A duplicated copy of id2b is an unusual sex-determining candidate gene on the Y chromosome of arapaima (Arapaima gigas)

Arapaima gigas is one of the largest freshwater fish species of high ecological and economic importance. Overfishing and habitat destruction are severe threats to the remaining wild populations. By incorporating a chromosomal Hi-C contact map, we improved the arapaima genome assembly to chromosome-level, revealing an unexpected high degree of chromosome rearrangements during evolution of the bonytongues (Osteoglossiformes). Combining this new assembly with pool-sequencing of male and female genomes, we identified id2bbY, a duplicated copy of the inhibitor of DNA binding 2b (id2b) gene on the Y chromosome as candidate male sex-determining gene. A PCR-test for id2bbY was developed, demonstrating that this gene is a reliable male-specific marker for genotyping. Expression analyses showed that this gene is expressed in juvenile male gonads. Its paralog, id2ba, exhibits a male-biased expression in immature gonads. Transcriptome analyses and protein structure predictions confirm id2bbY as a prime candidate for the master sex-determiner. Acting through the TGFβ signaling pathway, id2bbY from arapaima would provide the first evidence for a link of this family of transcriptional regulators to sex determination. Our study broadens our current understanding about the evolution of sex determination genetic networks and provide a tool for improving arapaima aquaculture for commercial and conservation purposes.

Identification and localization of the male-specific locus of arapaima. To identify the sex locus of arapaima, we sequenced a pool of male and a pool of female genomic DNAs. These pool-sequencing (Pool-Seq) datasets were aligned to the male (heterogametic sex, XY) and female (homogametic sex, XX) assemblies to search for sex-biased signatures. When the Pool-Seq reads were aligned to the female assembly, two prominent peaks of male-specific single nucleotide variants (SNVs) were detected (heterozygous variations in males but homozygous for the same allele in all females) on chromosomes (Chr) 26 and 5, and a minor peak on Chr 10 ( Fig. 1A). No signal of sex-linkage appeared using the male reference assembly, however (Fig. 1B). The sequence containing the peak of male-specific SNVs in the female assembly Chr05 (Fig. 1C) has high sequence identity with a small region of the male assembly of Chr26 (Fig. 1E). In the male pool, the latter region's normalized read coverage is half that of sequences elsewhere on the chromosome, and a 10 kb region has a complete absence of reads in the female pool (Fig. 1D). These results suggest that the distal part of Chr26 contains the sex locus of arapaima and that this originated by insertion of a small duplicated fragment region from a progenitor on the autosomal Chr05. The high male-specific SNV-peak detected on the female Chr26, may therefore result from a larger X/Y differentiated region around the small male sex-specific insertion, since, in general, the sex-determining region is known to accumulate mutations around the SD gene 15 . As the arapaima sex determination system is XX/XY male heterogamety 13 , the lack of signal on the male assembly could be considered as counter-intuitive. However, our findings are consistent with a male-specific duplication/insertion, whose male-specific reads can only align to the autosomal region of the XX female genome from which this duplication originated. Hence, the Y-specific sequences are haploid and no male-specific SNVs are expected in those regions, as females have no corresponding sequences in their genome to be compared with. This provides a clear and strong signal on that location on Chr05, which is easier to detect than the small region with coverage difference between the X and Y versions Chr26.
A duplicated id2b gene in the male-specific region of the Y chromosome. This small male-specific region (9656 bp) on Chr26, which shows high similarity to Chr05 (Fig. 1D), contains only two genes, a duplicated copy of id2b (id2bbY, [535 bp (DNA) and 411 bp (CDS)]) and a fragment of kidins220b (kidins220bΔ, [514 bp (DNA) and 231 bp (CDS)]) (Fig. 1D). The DNA sequence of kidins220bΔ is highly diverged (5.7% amino acid and 6.1% cDNA identity) from to the homologous kidins220b (15,043 bp DNA and 3477 bp CDS) on Chr05 suggesting that this gene fragment is corrupted. All RNA-seq datasets examined (see "Methods") lacked transcripts from the Chr26, supporting the view that it is non-functional. In contrast, the id2bbY copy is highly conserved, with 93.2% cDNA identity and 86.9% amino acid identity. (C) Zoomed view of the sex-specific SNVs (total per 5 kb window size) on the sex-biased region of Chr05 with the location of the inhibitor of DNA binding 2 b gene (id2b) and of the kinase D interacting substrate 220 b gene (kidins220b). (D) Zoomed view of the normalized coverage depth (average per 5 kb window size) of the sexbiased Chr05 homologous region on male Chr26 with location of the duplicated id2bbY copy of the Chr05 id2b gene and of the truncated duplicated kidins220bΔ of the Chr05 kidins220b.E. Multiple alignment plots of the percentage (%) of sequence identity between the id2bbY locus on Chr26, the coding sequences (CDS) of id2b and kidins220b and the corresponding autosomal Chr05 region used as a reference.  S1). This indicates a relaxation of purifying selection on id2b after its divergence from id2a, which may be connected to the multiple losses of id2b during evolution.
To assess whether id2bbY emerged before or after the divergence of arapaima and Asian arowana, we screened the whole genome assembly of arowana for id2bbY orthologs but failed to detect it. Thus, id2bbY either arose before the split of arapaima and arowana but was subsequently lost in arowana, or the gene duplication event occurred after the divergence of both fish lineages. To answer this question, we compared the Ks distance between arapaima id2bbY and id2ba to the genome-wide Ks between arapaima and arowana. The Ks distance represents synonymous changes, which are generally not exposed to selection. Hence, this parameter can be used under the assumption of an appropriate neutral molecular clock. We retrieved 18,621 one to one orthologous pairs with conserved synteny between arapaima and arowana. Their pairwise Ks values were then estimated and resulted in a distribution with median at 0.44 and mean at 0.46, while the Ks distance between arapaima id2bbY and id2ba was estimated to be 0.078, which is far smaller. This indicates that id2bbY emerged after arapaima diverged from arowana, and hence is a specific duplication in the arapaima lineage. Assuming the divergence time between arapaima and Asian arowana at 106 MYA 16  The id2bbY is a male-specific marker of arapaima. To validate if id2bbY can be used as a male-specific marker in arapaima, PCR amplifications with specific primers were performed in 8 different populations (Table 1). In our previous study, 25 males and 25 females derived from two different populations (Senador Guiomard and Cacoal, Brazil), were used for RAD-tag analyses 13 . Male and female sex-specific RAD-tags were extracted from those animals. While generally a good match between genotyped and phenotyped was recorded, three males showed a female pattern of tags (males #10, #13 and #24), and one female showed male tags (female #21). We genotyped all 50 animals for the id2bbY gene. Results fully agreed with RAD-tag genotyping and we confirmed that the outliers belong to the opposite sex ( Fig. S2 A and B). Most likely these outlier individuals were not accurately phenotypically sexed, which was done in the fish farms by gross body morphology. Similarly, the population from Pentecoste showed 3 outliers, showing the difficulties in identifying the sex of arapaima either by endoscopy of the gonad or by vitellogenin detection, which accurately identifies mature (vitellogenin producing) female, but with a risk to mis-classify immature females (non-vitellogenin producing if not induced by 17β-estradiol at this reproductive stage) 17 as males. Accordingly, all animals that were sexed using histology, a much more precise procedure, showed 100% concordance of phenotypic sex with the id2bbY genotype.
id2bbY is a candidate master sex-determining gene in arapaima. The origin of the id2bbY gene resembles the situation in medaka, in which an autosomal gene was duplicated and the new copy inserted as a small male-specific region into another chromosome, thus generating the proto-Y 18 . In medaka, the SD gene www.nature.com/scientificreports/ dmrt1bY (synonym dmy) is evolving faster than its ancestral gene, dmrt1a 19 , which supports of the hypothesis of a higher mutation rate in males than females, due to greater number of cell divisions in the male germ line 20 and/or lower copy number of Y chromosome compared to X chromosomes and autosomes, allowing genetic drift to act more strongly (collectively named "Y-driven evolution"). To assess if this is also the case for the arapaima candidate SD gene, we estimated the substitution rates across the gene tree of id2 (including id2a, id2b and id2bbY) using codeml under the free-ratio model. This analysis revealed that the synonymous substitution rate (Ks) of the arapaima id2bbY branch (0.0888) is 14 times higher than of the arapaima id2b branch (0.0062).
In addition, the mutation rate between the intronic region of id2bbY and id2b was calculated to exclude the possibility that the high Ks observed for the id2bbY branch is due to a codon usage bias. Using the arowana id2b as the reference, the pairwise distance (p-distance model in MEGA) of arapaima id2bbY intron (0.427) is longer than that of the arapaima id2b intron (0.387), consistent with the hypothesis of Y-driven evolution. In medaka, dmrt1bY has a higher ω value compared to its ancestor dmrt1a 19 . Similarly, id2bbY shows a higher ω value (0.97) than id2b (0.25, Fig. S3). This is estimated under a model 21 of one ω for id2bbY and a different one for id2ba. Among all branches, the log likelihood value under this model (− 1721.39) is higher than that (− 1724.89) of the model with an equal ω value but misses marginally the significance level (p = 0.06, likelihood ratio test). In spite of the higher ω, using the branch-site model 21 , we failed to detect positively selected sites in id2bbY.
To provide further evidence that id2bbY is most likely the male sex-determining gene of arapaima, we performed protein structure and expression analyses of the id2 genes. The Id2 proteins belong to the inhibitor of DNA binding (ID) family, which is characterized by a helix-loop-helix (HLH) domain. The ID family proteins do not bind to DNA, instead they interact directly with basic helix-loop-helix (bHLH) transcription factors, suppressing their heterodimerization and inhibiting their action in a dominant-negative manner. The amino acid sequence comparison between the Id2's of different vertebrate species revealed that Id2a is more conserved than Id2b (Fig. 3A). The divergence of Id2b sequences is also observed in the HLH domains of catfish, Mexican tetra, and Northern pike. However, in zebrafish, arowana and arapaima, the HLH domain of the Id2b's is similar to that of Id2a. Interestingly, besides changes at more variable positions, the Id2bbY peptide sequence has a proline to leucine (P56L) and an asparagine to serine (N58S) amino acid replacement in the highly conserved loop of its HLH domain. Protein structure prediction of Id2a, Id2ba and Id2bbY in arapaima revealed no obvious change in the 3-D structure of Id2bbY, indicating that it is a functional Id2 factor. These results motivate in-depth structure/function relationship studies in the future ( Fig. 3B-G).
Sex-determining genes are expected to show an expression bias towards one of the sexes during early gonadal development. In addition, Y-specific SD genes (e.g. gsdf Y , gdf6Y, amhy), derived by gene duplications or allelic variation display higher expression levels in the gonads during the early stages of sex determination and sex differentiation when compared to the autosomal copy or to the version on the X chromosome, respectively. For expression analyses, first two juvenile arapaima (about 1 year old and between 94 and 102 cm long) from the Thuringia aquaculture were used for RNA-seq. At this stage, females showed ovaries containing mainly oogonia and pre-vitellogenic oocytes, while male testis tubules were characterized by germinal epithelium containing only Sertoli cells and spermatogonia (Fig. S2C, D). Expression analyses of id2a, id2ba and id2bbY revealed that both, id2ba and id2bbY, exhibit male-specific expression in juveniles, with id2bbY being almost 10 times more expressed than id2ba (Fig. 4A). In previously established transcriptomes of adult gonads 13 , id2bbY expression is similar to the juvenile gonads. However, id2ba is upregulated in adult testis and ovary compared to the juvenile stages of the organs (Fig. 4A). The expression levels of id2a are similar in all samples, being slightly lower in adult ovaries (Fig. 4A). In summary, id2bbY follows the expected expression pattern of a potential sex-determining gene.
Transcriptome analyses of genes known to be involved in sex determination and gonadal differentiation in vertebrates showed that the male and female genes are active in both juvenile testis and ovary (Fig. 4B).  www.nature.com/scientificreports/ Transcription factors important for testis development, e.g., dmrt1 and sox9b, showed similarly high expression in juvenile and adult males. In addition, growth factors related to TGFβ signaling, gsdf and amh, which are important for testis differentiation, showed higher expression in juvenile testis than in the adult (Fig. 4B).

Discussion
Our high-quality chromosome-level genome of a male and a female arapaima considerably improved the former published genomic information. The new assemblies allowed to identify a most likely candidate for the male sex-determining gene and to establish a highly versatile PCR test for sex genotyping. Sex chromosome evolution can be a relatively rapid process, and it depends on the origin and fixation of a new SD gene 22 . Both Asian arowana and arapaima belong to the suborder Osteoglossoidei. While arapaima has a XX/XY system, Asian arowana females possess heteromorphic sex chromosomes, indicating a ZZ/ZW system 23 . A similar situation has been described for fish of the genus Oryzias, which further demonstrates that SD genes and sex chromosome systems can vary even between closely related species. The dmrt1bY gene (on LG1) is the SD gene in both Oryzias latipes 24,25 and Oryzias curvinotus 26 , while gsdf Y (LG12) is the SD gene in Oryzias luzonensis 27 and sox3 Y (LG10) in Oryzias dancena 28 . Oryzias hubbsi and Oryzias javanicus, differently from the other species of the genus, even have a ZZ/ZW sex chromosome system 29 .
The id2bbY gene is a duplicated copy of the autosome id2b gene on the Y chromosome (Fig. 5). Sequence differences from the autosomal precursor make it possible to use this gene as male-specific molecular marker. The importance of finding a marker is extremely useful for sex ratio control in aquaculture, for instance for the production of monosex populations, and for monitoring wild populations. These are desirable because of the existence of valuable traits associated with one sex (e.g. growth, color and shape) 11 . In arapaima, the giant size, and the long time to reach sexual maturation require a huge amount of resources to maintain the animals for the breeding process 9 . The most accurate method to identify the sex of an individual is the direct inspection of the differentiated gonad morphology confirmed by histology, especially for immature fish. However, for rearing purposes, other methods have to be applied, such as endoscopy and vitellogenin detection. The simple PCR amplification of id2bbY showed 100% accuracy in reliably sexed fish. This method is minimally invasive, and requires only a very small fin clip or oral swap and can be done already in small fish. This genotyping method was validated in different populations of arapaima, distributed along the Amazon territory (Fig. S4). Despite the broad range, recent studies have provided no evidence for the existence of different species of arapaima, and that the genetic divergence between populations is associated with sedentary behavior, the impact of fisheries on stocks, and the characteristics of each basin, e.g., floodplain dynamics [30][31][32][33] . www.nature.com/scientificreports/ All SD genes described so far are single gene duplicates or allelic variants of genes known to be related to sex determination and differentiation 34 . The only exception is the sdY gene in salmonids, a truncated copy of the irf9 (interferon regulatory factor 9) gene 35 , which has not been implicated in sexual development so far. At first sight, id2bbY appears as another exception because to date, id2 has never been assigned an important role in sex determination and differentiation. However, several studies showed that this gene and other ID proteins are involved in ovary maturation, granulosa cell differentiation and spermatogenesis [36][37][38] . In chicken, Id2 was proposed to be involved in ovarian follicle differentiation by increasing the levels of fshr mRNA 39 . The Id genes are transcriptionally regulated by TGF-β signaling 40 , which is known to be important for ovary and testis development, and germ cell differentiation 41,42 . TGF-β growth factors bind to their respective type 2 receptors, which in turn recruit a type 1 receptor activating the SMAD factors that regulate id2 gene expression 43 (Fig. 6). Components of the TGF-β signaling pathway recurrently became the SD trigger in different species, including amhy 44 , gdf6Y 45 , gsdf Y27 , bmpr1bbY 46 and amhr2-SNP 47 . Interestingly, those SD genes encode ligands and receptors, and therefore are located upstream in the signaling pathway. The id2bbY of arapaima would be the first reported example of a downstream factor of the TGF-β cascade that has evolved as master male SD gene (Fig. 6). A sex-linked SNP mutation in amhr2 of obscure puffer (Takifugu obscurus) showed that males have higher phosphorylation levels of Smads and also higher activity of id3 when compared to females 48 . Recent data using single-cell sequencing demonstrated that, in fetal testis of humans, ID2 upregulation is necessary for germ cell and testis development, and that it is induced by AMH and BMPR1B 49 . The current knowledge suggests that an increase of TGF-β signaling during the sex determining window, in this case caused by id2, can lead to testis development.
The SD trigger acts on the sex-regulatory network by promoting the genetic pathway of one of both sexes, and/or by repressing the opposite sex [50][51][52][53] . To understand if id2bbY is directly promoting the male SD program or repressing the female pathway it is important to investigate the ancestral function of id2b. Arapaima id2ba and id2bbY have a male-specific expression in juvenile, but id2bbY is significantly higher expressed, also in adult testis. In both, juvenile and adults testis, TGF-β signaling genes are strongly expressed, consistent with a male-promoting role. In addition, the 3-D analyses of the Id2bbY protein demonstrates no notable difference in the HLH domain, indicating that it is functional and can have some overlapping role with Id2ba. Interestingly, expression analyses of the id2 genes in sterlet and medaka showed higher expression in ovary compared to testis (Table S1), indicating that the male-bias of id2a and id2b of arapaima may be specific for this species.
To confirm the role of id2bbY as the master SD gene and its connection to TGF-β signaling, knock-out and gene addition genetic manipulations have to be done. Such experiments would also be useful to elucidate the role of Id2 proteins in sexual development. To date, however, the methodologies for genome modification and transgenesis are not developed in arapaima, yet.
In summary, we provide further evidence for the importance of TGF-β signaling in testis development, from which several upstream components are recurrently recruited as sex-determining trigger in different species. The id2bbY gene is the first candidate sex determining gene from the pathway that is located at a downstream position of the pathway. id2bbY is a reliable male -specific marker, which provides a versatile tool for sexing fish in aquaculture and for conservation measures. Genome sequencing and assembly. Genome sequencing and contig assemblies have been described in a previous paper 13 . Briefly, DNA was derived from fin tissue of a single adult female (ID F3) and from a single adult male (ID M14). Libraries were produced using the Truseq DNA Nano sample prep kit using the 550 pb insert size option and sequenced on a Hiseq 2500 using rapid v2 PE 2*250 nt mode. All sequences were assembled with DISCOVARdenovo (version 52488) (https:// softw are. broad insti tute. org/ softw are/ disco var/ blog/) using default parameters.
Hi-C sequencing. In situ Hi-C was performed according to previously described protocols 54   www.nature.com/scientificreports/ then disrupted with a Dounce pestle, nuclei were permeabilized using 0.5% SDS and then digested with Hin-dIII endonuclease. 5′-overhangs at HindIII-cut restriction sites were filled-in, in the presence of biotin-dCTP with the Klenow large fragment, and then re-ligated at a NheI restriction site. Nuclei were lysed and DNA was precipitated and then purified using Agencourt AMPure XP beads (Beckman Coulter) and quantified using the Qubit fluorometric quantification system (Thermo). T4 DNA polymerase was used to remove un-ligated biotinylated ends. Then, the Hi-C library was prepared according to Illumina's protocols using the Illumina TruSeq Nano DNA HT Library Prep Kit with a few modifications: 1.4 μg DNA was fragmented to 550 nt by sonication. Sheared DNA was then sized (200-600 bp) using Agencourt AMPure XP beads, and biotinylated ligation junctions were captured using M280 Streptavidin Dynabeads (Thermo) and then purified using reagents from the Nextera Mate Pair Sample preparation kit (Illumina). Using the TruSeq nano DNA kit (Illumina), the 3′ ends of blunt fragments were adenylated. Next, adaptors and indexes were ligated, and the library was amplified for 10 cycles. Library quality was assessed by quantifying the proportion of DNA cut by endonuclease NheI using a Fragment Analyzer (Advanced Analytical Technologies, Inc., Iowa, USA). Finally, the library was quantified by qPCR using the Kapa Library Quantification Kit (Roche). Sequencing was performed on an Illumina HiSeq3000 apparatus (Illumina, California, USA) using paired-end 2 × 150 nt reads. This produced 78 million read pairs (23.5 Gb of raw nucleotides).
Genome scaffolding. Contigs were scaffolded using Hi-C as a source of linking information. Reads were aligned to the draft genome using Juicer 55 with default parameters. A candidate assembly was then generated with 3D de novo assembly (3D-DNA) pipeline 56 with the − r 0 parameter. Finally, the candidate assembly was manually reviewed using Juicebox Assembly Tools 57 . Genome completeness was estimated using Benchmarking Universal Single-Copy Orthologs (BUSCO) v3.0 58 based on 4584 BUSCO orthologs derived from the Actinopterygii lineage.
Genome annotation. Repeats in the assembly were identified and masked by RepeatModeler and Repeat-Masker (http:// www. repea tmask er. org). The assembly was first scanned by RepeatModeler for de novo identification of repeats. The results, together with FishTEDB 59 , were then transferred to RepeatMasker for similarity scan and final repeat masking. Protein coding genes were annotated by collecting and synergizing gene evidence from homolog alignment, expression data mapping and ab inito prediction 60 . Briefly, for homolog alignment, 611,738 protein sequences from Swiss-Prot (www. unipr ot. org) and from related fish species were used as query. These species include Paramormyrops kingsleyae, Erpetoichthys calabaricus, Scleropages formosus, Lepisosteus oculatus, Salmo salar, Danio rerio and Callorhinchus milii (www. ensem bl. org). GeneWise 61 and Exonerate (https:// www. ebi. ac. uk/ about/ verte brate-genom ics/ softw are/ exone rate) were used independently to align protein queries to the assembly and to determine the gene structure. GenBlastA 62 were also used as support to GeneWise to roughly locate each protein sequence on the assembly. For expression data mapping, RNA-seq reads used in our previous study 13 were mapped on the assembly using HISAT 63 and parsed using StringTie 64 for gene structure. In parallel, transcripts were first assembled using Trinity 65 and then mapped to the assembly using PASA 66 . For ab initio prediction, SNAP (https:// github. com/ KorfL ab/ SNAP), GeneMark-ES 67 and AUGUSTUS 68 were used.
Finally, gene structures that were consistently predicted in each of the above parallel approaches were selected as high-quality gene models to train AUGUSTUS, and all collected gene evidences were transferred to the trained AUGUSTUS for final annotation.
The final gene models were mapped to Pfam (https:// pfam. xfam. org/), Swiss-Prot and NCBI nr database (fish only) using BLAST 69 , and genes with no hit and not supported by RNA-seq reads were removed.
Identification of the male sex-specific marker by PoolSex. Pool-sequencing libraries were prepared from male and female gDNA pools of the same individuals analyzed in Du et al. 13 , using the Illumina TruSeq Nano DNA HT Library Prep Kit (Illumina, San Diego, CA, USA) according to the manufacturer's protocol. After fragmentation of each gDNA pool (200 ng/pool) by sonication using an M220 Focused-ultrasonicator (COVA-RIS), the size selection was performed using SPB beads retaining fragments of 550 bp. Following the blunt 3′ end fragments mono-adenylation and ligation to specific paired-end adaptors, the amplification of the construction was performed using Illumina-specific primers. Library quality was verified with a Fragment Analyzer (Advanced Analytical Technologies) and then quantified by qPCR using the Kapa Library Quantification Kit Pool-sequencing datasets were analyzed with the PSASS-workflow pipeline (https:// github. com/ SexGe nomic sTool kit/ PSASS-workfl ow) that computes on a whole-genome assembly, the fixation index (FS T ), sex-specific single-nucleotide variation (SNVs, heterozygotes in one sex and homozygotes in the other sex), and male/female coverage differences. For the chromosome Manhattan plots, we used PSASS with the default settings i.e.,window-size = 50,000,-output-resolution = 1000,-group-snps = True,-freq-het = 0.5,-range-het = 0.1,-freqhom = 1, and-range-hom = 0.05. For the zoomed views of chromosomes 05 and 26, we used the small-window settings with a modified window-size = 5000,-output-resolution = 500.
Multiple alignment plots between the sex-biased region on chromosome 5 and the corresponding duplicated region on Chr26 of the male genome assembly were computed using mVista 70  Substitution rate analysis. Substitution rates were calculated using codeml in PAML 21 . Protein and coding sequences of id2 were retrieved from NCBI and Ensembl, and aligned using ClustalW 73 for protein sequence.
Coding sequence alignments were obtained from the protein sequence alignment using PAL2NAL 74 . Phylogenetic trees were built based on the alignment using maximum likelihood method in MEGA7 75 .
Genomic DNA extraction and genotyping. Muscle or caudal fins of arapaima were fixed in 100% ethanol and stored at 4 °C. The tissues were cut into smalls fragments and digested for 3 h at 60 °C in 750 µL extraction buffer (10 mM Tris, 0.1 mol/L EDTA, 0,5% SDS) and 10 µL proteinase K (20 mg/mL, Sigma-Aldrich). After digestion, 375 µL phenol and 350 µL chloroform:isoamyalcohol (24:1) were added and the phases mixed by gentle shaking for 10 min. After 12.000 rpm centrifugation at 4 °C the upper phase was transferred into a new tube, and 750 µL chloroform:isoamyalcohol (24:1) was added and mixed again for 10 min. The phases were separated by centrifugation. The genomic DNA contained in the upper phase was precipitated overnight at − 20 °C in 1400 µL absolute ethanol. After 12,000 rpm centrifugation the DNA pellet was washed twice with 70% ethanol and resuspended in 60 µL T.E. buffer (10 mM Tris-HCl and 1 mM EDTA). The genomic DNAs were quantified using NanoDrop-2000 and 100-250 ng was used in the PCR reactions.
The complete coding sequences of the male-specific gene of arapaima (id2bbY) and its autosomal copy (id2ba) were compared, and specific PCR primers (5′-CAA GTA GTC ATT CAG AAA CTT TTT CAG-3′ and 5′-GTA CGT TGG ATA TAG ATA CAC TTG GG-3′) for the Y chromosome copy were designed in the 3′UTR sequence of the gene (Fig. S5). The PCR products were resolved on 1% agarose gels.
RNA sequencing and transcriptome analysis of juvenile and adult gonads. RNA of gonads of one juvenile male (RIN 9.0) and one juvenile female (RIN 7.0) arapaima were extracted using the TRIzol Reagent (Thermo Fisher Scientific, Waltham, USA) according to the supplier's recommendation. RNA was then cleaned using RNeasy (Qiagen RNeasy Mini Kit cat#: 74104). Library processing and RNAseq were carried out by NOVOGENE (Cambridge, UK) on a NovaSeq 6000 PE 150, generating 10 Gb of data per sample.
Transcriptome sequences of male and female juvenile and adult gonads were mapped to the genome using the RNA-sequence aligner STAR (https:// github. com/ alexd obin/ STAR/ relea ses,-quantMode GeneCounts). Differentially expressed genes between testis and ovary were detected by DESeq2 81 (Bioconductor/R) for juveniles and adults. Genes were considered to be differentially expressed, if p value ≤ 0.05 AND log2FC ≤ − 2 (higher expression in male) and log2FC ≥ 2 (higher expression in female). Heat maps for sex-related genes were plotted and genes showing comparable regulation between male and female, and between adult and juvenile samples were selected.
Sexing procedures. Gonads of 10 juvenile individuals (Thuringia population) were dissected from the body cavity prior to gutting of the fish. A central gonadal section of > 1 cm length was immediately transferred into 4% Histofix (Carl Roth, Germany) for histological analyses. All specimens from Coari, Pimenta Bueno, Presidente Figueiredo and Senador La Rocque were anaesthetized with 0.01% benzocaine (Acros, Morris, NJ, USA) and euthanized by cerebral concussion. A small sample of the fin was clipped for DNA extraction and a piece of the gonad was dissected and immediately fixed in 5% glutaraldehyde in phosphate buffer (0.1 mol l −1 at pH 7.2), dehydrated and embedded in glycol methacrylate (Leica, Heidelberg, Germany). Samples were cut in 5 μm thick sections and stained with hematoxylin-eosin staining as described 82 . Animals were sexed according to da Costa Amaral et al. 83 .
The animals originated from Pentecoste were sexed by endoscopy 84 or vitellogenin detection, using the enzyme immune assay kit (Acobiom, Montpellier, France) developed specifically for Arapaima gigas according to the supplier's recommendation.
Ethics approval and consent to participate. Adult and juvenile arapaima were kept and sampled in accordance with national German legislation. Rearing of Arapaima at IGB is according to authorization ZH 114 (issued 06 February 2014) by LAGeSo, Berlin, Germany. Commercial aquaculture at Manich Foods in Thuringia is authorized by registration 16 061 111S6001 (FischSeuchV). The maintenance, usage, and sampling of experimental arapaima in Brazil were approved by the Ethics Committee for the Use of Animals Embrapa Amazonia Ocidental, (protocols 04/2016, 01/2018 and 09/2020) accredited by the National Council for the Control of Animal Experimentation, which belongs to the Ministry of Science, Technology, Innovations and Communications.

Data availability
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.