Characterization of a male specific region containing a candidate sex determining gene in Atlantic cod.

The genetic mechanisms determining sex in teleost fishes are highly variable and the master sex determining gene has only been identified in few species. Here we characterize a male-specific region of 9 kb on linkage group 11 in Atlantic cod (Gadus morhua) harboring a single gene named zkY for zinc knuckle on the Y chromosome. Diagnostic PCR test of phenotypically sexed males and females confirm the sex-specific nature of the Y-sequence. We identified twelve highly similar autosomal gene copies of zkY, of which eight code for proteins containing the zinc knuckle motif. 3D modeling suggests that the amino acid changes observed in six copies might influence the putative RNA-binding specificity. Cod zkY and the autosomal proteins zk1 and zk2 possess an identical zinc knuckle structure, but only the Y-specific gene zkY was expressed at high levels in the developing larvae before the onset of sex differentiation. Collectively these data suggest zkY as a candidate master masculinization gene in Atlantic cod. PCR amplification of Y-sequences in Arctic cod (Arctogadus glacialis) and Greenland cod (Gadus macrocephalus ogac) suggests that the male-specific region emerged in codfishes more than 7.5 million years ago.

SCIENtIFIC RePoRTS | (2019) 9:116 | DOI: 10.1038/s41598-018-36748-8 acquired this role by being recruited from conserved downstream regulators of gonadal differentiation, although the function of some actors differs among lineages 6,[27][28][29] . An exception is the novel salmonid sdY; a truncated male-specific copy of the interferon regulatory factor 9 (irf9), which is an immune-related gene not associated with sex 30 . Knowledge about how genes are functionalized and incorporated at the top of the sex regulatory cascade should broaden our understanding of the genetic regulation of sex determination and elucidate whether there are constraints on the types of genes that can be co-opted as master control switches. Atlantic cod is an economically important cold-water marine species widely distributed in the North Atlantic Ocean. Due to the decline in many cod stocks, cod farming has attracted interest but production is hampered by precocious early sexual maturation, particularly in males. This could partly be solved by production of all-female triploid stocks 4,31,32 . Gynogenetic and sex-reversed cod populations have demonstrated a XX-XY sex determination system 21,33,34 , but karyotyping of Atlantic cod and the closely related European hake (Merluccius merluccius) failed to reveal sex-linked chromosome heteromorphism 35,36 . Recently, whole genome sequence data from wild Atlantic cod was used to identify genotypes segregating closely with a XX-XY system in six putative sex determining regions distributed across five linkage groups 37 . Here we use whole genome re-sequencing data from 49 males and 53 females, together with long-read sequence data and Sanger sequencing of targeted PCR products, to characterize a Y-sequence of 9,149 base pairs on LG11 harboring a single gene which we have named zinc knuckle on the Y chromosome gene (zkY). Gene expression data from early development stages and modeling of the zinc knuckle structure offer circumstantial evidence consistent with a function in Atlantic cod sex determination.

Results and Discussion
Identification of male specific sequence on linkage group 11. Two independent approaches were used to identify a male specific genomic region in Atlantic cod. Firstly, we searched whole genome sequence data from males and females for SNPs that segregate according to an XX-XY system. Illumina short-reads (approximately 10X coverage per individual) generated from 49 males and 53 females were mapped to an initial in-house developed gadMor2.1 genome assembly, followed by variant detection. When applying stringent criteria demanding that gender specific variants should be heterozygous in all 49 males and homozygous in all 53 females, we detected 9 variants all distributed within a 15 kb region on LG11 (Supplementary File 1 and Fig. 1b). BLAST alignments revealed that these variants fell inside the 55 kb region previously reported as showing most evidence for being involved in sex determination by Star et al. 37 . Secondly, we analyzed resequencing data to identify positions in the genome that displayed significant differences in read depth between males and females. Two regions showed an almost complete absence of female reads while displaying >500 reads from males, i.e. Y-specific characteristics. These regions were separated by an intervening sequence displaying roughly 2-times coverage in females versus males, i.e. a possible X-sequence. Both the Y-and X-sequence regions fell within an interval flanked by the nine sex-linked markers on LG11 lending further support to the significance of this region in sex The Y-and X-sequences (grey boxes), including the Y-specific zkY gene, and the nine polymorphisms heterozygous in all males and homozygous in all females (black dots). Long reads confirming assembly integrity were generated using Oxford Nanopore (green) and Pacific Biosciences (blue) sequencing technologies. determination. Closer examination of the architecture of this region using IGV 38 revealed that despite a high read-depth, there was no evidence of single reads bridging, or read-pairs spanning, these differential read-depth sub-regions. Collectively these anomalies are suggestive of assembly errors in the initial gadMor2.1 assembly, and a hybridization of X-and Y-sequences, and underscores the importance of developing high-quality reference genomes.
In order to resolve this complex region we performed nested, long-range PCR to generate a Y-specific fragment, which was then sequenced using an Illumina MiSeq. Assembly construction using Gap Filler generated a 6 kb sequence, which was integrated into the initial gadMor2.1 assembly to produce a complete Y-sequence of 9,149 bp ( Fig. 1 and Supplementary File 2). The sequencing revealed an imperfect repeated element positioned close to the end of the Y-sequence (see underlined sequence in Supplementary File 2). An X-sequence of 425 bp was constructed by Sanger sequencing PCR fragments generated from females using primers flanking the Y-sequence. To confirm the integrity of these manually curated X-and Y-sequences we aligned publicly available long-read data (Pacific BioSciences) derived from the male used to generate the reference (accession numbers at ENA (http://www.ebi.ac.uk/ena), ERX1787826-ERX1787972) and long reads generated in-house using an Oxford Nanopore MinION device, also from a male. Although overall coverage was low, we were able to identify long-reads that aligned specifically to either the Y-or X-sequences and, in combination, spanned their entire lengths (Supplementary File 3 and Fig. 1b).
The Y-sequence contains a single gene. X-and Y-sequences of the public gadMor2.1 assembly were annotated using standard gene-model predictive software and RNA-Seq data produced from early larval stages together with publicly available RNA-Seq data from Atlantic cod (See Material and Methods). This revealed a single gene (which we have named zkY) in the Y-sequence (Supplementary File 2 and Fig. 1), which is inserted in a synteny block that is highly conserved in teleosts and in spotted gar (Lepisosteus oculatus) (Supplementary Figure 1). The intronless cod zkY codes for a zinc knuckle protein characterized by the zinc knuckle consensus sequence Cys-X2-Cys-X4-His-X4-Cys (X = any amino acid). This motif is mainly found in the RNA-binding retroviral nucleocapsid (NC) proteins, but also in various eukaryotic proteins binding single-stranded nucleotide targets [39][40][41][42] . The flanking basic residues bind nucleic acids non-specifically through electrostatic interactions with the phosphodiester backbone of nucleic acids 43 .
Zinc knuckle proteins are members of the large family of zinc finger proteins possessing a versatility of tetrahedral Cys-and His-containing motifs that bind to DNA and RNA target sites 44,45 . The DNA-binding domain of the DMRT transcription factors, including the sex determinant DmY in medaka, consists of intertwined CCHC and HCCC motifs binding in the minor groove of DNA 46 . The zinc finger protein ZFAND3 is essential for spermatogenesis in mice and the polymorphic tilapia zfand3 was recently mapped in the sex determining locus 47,48 . Hence, the recruitment of zinc finger proteins as the sex determinant seems to have occurred several times in teleosts.
Multiple autosomal copies of cod zkY. The sex determining gene in several teleosts exhibits an autosomal copy that differs from the male-specific gene in spatio-temporal expression patterns and/or functionality 28,49-52 . We identified 12 autosomal copies of cod zkY mapping to seven different linkage groups and one unassembled scaffold. Premature stop codons and single base indels were found in four of the genes encoding putatively non-functional proteins lacking the zinc knuckle domain. The remaining eight genes code for zinc knuckle proteins named zk1 to zk8, which differ in length from 143 to 633 amino acids (Supplementary File 4). Sequence alignment revealed several amino acid substitutions in the zinc knuckle domain in addition to the variable size of an imperfect repeat of basic residues preceding the domain ( Fig Functional implications of the amino acid substitutions in the zinc knuckle domain of the autosomal proteins was inferred by exploring the predicted 3D structure and the interactions with a putative RNA oligonucleotide target. The suitability of the d(ACGCC) sequence in the structure modeling was supported by the stacking of the conserved Trp442 between the two bases G3 and C4 in agreement with the interactions between specific base moieties and Trp at the corresponding position in NC proteins [53][54][55] . The three cysteines Cys443, Cys436 and Cys446 together with His441 coordinate the tetrahedral binding of the Zn 2+ ion in the modelled cod zinc knuckles. While the replacement of Met433 in zkY, zk1 and zk2 with the Ile residue in the four zk3-zk7 proteins maintains the hydrophobic interactions with Cys446, the Ile433 residue is predicted to interact with the G3 base of the pentanucleotide (Fig. 2C). The RNA-binding specificity of human HIV-1 was consistently altered by the Met-> Val and Met-> Lys substitutions in the same position of the second zinc knuckle 56 . The Asn435Lys change in the zk8 protein may affect the tetrahedral Zn 2+ ion coordination by interacting with Cys446, while Gln443 is predicted to form a hydrogen bond with the phosphate group connecting A1 and C2 (Fig. 2D). Additionally, electrostatic interactions are predicted between Arg437 and His434, which are moved away from its C2 contacts observed in the other variants.
Altogether, the predicted alterations in the contacts between the replaced amino acids and the interactions with the oligonucleotide sequence template suggest functional differences in the putative RNA-binding activity of the zinc knuckle proteins examined. In addition, the functional role played by the flanking basic residues in the non-specific binding of nucleic acids 44 suggests that these interactions might be altered by the extended flanking repeat in the zk6 and zk7 proteins. Based on the identical zinc knuckle domain and the similar flanking sequences in the three proteins zkY, zk1 and zk2, we decided to compare the larval expression of these coding genes.
Increased transcription of cod zkY prior to onset of sex differentiation. A prerequisite for being a sex determining gene is to show pronounced transcription levels prior to onset of sex differentiation, which in Atlantic cod likely occurs at start feeding around 35 days post hatching (dph) based on the high number of germ cells in females compared to males together with the female-biased expression of cyp19a1a 21,57 . RNA-Seq analysis of cod zkY and the autosomal zk1 and zk2 copies showed that the three genes were expressed at variable levels from hatching until 22 dph at which stage the transcription levels of zkY increased and stabilized at relatively high levels from 30 dph onwards (Fig. 3). In contrast, the larval expression of zk1 and zk2 was almost undetectable or very low from 11 dph. The increased larval expression of the male-specific zkY gene prior to the onset of sex differentiation is consistent with a possible function in sex determination. Multiple germline-specific RNA regulatory proteins are essential for germ cell specification, maintenance, migration and proliferation in various organisms 58 . While this regulatory network seems to be evolutionary conserved in nematodes, flies and mammals, the key role played by different sex determinants in the control of male germ cell proliferation in fish suggests the involvement of lineage-specific regulators 6 . Knockdown of the germline zinc knuckle helicases glh1-2 in C. elegans and the vasa homolog in mice resulted in male infertility [59][60][61] . Intriguingly, the loss of zinc knuckles in the RNA-binding Vasa

X-and Y-sequences in other gadoid species.
To elucidate the emergence of this putative sex determining region in other cod fishes we scanned draft genome assemblies from 12 species within the family Gadidae 63 for sequences matching the Atlantic cod Y-sequence. In its entirety, the 9,149 bp male-specific region showed only partial hits suggesting that the Y-sequence is absent or incorrectly assembled in these draft assemblies. Therefore, to find evidence for the sex determination region in other codfish species we were obliged to use a PCR based approach. Although the multi-species sequence alignments were highly fragmented, a close examination revealed isolated regions within and immediately outside the male-specific region of Atlantic cod which were, to some extent, conserved across species. These regions allowed us to develop primers, which amplified a specific fragment of the Y-sequence in both Greenland cod (Gadus macrocephalus ogac) and Arctic cod (Arctogadus glacialis), in addition to Atlantic cod (see alignment in Supplementary File 6). The X-sequence was efficiently amplified and sequenced in Greenland cod, Arctic cod, polar cod (Boreogadus saida), haddock (Melanogrammes aeglefinus) and burbot (Lota lota) (see alignment in Supplementary File 7). This result documents that the Y-specific sequence was present prior to the separation of Arctic cod from Atlantic cod more than 7.5 MYA 63,64 while the presence of X-sequence in burbot suggests that this arrangement predates the Lotinae -Gadinae split 45 MYA (Supplementary Figure 3). A diagnostic test for distinguishing male and female Atlantic cod is potentially useful for researchers and the emerging aquaculture industry. To efficiently determine gender, we developed a simple PCR reaction including two different forward primers (one Y-sequence specific and the other matching a common sequence upstream of the X-and Y-regions) and a common reverse primer (annealing to a sequence downstream of the X-and Y-regions) resulting in a single band in females and double in males (Fig. 4).
Sex determination is a complex process involving the coordinated actions of multiple factors, and a greater understanding of what genes are involved in the sex determination cascade is needed. Furthermore, this understanding must be developed in multiple species to learn whether these processes are species or taxa specific and to improve our knowledge of the evolution of sex determination. The main focus of this paper has been to improve our knowledge of the genomic basis for sex determination in Atlantic cod. Our results indicate that the zkY gene present within the male specific region on LG11 is involved in Atlantic cod sex determination, and documents the presence of X-and Y-sequences in other gadoids.

Materials and Methods
Ethics statement. To produce whole genome sequence data, fin clips were collected non-destructively in strict accordance with the welfare rules given by the National Animal Research Authority (NARA). Samples were collected as a part of an ongoing, long term project authorized and managed by Nofima AS which seeks to maintain a biobank of samples from parents of the National cod breeding program nucleus. For RNA sequencing (larvae 1-35 dph), permission for sampling biological material is not required for fish eggs and larvae collected before start-feeding according to NARA. Construction of the initial gadMor2.1 assembly and variant detection. The initial gadMor2.1 assembly was constructed by integrating a dense linkage map with two public draft assemblies (NEWB454 and CA454ILM https://figshare.com/articles/Transcript_and_genome_assemblies_of_Atlantic_cod/3408247) generated from sequencing the same male. The assemblies were constructed using different combinations of short-read sequencing data and different assembly programs (Newbler and Celera respectively) and displayed different qualities with NEWB454 having longer scaffold N50s and more gaps than CA454ILM 66 . The linkage map containing 9354 SNP markers was produced after genotyping a pedigree of 2951 fishes. Chimeric scaffolds were broken at junctions between contigs containing SNPs from different LGs. Overlapping scaffolds were identified by comparing SNPs mapping to both assemblies and were merged using coordinates from alignment with LASTZ 67 generating scaffolds that were used to build the final chromosome sequences. Finally, all scaffolds were oriented, ordered and concatenated into a new chromosome sequence based on information from the linkage map. Variants were detected by first filtering raw reads from each individual using Trimmomatic v0.32 68 , and subsequently aligning reads to the unmasked initial gadMor2.1 assembly using Bowtie2 v2.3.2 with the parameter 'sensitive' 69 . The resulting BAM files were merged by gender using Sambamba v0.6.5 70 and GATK HaplotypeCaller v3.8 71 was used to call variants with the following parameters: gt_mode DISCOVERY, minPruning 3. A Perl script was used to report SNPs, and their positions, that display heterozygous genotypes in all males and homozygous genotypes in all females. Read depth differences between males and females. For each sex, quality filtered reads from males (n = 49) and females (n = 53) were combined to generate two gender-specific bed files. Bedtools genomecov version 2.22.0 72 was then used to count read depth at each position in the public gadMor2.1 assembly. A Perl script was then used to calculate the average read depth per individual across successive 1000 bp intervals with an overlap of 500 bp (positions with 0 coverage in males and females were ignored). For each average, the absolute difference between the larger and smaller number was calculated and plotted. Across the genome, the region with the most extreme read depth differences was seen on LG11 (Fig. 1a).

Construction of Y-and X-sequences.
A nested PCR strategy was used to amplify the Y-specific sequence lacking from the initial gadMor2.1 assembly. The initial reaction used the following primers and conditions, (Fwd; 5′-CCTAAACCACAGTCCTGGGC-3′, Rev; 5′-ACATTGTGCACACACATTGTATC-3′, PrimeSTAR GXL DNA Polymerase (Takara, Japan), cycling 30 times with 10 s at 98 °C, 15 s at 57 °C, 3 min at 68 °C, final extension 72 °C for 10 min. PCR-product was used as template in a subsequent reaction using the same conditions but with the nested PCR primers (Fwd; 5′-CTCTGTAGTTTGTGGTGGGGT-3′, Rev; 5′-ACATGACGAATGGCCTCCTTT-3′). The resulting PCR product was purified using a QIAquick PCR purification kit from QIAGEN (Germany) and quantified. DNA was prepared for sequencing using a Nextera XT kit from Illumina (USA) according to manufacturer's instructions and the resulting library sequenced using 2 × 250 nt sequencing chemistry using a MiSeq (Illumina, USA). After quality trimming the resulting reads were anchored to the existing flanking Y-sequences using Gap filler 73 . The resulting sequence contained a single gap that was filled by Sanger sequencing a PCR product generated using the following PCR primers and conditions (Fwd; 5′-ACACAACGCAGAGTCTGTCC-3′, Rev: 5′-TCAGCTAGTCTCGCAATGGC-3′, denaturation 15 min at 95 °C, then cycling 30 times with 10 s at 98 °C, 15 s at 98 °C, 60 s at 68 °C, final extension 10 min at 72 °C). An X-sequence was constructed from Sanger sequencing a PCR product produced using primers annealing up-and down-stream of the Y-sequence.

Generation X-and Y-sequences in other gadoids. Sequence alignments of male specific region in
Atlantic cod with public assembly data from 12 gadid species 63 , using MUMmer version 3.23 74 , identified four rather short regions that showed good conservation across species, including two regions within the Y-sequence (B; 12,520,541-12,520,838 and C; 12,527,782-12,5280,27) and two regions (A; 12,518,827-12,519,231 and D; 12,528,480-12,529,285) immediately flanking the X-and Y-sequences. Primers were designed using Primer3 75 to amplify from within the Y-sequence conserved sequence C to flanking sequence D (i.e. a Y-specific PCR) and from flanking sequence A to D (an X-sequence specific PCR). Products from these reactions were Sanger sequenced and aligned using MAFFT v7.402 76  Nanopore Sequencing. High-molecular weight DNA was extracted using a phenol-chloroform method 77 .
Sequencing libraries were prepared using SQK-RAD003 and SQK-LSK108 kits and protocols from Oxford Nanopore Technology (Oxford, UK) and sequenced on a MinION device generating a combined total of 3.9 Gb sequence data with an N50 read length of 4.2 kb. Reads were aligned to the public gadMor2.1 assembly using GraphMap aligner v0.5.2 78 .
Transcript profiling of cod larvae. Cod larvae were hatched at the National Atlantic Cod Breeding Centre in Tromsø, Norway, after the incubation of fertilized eggs in seawater rearing tanks at 4.5 °C. For 1 and 7 dph larvae, RNA was extracted from a pool of 10 individuals using the QIAGEN AllPrep DNA/RNA/miRNA Universal kit (QIAGEN; Germany). Samples were prepared for sequencing using TruSeq Stranded mRNA kit from Illumina (USA) and sequenced using an Illumina HiSeq 2000 to produce 362 million reads. For the samples 12 and 35 dph samples, RNA was extracted using an RNeasy kit (QIAGEN, Germany) prepared for sequencing using TruSeq Stranded mRNA kit from Illumina (USA) and sequenced using an Illumina MiSeq (2 × 250nt) to produce 4.4 million reads. All reads were trimmed using Trimmomatic v0.32 68 80 . Gene models were tested by performing (i) open reading frame (ORF) prediction using TransDecoder 83 using both pfamA and pfamB databases for homology searches and a minimum length of 30 amino acids for ORFs without pfam support, and (ii) BLASTP analysis (evalue <1e-10) for all predicted proteins against zebrafish (Danio rerio) (v9.75) and three-spined stickleback (Gasterosteus aculeatus) (BROADS1.75) annotations from Ensembl. Only gene models with support from at least one of these homology searches were retained. Functional annotation of the predicted transcripts was done using blastx against the SwissProt database.
Modeling the zinc knuckle domain. The three-dimensional structure of the three different variants of the zinc knuckle domain were built using a threading approach which combines three-dimensional fold recognition by sequence alignment with template crystal structures and model structure refining. The query-template alignment was generated by HHPRED (https://toolkit.tuebingen.mpg.de) and then submitted to the program MODELLER 84 as implemented in Discovery Studio (Dassault Systèmes BIOVIA). Query coverage and e-value score were considered to define a suitable template structure. The structure of nucleocapsid protein NCp10 of retrovirus MoMuLV, which contains a single Cys-X2-Cys-X4-His-X4-Cys zinc knuckle domain bound to the oligonucleotide d(ACGCC) was selected as template (https://www.rcsb.org/structure/1a6b). Zinc ions and oligonucleotides were explicitly considered through molecular modeling steps. Fifty models, optimized by a short simulated annealing refinement protocol available in MODELLER, were generated and their consistency was evaluated on the basis of the probability density function violations provided by the program. Stereochemistry of selected models was checked using the program PROCHECK 85 . Visualization and manipulation of molecular images were performed with Discovery Studio (Dassault Systèmes BIOVIA).

Data Availability
The gadMor2.1 genome assembly, long-range PCR MiSeq Illumina data are available at https://figshare. com/s/313f8fe1fdcc82571a99. The 102 individual samples read data are available at ENA, with the study accession number PRJEB12803. The MiSeq whole NEAC larvae at 12,35 dph and 1,7 dph illumina samples are available at ENA, with the study accession number PRJEB25591. All data generated or analyzed during this study are included in this published article and its Supplementary Information files.