The origin and evolution of sex chromosomes from autosomes, and the mechanism of sex determination have long been subjects of interest to biologists. In eutherian mammals and birds, the sex chromosomes are highly dimorphic and have degenerated with extensive gene losses1,2,3. However in teleost fishes, cytogenetically different sex chromosomes are found in less than 10% of the species examined4,5, and their recent origin in several lineages makes them good models to study the early stages of divergence. The frequent turnover of sex chromosomes seems to be associated with the variety of sex determining genes, even in closely related species4,5,6. For example, in medaka fishes (genus Oryzias), the Y-chromosomal dmY or dmrt1bY copy determines the sex in O. latipes and O. curvinotus, while testicular differentiation in O. dancena and in O. luzonensis is initiated by male-specific regulatory elements upstream of sox3 and gsdf (gonadal soma-derived factor), respectively7,8,9,10. The various mechanisms of genetic sex determination in fish range from sex–specific alleles of the anti-Müllerian hormone (Amh) in Nile tilapia (Oreochromis niloticus) and the Amh type 2 receptor (Amhr2) in tiger pufferfish (Takifugu rubripes)11,12 to complex polygenic regulation involving several genomic regions as shown in the cichlid Astatotilapia burtoni13. The diversity of mechanisms is further increased by the insertion of X- and Y- sequences modulating the expression of the neighboring master sex determinant gene, as found in the sablefish (Anoplopoma fimbria)14.

Sexual differentiation in vertebrates is initially marked by highly increased proliferation of the germ cells in the presumptive ovaries compared to that in testes6,15,16,17,18,19. The time course of the sex-dimorphic germ cell proliferation varies greatly among teleost species and occurs around hatching in medaka20, at the start-feeding stage in Atlantic cod21 and in late juvenile stages in sea bass (Dicentrarchus labrax)22. Male germ cell divisions are inhibited by Dmy in the medaka O. latipes23, and the sex determinants Gdsf, Amh and Amhr2 of the TGFβ signaling pathway might play similar roles in inducing sex differentiation in other species6. However, the requirement of germ cells for gonadal development appears to vary among teleost species24,25,26 reflecting that sex determination might be triggered by alternative mechanisms. Most sex determinants found in vertebrates are thought to have acquired this role by being recruited from conserved downstream regulators of gonadal differentiation, although the function of some actors differs among lineages6,27,28,29. An exception is the novel salmonid sdY; a truncated male-specific copy of the interferon regulatory factor 9 (irf9), which is an immune-related gene not associated with sex30. Knowledge about how genes are functionalized and incorporated at the top of the sex regulatory cascade should broaden our understanding of the genetic regulation of sex determination and elucidate whether there are constraints on the types of genes that can be co-opted as master control switches.

Atlantic cod is an economically important cold-water marine species widely distributed in the North Atlantic Ocean. Due to the decline in many cod stocks, cod farming has attracted interest but production is hampered by precocious early sexual maturation, particularly in males. This could partly be solved by production of all-female triploid stocks4,31,32. Gynogenetic and sex-reversed cod populations have demonstrated a XX-XY sex determination system21,33,34, but karyotyping of Atlantic cod and the closely related European hake (Merluccius merluccius) failed to reveal sex-linked chromosome heteromorphism35,36. Recently, whole genome sequence data from wild Atlantic cod was used to identify genotypes segregating closely with a XX-XY system in six putative sex determining regions distributed across five linkage groups37. Here we use whole genome re-sequencing data from 49 males and 53 females, together with long-read sequence data and Sanger sequencing of targeted PCR products, to characterize a Y-sequence of 9,149 base pairs on LG11 harboring a single gene which we have named zinc knuckle on the Y chromosome gene (zkY). Gene expression data from early development stages and modeling of the zinc knuckle structure offer circumstantial evidence consistent with a function in Atlantic cod sex determination.

Results and Discussion

Identification of male specific sequence on linkage group 11

Two independent approaches were used to identify a male specific genomic region in Atlantic cod. Firstly, we searched whole genome sequence data from males and females for SNPs that segregate according to an XX-XY system. Illumina short-reads (approximately 10X coverage per individual) generated from 49 males and 53 females were mapped to an initial in-house developed gadMor2.1 genome assembly, followed by variant detection. When applying stringent criteria demanding that gender specific variants should be heterozygous in all 49 males and homozygous in all 53 females, we detected 9 variants all distributed within a 15 kb region on LG11 (Supplementary File 1 and Fig. 1b). BLAST alignments revealed that these variants fell inside the 55 kb region previously reported as showing most evidence for being involved in sex determination by Star et al.37. Secondly, we analyzed resequencing data to identify positions in the genome that displayed significant differences in read depth between males and females. Two regions showed an almost complete absence of female reads while displaying >500 reads from males, i.e. Y-specific characteristics. These regions were separated by an intervening sequence displaying roughly 2-times coverage in females versus males, i.e. a possible X-sequence. Both the Y- and X-sequence regions fell within an interval flanked by the nine sex-linked markers on LG11 lending further support to the significance of this region in sex determination. Closer examination of the architecture of this region using IGV38 revealed that despite a high read-depth, there was no evidence of single reads bridging, or read-pairs spanning, these differential read-depth sub-regions. Collectively these anomalies are suggestive of assembly errors in the initial gadMor2.1 assembly, and a hybridization of X- and Y-sequences, and underscores the importance of developing high-quality reference genomes.

Figure 1
figure 1

Genomic organization of the Y- and X-sequences on LG11 in Atlantic cod. (a) Comparison of read depth between male (n = 49) and female (n = 53) samples reveals an excess of Y-specific reads across a 9 kb interval on LG11 in the final and public gadMor2.1 assembly. Positions of cod zkY and the neighboring genes are indicated. (b) The Y- and X-sequences (grey boxes), including the Y-specific zkY gene, and the nine polymorphisms heterozygous in all males and homozygous in all females (black dots). Long reads confirming assembly integrity were generated using Oxford Nanopore (green) and Pacific Biosciences (blue) sequencing technologies.

In order to resolve this complex region we performed nested, long-range PCR to generate a Y-specific fragment, which was then sequenced using an Illumina MiSeq. Assembly construction using Gap Filler generated a 6 kb sequence, which was integrated into the initial gadMor2.1 assembly to produce a complete Y-sequence of 9,149 bp (Fig. 1 and Supplementary File 2). The sequencing revealed an imperfect repeated element positioned close to the end of the Y-sequence (see underlined sequence in Supplementary File 2). An X-sequence of 425 bp was constructed by Sanger sequencing PCR fragments generated from females using primers flanking the Y-sequence. To confirm the integrity of these manually curated X- and Y-sequences we aligned publicly available long-read data (Pacific BioSciences) derived from the male used to generate the reference (accession numbers at ENA (, ERX1787826-ERX1787972) and long reads generated in-house using an Oxford Nanopore MinION device, also from a male. Although overall coverage was low, we were able to identify long-reads that aligned specifically to either the Y- or X-sequences and, in combination, spanned their entire lengths (Supplementary File 3 and Fig. 1b).

The Y-sequence contains a single gene

X- and Y-sequences of the public gadMor2.1 assembly were annotated using standard gene-model predictive software and RNA-Seq data produced from early larval stages together with publicly available RNA-Seq data from Atlantic cod (See Material and Methods). This revealed a single gene (which we have named zkY) in the Y-sequence (Supplementary File 2 and Fig. 1), which is inserted in a synteny block that is highly conserved in teleosts and in spotted gar (Lepisosteus oculatus) (Supplementary Figure 1). The intronless cod zkY codes for a zinc knuckle protein characterized by the zinc knuckle consensus sequence Cys-X2-Cys-X4-His-X4-Cys (X = any amino acid). This motif is mainly found in the RNA-binding retroviral nucleocapsid (NC) proteins, but also in various eukaryotic proteins binding single-stranded nucleotide targets39,40,41,42. The flanking basic residues bind nucleic acids non-specifically through electrostatic interactions with the phosphodiester backbone of nucleic acids43.

Zinc knuckle proteins are members of the large family of zinc finger proteins possessing a versatility of tetrahedral Cys- and His-containing motifs that bind to DNA and RNA target sites44,45. The DNA-binding domain of the DMRT transcription factors, including the sex determinant DmY in medaka, consists of intertwined CCHC and HCCC motifs binding in the minor groove of DNA46. The zinc finger protein ZFAND3 is essential for spermatogenesis in mice and the polymorphic tilapia zfand3 was recently mapped in the sex determining locus47,48. Hence, the recruitment of zinc finger proteins as the sex determinant seems to have occurred several times in teleosts.

Multiple autosomal copies of cod zkY

The sex determining gene in several teleosts exhibits an autosomal copy that differs from the male-specific gene in spatio-temporal expression patterns and/or functionality28,49,50,51,52. We identified 12 autosomal copies of cod zkY mapping to seven different linkage groups and one unassembled scaffold. Premature stop codons and single base indels were found in four of the genes encoding putatively non-functional proteins lacking the zinc knuckle domain. The remaining eight genes code for zinc knuckle proteins named zk1 to zk8, which differ in length from 143 to 633 amino acids (Supplementary File 4). Sequence alignment revealed several amino acid substitutions in the zinc knuckle domain in addition to the variable size of an imperfect repeat of basic residues preceding the domain (Fig. 2A, Supplementary File 5 and Supplementary Figure 2).

Figure 2
figure 2

Amino acid substitutions in the zinc knuckle domain in cod zkY and its autosomal copies. (A) Sequence alignment of the zinc knuckle domain (boxed) and flanking regions in cod zkY and the eight autosomal protein variants. The characteristic Cys-Cys-His-Cys residues of the zinc knuckle are arrowed, substituted amino acids are indicated by colors and correspond to the labelled positions 435, 437, and 443 in (B-D). Repeated segments adjacent to the zinc knuckle are shaded. (B–D) Modeled structure of the three different zinc knuckle domains of cod zkY and the autosomal proteins. The Zn2+ ion is displayed together with the oligonucleotide d(ACGCC) template (see Methods). (B) Met443 variant of zkY, zk1, zk2, (C) Ile443 variant of zk3-zk7, (D) Lys435-Arg437-Gln443 variant of zk8. Hydrogen bonds are indicated by dotted green lines, Pi-donor hydrogen bonds in light blue, hydrophobic interactions by dotted magenta lines, and electrostatic bonds by dotted orange lines.

Functional implications of the amino acid substitutions in the zinc knuckle domain of the autosomal proteins was inferred by exploring the predicted 3D structure and the interactions with a putative RNA oligonucleotide target. The suitability of the d(ACGCC) sequence in the structure modeling was supported by the stacking of the conserved Trp442 between the two bases G3 and C4 in agreement with the interactions between specific base moieties and Trp at the corresponding position in NC proteins53,54,55. The three cysteines Cys443, Cys436 and Cys446 together with His441 coordinate the tetrahedral binding of the Zn2+ ion in the modelled cod zinc knuckles. While the replacement of Met433 in zkY, zk1 and zk2 with the Ile residue in the four zk3-zk7 proteins maintains the hydrophobic interactions with Cys446, the Ile433 residue is predicted to interact with the G3 base of the pentanucleotide (Fig. 2C). The RNA-binding specificity of human HIV-1 was consistently altered by the Met- > Val and Met- > Lys substitutions in the same position of the second zinc knuckle56. The Asn435Lys change in the zk8 protein may affect the tetrahedral Zn2+ ion coordination by interacting with Cys446, while Gln443 is predicted to form a hydrogen bond with the phosphate group connecting A1 and C2 (Fig. 2D). Additionally, electrostatic interactions are predicted between Arg437 and His434, which are moved away from its C2 contacts observed in the other variants.

Altogether, the predicted alterations in the contacts between the replaced amino acids and the interactions with the oligonucleotide sequence template suggest functional differences in the putative RNA-binding activity of the zinc knuckle proteins examined. In addition, the functional role played by the flanking basic residues in the non-specific binding of nucleic acids44 suggests that these interactions might be altered by the extended flanking repeat in the zk6 and zk7 proteins. Based on the identical zinc knuckle domain and the similar flanking sequences in the three proteins zkY, zk1 and zk2, we decided to compare the larval expression of these coding genes.

Increased transcription of cod zkY prior to onset of sex differentiation

A prerequisite for being a sex determining gene is to show pronounced transcription levels prior to onset of sex differentiation, which in Atlantic cod likely occurs at start feeding around 35 days post hatching (dph) based on the high number of germ cells in females compared to males together with the female-biased expression of cyp19a1a21,57. RNA-Seq analysis of cod zkY and the autosomal zk1 and zk2 copies showed that the three genes were expressed at variable levels from hatching until 22 dph at which stage the transcription levels of zkY increased and stabilized at relatively high levels from 30 dph onwards (Fig. 3). In contrast, the larval expression of zk1 and zk2 was almost undetectable or very low from 11 dph.

Figure 3
figure 3

Larval expression of zkY, zk1 and zk2 from hatching to juvenile stage. Expression levels are given as relative transcript read numbers. The onset of sex differentiation is indicated21.

The increased larval expression of the male-specific zkY gene prior to the onset of sex differentiation is consistent with a possible function in sex determination. Multiple germline-specific RNA regulatory proteins are essential for germ cell specification, maintenance, migration and proliferation in various organisms58. While this regulatory network seems to be evolutionary conserved in nematodes, flies and mammals, the key role played by different sex determinants in the control of male germ cell proliferation in fish suggests the involvement of lineage-specific regulators6. Knockdown of the germline zinc knuckle helicases glh1-2 in C. elegans and the vasa homolog in mice resulted in male infertility59,60,61. Intriguingly, the loss of zinc knuckles in the RNA-binding Vasa in vertebrates and insects was suggested to have coincided with the emergence of hitherto unknown zinc-knuckle cofactors conferring target specificity62.

X- and Y-sequences in other gadoid species

To elucidate the emergence of this putative sex determining region in other cod fishes we scanned draft genome assemblies from 12 species within the family Gadidae63 for sequences matching the Atlantic cod Y- sequence. In its entirety, the 9,149 bp male-specific region showed only partial hits suggesting that the Y- sequence is absent or incorrectly assembled in these draft assemblies. Therefore, to find evidence for the sex determination region in other codfish species we were obliged to use a PCR based approach. Although the multi-species sequence alignments were highly fragmented, a close examination revealed isolated regions within and immediately outside the male-specific region of Atlantic cod which were, to some extent, conserved across species. These regions allowed us to develop primers, which amplified a specific fragment of the Y-sequence in both Greenland cod (Gadus macrocephalus ogac) and Arctic cod (Arctogadus glacialis), in addition to Atlantic cod (see alignment in Supplementary File 6). The X-sequence was efficiently amplified and sequenced in Greenland cod, Arctic cod, polar cod (Boreogadus saida), haddock (Melanogrammes aeglefinus) and burbot (Lota lota) (see alignment in Supplementary File 7). This result documents that the Y-specific sequence was present prior to the separation of Arctic cod from Atlantic cod more than 7.5 MYA63,64 while the presence of X-sequence in burbot suggests that this arrangement predates the LotinaeGadinae split 45 MYA (Supplementary Figure 3).

A diagnostic test for distinguishing male and female Atlantic cod is potentially useful for researchers and the emerging aquaculture industry. To efficiently determine gender, we developed a simple PCR reaction including two different forward primers (one Y-sequence specific and the other matching a common sequence upstream of the X- and Y- regions) and a common reverse primer (annealing to a sequence downstream of the X- and Y-regions) resulting in a single band in females and double in males (Fig. 4).

Figure 4
figure 4

Agarose gel separation of PCR products from male and female Atlantic cod. (Cropped from a larger gel image presented in Supplementary Figure 4).

Sex determination is a complex process involving the coordinated actions of multiple factors, and a greater understanding of what genes are involved in the sex determination cascade is needed. Furthermore, this understanding must be developed in multiple species to learn whether these processes are species or taxa specific and to improve our knowledge of the evolution of sex determination. The main focus of this paper has been to improve our knowledge of the genomic basis for sex determination in Atlantic cod. Our results indicate that the zkY gene present within the male specific region on LG11 is involved in Atlantic cod sex determination, and documents the presence of X- and Y- sequences in other gadoids.

Materials and Methods

Ethics statement

To produce whole genome sequence data, fin clips were collected non-destructively in strict accordance with the welfare rules given by the National Animal Research Authority (NARA). Samples were collected as a part of an ongoing, long term project authorized and managed by Nofima AS which seeks to maintain a biobank of samples from parents of the National cod breeding program nucleus. For RNA sequencing (larvae 1–35 dph), permission for sampling biological material is not required for fish eggs and larvae collected before start-feeding according to NARA.

Sampling and Illumina whole-genome sequencing

Tissue was collected from 49 males and 53 females from the National Atlantic cod breeding program maintained by Nofima, Tromsø, Norway (available at, PRJEB12803 and Supplementary File 8). The fish were parents in year classes 2005 or 2006 and represented the second generation of cod produced in captivity. The original broodstock in the base population were sampled from different geographical areas along the Norwegian coast65. DNA was extracted using the DNeasy kit manufactured by QIAGEN (Germany) according to manufacturer’s instructions. Libraries were prepared using the Truseq Library prep kit (Illumina, San Diego, USA), and sequenced using an Illumina HiSeq 2500 instrument. A total of 13.1 billion paired-reads (2 × 100 PE) were produced, averaging 129 million paired reads per individual with coverage from 12.4 to 19.0X.

Construction of the initial gadMor2.1 assembly and variant detection

The initial gadMor2.1 assembly was constructed by integrating a dense linkage map with two public draft assemblies (NEWB454 and CA454ILM generated from sequencing the same male. The assemblies were constructed using different combinations of short-read sequencing data and different assembly programs (Newbler and Celera respectively) and displayed different qualities with NEWB454 having longer scaffold N50s and more gaps than CA454ILM66. The linkage map containing 9354 SNP markers was produced after genotyping a pedigree of 2951 fishes. Chimeric scaffolds were broken at junctions between contigs containing SNPs from different LGs. Overlapping scaffolds were identified by comparing SNPs mapping to both assemblies and were merged using coordinates from alignment with LASTZ67 generating scaffolds that were used to build the final chromosome sequences. Finally, all scaffolds were oriented, ordered and concatenated into a new chromosome sequence based on information from the linkage map.

Variants were detected by first filtering raw reads from each individual using Trimmomatic v0.3268, and subsequently aligning reads to the unmasked initial gadMor2.1 assembly using Bowtie2 v2.3.2 with the parameter ‘sensitive’69. The resulting BAM files were merged by gender using Sambamba v0.6.570 and GATK HaplotypeCaller v3.871 was used to call variants with the following parameters: gt_mode DISCOVERY, minPruning 3. A Perl script was used to report SNPs, and their positions, that display heterozygous genotypes in all males and homozygous genotypes in all females.

Read depth differences between males and females

For each sex, quality filtered reads from males (n = 49) and females (n = 53) were combined to generate two gender-specific bed files. Bedtools genomecov version 2.22.072 was then used to count read depth at each position in the public gadMor2.1 assembly. A Perl script was then used to calculate the average read depth per individual across successive 1000 bp intervals with an overlap of 500 bp (positions with 0 coverage in males and females were ignored). For each average, the absolute difference between the larger and smaller number was calculated and plotted. Across the genome, the region with the most extreme read depth differences was seen on LG11 (Fig. 1a).

Construction of Y- and X-sequences

A nested PCR strategy was used to amplify the Y-specific sequence lacking from the initial gadMor2.1 assembly. The initial reaction used the following primers and conditions, (Fwd; 5′-CCTAAACCACAGTCCTGGGC-3′, Rev; 5′-ACATTGTGCACACACATTGTATC-3′, PrimeSTAR GXL DNA Polymerase (Takara, Japan), cycling 30 times with 10 s at 98 °C, 15 s at 57 °C, 3 min at 68 °C, final extension 72 °C for 10 min. PCR-product was used as template in a subsequent reaction using the same conditions but with the nested PCR primers (Fwd; 5′-CTCTGTAGTTTGTGGTGGGGT-3′, Rev; 5′-ACATGACGAATGGCCTCCTTT-3′). The resulting PCR product was purified using a QIAquick PCR purification kit from QIAGEN (Germany) and quantified. DNA was prepared for sequencing using a Nextera XT kit from Illumina (USA) according to manufacturer’s instructions and the resulting library sequenced using 2 × 250 nt sequencing chemistry using a MiSeq (Illumina, USA). After quality trimming the resulting reads were anchored to the existing flanking Y-sequences using Gap filler73. The resulting sequence contained a single gap that was filled by Sanger sequencing a PCR product generated using the following PCR primers and conditions (Fwd; 5′-ACACAACGCAGAGTCTGTCC-3′, Rev: 5′- TCAGCTAGTCTCGCAATGGC-3′, denaturation 15 min at 95 °C, then cycling 30 times with 10 s at 98 °C, 15 s at 98 °C, 60 s at 68 °C, final extension 10 min at 72 °C). An X-sequence was constructed from Sanger sequencing a PCR product produced using primers annealing up- and down-stream of the Y-sequence.

Generation X- and Y-sequences in other gadoids

Sequence alignments of male specific region in Atlantic cod with public assembly data from 12 gadid species63, using MUMmer version 3.2374, identified four rather short regions that showed good conservation across species, including two regions within the Y-sequence (B; 12,520,541–12,520,838 and C; 12,527,782–12,5280,27) and two regions (A; 12,518,827–12,519,231 and D; 12,528,480–12,529,285) immediately flanking the X- and Y-sequences. Primers were designed using Primer375 to amplify from within the Y-sequence conserved sequence C to flanking sequence D (i.e. a Y-specific PCR) and from flanking sequence A to D (an X-sequence specific PCR). Products from these reactions were Sanger sequenced and aligned using MAFFT v7.40276 (Supplementary Files 6 and 7).

Diagnostic sex-test

A sex distinguishing duplex PCR was designed for Atlantic cod using the following primers and conditions (Fwd A; 5′-ACACACGGTCTGCTGTAGTG-3′, Fwd C; 5′- GGAGGGGAATTGTACAAACACG-3′, Rev D; 5′-GTGTGCCAAATGGATGCCAA-3′), denaturation for 15 min at 95 °C, cycling 35 times with 30 s at 94 °C, 60 s at 55 °C, 60 s at 72 °C, final extension 72 °C for 10 min (Supplementary File 9).

Nanopore Sequencing

High-molecular weight DNA was extracted using a phenol-chloroform method77. Sequencing libraries were prepared using SQK-RAD003 and SQK-LSK108 kits and protocols from Oxford Nanopore Technology (Oxford, UK) and sequenced on a MinION device generating a combined total of 3.9 Gb sequence data with an N50 read length of 4.2 kb. Reads were aligned to the public gadMor2.1 assembly using GraphMap aligner v0.5.278.

Transcript profiling of cod larvae

Cod larvae were hatched at the National Atlantic Cod Breeding Centre in Tromsø, Norway, after the incubation of fertilized eggs in seawater rearing tanks at 4.5 °C. For 1 and 7 dph larvae, RNA was extracted from a pool of 10 individuals using the QIAGEN AllPrep DNA/RNA/miRNA Universal kit (QIAGEN; Germany). Samples were prepared for sequencing using TruSeq Stranded mRNA kit from Illumina (USA) and sequenced using an Illumina HiSeq 2000 to produce 362 million reads. For the samples 12 and 35 dph samples, RNA was extracted using an RNeasy kit (QIAGEN, Germany) prepared for sequencing using TruSeq Stranded mRNA kit from Illumina (USA) and sequenced using an Illumina MiSeq (2 × 250nt) to produce 4.4 million reads. All reads were trimmed using Trimmomatic v0.3268 before further analysis.

Total available short read data (, PRJEB18628; and = SRP056073) was binned based on days before hatching (dph) before being aligned to the public gadMor2.1 assembly using star aligner STAR v2.3.1z12 as described previously. Potential transcripts were constructed using stringtie v1.3.379, while cuffmerge v2.2.180 was used to produce a GTF file containing key metrics. FPKM values for zkY, zky1, zky2, were calculated using only those reads with a mapping quality of ≥30.

Gene annotation

Data from various public sources was used to build gene models including (i) 3M transcriptome reads generated using GS-FLX 454 technology and hosted at NCBI’s SRA ( = SRP013269), (ii) >250 K ESTs hosted by NCBI ( (iii) 4.4 M paired-end mRNA MiSeq sequences from whole NEAC larvae at 12 and 35 dph (, PRJEB25591), (iv) Pacbio reads from (, PRJEB18628), (v) 362 M Illumina reads from 1 and 7 dph (, PRJEB25591) and (vi) approximately 1.7B Illumina reads from 4–63 dph as well as juvenile samples ( = SRP056073). To enable model building, short Illumina reads (<250nt) were mapped to the public gadMor2.1 assembly using STAR v2.3.1z12. Because Illumina reads from 4 dph - juvenile were generated using an unstranded library, the parameter’ outSAMstrandField intronMotif’ was used in alignment. Long reads from PacBio were mapped using STARlong v2.5.2a81 while 454 transcriptome reads were mapped using gmap v 2014-07-2882 with ‘–no-chimeras’ parameter in addition to default parameters. Cufflinks v2.2.180 was used to assemble the reads into transcript models for all alignments except for data from 4 dph - juvenile stage samples where stringtie v1.3.379 was used. Transcript models were merged using cuffmerge v2.2.180.

Gene models were tested by performing (i) open reading frame (ORF) prediction using TransDecoder83 using both pfamA and pfamB databases for homology searches and a minimum length of 30 amino acids for ORFs without pfam support, and (ii) BLASTP analysis (evalue <1e-10) for all predicted proteins against zebrafish (Danio rerio) (v9.75) and three-spined stickleback (Gasterosteus aculeatus) (BROADS1.75) annotations from Ensembl. Only gene models with support from at least one of these homology searches were retained. Functional annotation of the predicted transcripts was done using blastx against the SwissProt database.

Modeling the zinc knuckle domain

The three-dimensional structure of the three different variants of the zinc knuckle domain were built using a threading approach which combines three-dimensional fold recognition by sequence alignment with template crystal structures and model structure refining. The query-template alignment was generated by HHPRED ( and then submitted to the program MODELLER84 as implemented in Discovery Studio (Dassault Systèmes BIOVIA). Query coverage and e-value score were considered to define a suitable template structure. The structure of nucleocapsid protein NCp10 of retrovirus MoMuLV, which contains a single Cys-X2-Cys-X4-His-X4-Cys zinc knuckle domain bound to the oligonucleotide d(ACGCC) was selected as template ( Zinc ions and oligonucleotides were explicitly considered through molecular modeling steps. Fifty models, optimized by a short simulated annealing refinement protocol available in MODELLER, were generated and their consistency was evaluated on the basis of the probability density function violations provided by the program. Stereochemistry of selected models was checked using the program PROCHECK85. Visualization and manipulation of molecular images were performed with Discovery Studio (Dassault Systèmes BIOVIA).