Draft genomes of female and male turbot Scophthalmus maximus

Turbot (Scophthalmus maximus) is a commercially important flatfish species in aquaculture. It has a drastic sexual dimorphism, with females growing faster than males. In the present study, we sequenced and de novo assembled female and male turbot genomes. The assembled female genome was 568 Mb (scaffold N50, 6.2 Mb, BUSCO 97.4%), and the male genome was 584 Mb (scaffold N50, 5.9 Mb, BUSCO 96.6%). Using two genetic maps, we anchored female scaffolds representing 535 Mb onto 22 chromosomes. Annotation of the female anchored genome identified 87.8 Mb transposon elements and 20,134 genes. We identified 17,936 gene families, of which 369 gene families were flatfish specific. Phylogenetic analysis showed that the turbot, Japanese flounder and Chinese tongue sole form a clade that diverged from other teleosts approximately 78 Mya. This report of female and male turbot draft genomes and annotated genes provides a new resource for identifying sex determination genes, elucidating the evolution of adaptive traits in flatfish and developing genetic techniques to increase the sustainability of turbot aquaculture.


Background & Summary
Turbot (Scophthalmus maximus) is an economically important flatfish with both eyes on the upper side of the body, and it is commonly found along the Atlantic coast of Europe. Aquaculture of turbot was initiated in Scotland in the 1970s and subsequently expanded into other European countries by the early 1980s 1 . In the 1990s, turbot was introduced to China where its farming has since developed rapidly. China is currently the largest producer of turbot in the world 2 . Turbot growth is sexually dimorphic, with females eventually attaining sizes up to 50% larger than those of males 3 . An all-female stock can potentially increase the production value of turbot aquaculture. The sex determination system of turbot follows the ZW/ZZ model, and this system can be affected by environmental factors 4 . Therefore, understanding the genomic architecture of female and male turbot may enable screening for sex determination loci, improve understanding of the interactions between genetic and environmental factors in sex determination, and lead to the acquisition of genomic resources for molecular breeding. Four sex-related QTLs, located on four different linkage groups, have been found in turbot 5 . Though the turbot genome has been assembled, the sex-determining mechanism of turbot remains unclear 6 .
In this study, we sequenced, assembled and annotated the female and male turbot genomes, and conducted a phylogenetic analysis using the genome sequences of eight other closely related species. A 568 Mb female genome sequence and 584 Mb male genome sequence were assembled. The draft turbot genomes represent a valuable resource for isolating the sex determination genes, increasing our understanding of flatfish development and improving the molecular breeding techniques for turbot.

Methods
Turbot samples and genome sequencing. One female (ZW) and one male (ZZ) adult turbot were selected for whole genome shotgun sequencing and were temporarily maintained at 16 °C in laboratory facilities. Subsequently, the physiological sex of each turbot was determined by paraffin sectioning and HE staining of its gonadal tissues (Fig. 1). Blood samples were collected from the subjects using sterile syringes that contained anticoagulant solution (0.5 M EDTA, pH 8.0). Blood samples were stored at 4 °C. High-quality genomic DNA was extracted using Puregene Tissue Core Kit A (Qiagen, USA) for constructing DNA libraries (2 k~40 Kb). We constructed three paired-end (PE) libraries (170 bp, 500 bp and 800 bp) and five mate-paired (MP) libraries (2 kb, 5 kb, 10 kb, 20 kb and 40 kb) for female turbot using TruSeq PE Cluster Kit v3-cBot-HS (Illumina, USA) and Nextera Mate Pair Library Prep Kit (Illumina, USA). The samples were sequenced using the Illumina HiSeq. 2000 platform. We constructed six libraries (PE libraries 170 bp, 500 bp and 800 bp; MP libraries, 2 kb, 5 kb and 10 kb) for male turbot using HiSeq. 3000/4000 PE Cluster Kit and Nextera Mate Pair Library Prep Kit (Illumina, USA), the samples of which were sequenced using the Illumina HiSeq. 4000 platform. In total, we generated 99.5 Gb and 196.4 Gb of raw data for the female and male turbot, respectively. Before genome assembly, we filtered artificial and low-quality reads, resulting in 89.3 Gb and 174.6 Gb of clean data for the female and male fish, respectively (Table 1).
In general, genome size (G) can be calculated following the formula G = K_num/K_depth, where K_num is the total number of k-mer and K_depth is the expected coverage depth of k-mer 7 . We used the 31.3 Gb of female sequence data and the 59.8 Gb of male sequence data to estimate the genome sizes of turbot. The parameters used  Genome assembly and anchoring of the pseudo-chromosomes. The turbot genomes were de novo assembled using SOAPdenovo2 (v2.04) 8 with a parameter of "-K 29". SOAPdenovo2 employs the de Bruijn graph algorithm to simplify assembly and reduce computational complexity. The gaps were filled using GapCloser (v1.12) 8 with default parameters. Using this method, we assembled a genome spanning a contig length of 542 Mb, with a contig N50 of 12.16 kb, and a 568 Mb scaffold length, with a scaffold N50 of 6.2 Mb, for the female. The male genome had a contig length of 553 Mb with a contig N50 of 16.52 kb and a scaffold length of 584 Mb with a scaffold N50 of 5.9 Mb ( Table 2). The sizes of the draft assemblies in our study are a few Mb larger than that of a previous turbot draft assembly of unknown sex 6 . To construct pseudo-chromosomes of the turbot genome, we anchored the scaffolds of the female genome onto linkage groups on two genetic maps: one containing 514 SSRs and the other containing 6,647 SNPs 5,9 . We mapped SSR and SNP markers to the scaffolds using e-PCR and BLASTN (e-value ≤ 1e-5), and we linked the scaffolds that had SSRs and SNPs consistent with those on the maps onto the chromosomal regions, with strings of 'N's representing the gaps between adjacent scaffolds. The scaffolds with markers located on different chromosomes were filtered out. In total, 420 scaffolds with 535 Mb lengths were anchored onto 22 chromosomes, and 94% of the scaffolds were used. Genome annotation. Transposable elements (TEs) are abundant in vertebrate genomes and contribute to genome evolution 10 . We identified TEs in the female turbot genome using both homology-based and de novo prediction approaches. In the homology-based approach, we identified known TEs by searching for regions that match the RepBase TE library (v16.10) 11 using RepeatMasker (v3.3.0) 12 and RepeatProteinMask (v3.2.2). In addition, we constructed a de novo repeat library of the turbot genome using repeatScout (v1.0.5) 13 . Furthermore, we used the de novo library as a reference and used RepeatMasker to further identify TEs. In total, we identified 87.8 Mb of TEs, accounting for 15.44% of the genome, which represents a higher proportion of the genome than do the TEs of other flatfish genomes (56.2 Mb in Japanese flounder and 20.3 Mb in Chinese tongue sole) 14, 15 . Among the different types of TEs, DNA transposons were the most abundant (5.77%, 32.8 Mb) (Table 3).
We also used homology-based and de novo approaches to predict genes in the female genome assembly. For the homology-based prediction, Danio rerio, Gasterosteus aculeatus, Oryzias latipes, Takifugu rubripes, Tetraodon nigroviridis, Cynoglossus semilaevis, Paralichthys olivaceus and Homo sapiens proteins were downloaded from Ensembl (release 60) and NCBI, and we mapped the protein sequences onto the turbot genome using tblastN (e-value ≤ 1e-5). Homologous genome sequences were aligned against matching proteins using Genewise  www.nature.com/scientificdata www.nature.com/scientificdata/ (v2.4.0) 16 to define genes. We identified 11,245 to 20,057 homologous genes using the eight species reference. For de novo prediction, Augustus (v2.5.5) 17 and Genscan (v1.0) 18 were employed to analyze the repeat masked genome, which predicted 24,402 and 28,024 genes, respectively. Additionally, RNA-seq data were mapped to the genome to support 17,688 genes. To combine the results from the various analyses, we used Glean 19 to obtain a primary non-redundant gene set of 20,134 genes with a mean gene length of 10,322 bp and an average CDS length of 1,605 bp (Table 4). The average exons number per gene was 9.63, and the average length per gene was 166 bp (Fig. 2). For the predicted gene set, we annotated motifs and domains by using InterProScan (v5.16) 20 against publicly available databases including Pfam, ProDom, SMART, PRINTS, SUPERFAMILY and PROSITE 21 . We identified 18,434 genes containing conserved functional motifs in the predicted protein sequence. We also obtained Gene Ontology (GO) information for the predicted genes and found that 15,837 genes had GO annotations. We mapped the protein sequences from the turbot genome to the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway maps (KEGG database, release 58.0) using BLASTP (e-value ≤ 1e-5)., which assigned KEGG pathways to 9,930 genes. We also searched the SwissProt and Trembl databases (UniProt, release 2011_06) using BLASTP (e-value ≤ 1e-5), which resulted in a total of 14,806 and 18,441 assigned proteins, respectively. Only 353 genes (1.75%) were not supported by the protein databases.

Gene family clustering.
A gene family is a group of genes with similar structures, general with similar functions 22 . We clustered the genes into gene families of turbot and D. rerio, G. aculeatus, O. latipes, T. rubripes, T. nigroviridis, C. semilaevis, P. olivaceus and O. niloticus using OrthoFinder2 (v2.2.7) 23 . A total of 20,134 turbot genes were clustered into 14,440 gene families with an average of 1.39 genes per gene family. We identified 369 putative specific gene families among the three flatfish species included in the analysis. These lineage-specific gene families may have contributed to the evolution of flatfish (Fig. 3).
We used the BRMC approach to estimate species divergence times using MCMCTree through the Phylogenetic Analysis by Maximum Likelihood (PAML) package (v4.5) 27 . The MCMC process of the PAML MCMCTree program was run to sample 200,000 times, with sample frequency set to 2 and a burn-in of 20,000 iterations. Other parameters were set at their default values. The calibration times for the T. rubripes-T. nigroviridis divergence and D. rerio-O. latipes, G. aculeatus, T. rubripes, T. nigroviridis (min 149.85 Mya; max 165.2 Mya) were derived from previous research 28 .
Our analysis suggests that turbot, Japanese flounder and Chinese tongue sole, all of which belong to Pleuronectiformes, form a monophyletic clade. Our phylogenetic analysis suggests that the turbot and Japanese flounder likely shared a common ancestor approximately 65 million years ago (Mya) and that this ancestor diverged from the Chinese tongue sole approximately 78 Mya (Fig. 4). These findings are consistent with conclusions from previous evolutionary studies 14,29 .

Data records
The genomic sequence data have been deposited to NCBI Sequence Read Archive (SRA) with the accession number SRP197491 30 . The female genome 31 and male genome 32 assemblies are available at NCBI GenBank. The whole genome shotgun (WGS) project has the project accession VEVO00000000. This version of the project (01) has the accession number VEVO01000000, and consists of sequences VEVO01000001-VEVO01028256 33 . The list of gene families generated in this work, the annotation gff files of the female genome and the repeat annotations, the alignment file used for constructing the phylogenetic tree and the tree output are available at Figshare 34 .

Gene set Number Average gene length (bp) Average CDS length (bp) Average exon per gene Average exon length (bp) Average intron length (bp)
De  www.nature.com/scientificdata www.nature.com/scientificdata/

Technical Validation
Genome assembly quality assessment. To assess the qualities of the male and female genome assemblies. we first used BWA 35 with the default parameters to map the PE libraries reads used for assembly back to the corresponding genome, and we used the SAMtools flagstat function (SAMtools v1.9) 36 to count basic statistics. For the female genome, 99.6% of the PE library reads could be mapped back to the female assembled genome and 96.38% of the mapped reads could be mapped in proper pairs. For the male genome, the re-mapped reads and the reads mapped in proper pairs were 99.75% and 93.23%, respectively. We also calculated the coverage depth of each base pair with the SAMtools depth function (SAMtools v1.9) and found that the coverage depth was greater than 5 for more than 99.11% of male assembly sequences and for more than 96.76% of the female assembly sequences with the exception for the gap areas. We then used Benchmarking Universal Single-Copy Orthologs (BUSCO, v3.0.2) 37 to assess the assembled genome sequences. We used BUSCO with 4,854 single-copy orthologs from actinopterygii_odb9 to assess the completeness of the female and male turbot genome sequences. For the female genome, 4,427 (96.6%) complete Actinopterygii BUSCOs were present in the female turbot genome, including 4,319 (94.2%) single-copy Actinopterygii BUSCOs and 108 (2.4%) duplicated Actinopterygii BUSCOs. Seventy-two (1.6%) fragmented Actinopterygii BUSCOs were present, possibly due to incomplete assembly, and   www.nature.com/scientificdata www.nature.com/scientificdata/ only 85 (1.8%) Actinopterygii BUSCOs were considered missing in the female genome assembly. For the male genome, 4,467 (97.4%) complete Actinopterygii BUSCOs were present in the male genome, including 94.3% single-copy and 3.1% duplicated Actinopterygii BUSCOs. The fragmented and missing Actinopterygii BUSCOs in male genome represented 1.1% and 1.5%, respectively, of the genome.
To further validate the technical quality of the new male and female genome assemblies, we used nucmer 38 to compare our new male and female genome assemblies with the current reference genome assembly (GCA_003186165.1), then used dnadiff 39 to wrap the comparison results (Table 5). Moreover, LASTZ 40 with optimized parameters (-hspthresh = 4500 -gap = 600,150 -ydrop = 15000 -notransition) and Circos graph were used to make a correspondence analysis between 23 linkage groups and 22 chromosomes in reference genome. Consequently, 21 linkage groups have one-to-one corresponding chromosomes in reference genome, while Lg08 and Lg23 are both corresponding to chromosome 4 (Fig. 5). The above results indicated that the assembled genome sequences and the gene region assembly are acceptable.