Whole genome resequencing data for three rockfish species of Sebastes

Here we report Illumina-based whole genome sequencing of three rockfish species of Sebastes in northwest Pacific. The whole genomic DNA was used to prepare 350-bp pair-end libraries and the high-throughput sequencing yielded 128.5, 137.5, and 124.8 million mapped reads corresponding to 38.54, 41.26, and 37.43 Gb sequence data for S. schlegelii, S. koreanus, and S. nudus, respectively. The k-mer analyses revealed genome sizes were 846.4, 832.5, and 813.1 Mb and the sequencing coverages were 45×, 49×, and 46× for three rockfish, respectively. Comparative genomic analyses identified 46,624 genome-wide single nucleotide polymorphisms (SNPs). Phylogenetic analysis revealed closer relationships of the three species, compared to other six rockfish species. Demographic analysis identified contrasting changes between S. schlegelii and other two species, suggesting drastically different response to climate changes. The reported genome data in this study are valuable for further studies on comparative genomics and evolutionary biology of rockfish species.


Background & Summary
The rockfish of genus Sebastes Cuvier 1829 is the most specious in the family Sebastidae (Actinopterygii: Scorpaeniformes) 1,2 . The genus contains nearly 110 species worldwide and most of the species are subjected to substantial commercial and recreational fisheries 2 . Such great species diversity is likely attributed to recent species diversification processes [2][3][4] , thus resulting taxonomic confusion in some areas due to morphological similarity. The rockfish species have provided valuable opportunities for evolutionary studies, shedding light on the origin and diversification within the genus 3,5 . In addition, as ovoviviparous teleost, rockfish could provide exceptional clues for studying evolution of their reproductive ecology. Ovoviviparity is a unique fish reproduction mode, in which fertilized eggs cannot be delivered from the female ovary until the embryos are mature 6 . In these respects, molecular information such as whole genome data would contribute to providing more comprehensive insights into evolutionary biology of these species.
In this study, we report whole genome data of three marine ovoviviparous fish in genus Sebastes, viz., Sebastes schlegelii Hilgendorf 1880, Sebastes koreanus Kim and Lee 1994, and Sebastes nudus Matsubara 1943. The three rockfish are commercial species commonly distributed in Korea, Japan, and northeast coast of China 1 . Herein, a total of three male adults (each individual representing one species) were collected from coastal waters of Qingdao, China. Prior to sequencing, the genome sizes of three species were estimated as ~800 Mb, thus nearly 40 Gb sequencing data (about 50× genome coverage) of each species was produced by Illumina HiSeq2500 sequencing platform. We intend to develop genomic resources for further studies on taxonomy, phylogenetics, conservation and evolution of these commercially important rockfish in genus Sebastes.
The experimental design, sequencing and analysis pipeline is shown in Fig. 1. After data filtering, a total of 38.54, 41.26, and 37.43 Gb sequence data were produced for S. schlegelii, S. koreanus, and S. nudus, respectively (Table 1). K-mer analyses revealed the genome size was 846.4, 832.5, and 813.1 Mb for the respective three species ( Table 2). The genome sequences of S. schlegelii, S. koreanus, and S. nudus were assembled into scaffolds with a total size of 755.1, 751.7, and 748.5 Mb, respectively. The estimated genomic information of three rockfish species were shown in Table 2. www.nature.com/scientificdata www.nature.com/scientificdata/ The filtered clean data were mapped to the reported S. steindachneri (GCA_001910785.2) reference genome and the generated bam files were subsequently investigated in demographic analyses. A coalescent-based hidden Markov model, the pairwise sequentially Markovian coalescent (PSMC) model, was used to infer the history of effective population sizes (Ne). The PSMC results exhibited contrasting demographic changes in the last glacial, revealing Ne increase in S. schlegelii and decrease in other two species (Fig. 2). The demographic analyses suggested that drastically different responses to climate changes can be detected in closely related species, as reported in demographic changes of two closely related dolphin species 7 . Such contrasting demographic changes could be due to the altered ecology of competitors and the pattern of population differentiation 7 . Further studies are warranted to specify the contrasting demographic patterns among closely related species. In addition, phylogenetic relationship of species in genus Sebastes were reconstructed based on whole genome sequences. Supplemented with six reported genome sequences, a total of 14,821,089 single nucleotide polymorphisms (SNPs) were identified. After SNP filtering, the remaining 46,624 SNPs were employed in phylogenetic reconstruction. The neighbour-joining topology revealed closer relationship of S. schlegelii, S. koreanus, and S. nudus, compared to other rockfish species in this genus (Fig. 3). Based on a literature survey and author knowledge, the reported whole genome data in the present study is the first whole genome information present to the public of the three rockfish, therefore, these data could be valuable for further studies on taxonomy, phylogenetics and evolutionary biology of rockfish species.  www.nature.com/scientificdata www.nature.com/scientificdata/ Methods Sample collection. Animal experiments were conducted in accordance with the guidelines approved by the Zhejiang Ocean University Animal Ethics Committee and the national legislation. The sample collection procedure was following the description of our previous published work (ref. 8 ). To obtain enough genomic DNA for the Illumina sequencing, we collected fresh epaxial white muscle tissues from Sebastes schlegelii, S. koreanus, and S. nudus   www.nature.com/scientificdata www.nature.com/scientificdata/ sampled from Qingdao, China. The samples were quickly frozen in liquid nitrogen for 1 hour before storing at −80 °C. Genomic DNA was extracted using a standard phenol/chloroform extraction protocol. The integrity of genomic DNA molecules was checked using agarose gel electrophoresis, showing a main band around 20 Kb and satisfying the requirement for Illumina library construction by the manufacturer's protocol.

DNA extraction
Whole-genome sequencing. Whole genome sequencing was performed commercially at Novogene Co.
Ltd in Beijing. In brief, 1.0 μg of genome DNA was fragmented using an E210 Focused-ultrasonicator (Covaris, Woburn, MA). The sheared DNA fragments were used to prepare pair-end libraries with an average insert size of 350 bp for all samples according to the manufacturer's instructions (Illumina Inc., San Diego, CA). Each library was sequenced in two independent lanes of HiSeq 2500 platform (Illumina Inc.) using 150-bp pair-end fashion. The raw data were converted to single-sample FASTQ files through base calling procedure and after filtering interference information such as adaptors and low-quality reads, the clean data FASTQ files of each sample were employed for further bioinformatics analyses.
Genome assembly. The genome size, heterozygous ratio and repeat ratio were estimated using k-mer analysis (K = 17) performed in GCE v1.0.0 9 . Pair-end reads were assembled into contigs and scaffolds in SOAPdenovo v2.01 10 with a k-mer of 41 by applying the de Bruijn graph structure. Phylogenetic analysis. The generated genome data were supplemented with publicly available sequences of six rockfish species in genus Sebastes, i.e. S. steindachneri (GCA_001910785.2), S. aleutianus (GCA_001910805.2), S. minor (GCA_001910765.2), S. nigrocinctus (GCA_000475235.3), S. norvegicus (GCA_900302655.1), and S. rubrivinctus (GCA_000475215.1) downloaded from NCBI database. The clean reads were aligned to the genome reference of S. steindachneri by using the bwa-mem algorithm in BWA 0.7.12 11 with default parameters. Single nucleotide polymorphisms (SNPs) calling was implemented in SAMtools 1.3.1 12 with default parameters. SNP filtering was produced using VCFtools 13 . The SNP calling procedure and parameters are expanded versions of descriptions in our related work 14 . In order to avoid sex bias affecting topological structure, contigs containing SNPs were cross-validated with the sex-determining loci identified in the previous study 15 . Sex-determining SNP loci were excluded in phylogenetic analysis. Phylogenetic tree of the nine species of Sebastes based on the filtered SNPs was reconstructed using neighbour-joining (NJ) method in Tassel 5 16 with default parameters. However, potential sampling bias should be raised as a caveat when performing phylogenetic analyses based on SNPs derived from one single individual per species. Further analyses are warranted to obtain more robust results by sampling more individuals.
Demographic analysis. Analysis of demographic history for all three rockfish species was done using the PSMC model, as implemented in the PSMC package 17 . The "fq2psmcfa" and "splitfa" tools from the PSMC package were used to create the input file for the PSMC modelling. The PSMC analysis command included the options "-N25" for the number of cycles of the algorithm, "-t15" as the upper limit for the most recent common ancestor (TMRCA), "-r5" for the initial θ/ρ, and "-p 4 + 25*2 + 4 + 6" atomic intervals. The reconstructed population history was plotted using "psmc_plot.pl" script using the substitution rate "-u 2.5e-8" adopted from medaka 18 , and a generation time of 8 years. The generation time was calculated as: g = a + [s/(1 − s)] 19 , where s is the expected adult survival rate which is assumed as 80%, and a is sexual maturation age that is 4 years for S. schlegelii 20 . Therefore, the generation time was determined as 8 in the PSMC analysis. To determine variance in the estimated effective population size, we performed 100 bootstraps for each species.

Data records
All sequencing raw reads for the three rockfish species have been deposited within NCBI Sequence Read Archive 21 , and the assembly genome sequences (Sebastes schlegelii 22 , S. nudus 23 , and S. koreanus 24 ) have been deposited within GenBank. Also, the assembly genome sequences, aligned VCF files and phylogenetic tree file were stored in Figshare 25 .

technical Validation
In our present study, the sampled fish individuals were captured using hook-and-line fishing in the coastal waters of Qingdao, China. Taxonomic determination was implemented in the laboratory by identifying morphological characters. The DNA quality was checked using agarose gel electrophoresis (Fig. 4). The preprocessing steps including quality evaluation and data filtering of raw reads were implemented by the following procedures as in the previous study 8 . The quality of raw reads was evaluated using FastQC 26 software and low-quality reads were filtered using HTQC 27 software according to the following criteria: (1) adaptors in the reads were trimmed and removed; (2) read pairs were removed when either of the reads had more than 10% of N bases; (3) read pairs were removed if either of the reads had more than 20% low-quality bases (phred quality score < 5); (4) ambiguous or low-quality fragments at the two ends of reads within a window size of 5 bp and an average quality threshold of 20 were trimmed. The sequencing quality was also assessed by examining GC-content, Q20-statistics and error rate ( Table 1, Fig. 5). FastQC output files can be also viewed within the Supplementary Information. Moreover, the parameters used in bioinformatics analyses were following the default settings or the published literatures, which were provided in the Methods section.

Code Availability
The code used at each step were shown in the respective methods.