The genome resources for conservation of Indo-Pacific humpback dolphin, Sousa chinensis

The Indo-Pacific humpback dolphin (Sousa chinensis), is a threatened marine mammal and belongs to the First Order of the National Key Protected Wild Aquatic Animals List in China. However, limited genomic information is available for studies of its population genetics and biological conservation. Here, we have assembled a genomic sequence of this species using a whole genome shotgun (WGS) sequencing strategy after a pilot low coverage genome survey. The total assembled genome size was 2.34 Gb: with a contig N50 of 67 kb and a scaffold N50 of 9 Mb (107.6-fold sequencing coverage). The S. chinensis genome contained 24,640 predicted protein-coding genes and had approximately 37% repeated sequences. The completeness of the genome assembly was evaluated by benchmarking universal single copy orthologous genes (BUSCOs): 94.3% of a total 4,104 expected mammalian genes were identified as complete, and 2.3% were identified as fragmented. This newly produced high-quality assembly and annotation of the genome will greatly promote the future studies of the genetic diversity, conservation and evolution.

www.nature.com/scientificdata www.nature.com/scientificdata/ whole genome sequences information would be a valuable resource for the biology, ecology, conservation and evolutionary studies.
To obtain a high-quality genome sequence of S. chinensis, we first performed a pilot genome survey with low depth coverage sequencing (32.9X) ( Table 1) by using Illumina Hiseq 4000 to estimate the genome size and heterozygosity of the species. The assembled genome size is about 2.29 Gb 25 (contig N50 = 13 Kb and scaffold N50 = 163 Kb) and the completed BUSCO evaluated is just about 76% in genome survey 26 . The low depth sequencing estimated the genome size is about 2.7 Gb and generated an insufficient completeness genome 26 . Therefore, we constructed four additional insert size libraries (beside the previous 500 bp and 2 Kb in genome survey) and generated a total of 290.5 Gb (107.6X) clean data after filtering (Tables 1 and 2). The S. chinensis genome was finally assembled into scaffolds with a total size of 2.34 Gb 27 (Tables 1 and 3   www.nature.com/scientificdata www.nature.com/scientificdata/ N50 of assembly results was 67 Kb and 9 Mb, the N50 number and N90 number of scaffolds was 78 and 283 respectively (Table 3). 94.3% of 4,104 conserved genes were completed identified by BUSCO 28 (Table 4). The newly assembled genome quality was much better than the genome survey (Table 1). In total, 878.3 Mb (37.41%) of genomic regions consist of repeat sequences ( Table 5). The gene annotation of the genome yielded 24,640 coding genes and 91.2% of the predicted genome were annotated from biological databases (Tables 6 and 7). Approximately 95% of the "total complete BUSCOs" were identified by BUSCO pipeline based on the annotation result (Table 8), which suggested a good quality genome annotation.

Methods
Sample collection, DNa extraction and sequencing. The same sample collection and DNA extraction methods have been reported in a previously published study 26 . In addition to the previously constructed 500 bp and 2 kb libraries, new 300 bp and 800 bp small insert and 5 kb and 10 kb mate pair libraries were constructed according to the manufacturer's protocol (Illumina, San Diego, CA, USA). After library construction, we used Illumina HiSeq X Ten to sequence PE150 reads for 300 bp library. PE125 reads for 800 bp library, and PE50 reads for 5 Kb and 10 Kb libraries were sequenced by Illumina HiSeq 4000 platform. A total of approximately 370 Gb raw data was obtained. Then, we filtered the reads with stringent filtering criteria using SOAPnuke 29 and 290.5 Gb of clean data was generated (107.6X genome coverage) ( Table 2).
Genome assembly and evaluation. We used all the clean data to assemble the genome by Platanus 30 .
First, the contigs were constructed based on the de Bruijn graphs from paired-end reads. Second, the order of the contigs was fixed using the paired end (mate-pair) information in the scaffold construction process. Third, in the Gap-closing step, each set of assembled reads were used to close the gaps, and each gap was covered with  Table 2. Statistics of raw and clean data. Note: Assuming the genome size is 2.7 Gb. *The data was used in previously pilot study project 26 .   www.nature.com/scientificdata www.nature.com/scientificdata/ reads mapped on the scaffolds by the Platanus pipeline. After that, we filled the gaps with GapCloser 31 . Finally, scaffolds were extended by SSPACE 32 using the mate-paired library data. The final total assembled genome length was 2.34 Gb with a contig N50 of 67 kb, and a scaffold N50 of 9 Mb (Table 3). The assembly and gene annotation qualities were assessed using BUSCO software 28 . The total number of mammal gene sets used in the evaluation was 4,104.

Contig Length (bp) Contig Number Scaffold Length (bp) Scaffold Number
Genome annotation. The genome was searched for tandem repeats using Tandem Repeats Finder 33 .
Interspersed repeats were mainly identified using homology-based approaches. The Repbase 34 (known repeats) database and a de novo repeat library generated by RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html) were used. The database was mapped by using RepeatMasker (http://www.repeatmasker.org). The repeat content of this species is 37.4% (Table 5).

Type
Repeat Size % of genome    www.nature.com/scientificdata www.nature.com/scientificdata/ The coding genes in the S. chinensis genome were annotated based on evidence derived from known proteins and published RNA sequences. For protein homology-based prediction, proteins of B. taurus, T. truncatus, O. orca, and B. mysticetus were downloaded from NCBI and aligned to the S. chinensis genome using TBLASTN 35 with an E-value ≦ 1E −5 . Homologous genome sequences were aligned to the matched proteins to predict the gene models by Genewise 36 . We filtered the sequences for redundancy and retained the gene models with the highest scores. RNA-seq data provided a good supplement for gene prediction based on the homology-based method, as most of open reading frames (ORF) in the homology-based gene models were not intact. First, transcriptome data (total 4,305,634,920 nucleotides) of S. chinensis was downloaded from https://www.ebi.ac.uk/ena/ data/search?query=ERP003522 which was sequenced by Illumina Hiseq2000 platform and published in 2013 24 . These reads were aligned to the assembled genome sequence using hisat 37 . Subsequently, hisat mapping results were merged and sorted, and transcripts were assembled using stringtie with the default parameters 38 . Finally, the Genewise results were extended using the transcripts ORFs following the strategy of the Ensembl gene annotation system 39 . This method and strategy were used extensively in the genome research [40][41][42][43][44] . The 24,640 (Table 6) predicted genes were then functionally annotated by aligning to five databases: InterPro 45 , Gene ontology 46 , KEGG 47 , Swissprot 48 and TrEMBL 48 , 91.2% of the predicted genes were annotated with function (Table 7).

Data records
This genome assembly and annotation results have been deposited at DDBJ/ENA/GenBank 27 . Raw read files are available at NCBI Sequence Read Archive 49 .

technical Validation
Evaluation the completeness of the genome assembly and annotation. To evaluate the completeness of the genome assembly and annotation, BUSCO pipeline 28 was used to investigate the presence of highly conserved orthologous genes in the genome assembly and annotation result we obtained. BUSCO was run over the mammalian set, which includes total of 4,104 orthologue groups. 94.3% and 95.1% of the "total complete BUSCOs" were identified by BUSCO pipeline based on the genome assembly and annotation result respectively (Tables 4 and 8), which evidenced a good quality of the genome assembly and gene sets annotation.
To further evaluate the accuracy of genome, the paired-end short insert size library reads were aligned to the assembled genome by the BWA-mem (v0.7.15) 50 with default parameters. After sorting mapped reads according to mapping coordinates in Picard (ver. 1.118) (http://broadinstitute.github.io/picard/), the mapping rate is 99.92% and the unique mapping rate is 75.81%. A total of 98.27% assembled genome was covered by the reads and the mapping coverage with at least 4X, 10X, 20X is respectively 98.16%, 97.97% and 97.32%.
Comparison with other cetacean genomes. A total of approximately 370 Gb raw data was generated using the Illumina HiSeq X Ten and 4000 platform for the S. chinensis genome with 6 different kinds of insert size libraries: 300 bp, 500 bp, 800 bp, 2 Kb, 5 Kb and 10 Kb 49 . After a data filtering process, approximately 290.5 Gb of clean data, representing approximately 107.6-fold genome coverage, was obtained for genome assembly (Table 1). After being assembled by the software Platanus, the total assembled genome length was approximately 2.34 Gb with a contig N50 of 67 kb, and a scaffold N50 of 9 Mb 27 (Table 3), which was better than the published B. acutorostrata, L. vexillifer and B. mysticetus genomes (Table 9). We predicted 24,640 coding genes in the S. chinensis genome ( Table 6) by using a homolog and RNA-seq supplemented approach which was used extensively in the genome research [40][41][42][43][44] . There were 27,924 genes predicted in O. orca and approximately 20,000-23,000 genes predicted in the B. mysticetus, L. vexillifer and B. acutorostrata (Table 9).
Here, we reported the updated high-quality genome sequence of the threatened Indo-Pacific humpback dolphin. The genome resource would greatly enhance the further studies of the gene function and conservation biology of S. chinensis. Our study is an important step towards comprehensive understanding of the genetic background of S. chinensis at the genomic level. The data will be also valuable for facilitating studies of cetacean evolution, as well as population genetic and ecology.

Code availability
Several tools have been implemented in the data analyses, whose versions, settings and parameters are described below.   Table 9. Statistics of the assembled sequence length of published cetacean genomes (S. chinensis included).