Draft genome of the big-headed turtle Platysternon megacephalum

The big-headed turtle, Platysternon megacephalum, as the sole member of the monotypic family Platysternidae, has a number of distinct characteristics including an extra-large head, long tail, flat carapace, and a preference for low water temperature environments. We performed whole genome sequencing, assembly, and gene annotation of an adult male big-headed turtle based on the Illumina HiSeq X genomic sequencing platform. We generated ~497.1 Gb of raw sequencing data (×208.9 depth) and produced a draft genome with a total length of 2.32 Gb and contig and scaffold N50 sizes of 41.8 kb and 7.22 Mb, respectively. We also identified 924 Mb (39.84%) of repetitive sequences, 25,995 protein-coding genes, and 19,177 non-coding RNAs. We generated the first de novo genome of the big-headed turtle; these data will be essential to the further understanding and exploration of the genomic innovations and molecular mechanisms contributing to its unique morphology and physiological features.


Background & Summary
The big-headed turtle, Platysternon megacephalum (NCBI Taxonomy ID: 55544), as the sole member of the family Platysternidae, is native to Southeast Asian countries, including China, Cambodia, Laos, Myanmar Thailand and Vietnam 1 . The species inhabits flowing cool rocky mountain streams with water temperatures usually between 12-28 °C 2 . It is an aquatic predator with strong climbing abilities, preying on lizard, frogs, fish, shrimps, crabs, snails, earthworms, insects, and even small birds and mammals, along with consuming some fruit and plant matter 3 . The big-headed turtle has some defining features, such as an extra-large head, long tail, and flat carapace. Its head width is approximately equal to half of its carapace width and therefore cannot be retracted into the shell. This species has the longest tail relative to body size of any turtle, with tail length being more than half of the carapace length 4 . In addition, it has a very flat carapace and an eagle-like hooked upper jaw (Fig. 1). The species prefers its body temperature to be around 25.3 °C 5 . While big-headed turtles are unique in many morphological and physiological characteristics compared to other turtle species, they are not unique in the anthropogenic threats they face.
The Asian turtle crisis has resulted in population declines in many species, including the big-headed turtle. Over harvesting for the pet trade, traditional medicine, and food represent the main drivers of the population decline and have resulted in an over 89% reduction in population density in South China 6 . The species was listed in the CITES (Convention on International Trade in Endangered Species of Wild Fauna and Flora) Appendix I in 2013 and as an endangered species in the IUCN (International Union for Conservation of Nature) Red List of Threatened Species in 2000 1 . Conservation and restoration measures are needed to protect this endangered species, and a genomic resource will be an essential tool for conservation efforts. However, there is currently little knowledge about the big-headed turtle's genetic information, with previous research mainly focusing on mitochondrial DNA and the development of microsatellite markers [7][8][9][10][11] .
In this study, we report the first sequencing, assembly, and annotation of the big-headed turtle genome. The final draft genome assembly was approximately 2.32 Gb with a contig N50 of 41.8 kb and scaffold N50 of 7.22 Mb. A total of 25,995 protein-coding genes and 19,177 non-coding RNAs (409 rRNAs, 2,089 tRNAs, 16,050 miRNAs, and 629 snRNAs) were predicted from the genome assembly. The genomic resource of the big-headed turtle will be a key tool in the study of conservation genetics for this species. www.nature.com/scientificdata www.nature.com/scientificdata/

Methods
Sample collection and sequencing. The genomic DNA of the big-headed turtle was extracted from the leg muscles of a single adult male obtained from Heyuan, Guangdong Province, China. The sampled turtle was one of a number that the law enforcement agencies confiscated from the black market and then transferred to scientific research institutions for study and captive breeding. Our institute has government permission to use confiscated big-headed turtles for scientific research (e.g. conservation genetics, ecology). The sampled turtle was euthanized and the animal collection and utility protocols were reviewed and approved by the Animal Ethics Committee at the Guangdong Institute of Applied Biological Resources (No: GIABR20170103). Ten paired-end libraries including four short-insert libraries (250 bp × 2, and 500 bp × 2) and six long-insert libraries (2 kb × 2, 5 kb × 2, and 10 kb × 2) were constructed and sequenced on an Illumina HiSeq X platform according to the manufacturer's instructions (Illumina, San Diego, California, USA). The sequenced read length was 150 bp for each library. A total of 497.1 Gb (208.9×) of raw sequences were eventually obtained (Table 1). Prior to assembly, quality control was performed for raw reads using a SOAPfilter to filter out the adaptor sequences, the reads containing more than 10% unidentified nucleotides, and low-quality reads containing more than 50% bases with Illumina phred quality score ≤8. We obtained approximately 469.2 Gb of clean reads for further assembly. assembly and evaluation. A total of 170 Gb high-quality reads from the short-insert reads (350 bp) were used to estimate the genomic information of the big-headed turtle, and 17-mer frequency information was generated based on the K-mer analysis as implemented in the software GEC 12 . The heterozygous ratio was also evaluated based on the frequency of the heterozygous k-mers and homozygous k-mers using GCE software 12,13 . According to 17-mer analysis the estimated genome size of big-headed turtle was 2,383.87 Mb (~2.38 Gb), and the estimated heterozygous and repeat sequencing ratios were calculated to be 0.33 and 53%, respectively ( Fig. 2 and Table 2).
De novo assembly was performed from the generated clean reads using SOAPdenovo 14 , a de Bruijn graph algorithm-based de novo genome assembler. We assembled the big-headed turtle genome in three steps: contig construction, scaffolding, and gap filling. First, three K (45, 59, 71) values were used to assemble the genome, according to the N50 length of contig and the BUSCO assessments of three genomes. The genome of 59-mer was chosen as the final genome for subsequent analysis. The clean reads of short-insert libraries (250 bp and 500 bp) were used to construct the contigs with 59-mer. Then reads of long-insert libraries (2 kb, 5 kb and 10 kb) were implemented to link the contig sequences into scaffold sequences. To further improve assembly quality, GAPcloser 14 and SSPACE 15 were applied to reduce gap regions and raise scaffold length using a genome sketch assembled by SOAPdenovo. This last step improved the contig N50 and N90 sizes to 41,757 and 5,528 bp, and the scaffold N50 and N90 sizes to 7,221,511 and 257,323 bp, respectively, with the fragments being longer than 100 bp ( Table 3). The final assembly of the big-headed turtle genome had a total length of 2.32 Gb, which was similar to the three previously published turtle genomes: Chrysemys picta bellii 16 , Chelonia mydas and Pelodiscus sinensis 17 .  www.nature.com/scientificdata www.nature.com/scientificdata/ The key assembly statistics of the big-headed turtle genome are comparable to or better than those of previously published turtle genomes (Table 4).

annotation
Repeat annotation. There are two major types of repetitive sequences: tandem repeats and interspersed repeats 18 . For the repeat annotation of the big-headed turtle genome, both homology-based predictions and de novo methods were performed. In the homolog-based methods, interspersed repeats were identified using RepeatMasker 19 and RepeatProteinMask to search against the published RepBase sequences. In the de novo method, RepeatMasker and RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html) were used to detect interspersed repeats in the genome. Tandem Repeats Finder (TRF) 20 was subsequently used to search for tandem repeats. Overall, the results identified a total of 924 Mb of non-redundant repetitive sequences in the bigheaded turtle genome, which account for 39.84% of the whole genome ( Table 5). The most predominant elements were long interspersed nuclear elements (LINEs), which accounted for 27.47% (637 Mb) of the genome (  www.nature.com/scientificdata www.nature.com/scientificdata/ Gallus gallus, Gekko japonicus, Nanorana parkeri, Parus major, Pelodiscus sinensis, Philomachus pugnax and Xenopus laevis were downloaded from the NCBI database and mapped onto the big-headed turtle genome using TBLASTn 21 with an E-value cutoff of 1 E −5 . Then, homologous genome sequences were aligned against the matching proteins using GeneWise 22 to define gene models. In the ab initio prediction, Augustus 23 , GlimmerHMM 24 and SNAP 25 were used to predict the coding regions of genes. To optimize the genome annotation, seven RNA tissue libraries (liver, skin, lung, intestine, heart, muscle and stomach) were constructed according to the manufacturer's instructions (Illumina, San Diego, California, USA) and a total of 61.27 Gb of sequence data was generated. RNA-seq reads were used in de novo assembly with Trinity 26 . The unique transcriptional sequences were employed to predict gene models using PASA 27 . In total, 25,995 non-redundant protein-coding genes were annotated in the big-headed turtle genome (Table 7).
Functional annotation of genes. The functional annotation of protein-coding genes of the big-headed turtle genome were predicted by aligning protein sequences against public databases including SwissProt (http:// www.gpmaw.com/html/swiss-prot.html), KEGG (http://www.genome.jp/kegg/) and TrEMBL (http://www. uniprot.org) using BLASTp with an E-value cutoff of 1 E −5 . Protein motifs and domains were annotated using InterPro 28 , and Gene Ontology (GO) 29 terms for each gene were retrieved from the corresponding InterPro results. Overall, 22,400 protein-coding genes (86.2%) were successfully annotated (Table 8).
Non-coding RNA annotation. The non-coding RNAs were predicted in the big-head genome based on four categories: ribosomal RNA (rRNA), transfer RNA (tRNA), microRNAs (miRNA) and small nuclear RNA (snRNA). tRNAscan-SE 30 was applied to identify tRNA with eukaryotic parameters according to the characteristics of tRNA. Because of the highly conserved characteristics of rRNA, BLASTN 31 was used to predict rRNA sequences by aligning with a human template with an E-value of 1 E −10 . The miRNA and snRNA sequences were identified using INFERNAL 32 by searching against the Rfam database. We identified a total of 409 rRNA, 2,089 tRNA, 16,050 miRNA, and 629 snRNA genes in the big-headed turtle genome (Table 9).

Data Records
This Whole Genome Shotgun project including the assembled genome sequence and the structural and functional annotation of genes has been deposited at DDBJ/ENA/GenBank under the accession QXTE00000000 33 Table 9. Summary of non-coding RNA.
www.nature.com/scientificdata www.nature.com/scientificdata/ technical Validation assessing the completeness of the genome assembly. To assess the quality of the genome assembly, we performed three independent evaluations as described below. First, the base content was counted with scaffolds longer than 100 bp and the results showed that the GC content for the big-headed turtle was 44.63% (Table 10), which was comparable to those of Chrysemys picta bellii, Chelonia mydas and Pelodiscus sinensis (Table 4). Second, the short-insert paired-end reads (250 bp and 500 bp) were mapped to the genome with BWA 36 . The mapping rate was 98.84% and the genome coverage was 99.57% (Table 11), indicating high reliability of genome assembly. Third, the SNPs (Single Nucleotide Polymorphisms) were counted to validate the uniformity of the genome using SAMtools 37 , and we found the ratio of homozygous SNPs was only 0.014% (Table 12), indicating that the assembly had a high base accuracy. Finally, CEGMA (http://korflab.ucdavis.edu/dataseda/ cegma/) and BUSCO (http://busco.ezlab.org/) were used to evaluate the completeness of the assembly. CEGMA assessment showed that our assembly captured 226 (91.13%) of the 248 ultra-conserved core eukaryotic genes, of which 202 (81.45%) were complete (Table 13). BUSCO analysis showed that 95.2% and 2.6% of the expected vertebrate genes were identified as complete and fragmented, respectively, while 2.2% were considered missing in the assembly (Table 14).
Annotation filtering and validation. The EVM software 38 was used to merge the above results of gene annotation and 39,212 genes were obtained from the merged set. To further revise the genome annotation, we removed the following type of genes: (1) genes with overlap regions of TE ≥ 20%; (2) premature termination genes; (3) genes with only de novo predictive support. After filtering, 25,995 genes were retained. In addition, total RNA-seq reads of 7 tissues were mapped onto the big-headed turtle genome to further identify exon regions and splice positions using Tophat 39 and Cufflinks 40 , and 20,028 (77.05%) genes had evidence supports of RNA data (RPKM value > 1) cross above 7 tissues.    www.nature.com/scientificdata www.nature.com/scientificdata/

Code availability
The execution of this work involved using many software tools, whose settings and parameters are described below.