The sequence and de novo assembly of Oxygymnocypris stewartii genome

Liu, Hai-Ping; Xiao, Shi-Jun; Wu, Nan; Wang, Di; Liu, Yan-Chao; Zhou, Chao-Wei; Liu, Qi-Yong; Yang, Rui-Bin; Jiang, Wen-Kai; Liang, Qi-Qi; Wangjiu; Zhang, Chi; Gong, Jun-Hua; Yuan, Xiao-Hui; Mou, Zhen-Bo

doi:10.1038/sdata.2019.9

Download PDF

Data Descriptor
Open access
Published: 05 February 2019

The sequence and de novo assembly of Oxygymnocypris stewartii genome

Hai-Ping Liu¹^na1,
Shi-Jun Xiao^1,2^na1,
Nan Wu³^na1,
Di Wang³,
Yan-Chao Liu¹,
Chao-Wei Zhou¹,
Qi-Yong Liu¹,
Rui-Bin Yang⁴,
Wen-Kai Jiang ORCID: orcid.org/0000-0001-8863-3117³,
Qi-Qi Liang³,
Wangjiu¹,
Chi Zhang¹,
Jun-Hua Gong¹,
Xiao-Hui Yuan² &
…
Zhen-Bo Mou¹^na1

Scientific Data volume 6, Article number: 190009 (2019) Cite this article

3754 Accesses
24 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Animal genomes in the Qinghai-Tibetan Plateau provide valuable resources for scientists to understand the molecular mechanism of environmental adaptation. Tibetan fish species play essential roles in the local ecology; however, the genomic information for native fishes was still insufficient. Oxygymnocypris stewartii, belonging to Oxygymnocypris genus, Schizothoracinae subfamily, is a native fish in the Tibetan plateau living within the elevation from roughly 3,000 m to 4,200 m. In this report, PacBio and Illumina sequencing platform were used to generate ~385.3 Gb genomic sequencing data. A genome of about 1,849.2 Mb was obtained with a contig N50 length of 257.1 kb. More than 44.5% of the genome were identified as repetitive elements, and 46,400 protein-coding genes were annotated in the genome. The assembled genome can be used as a reference for future population genetic studies of O. stewartii and will improve our understanding of high altitude adaptation of fishes in the Qinghai-Tibetan Plateau.

Design Type(s)	sequence analysis objective • sequence annotation objective • sequence assembly objective • transcription profiling design
Measurement Type(s)	whole genome sequencing • transcription profiling assay
Technology Type(s)	DNA sequencing • RNA sequencing
Factor Type(s)
Sample Characteristic(s)	Oxygymnocypris stewartii • muscle tissue • ovary • gut wall • kidney • adipose tissue • eye • swim bladder • skin of body • liver • heart • gill • brain • Tibetan Plateau • river

Machine-accessible metadata file describing the reported data (ISA-Tab format)

Chromosome-level genome assembly of Plagiognathops microlepis based on PacBio HiFi and Hi-C sequencing

Article Open access 19 July 2024

Genome assembly and population genomic data of a pulmonate snail Ellobium chinense

Article Open access 04 January 2024

Chromosome-level genome assembly of Acrossocheilus fasciatus using PacBio sequencing and Hi-C technology

Article Open access 03 February 2024

Background & Summary

The Qinghai-Tibetan Plateau (QTP) is the largest and highest plateau in the world¹. The upshift of QTP has formed complex mountain systems in Southwest China and greatly reshaped the drainage at this area². The rapid alteration of topography in the QTP might act as significant barriers for gene flow of many species, leading to population isolations and initiating allopatric divergence and speciation³. Genomes of fish species in the QTP provide valuable resources for scientists to understand the molecular mechanism of environmental adaptation. Although we have successfully obtained the reference genome of Glyptosternon maculatum⁴, leading to the first high-quality fish genome in Tibet-plateau, the genome information of fish species in QTP is still lacking.

The schizothoracine fishes (Schizothoracinae subfamily, Cyprinidae family, Cypriniformes order), also known as “mountain carps”, which composed of approximately 100 species in 10–13 genera⁵. They can be diagnosed by two lines of enlarged scales along both sides of the urogenital opening and anus⁶. These fishes exhibit many unique traits that adapt to the extreme environment of the QTP⁷. Therefore, this taxon provides an excellent opportunity for investigating high altitude adaptation of teleost fishes.

Distributed in the QTP and its surrounding areas, they are the largest and most diverse taxon of the QTP icthyofauna⁶. Based on morphological traits, the schizothoracine fishes can be divided into three hierarchical groups that adapt to different environments of QTP: the primitive group (including Schizothorax, Schizocypris, and Aspiorhynchus), the specialized group (including Diptychus, Gymnodiptychus, and Ptychobarbus), and the highly specialized group (including Gymnocypris, Oxygymnocypris, Chuanchia, Herzensteinia, Platypharodon, and Schizopygopsis)⁶. The evolution of the three groups was proposed to be associated with the upshift history of the plateau^6,8. Thus, schizothoracine fishes represent an excellent model for the study of speciation caused by geographical isolation, as well as a good model for the study of adaptive evolutions of fish species in the QTP.

Another prominent feature in the evolution of schizothoracine fishes is the complex chromosome compositions, and the majority of fishes in this taxon are considered to be polyploids⁹. Whole genome duplication (WGD) plays a vital role in the evolutionary history of plant and animals. There are at least three rounds of whole genome duplications early in teleost diversification^10,11, and these events were suggested to be causally related to the evolutionary success of teleost^12,13. The polyploid nature and rapid diversification of schizothoracine fishes make them a good model for the study of polyploidy driven speciation.

Oxygymnocypris stewartii (Lloyd, 1908) (NCBI Taxon ID: 361644, Fig. 1a), a highly specialized schzothoracine fish, is a one-time spawning fish species mainly distributed in the tributaries of the middle reaches in the YarlungZangbo River across an elevation ranging from roughly 3,000 m to 4,200 m¹⁴ (Fig. 1b). O. stewartii is currently listed in the Red List by the World Conservation Union (IUCN) and identified as an endangered fish¹⁵. Therefore, it is imperative to protect and restore the population resources of the O. stewartii.

**Figure 1: A picture of *Oxygymnocypris stewartii*.**

In this report, we provide the whole genome sequence of O. stewartii through the PacBio single molecule sequencing technique (SMRT). The availability of a fully sequenced and annotated genome is essential to support basic biological studies and will be helpful to the development of further protection strategies for this endangered species. Its whole genome sequence will also provide a foundation to explore the adaptive evolutionary processes of highland fishes, supplied as a starting point to study speciation mechanisms caused by the rapid rising of the QTP.

Methods

Sample collection and sequencing

A healthy female fish captured from Gongga Country, Lhasa, Tibet (Fig. 1a,b) was used for genome sequencing. Genomic DNA was isolated using Qiagen DNA purification kit (Qiagen, Valencia, CA, USA) from the white muscular tissue as in our previous studies⁴.

To generate enough read data for the genome assembly, both the PacBio SEQUEL and the Illumina HiSeq 4000 platform were used for the sequencing. Long reads generated from the PacBio platform were used for genome assembly, and the short but accurate reads from the Illumina platform were analyzed for genome survey and base level correction after the assembly. For the PacBio platform, genomic sequencing libraries were constructed according to the PacBio suggested protocol and 141.1 Gb long sequencing reads were obtained from 27 SMRT cells. A total of 140.7 Gb (coverage of 74.3×) subreads were obtained after removing adaptors in polymerase reads (Table 1). The subreads N50 and average lengths were 14.2 and 9,0 kb, respectively. For the Illumina HiSeq 4000 sequencing platform, one ug genomic DNA molecules were used for sequencing library construction. DNA molecules were fragmented, end-paired and ligated to the adaptor, which was further fractionated on agarose gels and purified by PCR amplification. To improve the representativeness of reads for the O. stewartii genome, 11 paired-end sequencing libraries were constructed with insert length of 250 bp according to Illumina’s protocol (Illumina, San Diego, CA, USA). Finally, a total of 145.4 Gb (coverage of 70.8×) short sequencing reads were generated. Reads with the adaptors and a quality value lower than 20 (corresponding to a 1% error rate) were filtered out. As a result, we obtained 144.3 Gb cleaned reads for the k-mer analysis and base correction of the genome (Table 1).

Table 1 Sequencing data used for the Oxygymnocypris stewartii genome assembly.

Full size table

The individual used for the genomic sequencing was also used for the transcriptome sequencing, providing necessary gene expression data for the genome sequence annotation. Given that gene expression exhibited clear tissue-specificity, 12 tissues, including skin, eye, swim bladder, muscle, brain, gill, heart, liver, gut, ovary, fat tissue and kidney were collected for the following transcriptome sequencing. As per the similar method in our previous study⁴, RNA molecules were extracted using RNAiso Pure RNA Isolation Kit (Takara, Japan) for all samples, and DNase I treatment was performed to eliminate DNA contamination. After the quality assessment of the extracted RNAs using NanoVue Plus spectrophotometer (GE Healthcare, NJ, USA), RNA-seq libraries were constructed according to the protocol⁴ and were sequenced by Illumina HiSeq 4000 in paired-end 150 bp mode, resulting in a total of ~50 Gb transcriptome data. All genome and transcriptome sequencing data were summarized in Table 1.

De novo assembly of Oxygymnocypris stewartii genome

Genome size was estimated using Illunima sequencing data with the Kmer-based method¹⁶. As per our previous study⁴, we estimated the genome size of O. stewartii by the Kmer frequency distribution. Jellyfish (v2.1.3)¹⁷ was used to calculate the frequency of each Kmer from the short sequencing data (Table 2 and Fig. 2). As a result, we estimated the genome size of O. stewartii to be approximately 1,893.5 Mb.

Table 2 Statistics of 17-mer analysis for Oxygymnocypris stewartii genome.

Full size table

**Figure 2: 17-mer frequency distribution in *Oxygymnocypris stewartii* genomes.**

The long reads generated from the PacBio SEQUEL platform were assembled into contigs using the FALCON package¹⁸ with default parameters. After the self-error correction step in the FALCON, we got 104.9 Gb (55.4x coverage) of error-corrected pre-assembly reads. The assembly of the PacBio data alone resulted in a genome of 1,898.4 Mb with a contig N50 length of 240.3 kb. The assembled genomic sequences were further polished by two rounds of polishing with Quiver¹⁹ using the PacBio long reads. After that, another round of the genome-wide base-level correction was performed with the Illumina short sequencing data by Pilon²⁰. In the end, we obtained the final 1,849 Mb draft genome of O. stewartii with a contig N50 length of 257.1 kb (Table 3).

Table 3 The statistics of length and number for the de novo assembled genome of Oxygymnocypris stewartii.

Full size table

The completeness and the accuracy of the genome were evaluated by CEGMA, BUSCO and read mapping. The completeness of the genome assembly was assessed by the single copy orthologs (BUSCO, version 3.0)²¹ and CEGMA²² software. 94.2% complete and 3.6% partial of the 2,586 vertebrate BUSCO genes were identified in the final assembly. Using CEGMA²², we revealed that 95.56% of the 248 core genes were evolutionarily conserved genes identified in the genome. Both BUSCO and CEGMA confirmed the completeness of the genome assembly. The accuracy of the genome was evaluated by the Illumina short read mapping with BWA²³ and the transcript alignment with BLAT²⁴. More than 98.6% of the reads were aligned to the genome, and the insert length distribution exhibited a single peak that was consistent with the experimental design. Meanwhile, the transcriptome was de novo assembled by Trinity²⁵, and the transcripts were mapped to the genome assembly using BLAT²⁴ with default parameters. We found that the alignment coverage (alignment length to transcript length) of expressed genes ranged from 96.44 to 99.95% in the genome assembly.

Repetitive element and non-coding gene annotation in the O. stewartii genome

To annotate repeat elements in the O. stewartii genome, both homologous comparison and ab initio prediction were applied. The similar annotation process in our previous work⁴ was employed. For ab initio repeat annotation, LTR_FINDER²⁶, RepeatScout²⁷, and RepeatModeler (http://repeatmasker.org/RepeatModeler/) were used to construct a de novo repetitive element database, and the RepeatMasker²⁸ (http://repeatmasker.org/RMDownload.html) were used to annotate repeat elements with the database. Then, RepeatMasker and RepeatProteinMask²⁸ were used for known repeat element types by searching against Repbase database²⁹. Tandem repeats were also ab initio predicted using TRF tool³⁰. A total of 822.84 Mb repetitive elements were identified in the O. stewartii genome by those repeat annotation processes, accounting for 44.50% of the whole genome (Tables 4 and 5 and Fig. 3).

Table 4 The annotation of repeated sequences in the Oxygymnocypris stewartii genome using TRF, RepeatMasker, and RepeatProteinMask.

Full size table

Table 5 Summary statistics of repeat annotation in Oxygymnocypris stewartii.

Full size table

**Figure 3: Distribution of the divergence rate of each type of repetitive element in *Oxygymnocypris stewartii* genome.**

For non-coding genes, 24,208 tRNAs were predicted using tRNAscan-SE³¹, and 1,363 rRNA genes were annotated using BLASTN tool with an E-value of 1E-10³² against human rRNA sequence. Small nuclear and nucleolar RNAs in the O. stewartii genome were also annotated by the infernal tool³³ using Rfam database³⁴ (Table 6).

Table 6 The number of the annotated non-coding RNA in the Oxygymnocypris stewartii genome.

Full size table

Protein-coding gene prediction and functional annotation

The gene model prediction method in our previous study⁴ was applied to the protein-coding gene annotation in the O. stewartii genome. We merged the evidence of the gene prediction from multiple methods, including homolog based, ab initio and RNA-seq based annotations. The protein and coding sequences were obtained from the Ensembl database³⁵ for the following species, including human (Homo sapiens, GCF_000001405.37), mouse (Mus musculus, GCF_000001635.26), zebrafish (Barchydanio rerio var, GCF_000002035.5), common carp (Cyprinus carpio, GCF_000951615.1), tiger puffer (Takifugu rubripes, GCF_000180615.1), channel catfish (Ictalurus punctatus, GCF_001660625.1), Sinocyclocheilus graham (GCF_001515645.1) and grass carp³⁶ (Ctenopharyngodon idellus). The protein sequences were aligned against the O. stewartii genome using TBLASTN³⁷ search with parameters of e-value 1e-5. After filtering low-quality records, the gene structure was predicted by GeneWise³⁸ (referred to “Homology” in Table 7). Secondly, transcripts assembled from twelve tissues RNA-Seq data were aligned against the O. stewartii genome using Program to Assemble Spliced Alignment (PASA)³⁹ (referred to “PASA” in Table 7). Augustus⁴⁰, GeneID⁴¹, GeneScan⁴², GlimmerHMM⁴³, and SNAP⁴⁴ were used for ab initio prediction with the optimized parameters that trained using high-quality proteins that derived from the PASA gene models. RNA-seq reads were also aligned to the O. stewartii genome directly using TopHat⁴⁵ v2.0.9, and the gene models were constructed by Cufflinks⁴⁶ v2.2.1 (referred to Cufflinks in Table 7). Finally, EvidenceModeler³⁹ was applied to combine all gene models that were predicted by various methods with the identical weights with our previous work⁴. Untranslated regions (UTRs) and alternative splicing variations were annotated using PASA2³⁹ (referred to “PASA-update” in Table 7). Finally, 46,400 protein-coding genes with a mean of 8.41 exons per gene (Table 7) were annotated in the O. stewartii genome. The statistics of gene models, including lengths of a gene, CDS, intron, and exon in O. stewartii were comparable to those for close-related species (Table 8 and Fig. 4).

Table 7 The statistics of gene models of protein-coding genes annotated in the Oxygymnocypris stewartii genome.

Full size table

Table 8 The comparison of the gene models annotated from the Oxygymnocypris stewartii genome and other teleosts.

Full size table

**Figure 4: Comparisons of the prediction gene models in the *Oxygymnocypris stewartii* genome to other species.**

Public biological function databases of SwissProt⁴⁷, InterPro⁴⁸, NR from NCBI and Kyoto Encyclopedia of Genes and Genomes (KEGG)⁴⁹ were used for the functional annotation of the predicted genes. BLASTX utility³² were used for the homolog search with an E-value threshold of 1E-5. InterPro database⁴⁸ was used to predict protein function based on the conserved protein domains by InterproScan tool⁵⁰. A total of 45,991 genes (99.1%) were successfully annotated by at least one public database. (Table 9 and Fig. 5).

Table 9 The number of genes with homology or functional classification for Oxygymnocypris stewartii.

Full size table

Code Availability

The sequence data were generated using the software provided by the sequencing platform manufacturer and the sequencing data were processed with commands with the guidance from the public software that is cited in the manuscript. No custom computer codes were generated in this work.

Data Records

All PacBio long-read sequencing data and Illumina short-read sequencing data have been deposited to NCBI Sequence Read Archive (SRA) (Data Citation 1).

The transcriptome data are available through the NCBI SRA (Data Citation 2).

The assembled genome version is available at GenBank (Data Citation 3).

The annotation gff3 file of the assembled genome is available at Figshare (Data Citation 4).

Technical Validation

RNA integrity

The transcriptomes for twelve tissues from three fish individuals were sequenced. Before constructing RNA-Seq libraries, the concentration and quality of total RNA were evaluated using NanoVue Plus spectrophotometer (GE Healthcare, NJ, USA). The total amount of RNA, RNA integrity and rRNA ratio were used to estimate the quality, content and degradation level of RNA samples. In the present study, RNAs samples with a total RNA amount ≥10 μg, RNA integrity number ≥8, and rRNA ratio ≥1.5 were finally subjected to construct the sequencing library.

Quality filtering of Illumina sequencing raw reads

The raw sequencing reads generated from the Illumina platform were rigorously cleaned by the following procedures as in the previous study⁴. Firstly, adaptors in the reads were filtered out; secondly, reads with more than 10% of N bases were filtered out; thirdly, reads with more than 50% of the low-quality bases (phred quality score <= 5) were filtered out. If any end pair was classified as low quality, both pairs were discarded. The initially generated raw sequencing reads were also evaluated for quality distribution, GC content distribution, base composition, average quality score at each position and other metrics.

Additional information

How to cite this article: Liu, H. P. et al. The sequence and de novo assembly of Oxygymnocypris stewartii genome. Sci. Data. 6:190009 https://doi.org/10.1038/sdata.2019.9 (2019).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Liu, J.-Q., Wang, Y.-J., Wang, A.-L., Hideaki, O. & Abbott, R. J. Radiation and diversification within the Ligularia–Cremanthodium–Parasenecio complex (Asteraceae) triggered by uplift of the Qinghai-Tibetan Plateau. Molecular Phylogenetics and Evolution 38, 31–49 (2006).
Article CAS Google Scholar
Pan, G et al. Tectonic evolution of the Qinghai-Tibet Plateau. Journal of Asian Earth Sciences 53, 3–14 (2012).
Article ADS Google Scholar
Rieseberg, L. H. Chromosomal rearrangements and speciation. Trends in Ecology & Evolution 16, 351–358 (2001).
Article Google Scholar
Liu, H. et al. Draft genome of Glyptosternon maculatum, an endemic fish from Tibet Plateau. GigaScience 7, giy104–giy104 (2018).
ADS PubMed Central Google Scholar
Mirza, M. R. A contribution to the systematics of the Schizothoracine fishes (Pisces: Cyprinidae) with the description of three new tribes. Pakistan Journal of Zoology 23, 339–341 (1991).
Google Scholar
Cao, W., Chen, Y., Wu, Y. & Zhu, S. Origin and evolution of Schizothoracine fishes in relation to the upheaval of the Xizang Plateau. Science Press, Bejing. 125–126 (1981).
He, D., Chen, Y., Chen, Y. & Chen, Z. Molecular phylogeny of the specialized schizothoracine fishes (Teleostei: Cyprinidae), with their implications for the uplift of the Qinghai-Tibetan plateau. Chin Sci Bull 49, 39–48 (2004).
Article CAS Google Scholar
Yonezawa, T., Hasegawa, M. & Zhong, Y. Polyphyletic origins of schizothoracine fish (Cyprinidae, Osteichthyes) and adaptive evolution in their mitochondrial genomes. Genes & Genetic Systems 89, 187–191 (2014).
Article Google Scholar
Wang, X., Gan, X., Li, J., Chen, Y. & He, S. Cyprininae phylogeny revealed independent origins of the Tibetan Plateau endemic polyploid cyprinids and their diversifications related to the Neogene uplift of the plateau. Science China Life Sciences 59, 1–17 (2016).
Article ADS Google Scholar
Peer, Y. V. D., Maere, S. & Meyer, A. The evolutionary significance of ancient genome duplications. Nature Reviews Genetics 10, 725–732 (2009).
Article Google Scholar
Wittbrodt, J., Meyer, A. & Schartl, M. More genes in fish? Bioessays 20, 511–515 (1998).
Article Google Scholar
Hoegg, S., Brinkmann, H., Taylor, J. S. & Meyer, A. Phylogenetic Timing of the Fish-Specific Genome Duplication Correlates with the Diversification of Teleost Fish. Journal of Molecular Evolution 59, 190–203 (2004).
Article ADS CAS Google Scholar
Ravi, V. & Venkatesh, B. Rapidly evolving fish genomes and teleost diversity. Current Opinion in Genetics & Development 18, 544–550 (2008).
Article CAS Google Scholar
Yang, H. Y. et al. Status Quo of Fishery Resources in the Middle Reach of Brahmaputra River. Journal of Hydroecology 3, 120–126 (2010).
Google Scholar
Qi, D. L., Chao, Y., Tang, W. J. & Yang, C. Threatened fishes of the world: Oxygymnocypris stewartii (Lloyd 1908) (Cyprinidae). Environmental Biology of Fishes 86, 351 (2009).
Article Google Scholar
Liu, B. et al. Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. Quantitative Biology 35, 62–67 (2013).
Google Scholar
Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012).
Article Google Scholar
Pendleton, M. et al. Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nature Methods 12, 780–786 (2015).
Article CAS Google Scholar
Chin, C. S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods 10, 563–569 (2013).
Article CAS Google Scholar
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. Plos One 9, e112963 (2014).
Article ADS Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article Google Scholar
Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).
Article CAS Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 27, 1754–1760 (2009).
Article Google Scholar
Kent, W. J. BLAT--the BLAST-like alignment tool. Genome Research 12, 656–664 (2002).
Article CAS Google Scholar
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology 29, 644–652 (2011).
Article CAS Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Research 35, W265–W268 (2007).
Article Google Scholar
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21, i351–i358 (2005).
Article CAS Google Scholar
Tempel, S. Using and Understanding RepeatMasker. Methods Mol Biol 859, 29–51 (2012).
Article CAS Google Scholar
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA 6, 11 (2015).
Article Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research 27, 573–580 (1999).
Article CAS Google Scholar
Schattner, P., Brooks, A. N. & Lowe, T. M. The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs. Nucleic Acids Research 33, W686–W689 (2005).
Article CAS Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990).
Article CAS Google Scholar
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).
Article CAS Google Scholar
Li, Y.-h. et al. De novo assembly of soybean wild relatives for pan-genome analysis of diversity and agronomic traits. Nature Biotechnology 32, 1045 (2014).
Article CAS Google Scholar
Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Research 30, 38–41 (2002).
Article CAS Google Scholar
Wang, Y. et al. The draft genome of the grass carp (Ctenopharyngodon idellus) provides insights into its evolution and vegetarian adaptation. Nat Genet 47, 625–631 (2015).
Article CAS Google Scholar
Gertz, E. M, Yu, Y. K, Agarwala, R, Schäffer, A. A . & Altschul, S. F. Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol 4, 41 (2006).
Article Google Scholar
Birney, E., Clamp, M. & Durbin, R. GeneWise and genomewise. Genome Res. 14, 988–995 (2004).
Article CAS Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology 9, R7 (2008).
Article Google Scholar
Stanke, M. & Morgenstern, B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Research 33, W465–W467 (2005).
Article CAS Google Scholar
Guigó, R., Knudsen, S., Drake, N. & Smith, T. Prediction of gene structure. J Mol Biol 226, 141–157 (1992).
Article Google Scholar
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J Mol Biol 268, 78–94 (1997).
Article CAS Google Scholar
Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004).
Article CAS Google Scholar
Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).
Article Google Scholar
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology 14, R36 (2013).
Article Google Scholar
Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protoc 7, 562–578 (2012).
Article CAS Google Scholar
The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Research 45, D158–D169 (2016).
Article Google Scholar
Finn, R. D. et al. InterPro in 2017—beyond protein family and domain annotations. Nucleic Acids Research 45, D190–D199 (2016).
Article Google Scholar
Kanehisa, M. et al. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Research 42, D199–D205 (2014).
Article CAS Google Scholar
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
Article CAS Google Scholar

Data Citations

NCBI Sequence Read Archive SRP156257 (2018)
NCBI Sequence Read Archive SRP158092 (2018)
GenBank QVTF00000000 (2018)
Wu, N. Figshare https://doi.org/10.6084/m9.figshare.7350365.v1 (2018)

Download references

Acknowledgements

This work was supported by the special finance of Tibet autonomous region (No. 2017CZZX003, 2017CZZX004 and XZNKY-2018-C-040), the National Natural Science Foundation of China (No. 31560144 and 31602207), and the National Key Research and Development Program of China (No. 2016YFC1200500).

Author information

Hai-Ping Liu, Shi-Jun Xiao, Nan Wu and Zhen-Bo Mou: These authors contributed equally to this work.

Authors and Affiliations

Institute of Fisheries Science, Tibet Academy of Agricultural and Animal Husbandry Sciences, Lhasa, 850002, China
Hai-Ping Liu, Shi-Jun Xiao, Yan-Chao Liu, Chao-Wei Zhou, Qi-Yong Liu, Wangjiu, Chi Zhang, Jun-Hua Gong & Zhen-Bo Mou
School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China
Shi-Jun Xiao & Xiao-Hui Yuan
Novogene Bioinformatics Institute, Beijing, China
Nan Wu, Di Wang, Wen-Kai Jiang & Qi-Qi Liang
College of Fishery, Huazhong Agricultural University, Wuhan, China
Rui-Bin Yang

Authors

Hai-Ping Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shi-Jun Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Nan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Di Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yan-Chao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chao-Wei Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Qi-Yong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Rui-Bin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Kai Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Qi-Qi Liang
View author publications
You can also search for this author in PubMed Google Scholar
Wangjiu
View author publications
You can also search for this author in PubMed Google Scholar
Chi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jun-Hua Gong
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Hui Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Zhen-Bo Mou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.-P.L., W.-K.J., Q.-Y.L., and Z.-B.M. conceived the study. H.-P.L., R.-B.Y., X.-H.Y. and W.-K.J. designed the scientific objectives. Q.-Y.L. and Z.-B.M. managed the project; Y.-C.L., J.W., C.Z. and C.-W.Z. collected the samples and extracted the genomic DNA; N.W. and D.W. estimated the genome size and assembled the genome; Q.-Q.L. and S.-J.X. assessed the assembly quality; N.W. and J.H.G. carried out the repeat annotation and gene annotation. H.-P.L., S.-J.X., N.W., J.-H.G., and W.-K.J. wrote the manuscript. Also, all authors read, edited and approved the final manuscript.

Corresponding authors

Correspondence to Hai-Ping Liu or Zhen-Bo Mou.

Ethics declarations

Competing interests

The authors declare no competing interests.

ISA-Tab metadata

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files made available in this article.

Reprints and permissions

About this article

Cite this article

Liu, HP., Xiao, SJ., Wu, N. et al. The sequence and de novo assembly of Oxygymnocypris stewartii genome. Sci Data 6, 190009 (2019). https://doi.org/10.1038/sdata.2019.9

Download citation

Received: 15 October 2018
Accepted: 06 December 2018
Published: 05 February 2019
DOI: https://doi.org/10.1038/sdata.2019.9

This article is cited by

Sequencing an F1 hybrid of Silurus asotus and S. meridionalis enabled the assembly of high-quality parental genomes
- Weitao Chen
- Ming Zou
- Jie Li
Scientific Reports (2021)
Recent genome duplications facilitate the phenotypic diversity of Hb repertoire in the Cyprinidae
- Yi Lei
- Liandong Yang
- Shunping He
Science China Life Sciences (2021)
Comparative transcriptome analysis of scaled and scaleless skins in Gymnocypris eckloni provides insights into the molecular mechanism of scale degeneration
- Xiu Feng
- Yintao Jia
- Yifeng Chen
BMC Genomics (2020)
Comprehensive transcriptome data for endemic Schizothoracinae fish in the Tibetan Plateau
- Chaowei Zhou
- Shijun Xiao
- Haiping Liu
Scientific Data (2020)

Subjects

Abstract

Similar content being viewed by others

Background & Summary

Methods

Sample collection and sequencing

De novo assembly of Oxygymnocypris stewartii genome

Repetitive element and non-coding gene annotation in the O. stewartii genome

Protein-coding gene prediction and functional annotation

Code Availability

Data Records

Technical Validation

RNA integrity

Quality filtering of Illumina sequencing raw reads

Additional information

References

References

Data Citations

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

ISA-Tab metadata

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links