Comprehensive transcriptome data for endemic Schizothoracinae fish in the Tibetan Plateau

The Schizothoracinae fishes, endemic species in the Tibetan Plateau, are considered as ideal models for highland adaptation and speciation investigation. Despite several transcriptome studies for highland fishes have been reported before, the transcriptome information of Schizothoracinae is still lacking. To obtain comprehensive transcriptome data for Schizothoracinae, the transcriptome of a total of 183 samples from 14 representative Schizothoracinae species, were sequenced and de novo assembled. As a result, about 1,363 Gb transcriptome clean data was obtained. After the assembly, we obtain 76,602–154,860 unigenes for each species with sequence N50 length of 1,564–2,143 bp. More than half of the unigenes were functionally annotated by public databases. The Schizothoracinae fishes in this work exhibited diversified ecological distributions, phenotype characters and feeding habits; therefore, the comprehensive transcriptome data of those species provided valuable information for the environmental adaptation and speciation of Schizothoracinae in the Tibetan Plateau.


Background & Summary
The Tibetan Plateau, the world's largest and highest plateau, has unique geographical and climatic characteristics, such as the high altitude, dramatic difference in day and night temperature, strong solar radiation 1 . Due to the special geographical environment, many highland species that are distributed in and around the Tibetan Plateau have gradually formed unique characteristics to tolerate harsh living conditions during the long-term evolution 2 . The Schizothoracinae fishes, members of family Cyprinidae, are endemic to Asian highlands including 15 genera and ca. 100 species 3 . In China, more than 70 species, account for over 80% of the world's Schizothoracine fishes, are mainly distributed in lakes and rivers of the Tibetan Plateau and adjacent areas 4 . According to the morphological characteristics, the Schizothoracine fishes can be divided into three groups: the primitive group, the specialized group and the highly specialized group 5 . Several researches on the morphology, archaeology and molecular biology of Schizothoracine fishes on the Tibetan Plateau have shown that there is close correlation between the species diversity and the uplift of the Tibetan Plateau 6,7 and the morphological traits of Schizothoracine fishes is related with specific periods of geological evolution of the Tibetan Plateau such as pharyngeal teeth, scales and whiskers 5 . Therefore, the Schizothoracine fishes are considered as good model species for the investigations on highland adaptation and speciation. More genomic and transcriptome data are required to decipher the relationship of the speciation and the uplift of the Tibetan Plateau for the Schizothoracine fishes.
Recent advances in sequencing technologies have offered the opportunity to obtain the genomes of numerous highland animals, enabling us to better understand the adaptive evolution mechanism of highland fish species. So far, the vast majority of the genome researches on the environmental adaptation were performed on highland terrestrial animal (e.g., yak 8 and Tibetan antelope 9 ). Few study was reported on highland fish, especially for Schizothoracinae fishes. One of the major reasons was the complexity of the genome, such as high content of repeats and polyploidy 10 . Transcriptome sequencing is a good choice to construct the sequence dataset for transcribed genes in many polyploidy cases 11 . Despite several transcriptome analyses on highland adaptation have reported in Schizothoracine fishes before [12][13][14][15][16] , the species and tissues used for transcriptome sequencing were still limited. There is a great demand for more transcriptome sequencing data for the adaptation and evolution of Schizothoracine fishes in the Tibetan Plateau. In this work, we obtained and released a total of ∼1.36 Tb of high-quality transcriptome data for 183 samples of 14 representative Schizothoracine fish covering 5 genera from 6 drainage systems and 3 lakes in the Tibetan Plateau (Tables 1, 2 and Fig. 1). The distribution, ecological position and phenotype difference making the transcriptome of those Schizothoracine species invaluable genetic resources for the adaptation and speciation of endemic fish in the Tibetan Plateau.
All individuals were narcotized with MS-222 (Solarbio, Beijing, China) for a few minutes before the sample collection. A total of 183 tissues were collected from 14 representative Schizothoracine fish in our study, including muscle, liver, spleen, gonads, skin, swim bladder, gut, eye, gill, kidney, heart, brain, blood, fat, vibrissa ( Table 2). All tissues were immediately frozen in liquid nitrogen after the dissection and then stored at −80 °C until total RNA isolation.
RNA extraction and sequencing. Total RNA was isolated from each sample using RNAiso Plus (TaKaRa, Dalian, China) according to the manufacturer's instructions and was determined with a photometer for RNA sample integrity (Thermo Scientific, USA). RNA samples passing the quality criteria (see technical validation for detail) were used for the library preparation and RNA sequencing. All samples were sequenced on an Illumina HiSeq X Ten platform with 150 bp paired-end mode. In preset research, a total of more than 10 billion raw PE reads were obtained from all libraries. After filtering by removal of adaptor sequences, contaminated reads and poor-quality reads, we obtained approximately 1.4 Tb of clean data with Q20 bases larger than 96.94%. The average of 7.6 Gb sequencing data were obtained for samples (Supplementary Table S1). The transcriptome data for Oxygymnocypris stewarti in the Oxygymnocypris genus that reported in our previous studies 17 were also used for comparision in the work.
De novo assembly of fish transcriptome. We firstly utilized publicly available program Trinity software version 2.5.1 18 with default parameters for de novo assembly of fish transcripts. The length of <200 bp contigs from each assembly libraries were discarded for subsequent analysis. Next, the redundancies of the transcripts for each species in the dataset were eliminated using the CD-HIT-EST program included in the cd-hit-v4.6.6 package 19 , with parameters -c 0.98 -n 11 -d 0 -M 0 -T 8 in the final assembly and the longest transcript in each cluster was considered as unigenes. After assembly, the unigene numbers for 15 Schizothoracine species ranged  (Table 3). Of these, the highest number of unigenes was observed in P. kaznakovi, and the lowest in S. labiatus. The GC contents of transcripts for all species were rather stable around 40-42%. The N50 length of unigenes ranged from 1,564 to 2,143 bp, with an average of 1,250 bp for all fish transcriptome. As shown in Fig. 2, the unigene length distribution is comparable for all Schizothoracine species, and the average length ranged from 1,120 to 1,392 bp. The assembled transcriptome sequences were analyzed by the BUSCO pipeline. BUSCO were generally used in the evaluation of the completeness of a genome assembly, we applied BUSCO version3.0.2 to assess the quality of transcriptome assembly in our work. As a result, we found that more than 98% of the 2,586 BUSCO genes of vertebrates were detected in our transcriptome and 85-92% were completely identified depends on species (Fig. 3), suggesting the transcriptome represented a rather high level of completeness of the conserved genes. Meanwhile, we found that a high fraction of duplicated BUSCO for all species (Fig. 3), which was consistent with the fact that the majority of the Schizothoracine fish were polyploidy.    Table 1. The altitude was represented by the color bar from white (high alititude) to green (low altitude).
www.nature.com/scientificdata www.nature.com/scientificdata/ Functional annotation of transcriptome. To annotate the assembled unigenes, we searched the homologous sequences for all unigenes against four public available function databases (Blast-X search: E-value cutoff of 1 × 10 −10 ), including NCBI nonredundant protein database (NR), Swiss-Prot, KEGG pathway database and KOG database. Only the best hits with the highest sequence homology was used for annotation. Then, the gene ontology (GO) terms analysis of the predicted protein based on the NR in NCBI was performed with the Blast2GO software version3.1 with default parameters. We found that at least 40.2% of unigenes per species were annotated based on proteins in four public databases (Table 4 and Supplementary Fig. S1). Meanwhile, we found that high match efficiency was observed the longer assembled unigenes (≥2,000 bp) compared to shorter unigenes (≤500 bp) during the annotation process, the same result was reported in other animal 20 .
technical Validation RNA integrity. The transcriptome for twelve tissues from three fish individuals were sequenced. In before constructing RNA-Seq libraries, the concentration and quality of total RNA were evaluated using NanoVue Plus spectrophotometer (GE Healthcare, NJ, USA). The total amount of RNA, RNA integrity and rRNA ratio were      www.nature.com/scientificdata www.nature.com/scientificdata/ used to estimate the quality, content and degradation level of RNA samples. In the present study, RNAs samples with a total RNA amount ≥ 10 μg, RNA integrity number ≥ 8, and rRNA ratio ≥ 1.5 were finally subjected to construct the sequencing library.
Quality filtering of Illumina sequencing raw reads. The raw sequencing reads generated from the Illumina platform were rigorously cleaned by the following procedures as in the previous study 38 . Firstly, adaptors in the reads were filtered out; secondly, reads with more than 10% of N bases were filtered out; thirdly, reads with more than 50% of the low-quality bases (phred quality score < =5) were filtered out. If any end of the pair was classified as low quality, both pairs were discarded. The initially generated raw sequencing reads were also evaluated regarding quality distribution, GC content distribution, base composition, average quality score at each position and other metrics.

Code availability
No specific code or script was used in this work. All commands used in the data processing were executed as the manual and usage instrument of the corresponding bioinformatics software.  Table 4. Functional annotation summary for species. The hit number for NR, Swiss-port, KOG, GO, KEGG were summarized. The ratio means the percentage of annotated unigenes to the total assembly sequences. # The transcriptome data for Oxygymnocypris stewarti was reported in our previous studies 17 . 22