Full-length transcriptome of Misgurnus anguillicaudatus provides insights into evolution of genus Misgurnus

Reconstruction and annotation of transcripts, particularly for a species without reference genome, plays a critical role in gene discovery, investigation of genomic signatures, and genome annotation in the pre-genomic era. This study generated 33,330 full-length transcripts of diploid M. anguillicaudatus using PacBio SMRT Sequencing. A total of 6,918 gene families were identified with two or more isoforms, and 26,683 complete ORFs with an average length of 1,497 bp were detected. Totally, 1,208 high-confidence lncRNAs were identified, and most of these appeared to be precursor transcripts of miRNAs or snoRNAs. Phylogenetic tree of the Misgurnus species was inferred based on the 1,905 single copy orthologous genes. The tetraploid and diploid M. anguillicaudatus grouped into a clade, and M. bipartitus showed a closer relationship with the M. anguillicaudatus. The overall evolutionary rates of tetraploid M. anguillicaudatus were significantly higher than those of other Misgurnus species. Meanwhile, 28 positively selected genes were identified in M. anguillicaudatus clade. These positively selected genes may play critical roles in the adaptation to various habitat environments for M. anguillicaudatus. This study could facilitate further exploration of the genomic signatures of M. anguillicaudatus and provide potential insights into unveiling the evolutionary history of tetraploid loach.

Misgurnus anguillicaudatus, also known as oriental weatherloach, is distributed throughout China and mainly occurs in the Yangtze River basin 8,9 . Due to its superior nutritional values and excellent flavor, this famous loach is widely consumed and favored in East Asia region. Therefore, M. anguillicaudatus has become an important freshwater aquaculture species for several decades in China 10 . However, the genome and gene set of this species is still not available due to the high heterozygosity levels and complex polyploidy. Thus a full-length transcriptome is quite essential for documenting the genomic signatures of M. anguillicaudatus.
This study firstly proposed the full-length transcriptome of M. anguillicaudatus derived from 12 different tissues using the PacBio SMRT sequencing that could be used as a reference in some cases. Moreover, the annotation of M. anguillicaudatus gene set could improve the whole genome annotation, contribute towards better understanding of the complexity of the genome and serve as reference sequences for gene functional studies. Although previous studies 10- 13 have provided important insights into the evolutionary status and adaptation of M. anguillicaudatus, little information is available on the difference of evolutionary adaptations of regular diploid and tetraploid M. anguillicaudatus 10 . Therefore, a number of full-length transcripts of M. anguillicaudatus obtained in this study were also applied to evaluate the evolutionary status and adaptive evolution of diploid and tetraploid M. anguillicaudatus which would provide novel insights into the genetic characteristics of natural polyploidy.

SMRT sequencing, completeness and Functional annotation.
To obtain a representative fulllength transcriptome for diploid M. anguillicaudatus, two size fractions (1-5 kb and 4.5-10 kb) of sample were sequenced on two SMRT cells using PacBio Sequel system (Table 1). Sequence data were processed through the Pacbio Iso-Seq pipeline. A total of 1,149,167 reads of insert (ROIs), including 627,338 ROIs from a SMRT cell for 1-5 kb fractions and 521,829 ROIs from another SMRT cell for 4.5-10 kb fractions, were generated with an average of 3,468 bp in length. Non-Chimeric ROIs sequences were filtered into two groups of sequences comprised of full-length ROIs sequences and non-full-length ROIs sequences. Only the reads with two primers and poly-A tail were classified to full-length non-chmeric reads, and 422,274 full-length non-chmeric reads were remained after circular consensus sequence (CCS) generation and filtering for full-length read classification. These FL reads were clustered into consensus clusters based on Quiver principle, and then 135,316 high-quality (HQ) and consensus isoforms were merged into 76,787 final consensus sequences in total. The length of these consensus sequences ranged from 302 bp to 13,253 bp with an average length of 3,614 bp, and the length exhibited a bimodal distribution as expected (Fig. 1A). The transcript N50 was 5,234 bp, and the average GC content was 41.4% (Fig. 1B). Subsequently, we utilized CD-HIT-EST for further clustering to obtain the non-redundant transcripts and obtained a total of 33,330 full-length transcripts of M. anguillicaudatus. BUSCO (Benchmarking Universal Single Copy Orthologs) was utilized to determine completeness of our transcript dataset with the Actinopterygii dataset (http://busco.ezlab.org/v2/datasets/). The results showed that this dataset contains 65.6% complete and 10.5% partial BUSCO orthologues. Meanwhile, we also used an alternative method, CEGMA (Core Eukaryotic Genes Mapping Approach) to evaluate the transcripts completeness and found that 83.9% (208 genes) of the 248 core CEGMA genes are complete in this dataset.
Functional annotation of the non-redundant transcripts was investigated using 6 different protein public databases ( Fig. 2A). Using BLASTX, 30,968 sequences (92.9%) were successfully annotated based on the NR (NCBI non-redundant proteins) database. The transcripts were then classified using the Swiss-prot (a manually annotated, non-redundant protein database), InterPro (https://www.ebi.ac.uk/interpro/), COG (Clusters of Orthologous Groups) and KEGG (Kyoto Encyclopedia of Genes and Genomes) databases (Supplementary  Table S1). Of all the hits to the NR proteins from BLASTX, the annotation results showed that the majority of the transcripts closely matched the genes from other fish species. The transcripts had the highest number of hits to the Danio rerio (15,313 hits, 45.9%), as expected, followed by Cyprinus carpio (8,731 hits, 26.2%) and Astyanax mexicanus (1,643 hits, 4.9%) proteins (Fig. 2B). Notably, 4.5% of transcripts showed no homologous sequences in the public databases.
Prediction of gene families and coding sequences. We utilized the COGENT to further partition these non-redundant transcripts into putative gene families and reconstruct each family into one or several full-length unique transcript models. Of the 33,330 transcripts, COGENT constructed 6,918 gene families with two or more isoforms (23,894 transcripts in total). A total of 26,683 complete ORFs (Open Reading Frames) were identified with TransDecoder (Supplementary Table S2). The lengths of the complete coding sequences ranged from 300 to 6,867 bp with an average length of 1497 bp (Fig. 3). Among the 26,683 complete ORFs, a minority of ORFs (1,006, 3.8%) showed no homologous entry in Pfam database.  Analysis of candidate long non-coding RNAs. In addition to protein-coding RNAs, non-coding RNAs, which are attributed to poly-A non-coding RNAs, constitute a major component of the transcriptome. After annotation with the public databases, 2,776 non-annotated transcripts were remained. The number of lncRNAs predicted by each prediction method was shown in Fig. 4A. In total, 1,208 high-confidence lncRNAs were identified with an average length of 1,593 bp ( Fig. 4B). Of these lncRNAs, only a minority (105 lncRNAs) was annotated with known RNA motifs by searching against Rfam ncRNA families 14 , and most of these appeared to be precursor transcripts of miRNAs or snoRNAs (Supplementary Table S3). The functions of the lncRNAs, especially the unannotated ones, need to be further verified.

Identification of orthologous genes among Misgurnus species and phylogenetic tree.
To reveal the adaptive evolution of Misgurnus species respond to the specific environments they inhabit, a total of 1,918 one-to-one single-copy orthologous genes, ranging from 150 to 5796 bp in length, were identified among four Misgurnus species together with a outgroup Danio rerio. Subsequently, the alignments were trimmed with Gblocks. Totally, 1,905 single copy orthologous gene with length >150 bp were remained for the further analysis (Supplementary Table S4). Phylogenetic tree was inferred from the concatenated sequences of 1,905 orthologues using ML (Maximum Likelihood) method. The phylogenetic tree showed that diploid and tetraploid M. anguillicaudatus formed a clade with a bootstrap value of 99%, indicating a close evolutionary relationship (Fig. 5A).

Accelerated Evolution and positively selective genes in M. anguillicaudatus clade.
To assess the overall evolutionary rates within four Misgurnus lineages, the calculation of substitution rates (Ka and Ks) for each orthologues was performed with codeml using the free ratio model. The levels of the Ka/Ks ratios revealed a declining trend in the four Misgurnus lineages, with the tetraploid M. anguillicaudatus lineage revealing highest Ka/Ks ratios than the other three lineages (Fig. 5B), which indicates accelerated evolution occurred in tetraploid M. anguillicaudatus compared with other Misgurnus species. The positively selective genes (PSGs) were identified with branch site model using the diploid and tetraploid M. anguillicaudatus clade as foreground branch.
We used the branch-site model implemented in codeml to identify the positively selected genes (PSGs) and the positive selected sites (PSSs) in codons along the M. anguillicaudatus clade. We identified 28 PSGs in the diploid and tetraploid M. anguillicaudatus clade (Supplementary Table S5). We classified the PSGs by GO categories and plotted the distribution of GO classifications of PSGs among the GO categories (Fig. 6). GO categories "cellular processes", "metabolic process", "cell", "cell part", "binding" contained the largest percentage of genes. The GO enrichment results exhibited that these PSGs were significantly enriched in "regulation of body fluid levels" (P = 0.012), "polysaccharide binding" (P = 0.012) and "pattern binding" (P = 0.012), indicating that the PSGs play critical roles in these biological functions. Meanwhile, the PSSs were calculated using likelihood ratio tests (LRTs) with false discovery rate (FDR) correction (P < 0.05) and Bayes Empirical Bayes (BEB) estimates (BEB > 0.95). Nine PSGs with 10 positively selective sites were detected in the M. anguillicaudatus clade ( Table 2). Among these PSSs, two sites in Ramp2, which is involved in protein transporter activity, were identified in the diploid and tetraploid M. anguillicaudatus clade, indicating that it plays important roles in environmental adaptations of M. anguillicaudatus.

Discussion
Long read reference reconstruction of the full-length transcripts. Transcriptome reconstruction and annotation, particularly for species without a reference genome, has improved significantly with the development of sequencing techniques, and plays a critical role in gene discovery, deep exploration of genomic signatures, and genome annotation in the pre-genomic era 15 . In recent years, traditional short-read RNA-seq has been widely applied in the studies of investigating the RNA expression patterns although most of the transcripts assembled with the assemblers using short reads, such as Trinity 5 , SOAPdenovo-Trans 16 or Velvet 17 , were not full-length (<1000 bp). However, short-read RNA-seq techniques still have some limits in precise reconstruction and reliable expression investigation of transcriptomic isoforms due to the complexity of the alternative splicing mechanism of eukaryotic cells, particularly in species without a reference genome sequence. In contrast, for a genome-free species, PacBio Iso-Seq is a superior strategy for the direct generation of a comprehensive transcriptome with accurate alternative splicing isoforms and novel genes. The advantages of SMRT sequencing have been extensively documented in previous studies. In this study, a full-length transcriptome of M. anguillicaudatus was sequenced with Iso-Seq technique and comprehensively analyzed. Our results showed that the non-assembled transcripts from Iso-Seq (Transcript N50 of 5,234) were much longer than the assembled transcripts from Illumina sequencing platforms (Unigene N50 of 1,662 bp) 12 , and many Iso-Seq sequencing subreads contained full-length ORFs.
In this study, we generated 33,300 transcripts using the PacBio Iso-Seq, which was much smaller than 81,300 transcripts de novo assembled in Luo et al. 18 and showed high-efficiency in recovering full-length transcript. Even though the number of predicted ORFs in the PacBio dataset was less than that in the de novo dataset, the full-length transcripts included longer coding transcripts (26,683 ORFs with average length of 1457 bp) while the average length of ORFs in assembled transcripts was only 661 bp in the previous RNA-seq study 12 of M. anguillicaudatus. The CEGMA/BUSCO completeness alignment of full-length transcripts indicated that our dataset contained expected core conserved genes, and indirectly revealing that only a minority of the genes was missing in the full-length transcripts.
Many transcriptome studies of M. anguillicaudatus based on Illumina platform have been reported 12,[18][19][20] . Compared with those studies, our study produced a more comprehensive transcriptome data set with several features. First, more accurate full-length (with 22,683 completed ORFs) transcripts were generated, providing valuable information for detection of gene structure and variations and abundant sequences that can be directly used in the genetic functional studies without additional PCR cloning. Second, a total of 1,208 lncRNAs of M. anguillicaudatus were identified, and these lncRNAs could provide a useful resource for investigating the potential functions of lncRNAs in M. anguillicaudatus although only a minority of lncRNAs was annotated with known RNA motifs in Rfam database. Third, the full-length transcripts produced in this study would be used as reference to improve genome assembly and gene annotation of M. anguillicaudatus. Finally, the reliable ORFs and proteome of M. anguillicaudatus were essential to the identification of orthologous genes, which contribute to better understanding of the evolutionary status and adaptive evolution of diploid and tetraploid M. anguillicaudatus.
Evolutionary status and positive selection in Misgurnus linages. In recent years, the extensive ploidy variability of M. anguillicaudatus is a hot research topic because it has several natural ploidy types in  [18][19][20] . It's quite urgent to unveil the evolutionary history and genomic signatures of polyploid loaches so that we could proceed into post-genome era of M. anguillicaudatus. In this study, we firstly investigated the evolutionary status of tetraploid M. anguillicaudatus in genus Misgurnus. The results exhibited that the tetraploid and diploid M. anguillicaudatus grouped into one clade, which coincides with the results inferred with mitochondrial genome 10 . M. bipartitus showed closer relationship with M. anguillicaudatus clade rather than M. mohoity, and the same topological relationships of three Misgurnus species were observed in the previous study 12 .
In genus Misgurnus, only M. anguillicaudatus is widely distributed in both north and south China, indicating that M. anguillicaudatus has strong adaptation to the various environments. The overall evolutionary rates of the Misgurnus lineages showed that tetraploid M. anguillicaudatus exhibited accelerated evolution compared with those of diploid M. anguillicaudatus and M. mohoity, indicating a trend of increase in the evolutionary pressure   in the tetraploid M. anguillicaudatus. The function classifications of PSGs identified from the M. anguillicaudatus clade revealed significant enrichment in biological processes, particularly metabolic process and cellular process, which were under positive selection. Notably, seven PSGs (i. e. ncl, srpk1a, bach1a, rpa1, zgpat, naf1, rad9a) of M. anguillicaudatus were involved in "nucleic acid metabolic process" (GO: 0090304), implying that this functional process has played an important role in the adaptation of M. anguillicaudatus. Previous study 12 found that the api5 gene, an anti-apoptotic factor that efficiently suppresses E2F1-induced apoptosis, was under positive selection in diploid M. anguillicaudatus linage. Likewise, the api5 was also under positive selection in tetraploid M. anguillicaudatus, suggesting that anti-apoptosis is critical for the adaptation to various habitat environments of M. anguillicaudatus. Meanwhile, the endothelial lipase (lipg), which is involved in lipid absorption, transport, and metabolism 24 , was positively selected in M. anguillicaudatus clade. Notably, Yi et al. 12 has reported that the genes (thrsp and srd5a2a) involved in lipid metabolism closely associated with temperature stress tolerance in fish 25,26 were under positive selection in diploid M. anguillicaudatus. It indicates that lipid metabolism also play important roles in adaptation to the complex environments in tetraploid M. anguillicaudatus. The garem and emp2, which positively regulate and promote cell proliferation and division 27,28 , were positively selected in the M. anguillicaudatus clade, and the similar results were also observed in the previous study 12 , suggesting cell proliferation also act as a critical factor in the environmental adaptation in M. anguillicaudatus. After anaesthetizing with MS-222 (100 mg/L), 12 tissues (i.e. blood, gill, skin, muscle, liver, spleen, intestine, ovary, testis, kidney, heart and brain) of a male and a female individual were immediately sampled separately. The total RNA of each sample was isolated using TRIzol reagent (TaKaRa, Dalian, China), and then genomic DNA was removed using gDNA eraser (TaKaRa, Dalian, China). RNA quality and concentration were determined using NanoDrop 2000 (Thermo Scientific, DE, USA) and Agilent Bioanalyser 2100 (Agilent Technologies, CA, USA), respectively. Equal amounts of the total RNA from each of the 12 tissues were pooled to generate one sample for library preparation. The first-strand cDNA was synthesized using SMARTer PCR cDNA Synthesis Kit (Clontech, CA, USA). After a round of PCR amplification, the amplified cDNA was size selected into different size fractions to prevent preferential small template sequencing, using the Blue Pippin (Sage Science; MA, USA). Two Iso-Seq libraries (1-5 kb and 4.5-10 kb) were constructed and sequenced on two cells with PacBio Sequel system (PacBio, CA, USA).

Methods
Data processing, isoform clustering. Long reads produced by the PacBio Sequel sequencer were processed with the PacBio IsoSeq pipeline (https://github.com/PacificBiosciences/IsoSeq_SA3nUP) to generate full-length refined consensus transcripts. The reads were filtered using the standard protocols in the SMRT Analysis software suite (https://www.pacb.com/support/software-downloads/), and reads of insert (ROIs) were classified to four categories: full-length non-chimeric, full-length chimeric, non-full-length (non-FL) and short reads. Only reads with two primer and poly-A tail would be classified to full-length in this study. Both full-length non-chimeric reads (FL) and non-full-length reads were used in further analysis. Full-length non-chimeric reads were clustered into consensus. For each cluster, if there were sufficient FL and non-FL coverage, then Quiver was run to polish the consensus. Depending on the Quiver output QV, the script bins the Quiver polished output into either "low QV (LQ)" or "high QV (HQ)". Only HQ consensus sequences were remained, and the consensus from two libraries were merged and removed the redundancies for the further analysis. Subsequently, we used CD-HIT-EST program 29 to identify the isoforms based on sequence similarity with parameter: -c 0.98 -T 6 -G 0 -aL 0.90 -AL 100 -aS 0.98 -AS 30.
Gene annotation and prediction of non-coding RNA identification. Functional annotations were conducted by using BLAST toolkit (E-value ≤ 1 × 10 −5 ) against different protein and nucleotide databases of Clusters of Orthologous Groups (COG), Kyoto Encyclopedia of Genes and Genomes (KEGG), a database of conserved Protein families or domains (Pfam), Swiss-prot (a manually annotated, non-redundant protein database), NR (NCBI non-redundant proteins) and nucleotide database (NT). Meanwhile, we used InterProScan5 30 to obtain the InterPro annotation. Blast2GO 31 was used to annotate with Gene Ontology (GO) based on the NR annotation. To obtain a high-confidence set of lncRNA genes, we identified lncRNAs from Iso-Seq dataset using a customized pipeline comprised of two steps: (1) removal of annotated protein coding transcripts by BLAST against multiple protein databases together with NCBI Nt database; (2) calculation the coding potential of each transcript using PLEK 32 (predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme), CNCI 33 (Coding-Non-Coding Index) and CPAT 34  . We used PRANK program 39 to conduct codon alignments for the one to one single copy orthologues with parameter "−iterate = 5". The regions of ambiguous alignments were excluded using Gblocks 40 (version 0.91b) with the parameter "−t = c". Then the alignments with >150 bp (50 codons) were remained for the evolutionary analyses. The phylogenetic trees were inferred from the concatenated sequences of the orthologous genes (OGs) using the ML method with 1000 bootstraps. We used RAxML 41 with GTRGAMMA substitution model to perform the ML analysis. The consensus phylogenetic tree was used as the guide tree in the evolutionary analysis. The codeml program implemented in PAML v4.9 42 with the free-ratio model (model = 1) was selected to calculate the substitution rates of each taxa. When N × d N or S × dS < 1 or dS > 1, the orthologues were discarded according to the criterion presented in previous study 43 . The positively selective genes (PSGs) were identified using branch site model following the method described by Yi et al. 12 . The significance of LRTs was assessed to discriminate between alternative models. Multiple testing correction was performed with the FDR method. The BEB estimates from each model were used to identify amino acid sites under positive selection when FDR-adjusted P value < 0.05. GO enrichment analysis was performed with DAVID Functional Annotation Tool 44 .