Comprehensive transcriptome characterization of Grus japonensis using PacBio SMRT and Illumina sequencing

The red-crowned crane (Grus japonensis) is an endangered species distributed across southeast Russia, northeast China, Korea, and Japan. Here, we sequenced for the first time the full-length unreferenced transcriptome of red-crowned crane mixed samples using a PacBio Sequel platform. A total of 359,136 circular consensus sequences (CCS) were obtained via clustering to remove redundancy. A total of 303,544 full-length non-chimeric sequences were identified by judging whether CCS contained 5′ and 3′ adapters, and the poly(A) tail. Eight samples were sequenced using Illumina, and PacBio sequencing data were corrected according to the collected Illumina data to obtain more accurate full-length transcripts. A total of 4,100 long non-coding RNAs, 13,115 simple sequences repeat loci and 29 transcription factor families were identified. The expression of lncRNAs and TFs in pancreas was lowest comparing with other tissues. Many enriched immune-related transmission pathways (MHC and IL receptors) were identified in the spleen. This study will contribute to a better understanding of the gene structure and post-transcriptional regulatory network, and provide references for future studies on red-crowned cranes.

The red-crowned crane is a large wading bird that is mainly distributed in the southeast of Russia, northeast of China, Korea, and Japan [1][2][3][4] . The population of the red-crowned crane is split into continental and island populations, with continental groups being migratory birds and the island groups being resident birds 5 . In recent years, due to the intensification of human activities and reclaimed wetland, the breeding and wintering habitats of the migratory populations have experienced double loss of function and area [6][7][8] . Hence, the range of activity of this species in China has been continuously reduced and its population has started to decline significantly, with the number of red-crowned cranes observed in Winter being only 8% of that reported in the 1980s 5 , making it an endangered species, as declared by the International Union for Conservation of Nature in 2012 9 . Despite the awareness that the red-crowned crane is on the brink of extinction and the implementation of several protective measures, such as with the establishment of legislation and key conservation areas, the population of the migratory species continues to decline 2,10,11 . To protect red-crowned crane, many studies about its population dynamics, feeding habits, habitat protection, disease control, overwintering ecology and reproductive behavior have been conducted 7,[12][13][14][15][16] , including molecular studies mainly focused on marker development and genetic diversity analysis. Several markers were identified and characterized, such as simple sequence repeats (SSRs) and single nucleotide polymorphisms (SNPs) 11,17,18 and genetic diversity and population structure of the red-crowned crane populations were estimated using various markers, such as major histocompatibility complex (MHC), SSRs, and mitochondrial genes [19][20][21][22][23][24][25] . However, only two studies involving transcriptome and genome of the red-crowned crane have been performed, which included the analysis of the expression profiles of cytochrome P450 (CYP) 1-3 genes in red-crowned crane tissues by Illumina HiSeq 2500 and analysis of the genomic relationships between the crane and other birds by Illumina HiSeq 2000 26,27 .
Although second-generation sequencing is one of the most widely employed sequencing technologies, this is less effective in assembling the full-length transcripts for the limitation of short read length, and thus restricts the yield of full-length genes [28][29][30] . However, the third-generation sequencing technology represented by Single Molecule Real-Time (SMRT) effectively overcomes this limitation 31 . SMRT sequencing enables direct acquisition of the full-length transcriptome sequence of functional genes with an accuracy as high as 99.999% 32 [37][38][39] . Hence, SMRT sequencing technology provides an important method for understanding gene structure and transcription network of endangered animals 34,35 . This is of great significance for studying the regulatory mechanism of endangered animals and the differential expression of environmental impact genes. In this study, the full-length transcripts of red-crowned crane were obtained for the first time through PacBio SMRT sequencing and the expression landscapes of eight different tissues (brain, muscle, pancreas, heart, kidney, liver, lung, and spleen) were further evaluated using Illumina NovaSeq 6000 sequencing. In addition, detailed transcriptome analysis presented this novel information in light of transcript diversity, comprising information on full-length transcripts, transcription factors (TFs), long non-coding RNAs (lncRNAs), novel genes, SSRs, and immune-related signaling pathways. Therefore, the collected data provides a reference for understanding the gene structure and post-transcriptional regulatory network of red-crowned crane, and may lay the foundation for future molecular research related to red-crowned crane, and contribute to the protection of this endangered species.

Results and discussion
High quality non-redundant full-length transcriptome. It has been reported that SMRT Sequel has the advantage of ultra-long reading length, including the 5'-and 3'-untranslated regions (UTRs) and poly(A) tail of the transcript, which provides important information for the stability and translation of mRNA 40 . Herein, we used the PacBio sequence platform to sequence mixed samples of red-crowned crane and obtained a total of 24.54 G subreads base with an average length of 2,335 bp. The subread BAM of the offline data was self-corrected to obtain 359,136 CCS with an average length of 2,779 bp. Moreover, the hierarchical n*log(n) algorithm was used to cluster the full-length non-chimeric sequences of the same transcript to obtain the consensus sequence, which was further polished by arrow software to obtain 23,452 polished consensus. In addition, eight tissue samples (brain, heart, kidney, liver, lung, muscle, pancreas, and spleen) were analyzed by Illumina technology, from which a total of 181,391,092 raw reads were obtained. After filtering the raw data, we obtained 172,306,116 high quality clean reads (Table 1). To more accurately identify the full-length transcriptome, the third-generation data was corrected based on the second-generation data with high accuracy using LoRDEC 0.7 software 41 . Moreover, the CD-HIT software was used to align and cluster the 23,452 polished consensus sequences and remove redundant sequences, providing a total of 15,126 unigenes with an average length of 2,879 bp. The number of genes with different transcript length in the interval (< 0.5, 0.5-1, 1-2, 2-3, and > 3 kb) is shown in Table 2. Overall, 22,994 gene subtypes were identified, with most genes having only one isoform. In addition, 15,126 unigenes obtained after CD-HIT software processing were used as the reference sequence, and the clean reads of each tissue sequenced by Illumina were aligned to the reference sequence using RSEM. We found that the mapping rate between the pancreas and the reference sequence was up to 86.85%, and the minimum mapping rate between the brain and the reference sequence was only 56.36% (Table 3).
Compared with the traditional second-generation sequencing, the PacBio SMRT technology used in this study significantly improved the reading ability, not only providing the full-length transcripts, but also identifying lncRNAs, SSRs, and TF families. Regardless of whether it was a coding or non-coding gene, the prevalence of long transcripts was higher than previously predicted. We could predict the gene function of the obtained long transcripts and obtain more accurate information on red-crowned crane interleukin (IL) and MHC receptor families, which can provide additional insights into the immune function. Our sequencing research has largely Table 1. Summary of transcriptome sequencing data obtained using Illumina technology.

Sample
Raw reads Clean reads Clean bases (G) Error (%) Q20 (%) Q30 (%) GC (%)  1a). To further elucidate the main biological functions of red-crowned crane unigenes, GO, KOG, and KEGG pathways analyses were performed. A total of 10,739 unigenes were annotated using the GO database, which were further divided into 54 major functional groups according to biological processes, cellular component, and molecular function. We found that the main subgroups of biological processes were "cellular process" (GO: 0,009,987, 4,951) and "metabolic process" (GO: 0,008,152, 4,465). In the classification of cellular components, the predominant portion of transcripts represented "cell" (GO: 0,005,623, 2,584) and "cell part" (GO: 0,044,464, 2,587). The main categories in the classification of molecular functions were "binding" (GO: 0,005,488, 6840) and "catalytic activity" (GO: 0,003,824, 4149) (Fig. 1b). In addition, a total of 11,167 unigenes were subdivided into KOGs, among which "General function prediction only" (2,028, 16.18%) was the largest group, followed by "signal transduction mechanisms" (1,875, 14.96%), "posttranslational modification, protein turnover, chaperones" (1,084, 8.65%) (Fig. 1c). To explore the biological functions and interactions of the identified genes, we searched 34,350 redcrowned crane genes in the KEGG database. A total of 14,237 unigenes were found to match within the database information, and they were assigned to KEGG pathways, which were divided into six sub-categories, i.e., Cellular Processes (2,090), Environmental Information Processing (2,062), Genetic Information Processing (1,448), Human Diseases (4,349), Metabolism (3,010), and Organismal Systems (3,598) (Fig. 1d). These pathways in the KEGG pathway were closely related to normal life activities. "Human Diseases" pathway was annotated by enriched analysis, with the largest group of unigenes being signal transduction belonging to Human Diseases. It should be noted that the samples herein analyzed were collected from a dead red-crowned crane; thus, the main signal pathway of red-crowned cranes at the time of death may be mainly related to " Human Diseases ".
Analysis of other genetic features. TFs were identified using animalTFDB 2.0 and a total of 29 families were identified (Fig. 2a). ZBTB (124) was the most abundant, followed by zf-C2H2 (123). According to previous studies, TFs of these TF families can directly bind to DNA and control gene expression nearby, thereby regulating gene transcription and playing an important role in biological processes, such as cell proliferation, differentiation, apoptosis, and functional regulation 42,43 . Therefore, the highest expression level of these TF families may be explained by the origin of the samples used in this study, specifically dead red-crowned cranes. In addition, the expression of TFs in the pancreas was lowest, and differed sharply from other tissues (Fig. S1).
LncRNAs play an important role in epigenetics, transcriptional, and post-transcriptional regulation of gene expression. Increased expression of lncRNAs has been found to play an important role in cell physiological activities, as well as in cancer and other diseases. In this study, we obtained 4,100 lncRNAs by coding prediction using CNCI, CPC, Pfam, and PLEK databases, with an average length lower than that of mRNA sequences (Fig. 2b). Furthermore, the expression of lncRNAs was similar to TFs. The expression of lncRNAs in pancreas was lowest, and differed obviously from other tissues (Fig. S2). The herein provided updated high-quality transcriptome of the red-crowned crane contributes to enhanced gene annotation and lncRNA analysis of this species.
Furthermore, a total of 13,115 SSRs loci were identified by PacBio sequencing, the single nucleotide repeat motif was the dominant type, accounting for 76.35% of the total SSRs, followed by two and three nucleotide repeat motifs, with frequencies of 7.05% and 14.27%, respectively ( Table 4). The information regarding the SSR loci provides more abundant molecular markers for genetic diversity analysis and map construction of red-crowned crane. These sequencing results make the transcript dataset of red-crowned cranes more complete, which is conducive to providing a reference basis for accurate annotation research of the red-crowned crane in the future. www.nature.com/scientificreports/  www.nature.com/scientificreports/ Differential gene expression in different tissues. Next, the gene expression profile of different tissues of the red-crowned crane was analyzed and compared. The CD-HIT software was used to remove the redundancy of the corrected consensus sequence, and the unigenes were obtained as the reference sequence. Then, the clean reads of each sample sequenced by Illumina were aligned to the reference sequence by using the comparison software bowtie2 in RSEM, with the highest mapping rate of pancrease being 86.85% (Table 3). Based on gene expression criteria of FPKM > 0.1. We counted the number of genes under different expression conditions and found that the genes with higher expression were the least in pancrease. This may indicate that the gene expression profile of the pancreas is less affected by external factors (Fig. 3a). Principal component analysis (PCA) of the full-length transcriptome of eight different tissues of red-crowned crane revealed that the difference between the lung and spleen was the smallest, whereas that between the liver and brain was the largest (Fig. 3b). We analyzed the differential gene expression of the top 20 genes in the different tissues and found that the pancreas had the lowest expression profile (Fig. 3c). To obtain more accurate data, we analyzed the expression patterns of 17,535 genes and drew a volcano map comparing differential gene expression between two tissues. The largest difference was between the liver and the brain with 5,494 differentially expressed genes (DEGs), of which 1,970 were upregulated and 3,524 were downregulated. The smallest difference was between the lung and spleen with 2,771 DEGs, of which 1,512 were upregulated and 1,259 were downregulated. Random comparison of the remaining tissues showed 4,102 DEGs between the heart and pancreas, of which 3,407 were upregulated and 695 were downregulated. Moreover, a total of 5,168 DEGs were identified between the kidneys and muscles, including 3,043 upregulated and 2,125 downregulated sequences (Fig. 3d).  www.nature.com/scientificreports/

Expression of immune-related IL and MHC receptors.
To obtain relevant information about the immune system of the red-crowned crane in its infancy, we used the Swiss-Prot database to predict the function of the full-length transcriptome landscape identified for the different tissues. The expression of immune-related genes was analyzed by heatmap, revealing that the spleen was the tissue with the most immune expression (Fig. 4a). Moreover, many IL and MHC receptor families were identified, indicating the innate and adaptive immunity 44 . Among them, 13 types of IL receptors (IL-1R, IL-3R, IL-4R, IL-6R, IL-7R, IL-8R, IL-10R, IL-11R,  IL-13R, IL-17R, IL-22R, IL-34R, and IL-36R) and two MHC receptors (MHC I and MHC II) were identified. The IL receptor family was expressed mostly in the spleen, whereas was the lowest in the pancreas. In the spleen, the expression of IL-8R was the highest and that of IL-36R was the lowest (Fig. 4b). The MHC I and MHCII receptors family was expressed mostly in the lung, whereas was the lowest in the muscle. The number of MHC II receptors in the spleen was much higher than that of MHC I receptors, as the spleen is more sensitive to exogenous antigens (Fig. 4c). Through the expression of IL and MHC receptor families in different tissues, we identified that the spleen was the main immune organ in the red-crowned crane that died due to a leg bone fracture. As red-crowned cranes are endangered wild animals, we only use random individuals that died of disease for analysis; thus, we could not repeat our experiments to further confirm the obtained results. Most full-length transcripts were found to be enriched in immune-related signaling pathways, as expected in a sick state 30,36,[45][46][47] . Accordingly, the immune expression of the spleen was found to be the highest, suggesting that the spleen is the main immunity organ of the red-crowned cranes beyond all doubt.

Conclusion
In summary, we reported the first full-length transcriptome for critically endangered Grus japonensis using PacBio SMRT and Illumina sequencing technology. In this study, we used full-length transcriptome sequencing to present a new set of transcriptomic data comprising 13,115 SSRs, 4,100 lncRNAs, and 29 TFs. The purpose of our research was to obtain a more complete and accurate transcriptome dataset of red-crowned crane through three-generation sequencing technology, and then discuss the relevant tissues involved in immunity, to provide a basis for the research and protection of this endangered species. Through functional annotation and structural analysis of the transcriptome, many enriched immune-related signaling pathways were found. Our sequencing data of the full-length transcriptome of a juvenile red-crowned crane provides valuable information to close the knowledge gap in the transcriptome for this species, thereby providing a reference on new protein-coding genes and transcript subtypes. In conclusion, this study provides enhanced transcriptome information, improves our understanding regarding the gene structure and post-transcriptional regulatory network, and provides a reference for future studies on red-crowned cranes. www.nature.com/scientificreports/

Methods
Sample collection. The samples used in this study were collected from a red-crowned crane that died due to a leg bone fracture in the Hongshan Forest Zoo of Nanjing, and related scientific research was conducted with the relevant permission from the zoo. To maintain the original state of the samples, they were quickly stored at -80 °C. Brain, muscle, pancreas, heart, kidney, liver, lung, and spleen tissue samples were collected. PacBio SMRT was used to sequence the full-length transcriptome of a mixed sample, and each of the eight tissues were sequenced separately using Illumina sequencing platform NovaSeq 6000. The overall experimental flow chart was drawn by Adobe Photoshop v22.0.0 and shown in (Fig. 5). The project was approved by the Animal Ethics Committee of Nanjing Forestry University and the Nanjing Hongshan Forest Zoo.
RNA extraction and quality analysis. Trizol reagent was used to obtain total RNA from samples. To ensure that the database data of library construction was of high quality, RNA purity was detected by NanoPhotometer spectrophotometer (Implen, Westlake Village, CA, USA). The Qubit 2.0 Fluorometer (Life Technologies, Waltham, MA, USA) was used to accurately quantify the RNA concentration. The RNA Nano 6000 Assay Kit of the Bioanalyzer 2100 system (Agilent Technologies, Santa Clara, CA, USA) was used to accurately investigate RNA integrity. The results showed that RNA samples could be used for library construction.  www.nature.com/scientificreports/ Transcriptome sequencing library construction. mRNA samples were purified using magnetic beads with Oligo (dT), the first cDNA was synthesized by reverse transcription of mRNA using the SMARTer PCR cDNA synthesis Kit (Pacific Biosciences, Menlo Park, CA, USA). The synthesized cDNA was enriched by PCR amplification, the amplified cDNA fragments were divided into different size fractions using the Blue Pippin size selection system (Pacific Biosciences). After obtaining suitable fragments, an additional PCR was performed to obtain enough cDNA. Then, full-length cDNA for damage repair, end repair, and connection of SMRT dumbbell joints. Finally, exonuclease digestion was performed. The library was qualified and sequenced using PacBio Sequel platform.
Data processing. We used PacBio Sequel to sequence the mixed samples. The PacBio official software package SMRTlink 7.0 (Parameter: minLength = 50, maxLength = 15,000, minpasses 1) was used to process the original Iso-Seq offline data. CCSs were generated from subread BAM files (parameters: min_length, 50; max_drop_ fraction, 0.8; no_polish, TRUE; min_zscore, − 9999.0; min_passes, 1; min_predicted_accuracy, 0.8; max_length, 15,000) 46,48 . According to sequence, whether it contained 5 ' and 3' adapters and poly(A) tail, the sequence was divided into full-length or non-full length sequences. The hierarchical n*log(n) algorithm was used to cluster the full-length sequences and obtain a cluster consensus sequence, and then the final arrow polishing (parameters: hq_quiver_min_accuracy, 0.99; bin_by_primer, false; bin_size_kb, 1; qv_trim_5p, 100; qv_trim_3p, 30) was performed 46,49 . The obtained consensus sequences were calibrated with the non-full-length sequences to obtain high-quality sequences for subsequent analysis. The LoRDEC 0.7 software was used to correct with high accuracy the third-generation data based on the second-generation data 41 . Then, CD-HIT (-c 0.95-T 6-G 0-aL 0.00-aS 0.99) software was used for sequence alignment clustering. Redundant sequences were removed based on 95% similarity.
Differential expression analysis. The corrected consensus sequence was further filtered using CD-HIT, and the obtained unigenes were used as the reference sequence. The clean reads of each tissue sequenced by Illumina were aligned to the reference sequence using RSEM v1.3.0 (-phred33; -quals; -forward-prob 0.5; -time) 50 . Gene expression was estimated using RSEM for the eight tissue samples obtained from Grus japonensis. The expression profiles of the unigenes in different cDNA libraries were detected in terms of fragments per kilobase of transcript per million mapped reads (FPKM). We used RSEM software to calculate the mapping results, and further obtain the read counts value of each sample mapped to each gene and perform FPKM conversion. The standard of gene expression was set as FPKM > 0.1. R software was used for performing the PCA of the fulllength transcriptome. The top 20 DEGs among the eight tissues were screened by coefficient of variation, using NovoMagic v3.0 to draw the heatmaps. In addition, we also did cluster analyses on the expression of lncRNAs and TFs in different tissues (log 2 transformation for IncRNAs and log 10 transformation for TFs). www.nature.com/scientificreports/