Characterization and analysis of full-length transcriptomes from two grasshoppers, Gomphocerus licenti and Mongolotettix japonicus

Acrididae are diverse in size, body shape, behavior, ecology and life history; widely distributed; easy to collect; and important to agriculture. They represent promising model candidates for functional genomics, but their extremely large genomes have hindered this research; establishing a reference transcriptome for a species is the primary means of obtaining genetic information. Here, two Acrididae species, Gomphocerus licenti and Mongolotettix japonicus, were selected for full-length (FL) PacBio transcriptome sequencing. For G. licenti and M. japonicus, respectively, 590,112 and 566,165 circular consensus sequences (CCS) were generated, which identified 458,131 and 428,979 full-length nonchimeric (FLNC) reads. After isoform-level clustering, next-generation sequencing (NGS) short sequences were used for error correction, and remove redundant sequences with CD-HIT, 17,970 and 16,766 unigenes were generated for G. licenti and M. japonicus. In addition, we obtained 17,495 and 16,373 coding sequences, 1,082 and 813 transcription factors, 11,840 and 10,814 simple sequence repeats, and 905 and 706 long noncoding RNAs by analyzing the transcriptomes of G. licenti and M. japonicus, respectively, and 15,803 and 14,846 unigenes were annotated in eight functional databases. This is the first study to sequence FL transcriptomes of G. licenti and M. japonicus, providing valuable genetic resources for further functional genomics research.


Results
the full-length transcriptome sequences of G. licenti and M. japonicus. To identify as many transcripts as possible, high-quality RNA samples were extracted from three adult females and three adult males of G. licenti and then pooled together in equal amounts for library preparation and sequencing. A total of 34.16 Gb clean data were generated from PacBio sequencing, which yielded 590,112 circular consensus sequence (CCS) reads with a mean length of 3,107 bp. By searching for the presence of poly-A tails and the 5′ and 3′ primers, 458,131 full-length nonchimeric (FLNC) reads and 131,027 non-full-length (NFL) reads were further identified from the CCS reads. After isoform-level clustering based on the iterative clustering for error correction (ICE) algorithm and polishing based on the Arrow algorithm, a total of 29,340 polished FL consensus isoforms with an average length of 2,995 bp were generated from the FLNC reads, including 28,736 high-quality (HQ; accuracy ratio > 99%) and 601 low-quality (LQ; accuracy ratio ≤ 99%) sequences (Table 1). Furthermore, 41.46 Gb clean reads were generated after adaptor sequence trimming and LQ read filtering in an Illumina platform (Additional file 1: table S1). These clean reads were subsequently used to correct the low-quality isoforms from PacBio sequencing by Proovread v2.12 (https ://githu b.com/BioIn f-Wuerz burg/proov read) 24 software (Additional file 1: table S2). After redundancy removal via the CD-HIT program and filtering reads less than 200 bp in length, the consensus isoforms were finally clustered into a total of 17,932 unigenes for subsequent analysis ( Table 1). The final unigene sequences have been deposited at DDBJ/EMBL/GenBank by Transcriptome Shotgun Assembly project under the accession GICA00000000. The version described in this paper is the first version, GICA01000000.
The same method was used for the FL transcriptome analysis of M. japonicus, for which all the results of PacBio sequencing are shown in Table 1, and the final unigene sequences are available in DDBJ/EMBL/GenBank under the accession GIBZ00000000. The version described in this paper is the first version, GIBZ01000000. The Illumina results of M. japonicus are presented in Additional file 1: Table S1.
To test the completeness of the two FL transcriptomes, Benchmarking Universal Single-Copy Orthologs (BUSCO) v3.0.2 with the Insecta odb9 database 25  In addition, we also mapped PacBio unigenes and Illumina unigenes to the genome sequences of L. migratoria using Minimap2 (v2.15; https ://githu b.com/lh3/minim ap2) 28 . The results showed that the mapping ratio of PacBio unigenes of G. licenti and M. japonicus was obviously higher than Illumina unigenes (Additional file 1: Simple sequence repeat and long noncoding RNA identification. The MIcroSAtellite identification tool (MISA) 30 was used for SSR analysis of the unigenes with lengths greater than 500 bp. For G. licenti, 17,566 unigenes were subjected to SSR analysis, and a total of 11,840 SSRs were identified in 6,436 unigenes, of which 2,785 unigenes contained more than one SSR and 2,467 SSRs were present in compound form (Additional file 2: table S10). For M. japonicus, a total of 10,814 SSRs were identified in 16,351 unigenes for SSR analysis, of which 2,517 contained more than one SSR and 1,963 existed in a compound form (Additional file 2: table S10). In both species, mono-, di-, and trinucleotide repeats were the most abundant SSRs, accounting for 35.3%,  LncRNA is another important component of the transcriptome. To identify lncRNAs in the PacBio data, four analysis methods, including Coding Potential Calculator (CPC) 31 , Coding-Non-Coding Index (CNCI) 27 , Coding Potential Assessment Tool (CPAT) 32 and Pfam protein structure domain analysis, were used. In total, 905 and 706 lncRNAs were predicted from G. licenti and M. japonicus, respectively, by all four methods (Fig. 4c,d). By filtering transcripts < 300 bp in length, 829 transcripts were confirmed as lncRNAs in G. licenti and 632 transcripts as lncRNAs in M. japonicus.
As one unigene could correspond to multiple GO categories, the annotated 8,065 unigenes of G. licenti were classified into at least one of the three main GO categories: 3,509 (43.51%) were classified as cellular component, 6,842 (84.84%) as molecular function and 5,193 (64.39%) as biological process (Fig. 7a and Additional file 2: table S17). A total of 7,637 M. japonicus unigenes were annotated to GO categories, and the largest category was molecular function (6,466, 84.67%), followed by biological process (5,082, 66.54%) and cellular component (3,365,44.06%) (Fig. 7b and Additional file 2: table S18). Notably, in the biological process category, we found that 210 genes of G. licenti and 231 genes of M. japonicus were related to reproduction, 178 genes of G. licenti and 188 genes of M. japonicus were related to reproductive process, 607 genes of G. licenti and 567 genes of M. japonicus were related to developmental process, and both species had 67 genes related to growth.

Discussion
A good foundation for taxonomic and phylogenetic research exists in Acrididae, especially given the publication, in recent years, of a large number of mitochondrial genome sequences, which provide abundant molecular resources for molecular phylogenetic study of the family 14,35-39 . At the same time, Acrididae are extremely diverse in terms of size, body shape, feeding biology, ecology, geographical distribution and life-history traits 12,40,41 , and thus, this family has high potential to provide model candidates for functional genomics research. Although Acrididae are favored by many researchers for these reasons [42][43][44][45] , the lack of genome and transcriptome information has seriously hindered functional genomics studies of Acrididae. With the development of NGS technology, short-read RNA-seq has been widely applied in Acrididae functional genetic studies, although the transcripts must be assembled from short reads with assemblers such as Trinity 26 , SOAPdenovo-Trans 46 or Trans-ABySS 47 . Due to the complex splicing mechanisms of eukaryotic cells, it is difficult to accurately reconstruct and reliably express transcriptomic isoforms with these short-read RNA-seq techniques. In contrast, PacBio sequencing can directly generate superlong transcripts with a short run time and without assembly, comprehensively solving these problems.
Here, we successfully performed the high-quality collection of transcripts from two Acrididae insects, G. licenti and M. japonicus, using PacBio sequencing. Compared with the de novo assembled unigenes obtained by RNA-seq, the length of the unigenes obtained by PacBio sequencing was greatly improved. This result shows that PacBio sequencing provides great advantages for obtaining complete transcripts and is beneficial for complex de novo transcriptome analyses. However, the drawbacks of PacBio sequencing were also obvious. Despite the vast amount of sequence data (G. licenti, 34.16 Gb clean data; M. japonicus, 34.55 Gb clean data), fewer unigenes were obtained by PacBio sequencing than by RNA-seq. This phenomenon has also been observed in other FL transcriptome studies 9,48,49 . One of the reasons could be attribute to the strict parameters setting in the process of FL transcript cluster.
As a key part of the transcriptional regulation system, TFs can affect the phenotypes of organisms by regulating downstream target genes and signaling networks. For example, the Hox protein Abdominal-A (Abd-A) can downregulate the expression of the cell cycle regulator gene Cyclin E (CycE) and the wing disc cuticle protein gene WCP4, which plays an important role in insect embryo development, somite differentiation and proleg formation 50,51 . Using the present data, 1,082 putative TFs from 41 TF gene families were identified in G. licenti, and 813 putative TFs from 39 TF gene families were identified in M. japonicus. www.nature.com/scientificreports/ SSR markers have the characteristics of codominance, high reproducibility, abundant polymorphisms and easy detection, and they have therefore become an important genetic marker technology. SSR markers are often used to study the genetic differentiation, genetic diversity and population structure of species, which is of great significance for the protection of species, especially endangered species [52][53][54] . Here, we detected 11,840 and 10,814 SSRs from G. licenti and M. japonicus, respectively. These SSRs will serve as useful tools for analyzing genetic diversity, constructing genetic maps and investigating the population structures of G. licenti and M. japonicus.
LncRNAs are a group of RNA molecules with highly conserved secondary and tertiary structures, and their transcripts are generally longer than 200 nt. LncRNA can not only control gene expression through transcriptional and posttranscriptional regulation but also exert powerful biological functions by affecting protein localization and telomere replication. In recent years, a large number of lncRNAs have been identified from Drosophila melanogaster 55 , Bombyx mori 56 , Plutella xylostella 57 and other insects, laying an important foundation for further studies of the functions of lncRNA in insect growth and development. To date, researchers studying lncRNAs in D. melanogaster have confirmed that lncRNAs can participate in many biological processes, such as sex determination, male courtship and X chromosome inactivation [58][59][60][61] . In this study, we predicted 829 lncRNAs in G. licenti and 706 lncRNAs in M. japonicus with PacBio sequencing, laying a foundation for the next step in studying the biological functions of lncRNAs in these two insects.
Our gene annotation results clearly showed that a large number of unigenes from G. licenti and M. japonicus can be classified for functional research. The NR annotation revealed that the species with the largest distribution of unigenes in G. licenti and M. japonicus was Z. nevadensis rather than more closely related L. migratoria, which may be caused by the incomplete genome assembly of L. migratoria 17 . But compared with published RNA-seq of the Acrididae species, unigenes obtained by PacBio sequencing has a higher proportion of annotation on the genomes of L. migratoria 18,19,62 . The KEGG analysis showed that 209 and 211 KEGG pathways were successfully annotated in G. licenti and M. japonicus, respectively. And among which Wnt, Notch, Hedgehog, Hippo and other pathways played critical roles in insect growth and development, for example, the Wnt signaling pathway has been shown to be essential for both embryogenesis and organogenesis, such as controlling axis elongation and leg development of one short-germ insect 63 , and the Hippo signaling pathway is an evolutionally-conserved signaling cascade that plays a role in controlling organ size during animal development, such as controlling the size of silkworm wings 64 . GO annotations are classified into three main categories: biological process, cellular component and molecular function. In this study, 8,065 and 7,637 unigenes of G. licenti and M. japonicus were respectively enriched in these three main categories. Among them, in the biological process category, 210, 178, 607 and 67 unigenes in G. licenti and 231, 188, 567 and 67 unigenes in M. japonicus were classified into the reproduction, reproductive process, developmental process, and growth subcategories, respectively. These genes provide important information for the study of growth and development, sex determination, fertilization, oviposition and other related processes in G. licenti and M. japonicus.
In conclusion, PacBio sequencing was performed on G. licenti and M. japonicus, and we provide here the first report of FL transcriptomes in Acrididae. The obtained FL transcriptome datasets enrich the data resources of G. licenti and M. japonicus and will provide support for future research on their functional genomics.

Methods
Sample collection and RnA preparation. Three adult females and three adult males of G. licenti and M. japonicus were collected from Yan'an City of Shaanxi Province, China. The whole body, except the gut, was immediately collected and stored in liquid nitrogen. According to the manufacturer's instructions, total RNA was extracted from each individual using TRIzol reagent (Invitrogen, Carlsbad, CA, USA), and RNA degradation and contamination were detected with 1% agarose gels. The integrity and purity of RNA were assessed by the Agilent 2,100 Bioanalyzer (Agilent Technologies, CA, USA) and NanoDrop 2000 (Thermo Scientific, Wilmington, DE, USA). Only RNAs with an RNA integrity number (RIN) score > 8.0 and 1.8 < OD260/280 < 2.2 were used for the preparation and construction pf PacBio and Illumina libraries.
Library construction and sequencing. To construct the PacBio sequencing library, eligible RNAs from each individual were mixed in equal amounts and reverse-transcribed into FL cDNA using the SMARTer PCR cDNA Synthesis Kit (Clontech, CA, USA). The KAPA HiFi PCR Kits were used to amplify the FL cDNA by PCR, and the BluePippin Size Selection system (Sage Science, USA) was used to select the PCR products. The cDNA products with lengths of 1-6 kb were finally retained. After repairing the ends of the FL cDNA and connecting the SMRT dumbbell-type connector, the SMRTbell Template libraries were constructed using the SMRTbell Template Prep Kit. Agilent 2,100 Bioanalyzer and Qubit 2.0 (Life Technologies, Carlsbad, CA, USA) were used to evaluate the concentration and quality of these libraries. Finally, qualified libraries were sequenced using the PacBio Sequel platform (Pacific Biosciences, Menlo Park, CA, USA). Raw PacBio sequencing reads were stored in the Short Read Archive (SRA) of NCBI with accession number SRR10420895 for G. licenti and SRR10420906 for M. japonicus.
Six separate Illumina libraries were constructed for G. licenti and M. japonicus using the protocol of the Gene Expression Sample Prep Kit (Illumina, San Diego, CA, USA). Briefly, polyadenylated mRNA was isolated from total RNA using Oligo (dT) magnetic beads and then fragmented randomly with Fragmentation Buffer. The first-strand cDNA was synthesized with random hexamer primers using the fragmented mRNA as a template, and the second-strand cDNA was synthesized with DNA polymerase I (New England Biolabs) and RNase H (Invitrogen). After end repair, A-tail, adaptor ligation and purification with AMPure XP beads, PCR amplification was conducted. Finally, the six libraries were paired-end sequenced at 150 bp on Illumina HiSeq X Ten platform. Clean Illumina reads were produced after removing low-quality reads and adaptor reads, and then all clean reads from the same species were merged together for de novo assembled by using Trinity v2. 5 pacBio read error correction. The SMRT Link 5.1 pipeline from Pacific Biosciences 65 was used for PacBio data processing. Briefly, CCS reads were extracted from raw reads with minFullPass = 1 and minPredictedAccuracy = 0.80. After discarding CCS reads with lengths shorter than 50 bp, the retained CCS reads were classified into FLNC and NFL transcripts according to whether they simultaneously contained 5′ primers, 3′ primers and poly-A tails. The ICE algorithm was used to cluster FLNC sequences to obtain consensus isoforms, and then, Arrow software (https ://www.pacb.com/wp-conte nt/uploa ds/SMRT_Tools _Refer ence_Guide _v600.pdf) was used to refine the consensus isoforms using the NFL to obtain polished consensus sequences. All polished consensus sequences were corrected using the Illumina RNA-seq short reads with the software Proovread v2.12 24 with the default settings. To obtain the final transcriptome isoform sequences, the redundant sequences were removed by the CD-HIT software 66 , and the corrected consensus sequences were further screened. Due to the possible degradation of 5′ end sequences, isoforms from the same transcript were divided into different clusters, resulting in redundant sequences. Therefore, CD-HIT software was used to remove redundant sequences from the transcriptome isoform sequences and obtain the unigene sequences. Finally, BUSCO software 25 with the insect lineage database (insecta_orthoDB9, created 13/02/2016) was used to assess the completeness of the unigenes.
coding sequences and transcription factor prediction. The coding sequence and corresponding protein sequence of the unigenes were predicted by the TransDecoder v5.0.1 package based on the ORF length, loglikelihood score and Pfam database protein domain sequences. The TFs were predicted from protein sequences by a prediction tool in an animal transcription factor database (AnimalTFDB) 67 .
Simple sequence repeat prediction. An SSR is a set of repetitive DNA sequence with lengths varies from 2 base pairs to 13 base pairs and some motifs repeated 5-50 times, also known as a microsatellite 68 . MISA (v1.0; https ://webbl ast.ipk-gater slebe n.de/misa/) 30 is a software used to identify SSRs, and through the analysis of transcript sequences, seven kinds of SSR can be identified: mononucleotides, dinucleotides, trinucleotides, tetranucleotides, pentanucleotides, hexanucleotides and compound SSRs. Only transcripts > 500 bp in length were subjected to SSR detection.
Long noncoding RnA prediction. LncRNA is an important component of the transcriptome. Because lncRNA does not encode proteins, lncRNA can be obtained by screening transcripts for coding potential, judging whether they have coding potential or not, and filtering out the transcripts with coding potential. In this study, lncRNAs were predicted by screening the coding potential of transcripts using CPC 31 , CNCI 27 , CPAT 32 and Pfam protein structure domain analysis, and select the intersection of four results as the set of lncRNAs.
Identification of alternative splicing transcript isoforms. Due to the lack of annotated reference genome in G. licenti and M. japonicus, we used the all-vs-all BLAST method with high identity settings described by Liu et al. to de novo detecting AS transcript isoforms 69 . If the BLAST results meet the following conditions, it is considered as a candidate AS events: (a) the length of both sequences exceeded 1,000 bp, and the alignment contained two high-scoring Segment Pairs (HSPs); (b) the gap of AS exceeded 100 bp, and was located at least 100 bp from the 3′/5′ end; and (c) allowed a 5 bp overlap for all alternative scripts.