SMRT sequencing of full-length transcriptome of flea beetle Agasicles hygrophila (Selman and Vogt)

This study was aimed at generating the full-length transcriptome of flea beetle Agasicles hygrophila (Selman and Vogt) using single-molecule real-time (SMRT) sequencing. Four developmental stages of A. hygrophila, including eggs, larvae, pupae, and adults were harvested for isolating total RNA. The mixed samples were used for SMRT sequencing to generate the full-length transcriptome. Based on the obtained transcriptome data, alternative splicing event, simple sequence repeat (SSR) analysis, coding sequence prediction, transcript functional annotation, and lncRNA prediction were performed. Total 9.45 Gb of clean reads were generated, including 335,045 reads of insert (ROI) and 158,085 full-length non-chimeric (FLNC) reads. Transcript clustering analysis of FLNC reads identified 40,004 consensus isoforms, including 31,015 high-quality ones. After removing redundant reads, 28,982 transcripts were obtained. Total 145 alternative splicing events were predicted. Additionally, 12,753 SSRs and 16,205 coding sequences were identified based on SSR analysis. Furthermore, 24,031 transcripts were annotated in eight functional databases, and 4,198 lncRNAs were predicted. This is the first study to perform SMRT sequencing of the full-length transcriptome of A. hygrophila. The obtained transcriptome may facilitate further exploration of the genetic data of A. hygrophila and uncover the interactions between this insect and the ecosystem.

Alligator weed Alternanthera philoxeroides (Mart.) (Amaranthaceae) that originated from South America 1,2 , was introduced into China in the 1930 s. In China, A. philoxeroides is an important invasive species and has resulted in ecological and economic damage 3 . To control A. philoxeroides infestations, the flea beetle Agasicles hygrophila (Selman and Vogt) (Coleoptera: Chrysomelidae) was introduced as a biological control agent 4 . The use of A. hygrophila as a biological control of A. philoxeroides is acknowledged to be the world's first successful example of aquatic weed control 5 .
Study has reported that the physiological adaptations of biological control agents determine where and when they will be successful 6,7 . Host specificity and potential host shifting are considered as the most important factors for the evaluation of A. hygrophila in many countries where it is introduced for biological control of A. philoxeroides 8 . Due to the good performances in host selection and ecological adaptation, A. hygrophila has become a desirable model insect for the investigation of the relationships between insects and plants, as well as insects and ecosystem. Currently, the genome and transcriptome information of A. hygrophila has not been investigated, which hinders the study of the molecular mechanisms underlying the interaction between A. hygrophila and host plant and ecosystem.
Transcriptome could reflect the type and number of intracellular genes and reveal the physiological and biochemical processes at a molecular level 9 . Several technologies have been applied for transcriptome sequencing. Among these, short-read transcriptome sequencing has become a powerful tool for the description of gene expression levels 10,11 . However, most of these technologies are incapable of assembling full-length transcripts because of the shortness of sequencing reads, which necessitates efforts for exploring other technologies. Thus, RNA sample preparation. Total RNA samples (at four different developmental stages) were isolated using the RNeasy Plus Mini Kit (Qiagen, Valencia, CA, USA). RNA degradation and contamination were monitored using 1% agarose gels. The purity and concentration of RNA were measured using the NanoDrop ND-1000 spectrophotometer (NanoDrop Technologies, Rockland, DE, USA) with a OD 260 / 280 reading. The RNA integrity was assessed using the RNA Nano 6000 Assay Kit of the Agilent Bioanalyzer 2100 system (Agilent Technologies, CA, USA). The total RNA samples from four developmental stages were mixed together for the following experiments.
Library preparation and SMRT sequencing. mRNA was purified from 3 µg of mixed total RNA using poly (T) oligo-attached magnetic beads. Fragmentation was conducted using divalent cations under elevated temperatures in the NEBNext First Strand Synthesis Reaction Buffer (5×). The SMART PCR cDNA Synthesis Kit (Clontech, CA, USA) was used for synthesizing full-length cDNA. Remaining overhangs were converted into blunt ends via exonuclease/polymerase activities. After adenylation of the 3′ ends of the DNA fragments, NEBNext Adaptor with a hairpin loop structure was ligated to prepare for hybridization. BluePippin ® (Sage Science, Beverly, MA, USA) was used for size selection of the full-length cDNA and for building libraries of differently sized cDNA. The generated cDNA was then re-amplified using PCR, and the fragment size distribution was quantified using the Qubit fluorometer (Life Technologies, Carlsbad, CA, USA). The quality of the libraries was assessed using the Agilent Bioanalyzer 2100 system. SMRT sequencing was performed using the Pacific Biosciences' real-time sequencer using C2 sequencing reagents.
Next-generation sequencing. Total RNA (5 μg) was digested by using DNase I (NEB, Frankfurt, Germany). The sample was purified with Agencourt RNAClean XP Beads and fragmented into 130-170 nt. First-strand cDNA was generated by First Strand Master Mix and Super Script II reverse transcription (Invitrogen). Then second-strand cDNA was synthesized using Second Strand Master Mix. After end repairing, adding A and adaptor ligation, several rounds of PCR amplification with PCR Primer Cocktail and PCR Master Mix were performed to enrich the cDNA fragments. The final library is quantitated by using the Agilent 2100 bioanalyzer instrument, and real-time quantitative PCR. The qualified libraries was sequenced pair end on the Illumina HiSeq. 2000 System.
Preprocessing of SMRT reads. Raw SMRT sequencing reads were processed by removing polymerase reads that were <50 bp and had quality of <0.75, obtaining the clean reads. The obtained clean reads were processed into error-corrected reads of inserts (ROIs) with parameters of full passes of ≥0 and quality of >0.75. The ROI reads with both the 5′ and 3′ primer sequences and a poly(A) tail present were considered to be full-length transcripts. During the processes of library preparation, the chimeric sequences formed by the direct linkages of two cDNA template strands due to the low concentrations of adapter or SMRTbell are called artificial chimeric sequences. The non-chimeric sequences in the full-length sequence are called full-length non-chimeric (FLNC) sequences. The FLNC transcripts were determined by searching for the poly(A) tail signal and the 5′tail si cDNA primers in ROIs.
Iterative clustering for error correction (ICE) in SMRT analysis (v2.3.0) 12 was used to obtain consensus isoforms by approaching clustering, and the full-length consensus sequences from ICE were refined using Quiver. Full-length transcripts with post-correction accuracy of >99% (high-quality isoforms) were generated for further analysis. Any redundancy in high-quality full-length transcripts was removed by CD-HIT 14 .
Alternative splicing detection. RNA alternative splicing, occurring after a pre-mRNA transcript, is formed from template DNA, which results in a single gene coding for multiple proteins. During this process, particular exons of a gene may be included within or excluded from the final, processed mRNA produced from that gene 15 , which results in the proteins translated from alternatively spliced mRNAs containing differences in their amino acid sequence and biological functions. In this study, based on the obtained redundancy removed transcripts, we predicted the alternative splicing events. Briefly, all sequences were aligned to each other with BLAST 16 . The alignment results according with the following conditions were considered as alternative splicing events 17 : • Both sequences lengths were more than 1000 bp, besides there should be two High-scoring Segment Pairs in the alignment; • The alternative splicing Gap was greater than 100 bp, with at least 100 bp distance to the 3′/5′ end; • All alternatively spliced transcripts allowed 5 bp overlap.
Prediction of coding sequences. The coding sequences and corresponding amino acid sequences within the transcript sequences were predicted using TransDecoder (https://github.com/TransDecoder/TransDecoder/ releases). TransDecoder could identify candidate protein-coding regions based on nucleotide composition, open reading frame (ORF) length, log-likelihood score, and (optional) Pfam domain content 20 .
Functional annotation of transcripts. The obtained non-redundant transcript sequences were mapped to eight databases to obtain the annotation information of the transcript. These databases included NR 21 , Swiss-Prot 22 , Gene Ontology (GO; http://www.geneontology.org) 23 , Clusters of Orthologous Groups of proteins (COG; http://www.ncbi.nlm.nih.gov/COG) 24 , euKaryotic Ortholog Groups (KOG) 25  lncRNA prediction. The most widely used methods for analyzing coding potential are Coding Potential Calculator (CPC) 28 , Coding-Non-Coding Index (CNCI) 29 , Coding Potential Assessment Tool (CPAT) 21 , and pfam protein structure domain analysis. In this study, lncRNAs were predicted by screening the coding potential of transcripts using these four methods above.

Results
SMRT sequencing data output. Bases on the Pacific Biosciences' SMRT sequencing technology, 456,994 polymerase reads were generated. After preprocessing, 9.45 Gb of clean reads were obtained (Table 1). On the basis of the conditions of full passes of ≥0 and quality of >0.75, 335,045 ROIs were obtained (Table 2). In addition, 158,085 FLNC sequences were identified (Table 3).     (Fig. 2).

GO annotation.
The GO database is produced by the Gene Ontology Consortium and features a structured, precisely defined, common, controlled vocabulary for describing the roles of genes and gene products in any    organism. GO annotation system is a directed acyclic graph, including three categories: biological process (BP), molecular function (MF), and cellular component (CC). In this study, GO analysis revealed that the transcripts were enriched in several BP, MF, and CC associated terms (Fig. 3).

COG annotation.
The COG database is an attempt at phylogenetically classifying proteins encoded in 21 complete genomes of bacteria, archaea, and eukaryotes. The database can be used for the functional and phylogenetic annotation of newly sequenced genomes. This study also found that the number of transcripts that were enriched in function R was the most, followed by function J and function O (Fig. 4).
lncRNA prediction. The number of lncRNA transcripts, as predicted by CPC, CNCI, pfam protein structure domain analysis, and CPAT is shown in Fig. 5. In total, 4,198 lncRNA transcripts were predicted by all four methods.

Discussion
The methodological strengths of SMRT sequencing have been comprehensively investigated in human 13 , which is superior to methods of short read sequencing due to the advantage of obtaining full-length transcripts. Besides, it could be used for the analysis of alternative splicing events, and the primary-percusor-mature RNAs structures to help better understand the RNA processing. In this study, 9. Genomic sequencing clearly revealed that the great majority of genes specifying the core biological functions are shared by all eukaryotes 31 . The rational classification of proteins encoded in sequenced genomes is critical for maximizing the use of genome sequences for functional and evolutionary studies 24 . In this study, these transcripts were enriched in various subcategories such as metabolic process, cellular process, cell, cell part, binding, and catalytic activity in the three main categories BP, MF, and CC according to the GO annotation analysis. The results of COG annotation showed that the largest number of transcripts were enriched in the function of general function prediction only. The results suggested that the transcripts of A. hygrophila were associated with the abovementioned functions. LncRNAs, a novel class of nonprotein coding transcripts longer than 200 nt, are key regulatory molecules that can regulate gene expression at many different levels. Recently, increasing number of research has focused on the functions of lncRNAs in entomology, such as in Drosophila melanogaster, Plutella xylostella, and Nilaparvata lugens 32 , which provides a foundation for exploring the functions of lncRNA in insect development. This study identified 4,198 lncRNA transcripts with four analytical methods. However, their functions in A. hygrophila require further investigations.
In conclusion, our study, for the first time, completes SMRT sequencing of the full-length transcriptome of A. hygrophila. The obtained transcriptome may facilitate further studies on the genetic data of A. hygrophila and may help clarify the interactions between A. hygrophila and the ecosystem.