Full-length SMRT transcriptome sequencing and microsatellite characterization in Paulownia catalpifolia

Paulownia catalpifolia is an important, fast-growing timber species known for its high density, color and texture. However, few transcriptomic and genetic studies have been conducted in P. catalpifolia. In this study, single-molecule real-time sequencing technology was applied to obtain the full-length transcriptome of P. catalpifolia leaves treated with varying degrees of drought stress. The sequencing data were then used to search for microsatellites, or simple sequence repeats (SSRs). A total of 28.83 Gb data were generated, 25,969 high-quality (HQ) transcripts with an average length of 1624 bp were acquired after removing the redundant reads, and 25,602 HQ transcripts (98.59%) were annotated using public databases. Among the HQ transcripts, 16,722 intact coding sequences, 149 long non-coding RNAs and 179 alternative splicing events were predicted, respectively. A total of 7367 SSR loci were distributed throughout 6293 HQ transcripts, of which 763 complex SSRs and 6604 complete SSRs. The SSR appearance frequency was 28.37%, and the average distribution distance was 5.59 kb. Among the 6604 complete SSR loci, 1–3 nucleotide repeats were dominant, occupying 97.85% of the total SSR loci, of which mono-, di- and tri-nucleotide repeats were 44.68%, 33.86% and 19.31%, respectively. We detected 112 repeat motifs, of which A/T (42.64%), AG/CT (12.22%), GA/TC (9.63%), GAA/TTC (1.57%) and CCA/TGG (1.54%) were most common in mono-, di- and tri-nucleotide repeats, respectively. The length of the repeat SSR motifs was 10–88 bp, and 4997 (75.67%) were ≤ 20 bp. This study provides a novel full-length transcriptome reference for P. catalpifolia and will facilitate the identification of germplasm resources and breeding of new drought-resistant P. catalpifolia varieties.

www.nature.com/scientificreports/ conserved within and among genera 10 . Depending on their origin, SSR markers can be categorized as genomic SSRs or expressed sequence tag (EST) SSRs. EST-SSR markers are easier to obtain for a large number of plants that have no reference genome, although the polymorphism of EST-SSR is lower than that of genomic SSR markers. As functional molecular markers, EST-SSRs are more conserved, better universality, lower cost and more interspecific transferability 11,12 . Moreover, EST-SSR polymorphisms may be directly related to gene function 13 and can be used for researches of other related species 14,15 . In recent years, EST-SSR markers have been developed and applied in various tree species, including Eucalyptus globulus 16 , Euphrates Poplar 17 , the rubber tree 18 , Robinia pseudoacacia 19 , Fraxinus velutina 20 , and Pinus koraiensis 21 .
Single-molecule real-time (SMRT) sequencing technology (Pacific Biosciences), also known as third generation sequencing technology, can efficiently and accurately obtain high-quality (HQ), long and intact transcripts containing 5′-and 3′-untranslated regions and polyadenosine tails without assembly 22,23 . SMRT sequencing can be used to accurately identify features such as fusion genes, gene families, long non-coding RNAs (lncRNAs) and alternative splicing (AS) events 24,25 . SMRT sequencing technology is a reliable method for obtaining full-length transcripts that can be used to study the transcriptomes of non-model plants which lack reference genomes, such as Paulownia and Chinese catalpa. SMRT sequencing technology has been successfully applied to full-length transcriptome sequencing studies in animals, plants and insects [26][27][28] . Furthermore, full-length transcriptome sequences obtained using SMRT sequencing contain numerous EST SSRs 29,30 , which can be used for genetic analyses of the sequenced species and their related species, as well as for studies of conservation biology and molecular assisted breeding 23,31,32 . To the best of our knowledge, no full-length transcriptome sequence of P. catalpifolia has been reported.
In this study, we performed a full-length transcriptomic analysis of mixed P. catalpifolia leaves treated with varying degrees of drought stress using SMRT sequencing. We then performed function annotation analyses using publicly available databases and used various bioinformatics software to predict AS, lncRNAs and SSRs and to further analyze SSRs characteristics deeply. In the absence of Paulownia reference genome, the full-length transcriptome sequence acquired in our study not only can be used as a reference sequence for transcriptome sequencing, but also will support further genetic analyses in Paulownia species. In addition, the SSRs predicted in our study will facilitate the development of drought-resistant SSR markers, the discovery of drought-resistant genes and the study of the genetic relationships between P. catalpifolia and other related species.

Results
SMRT sequencing of the full-length transcriptome. We acquired full-length transcriptomic of P. catalpifolia using SMRT sequencing technology and obtained 28.83 Gb sequencing data. After removing the adapter sequences, approximately 454,554 polymerase reads remained, which then formed 19,052,345 subreads with an average read length of 1470 bp. After self-correction and merging, the subreads formed 405,034 circular consensus sequences (CCSs) (Fig. 1a) with an average length of 1693 bp, and 349,745 full-length non-chimeric sequences (FLNCs) (Fig. 1b). A total of 30,953 transcripts were obtained after clustering and removal of redundant sequences using the PacBio SMRT LINK Cluster tool, and 30,928 HQ transcripts with ≥ 99% accuracy and a full-length read support ≥ 2 were sequenced (Fig. 1c). The length range of the HQ transcripts was 362-7922 bp, the N50 was 1768 bp, and the mean transcript length was 1618 bp. Of the HQ transcripts, 10.47% and 86.07% were 362-900 bp and 1000-3000 bp in length, respectively. Long-length HQ transcripts (> 3000 bp) constituted 3.46% of the total HQ transcripts. After error correction and removal of all 100% identical sequences, 25,969 HQ transcripts remained, its individual transcript length ranging from 362 to 7922 bp, the average length of 1624 bp, and N50 of 1781 bp, which were used in subsequent analyses.
Identification of long non-coding RNAs, coding sequences and alternative splicing. The long non-coding RNAs (lncRNAs) are not translated into protein and its length are more than 200 nucleotides. LncR-NAs are vital for regulating the neighboring gene expression 33 . A total of 149 common lncRNAs were identified in P. catalpifolia Using four methods (CPC2, CPAT, PLEK and CNCI) (Fig. 4a). TransDecoder software was used to predict 24,982 coding sequences (CDSs), of which 16,722 were intact. The lengths of the amino acids encoded by the intact CDSs were in the range of 100-1840, with the number of amino acids decreasing as the length increased except 100-300 (Fig. 4b). Alternative splicing (AS) is one of crucial biological phenomenons, and it is helpful to produce different mature transcripts using the same RNA sequence 34 Table 2). The number of complete SSRs was 6604 in total and accounted for 89.64% of the total SSR loci, which included 2951 mononucleotide (44.68%), 2236 dinucleotide (33.86%), 1275 trinucleotide (19.31%), 50 tetranucleotide (0.76%), 24 pentanucleotide (0.36%) and 68 hexanucleotide SSRs (1.03%) (Fig. 5). The complete SSR lengths ranged from 10 to 88 bp, with a mean of 15.99 bp. The number of repeat SSR motifs ranged from 5 to 44, with a mean of 10.03. We found that SSRs with 6 motif repeats were the most common and accounted for 13.64% (901) of all SSRs, followed by SSRs with 10 repeats (897, 13.58%), 5 repeats (834, 12.63%) and 11 repeats (757, 11.46%), respectively. Furthermore, 4997 SSRs had motif repeat numbers ≤ 12, accounting for 75.67% of all SSR loci identified (Table 3).

Discussion
The lack of reference genome has impeded basic genetic research in P. catalpifolia and its related species. However, SMRT sequencing technology can generate full-length transcript sequences without a reference genome [35][36][37] and has been widely used to predict and validate gene models related to some unique traits in species 38 . In this study, we used the SMRT technique to perform full-length transcriptome sequencing in P. catalpifolia using PacBio RS II platform. In total, 28.83 Gb sequencing data were obtained including 349,745 full-length non-chimeric sequence reads, which was similar to the number of FLNC reads in Rhododendron lapponicum 39 . After subjecting the reads to clustering, error correction and redundant sequence removal, a total of 25,969 HQ transcripts were finally obtained. Very-long-read sequences were generated using the SMRT sequencing technology, and one read is considered a full-length transcript under normal circumstances 40 . The HQ transcripts generated using SMRT sequencing were longer in length than those generated using an Illumina system. In this study, the average length of the HQ transcripts in P. catalpifolia was 1624 bp, while the mean unigene length was 945 bp in tung tree 41 , 683 bp in Pueraria lobata 42 and 690 bp in Eucommia ulmoides 43 , each of which were sequenced using an Illumina system. In addition, we found that HQ transcripts > 1000 bp in length accounted for 84.04% of all HQ transcripts in our research, which was much higher than that in P. australis (40.09%) 44 and P. tomentosa (42.16%) 45 using Illumina sequencing technique. Our results demonstrated that SMRT sequencing is a reliable and efficient method to obtain full-length transcript sequences in species without an annotated reference genome.   www.nature.com/scientificreports/ We annotated 25,602 HQ P. catalpifolia transcripts using five public databases. The annotated HQ transcripts accounted for 98.59% of all HQ transcripts, a similar rate to those of transcriptomics studies in R. lapponicum 39 and Medicago sativa 46 . The 367 HQ transcripts with no predicted functions are likely to be species-specific or unknown genes in P. catalpifolia. GO classification of the HQ transcripts indicated that the majority were associated with the GO terms metabolic processes, binding, catalytic activity, cellular processes, cell and cell part. HQ transcript annotation using KOG indicated that a large number of transcripts were involved in posttranslational modifications, protein turnover, chaperones, translation, and ribosomal structure and biogenesis. A total of 13,829 HQ transcripts were assigned to specific KEGG pathways, such as carbohydrate metabolism, energy metabolism, translation, folding, sorting and degradation. We also found that many HQ transcripts exhibited multiple molecular functions and participated in diverse biological pathways. Our study provides a wealth of genetic information for molecular research into the growth and development of P. catalpifolia leaves, particularly in response to drought stress.

Repeat motif length
In recent years, SSR molecular markers have been widely used for genetic map construction, genetic diversity analyses and functional gene mining. However, the traditional methods of SSR primer development are timeconsuming, complex and costly, thus hindering their development seriously. While the SSR primers developed on the basis of transcriptome sequencing data information are economical, efficient, and abundant, which has gradually become one of important methods. Furthermore, SSR molecular markers are rapidly being developed alongside recent advancements in transcriptome sequencing technology 47,48 . In our study, a total of 7367 SSR loci were detected from 25,969 HQ transcripts, including 763 complex SSRs and 6604 complete SSRs. The frequency of the SSRs was 28.37%, and the average distribution distance was 5.59 kb. Among the 6604 complete SSRs, the most abundant and frequent mononucleotide, dinucleotide and trinucleotide motifs were A/T, AG/CT and GAA/ TTC, respectively; studies examining SSRs in Hevea brasiliensis 49 , Chinese cabbage 50 and R. lapponicum 39 produced similar results. A/T was the most abundant mononucleotide motif (2816, 95.43%), which was consistent with a study performed by Lagercrantz et al 51 . AG/CT (807, 36.09%) and GA/TC (636, 28.44%) were the most abundant dinucleotide motifs, and CT repeats usually existed in transcriptional regions that might take part in antisense transcription and have an effect on gene regulation 39,52 . There were differences in SSR abundance of different plant species in diverse researches, and repeat number of 6, 10, 5, 11, and 12 occupied 59.30% of the total complete SSR loci in our study. The SSR markers that we have developed in this work will facilitate mining for drought resistance genes, breeding drought resistant varieties, genetic diversity analyses and genetic map construction in P. catalpifolia. Of course, the SSRs found in this study were predicted theoretically and should be verified experimentally before further using.

Materials and methods
Plant materials and RNA extraction. P. catalpifolia seedlings were planted in separate pots at Mengzhou Forest Farm at the Paulownia Research and Development Center of State Administration of Forestry and Grasslands (Jiaozuo, Henan, China, 112° 42′ 58″ E, 34° 51′ 38″ N). The third and fourth fully expanded functional leaves from the top of the stem were collected at 0, 8 and 16 days after drought stress, respectively. The leaves were immediately frozen in liquid nitrogen and stored at − 80 °C until the experiment 23 . The Paulownia catalpifolia used in this study were identified by Paulownia Research and Development Center of State Administration of Forestry and Grassland, and the collection and use of Paulownia catalpifolia samples in our experiment comply with the guidelines of Paulownia Research and Development Center of State Administration of Forestry and Grassland. Total RNAs extraction were performed using the EZ-10 DNAaway RNA mini-prep kit (Sangon Biotech Co., Shanghai, China) following the manufacturer's instructions. The total RNAs of three samples above were mixed equally according to the method of Diao 53 to form the sample S for transcriptome sequencing. The degrees of RNA degradation and contamination were evaluated using 1% agarose gels 39 . The RNA purity and concentration were checked using the NanoPhotometer spectrophotometer (Implen, CA, USA) and Qubit RNA Assay Kit (Life Technologies, CA, USA), respectively 22 . RNA integrity was analyzed using an Agilent Bioanalyzer 2100 system (Agilent Technologies, CA, USA) 22 . The resulting high-quality RNA was used for full-length transcriptome sequencing. cDNA library construction and SMRT sequencing of the full-length transcriptome. Full-length cDNA was synthesised from 1.0 μg purified mRNA using the SMARTer PCR cDNA Synthesis Kit (Clontech, USA) according to the manufacturer's protocol, its size were selected using the BluePippin Size-Selection System (Sage Science, USA) and then PCR amplified again. The cDNA library was constructed after repairing the ends, connecting dumbbell-shaped SMRT adapters, performing exonuclease digestions and conducting a secondary screening using BluePippin. After the cDNA library had passed quality control using the Qubit 2.0 and Agilent 2100, full-length transcriptome sequencing of P. catalpifolia was performed using the PacBio RS II platform, based on the target data volume 23 . Quality control and functional annotation of the full-length transcriptome. The raw SMRT data were pre-processed using the SMRT Pipe analysis workflow within the PacBio SMRT Analysis software suite. Examination of the polyadenosine signal and 5′ and 3′ adaptors, as well as error correction, were performed following the methods similar to the one described 54 . Full-length SMRT transcripts were identified, and nonredundant HQ transcripts were acquired using CD-HIT-EST software 55 . Clustering and removal of redundant sequences were performed using the PacBio SMRT LINK Cluster tool, and all HQ transcripts were aligned to nucleotide and protein databases using BLASTX 54 . The databases used in this study were NCBI non-redundant, gene ontology (GO), eukaryotic orthologous groups (KOG), Kyoto encyclopedia of genes and genomes (KEGG) and Swiss-Prot. www.nature.com/scientificreports/ Identification of lncRNAs, coding sequences (CDSs) and AS variants. LncRNA candidates were identified using the following software: coding potential calculator 2 (CPC2), coding potential assessment tool (CPAT), predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme (PLEK), and the coding-non-coding index (CNCI), respectively. LncRNAs with > 200 nucleotides were selected. Trans-Decoder version 3.0.0 was used to identify candidate coding sequences (CDSs) in the full-length transcriptome of P. catalpifolia. All non-redundant HQ transcripts were aligned using a previously described method 56 . Candidate AS events were identified using the selection criteria described by Diao et al. 53 .
Identification and characterization of SSRs. The microsatellite identification tool (MISA) was used to identify SSRs within the 25,969 HQ transcripts, and the characteristics of the repeated motif types were further analyzed statistically. In this study, the SSR locus were identified according to the criteria below: the repeat number of mononucleotide motifs was ≥ 10 and the repeat numbers of di-, tri-, tetra-, penta-and hexanucleotide motifs were ≥ 6, 5, 5, 5 and 5, respectively.

Data availability
The raw data from SMRT sequencing are accessible at NCBI under bioproject (PRJNA565572).