De novo transcriptomic analysis and identification of EST-SSR markers in Stephanandra incisa

Stephanandra incisa is a wild-type shrub with beautiful leaves and white flowers and is commonly used as a garden decoration accessory. However, the limited availability of genomic data of S. incisa has restricted its breeding process. Here, we identified EST-SSR markers using de novo transcriptome sequencing. In this study, a transcriptome database containing 35,251 unigenes, having an average length of 985 bp, was obtained from S. incisa. From these unigene sequences, we identified 5,555 EST-SSRs, with a distribution density of one SSR per 1.60 kb. Dinucleotides (52.96%) were the most detected SSRs, followed by trinucleotides (34.64%). From the EST-SSR loci, we randomly selected 100 sites for designing primer and used the DNA of 60 samples to verify the polymorphism. The average value of the effective number of alleles (Ne), Shannon’s information index (I), and expective heterozygosity (He) was 1.969, 0.728, and 0.434, respectively. The polymorphism information content (PIC) value was in the range of 0.108 to 0.669, averaging 0.406, which represented a middle polymorphism level. Cluster analysis of S. incisa were also performed based on the obtained EST-SSR data in our work. As shown by structure analysis, 60 individuals could be classified into two groups. Thus, the identification of these novel EST-SSR markers provided valuable sequence information for analyzing the population structure, genetic diversity, and genetic resource assessment of S. incisa and other related species.

The results of the comparative analysis of the homology of the S. incisa transcriptome unigenes in the Nr database found highest homology with Prunus mume unigene. We found that 11,565 unigenes were annotated to Prunus mume, 3561 unigenes to Malus domestica, and 3189 unigenes to Pyrus x bretschneideri. It exhibited low homology with Medicago truncatula and Arabidopsis thaliana ( Figure S1 and Table S1).

Development and validation of novel EST-SSRs.
We used Primer5 software to design primer pairs that meet the required criteria. We randomly selected and synthesized 100 primer pairs, and initially tested the polymorphism of SSR primers. 29 primer pairs with polymorphism were separated by agarose gel ( Figure S2). Finally, amplicons and amplicon polymorphisms were assessed using DNA from 60 samples. The polymorphisms of loci were detected by capillary electrophoresis (Fig. 7).
Polymorphism of SSR loci. The results showed that there were 29 polymorphic loci in 100 loci. Additionally, we detected 90 alleles in these two populations, at an average of 3.1 alleles/locus. S21 was the locus with the largest number of alleles (6) Table S6).  www.nature.com/scientificreports/ Cluster analysis. All the amplified alleles were used for cluster analysis of the 60 individuals via the UPGMA method. A UPGMA dendrogram showed that all the individuals were divided into two clusters, in which cluster 1 contained A27, LS06, LS07, LS23 and cluster 2 contained all other individuals ( Figure S3). Clearly, a high level of genetic similarity was displayed, with similarity coefficients distributed in a narrow range of 0.75-1.0. The results indicated low diversity in the screened germplasm and narrow genetic.

Discussion
SSR markers are commonly used for genetic diversity analysis, constructing cultivar DNA fingerprinting, marking assisted breeding, and studying genome-wide association 28 . Transcriptome sequencing has been efficiently used to develop SSR markers in several plant species. It can also be used in non-model plants and plays a crucial role in the discovery of novel genes, gene expression analysis, and for the identification of molecular markers [29][30][31] . S. incisa is a non-model plant with high ornamental and medicinal value. However, there are no published reports on the transcriptome sequence of S. incisa. In this study, the transcriptome of S. incisa was reported using Illumina sequencing technology. We identified 40,053,100 paired-end raw reads and 40,010,736 high-quality clean reads with a Q20 level of 97.98%, to ensure the quality of sequencing. The assembled unigenes were shorter than Saccharina japonica (44,362,190) but longer than Chinese Hawthorn (28,888,844) 32,33 . The average length of these unigenes was longer than that of Pinus bungeana (922 bp) and Dalbergia odorifera (676 bp) 27,34 . Next, these assembled unigenes were successfully annotated to registered public databases, including Swiss-Prot, Nr, KEGG, and KOG. These annotations provide useful information, which could be used for the genetic diversity analysis of S. incisa in the future. The result of the annotation showed that S. incisa was closely related to Prunus, Housefly, and Pyrus x bretschneideri, all belonging to the family Rosaceae. These results were consistent with the previous taxonomic studies. In the GO classification results, metabolic processes and catalytic activity were classified as the largest groups among the three functional categories. The prediction cluster occupied an important position in the KOG classification, consistent with previous results, followed by signal transduction mechanisms and protein turnover, posttranslational modification, chaperones categories. As for KEGG pathways database, 11,076 (31.42%) unigenes had high matches and were classified to five categories, including 133 KEGG pathways. Thus, these results provided valuable information to be used to study genetic diversity and improved adaptation of S. incisa. Molecular markers can provide adequate genetic diversity information for plants [35][36][37] . EST-SSR markers have been an effective tool for analyzing gene structure, linkage mapping, and QLT analysis 38 . Thus, the identification of EST-SSR markers of S. incisa could promote its molecular breeding process, especially in mapping and anchoring parental maps. However, there exist no published reports on the molecular markers in S. incisa. Here, we identified 5555 EST-SSRs, among which, dinucleotide repeat sequences and trinucleotide repeat sequences accounted for a large proportion. The most common dinucleotide repeat in S. incisa was AG/CT, followed by AC/GT, and AT/AT. Although the functional significance of SSRs in plant transcript regions is unclear, the AG/ CT motif, a homopurine-homopyrimidine stretch frequently found in the 5′ untranslated region, reportedly plays a vital role in regulating the gene expression and nucleic acid metabolism in plants 39 . The most common trinucleotide repeat was AAG/CTT, followed by AGC/CTG, and ACC/GGT. These results agreed with the results of previous research and showed that the AAG may be the most important motif in dicotyledonous plants 40 . www.nature.com/scientificreports/ Additionally, the results agreed with the previous studies on the Rosacea, such as Prunus mume 41 . Thus, the frequency of SSRs was closely linked to the size of the database, mining tool used, the genome structure, and the species differences 42 . The location site of EST-SSR markers in the genes facilitates their use for detecting valuable genetic diversity that is possibly associated with valuable breeding traits. The EST-SSRs are considered appropriate for designing specific primers due to the high quality amplicons 43 . In this study, we randomly selected and synthesized 100 primer pairs, followed by the identification of 29 polymorphic genic SSR markers and 90 alleles among the two S. incisa populations. Additionally, the average values of Ne, I, and He were 1.969, 0.728, and 0.434, respectively. These results indicated that S. incisa has a medium level of genetic diversity. The PIC value is used to assess the level of genetic information 44 . Additionally, it is used employed to assess the polymorphic level of the markers and classified into high-level (< 0.5), moderate-level (0.5 < PIC < 0.25), and low-level (< 0.25) categories 45 . The value of PIC ranged from 0.108 to 0.669, which represented a middle polymorphism level. The level of polymorphism could be attributed to several factors, such as the difference in quantity in the sample. Here, polymorphism was detected based on the identified EST-SSR-containing sequences, which might be a valuable resource for the identification of genetic markers for future research on S. incisa. An admixture model-based approach was implemented to evaluate the population structure, and suggested two clusters were the best for the 60 S. incisa samples. The analysis of neighbor-joining confirmed this result. Based on the cluster analysis results, no obvious correlation of genotypes of the S. incisa to their geographical locations.
These polymorphic microsatellite markers would provide an important reference for developing polymorphic molecular markers for S. incisa and could be applied in population genetics analysis, linkage mapping, and marker-assisted selective breeding in the future. Thus, the identified EST-SSR makers were effective for the genetic analysis of the S. incisa populations. Additionally, these results could act as a new valuable resource for genomic studies on S. incisa. However, only 60 individuals from two groups were genetically analyzed. Thus, further research with an extended sample size needs to be conducted for the analysis of genetic diversity and structure of the species.

Conclusions
This is the first study to analyze the transcriptome of S. incisa. A total of 35,251 unigenes, with an average length of 985 bp, were obtained from S. incisa to create a transcriptome database, from which 5555 EST-SSRs were identified. All unigenes were annotated in the Nr (75.47%), Swiss-Prot (51.46%), KOG (41.63%), and KEGG (31.42%) databases. From these EST-SSR loci, we randomly selected 100 sites for designing primer and used the DNA of 60 samples to verify the polymorphism. Diversity parameters showed that these primer pairs had a middle polymorphism level among the two S. incisa populations. Cluster analysis indicated that 60 individuals could be classified into two groups. This is the first report to identify EST-SSR markers in S. incisa, which could be used for further genetic research and breeding approaches in S. incisa.

Methods
Plant materials and RNA isolation. We collected grown leaves of S. incisa plants from Laoshan to conduct transcriptomics analysis. The leaves were flash frozen in liquid nitrogen and stored at − 80 °C until further use. The TRIzol reagent was used for extracting total RNA following the specified protocol (Invitrogen, USA). For confirming polymorphism, 60 individual samples from the Laoshan (LS) and Anshan (AS) populations were selected, and each of the sampled individuals was kept more than 150 m apart to minimize the genetic relationships among the sample trees (Fig. 8). The fresh and healthy leaves were stored in silica gel. The total genomic DNAs were extracted using CTAB 46 . The extracted RNA and DNA were quantified using Thermo Scientific Nanodrop2000 spectrophotometer. Agarose gel electrophoresis was used to assess the integrity of the quantified DNA and RNA. The high-quality RNA samples were used to prepare the complementary cDNA library.
Construction of cDNA library and illumina sequencing. Following the Illumina sequencing platform manufacturer's instructions, first, the first and the second strand were synthesized. Next, the cDNA was purified, ends were repaired, ligated to the adapter, followed by library enrichment. The cDNA quality and concentration were evaluated using Qseq100 DNA Analyzer (Bioptic Inc, China). Finally, the Illumina HiSeq 2000 platform was used to sequence the cDNA library.
Sequence assembly and data analysis. The raw sequencing data can be filtered by removing the adapter contaminants, reads containing a high number of poly-N, and read data with < 20 bp. The SeqPrep program can be used to screen high-quality clean read data for the de novo assembly, empty reads, and reads with Q < 20, followed by the calculation of the GC content, sequence replication levels, along with the Q20 and Q30 values of the obtained clean data 47 . Thus, we filtered the raw sequencing reads (reads with unknown base 'N' , low-quality reads with the adapter, along with other low-quality sequences) to obtain the clean reads, which were assembled using Trinity software (version 2.4.0, https ://githu b.com/trini tyrna seq/trini tyrna seq/issue s/270) to get the consensus sequences. The software combined the reads with overlapping nucleic acid sequences into contigs. The sequences that could not extend at both ends of contigs were classified as unigenes 48 . These assembled unigenes were used for annotation analysis. These unigenes were compared with public databases (E < 1e−5), such as NCBI non-redundant (Nr, http://www.ncbi.nlm.nih.gov/) 49 , Swiss-Prot (http://www.expas y.ch/sprot /) 50  Identification of EST-SSRs loci and primer design. The SSR loci contain di/tri/tetra/penta/hexa nucleotide sequences with at least 8/6/4/3/3 repeats, respectively. The Microsatellite identification tool (MISA, http:// www.grame ne.org/db/marke rs/ssrto ol) was used to identify EST-SSRs 54 . Primer 5.0 (http://www.premi erbio soft. com/prime rdesi gn/) was used to design the primers based on the following parameters: annealing temperature (55-60 °C), GC content (40-60%), primer lengths (18-22 bp), and target PCR product size (100-300 bp) 55 . All primers were synthesized by the Beijing Ruibiotech Company.
Validation and application of SSR markers. The CTAB method was used to extract total genomic DNA 56 . The PCR reactions were carried out in a final volume of 20 µL, which contained 2 × Mix (10 µL), reverse primer (0.15 µL), forward primer (0.15 µL), DNA (1 µL) and dd H2O up to 20 µL. PCR amplification was carried out at 95 ℃ for 5 min, followed by 35 denaturation cycles at 95 ℃ for 30 s, annealing at 52 ℃ for 30 s, and then extension at 72 ℃ for 30 s. A final extension at 72℃ and 10 min was also carried out. All the PCR products were verified through capillary electrophoresis. Electrophoretograms of capillary electrophoresis were read and analyzed through Gene Marker V2.2.0 (https ://genem arker .softw are.infor mer.com/2.2/) software.

Data availability
The dataset is available from the NCBI Short Read Archive (SRA) with accession number SRR12158738. www.nature.com/scientificreports/