De novo assembly of transcriptomes, mining, and development of novel EST-SSR markers in Curcuma alismatifolia (Zingiberaceae family) through Illumina sequencing

Curcuma alismatifolia widely used as an ornamental plant in Thailand and Cambodia. This species of herbaceous perennial from the Zingiberaceae family, includes cultivars with a wide range of colours and long postharvest life, and is used as an ornamental cut flower, as a potted plant, and in exterior landscapes. For further genetic improvement, however, little genomic information and no specific molecular markers are available. The present study used Illumina sequencing and de novo transcriptome assembly of two C. alismatifolia cvs, ‘Chiang Mai Pink’ and ‘UB Snow 701’, to develop simple sequence repeat markers for genetic diversity studies. After de novo assembly, 62,105 unigenes were generated and 48,813 (78.60%) showed significant similarities versus six functional protein databases. In addition, 9,351 expressed sequence tag-simple sequence repeats (EST-SSRs) were identified with a distribution frequency of 12.5% total unigenes. Out of 8,955 designed EST-SSR primers, 150 primers were selected for the development of potential molecular markers. Among these markers, 17 EST-SSR markers presented a moderate level of genetic diversity among three C. alismatifolia cultivars, one hybrid, three Curcuma, and two Zingiber species. Three different genetic groups within these species were revealed using EST-SSR markers, indicating that the markers developed in this study can be effectively applied to the population genetic analysis of Curcuma and Zingiber species. This report describes the first analysis of transcriptome data of an important ornamental ginger cultivars, also provides a valuable resource for gene discovery and marker development in the genus Curcuma.

www.nature.com/scientificreports www.nature.com/scientificreports/ auxiliary flower buds. Open flowers are small with purple flag petals (Fig. 1). To improve this species, traditional breeding programmes have been launched in previous studies 1 . A large number of cultivars and hybrids with variation in flower colour, from dark purple to white, are now available. Most commercially important cultivars of C. alismatifolia are hybrids and are propagated vegetatively.
To estimate genetic variation with high reproducibility within a variety of plant species, SSRs or microsatellites are known to be highly effective tools and are considered a robust class of molecular markers 2 . SSRs are classified based on their source (i.e., genomic SSRs or g-SSRs and genic SSRs or expressed sequence tags SSRs as EST-SSRs) within the coding region and are identified from transcribed RNA sequences 3 . Although EST-SSRs are closely linked to functional genes with a possible impact on important agronomic characters, g-SSRs are not necessarily expected to be either strongly linked with transcribed regions of the genome or to have genetic functions. A major disadvantage associated with EST-SSRs is sequence redundancy that results in multiple sets of markers at the same locus. This issue can be addressed through assembling short reads of RNA transcripts and ESTs into unigenes 4,5 . Before next generation sequencing (NGS) technologies, the development of SSR markers was costly and exhibited low throughput due to the necessity of building up genomic libraries for targeted SSR motifs to create recombinant DNA molecules using restriction enzymes for DNA fragmentation. Additionally, cloning of DNA fragments into a vector was performed, as well as sequencing of clones carrying SSRs 6 . Application of molecular markers to study genetic variation of Curcuma species has been reported in previous studies [7][8][9][10][11][12] . Moreover, in the Zingibaraceae family, within spice crops, such as turmeric (Curcuma longa L.) 10,[13][14][15][16] , Zingiber officinale 17,18 , and cardamom (Elettaria cardamomum Maton) 19 genomic SSRs and EST-SSRs have been developed. The development of SSRs based on transcriptome data can therefore be viewed as useful for estimation of genetic diversity and population structure to support strategies for the breeding of populations of C. alismatifolia.
The advent of NGS technologies, such as sequencing by synthesis (e.g., Illumina) can simplify the creation of an enormous amount of genomic or transcriptome sequence data, allowing cost-effective and efficient development of molecular markers, including SSR markers 20 .
RNA sequencing (RNA-seq) is a powerful technique because the dual aspects of quantification and discovery are blended within a single sequencing assay. It is possible to easily provide information on transcriptomes through the technology of high-throughput mRNA sequencing, which is regarded as a cost-effective and powerful tool for profiling gene expression within non-model organisms that have no reference genome 21 . RNA-seq technology and de novo transcriptome assembly are considered effective approaches to detect functional genes, as well as to characterize patterns of gene expression and associated regulatory networks [22][23][24][25][26][27][28][29][30][31] , having mapped out a direct, reliable, and simple strategy for cost-effective and efficient development of high-throughput identification www.nature.com/scientificreports www.nature.com/scientificreports/ of EST-SSRs 4 . Using RNA-seq technology, EST-SSRs have been utilized and developed in a variety of plant species, such as peanut 32 , bean 33 , Chinese cabbage 34 , Petunia spp. 35 mango 36 , Chinese bayberry 37 , and Tapiscia sinensis 38 . A list of plant species in which genomic and EST-SSRs have been developed using 454 and Illumina sequencing platforms is available in our review article 2 . Assembly of de novo transcriptomes have also been performed in Curcuma longa L., revealing novel transcripts related to antimalarial terpenoids and anticancer within rhizomes 39 and Zingiber officinale cv. Suruchi of Odisha 40 . To the best of our knowledge, however, there are no previous reports of de novo transcriptome assembly (RNA-seq) for C. alismatifolia, particularly for the development of EST-SSR markers. In this study, we present the first de novo transcriptome analysis of ornamental Curcuma species, C. alismatifolia cv. 'Chiang Mai Pink' and cv. 'UB Snow 701' based on Illumina HiSeq 4000 sequencing and the identification of SSR markers for Curcuma species. This transcriptome analysis offers new insights into the evolutionary origin of Curcuma, as well as a rich resource for genetic information that could be used for genetic improvement and breeding.

Results
Illumina sequencing and de novo assembly of paired-end reads. In the present study, approximately 13.19 Gb of raw read data was generated from Illumina HiSeq sequencing of the upper bracts of C. alismatifolia cultivars, 'Chiang Mai Pink' and 'UB Snow 701' . After raw read filtering, clean reads were assembled into 46,829 and 52,788 unigenes, reaching a total length of 40,992,261 bp and 50,231,516 bp for 'Chiang Mai Pink' and 'UB-Snow 701' cultivars, respectively. The assembled unigene length ranged from 301 to over 3000 bp with an average of 875 and 951 bp and N50 of 1,338 and 1,445 bp for 'Chiang Mai Pink' and 'UB-Snow 701' cultivars, respectively (Table 1).
Functional annotation and classification of the unigenes. After sequence contig assembly, unigenes were annotated by comparison against the six functional databases: NCBI Nr and Nt; COG; GO; KEGG and Swiss-Prot (Table 2, Supplementary Table S1). As indicated in Table 1, 47,160 (75.94%) of unigenes were significantly matched with known proteins in the Nr database, while 44,007 (70.86%) unigenes matched with entries in the nucleotide sequence (Nt) database, and 33,298 (53.62%) matched with proteins in the Swiss-Prot database. To further analyse BLAST results, E-value and similarity distributions were calculated using the Nr database. The E-value distribution of the top hits revealed that 53.98% of annotated sequences had high scores for homology (E-value < 10 −50 ), whereas 46.02% showed homology with E-values ranging from 10 −5 to 10 −50 (Fig. 3a). Additionally, 80.4% of sequences were found to have similarities of over 70% (Fig. 3b). These results reflect the high identities of the mapped sequences with known sequences, suggesting good assembly quality. Species distribution showed Musa acuminata subsp. malaccensis 'DH Pahang' (Musaceae) (order Zingiberales) to have a very high similarity score with 37,962 (80.5%) top BLASTx hits. Other species matched at below 6%, including African oil palm, Elaeis guineensis (Arecaceae) with 2,456 (5.21%) and date palm, Phoenix dactylifera (Arecaceae) with 1,815 (3.85%) (Fig. 3c). Further functional prediction and classification of all unigenes was performed using their annotation with COG, GO, and KEGG databases ( Table 2, Supplementary Table S1). COG function classification of the sequence of C. alismatifolia produced Nr hits for19,546 of 62,105 unigenes which, were annotated and classified functionally into 25 COG functional categories, including biochemistry metabolism, cellular structure, signal transduction, and molecular processing (Fig. 4). The cluster for general function prediction represented the largest group with 5,457 genes (15.6%) followed by transcription (4,035, 11.6%) and replication, recombination www.nature.com/scientificreports www.nature.com/scientificreports/ and repair with 2,875 genes (8.26%). Additionally, only a few unigenes were assigned to nuclear structure and extracellular structures with eight (0.02%) and seven (0.02%) genes, respectively.
In the present study, 3,651 (5.88%) C. alismatifolia unigenes were assigned to GO classes based on Nr annotation with 21,210 functional terms (Fig. 5). The annotated unigenes that belonged to three clusters of molecular function, cellular component, and biological process were categorized into 51 functional groups. Cellular component (8,917,42.04%) and biological process (8,292,39.09%) classifications represented the largest number of unigenes followed by molecular function (4001, 18.87%). Under the cellular component category, three subcategories, cell, organelle, and membrane, represented 38.28% of associated unigenes. The majority of the unigenes in the biological process category were specific for metabolic (1,817,8.56%) and cellular (1,814, 8.55%) processes followed by single-organism processes (1,367, 6.44%) and stimulus response (648, 3.05).
All unigenes were analysed by comparison with the KEGG pathway database for further analysis of the C. alismatifolia transcriptome. Carrying significant matches in this present study, 35,629 (57.37%) unigenes were assigned to 135 predicted metabolic pathways. Figure 6 shows the top 21 pathways based on six main categories. The metabolism pathway was the most represented among these six main categories, containing 20,743 unigenes (56.50%), followed by genetic information processing (9,106, 24.80%), cellular processes (2,387, 6.50%), organismal systems (2,277, 6.20%), environmental information processing (1,899, 5.17%), and human diseases pathways (298, 0.81%). These functional annotations provide useful information to further investigate specific developmental and biochemical processes of C. alismatifolia, as well as potential functions, structures, and pathways of genes.

GO-Annotated
Overall*  Table 2. Summary of functional annotation of unigenes of C. alismatifolia with six databases. *Overall: the number of unigenes which be annotated with at least one functional database. www.nature.com/scientificreports www.nature.com/scientificreports/ Identification of EST-SSRs from C. alismatifolia transcriptome. Using MISA software, all 62,105 unigenes were employed to mine potential EST-SSRs for development of SSRs in C. alismatifolia. A total of 7,811 unigene sequences were found to encode 9,351 potential EST-SSRs (Supplementary Table S2). Of these, 1,248 unigenes contained more than one SSR and 501 SSRs were present in compound formation (Table 3). Table 4 and Fig. 7 offer a summary of the number and type of EST-SSRs along with different numbers of tandem repeats.    www.nature.com/scientificreports www.nature.com/scientificreports/ From 61,786.426 kb of examined sequences, we detected a frequency of at least one SSR per 6.6 kb within the C. alismatifolia genome expressed fraction. In identified SSR loci considering sequence complementary, 213 motif sequence types were identified, of which mono, di-, tri-, quad-, penta-, and hexa-nucleotide repeats had 2, 4, 10,  www.nature.com/scientificreports www.nature.com/scientificreports/ 26, 57, and 114 types, respectively (Fig. 7). Trinucleotide repeat motifs were the most abundant (5,720, 56.35%) followed by dinucleotide repeat motifs (2, 473, 26.44%), whereas hexa-(443, 4.74%), penta-(313, 3.35%) and quad-nucleotide repeat motifs (244, 2.61%) were rare. Within the two possible types of mononucleotide repeat, A/T was the most abundant motif, containing 6.17% of total SSRs. The C/G motif was less abundant than the A/T with frequency of 0.33%. Among the dinucleotide repeat motif types, AG/CT, with frequency of 20.04% was the most abundant repeat motif while CG/CG was the least abundant motif, constituting 0.1% of total SSRs. The two most frequent repeats among the trinucleotide repeat motifs were AGG/CCT (13.82% of total SSRs) and CCG/ CGG (12.94% of total SSRs) followed by AAG/CTT (12.5%). Quad-, penta-, and hexa-nucleotide repeats motifs constituted 10.70% of the total SSRs (Table 4, Fig. 7). The number of SSR repeats ranged from 4 to 29, with five repeats (2,983, 31.90%) representing the most common followed by six (2,335, 24.97%) and seven (1,211, 12.95%) tandem repeats. ssR primer design and validation. Out of 9,351 SSRs, 8,955 high-quality primer pairs were successfully designed, and 150 primers were randomly selected for amplification of genomic DNA from C. alismatifolia (three cultivars and one hybrid), three Curcuma species, and two Zingiber species. Of the 150 tested primers, 70 primers amplified reproducible amplicons with the expected band sizes. Considering the amplification and polymorphic loci, 17 SSR primer pairs that presented clear and polymorphic loci were selected to evaluate polymorphism levels within the Curcuma and Zingiber species (Table 5, Supplementary File S3). Characteristics associated with the 17 polymorphic EST-SSR markers are listed in Table 6. The number of alleles (Na) per marker ranged from 3 to 6, with 75 alleles in total. The maximum and minimum levels of effective number (Ne) of alleles per locus was 4. Based on Dice's similarity matrix, we generated a UPGMA hierarchical clustering plot to evaluate the genetic relationships between nine cultivars and species (Fig. 8). The clustering plot clearly grouped the nine cultivars and species within three main clusters. Cluster I included four cultivars of C. alismatifolia with the highest similarity of 0.73 among its two sub-groups. Cluster II comprised three species of Curcuma with less similarity of 0.60 among C. longa and two C. angustifolia and C. cordata, and cluster III consisted of Z. officinale and Z. zerumbet with the least similarity of 0.50.

Discussion
In this study, we report the first analysis of transcriptome data from C. alismatifolia, a perennial ornamental bulb plant with long vase life and easy growth habits. Some cultivars of this plant have additional medicinal characteristics 41 . In spite of significant advances in DNA/RNA sequencing technologies and progress in developing SSR markers, there is a lack of genetic information to develop molecular markers for assessment of genetic diversity of C. alismatifolia populations compared to that in other ornamental plant species.
The C. alismatifolia transcriptome sequence offers a genomic resource for breeding and genetic study of Curcuma and provides a reference for plant transcriptome-scale evolutionary analyses of the Zingiberaceae family. De novo transcriptome analysis, which is based on RNA-seq data, is a significant alternative method for the whole genome sequencing for organisms with complex and large genomes that comport repetitive sequences 42,43 . In the present study, de novo transcriptomes derived from inflorescences of two cultivars of C. alismatifolia were assembled using Illumina paired-end RNA-seq technology. In total, 13.19 Gb of data was generated after sequencing. N50 and average length of all unigenes was 1,501 bp and 994, respectively (Table 1), which is comparable with N50 reported from the de novo transcriptome assembly of Curcuma longa (N50 = 1,515 bp) 39 , Zanthoxylum bungeanum (N50 = 846 bp) 21 , Zantedeschia rehmannii Engl. (N50 = 1476 bp) 44 , Cicer arietinum L. (N50 = 1192 bp) 22 , Euphorbia fischeriana (N50 = ~1500 bp) 29 , Cajanus cajan L. (N50 = 1510 bp) 45 , Hevea brasiliensis (N50 = 485 bp) 24 , Ipomoea batatas (N50 = 765bp) 26 , and Camellia sinensis (N50 = 506 bp) 27 using the Illumina method. In the assembly analysis, assemblies were measured by the size and accuracy of their transcripts and unigenes. In de novo assembly where the reference sequences are not available, assembly evaluation is more difficult. In such cases, some commonly used quality metrics include average length and N50 of unigenes. The unigene N50 is the median unigene size of transcriptome assembly. Although previous studies reported that larger N50 values and longer mean lengths indicate accurate and effective assembly 33,46 , these measures are www.nature.com/scientificreports www.nature.com/scientificreports/ primitive and often misleading 47 , e.g., N50, one of the most popular reference-free measures, can be maximized by trivial assemblies 48 . The assumption about N50 is that better assemblies will result from a larger number of identified overlaps between input reads and thus will have more reads assembled into longer unigenes. In short, N50 measures the continuity of unigenes but not their accuracy 49 . In transcriptomes, differential N50s for different tissues of the same plant may be observed, since different groups of genes are expressed 50 .
To predict unigene biological functions, unigene functional annotation was performed using six protein databases (Nr, Nt, COG, GO, KEGG, and Swiss-prot, Table 2) producing 48,813 out of 62,105 significant hits (Supplementary Table S1). Employing the same method, the C. alismatifolia unigene annotation rate was 78.60%, which was higher than that of C. longa (54.6%). The frequency of unigenes longer than 500 bp within the assembled transcripts may account for a higher percentage of unigene annotation. Generally, the rate of BLAST matches in protein databases for longer unigenes are more likely higher 51 . Other investigators also reported that because the significance of sequence similarity partially depends on the length of the query sequence, higher numbers of short reads in next generation sequencing (NGS) often cannot be matched with known genes 52 . Based on a BLASTX search against the Nr database, 80.5% of annotated unigenes of C. alismatifolia exhibited similarity to wild banana, Musa acuminata ssp. Malaccensis 'DH Pahang' , a perennial monocotyledonous herb of the same order, Zingiberales, as C. alismatifolia (Fig. 3c). One reason for this similarity could be the absence of whole-genome sequences within public databases created for any species of Curcuma. In this study, using the COG database, we classified annotated unigenes into 25 sub-terms or subcategories (Fig. 4), while with the GO database, unigenes were classified into 51 subcategories, including 11, 17, and 23 functions in molecular function, cellular components, and biological process aspects, respectively (Fig. 4), suggesting diverse transcripts in our transcriptome data for C. alismatifolia 53 21,39,44 . GO functional annotation helped us describe the macro level of gene functions and predict the physiological role of each unigene 55 . Results illustrated various molecular functions of assembled unigenes, suggesting their involvement in diverse metabolic pathways. Next, 35,629 unigenes were also annotated and mapped to 135 KEGG pathways. To better understand gene interaction and biological function, KEGG pathways were determined to be highly helpful. Functional classification provided based on KEGG pathways revealed that numerous important metabolic pathways within C. alismatifolia are still unknown and merit further investigation. These findings indicate that C. alismatifolia makes a huge investment in gene transcription control and capacity, as well as cell maintenance and defence. Approximately 80% of the top 21 hit pathways were involved in genetic information processing and metabolism, whereas the other pathways were related to those involved in cellular process, environmental information processing, human disease, and organismal systems (Fig. 6). In this study, unigenes associated with metabolism of terpenoids and polyketides, cofactors and vitamins, and biosynthesis of other secondary metabolites were also identified, providing evidence that numerous biologically active secondary metabolites have been isolated within Curcuma species 41,[56][57][58][59] . Briefly, functional analysis revealed that RNA-seq-based de novo transcriptome analysis for C. alismatifolia, a non-model organism with a complex genome, will facilitate further research on the physiology, biochemistry, and molecular genetics of C. alismatifolia or related species.   www.nature.com/scientificreports www.nature.com/scientificreports/ For gene-based studies detecting functional variations and studying population genetic structure, EST-SSRs are applicable. Transcriptome sequencing, generating enormous amount of sequence data, provides a good resource for development of SSRs. In the present study, out of 62,105 unigenes, 7,811 unigenes containing SSRs, representing approximately 12.57% of the transcriptomic sequences, possess SSR loci, with a distribution density of one SSR locus per 6.6 kb. This rate is comparable with SSR frequencies and distribution density in Torreya grandis (2.7%, 25.9 kb) 60 , Pummelo (14.7%, 5.6 kb) 61 , Amorphophallus (11.8%, 3.6 kb) 3 , Zingiber officinale (2.7%, 25.2 kb) 17 , calla lily (20.34%, 4.1 kb) 44 , Arachis hypogaea (17.7%, 3.3 kb) 4 , seagrass (17.5%, 5.8 kb) 62 , three varieties of Curcuma longa (14.6%, 5.3 kb; 14.9%, 5.2 kb; 20.5%, 4.8 kb) 39 , and three Macaronesian endemic plant species including Argyranthemum broussonetii (2.3%, 27 kb), Descurainia bourgaeana (3.5%, 22.9), and Echium wildpretii (1.8%, 38.2 kb) 63 . SSR frequency and differences in abundance amongst various species, can be partially attributed to differences between species, the size of the unigene assembly dataset, SSR search criteria, sequence redundancy, and the database-mining tools employed 33,44,64 . In our study, six different repeat motifs were identified, in which the most abundant were trinucleotide repeats (56.35%), followed by dinucleotides (26.44%). In contrast, hexa-(4.74%), penta-(3.35%), and quad-nucleotide repeats (2.61%) were rare (Table 4). This result is similar to previous findings for di-and tri-nucleotide motifs, which are reported as the most frequent SSR motif types within the transcriptome sequences of many other plants, including C. longa 39 , coloured calla lily 44 , Amorphophallus spp. 3 , Z. officinale 17 , Pummelo 61 , and cotton 65 . Trinucleotide repeats are the most abundant motifs for SSRs because open reading frames do not disturb with insertions and deletions within translated regions, whereas frameshift mutation may restrict the development of other motif types 4,33,66 . Among mononucleotide repeats, as in most plants, A/T repeats were far more abundant than G/C repeats 21,67,68 . The most abundant dinucleotide repeat (20.04%) was the AG/CT motif, as illustrated in Table 4, which is identical to previous findings in Zanthoxylum bungeanum 21 and Oryza sativa 69 followed by AT/TA (4.20%) and AC/GT (2.09%). The most abundant trinucleotide repeat motif in C. alismatifolia was AGG/CCT (13.82%) and closely followed by CCG/CGG (12.94%), similar to reports in calla lily and Amorphophallus spp. 3,17,44 . Consistent with previously reported observations, these results for C. alismatifolia indicate that the trinucleotide motif CCG/CGG is common in monocots. In addition, we also noticed that GC-rich trinucleotide motifs (CCG/CGG, AGG/CCT, AAG/CTT, AGC/CTG, ACG/CGT, and ACC/GGT > 67%) were more abundant than AT-rich trinucleotide (AAC/GTT, AAG/CTT, AAT/ATT, ACT/ AGT, ATC/ATG < 33%). The fact that high GC content and consequent codon usage bias can be considered specific features of monocot genomes is strongly supported by these results 44,70 .
As an efficient way to develop EST-SSR markers, mining transcriptome data provides great flexibility in selecting markers at different resolutions and for different applications 71 . Moreover, SSRs, which are derived from a transcriptome database, have advantages over other strategies, including needing less time and cost. In this study, 17 EST-SSR markers (Table 5) were obtained and verified, offering an informative and applicable approach for evaluation of genetic relationships within Curcuma and Zingiber species. Of the 150 primer pairs that were evaluated using a hybrid and three cultivars of C. alismatifolia, three Curcuma, and two Zingiber species, 17 polymorphic pairs, as EST-SSR markers, showed a moderate level of genetic diversity (Na  75,76 . In the present study, the usefulness of the 17 newly developed polymorphic EST-SSR markers for the evaluation of genetic diversity among Curcuma and Zingiber species was clearly demonstrated. Hierarchical cluster analysis revealed three distinct groups among the nine studied cultivars and species of Curcuma and Zingiber (Fig. 8), suggesting that the set of cross-species transferable EST-SSR markers developed in this study will enhance the current repository for the genus Curcuma. These markers can be useful for detection of markers associated with specific traits in other Zingiberaceae species and related genera in breeding programmes. Cross species transferability has been previously demonstrated among seven species of Zingiberaceae (Z. zerumbet, H. spicatum, C. longa, C. amada, C. aeruginosa, C. aromatica and C. angustifolia) for 16 EST-SSR markers from Z. officinale 17 . In another study, 100% transferability of EST-SSR markers from C. longa was observed among other species of Curcuma 16 . Genic SSRs are more transferable among distantly related species than genomic SSRs because their target coding domains are more likely to be conserved between relatives that makes them valuable for comparative mapping and evolutionary studies 77 .
In addition to the novel markers, the availability of the C. alismatifolia transcriptome dataset will allow an in-depth exploration of Curcuma specific genes with known or unknown biochemical functions along with a better understanding of its economically and agronomically important traits and to develop new cultivars for ornamental purposes or to produce valuable secondary metabolite components.

Materials and Methods
Rhizomes from C. alismatifolia cv. 'Chiang Mai Pink' and 'UB Snow 701' were obtained from a Curcuma Nursery (Ubonrat) in Thailand. Rhizomes were grown in a screen house at field no. www.nature.com/scientificreports www.nature.com/scientificreports/ was added. After shaking firmly for 15 s and incubation at room temperature for 5 min, mixtures were centrifuged at 12,000 g for 5 min at 4 °C to obtain a biphasic solution containing an upper colourless, aqueous phase and a lower phenol-chloroform phase with red colour. The aqueous phase was subsequently transferred into a new Eppendorf tube, and an equal volume of isopropanol was added to precipitate RNA from the aqueous phase. Samples were kept at room temperature for 10 min followed by centrifugation at 18,000 g for 15 min at 4 °C. After discarding the supernatant, 70% ethanol was added to the white RNA pellet and centrifuged at 13,000 g for 5 min at 4 °C. In the final stage, the pellet was air-dried and then dissolved in 30 μl diethyl pyrocarbonate (DEPC)-treated water 78 . Total isolated RNA was treated with DNase I to remove any residual DNA. Using a NanoDrop 2000 (Thermo Fisher Scientific Inc., MA, USA), the purity and concentration of isolated RNA were determined, and quality was verified by electrophoresis on a 1.5% agarose gel.
After total RNA extraction, Oligo (dT) magnetic beads were used to purify and enrich mRNA in each sample. Following purification, a fragmentation buffer was used to break the mRNA. Then, mRNA fragments were used as templates for cDNA synthesis using RT-PCR and random primers. Short cDNA fragments were purified and resolved with elution buffer (EB) for end reparation with T4 DNA and Klenow DNA polymerase. Subsequently, 3′-single adenylation was added to repaired cDNA fragments and connected with sequencing adapters. To construct the cDNA library, these products were purified and amplified via PCR. During quality check steps, Agilent 2100 Bioanaylzer (Agilent Technologies, Santa Clara, CA, USA) and ABI StepOnePlus Real-Time PCR System (Applied Biosystems, Foster City, CA, USA) were used in quantification and qualification of the cDNA library. Finally, cDNA libraries were sequenced on a flow cell using an Illumina HiSeq 4000 system (Illumina, San Diego, CA, USA) with 101-base paired-end reads at BGI Co., Ltd., Shenzhen, China.
De novo assembly and functional annotation of unigenes. After sequencing, raw reads were first filtered for low-quality reads with greater than 20% Q-score <20 bases, adaptor-polluted reads, ambiguous reads containing >5% unknown base (N), and for non-coding RNA such as rRNA, tRNA, and miRNA. After filtering, the remaining reads, called 'Clean Reads' , were stored in FASTQ format 79 . Clean reads were de novo assembled using Trinity (v2.0.6) 80 software with parameters setting of minimum contig length of 150 bp, min_kmer_cov set to 3 to increase the stringency for reads being assembled together, and sequence homology by 80%. The resulting Trinity sequences were called transcripts, which were obtained from connecting contigs that could not be extended on either end. Then, TGICL (v2.0.6) (TIGR gene indices clustering tool) 81 was used to cluster transcripts to obtain final unigenes.
Functional annotation of unigenes was performed using BLASTx 82 against NCBI databases, including Nonredundant (Nr), nucleotide sequence (Nt), Kyoto Encyclopaedia of Genes and Genomes (KEGG) 83 , SwissProt 84 , and Clusters of Orthologous Groups (COG), with an E-value threshold of 10 −5 . To obtain the GO annotation, Blast2GO 85 was applied according to the Nr annotation (GO annotation: http://www.geneontology.org). ssR locus detection and primer design. To detect SSRs in unigenes, the Perl script MIcroSAtellite identification tool (MISA, http://pgrc.ipk-gatersleben.de/misa/misa.html) 86,87 was utilized. To select the SSR loci, the minimum number of repeats was 12 for mono-, six for di-, five for tri-and quad-, four for penta-and hexanucleotide repeats, respectively. It was impossible to design primers for SSR loci in which the sequence failed to meet appropriate criteria for designing a primer or whose flanking sequences were too short 21 . SSR primers were designed for each SSR using Primer3 (http://bioinfo.ut.ee/primer3) 88 according to the following parameters: (1) primer length of 18 to 26 bp with 20 bp as the optimum; (2) PCR product size ranging from 80 to 185 bp; (3) melting temperature (Tm) between 56 °C and 64 °C with a difference of no greater than 4 °C between the Tm values of the forward and reverse primers and with 60 °C as the optimum annealing temperature; (4) GC content of 40% to 70% with an optimum of 50%; (5) designed primer sequence limited to the middle region, with 30 bp being removed from the ends of the contig sequence; and (6) primer pairs devoid of secondary structure or consecutive tracts of a single nucleotide 21 . For this study, 150 total primers were selected at random and synthesized.

DNA extraction, EST-SSR markers amplification and validation.
Total genomic DNA was extracted from young leaves of three C. alismatifolia cultivars; 'Chiang Mai Pink' , 'UB Snow 701' , 'Sweet Pink' , one hybrid 'Doi Tung 554' , three species of Curcuma angustifolia, Curcuma cordata, Curcuma longa, and two species of Zingiber officinale and Zingiber zerumbet using a modified extraction buffer of the cetyltrimethyl ammonium bromide (CTAB) protocol 1 . With a NanoDrop ND 2,000 spectrophotometer (Thermo Fisher Scientific Inc., MA, USA), the quality and quantity of DNA were evaluated, and samples were stored at −20 °C before use. Finally, DNA concentration was adjusted to 70 ng/ml. Amplification and polymorphism of 150 pairs of synthesized SSR primers were tested through polymerase chain reaction (PCR) and electrophoresis. DNA amplification through PCR was performed using a T100 Gradient Thermal cycler (Bio-Rad Laboratories, Inc., CA, USA) within a 15 μL final reaction volume containing 4.5 μL of double distilled water, 1 μL of each primer (100 μmol/L), 1 μL of genomic DNA (70 ng/μL), and 7.5 μL of 2X DreamTaq Green PCR Master Mix (Thermo Fisher Scientific, Inc., MA, USA). PCR condition were as follows: initial denaturation of template DNA at 94 °C for 3 min followed by 35 cycles of 94 °C for 40 s, 50 to 60 °C (depending on the melting temperature of the primer pair used) for 1 min, and 72 °C for 1 min followed by a final extension of 10 min at 72 °C. PCR products were separated on 2.5% MetaPhor ™ Agarose gel stained with Midori Green Advance (Nippon Genetics Inc., Japan) using a 50 bp DNA ladder (N3231S, Biolabs, Inc., UK). Gels were visualized under ultraviolet light and imaged using the Gel Doc XR + Imaging system (Bio-Rad Laboratories Inc., CA, USA). ssR data analysis. For SSR analysis, PCR products were manually scored based on allele size following data scoring as "0" in the absence of the band and "1" as its presence. The binary data matrix was incorporated in the Numerical Taxonomy and Multivariate Analysis System (NTSYSpc, version 2.10e; Applied Biostatistics Inc. WLB, www.nature.com/scientificreports www.nature.com/scientificreports/ CAS) 89 to generate Dice's similarity matrix. Similarity matrix was used to create a hierarchical clustering plot based on the unweighted paired group method with arithmetic mean (UPGMA) using NTSYSpc software. To calculate genetic diversity parameters, including observed (Ho) and expected heterozygosity (He), and observed (n a ) and effective number of alleles (n e ) , Shannon's Information Index (I) and polymorphic information content (PIC) of selected SSR primers using population genetic analysis software (POPGENE32, ver. 1.32) was employed 90 .

Data Availability
Transcriptome datasets supporting the conclusions of this article are available in the European Nucleotide Archive (ENA) under the accession number PRJEB18956.