Introduction

Phyllanthus emblica L. (syn. Emblica officinalis Gaertn.) also known as Indian gooseberry or aonla is a deciduous tree of the family Phyllanthaceae, distributed across the tropical and subtropical regions comprising over 800 species worldwide and over 50 species in India1. It is well known that P. emblica fruit is one of the richest natural sources of vitamin C2 and also contains several other vital bioactive phytoconstituents such as flavonoids; gallic acid, ellagic acid, rutin, quercetin and catechol3, terpenoids such as phyllaemblicins B, phyllaemblicins C, glochicoccinoside D4, tannins; mucic acid 1,4-lactone 3-o-gallate, isocorilagin, chebulanin, chebulagic acid and isomallotusinin5, respectively. P. emblica has been used as medicine and nutritional tonic in traditional medicine systems such as Ayurveda, Unani and Sidha to cure various infectious and non-infectious diseases6. Due to the presence of a variety of bioactive compounds in P. emblica extract, numerous therapeutic effects have been reported such as antimicrobial, antioxidant, anti-inflammatory, analgesic, antipyretic, adaptogenic, hepatoprotective, antitumor, antiulcerogenic and immunomodulatory activities7,8,9. From a pharmacological perspective, phenols and terpenoids are the major secondary metabolites of P. emblica while its high ascorbic acid content can be recommended as a high-quality and low-cost alternative for securing nutritional requirements.

Although P. emblica is well known to provide various medicinal benefits, only limited studies have been conducted to understand its genetic structure. To date, only 21 genomic8, 10 and 52 genic SSRs11 were reported in P. emblica that have been used for characterizing different populations. These microsatellite markers, however, are insufficient to investigate the genetic structure, variability and gene flow within the P. emblica population. Therefore, the development of molecular markers is essential for the characterization of P. emblica germplasm and phylogenetic studies of the species. This will help us better understand the genetics of P. emblica and enable the more effective use of a variety of germplasm for breeding programs.

Recent developments in sequencing technology offer enormous potential for genomic analysis and gene function study in both model and non-model organisms. Nowadays, de novo assembly is growing in popularity since it is a quick and cost-effective method for short reads when reference genomes are not available12. RNA-Seq platforms provided an opportunity to mine important molecular markers such as SNPs and SSRs at a much lower cost. RNA-Seq generates millions of short tags and subsequently assembled them, which can help to interpret genome and transcriptome sequences. The availability of transcriptome assembly and an adequate number of EST sequence data helps in the development of EST-SSR markers. Further, EST-SSRs are increasingly used for the evaluation of genetic relationships because they are abundant in gene-rich regions, co-dominant, highly polymorphic and easily transferable among phylogenetically related species13. Therefore, in this study, Illumina NextSeq500 sequencing technology was utilized to characterize the transcriptome of P. emblica shoot and to develop EST-SSR markers which can be used to evaluate the genetic diversity, linkage map construction and marker-assisted breeding. This study will provide valuable insights into the genetic structure of P. emblica that can be utilized in an improved breeding program.

Materials and methods

Plant material, RNA and DNA extraction

Phyllanthus emblica L. shoot tissues were procured from a fully grown healthy plant located at Mizoram University, Aizawl and were treated with 0.1% DEPC treated water followed by snap-frozen in liquid nitrogen and stored in a deep freezer (− 80 °C) until further use. For Illumina Sequencing, the total RNA was isolated from the shoot samples using the modified CTAB and lithium chloride (LiCl) method14. The concentration and purity of RNA samples were analyzed on 1% denaturing RNA agarose gel and bio-spectrophotometer (Eppendorf, Germany), respectively.

To examine the polymorphism of EST-SSR markers, total of 30 leaf samples were collected from different locations comprising nine commercial varieties and twenty-one wild species of Phyllanthus emblica L. which has listed in Table 1 (Fig. 1) and stored at − 80 °C until DNA extraction. Two grams of leaves were ground in liquid nitrogen and genomic DNA was extracted using the CTAB method described by Doyle and Doyle15 with modification. The DNA was quantified with a bio-spectrophotometer (Eppendorf, Germany) and 0.8% agarose gel electrophoresis analysis.

Table 1 Details of P. emblica experimental material.
Figure 1
figure 1

Map depicting different locations from where experimental material procured in the study.

RNA sequence library preparation and sequencing

The RNA-Seq paired-end sequencing library was prepared from the purified RNA samples after pooling them in equimolar concentration16 using TruSeq standard mRNA sample prep kit (Illumina, California, USA). Briefly, mRNA was enriched from the total RNA using poly-T attached magnetic beads, followed by enzymatic fragmentation, 1st strand cDNA conversion using SuperScript II and Act-D mix to facilitate RNA-dependent synthesis. The 1st strand cDNA was then synthesized to the second strand using a second strand mix. The dscDNA was then purified using AMPure XP beads followed by A-tailing, adapter ligation and then enriched by a limited number of PCR cycles. The PCR enriched library was analyzed on a 4200 Tape station system (Agilent Technologies, Chandigarh, India) using sensitivity D1000 screen tape as per manufacturer instructions. The prepared library was sequenced on the Illumina NextSeq500 platform to produce 150 bp paired-end reads.

De novo transcriptome assembly and data clustering

The sequence raw data were processed to obtain high-quality concordant reads using Trimmomatic v0.3817 and in-house script to remove the adapter, ambiguous reads (reads with unknown nucleotides “N” larger than 5%) and low-quality sequences (reads with more than 10% quality threshold (QV) < Phred score). Further, these high-quality reads (QV ≥ 20) were assembled into transcripts using Trinity de novo assembler (v2.8.4)18 with kmer of value 25. The assembled transcripts were then clustered together using CD-HIT-EST 4.619 to remove the isoform produced during assembly. The resulting sequences were identified as unigenes and were considered for downstream analysis.

Sequence annotation and gene ontology (GO) analysis

TransDecoder-v5.3.0 (http://transdecoder.github.io/) was used to predict coding sequences (CDS) from the retrieved unigenes and identified candidate coding regions within unigene sequences. The functional annotation of coding sequences was performed using the DIAMOND program20 which is a BLAST-compatible local aligner for mapping translated DNA query sequences and finds homologous sequences for transcripts against non-redundant protein database from NCBI. GO analysis of identified coding sequences was carried out using the Blast2GO program. GO mapping was executed to retrieve GO categories for functionally annotated transcripts. BlastX result accession IDs were used to retrieve gene name symbols, identified gene names or symbols are then explored in the species-specific appearances of the gene product tables of the GO database. BlastX result accession IDs are used to retrieve UniProt IDs utilizing PIR that incorporates PSD, UniProt, SwissProt, TrEMBL, RefSeq, PDB and GenPept databases.

Development and detection of EST-SSR markers

The Microsatellite searching tool (MISA, http://pgrc.ipk-gatersleben.de/misa) was used for the identification of potential microsatellites from all the unigenes found in the transcriptome sequence of P. emblica. Primer 3 software (http://primer3.sourceforge.net/releases.php) was used to design the EST-SSR markers by optimizing the primer parameters such as; primer length range between 18 and 23 bp, GC content 40 ± 60%, product size range between 100 and 300 bp and annealing temperature ranging from 55 to 66 °C. A total of 30 primer pairs were randomly selected, designed and used to study the germplasm characterization of P. emblica.

PCR amplification and PAGE analysis

A total of 30 EST-SSR markers were designed to evaluate the genetic diversity among the studied aonla genotypes. For PCR analysis, a final volume of 10 µl reaction mixture containing 1X Taq polymerase buffer with MgCl2, 0.2 mM dNTP, 0.3U Taq DNA polymerase, 10 pmol SSR primers (Eurofins, India), and 50 ng/l of genomic DNA was utilized. The following thermal profile was used for the PCR amplification in the thermal cycler (Applied Biosystems, USA): initial denaturation at 94 °C for 5 min; 35 cycles of 94 °C for 1 min; annealing at varying temperatures depending on the primer pair for 45 s; and extension at 72 °C for 45 s; followed by final extension of products at 72 °C for 8 min. Following polyacrylamide gel analysis of the PCR results and assessment of amplicon size were measured against a 50 bp DNA ladder (GeNei, Bangalore, India).

Genetic diversity analysis

To assess the genetic diversity among 30 genotypes of aonla germplasm (Supplementary Table 1), the dendrogram was constructed based on the unweighted pair group method of the arithmetic mean (UPGMA) by using Jaccard’s similarity coefficient with the help of DARwin software ver.6. Further, factorial analysis and a Neighbour-Joining tree were constructed with the help of DARwin software ver.621. Program POPGENE 1.32 was used to determine each primer’s polymorphic information content (PIC), Marker Index (MI), effective multiplex ratio (EMR), resolving power (Rp), estimates of gene diversity for each population across all loci in terms of alleles per locus (Na), the effective number of alleles (Ne), Shannon’s information index (I), observed heterozygosity (Ho) and expected heterozygosity (He)22, 23.

Research involving plants

The necessary permissions for procuring the Phyllanthus emblica germplasm used in the current study have taken from the mentioned collection sites. The said material comprising of commercial genotypes is authenticated and validated by Rajnish Sharma while the wild genotypes are being authenticated, validated and maintained by Suresh Kumar at Mizoram University. All experimental research and field studies on plants (either cultivated or wild), including the collection of plant material were carried out in accordance with relevant institutional, national, and international guidelines and legislation.

Results and discussion

De novo transcriptomic assembly

The transcriptome sequencing of P. emblica using the Illumina NextSeq500 platform produced 39,933,248 Pair End (PE) reads, 5,991,900,473 bases and ~ 6 Gb total data. After trimming of adapter and removal of ambiguous sequences, high-quality reads were assembled into a total of 1,26,606 transcripts with a mean transcript length of 951 bp (Table 2). The de novo transcriptome assembly was best developed at k-mer 25 at 1418 bp N50 value. The maximum transcript length was found 7539 bp with a minimum transcript length of 201 bp (Table 2). After the removal of isoforms from the 1,26,606 transcripts, a total of 87,771 unigene sequences were retrieved. The average unigene length was found 884 bp with a maximum unigene length of 7539 bp (Table 2). From the first transcriptomic study of leaf and flower tissue of P. emblica using Illumina Hiseq2000 platform 1,34,205 unigene sequences and 89,242 singletons with an average contig length of 278 bp reported were reported by Kumar et al.24. While, in another leaf transcriptome study using Illumina Hiseq4000, a total of 76,881 non-redundant genes were reported by Liu et al.11.

Table 2 Summary of transcriptome data generated in Illumina NextSeq500 for P. Emblica L.

Prediction of coding sequences and function annotation

TransDecoder (v5.3.0) was used to find the coding sequences within a total of 87,771 Unigenes. As a result, a total number of 43,377 coding sequences with a maximum coding sequence length of 6300 bp with an average length of 837 bp were obtained (Table 2). After finding the coding regions based on various structural and positional parameters, it became important to find the functional information associated with assembled coding regions. Hence, coding regions were aligned by DIAMOND (BlastX alignment tool) to find the homologous sequences against NR (Non-Redundant) protein database from NCBI. Out of 43,377 transcripts, 38,692 coding sequences found Blast hit and 4685 were found unique. Reciprocally, Kumar et al.24 have shown a similarity of 47,276 sequences with the NCBI-NR protein database. In terms of similarity of coding sequences among other species, maximum of 16% of transcripts showed similarity with Hevea brasiliensis followed by Jatropha curcas (12%), Manihot esculenta (10%), Ricinus communis (9%) Populus trichocarpa (9%), Populus eupharatica (5%), Citrus sinesis (2%), Theobroma cacao (2%), Vitis vinifera (1%) and Quercus suber (1%) (Supplementary Fig. 1). In contrast, Kumar et al.24 reported the maximum transcripts showed homology with genes of Vitis vinifera (29%) followed by Oryza sativa (14.35%). The resulting similarity of P. emblica coding sequences with four Euphorbiaceae plants viz. Hevea brasiliensis, Jatropha curcas, Manihot esculenta and Ricinus communis reveal the similarities in gene architecture among Phyllanthaceae and Euphorbiaceae families.

Gene ontology analysis

GO assignments were used to classify functions of predicted coding sequences and also provided ontology of defined categories representing protein properties which are grouped into three main domains: Biological Process (BP), Molecular Function (MF) and Cellular Components (CC). A total of 11,134 coding sequences were annotated using Blast2GO analysis. Molecular Functions (MF) was found to have the highest number of 8727 coding sequences associated with it followed by 7596 coding sequences in Biological Processes (BP). While the least number of coding sequences were found associated with cellular components (6135). In concordance, Kumar et al.24 also reported that maximum predicted unigenes were found associated with molecular functions followed by cellular functions and biological processes respectively, following significant hits of unigenes against the NR database. In the category of MFs of GO, organic cyclic compound binding (3296/16%), heterocyclic compound binding (3292/16%), ion binding (2959/14%), transferase activity (2157/10%), small molecule binding (1908/9%), hydrolase activity (1796/9%), carbohydrate derivative binding (1585, 8%), catalytic activity on proteins (1439/7%), drug binding (1343/6%) and oxidoreductase activity (1067/5%) were annotated GO categories. Under the biological process GO category, organic substance metabolic process (4628/20%), cellular metabolic process (4407/19%), primary metabolic process (4381/19%), nitrogen compound metabolic process (3829/16%), biosynthetic process (2001/9%), the establishment of localization (1227/5%), oxidation–reduction process (1078/5%), regulation of cellular process (1019/4%), small molecule metabolic process (850/4%) were found most enriched categories. In the GO category of cellular components, membrane (3322/25%), organelles (2718, 21%), intracellular organelle (2666/20%), an intrinsic component of membrane (2531/19%), cytoplasm (1934/15%) were found most represented categories (Supplementary Fig. 2).

Flavonoid and terpenoid biosynthesis

Phenylpropanoid pathway begins deamination of aromatic amino acid phenylalanine and leads to a synthesis of a wide range of phenolic acids as secondary metabolites while chalcone synthase generates the intermediated flavonoid compound called naringenin, further oxidation and hydroxylation of naringenin generates eriodictyol and dihydrotricetin respectively. Further enzyme catalytic reactions convert this product either into anthocyanidins or catechins. Catechins are flavonoids distributed in a variety of foods and herbs including tea, apples, persimmons, cacaos, grapes, and berries25. Since they have tremendous beneficial health implications for humans and also play an important part in plant growth and development. For full utilization of these compounds in food, medicine, and other purposes requires a thorough understanding of genes and distinct biosynthetic pathways of their production in cellular systems and this makes them a versatile target for metabolic engineering26. The study conducted by Zhang et al.27 listed the presence of flavones from the leaves and branches of P. emblica by isolating two acylatedflavanone glycosides [S-eriodictyol 7-O-(6″-O-trans-p-coumaroyl)-β-d-glucopyranoside and S-eriodictyol 7-O-(6″-O-galloyl)-β-D-glucopyranoside] together with a new phenolic glycoside, 2-(2-methylbutyryl) phloroglucinol 1-O-(6″-O-β-D-apiofuranosyl)-β-D-glucopyranoside. This study noted the existence of significant flavones in P. emblica leaves and branches, highlighting the need for further investigation in the crop’s transcriptomics and metabolomics areas. Aonla shoot tissue used in the current study's transcriptome analysis revealed 62 genes associated with the production of flavonoids (Supplementary Table 2).

Terpenoids also known as isoprenoids are secondary metabolites synthesized in plants through two non-homologous biosynthetic pathways; first, the cytosolic mevalonate pathway (MVA) leading to the synthesis of sesquiterpenoids (C15) and the second non-mevalonate or plastid methylerythritol 4-phosphate (MEP) pathway resulting in a synthesis of monoterpenoids (C10) and diterpenoids28. Transcriptome analysis revealed a total of 48 predicted unigenes involved in terpenoid biosynthesis. Among them, 27 unigenes were found associated with terpenoid backbone biosynthesis, 9 unigenes for sesquiterpenes, and 7 unigenes for diterpenoids followed by 5 unigenes for monoterpenoid biosynthesis. By taking into consideration of terpenoids’ chemical diversity and their health implication into account there is a great need to understand the molecular mechanisms involved in triterpenoid saponin production in planta to assist their exogenous engineering, pinpoint biosynthetic genes, transporters and transcription factors.

Development and characterization of EST-SSR markers

A total of 7477 SSR sequences were retrieved from the 87,771 unigenes of P. emblica and only 448 sequences with more than one SSR were reported. The frequency distribution of SSRs is mainly comprised of more than one repeat motif viz., tri-nucleotide repeats (76%), followed by di-nucleotide (13%), composite type (8.46%), tetra-nucleotide (1.48%), penta-nucleotide (0.67%) and hexa-nucleotide (0.08%) repeats, respectively (Fig. 2, Supplementary Table 3). Consequently, 30 of them were randomly selected for primer design using Primer3 software28.

Figure 2
figure 2

Frequency distribution of identified SSR motifs in the P. emblica transcriptome.

22 of the 30 EST-SSR primer pairs showed successful amplification with genomic DNA of 30 P. emblica accessions with the desired product size while the remaining 8 primer pairs did not show any amplification. Out of 22 EST-SSRs, only 12 (40%) were found to be polymorphic and further used to assess the genetic diversity among the studied P. emblica genotypes (Supplementary Figs. 3, 4). These 12 EST-SSRs produced a total of 122 bands, of which 12 (9.83%) were monomorphic while 110 (90.16%) were polymorphic. The allele number for each primer ranged from 2 (PES-28) to 23 (PES-21), with an average of 10.17 and amplicon size varied from 100 to 480 bp (Table 3). However, Liu et al.11 reported that the average number of alleles per locus varied from 11 to 44 and the size of amplified product ranging from 104 to 297 bp while assessing the three populations of P. emblica using EST-SSR primers. Likewise, Pandey and Changtragoon10 and Geethika et al.8 studied and found that the number of alleles per locus varied from 4 to 7 and 2 to 9 with product size ranging from 150–236 bp to 130–330 bp, respectively while evaluating the genetic diversity among the two natural populations and 20 accessions of P. emblica, respectively. The variability in the allele number generated by primes may depend on the compatibility of the primer’s association with the plant genome as well as the components of each of the nitrogenous bases.

Table 3 Diversity statistics inferred in aonla (P. emblica) germplasm collected from different locations using EST-SSRs based on shoot transcriptome data.

The Polymorphism Information Content (PIC) value is used to determine the informativeness of a molecular marker; the higher the PIC value the more informative the primer. In the present study, the PIC values ranged from 0.50 (PES-28) to 0.93 (PES-21), with an average of 0.80. As evident from Table 3 the PIC value for each primer is equal to or greater than 0.5, indicating that all the primers are very informative and can be further utilized for germplasm characterization of P. emblica. Moreover, the highest percent polymorphism (100%) was recorded among three EST-SSR primers namely PES-40, PES-42 and PES-49, while the lowest percent polymorphism was observed in only one primer named PES-28 (50%) with an average of 86.38 percent polymorphism (Table 3). Additionally, the resolving power/discriminatory power (Rp), Marker Index (MI), and effective multiplex ratio (EMR) was also calculated for each polymorphic primer. The resolving power of a primer indicates the discriminatory potential of the primer to distinguish the genotypes or individuals. The resolving power of each primer ranged from 2.67 (PES 28) to 15.47 (PES 21), with an average of 8.64 (Table 3). Likewise, the effective multiplex ratio (EMR) was also calculated for all the 12 polymorphic primers and varied from 0.50 (PES 28) to 17.39 (PES-21), with an average of 8.34 (Table 3). Marker index (MI) is a feature of marker which explains the discriminatory power of a marker and is the product of PIC and EMR value. The maximum MI was recorded for primer PES-21 (16.09) and minimum for primer PES-28 (0.25) (Table 3).

The average number of alleles (Na), effective number of alleles (Ne), Shannon index (I), expected heterozygosity (He) and observed heterozygosity (Ho) was also estimated in the present study. The highest average number of alleles (Na), effective number of alleles (Ne), Shannon index (I) i.e. 5.40, 3.89 and 1.43, respectively were observed for the primer PES-21 while the lowest average number of alleles (Na) was recorded for primer PES-32 (2.20), the effective number of alleles (Ne) and Shannon index (I) for primer PES-25 i.e. 1.96 and 0.69, respectively (Table 3). The observed and expected heterozygosity for currently studied P. emblica genotypes was ranged from 0.65 to 1.00 and 0.58 to 0.86, respectively, with an overall average of 2.10 and 1.76, respectively. Similarly, in the previous studies, high level of genetic diversity at species level in terms of observed and expected heterozygosity was also estimated while characterizing the germplasm of P. emblica for example, Pandey and Changtragoon9 reported the Ho and He ranged from 0.360 to 0.760 and 0.499 to 0.806, respectively, by using six microsatellite markers in two natural population of P. emblica. Likewise, Geethika et al.8 and Liu et al.11 also recorded the Ho (0 to 1.00; 0.24 to 0.86) and He (0.401 to 0.825; 0.75 to 0.93) while assessing the genetic diversity of twenty and ninety P. emblica accessions using fifteen and twenty-one microsatellite markers, respectively.

Genetic diversity analysis

Based on the data obtained using 12 informative EST-SSR primers, a dendrogram was generated (Fig. 3). The dendrogram is a diagrammatical representation in the form of a tree illustrating the arrangement of clusters generated by EST-SSR analysis using NTSYS software divided the genotypes under studies into two main clusters namely, A and B at a similarity coefficient of 0.47 as presented in Fig. 3. The major cluster A comprises of 22 aonla genotypes which further bifurcated at a similarity coefficient ~ 0.487 into two sub-groups A1 and A2. These results indicated the consistency of genetic structure with geographical distribution while showing occurrence of similar genetic variations within two individual groups as reported by Liu et al.29 showing 20 EST-SSR primers to group the 260 P. emblica accessions into two major clusters. The sub-group A1 consists of 20 genotypes which include all the commercial varieties and genotypes from Himachal, Punjab and Tripura. However, HPH4 and HPH5 from Himachal showed the highest similarity coefficient value of 0.96 indicating the presence of similar genetic base. Whereas, the sub-group A2 comprised only two genotypes i.e. UMR2 and UMR3 from Meghalaya. On the other hand, Cluster B consists of 8 P. emblica genotypes and is grouped into two sub-clusters namely, B1 and B2 at a similarity coefficient of ~ 0.497. Sub-cluster B1 contained four genotypes from Tripura, i.e. JRL1, CPG1, CPG5 and JRL2 and Sub-cluster B2 also comprised of 4 genotypes i.e. WNG2 and WNG3 from Meghalaya and SAIRANG1 and LENGTE1 from Mizoram. These results indicated that there is great intermixing between aonla genotypes which may be the result of cross-pollination. In addition, polymorphism in P. emblica may also be influenced by other geographical and environmental factors such as altitude and precipitation29. DARwin software was used for factorial analysis in which 4 genotypes from Tripura location were grouped likewise, 3 genotypes from Himachal Pradesh and 3 genotypes from Punjab. Similarly, the 5 commercial cultivars were found near to each other and 2 commercial cultivars (Banarsi and Krishna) grouped together as shown in dendrogram (Fig. 3). Moreover, the neighbor-joining cluster analysis (Fig. 4) from the results of DARwin software also showed different grouping of genotypes as earlier discussed in a dendrogram and factorial analysis like a grouping of different commercial cultivars, Tripura location's 4 genotypes, 3 genotypes of Himachal Pradesh, Banarsi and Krishna and also a group of 4 genotypes two from Meghalaya and rest of two from Mizoram location (Fig. 5). It can be emphasized from these findings that P. emblica exhibited high levels of polymorphism in the current study as it is widely distributed throughout the India and found in different habitats that have resulted in rich gene pool by everlasting adaptive evolution leading to high levels of genetic variation. Similarly, Liu et al.30 using 20 EST-SSR primers in 260 Chinese accessions of P. emblica established the presence of high levels of genetic diversity and low levels of genetic differentiation. Moreover, Rout et al.31 also carried out the cluster analysis using UPGMA of 12 species of Phyllanthus collected from different locations of India by using RAPD and ISSR markers that depicted high levels of genetic diversity.

Figure 3
figure 3

Dendrogram illustrating hierarchical clustering of aonla genotypes.

Figure 4
figure 4

Neighbour-Joining phylogenetic tree of 30 aonla genotypes.

Figure 5
figure 5

Factorial analysis of aonla genotypes.

Moreover, the corresponding sequences of the twelve (40%) EST-SSRs found to be polymorphic were BLAST against the GenBank nonredundant database using BLASTX and the top hits of all of them were found to be similar to different organisms, presented in Table 4. Out of these 12 EST-SSRs, 3 primers (PES-25, PES-28 and PES 30) code for enzymes that were involved in flavonoid biosynthesis pathway, 2 (PES-32 and PES-38) were involved in the synthesis of ascorbate and aldrate metabolism and one (PES-33) in calcium signaling pathway (Table 4). Similarly, Liu et al.11 reported twenty (38.5%) polymorphic EST-SSR markers. Out of 20 EST-SSRs, 7 shows top hits/similarity with transcription factor TCP7-like (Jatropha curcas), Sugar transporter ERD6-like 7 isoform X2 (Jatropha curcas), Hypothetical proteins from Citrus clementina, Jatropha curcas and Sorghum bicolor, 40S ribosomal protein S29, partial (Zea mays) and U-box domain-containing protein kinase family protein, putative (Theobroma cacaos) through BLASTX analysis.

Table 4 Sequence similarity details of EST-SSRs.

Conclusion

Exploring candidate genes involved in useful metabolic pathways is an important forward step towards better acceptability of the wild and commercial aonla to explore them in selecting superior genotypes. Thus, we developed novel EST-SSRs linked with secondary metabolisms which would be useful in investigating the population genetics, gene mining, population demographics, gene flow, and the genetic resource assessments of P. emblica. These findings would be contributing towards genetic structure of P. emblica, its evolutionary adaptations and genetic relationships among the closely related Phyllanthaceae species.