Genic microsatellite marker characterization and development in little millet (Panicum sumatrense) using transcriptome sequencing

Little millet is a climate-resilient and high-nutrient value plant. The lack of molecular markers severely limits the adoption of modern genomic approaches in millet breeding studies. Here the transcriptome of three samples were sequenced. A total of 4443 genic-SSR motifs were identified in 30,220 unigene sequences. SSRs were found at a rate of 12.25 percent, with an average of one SSR locus per 10 kb. Among different repeat motifs, tri-nucleotide repeat (66.67) was the most abundant one, followed by di- (27.39P), and tetra- (3.83P) repeats. CDS contained fewer motifs with the majority of tri-nucleotides, while 3′ and 5′ UTR carry more motifs but have shorter repeats. Functional annotation of unigenes containing microsatellites, revealed that most of them were linked to metabolism, gene expression regulation, and response to environmental stresses. Fifty primers were randomly chosen and validated in five little millet and 20 minor millet genotypes; 48% showed polymorphism, with a high transferability (70%) rate. Identified microsatellites can be a noteworthy resource for future research into QTL-based breeding, genetic resource conservation, MAS selection, and evolutionary genetics.


Scientific Reports
| (2021) 11:20620 | https://doi.org/10.1038/s41598-021-00100-4 www.nature.com/scientificreports/ according to the manufacturer's instructions. The consistency and concentration of RNA were determined using a 1.2 percent agarose gel electrophoresis and a Qubit® 2.0 Fluorometer (Thermo Scientific, USA). The RNA Nano 6000 Assay Kit was also used to determine the integrity of the RNA. mRNA isolation from total RNA was carried out using oligo-dT-attached magnetic beads (Dynabeads® mRNA DIRECT™ Micro Kit), and Ion total RNA seq-kit v2 was used to create cDNA libraries (Thermo Fisher Scientific). For cDNA library preparation, purified mRNA randomly fragmented into short fragments (~ 200 bp) by RNaseIII enzyme then subjected to Ion adapter v2 hybridization, ligation and cDNA was synthesized and amplified as per manufacturer's guidelines. After amplification, cDNA libraries were selected for target fragments of ~ 200 bp on 2% Agarose EGelTM (Thermo Fisher Scientific) and stored at − 20 °C until used 10,28 . The selected ~ 200 bp cDNA library was subjected to emulsion PCR (Ion OneTouch™ 2 System, Thermofisher Scientific, USA), followed by enrichment of template positive bead recovery. Sequencing was conducted in an Ion S5 Machine (S5TM, Life Technologies). During the library preparation, each sample was assigned a unique molecular barcode.
Reads processing and assembly. PRINSEQ (0.20.4) was utilized for omitting low-quality sequences after FastQC (v0.11.9); the quality of the reads (if the reads had more than 10% nucleotide bases with Q value 25) were checked. Trinity (v2.13.2) reconstructed and assembled the high-quality reads into unigenes, using the default parameter 25 . To benchmark completeness of individual and combined P. sumatrense transcriptome assembly, the Benchmarking Universal Single-Copy Orthologs (BUSCO) version v5.2.2 was used with the default E-value cut-off of 1e−03 against the ortholog set of Embryophyta_odb9 lineage (creation date: 2020-09-10, number of species: 70, and number of BUSCOs: 255) from OrthoDB v9 29 . Further, TransDecoder (v. 5.5.0) was used for estimating the coding sequence present in Master assembly with the default parameters 30  SSR marker identification and primer design. The MIcroSAtellite (MISA) Identification Tool Perl script (http:// pgrc. ipk-gater sleben. de/ misa/) was used to find simple sequence repeat regions. The assembled transcriptome was screened for di-, tri-, tetra, Penta-, and hexanucleotide repeat motifs. Primer 3.0 (http:// fokker. wi. mit. edu) was used to design primers from sequences containing SSRs. In addition, the following parameters were considered for primer designing: primer length 18-28 bp optimum length 20 bp; Tm-55-63 °C, with an optimum of 60 °C; GC content 40-60%, optimum value 50%; maximum Tm difference between forward and reverse primer 1.5 °C and product size range 100-300 bp, 150 bp being the ideal size 23,33 .
Assigning functions to unigenes containing microsatellites. Functional annotation of SSR containing sequence carried out using standalone blast with NCBI non-redundant Nr database (e-value cut-off 10 −6 ), Swiss-Prot (A manually annotated and reviewed protein sequence database), KOG (Clusters of eukaryotic Orthologous Groups), and Pfam (Protein family, assigned using the HMMER3.0 package) 30 . Furthermore, the Blast2GO software with a cut-off E value of 1e −5 was used to assign GO (Gene Ontology) annotation on all unigene containing microsatellite motifs. Also, KEGG (KAAS, http:// www. genome. jp/ kegg) was used to perform metabolic pathways analyses (E-value 10 −10 ).

DNA extraction, PCR, and eSSR validation.
A total of 50 randomly selected sets of primers were selected and synthesized (Merck, India). Genomic DNA isolation was illustrated using DNeasy plant Mini Kit (Qiagen, Valencia, CA) from young leaves samples of five little millet, five proso millet, five Kodo millet, five barnyard millet, and five finger millet genotypes (Table S1). The polymerase chain reaction was carried out using the following PCR components: contained 1μL template DNA (50 ng), 1.0 μL forward primers, 1.0 μL reverse primers 2 μL 10 × Taq buffer + MgCl2 (15 mM Polymorphism estimation and cluster analysis. The 24 polymorphic eSSR products were scored manually as '1' (presence) and '0' (absence). The PIC was calculated using the score from the polymorphic loci and the formula PIC = ∑1 − gpi (where pi is the frequency of ith allele for each locus) 37

Results and discussion
De novo assembly and characterization of unigene. In this study 11.4, 7.6, and 10.5 million reads with average read length 158 bp, 165 bp, and 150 bp were generated in leaf transcriptome at the vegetative stage, flowering, and maturity stage, respectively. Reads with Q25 bases (i.e., reads with a base quality ≥ 25) were selected as high-quality reads, through additional analysis (Table 1). Finally, using Trinity, a total of 29,629,186 (81.36 P) reads were assembled into 47,358 unigenes. All unigenes were classified based on the length of the sequences, and the average length of the unigenes was estimated to be 521.72 bp. The majority of the reads (91.7%) were in the 200-1000 bp range. There were 25,743 unigenes less than 1000 bp in length and 5,038 with more than 1000 bp (16.36P) (Fig. S1). The completeness of transcripts was determined by comparing them to universal single-copy orthologs (BUSCO). The percentage of detected BUSCOs is represented on the x-axis (Fig. 2). The powder blue diamond illustrates the complete (C) and single-copy (S) genes; the sunglow stands in for complete and duplicated (D) genes; the tacao diamond represents fragmented (F) genes; the coral diamond act for the missing (M) genes. In the vegetative, flowering, and maturity stages, the total number of core genes queried were 45 www.nature.com/scientificreports/ flowering stage, and maturity stage (Fig. 2). All of these findings indicated that our database was complete and ready for further investigation.
EST-SSR discovery, frequencies, and distribution. MISA software was used to identify potential microsatellites from all 47,358 generated unigenes. A total of 4443 EST-SSRs were discovered (Table S2), and 3593 primers were created (Table S3). The distribution of SSRs in the little millet genome was found to be one SSRs per 10 Kb, comparable to the distribution of SSRs in other cereal genomes 42 . Among the 4,443 EST-SSRs identified in little millet, the tri-nucleotide motif (66.67%) was the most common, followed by di-(27.39%), tetra-(3.83%), Penta-(1.37%), and hexanucleotide motifs (0.71%). This trend demonstrated that the frequency of repeats decreased as motif length increased (See Fig. 3a). The number of microsatellite motif tandem repeats varied from 4 to 98. The most popular microsatellite had eight tandem repeats (1,245, 28.02%), followed by six tandem repeats (890, 20.03%), four tandem repeats (537, 12.08%), and five tandem repeats (465, 10.46%). Microsatellite motifs with more than 24 tandem repeats accounted for just 1.9 percent of the total (Fig. 3b). AG/CT was found to be the amplest type of di-nucleotide repeat, and it was observed to be 59.70% of the repeats, followed by AC/GT (24.50%) and AT/AT (11.10%). Among the tri-nucleotide repeats, the most frequent motifs were AGC/ GCT (27.84%) and CCG/CGG (17.65%). The most common tetra-nucleotide repeat motif was AGAT/ATCT (10.52%), whereas the most prevalent Penta nucleotide repeat motif was AGAGG/CCTCT (21.31%). AGG ATG /ATC CTC and ACA CAG /CTG TGT (15.62%) had the most hexanucleotide motifs (Fig. 3c). Shorter motif forms (mono-, di-, and tri-nucleotide microsatellites) were found to be more prevalent in the transcriptome (Fig. 4a). Furthermore, there were substantial variations in microsatellite length between three regions (CDS, 5′-UTR, and 3′-UTR), the average length of microsatellites in CDS regions (18.24 bp) was slightly greater than that of UTRs (17.45 bp). (Fig. S2). In these regions, the 5′UTRs had the highest GC content (57.86%) followed by CDSs (52.94%) and 3′UTRs (44.23%). The AT-and GC-content of mono-to hexanucleotide SSRs of these expressed areas were computed, and the findings are shown in Fig. 4b, and Table S4.

Functional annotation of microsatellites-containing unigenes.
To functionally annotate the assembled non-redundant unigenes, a sequence similarity search against the 'nr' database and the Swiss-Prot database was performed using the BLASTx algorithm with cut-off E values of (1e−5) and 1e−10, respectively. According to our findings, 33,160 (69.4%) and 20,686 (43.29%) unigenes were homologous with sequences in the 'nr' and Swiss-Prot databases, respectively. Our result revealed that when subjected to BLAST against 'nr' database, the majority of the unigenes i.e., more than 78.25 percent of unigenes with lengths greater than 500 bp, matched, whereas 21.75 percent with lengths less than 300 bp matched (Fig. S3).
GO and KEGG enrichment analysis of microsatellites-containing unigenes. The Blast2GO was used to classify the predicted functions of the assembled unigenes. Figure 5a depicts an overview of the unigene classification in each GO slim expression. GO analysis revealed that 22,717 unigenes (47.96 percent) could be classified into 60 Go terms. Among these unigenes, 8199 (36 P) were classified as Biological Processes, with the most enriched terms being organic substance metabolic process (1,543), cellular metabolic process (1457), and primary metabolic process (1446). The second annotated category was Molecular Function, with organic cyclic compound binding (1331), heterocyclic compound binding (1331), and ion binding (1235) being the most represented GO groups. In the cellular component organization, intracellular anatomical structure (1053), membrane (1009), and organelle (888) were the most represented terms (Fig. 5a and Table S5).
To evaluate the comprehensiveness of our transcriptome library, and thus the efficacy of the annotation protocol, BLASTx was used to align the Unigenes with KEGG, the E-value was kept below 1e−5, and the associated pathways were identified. While 30,400 (60.46%) of the 47,358 unigenes were annotated, only 5,644 (18.56%) were assigned to KEGG pathways. Thiamine and Purine metabolism were the most enriched pathways of the five major KEGG pathways, which included Thiamine metabolism, Purine metabolism (399), Starch and sucrose metabolism (387), Riboflavin metabolism (191), Folate biosynthesis (172), and Pyrimidine metabolism (159). In addition to starch and thiamine metabolism, a plethora of unigenes were involved in the synthesis of other essential vitamins and amino acids, photosynthesis, and secondary metabolite biosynthesis ( Fig. 5b and Table S6).
SSR marker polymorphism assay. The 50 genic SSR primer pairs were randomly chosen from the total number of designed EST-SSR. In total, 39 were amplified and produced expected band size in all five tested minor millets' species, 24 microsatellite loci showed allelic polymorphism. They were used to analyze the genetic diversity of 25 minor millet genotypes. The observed heterozygosity (Ho) ranged from 0.12 to 0.91, with an average of Ho = 0.49 ( www.nature.com/scientificreports/  Table 3). The SSR polymorphism data were further used to perform genetic correlation analysis. The genetic similarity matrix was used to generate a dendrogram and accessions. The SSR data of 25 millet genotypes were subjected to similarity index and cluster analysis using Jaccard's coefficient, and UPGMA was performed using the NTSYSpc-2.02i package (Table S7). The dendrogram created using Jaccard's similarity coefficient, and the UPGMA method revealed the highest (80.5 percent) similarity between BAR-1406 and BAR-1407 and the lowest (18.3 percent) similarity between LIT-171 and FIN-1831. The tree plot was created with three main clusters, with an average similarity of 59% (Fig. 6). Cluster I consisted of 15 genotypes of all little millet and Kodo millet genotypes, and it consisted of three sub-clusters, the first sub-cluster of which has two branches, first one includes all little millet genotypes, the second sub-cluster is branched into two, the first one includes two genotypes of barnyard millet, and the second branch includes the remaining genotype, third sub-cluster include Proso Millet genotypes in two branches. The second cluster had two sub-cluster D and E with near about 41% likeness and included Kodo millet genotypes divided between four branches. The last cluster was further subdivided into three sub-cluster consisted of Finger Millet genotypes; among different genotypes, FIN-18301 and FIN-18322 are closely related with 78% similarity, while FIN-1828 has the greatest genetic distance from other genotypes.
The genes and enzymes involved in the biosynthesis of thiamin. Thiamine (vitamin B1) plays a crucial molecular role for all living organisms. Thiamine diphosphate (ThDP), also known as vitamin B1, serves as an enzymatic cofactor in glucose metabolism, the Krebs cycle, and the biosynthesis branched-chain amino acids in all living organisms. Humans, unlike plants and microorganisms, cannot synthesize ThDP from scratch and must obtain it from their diet. It is shown that severe thiamine deficiency would likely cause the lethal disease beriberi 43 . Therefore, thiamine is a vital micronutrient for humans. Plants are the primary dietary source of thiamine. Little millet is rich in thiamine (0.2-0.48 mg/100 g) 44 , riboflavin (0.12 mg/100 g) 45 , potassium, zinc, magnesium, and manganese minerals, which are essential for healthy human diet 46,47 . According to KEGG analysis results the most enriched pathway was thiamine metabolism; hence, all unigenes were studied and 12 unigenes involved in thiamine metabolism identified (Table 4). Among them, the expression level of TDPK1, TH1, THI1, THIC, TPK1, TPK2, PALE1 and TPK3 were higher in flowering and vegetative than the maturity phase, while transcript level of TDPC2, TENAC, THI42 and TPS1 was higher in maturity in comparison with other samples (Table 4).

Discussion
Transcriptome sequencing, assembly, and genic SSR characterization. Minor millets, in particular, little millet, is a highly underappreciated species in terms of popularity and research interest, despite being a traditional food and fodder crop with a high nutrient profile. Millet molecular breeding is hampered by a lack of literature on molecular breeding and the inaccessibility of co-dominant markers. The identification and use of eSSR markers to screen accessions, varieties, germplasm, or cultivars will speed up the breeding process 48 . Transcriptome sequencing is a sophisticated and efficient method to detect new genes, identifying expression patterns, and developing molecular markers 49 . In this research, the Ion torrent S5 technology platform was utilized, and a total of 30 million reads (about 15 GB) were generated from three different samples. For sequence annotation, BLASTx against the NCBI database's 'nr' protein revealed that 57.30% (27,340) of the 47,358 unigenes in our dataset had at least one significant homologous gene in other species, and approximately 42.7 percent of unigenes could not be functionally annotated either because they performed similar properties as protein with an uncharacterized function or there were no BLAST matches. In most cases, the length of the query sequence influences the ability to discover essential similarities between the sequences. Several previous www.nature.com/scientificreports/ studies found that the likelihood of BLAST matches in protein databases could be raised while increasing the unigenes length 50 . This was also perceived in the current research, where 80 percent of the unigenes with lengths greater than 500 bp had homologs, whereas only 20 percent of the unigenes with lengths less than 300 bp were detected be similar to other homologs. Furthermore, there is limited data on the transcriptome, genome, and genes of little millet; in consequence, numerous little millet genes are not publicly available. The scarcity of SSR markers limits basic and applied genomics research in little millet. Transcriptome sequencing generates a massive amount of sequence data, which can be used to create a large number of SSRs. Markers derived from transcriptomic sequences are far more helpful than markers derived from genomic sequence data for gene-based interpretation and detecting functional variation 51 . A total of, 9764 potential EST-SSR markers www.nature.com/scientificreports/ were identified from 47,358 unigene sequences. The genic microsatellites frequency was 11.25 percent, and the distribution density was one SSR per 10 kb, which was remarkably higher than sesame (8.9%), barley (2.8%), Asian lotus (8%), maize (7%), but lower than cotton (12%) 23,[52][53][54][55] . Some variability in the data mining method, search parameters, and scale of the unigene assembly dataset maybe addressed by variations in SSR abundance 56 .
Here we discovered six distinct repeat motifs. Tri-nucleotide repeats (66.67percent) were the most prevalent type, followed by di-nucleotide repeats (27.39 percent) and tetra-nucleotide (3.83 percent). The AGC/GCT motif (825, 30.99 percent) was the most common of the tri-nucleotide repeats, followed by the CCG/CGG motif (523, 17.52 percent). AG/CT (726, 20.44 percent) was the most dominant motif among the di-nucleotide repeats, which was similar with study in Triticum aestivum L. 57 , and Raphanus sativus L. 58 . In comparison, the CG/CG motif had the lowest frequency.
Genetic relationship, polymorphism, and transferability of EST-SSR markers. The majority of reports on genetic diversity analyses in little millet germplasm have used RAPD 59 , SNP and genomic SSR markers 60 . Tiwari, et al. 59 using 60 RAPD investigated the diversity of 36 little millet (Panicum sumatrense) genotypes, although these researchers observed high polymorphisms among the markers used, but as RAPD technique is notoriously laboratory dependent their results are hardly reproducible in other laboratory or in other   62 also applied salinity response transcriptome of P. sumatrense to develop 37,100 genic SSR markers, that might be associated to salinity and drought stress tolerance. Genic SSRs are advantageous and frequently preferred for coding regions of the genome. Furthermore, the degree of transferability of this type in related species can provide a high level of acceptability 63 . In this research using transcriptome sequencing of samples of three developmental stages, species transferable genic SSR marker sets were mined. To verify generated SSR markers, 50  www.nature.com/scientificreports/ primer pairs were randomly selected and validated, 39 of which were positively amplified in 25 millet genotypes (Table S1). The lack of amplicons from other primer pairs is most likely due to primer positioning across large introns, splice sites, or poor-quality sequences 64 . Because of more significant sequence conservation in the transcribed regions, the rate of polymorphism of genic SSRs is typically lower than that of genomic SSRs 65 . However; this was not the case in this research, because the genic SSRs discovered were highly polymorphic. The average number of alleles recorded in this research using genic SSR markers was 8.68 per locus, with values ranging from 3 to 16 alleles. Among the 39 primer pairs tested in our study, 24 (61 percent) exhibited polymorphism. The determined value was greater than the polymorphism recorded by Senthilvel et al. 66 in the examined pearl millet varieties. Polymorphism variation is related to the geographic origins of specimens, the number of samples, the genome sequence of the samples, and the primers used 67 . In addition, the average PIC value was higher (0.57) as the result of the greater depth of sequencing coverage achieved in this research. One of the critical factors influencing polymorphism is the length of the microsatellite. SSR length may be classified as short (12 bp), medium (12-20 bp), or high (> 20 bp) 68 . In this study, medium SSR lengths (2,055, 46.24 percent) had the greatest proportion. Thus, the microsatellites created from little millet transcriptome probably show a moderate degree of polymorphism. Six microsatellite motifs show high variations in length (Kruskal-Wallis rank-sum test, P 1.8 e−16), and the frequency of each decreased with increasing motif size (Nemenyi test, P 2.2e−16). Using five populations of minor millets, we confirmed the polymorphism of the 48 microsatellite markers. A total of 217 alleles were found. At the species level, Barnyard millet and Finger millet showed the highest and the lowest genetic diversity, respectively. Cluster analysis divided genotypes into three main clusters, with an average similarity of 59 percent.
Cluster-I contains two species, little millet, and Kodo millet, with 62 percent likeness. Cluster-II has Kodo millet genotypes and consists of four branches with a 58 percent likeness. The third cluster, which is divided into three sub-clusters, contains Finger Millet genotypes. Genetic variation in species is affected by internal and external (historical or evolution causes) factors, little millet-as an ancient plant, has collected a large number of genetic variants during the long-term evolution, and it is supposed to retain a high degree of genetic diversity. In this study at the population level, all five minor millet species showed low genetic diversity, which could be a result of a small number of members in each population and distribution properties. Kodo millet showed the largest genetic diversity among all studied spices. While Barnyard millet and Proso Millet revealed limited genetic diversity, which could be the reason of gene flow between the two species during evolution. Generally, Microsatellite markers derived from transcriptome data/ESTs have a higher degree of transferability in related species 69 .
In the current study, 39 microsatellite markers were successfully amplified in minor millets, with 24 of them being polymorphic in all minor millets' genotypes. The transferability ratio was 78%, which was higher than the transferability ratios recorded in Pistacia 26 but less than Moon Seed 70 and Arrowhead 71 . Genic SSRs generated in this research have strong potential for cross-species amplification and could be utilized in future genetic studies of other cereals species.
Potential function of unigenes containing microsatellites. Microsatellites derived from transcribed sequences can be closely associated with gene function and thereby may have important influences on gene products, causing phenotypic changes, and controlling gene expression. To expose the possible roles of these unigenes, functional annotations and classification of microsatellites-containing unigene sequences are performed. According to GO functional annotations, a significant amount of unigenes, including microsatellites were assigned to terms such as "regulation of cellular process, cellular metabolic process, biosynthetic process, cell communication catalytic action" and response to stress, implying that they could be associated with little millet's basic metabolism, growth and developmental activities. Likewise, KEGG analyses revealed that SSRs, including unigenes, served various biological roles and were involved in different vital features of little millets. Notably, thiamine metabolism and purine metabolism were the most enriched pathway in KEGG analysis. Looking deeper, we identified twelve genes involved in the thiamine biosynthetic pathway. Compared to the maturity stage, expression levels of genes involved in the vitamin B1 pathway were higher during the vegetative and flowering phases. These results are in agreement with the report of Colinas and Fitzpatrick, 2015, where the expression of genes involved in thiamine biosynthesis was higher in photosynthesizing tissues 72 . Most of the genes involved in thiamine biosynthesis are not fully known; hence to improve human health and nutrition, it is vital to identify the genes and enzymes involved in thiamine biosynthesis 73 .

Conclusions
Using transcriptome data, a total of 4443 new SSR markers were developed, and the frequency, distribution, and function of these genic microsatellites were characterized. The unigenes containing microsatellites performed a spread spectrum of biological roles, the majority of which were related to thiamine metabolism, purine metabolism, other metabolic process, and signaling pathways. These findings shed light on the function of microsatellites in the transcriptome. Furthermore, the polymorphism and transferability of 24 eSSR markers were examined in 25 minor millet genotypes. The results of this research can present an excellent resource for germplasm identification, genetic relationship studies, linkage maps, MAS reproduction, and diversity analysis in little millet and other related species in future genetic and genomic studies, as well as play a substantial role to update the millet genic SSR markers database.

Data availability
Transcriptome data generated in this study were submitted to NCBI with the SRA ID SRR14509267, SRX10855102, SRR14509268, SRX1085510, SRR14509269, SRX10855100, and SAMN19114419.

Funding
This research was not supported by any agency.