Genome-wide identification and characterization of functionally relevant microsatellite markers from transcription factor genes of Tea (Camellia sinensis (L.) O. Kuntze)

Tea, being one of the most popular beverages requires large set of molecular markers for genetic improvement of quality, yield and stress tolerance. Identification of functionally relevant microsatellite or simple sequence repeat (SSR) marker resources from regulatory “Transcription factor (TF) genes” can be potential targets to expedite molecular breeding efforts. In current study, 2776 transcripts encoding TFs harbouring 3687 SSR loci yielding 1843 flanking markers were identified from traits specific transcriptome resource of 20 popular tea cultivars. Of these, 689 functionally relevant SSR markers were successfully validated and assigned to 15 chromosomes (Chr) of CSS genome. Interestingly, 589 polymorphic markers including 403 core-set of TF-SSR markers amplified 2864 alleles in key TF families (bHLH, WRKY, MYB-related, C2H2, ERF, C3H, NAC, FAR1, MYB and G2-like). Their significant network interactions with key genes corresponding to aroma, quality and stress tolerance suggests their potential implications in traits dissection. Furthermore, single amino acid repeat reiteration in CDS revealed presence of favoured and hydrophobic amino acids. Successful deployment of markers for genetic diversity characterization of 135 popular tea cultivars and segregation in bi-parental population suggests their wider utility in high-throughput genotyping studies in tea.


Scientific Reports
| (2022) 12:201 | https://doi.org/10.1038/s41598-021-03848-x www.nature.com/scientificreports/ in rapid identification of key QTLs, and expediting breeding of superior tea cultivars. Furthermore, cultural practices including monoculture/clonal cultivation of commercial tea plantations will have larger implications of novel core-set of SSR markers in developing unique DNA fingerprints for testing varietal/cultivars purity, authentication of potential tea cultivars or clones and various teas in global market 13 . Furthermore, multiple attributes such as multi-allelic nature, co-dominant inheritance, hyper-variability, chromosome-specific location, ubiquitous occurrence, high polymorphic information content (PIC) and reproducibility, TF derived novel SSR markers identified in this study can be potentially utilized for genetic improvement of tea 14,15 . Dissection of underlying mechanism of desirable complex traits are challenging due to highly regulated structural gene networks 16 . Being 'master regulator' of various cellular processes, TF genes can be an excellent target for identification of functionally relevant SSR markers having greater implications in molecular dissection of complex traits in tea. Earlier studies have reported the Teosinte branced1 (Tb1) of TCP TF family in maize and qSH1 TFs responsible for lower rice grain shattering in domestication of maize and rice 17,18 . Furthermore, TFs genes with well-characterised functional domain harbouring polymorphic SSRs markers (expansion/contraction) possibly affecting the gene function can assist in rapid identification of key QTLs in tea. Interestingly, cost effective next generation global transcriptome sequencing offers greater opportunity in rapid elucidation of underlying regulatory networks of diverse agronomic traits and creation of genome-wide functionally relevant marker resources 19 .
In the present study, successful efforts were made for the first time to identify transcription factors (TFs) derived-SSR markers in tea. Functionally relevant marker resource comprising of 1843 novel TF-SSR markers exhibiting genome-wide representation across all 15 chromosomes 28 was developed by using trait-specific (yield, quality and biotic/abiotic stress) in-house transcriptome data of 20 popular tea cultivars. Furthermore, the protein-protein interaction, gene ontology and localization (CDS & UTRs) identified functional relevance of novel TF-SSR markers in trait dissection. Constraints of existing SSR markers resources due to limited availability of experimentally validated SSR markers (~ 2000s) 14,15,[20][21][22][23][24] , the identification of 589 polymorphic novel markers including 403 core-set of the TF-SSRs in the current study will be an excellent asset for various genotyping studies in tea 13,[25][26][27] . Successful extrapolation of informative core-set of markers for genetic diversity assessment of 135 popular tea cultivars and expected segregation patterns in bi-parental mapping population suggests the wider utility of novel TF-SSR marker resource in marker-trait association, genetic diversity and phylogenetic studies in tea 20 .

Results
Frequency and distribution of SSRs. De Fig. 1a).The shorter repeat motifs were more abundant with overall base composition bias towards As and Ts in the TF genes. Further, localisation identified presence of SSR repeats in CDS (50%), 5'UTR Table 1. Statistics of overall de novo assembled transcripts derived from transcriptome sequencing of twenty tea cultivars.  Identification, distribution of SSRs in TF genes. TFs control the physiological and regulatory networks of various functional genes to maintain the normal growth and response against various biotic and abiotic Table 3. Overall abundance of SSRs repeat motifs in transcripts encoding transcription factor genes of tea.

Suitability of TTFMS markers for genetic mapping. SSR markers have been preferred markers for
genome mapping studies and dissection of complex traits. Successful utilization and testing of 589 polymorphic TF-SSR markers between two parental lines (P 1 and P 2 ) and respective bi-parental F 1 populations (10 individuals) revealed five segregating patterns in 265 TF markers. Of these, 185 markers representing hk x hk (77), lm x ll (49), nn x np (47), ab x cd (9) and ef x eg (3) segregating patterns, can be futuristically utilised for construction of genetic map and establishing marker-trait association of targeted traits in tea (Supplementary file 4).

Potential of codon reiteration in TFs of Tea.
SSRs repeats in coding region contributes to repetitive pattern in protein sequences as tandem tri-and hexa-nucleotide SSR repeats leads to single amino acid repeats (SAARs). SAARs or codon reiteration is a unique mechanism which increases the size of protein due to reiteration of some codons more than others. Current data stipulates high abundance of tri-nucleotide repeats in CDS region, potentially codes for serine (19%) followed by glycine (11%), leucine (11%), aspartic acid (10%), threonine (10%), glutamine (10%), glutamate (10%), proline (10%) and histidine (9%), and least abundance of tyrosine (1%) in the TFs genes (Table S1). Of these, serine and glycine were considered to be most favoured amino acids (AAs) of a polypeptide. Nevertheless, hydrophobic proline and leucine were also found abundant transcripts encoding TF genes in tea 36 . Furthermore, AAs frequency of reiteration in coding region of TF genes varied from five to ten and eleven to twenty were also identified ( Fig. S3a & b). Leucine, one of the most abundant AA was also found the most frequently reiterated in TFs of tea. Total 42 transcripts were identified in which near a reiterant, a second reiterant was retrieved in coding region. However, inherent interruption of SSR repeats due to mutations, few tri-nucleotide repeat encoding for histidine and glycine exhibited interruption possibly due to mutation over time in tea 37 (Fig. S4a-d).

Discussion
Simple sequence repeats (SSRs), a third major category of variations after CNVs and SNPs, are having important function in controlling long range interaction and genome packaging 3,38-41 . Being bias towards expansion than contraction, SSRs are considered as an important turning knob of evolution, and genome level structural and functional variability. Tea, a widely consumed non-alcoholic beverage produced in more than 60 countries have recorded increasing trend in production and consumption, ensuring higher return to farmers 7,8 . Multiple desirable attributes and key regulatory role of transcription factors, SSR markers derived from TFs can be potentially www.nature.com/scientificreports/ utilized for trait dissection and implementation of marker-assisted breeding in tea 5 . Furthermore, functionally relevant 403 core-set of TF SSR markers identified in this study can assist in high-throughput genotyping, authentication of various teas and large-scale fingerprinting studies. Therefore, enriching functionally relevant experimentally validated polymorphic TF derived SSR marker resource developed in the present study will expedite fingerprinting, genome mapping, linkage and diversity analysis efforts to assist in genetic improvement in tea 10,15,[20][21][22][23][24]42 .

Identification, distribution of SSRs in TF genes. Higher abundance of SSRs in transcripts encoding
TFs (2,776 TF genes harbouring 3,687 SSR motifs) than other plant species possibly be associated with SSR search criteria and genomic attributes of targeted species 34 . Frequency of di-nucleotide repeats in TF genes is consistent with earlier reports of EST-SSR marker studies in tea, and other dicotyledon crops like Actinidia eriantha, Luffaa egyptiaca, Paeonia, Amorphophallus, Colocasia esculenta, Rosa roxburghii, and Hevea brasiliensis 8,13,[43][44][45][46][47][48][49][50] . Furthermore, most frequent AG/CT repeats represent GAG, AGA, CUC and UCU codon encoding alanine and leucine exhibited with high abundance in protein sequences of tea and other plant species 10,51,52 . Contrarily, scarcity of GC repeat motifs in the data possibly be associated with less probability of CpG islands avoiding methylation mediated transcriptional interruptions 8 . Likewise, high abundance of tri-nucleotide motifs viz., AAG/CTT in TF genes were also reported to be predominant in dicotyledons 50 . The localisation of tri-nucleotide repeats in CDS region may be attributed to the fact that repeat length variations will not affect the reading frame of the protein. Likewise, di-nucleotide repeat abundance in untranslated regions (5'UTR) will not be affecting reading frame, hence, tolerated more in untranslated regions than CDS 10,51 . Moreover, variation in di-nucleotide repeats (GA/TC) present in 5'UTR has been correlated with important agronomic traits like amylose content in rice 53 . Interestingly, polymorphic di-nucleotide repeats TF-SSR markers belonging to functionally relevant TF families like B3, NAC, bHLH, C3H, and Myb-related localized to CDS regions were also assigned to various chromosomes of CSS tea genome. Furthermore, functionally relevant core-set of markers localized to UTRs and protein-coding regions can be potential markers for genetic analysis and establishing marker-trait association in tea 49 .

GO classification and functional relevance of TF-SSRs. The GO enrichment analysis of TF genes
harbouring SSRs depicted with high representation of GO terms like response to stimulus, response to abiotic stress, response to metabolic processes, cellular macromolecule biosynthesis and transcription, transcription regulator activity and transcription factor activity suggests the potential utility of current markers resource to identify trait-specific variations in tea. Furthermore, highly polymorphic core-set of TF-SSR markers identified in differentially expressed TF genes can be an important resource for eQTL analysis. Moreover, polymorphic SSR markers derived from bHLH (53), WRKY (37), C3H (32), Myb-related (31) and NAC (30) TF families reportedly involved in regulation of multiple enzymatic steps involved in quality related traits (flavonoids biosynthesis) are of utmost importance in targeted trait dissection 54 . Likewise, second most represented WRKY and NAC TFs harbouring highly polymorphic SSR markers were conferred to have key functional role in regulation of biotic [56][57][58] and ABA mediated abiotic stress tolerance (cold and drought stresses) in tea 55,58,59 . Similarly, C3H and Myb-related TF families regulate dormancy status of vegetative buds 60 and accumulation of anthocyanin pigment in tea, respectively 9 .
Localization of TF-SSR markers. Polymorphism, expansion/contraction of SSR loci in CDS and untranslated regions (UTRs) of potential genes may lead to key variations influencing gain or loss of targeted traits 52,61 . Therefore, 589 polymorphic TF-SSR markers identified in this study are potential functionally relevant markers for trait dissection 62 . Furthermore, SSR polymorphism recorded in UTRs of TFs possibly be influencing the transcription/translation (5'UTR) and gene silencing (3'UTRs) 63 . Likewise, variations in CDS region might results in truncated protein formation 64,65 . Abundance of TFs harbouring short motifs in the transcribed region was also reported in many earlier studies [66][67][68] . The scarcity of longer microsatellites in TF genes might be due to the downward mutation bias and low persistence time 69 . Moreover, contraction mutation events happen more with increase in allele size due to which longer alleles tend to become shorter avoiding their infinite growth 65,70 . Therefore, the pattern of SSRs in TFs genes stipulates that tea genome possibly be under rapid evolution 71 .

PPI network and functional significance. Protein-protein interaction is one of the important steps to
mediate the action of expressed proteins to precisely regulate the signal transduction processes and homeostasis 5 . TFs, being key molecular players controlling gene expression of various growth and development processes undergo complex interactions with other proteins. Furthermore, variation in these proteins will have profound impact on other interacting proteins. In current study, direct significant interactions identified between the TF genes of tea harbouring polymorphic markers with volatile fatty acid biosynthesis, drought responsive, plant pathogen interactions and MAPK signalling pathways stipulates their putative functional consequences. Therefore, understanding the interactions of TFs harbouring polymorphic SSR markers will assist in rapid prediction of functional relevance in biological functions, and also have implication for QTLs analysis and marker assisted selection in tea 20 .
Polymorphic potential, core marker selection, fingerprinting and genetic diversity analysis. Experimental validation of functionally relevant 862 markers with identification of 589 highly polymorphic and stable markers including 403 core-set TF-SSR markers can be utilised to study the impact of expansion/contraction repeats in targeted trait dissection (Table S4). Nevertheless, unsuccessful amplification in 20.2% TTFMS markers loci might be due to the insertion or deletion at primer binding sites of correspond- www.nature.com/scientificreports/ ing genomic sequences. Variations detected in UTRs and CDS regions may be correlated with regulation of gene function influencing quantitative and qualitative phenotypic variations in tea 27 . The 18 polymorphic functional domain associated TF-SSR markers may have utility for mapping of specific regulatory genes along with direct allele selection 43,44 and its impact on comparative gene expression 46 . High polymorphic rate of novel TTFMS markers (589; 79.8%) including 403 core set of markers suggests wider utility in genetic analysis in tea 28 . Additionally, comparable mean gene diversity (He: 0.48) and polymorphic information content (PIC: 0.60) inferences also suggests importance of novel markers in various genotyping studies in tea 10,[72][73][74] , similar to earlier studies in various crop plants like rice 4 , chickpea 31,32 and sugarcane 5 . A subset of 15 informative polymorphic core set of TTFMS markers distinguishing 135 popular tea cultivars can be utilised futuristically as informative set of markers for larger scale fingerprinting studies 75 . Successful DNA fingerprinting application greatly depends on the various marker attributes including polymorphic potential, reproducibility and discrimination power. The high polymorphic potential (5.89 alleles/ per locus) detected with core set of TF -SSR markers was comparable to other studies 76,77 . Interestingly, high average PIC recorded with core-set of markers was significantly high as compare to earlier reports in tea [76][77][78] . Moreover, clustering of tea cultivars based phenotypic attributes (leaf characteristics) and biochemical parameters (ECG, EGCG, EC, Catechin and Caffeine) suggests their implications for selection of potential parental groups for breeding of high yielding quality tea cultivars 10,15,42,79 . Additionally, 185 TF-SSR markers with expected segregation patterns in tested bi-parental population can be directly utilized for genetic map construction and QTLs analysis in tea 13 .

Expansion of codon repeats and their functional significance. Slippage mediated expansion and
contraction of tri-nucleotide repeats (do not disturb the protein reading frame) are tolerated more in coding region. In current study, tri-nucleotide repeats were more abundant in the CDS region might be due to mutation pressure or possibly due to positive selection for specific amino acid repeats in the polypeptides encoding TF genes of tea 80 . Expansion of codon repeats encoding hydrophilic AAs Serine (≥ 14 repeats) indicates more tolerance than hydrophobic AAs in coding regions due to strong selection pressure eliminating basic and hydrophobic AAs repeats 81 . Further, two acidic (aspartic and glutamic acid), neutral (serine and threonine) and one basic (histidine) amino acid repeats found more reiterated due to tri-nucleotide repeat motifs, supports the abundance of polar and acidic AAs in TFs gene families of tea 82 . Leucine, among the most abundant and frequently reiterated AA in TFs genes in tea, suggests SSR dependent AA (leucine) reiteration which is predominantly reported in higher plant species 83 . Reiteration of single amino acid tandem tri/hexa-nucleotide repeat in various TF genes in dormancy (B3, C2H2 and MYB), secondary metabolite bio-synthesis (bHLH and MYB), abiotic stress response (ERF, NAC, GRAS, HSF, Tri-helix, WRK and bud and leaf pigmentation (TCP) suggest positive selection pressure for accumulation of these repeats and might have functional role in quality, yield and biotic & abiotic stress tolerance in tea 36,81 (Fig. S6a-d).

Conclusion
SSR repeats in regulatory genes influence the normal activity and function of the genes due to the repeat length (expansion and contraction) variation causing phenotypic changes in the plants. Due to limited availability of number of validated SSR markers from regulatory genes, identification of 1843 TF-SSR markers including 589 potential polymorphic markers will be a novel Tea Transcription Factor derived MicroSatellites (TTFMS) marker resources in tea. Furthermore, identification of 403 functionally relevant core-set of TF SSR markers with desirable marker attributes (Na: 3-17 per locus; He: 0.48; Ho: 0.73; PIC: up to 0.90) and successful extrapolation in diversity characterization of 135 tea popular cultivars suggests wider implications of novel marker resources. Additionally, appropriate segregating patterns of 185 markers in bi-parental mapping population representing hk × hk (77), lm × ll (49), nn × np (47), ab × cd (9) and ef × eg (3) stipulates their potential applications in genetic mapping and establishing marker-trait association in tea. Polymorphic core set of TF-SSR markers retrieved in bHLH, Myb-related, WRKY, C2H2, C3H, ERF, NAC, FAR1, G2-like and MYB suggests their key role in combining quality (flavonoid biosynthesis) and stress tolerance in high yielding tea cultivars. Key attributes including polymorphic potential, stability, functional relevance and genome-wide representation across all 15 chromosomes suggests wider implications of novel TF-SSR resource to accelerate molecular breeding efforts and traits dissection in tea.

Methods
Data utilised for mining of transcription factor (TFs) genes. Global  °C melting temperature and product size between 100 and 350 bp). Further, for codon reiteration analysis only tri-/hexa-nucleotide repeats were targeted in the TFs genes of Tea. The tri-/hexa-nucleotide repeats encoding for amino acid ≥ 5 were identified and were encoded as single amino acid repeats in the TFs genes of Tea and every uninterrupted single amino acid repeats was considered as a unique event in the transcript 36 .
PPI Network analysis of TFs harbouring SSR. The PPI network for TF genes harbouring SSR were built utilising STRING PPI network of Arabidopsis (https:// string-db. org/) 86 . Further the network was visualised using Cytoscape v3.4. Further, correlation between the TF genes was determined on the basis of significant correlation edges with its TAIR orthologs.
DNA isolation, PCR amplification and data analysis. Young green leaves were utilised for genomic DNA isolation from random cultivars representing three traditional varietal types namely [Assam (C. assamica), China (C. sinensis) and Cambod/Indian type (C. assamica spp. lasiocalyx)] for screening of TF-SSR markers (Table S2). Further, genomic DNA of 135 tea cultivars and 10 individuals of F1 mapping population along with parental lines were isolated using DNeasy Plant Mini Kit (Qiagen, Germany) to predict functional diversity and for genetic mapping analysis (Table S3). Quantity and quality of DNA was analysed using NanoDrop 2000 OD 260 /OD 280 (Thermo Scientific, Lithuania) and integrity with 0.8% agarose gel. PCR amplification was performed using 25 ng of genomic DNA and amplified products was separated on denaturing polyacrylamide gels containing 7% of polyacrylamide and 7 M urea in 1 × TBE buffer. Denatured product was loaded on to the gel Sequi-Gen GT system (Bio-Rad, Australia) and size was measured using 50 bp ladder standard 10 . SSRs alleles were scored in binary format 0 (absent)/1 (present) and were utilised for genetic relationship determination and estimation of marker amplification frequency and polymorphism potential in tea cultivars. The observed heterozygosity (Ho), expected heterozygosity (He) and polymorphism information content (PIC) was estimated using power marker software version 3 87,88 . Further, the dendrogram was constructed on the basis of Nei's genetic distance matrix using neighbour-joining (NJ) methodology with 1000 bootstrap replicates 28,50,85,89,90 . Further, for genetic mapping analysis, tea being a cross pollinated plant species, four alleles representing five different segregation patterns viz.; hk × hk, lm × ll, nn × np, ab × cd and ef × eg were utilised 13 . Core set of TF SSR markers were identified using PI and PIsibs statistics for individual marker using GenAlEx version 6.5 22,91 . Further, additional parameters including PIC (≥ 0.5) and alleles (≥ 3 alleles/ loci) were also considered for identification of core set of TF-SSR markers 22 .