Introduction

Tea is an important plantation crop in India and is widely consumed as a non-alcoholic beverage around the world. As tea is a perennial, woody, cross-pollinated plant1, the conventional breeding program is extremely slow. Being a recalcitrant plant (i.e., difficult to regenerate in vitro), the transgenic or genome-editing approach for genetic improvement of tea is difficult2. Modern tea cultivars still rely primarily on hybridization as a method of genetic improvement. There are three botanical subgroups of tea plants (i.e., Assam, China, and Cambod type) based on morphological parameters, but due to their high outcrossing nature, they can all interbreed freely. The existing tea population today is mostly genetic admixtures of these three types3. Therefore, estimating the purity of tea genotypes using molecular markers is an important criterion for precious tea breeding.

Tea breeding is restricted to clonal selection of superior bush from the existing natural population. A systematic breeding technique for tea genetic improvement is not obscure. It is noteworthy to mention that a few draft genomes of tea, including the Assam and China types, have been reported4,5, providing insights into the tea genome's organization and genetic information. The development of molecular markers using the draft genomes of these two cultivars is one of the useful strategies for tagging the important QTLs and marker-assisted breeding for agronomically important traits. The development of a large number of diverse and informative genetic markers to cover the entire tea genome is thus necessary to accomplish such goals. Several DNA markers in tea plant, such as randomly amplified polymorphic DNA (RAPD), inter simple sequence repeats (ISSR), amplified fragment length polymorphism (AFLP), and simple sequence repeats (SSR), were reported primarily from the pre-genome sequence information6. However, they are insufficient to saturate the whole tea genome due to their large genome size5. A large number of SNPs and InDels were reported in tea plant7, however, these markers are expensive and require special skills to assay and analyze data in any typical laboratory setup. Therefore, the identification and characterization of a large number of robust, diverse, and easy-to-assay DNA markers via polymerase chain reaction (PCR) are crucial for genetic characterization of germplasm, tagging important QTLs, and trait introgression into elite tea genotypes through marker-assisted breeding.

It is also known that the tea genome contains a high proportion of repeat sequences (70–80%), the majority of which are transposable elements (TE) and other repeat-related elements. Miniature Inverted-repeat Transposable Elements (MITEs) have the structural features of DNA transposons, with terminal inverted repeats (TIRs) flanked by small direct repeats (target site duplication, TSD) at both ends of the element. MITEs are short, typically 70 bp to 800 bp in length with an AT-rich sequence. They are inserted preferentially into intergenic, adjacent to a gene, intronic, and exonic regions, thereby playing crucial roles in gene regulation and genome evolution8. MITE transposition in plant genomes is known to produce a wide range of variations in plants, both at the genotypic and phenotypic levels, which can help plants to adapt to different environments. On the other hand, MITE-related sequences may encode small RNAs that regulate specific target genes at the transcriptional and post-transcriptional levels9. Thus, MITE-derived molecular markers are excellent candidates for gene tagging, especially when targeting genes that govern quality traits in tea. In addition to MITE sequences, intronic regions of a gene often contain a plethora of repeat sequences that contribute to the diversity of genes10. The difference in length of an intron between individuals on a genome-wide scale is used to create DNA markers known as Intron Length Polymorphic (ILP) markers. The importance of ILP markers is due to their co-dominant nature, neutrality, ease of assay, higher reliability, and high cross-transferability across the related species11. Thus, genome comparison is exploited to develop potential intron polymorphism (PIP) markers by designing primers from the flanking exon sequences of an intron that vary in length12. A large number of ILP markers are successfully employed in different crops, such as rice13,14, foxtail millet15, onion16, and carrot17. In the current study, we report the simultaneous identification and characterization of a large number of CsMITE and CsILP markers by comparing the two tea genomes. We also propose that these markers have practical utility in the genetic characterization of tea germplasm for different traits, genetic diversity and germplasm characterization, tagging important QTLs, and the construction of linkage maps for advanced tea breeding.

Results

Identification and classification of CsMITEs

The prevalence of similar structural features in CsMITEs was employed for genome-wide identification of 23,170 and 37,958 potential CsMITE candidates in both the CSA and CSS tea genomes, respectively. The TSD lengths ranging from 2 to 10 bp and TIR lengths of at least 10 nucleotides were found in the identified potential CsMITE (Supplementary Table S1). Further, from these potential candidates, 180 representative MITEs were found to be part of conserved known families, and the rest 22,990 were novel in the CSA genome, whereas 377 MITEs were found to be part of conserved known families and 37,581 were found as novel sequences in the CSS genome. These 180 and 377 CsMITEs were subsequently classified into 43 and 60 CsMITE families and superfamilies in the CSA and CSS genomes, respectively. Some of the important CsMITE superfamilies identified in the present analysis are hAT-like, Tc1/Mariner, Mutator-like, PIF/Harbinger, and CACTA (Supplementary Table S2). The DTA Mae1 (superfamily hAT-like) and DTT Zem3 (superfamily Tc1/Mariner) families have a maximum of 27 CsMITE sequences in the CSA genome each, whereas the DTT Zem3 (superfamily Tc1/Mariner) family has 65 CsMITE sequences in the CSS genome (Tables 1 and 2).

Table 1 Classification of CsMITE families and superfamilies in the CSA tea genome.
Table 2 Classification of CsMITE families and superfamilies in the CSS tea genome.

In our analysis, around 1977 and 5466 CsMITEs were found to be located in the ‘genic’ region of CSA and CSS, respectively, whereas 1776 and 2573 were found to be located in the ‘near genic’ category in the CSA and CSS genomes, respectively. Further, a total of 18,934 and 29,144 CsMITEs were grouped in the ‘intergenic’ category in the CSA and CSS genomes, respectively. After a comparison of CsMITEs from the genic region in both CSA and CSS genomes, we found 53 and 154 unique MITEs in the CSA and CSS genomes, respectively (Supplementary Table S3). All identified CsMITEs in both the genomes were compared through a homology search to determine the common and unique CsMITEs. It was observed that 22,611 CsMITEs were shared in both the genomes, while 559 and 15,347 were found to be unique in the CSA and CSS genomes, respectively (Supplementary Table S4).

CsMITEs as precursors of miRNA sequences

Identified CsMITEs sequences in both the genomes were used as a query to perform a homology search against downloaded 48,885 miRNAs and 522 novel Camellia-specific miRNAs (Supplementary Table S5). The small RNA sequences with an exact match to the CsMITE sequences were pooled as MITE-derived sequences. Among aligned CsMITEs derived miRNA sequences, we found 3964 unique miRNAs in CSA and 5198 unique miRNAs in CSS as top hits against them. Further, Camellia-specific novel miRNAs showed a match with 430 and 448 predicted CsMITEs from the CSA and CSS genomes, respectively.

Identification and classification of non-redundant CsILP loci

A total of 36,951 CDS sequences from the CSA genome were searched using the online PIP marker database12, using Arabidopsis thaliana intron information as a model to find the best CsILP primer hits. The initial search designed a total of 15,087 CsILP primer pairs from 4287 unique CDSs, which matched with 3056 unique Arabidopsis CDSs. The identified CsILP primers were further filtered and removed the duplicate primer sequences, and we have assessed their unique primer binding sites on the CSA genome. This filtration process resulted in a non-redundant set of 4912 CsILP primer pairs, which have a unique primer binding site on the CSA genome and also matched to 1914 unique Arabidopsis CDS. Further, these 4912 CsILP loci mapped to 1780 scaffold sequences on the CSA genome (Supplementary Table S6). A comparison of these 4912 CsILP primer pairs to their potential primer binding sites revealed that they can bind to 4213 (85.8%) single or multiple binding sites in the CSA and CSS genomes, respectively. Out of these 4213 primers, 410 primers predicted more than one binding site in the CSS genome, whereas they were predicted to have a single binding site in the CSA genome.

Identification of Transcription Factors (TFs) associated with CsMITEs and CsILPs

The CsMITEs of both the tea genomes were searched against the Plant Transcription Factor and Transcriptional Regulator Categorization and Analysis Tool (PlantTFcat) database18, which harboured TFs belonging to WRKY, MYB, bZIP, bHLH, NAC, Zing finger, and AP2/ERF families. In the CSA genome, only two CsMITEs were found belonging to the bZIP and CCHC (Zn) TF families, whereas in the CSS genome, two CsMITES were found to belong to the C2H2 and HMG TF families (Tables 3A and B). A search against the PlantTFcat database produced a total of 193 hits that were relevant to TFs in the case of CsILP-containing CDS. The majority of the CsILP-containing CDS were found to be WD40-like (32%) TFs, followed by C2H2 (16%) and MYB/MYB-like (11%) TFs (Fig. 1). Other significant TFs associated with the ILP-containing CDS were AP2-EREBP (7%) and WRKY (5%), Homeobox-WOX (5%), E2F-DP (3%), bHLH (3%), bZIP (2%) and others (16%).

Table 3 Transcription factors identified in CsMITEs (A) CSA genome (B) CSS genome.
Figure 1
figure 1

TFs associated with CsILPs in tea genomes.

Gene ontology (GO) annotation of CsMITEs and CsILP loci

Functional annotation of genic CsMITEs from both the CSA and CSS genomes may help in understanding their roles in biological processes, molecular functions, and biological pathways. Therefore, GO annotation of the genic CsMITEs in both the CSA (1754 sequences) and CSS (4401 sequences) genomes was performed using the basic BLAST2GO software19. A total of 1328 (75.7%) and 3217 (73.1%) CsMITE sequences of the respective CSA and CSS genomes were annotated with at least one GO term associated with the cellular component (CC), molecular functions (MF), or biological process (BP). The AgriGO singular enrichment analysis (SEA)20 was performed against the reference database, The Arabidopsis Information Resource (TAIR) genome locus (TAIR10_2017)21, revealed 23 and 22 significantly enriched GO terms with FDR ≤ 0.05, respectively, for the genic CsMITEs from the CSA and CSS genomes. The blastx analysis indicated that 1624 (92.6%) and 4059 (92.2%) genic CsMITE sequences, separately from the CSA and CSS genomes, produced hits against the NCBI (National Centre for Biotechnology Information) customized plant non-redundant (nr) database. The majority of the genic CsMITEs (CSA: 448 sequences, 25.5%; CSS: 990 sequences, 22.5%) found top hits with Vitis vinifera sequences. According to the GO level distribution, the genic CsMITE sequences from the CSA and CSS genomes under the CC category produced the most significant sequences (p value ≤ 0.05) for ‘cell’ (CSA: 40.5%, CSS:35.2%) followed by the ‘cell part’ (CSA: 39.5%; CSS: 34.6%) and ‘organelle’ (CSA:27.8%, CSS:22.3%). Under the MF category of GO, the highest percent of genic CsMITEs were found significant (p value ≤ 0.05) for the ‘binding’ (CSA: 38.3%, CSS: 33.5%) function. In the case of the BP category, the ‘response to stimulus’ (CSA: 9.6%, CSS: 7.9%), ‘cellular process’ (CSA: 40.4%, CSS: 36.7%) and ‘biological regulation’ (CSA: 5.5%, CSS: 4.2%) were found significantly high (p ≤ 0.05) for the genic CsMITEs in the CSA and CSS genomes. Some of these genic CsMITEs might be related to important secondary metabolite pathways such as phenylpropanoid biosynthesis, isoflavonoid biosynthesis, anthocyanin biosynthesis, isoquinoline alkaloid biosynthesis, monoterpenoid biosynthesis, tropane, piperidine, biosynthesis of secondary metabolites, and pyridine alkaloid biosynthesis, which might serve as an important resource for marker development for tea quality breeding and genetic improvement (Fig. 2) (Supplementary Table S7). In our results, CsMITEs named MITE_T_15348 and MITE_T_12375 from the CSA genome and MITE_T_831, MITE_T_16821, MITE_T_5898, and MITE_T_29409 from the CSS genome were found to be involved in caffeine metabolism. MITE_T_8758, MITE_T_23119, MITE_T_11869, and MITE_T_14708 from the CSA genome, as well as MITE_T_27732, MITE_T_19350, MITE_T_34833, and MITE_T_18111 from the CSS genome, were discovered to be involved in the phenylpropanoid biosynthesis pathway (Supplementary Table S8).

Figure 2
figure 2

Gene ontology of CsMITEs detected in the genic region of CSA genome and CSS genomes using the online AgriGO v2.0—GO via the customized Singular Enrichment Analysis (SEA) tool against the Arabidopsis reference background annotation data (TAIR10_2017).

Similarly, the GO annotation of the CsILP containing CDS showed that 2123 (89.9%) out of 2362 could be related to at least one GO term associated with the CC, MF, or BP categories. Using the AgriGO-customized SEA tool, a total of 39 significantly enriched GO terms were identified with FDR ≤ 0.05 against TAIR genome locus (TAIR10_2017) reference annotation data. The BLASTx analysis further indicated that the majority of the CDS sequences (2340 sequences, 99.1%) produced hits against the plant ‘nr’ database and 497 CDS sequences (21%) found top hits for Vitis vinifera. Among the BP category, the GO terms associated with ‘cellular process’ (34%) followed by ‘metabolic process’ (33%) were the two major GO categories exhibited in the CsILP-containing CDS sequences. Under the MF category, ‘nucleotide binding’ (31%) and ‘hydrolase activity’ (25%) were found as the major two categories. Among the GO terms associated with CC, ‘cell’ (31%) and cell part (31%) were two major GO categories determined. Besides, ‘biosynthetic processes’ (19.6%) and ‘response to stimulus’ (13.4%) were also found as two notable GO categories, which might be fascinating in tea plant genetic improvement research (Fig. 3). Some of the CsILP markers, namely CSAPIP0022, CSAPIP1511, CSAPIP1721, CSAPIP2102, CSAPIP2983, CSAPIP3308 and CSAPIP3671 were found to be involved in theanine biosynthesis related pathways (Supplementary Table S8).

Figure 3
figure 3

Gene ontology of CsILPs detected in the genic region of CSA genome and CSS genomes using the online AgriGO v2.0—GO via the customized Singular Enrichment Analysis (SEA) tool against the Arabidopsis reference background annotation data (TAIR10_2017).

Validation of selected CsMITE and CsILP markers

We randomly selected 25 CsMITE and 33 CsILP primers for validation that have single primer binding sites in the tea genomes as predicted by in silico analysis (Supplementary Table S9). These 25 MITEs and 33 ILP markers are widely distributed across all the 15 chromosomes of the tea genome (Fig. 4). Initially, nine diverse tea genotypes were chosen to screen for the polymorphism that yielded 10 CsMITEs (Supplementary Table S10) and 15 CsILP polymorphic markers (Supplementary Table S11). Later, we used 36 diverse tea genotypes for further analysis at the genotypic level. The number of alleles per locus generated by each marker varied from 1 to 4 in CsMITEs and 1–6 in CsILP-based markers. The maximum number of alleles (i.e., 4) was generated by one CsMITE named MITE_T_23247 in all 36 tea genotypes (Supplementary Fig. S1). One CsILP, namely CSAPIP1038, has generated 6 alleles after running in PAGE (Supplementary Fig. S2). A phylogenetic tree was constructed using the 10 polymorphic CsMITE markers that divided all the 36 diverse sets of genotypes into 2 different main clusters. Cluster 1 contains 2 small sub-clusters which included 6 genotypes and 7 genotypes of tea (Fig. 5a). Cluster 2 was further divided into 2 sub-clusters; one is major with 19 genotypes and another is minor with 4 genotypes.

Figure 4
figure 4

Chromosomal location of MITEs and ILP markers selected for validation generated by MapChart.

Figure 5
figure 5

Phylogenetic tree of 36 diverse genotypes using the DARWIN 6 program with the neighbor-joining method (a) CsMITE markers (b) CsILP markers.

Similarly, the 15 polymorphic CsILP marker-based phylogenetic tree clustered 36 diverse genotypes into 2 clusters, the major cluster consisting of 30 genotypes, while the minor cluster grouped 6 genotypes (Fig. 5b). The major cluster is further divided into 2 sub-clusters in which one consists of 29 genotypes, leaving a single genotype out clustered.

Discussion

The advancement of whole-genome sequencing along with the availability of robust in silico tools can accelerate the development of low-cost, highly efficient gene-associated functional molecular markers for genotyping. The MITE-derived markers have an edge over other markers in terms of stability, and their high copy number can serve as a plentiful resource for producing genome-wide markers. Their close association with the genic regions can assist breeders to develop functional molecular markers to tag key agronomic traits.

In the present study, we took advantage of ILP and MITE polymorphic loci insertion to develop a large number of CsILP and CsMITE markers in the commercially important tea crop. The identification of MITEs is crucial as they are involved in the evolution of genomes and can significantly regulate the expression of host genes directly12 or through MITE-derived small RNAs9. We found a significant number of CsMITEs, which is about 83.464% and 78.42% located in the intergenic regions. Also, about 7.82% and 6.92% of CsMITEs were found adjacent to genic regions in the CSA and CSS genomes, respectively. Nevertheless, a considerably lower proportion of CsMITEs, i.e., 8.71% and 14.65%, were interestingly located in the genic region of the CSA and CSS genomes. A similar distribution pattern of MITEs is also reported in Arabidopsis thaliana22Oryza sativa L. ssp. japonica23 and Brassica genomes24. Introns, which were previously thought to be non-coding DNA, are now known to play important roles in gene expression regulation25. Therefore, by harnessing the advantage of publicly available genome sequences, we identified introns in the whole genome to exploit their length polymorphism as molecular markers in plants. Among the simple PCR-based markers, ILP is gene-specific, often hypervariable, neutral to the environment, and co-dominant, which has a high transferability rate in related species12. Previously, genome-wide intron-derived polymorphic markers in rice14, foxtail millet15 sorghum26, chickpea27, and Macrotyloma spp28 were reported. In the present study, we developed a large number of CsILP markers (4192) from the two sequenced genomes of tea harnessing the intron derived marker development tool12. As a part of CDS, these CsILP markers were further characterized with GO annotations and TF association to establish their functional implications in tea germplasm characterizations and breeding.

After detection of potential CsMITEs, they were classified into superfamilies, where Tc1/Mariner and hAT-like superfamilies constitute the maximum CsMITEs in both the genomes. In previous studies, Tc1-like elements have also been identified in angiosperms such as Oryza sativaBrassica rapaCannabis sativa, and Triticum urartu29. The pathway analysis with the CsMITEs from the genic region revealed their association with the important secondary metabolite biosynthetic pathways like phenylpropanoid biosynthesis, isoflavonoid biosynthesis, anthocyanin biosynthesis, and mono-terpenoid biosynthesis (Supplementary Table S7). In the current study, identified CsMITEs were found to be involved in secondary metabolite synthesis pathways were higher in the CSS genome as compared to the CSA genome. GO terms of CsMITEs from both CSA and CSS genomes showed that the highest percentage of CsMITEs fall under the ‘biological process’ category, such as response to stimulus, cellular process, and biological regulation. Similarly, based on GO term and pathway analysis of the CsILP loci in the tea genomes, numerous CsILPs could be associated with the pathways for caffeine metabolism and phenylalanine, tyrosine, and tryptophan biosynthesis. The major determinant of tea quality is the presence of bioactive compounds produced from different secondary metabolic pathways. Therefore, the genomic markers of CsILPs (e.g., CSAPIP1511-glutamine oxoglutarate aminotransferase and CSAPIP2102-glutamine synthetase for theanine synthesis) and CsMITEs (e.g., MITE_T_8010, MITE_T_33358, MITE_T_2957, MITE_T_5423, MITE_T_21237, and MITE_T_7677 for flavonoid biosynthesis, MITE_T_34833, and MITE_T_18111 for phenylpropanoid biosynthesis) related to important secondary metabolite pathways may aid in targeted breeding research in tea crop improvement (Supplementary Table S8).

In the present study, the majority of the small RNA derived-CsMITEs (i.e., 83.44% and 78.39%) were detected in the intergenic regions and only around 16.54% and 21.59% were mapped to genic and near genic regions of the CSA and CSS genomes, respectively, which was in accordance with those detected in the rice30. This may be due to fewer protein-coding sequences in comparison to transcriptionally active regions. The class II MITEs, during plant evolution, can act as mobile elements to shuffle TF-binding sites and modify transcriptional networks31. In order to find transcription factors, CsMITEs were investigated, but only two classes—bZIP and CCHC (Zn)—were found in the tea genomes. Altering bZIP gene expression patterns has been shown to influence many signalling and regulatory networks involved in a variety of physiological processes32, whereas CCHC (Zn) transcription factor is important for the gene expression regulation and cell cycle arrest33. In the present study, several CsILP loci (8.2%) were found associated with the TFs. Interestingly, the TFs or regulatory sequence-derived molecular markers were used in germplasm characterization in Medicago sp.34 and flax35. Therefore, identified CsMITE and CsILP markers related to TFs might serve as an important genomic resource for the characterization and tagging of agronomically important traits for tea breeding. In the current study, for the validation of CsMITE and CsILP markers, we used 36 diverse genotypes of tea, and the results were found to be very promising. We found 10 CsMITE markers to be polymorphic out of selected 25 markers and found to be involved in important pathways in the genome, e.g., MITE_T_22744 tangled with phosphate translocation and MITE_T_7141 is a TPR repeat-containing thioredoxin TTL3 involved in osmotic stress response (Supplementary Table S10). Likewise, 15 CsILP polymorphic markers were found to be involved in most of the important pathways in the genome, e.g., CsILP markers, namely CSAPIP2225, CSAPIP4702, and CSAPIP4263, were found to be involved in the biosynthesis of secondary metabolites (Supplementary Table S11). The phylogenetic tree of these 36 genotypes exposed huge variations among the 36 tea genotypes. In addition, polymorphism present in both the CsMITE and CsILP markers can be further evaluated and tested for association with the phenotypic variance of the trait, which can be successfully employed in the improvement of tea plants.

Conclusion

The current study revealed 22,990 novel CsMITEs sequences in the CSA genome and 37,581 novel CsMITEs sequences in the CSS genome. Similarly, we found 4213 non-redundant and putative CsILP markers that were shared by both tea genomes. Annotation of the CsMITE and CsILP marker-containing sequences revealed numerous markers associated with caffeine metabolism and secondary metabolite biosynthesis pathways, such as phenylalanine, tyrosine, and tryptophan biosynthetic pathways. The validation of the markers in tea germplasm revealed polymorphism and genetic diversity, which could be useful in genotype characterization, comparative mapping, and relationships at the genomic level in the tea crop. Furthermore, the high polymorphism potential in both marker types could be exploited to associate distinct phenotypic variances in tea quality traits. This functional molecular marker resource could be used to improve tea crops and other related perennial woody plant species through marker-assisted breeding.

Material and methods

Identification and classification of CsMITEs

The published whole genome sequence of two tea cultivars, namely, ‘Yunkang 10’ (CSA genome) 4 and cultivar ‘Shuchazao’ (CSS genome) 5, were chosen for this study. The entire workflow is depicted in Supplementary Fig. S3. The open-source software program MITE Tracker 36 was used to scan both the tea genomes (CSA and CSS) to identify potential CsMITE candidates with default parameters. This program identifies putative elements first by searching for accurate inverted repeat sequences, followed by a calculation of the local composition complexity (LCC) score. After these initial steps, valid candidates are identified, followed by clustering using the VSEARCH tool32. These putative CsMITE candidates were then classified by aligning the sequences with the annotated MITE sequences of plant MITE databases (P-MITE)37using the BLASTn homology search tool 38 from the NCBI with an e-value cut-off of ≤ 1e−5.

Based on the locations within or outside the gene sequences, CsMITEs of both the CSA and CSS genomes were classified as either genic, near genic, or intergenic using BED Tools v2. 27.039. The genic region is defined as a region where CsMITEs were found within a gene, near the genic region, and consists of sequences within 1000 bp sequences upstream or downstream to a gene, whereas intergenic CsMITEs were found beyond the gene but within its 1000 bp flanking regions on both sides8.

Genome-wide analysis of common and unique CsMITEs

A comparative genome-wide analysis was performed to identify conserved and unique CsMITEs sequences in the CSA versus CSS tea genomes through a sequence similarity approach. The similarity search was performed using the BLASTn program with an e-value cut-off ≤ 1e−5. To identify common and unique sequences in both the genomes, the CSS dataset was used as a query sequence to the database of the CSA genome, which compares one or more nucleotide query sequences to a subject nucleotide sequence or a database of nucleotide sequences. The potential sequences after the concurrent analyses in the next sections were then used to design and develop markers.

Identification of CsMITE as miRNA precursor

A total of 48,885 miRNA sequences from 271 organisms were downloaded from miRBase version 22 biological databases40, and 522 novel Camellia specific miRNAs were manually collected from research articles41,42,43. All the CsMITE elements identified earlier in both the genomes were used as a query to perform a homology search using the BLASTn program with an e-value of 1e−5 against these small RNA sequences. Only the predicted CsMITEs which had a perfect top match with the small RNAs were considered as CsMITE-derived small RNAs.

Identification and development of CsILP markers

Coding sequences of the CSA genome4 were used to identify potential ILP loci using the online ‘Develop’ tool of the database of PIP markers12. The tool to predict putative CsILP loci compared the dicot model plant, Arabidopsis thaliana polymorphic intron sequences. Accordingly, forward and reverse primers were designed from the 100 bp flanking sequences of an intronic sequence of ≤ 400 bp. The primers designed were first checked manually to remove the duplicate primer sequences. The resulting primer sequences were then subjected to in silico PCR against the CSA and CSS whole-genome assemblies using the ‘0’ mismatch in the Primer Search tool of EMBOSS software44 for detection of redundant primer-binding sites. Only the primers producing single amplimers were finally selected as potential non-redundant CsILP markers.

In silico mining of transcription factors associated with CsMITEs and CsILPs

Transcription factors in both the CsMITE and CsILP-containing sequences of the CSA and CSS genomes were identified using the online PlantTFcat tool18. The potential CsMITE and the CsILP-containing sequences were taken as input to the PlantTFcat database. These sequences were searched against all the transcription factor protein sequences present in this database by analysing their InterPro Scan domain patterns.

Annotation of CsMITEs and CsILP markers

The functional annotation and the pathways of genic CsMITEs and the CsILPs-containing CDS sequences were mapped separately using the basic BLAST2GO program GO v.5.2.519. In the BLAST2GO analysis, the local BLASTx tool of NCBI BLAST+ version 2.3.0 was run against a customized NCBI plant nr database using an e-value cut-off at 1e−5. The BLAST2GO interpro scan, GO mapping, and annotation tools were used with the default settings. GO-Slim analysis using the Plant slim and GO-Enzyme code mapping and KEGG was chosen for the final annotations. Finally, the GO annotated combined graph was plotted using the web gene ontology annotation plotting tool WEGO 2.045 at GO level 2 data. Additionally, the GO enrichment analysis was performed using the online AgriGO v2.020 via the customized Singular Enrichment Analysis (SEA) tool against the Arabidopsis reference background annotation data (TAIR10_2017)21.

Validation of CsMITEs and CsILPs markers

In total, 50 CsMITE sequences, i.e., ten each of short, medium, and long conserved sequences and 20 from novel CsMITE sequences of both the genomes, were used to design primer pairs from the flanking regions. For validation, a total of 25 primers were chosen that were anticipated to have single binding sites on both tea genomes (Supplementary Table S9). Similarly, out of 4912 CsILP-based markers, a total of 33 primers with more than 200 bp were selected (Supplementary Table S9). A chromosomal map was constructed which displayed the locations of 25 MITEs and 33 ILP markers distributed across all 15 chromosomes of tea using MapChart46. These CsILP-based primers have only one amplimer and have significant differences in amplicon size in two different tea genomes (in silico PCR). For validation of CsMITEs, we used 36 diverse tea genotypes (Table 4) to check polymorphism using CsMITEs and CsILPs-based markers. Genomic DNA was extracted by following the standard protocol47. Each PCR reaction was carried out in a 25 μL total volume consisting of 25 ng of genomic DNA, 0.5 μM each of forward and reverse primer, 0.25 mM of each of the dNTPs, 0.5 U Taq DNA polymerase, and 1 × Taq buffer. The PCR amplifications were carried out in a PCR (Applied Biosystems™) with the following thermal profile: 94 °C (5 min.), 30 cycles of 94 °C (60 s), 58–63 °C (60 s), 72 °C (45 s) and a final step of 72 °C (7 min.). PCR products were separated in a 6% polyacrylamide gel, stained with ethidium bromide and viewed under the Gel Documentation System (Gel Doc XR+ system, BioRad, USA). The number of alleles was scored manually for each CsMITE and CsILP-based marker. The molecular weight marker of 100 bp was used to identify the molecular weight of the amplified products. We determined the various parameters of genetic diversity and a phylogenetic tree was constructed using the software program DARwin v.6.048.

Table 4 36 diverse genotypes of Tea used for validation of CsMITEs and CsILPs.