Comparative transcriptome analysis to identify putative genes involved in thymol biosynthesis pathway in medicinal plant Trachyspermum ammi L.

Thymol, as a dietary monoterpene, is a phenol derivative of cymene, which is the major component of the essential oil of Trachyspermum ammi (L.). It shows multiple biological activities: antifungal, antibacterial, antivirus and anti-inflammatory. T. ammi, commonly known as ajowan, belongs to Apiaceae and is an important medicinal seed spice. To identify the putative genes involved in thymol and other monoterpene biosynthesis, we provided transcriptomes of four inflorescence tissues of two ajowan ecotypes, containing different thymol yield. This study has detected the genes encoding enzymes for the go-between stages of the terpenoid biosynthesis pathways. A large number of unigenes, differentially expressed between four inflorescence tissues of two ajowan ecotypes, was revealed by a transcriptome analysis. Furthermore, differentially expressed unigenes encoding dehydrogenases, transcription factors, and cytochrome P450s, which might be associated with terpenoid diversity in T. ammi, were identified. The sequencing data obtained in this study formed a valuable repository of genetic information for an understanding of the formation of the main constituents of ajowan essential oil and functional analysis of thymol-specific genes. Comparative transcriptome analysis led to the development of new resources for a functional breeding of ajowan.

Establishment of ajowan transcriptomes. To study thymol biosynthesis, short-read transcriptome sequencing from the inflorescence tissues of two ecotypes (Arak and Shiraz) was carried out (Fig. 2). Sequencing runs of the inflorescence tissues of Arak-3 and Arak-10 yielded 43,056,120 and 42,260,830 of high-quality reads, respectively. Similarly, 51,965,127 and 47,891,026 high-quality reads were generated from the inflorescence tissues of Shiraz-17 & Shiraz-21, respectively. Details of the generated sequencing data are illustrated in the Supplementary Table S1.
De novo assembly and annotation. To reconstruct the transcriptome dataset for ajowan, a total of four cDNA libraries were generated from the inflorescence tissues of the two ecotypes and were sequenced using the HiSeq. 2000 platform. A Trinity assembly of combined reads was selected for further analysis, which produced 123,488 unigenes with N50 of 994 bp and 151,115 transcripts with N50 of 1291 bp. The GC content of the assembled combined reads in the inflorescence tissues of two ecotypes was 38.35% (Table 1). The Trinity assembly of the combined reads obtained from Arak-3, Arak-10, Shiraz-17 and Shiraz-21 libraries resulted in the generation of  Table S1). The unigenes had an average length from 826 bases to 892 bases in all the libraries. Among all the unigenes, more than 27% were recognized as large unigenes, longer than 1000 bases in all libraries of the two ecotypes. The length distribution of the unigenes from all libraries is displayed in Supplementary Fig. S1.
The unigenes' annotation was done using BLASTx against KAAS, TAIR10, Uniprot, NCBI protein (NR) and carrot genome databases (Supplementary Table S2 Gene ontology classification. In order to provide a functional classification of unigenes, GO assignments was created by Trinotate software and visualization of the Gene ontology classification was done by using Web Gene Ontology Annotation Plot (WEGO). The WEGO Plot showed that all unigenes were classified into 55 functional categories (Fig. 4). Among all the categories of cellular components (CC), molecular function (MF), and biological process (BP) of Gene Ontology Annotation, the dominant categories were 'cell' and 'cell part' (≥75%) (Fig. 4). Furthermore, 'biological regulation' , 'cellular process, 'metabolic process' and 'response to stimulus' categories had high percentages (Fig. 4). Out of the total categorized unigenes, 24,373 unigenes (61.1%) were assigned to the 'metabolic process' category, among which 1303 unigenes (3.3%) were assigned to the 'secondary metabolic process' category.
Functional KEGG pathways identification. The functional biological pathways in T. ammi were identified by mapping 123,488 unigenes from the assembly to the canonical pathways reference in KEGG using KAAS. The result indicated that all unigenes were assigned to 355 KEGG pathways (Supplementary Table S3). In the secondary metabolic pathway (ko01110), 2,316 unigenes were identified, which related to 399 KEGG genes ko numbers ( Identification and classification of differentially expressed genes. Identification of differentially expressed genes was done based on fold change >4 and p_value < 1e-3. Out of 123,488 unigenes obtained from the Trinity assembly, 2,626 unigenes were recognized as differentially expressed unigenes in four inflorescence tissues of two ecotypes. Out of the 2,626 differentially expressed unigenes, 2,091 were annotated using different databases (Supplementary Table S2). The differentially expressed unigenes, which were uniquely annotated against KAAS, Uniprot, NR, TAIR10 and carrot genome databases were 866, 1,539, 1,773, 1,705 and 2,010, respectively (Supplementary Table S2 and Fig. S2). The clustering of expression patterns of the differentially expressed unigenes in inflorescence tissues of four genotypes is showed in Fig. 5. Among all the 15 clusters, Clusters 8 and 11 represented unigenes that were over expressed in high oil content ecotype tissues (Arak-3 & Arak-10) compared to low oil content ecotype tissues (Shiraz-17 & Shiraz-21) (Supplementary Table S4). Similarly, unigenes with a higher expression in low oil content ecotype (Shiraz-17 & Shiraz-21) were gathered in Clusters 5 and 15 (Supplementary Table S5). Unigenes presented in Clusters 9 and 10 displayed approximately equal expression patterns in genotypes Arak-10, Shiraz-17 and Shiraz-21. The heatmaps of classified differentially expressed unigenes which had differential expression patterns in inflorescence tissues of the two ecotypes (Clusters 5, 8, 11 and 15) are shown in Supplementary Fig. S3. GO classification of differentially expressed genes. Gene ontology classification showed that the differentially expressed unigenes were classified into 55 functional categories ( Supplementary Fig. S4). The dominant (≥50%) categories were 'cell' and 'cell part' , in the cellular component class of ontology and 'binding' in the molecular function class of ontology, and also 'metabolic process' and 'cellular process' in the biological process class of ontology. There were 65 (4.22%) differentially expressed unigenes related to the secondary metabolic process, categorized into 31 GO terms (Supplementary Table S6). The total number of GO terms related to the differentially expressed genes in all comparisons was 1415, which 1229 GO terms were unique (Fig. 6A). A comparison of the differentially expressed GO terms of four genotypes of the ajowan inflorescence of two ecotypes showed that Shiraz-17 and Shiraz-21 had only 124 differentially expressed GO terms, indicating that these two genotypes had more similar gene expression patterns (Fig. 6A). A classification of the GO terms into three main categories indicated that, in all comparison of pair genotypes, the largest differentially expressed GO term category was the biological process (BP), followed by molecular function (MF) and cellular component (CC), respectively (Fig. 6A). The differentially expressed GO terms from four genotypes produced six sets of data, which are shown in a Venn diagram (Fig. 6B). Among all the sets, three GO terms were related to the secondary metabolite, GO:0019748, as a secondary metabolic process in Arak-3 vs. Arak-10 and Arak-3 vs. Shiraz-17 sets, GO:0090487 as a secondary metabolite catabolic process in set Arak-3 vs. Arak-10, and GO:0044550 as a  Table S7). These showed that the two ajowan ecotypes (Arak and Shiraz) had different expression and regulation of the genes related to the terpenoids process.
GO enrichment of differentially expressed genes. GO enrichment analysis was done on the differentially expressed unigenes of each pair set to obtain over-represented GO categories of unigenes. In all, 1415 GO terms were used for enrichment from four genotypes (Fig. 6A). In the biological process class, the enriched categories were related to the metabolic processes of macromolecule metabolism, secondary metabolism, carotenoid biosynthesis, root development, fruit ripening, and response to growth hormones ( Fig. 7 and Supplementary  Fig. S5). The presence of GO categories related to terpenoid biosynthesis (Fig. 7A), geranylgeranyl diphosphate biosynthesis, and acetyl-CoA metabolism (Fig. 7B) led us to identify the genes related to differential terpenoid biosynthesis in two ecotypes. Enriched categories in the molecular function class were oxidoreductase, geranyl transtransferase, and cytom-chrome_c oxidase activity, which are probably related to putative terpenoid-encoding genes ( Supplementary  Fig. S6). In the cellular component class, enriched categories of cytoplasmic, cytosolic plastid, chloroplast, and thylakoid parts were observed ( Supplementary Fig. S7). The interactive graph view of enriched GO terms was constructed for Arak-3 vs. Shiraz-17 ( Fig. 8 and Supplementary Fig. S8), and Arak-10 vs. Shiraz-17 ( Supplementary  Fig. S9).

Identification of unigenes involved in biosynthesis of monoterpenoids.
Biosynthesis of monoterpenoids in ajwoan utilizes the terpenoid backbone pathway as an intermediate. This pathway consists of two steps; the final product of the first step is IPP, produced from MVA and MEP pathways. The products of the second step are GPP and FPP, depending on MEP or MVA pathway, produced from IPP ( Fig. 9). In this study, unigenes related to all the enzymes of the terpenoid backbone pathway, were identified in the inflorescence transcriptome of ajowan. The data showed that each enzyme was encoded by multiple copies of unigenes   Table 2. Global and overview KEGG pathway maps of Trachyspermum ammi transcripts defined by KAAS.

Identification of gene families involved in biosynthesis of monoterpenoids. Cyclic monoterpenes
such as thymol are the final products of different secondary transformations including isomerization-cyclization and hydroxylation of GPP as a substrate 38,39 . The gene families involved in the biosynthesis of monoterpenoids ( Fig. 9) might be terpene synthases (TPS), cytochrome P450s (CYP450s) 40 , dehydrogenase (DHs) 17,41 , and transcription factors (TFs) 42,43 , which may provide the biosynthesis of thymol in ajowan. The differentially expressed members of these gene families in ajowan transcriptome were identified by using annotated unigenes, some of which were putatively involved in the monoterpenoids biosynthesis pathway (Supplementary Table S9). The results of annotation using TAIR10 and carrot genome databases showed that 203 unigenes were of the CYP450 gene family, while 25 unigenes were differentially expressed in inflorescence tissues of four genotypes (Supplementary Table S9). Some of these might be involved in the biosynthesis of thymol and other monoterpenoids. Altogether 38 unigenes were annotated as Terpene synthases (TPs), of which four unigenes were differentially expressed in inflorescence tissues of four genotypes (Supplementary Table S9, Suplementary_Dataset_2). It was found that 1230 unigenes were related to the dehydrogenase (DHs) gene family and 53 unigenes among them showed differential expression in four genotypes (Supplementary Table S9 Table S11). In our study, the dominant families were bHLH, NAC, MYB-related, C2H2, ERF, bZIP, MYB, C3H, Trihelix, and WRKY (Fig. 10A). More than half (55%) of T. ammi's TFs had high homology with Daucus carota TFs (Fig. 10B). Seventy-three unigenes, belonging to 24 TFs families expressed differentially in the inflorescence tissues (Supplementary Table S9), included WRKY, bZIP, GATA, C3H, NAC, bHLH, and MYB families (Fig. 10C). The expression pattern of all differentially expressed TF gene families in inflorescence tissues of T. ammi is represented in Fig. 11.  Pathway enrichment analysis for differentially expressed genes. A pathway enrichment analysis of differentially expressed genes helped us to identify significant KEGG metabolic pathways, which included significantly more expressed genes. There were 10 KEGG pathways identified as significantly enriched, with a cutoff p-value 10e-3, in at least one of the pairwise genotype comparison (Table 5)

Discussion
The inflorescence of ajowan (Trachyspermum ammi) is a rich repository of secondary metabolites, especially monoterpens, such as thymol. Phytochemical analyses of inflorescence tissues of Arak and Shiraz ecotypes showed that three components-thymol, γ-terpinene, and p-cymene-to be the main components of ajowan essential oils, comprising 98% of essential oil components in the studied ecotypes. In other reports, too, the major components of the essential oils of ajowan were thymol, γ-terpinene and p-cymene 21,44 . Secondary metabolite biosynthesis and accumulation were tissue-specific and related to enzymes and regulator genes, demonstrating tissue-specific expression patterns 22,45 . Hence, the inflorescence tissues from two Iranian native ecotypes (Arak and Shiraz) were selected for transcriptome analysis. Phytochemical analyses of the inflorescence tissues (Fig. 1C) showed differences in the amount of the main components of the essential oils. These quantitative variations in the essential oil content can be due to the terpene synthase and other secondary metabolite related gene expression and regulation. The valuable data generated by the RNA-Seq analysis of the T. ammi transcriptome can be used to explain the thymol biosynthesis pathway, paralogues of genes and gene families related to thymol biosynthesis. The result of annotation indicated that more than 48% of the assembled unigenes of ajowan matched with the genomic databases of other plants. The unigenes, identified in this research, had a higher annotation percentage against the carrot genome database compared to other plant databases (Fig. 3A). This may be due to this fact that both carrot and ajowan are members of the Apiaceae family. The results of this research indicated the high similarity of assembled unigenes of T. ammi from Apiaceae family with Panax quinquefolius (American ginseng) from Araliaceae family 46 (Fig. 3B), which illustrated this fact that both two families are from the same order. Both Araliaceae and Apiaceae families are from Apiales order and they may be the remnants of an ancient group of pro-araliads 47 . The diversity of the GO terms related to the assembled unigenes, as demonstrated by functional GO assignments, showed the variety of the unigenes involved in the secondary metabolite biosynthesis pathway (Fig. 4). Furthermore, the detection of novel genes related to secondary metabolite processes may be possible from T. ammi RNA-seq data. A large number of unigenes involved in secondary metabolite processes in T. ammi was identified by the mapping of unigenes against the KEGG pathway database (Tables 2 and 3). A total of 2316 unigenes, including isoprenoid and putative terpenoid pathway genes, were involved in the secondary metabolite biosynthesis. The differentially expressed unigenes (2,626) of four different genotypes of T. ammi were represented by a differential gene expression analysis (Fig. 5). In all the fifteen clusters, unigenes grouped in clusters 5, 8, 11 and 15, as shown in Fig. 5, were expressed in different pattern in the high oil content ecotype (Arak-3 & Arak-10) compared to low oil content ecotype (Shiraz-17 & Shiraz-21) and might be involved in secondary metabolite biosynthesis. The presence of HDS gene (29437_0_2) and two transcription factors (56455_2_2 and 41958_0_3) in cluster 8, might be one of the causes of the increase essential oil amount in Arak ecotype (Supplementary Table S15). From 1531 DEGs unigenes classified by GO database (Supplementary Table S2), 65 differentially expressed unigenes (4.22%) were related to secondary metabolic process and categorized into 31 GO terms (Supplementary Table S6), suggesting that the differences in essential oil contents between four ajowan genotypes could have resulted from these genes. A comparison of the differentially expressed GO terms showed that genotype Shiraz-17 and Shiraz-21 had more similar gene expression patterns and regulation among four genotypes (Fig. 6). The differentially expressed GO terms, which related to the terpenoids process, are represented in four sets-Arak-10 vs. Shiraz-17, Arak-10 vs. Shiraz-21, Shiraz-17 vs. Arak-3 and Shiraz-21 vs. Arak-3. It can be concluded on the basis of the results that Arak and Shiraz ajowan ecotypes have different expressions and regulation patterns for genes related to the terpenoids process, probably due to different genetic backgrounds of the two ecotypes.
The summarization and clustering of GO terms present enriched GO clusters related to secondary metabolism processes, terpenoid biosynthesis, geranylgeranyl diphosphate biosynthesis, and acetyl-CoA metabolism particularly, leading us to the identification of genes related to differential terpenoid biosynthesis in the two ecotypes (Fig. 7, Supplementary Figs S5, S6 and S7). The interactive graph of the GO category's enrichment analysis (Fig. 8) showed that terpenoid biosynthesis and secondary metabolite biosynthesis GO terms were directly linked to lipid  biosynthesis GO term, highlighting the importance of those GO terms in secondary metabolite biosynthesis. Terpenoids are lipids and their production is biochemically dependent on sugar, amino acid, and triacylglycerol synthesis and catabolism through isoprene unit 48 . The terpenoid biosynthetic pathway, in other plants, has been studied and a lot of participating genes have been detected in this pathway [49][50][51] . Thymol biosynthesis utilizes the intermediates of terpenoid backbone 52 .
The KEGG pathway enrichment analysis identified 112 unigenes involved in the terpenoid backbone biosynthesis pathway (Table 3). For the terpenoid backbone pathway, more than one unigene for each enzymatic step was detected from annotation of unigenes. Hence, these unigenes could be part of one larger gene or members of the multigene families. This analysis may allow the detection of most of the paralogue genes, encoding the catalyzing enzymes for different steps. The number of identified unigenes for AACT, HDR and HMGR genes were 11, 9 and 8, respectively (Supplementary Table S8), suggesting these unigenes might have undergone various gene duplication events. These unigenes have been identified in other plant species producing terpenoid compounds such as Arabidopsis 53 , Gentiana macrophylla 54 , Phyllanthus amarus 37 , Gossypium raimondii and Glycine max 55 . The second step of the monoterpene biosynthetic pathway from GDS toward specific thymol biosynthesis in T. ammi is still uncharacterized. Accordingly, an effort has been made to detect the main genes and gene families involved in this step of thymol biosynthesis (Supplementary Tables S8, S9 and S13). The annotation results showed that three TPS genes were involved in monoterpenoid biosynthesis in ajowan (Fig. 9). QRT-PCR results showed two unigenes (56475 (ta_TPS2) and 37637 (ta_TPS1)) with complete CDS were significantly up-regulated in Arak ecotype compared to Shiraz ecotype, which might be the responsible unigenes for quantitative variation of monoterpene between two ecotypes. Based on the annotation results, 203 unigenes were identified as CYP450 gene family members. Similarly, the Arabidopsis genome contains 272 genes belonging to the CYP450 family. In our study, identification of differentially expressed unigenes related to CYP450s (25 unigenes) represented involvement of some of CYP450s genes in the thymol biosynthesis. Some of the CYP450s participate in essential oil biosynthesis 56,57 . In the menthol pathway in mint, a CYP450 responsible for the hydroxylation of the monoterpene limonene has been described 56 . Considering that thymol is an aromatic and hydroxylated compound, it is conceivable that cytochrome P450 enzymes are responsible for its formation. However, a definite proof of certain P450 enzymes that are able to catalyze the complete reaction still remains unknown. The formation of thymol in Thyme and Oregano species was also in the focus of a study 16 . C. Crocoll isolated the sequences of five cytochrome P450 enzymes from Origanum vulgare and Thymus vulgaris. It was shown that the expression of these genes correlated with the occurrence of thymol and carvacrol in the essential oils of thyme and oregano. Therefore a role in monoterpene biosynthesis can be hypothesized 16 . The relevance of cytochrome P450 genes to thymol biosynthesis was also shown in another study 57 . In this study, two P450 enzymes were investigated in order to clarify their role in the formation of thymol and carvacrol from γ-terpinene. In spite of some differences in details, all the mentioned studies demonstrate the role of CYP enzymes in hydroxylation of γ-terpinene and final production of thymol.
In this study, differentially expressed unigenes related to TFs gene families in inflorescence tissues of T. ammi were identified (Fig. 10C). Unigenes of NAC, WRKY, GATA, SBP, bZIP, bHLH, ERF, C3H, MYB, B3, and Nin-like TFs gene families had a high level of expression in inflorescence tissues of T. ammi (Fig. 11). In plants, secondary metabolism pathways have been found to be regulated by several transcription factor families including NAC, WRKY, ERF, MYB, and bHLH 58,59 . Identification of TFs regulating secondary metabolism pathways would be a powerful generic tool for plant metabolic engineering in ajowan. In general, this study gives rise to a valuable genomic resource data to explore tissue-specific thymol biosynthesis based on terpenoid diversity in future.  Fig. 2A,B).    RNA-Seq data processing and de novo assembly. Raw reads were subjected to quality control using the Trimmomatic software (Version 0.36) 62 to filter out adaptor and low-quality nucleotide/sequences. After trimming, FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) was used to examine the characteristics of the libraries and to verify the trimming efficiency. The high-quality filtered reads were used for downstream analyses. De novo assembly of the resulting pooled clean reads was conducted using Trinity (Release 2016-03-17) 63 and default parameters and kmer length of 25 (Fig. 2C). Resulted Trinity transcripts were clustered using CAP3 package by identity cutoff 95% and min overlap of 250 nt. Clustered assemblies and singletons were called unigenes, which represented putatively identified genes and the resultant sequences of Trinity were called unitranscripts, representing putative transcript isoforms. To identify the candidate Coding Sequence (CDS) regions within all transcript sequences, we used the TransDecoder tool (http://transdecoder.github.io) which is an ORF predictor. Protein sequences of predicted ORFs were used for annotation and functional analysis using databases such as KEGG and Uniprot. The identified unigenes from annotation result were checked for completeness of CDS.

Identification of unigenes and gene families related to terpenoid biosynthesis. Unigenes
involved in the triterpenoid backbone biosynthesis and monoterpenoid biosynthesis pathways were identified based on the annotation results. Metabolic pathway construction and visualization was performed by Pathvisio3 software 77 . CYP450, TPs, DHs and the TF gene family members were retrieved from the assembly and annotated results using in-house scripts. All identified unigenes and gene family members related to terpenoid biosynthesis were assessed and curated manually, using BLAST against NCBI and Uniprot databases. Differentially expressed unigenes of ajowan ecotypes was shown as a heatmap using the pheatmap package in RStudio 72 software. Calculation and visualization of Venn diagrams were performed by online software (http://bioinformatics. psb.ugent.be/webtools/Venn/).  Quantitative real-time PCR analysis. Quantitative expression of selected unigenes according RNA-Seq analysis of inflorescences tissues of ajowan ecotypes were carried out using the Real Time PCR Detection System (Corbett Rotor-Gene 6000 instrument, Corbett Life Science, Australia) and TaKaRa SYBR ® Green Permix Ex Taq TM II. Two internal control genes (SAND and eIF-4a) from T. ammi were used for estimating the relative transcript level of the analyzed unigenes. The REST software 78 (http://rest-2009.gene-quantification.info/) was used for data analysis of qRT-PCR amplification. Two technical replicates were used for all the qR-TPCR experiments. Specific oligonucleotides of selected genes for qRT-PCR analysis are shown in Supplementary Table S12.

Data Availability
All data generated during this study are included in this published article and its supplementary information files. Furthermore, the raw sequencing data of genotypes have been deposited in the NBCI (https://www.ncbi.nlm. nih.gov) under bioproject codes: PRJNA359623 and PRJNA362991; biosample accession numbers SRR5137050, SRR5137051, SRR5137053 and SRR5137052.