Introduction

Garden sage (Salvia officinalis L.) belongs to the genus Salvia, which is one of the economically best-known genera due to its vast medicinal properties and rich aromatic oils. The genus Salvia (tribe Mentheae) is the largest of the Lamiaceae family, which comprises nearly 1,000 species. Salvia plants are widely distributed in three regions around the world but mainly exist in Central and South America (~500 species), West Asia (~200 species) and East Asia (~100 species), while the other Salvia species are spread throughout the world1. Most of these plants contain various medicinally active components used throughout history in folk medicine, e.g., S. japonica, S. tuxtlensis, S. guaranitica, S. miltiorrhiza, S. chloroleuca, S. aureus, S. przewalskii, S. epidermidis, S. santolinifolia, S. hydrangea, S. tomentosa, S. isensis, S. lavandulifolia, S. glabrescens, S. nipponica, S. fruticosa, S. allagospadonopsis, S. macrochlamys and S. recognita. Recently, Salvia species have become a valuable source for pharmaceutical research for identifying and discovering biologically active compounds2. Essential oils of Salvia species exhibit significant bioactivities, including antimutagenic, anticancer, antimicrobial, anti-inflammatory, choleretic, antioxidant and antimicrobial activities. Salvia essential oils contain more than 100 active compounds with pharmacological effects, and they can be categorized into monoterpenes, sesquiterpenes, diterpenes, and triterpenes2. During their biosynthesis, these terpenoids are sequentially built up from the isoprene unit (C5) building blocks, isopentyl diphosphates (IPP) and dimethylallyl diphosphate (DMADP). These components are condensed in a sequential manner by prenyltransferases, resulting in the formation of prenyl diphosphates, such as diphosphate (GPP), farnesyl pyrophosphate (FPP), and geranylgeranyl pyrophosphate (GGPP)3. These prenyl diphosphates are the immediate precursors for the biosynthesis of mono-, sesqui-, di- and tetraterpenes. Despite the scientific and medicinal interest in these terpenoids of S. officinalis, the genes that are involved in the biosynthesis of these compounds have not yet been fully identified or understood. Plant secondary metabolites have significant use in the food and pharmaceutical industries, such as in fine chemicals and cosmetics. The biosynthesis, regulation and metabolic engineering of useful secondary metabolites have been extensively studied4. In recent years, next-generation sequencing (NGS)-based RNA sequencing (RNA-Seq) has become a powerful tool for discovering genes that are involved in the biosynthesis of various secondary metabolite pathways in medicinal plants5. For example, the phenylpropanoid and terpenoid biosynthesis pathways in Ocimum sanctum and Ocimum basilicum 6, the biosynthesis of active ingredients in Salvia miltiorrhiza 7, the biosynthesis of carotenoids in Momordica cochinchinensis 8, the biosynthesis of cellulose and lignin in Chinese fir (Cunninghaimia lanceolata)9, and the biosynthesis of tea-specific compounds, i.e., catechins, caffeine and theanine pathways in tea (Camellia sinensis)10, have been explored using NGS. Characterization of plant terpene synthases (TPSs) is typically carried out by the production of the recombinant enzymes in Escherichia coli. This is often difficult due to enzyme solubility and codon usage issues. Furthermore, plant terpene synthases that are localized to the plastids, such as diterpene synthases, must be abridged in a more or less experimental approach to ameliorate expression11,12. Transgenic tobacco (Nicotiana tabacum) is very efficient and has been successfully used for the characterization of two diterpene genes in glandular trichomes: labdane and Z-abienol13. Here, we characterized genes that are involved in terpenoid biosynthesis in S. officinalis and determined their biological significance in S. officinalis for terpenoid production in various tissues. In this study, a transcriptome database was established for S. officinalis leaves using NGS technology to identify and to characterize genes that are related to the terpenoid biosynthesis pathway. The criteria used to achieve these objectives and to elucidate the complex metabolic pathways and genes for the understanding of terpenoid production in S. officinalis included the following: (i) transcriptome analysis of leaves using Illumina HiSeq 2000 sequencing; (ii) Gas Chromagraphy coupled Mass Spectrometry (GC-MS) analysis for three fresh plant parts (old leaves, young leaves, and stems); (iii) characterization of five terpene genes in transgenic N. tabacum; (iv) qRT-PCR of highly expressed genes that are involved in the biosynthesis of terpenoids; (v) and the combination of data from the transcriptome, qRT-PCR, and metabolome with GC-MS for revealing the functions of metabolic genes that are involved in the biosynthesis of valuable terpenoids.

Results and Discussion

Identification of essential oil components

For GC-MS analysis, 236 bioactive phytochemical compounds were identified using n-hexane extracts from three fresh aerial parts of S. officinalis. The numbers of obtained bioactive phytochemical compounds from young leaves, old leaves and stems were 113 (89.29%), 108 (91.54%) and 82 (85.27%), respectively. The results of the qualitative and quantitative analyses of all phytochemical compounds from the essential oils are reported in (Table 1 and Supplementary Table S1). The identified phytochemical compounds are listed based on the retention time, compound mass and percentage of peak area (Fig. 1A,B). In young leaves, the monoterpene compounds were shown as the main group (66.64%), followed by the group of sesquiterpene compounds (15.87%) and diterpene compounds (1.4%). In old leaves, the monoterpene compounds were observed to be the main group (52.7%), followed by the sesquiterpene group (15.01%) and the diterpene group (14.18%), and only one triterpene compound represented 0.16%. Sesquiterpenes form the main group of compounds (23%) found in the stems, followed by diterpenes (19.53%), monoterpenes (19.11%) and and one triterpene compound represented 0.02% (Supplementary Table S1). Moreover, the three hexane extracts from the different tissues for essential oils contained unique, common and major phytochemical compounds. For example, the essential oil extracts of young leaves (A) had 61 unique compounds, 35 common compounds shared with the essential oil extracts from old leaves, five common compounds shared with the essential oil extract from stems and 12 common compounds shared among all three plant parts. Furthermore, the old leaves (B) contained 57 unique compounds and four common compounds shared with the stems. On the other hand, the stems (C) contained 61 unique compounds (Fig. 1C). Regarding the major phytochemical compounds, 1,8-cineole (41.20%) was the major compound in the essential oil extracts from young leaves, followed by β-caryophyllene (9.01%), camphor (6.27%), β-pinene (6.23%), and α-terpinenyl acetate (4.23%), whereas the essential oil extracts of old leaves was characterized by 1,8-cineole (25.93%), followed by camphor (11.52%), sugiol (10.80%), β-caryophyllene (5.51%), and α-caryophyllene (3.72%). Sugiol was characterized as the major compound of stem extracts (15.89%), followed by 1,8-cineole (12.37%), β-caryophyllene (10.23%), α-caryophyllene (7.30%), and then isocaryophyllene oxide (3.24%) (Table 1). When comparing the composition of the three essential oil extracts of S. officinalis, we deduced that some common compounds exist at different levels within the parts of S. officinalis (Fig. 1A). Additionally, some of the compounds that have been found in S. officinalis were detected in other Salvia plant species (Table 1 and Supplementary Table S1)14,15,16. Therefore, we suggest that plant parts can have a major effect on the composition of their essential oils. From these and previous GC-MS data17,18, an important question has been raised: why do the monoterpene compounds of S. officinalis mostly accumulated in young leaves? This question was difficult to answer before conducting the present work because there was a lack of information at the genetic level regarding the terpenoid biosynthetic pathway and how these compounds are synthesized in S. officinalis.

Table 1 The major chemical compositions in the essential oils of S. officinalis.
Figure 1
figure 1

Typical GC-MS mass spectragraphs for terpenoids from young leaf, old leaf, and stem of Salvia officinalis. (A) GC-MS peaks of the essential oil extracts, (B) Mass spectrum of GC peaks with retention time for the major compound. (C) Three-Way-Venn-Diagram to show the number of unique and common compounds in the essential oil extracts from young leaf (A), old leaf (B), and stem (C) of Salvia officinalis.

Illumina sequencing and the de novo assembly of the S. officinalis leaf transcriptome

In the past few years, the Illumina sequencing platform has become a powerful method for analysing and discovering the genomes of non-model plants19,20. In this context, to generate transcriptome sequences, complementary DNA (cDNA) libraries were prepared from leaf tissues of S. officinalis, and cDNA was then sequenced using paired-end reads (PE) sequencing using an Illumina HiSeq 2000 platform. Previous reports involving Illumina sequencing reported that the use of PE sequencing showed significant improvement in the efficiency of de novo assembly and increased the depth of sequencing9,21. The cDNA sequencing generated 6.6 Gb of raw data from S. officinalis leaves. After filtering and removing the adapter sequences from the raw data, the number of reads was 21,487,871 (21.48 million), comprising of 98,521,170 high-quality nucleotide bases, with 95.90% Q20, 91.69% Q30 and 48.73% GC content. For further analysis, high-quality reads were selected, and the transcriptome was assembled using the Trinity program22, which produced 88,554 transcripts with an N50 length of 1,793 bp, an N90 length of 479 bp and a mean length of 1,113 bp. Moreover, 48,671 unigenes could be detected with an N50 length of 1,485 bp, an N90 length of 298 bp and a mean length of 813 bp. The distribution of the assembled transcript length ranged from 200 to >2,000 bases; the maximum number of transcripts (34,051 transcripts, 38.45%) ranged from 200 bp to 500 bp, followed by 22,529 transcripts (25.44%) ranging from 1,000 to 2,000 bp and then 17,658 transcripts (19.94%) ranging from 500 to 1,000 bp. On the contrary, the lowest number of transcripts (14,316 transcripts, 16.17%) was obtained for a size of more than 2,000 bp. By contrast, the assembled unigene lengths were distributed between 200 and >2,000 bp. The maximum number of unigenes (27,381 unigenes, 56.26%) ranged from 200 to 500 bp, followed by 8,576 unigenes (17.62%) ranging from 500 to 1,000 bp and then 8.068 unigenes (16.58%) ranging from 1000 to 2,000 bp. Finally, the lowest number of unigenes (4,646 unigenes, 9.54%) was obtained for a size of >2000 bp. The length distributions of the transcripts and unigenes are shown in Supplementary Table S2 and Fig. S1. Our results are in good agreement with those for Boehmeria nivea, Medicago sativa, C. Longa, Centella asiatica and Apium graveolens, in which the largest number of both transcript and unigene lengths were found to range between 75 and 500 bp23,24.

Functional annotation and classification of assembled S. officinalis unigenes

The total number of unigenes (48,671, 100% of all unigenes) was compared against the public dabases, including the NCBI non-redundant protein sequences (NR), the NCBI nucleotide sequences (NT), the Kyoto Encyclopedia of Genes and Genomes (KEGG), the KEGG orthology (KO), Swiss-Prot, the protein family annotation (PFAM), Gene Ontology (GO), and the euKaryotic Ortholog Groups database (KOG) annotation databases (Supplementary Table S3 and Fig. S2). The annotation percentage results in this research were higher than the annotation percentages in other non-model plant studies [58% in safflower (Carthamus tinctorius) and 58.01% in Chinese fir (C. lanceolata)]9,25,26. The international standardized gene functional annotation system (GO Annotation) provides a powerful way to recognize the functions and properties of sequences that have not been characterized for an organism27. The BLAST2 GO program was used to categorize the functions of these annotated unigenes, and a total of 22,891 unigenes (47.03% of all of the assembled unigenes) were mapped to at least one GO term. Based on sequence homology, the unigene sequences from S. officinalis were categorized into 48 functional groups under three general sections: 59,883 were assigned to the biological process (BP), 43,029 were assigned to the cellular component (CC) and 29,760 were assigned to the molecular function (MF) sections. As a result, cellular process (13,933) and metabolic process (13,423) were the most enriched GO terms in the biological process (BP) section. Regarding the CC section, the cell (8,737) and cell part (8,720) were the most enriched. Within the molecular function (MF) section, binding (13,539) and catalytic activity (11,726) were highly enriched (Fig. 2). These results revealed that the main Gene Ontology (GO) classifications in the annotated unigenes were responsible for metabolism and fundamental biological regulation. These results were similar to previous results with the S. miltiorrhiza transcriptome and with the transcriptomes of O. sanctum and O. basilicum (members of the same family), which have the highest percentages of metabolic process, cellular process, cell, cell part, binding and catalytic activity28,29. Moreover, these results are in agreement with previous studies on de novo transcriptome assembly in the tuberous root of sweet potato, de novo transcriptome sequencing from R. sativus and de novo characterization of roots from the Chinese medicinal plant P. cuspidatum 26,29. The lowest percentage of unigenes categories included channel regulator activity (56), cell junction (28) and cell killing (27). Therefore, the present work suggests that the enormous potential data that exist in the Gene Ontology (GO) classifications can be used to identify the new genes.

Figure 2
figure 2

Functional annotation and classification of assembled unigenes from S. officinalis. Gene Ontology (GO) terms are summarized in three general sections of the biological process (BP), cellular component (CC) and molecular function (MF).

KEGG analysis of S. officinalis transcriptomes

KEGG pathway database can facilitate the understanding of the functional annotations of enzymes and the biological functions of genes regarding their networks7,30. To identify active biological functional pathways in the leaf tissues of S. officinalis, all 48,671 unigene sequences were mapped in reference to the canonical pathways of KEGG, but 9,716 (19.96%) unigene sequences could be assigned to 267 KEGG pathways. Furthermore, all transcripts were classified into five larger pathway categories, including cellular processes, environmental information processing, genetic information processing, metabolism and organismal systems (Fig. 3). The highest number of transcripts from S. officinalis was assigned to the metabolism category, followed by genetic information processing, organismal systems, and cellular processes, whereas the lowest number of transcripts was related to the category of environmental information processing. Interestingly, 608 transcripts of S. officinalis were related to the biosynthesis of various secondary metabolite pathways, which were sorted into 27 subcategories, with phenylpropanoid biosynthesis (ko00940), terpenoid backbone biosynthesis (ko00900) and carotenoid biosynthesis (ko00906) representing the largest subcategories (Supplementary Table S4). These results were in agreement with previous results from the transcriptomes of O. sanctum and O. basilicum, which are members of the same family, and from de novo transcriptome sequencing from R. sativus, the transcriptome of which had the highest percentages of phenylpropanoid biosynthesis and terpenoid backbone biosynthesis6,9.

Figure 3
figure 3

KEGG classified into five largest categories pathways includes cellular processes (A), environmental information processing (B), genetic information processing (C), metabolism (D) and organismal systems (E).

Genes related to the biosynthesis of isoprenoids

Various types of terpenoids were found in the essential oil extracts of S. officinalis. The mixture contained mainly myrcene, (+)-neomenthol, 1,8-cineole, (3S)-linalool, α-humulene/β-caryophyllene, momilactone-A, gibberellin 3, gibberellin 2, ent-copalyl diphosphate, ent-kaurene, ent-kaurenoic acid, ent-isokaurene C2, gibberellin 20, and beta-amyrin. Precursor molecules for terpenoid biosynthesis are derived from the cytosolic mevalonate (MVA) and plastidial methyl-erythritol phosphate (MEP) pathways. Therefore, queries against the Lamiaceae family transcriptome libraries were applied to identify and to determine genes that encode enzymes involved in the different steps of the terpenoid biosynthesis pathway, such as Mevalonate diphosphate decarboxylase, Isopentenyl phosphate kinase, isopentenyl pyrophosphate isomerase for swithing IPP to DMAPP isomerase, GPS (geranyl pyrophosphate synthase), FPS (farnesyl pyrophosphate synthase) and GGPS (geranylgeranyl pyrophosphate synthase)31,32. Furthermore, we identified and estimated the expression levels of isoprenoid genes by using uniprot annotations against the transcriptome libraries (Table 2). From the annotation data analyses, we found many transcript genes related to isoprenoid biosynthesis from the MEP pathway with higher expression levels, including gene transcripts such as SoDXS4,1(1-deoxy-D-xylulose-5-phosphate synthase 4, 1), SoDXR (1-deoxy-D-xylulose-5-phosphate reductoisomerase), SoMCT (2-C-methyl-D-erythritol 4-phosphate cytidylyltransferase), SoISPF (2-C-methyl-D-erythritol 2,4-cyclodiphosphate synthase), SoHDS2 ((E)-4-hydroxy-3-methylbut-2-enyl-diphosphate synthase 2), SoHDR2,3 (4-hydroxy-3-methylbut-2-enyl diphosphate reductase 2, 3) and SoIDI1 (isopentenyl diphosphate isomerase1). Additionally, we obtained some gene transcripts that were related to isoprenoid biosynthesis from the MVA pathway with higher expression levels, such as SoAACT1, 4 (acetyl-CoA C-acetyltransferase 1, 4), SoHMGS (hydroxymethyl glutaryl-CoA synthase), SoHMGR4, 3, 2 (hydroxymethyl glutaryl-CoA reductase 4, 3, 2) SoMVK (mevalonate kinase) and SoPMK (phospho-mevalonate kinase). Moreover, the transcriptome dataset of S. officinalis presented other genes, such as SoGPS, SoFPS2, and SoGGPSΙΙ10, which are the immediate precursor of the mono-, sesqui-, and di-terpene biosynthesis pathway. The SoGPS, SoFPS2, and SoGGPSΙΙ10 genes were highly abundant in leaves and had higher values of fragments per kilobase of transcripts per million mapped fragments (FPKM), which were 20.23, 281.11 and 49.23, respectively (Fig. 4 and Table 2). Our results were similar to previously obtained results from the transcriptomes of O. sanctum and O. basilicum, which are members of the same family and have a higher number of transcripts for the DXS and GPPS genes related to the terpenoid biosynthesis pathway6.

Table 2 Transcript abundance of MEP, MVA and other terpenoid backbone biosynthesis pathway genes as per the S. officinalis transcriptome data annotation.
Figure 4
figure 4

Representative terpenoid biosynthesis pathway with cognate heat maps for transcript levels of genes from transcriptome data with substrates and products, colored arrows connect substrates to their corresponding products. Green/red color-coded heat maps represent relative transcript levels of different terpenoid genes determined by Illumina HiSeq 2000 sequencing; red, upregulated; green, downregulated. Transcript levels data represent by FPKM: Fragments per Kilobase of transcripts per Million mapped fragments. MeV: MultiExperiment Viewer software was used to depict transcript levels. DXS: 1-deoxy-D-xylulose-5-phosphate synthase, DXR:1-deoxy-D-xylulose-5-phosphate reductoisomerase, MCT: 2-C-methyl-D-erythritol 4-phosphate cytidylyltransferase, ISPF: 2-C-methyl-D-erythritol 2,4-cyclodiphos-phate synthase, HDS:(E)-4-hydroxy-3-methylbut-2-enyl-diphosphate synthase, HDR: 4-hydroxy-3-methylbut-2-enyl diphosphate reductases, IDI: isopentenyl-diphosphate delta isomerase, AACT: acetyl-CoA C-acetyl transferase, HMGS: hydroxyl methyl glutaryl-CoA synthase, HMGR: hydroxymethyl glutaryl-CoA reductase (NADPH), MVK: mevalonate kinase, PMK: phospho-mevalonate kinase, GPPS: geranyl pyrophosphate synthase, FPPS: farnesyl pyrophosphate synthase, GGPS: geranylgeranyl pyrophosphate synthase, type II, CINO:1,8-cineole synthase, MYS: myrcene/ocimene synthase, LINA: (3S)-linalool synthase, NEOM:(+)-neomenthol dehydrogenase, SABI:(+)-sabinene synthase, TPS6:(−)-germacrene D synthase, AMS:beta-amyrin synthase, SEQ: Squalene monooxygenase, HUMS:α-humulene/β-caryophyllene synthase, GA2:gibberellin 2- -oxidase, GA20:gibberellin 20-oxidase, E-KS:ent-kaurene synthase, MAS:momilactone-A synthase, GA3:gibberellin 3-beta-dioxygenase, E-KIA: ent-isokaurene C2-hydroxylase, E-KIH:ent-kaurenoic acid hydroxylase, E-CDS: ent-copalyl diphosphate synthase.

Genes related to terpene synthases

Plants produce various terpenoid compounds with highly diverse structures. These compounds play an important role and function in the interactions with environmental factors and in fundamental biological processes32,33. Multiple terpenoids are synthesized in plants by the expression of many TPS genes. Moreover, some TPSs have the ability to catalyse the production of multiple products. Thus, the TPS gene family was classified according to phylogenetic relationships into eight subfamilies (TPS a, b, c, d, e/f, g, and h), which comprise mono-, sesqui-, di- and triterpene synthases34. Therefore, the annotation of transcriptome data from S. officinalis against the Lamiaceae family and Arabidopsis revealed many terpene synthases involved in the terpenoid biosynthesis pathway, e.g., myrcene, (+)-neomenthol, 1,8-cineole, (3S)-linalool, α-humulene/β-caryophyllene, momilactone-A, gibberellin 3, gibberellin 2, ent-copalyl diphosphate, ent-kaurene, ent-kaurenoic acid, ent-isokaurene C2, gibberellin 20, beta-amyrin and squalene. From the dataset, 65 TPS unigenes were identified and determined based on sequence similarities with a TPS sequence in the canonical annotation reference database. Twenty unigenes were annotated as being involved in monoterpene biosynthesis, including myrcene/ocimene synthase, (+)-neomenthol dehydrogenase, 1,8-cineole synthase, (+)-sabinene synthase and (3S)-linalool synthase, and three other unigenes were annotated as being involved in sesquiterpene biosynthesis, including α-humulene/β-caryophyllene synthase and (−)-germacrene D synthase. Additionally, 29 unigenes were annotated as being involved in diterpene biosynthesis, including momilactone-A synthase, gibberellin 3-beta-dioxygenase, gibberellin 2-oxidase, ent-copalyl diphosphate synthase, ent-kaurene synthase, ent-kaurenoic acid hydroxylase, ent-isokaurene C2-hydroxylase and gibberellin 20-oxidase. Finally, 12 unigenes were annotated as being involved in triterpene biosynthesis, including beta-amyrin synthase, squalene monooxygenase, and farnesyl-diphosphate, but some of these 12 genes showed high abundance in leaves and higher FPKM values (Fig. 4 and Table 3). The previous compounds have significant pharmacological activities, such as anticancer, anti-HIV, antiviral, anti-inflammatory and antibacterial activities. Sesquiterpenoids are similar to triterpenoids as both share the same origin and originate from farnesyl diphosphate (FDP). Triterpenoid compounds originate from the conversion of FDP into squalene by squalene synthase (SQS) and then to (S)-2,3-epoxysqualene by squalene monooxygenase (SQE)]. Subsequently, (S)-2,3-epoxysqualene is converted to beta-amyrin and camelliol C in the presence of multifunctional (S)-2,3-epoxysqualene cyclase via beta-amyrin synthase and camelliol C synthase, respectively. Similar reports about triterpenoid biosynthesis from (S)-2,3-epoxysqualene cyclases are available for O. basilicum and Catharanthus roseus 35,36.

Table 3 Transcript abundance of TPS genes as per the S. officinalis transcriptome.

SSR discovery and analysis

The Illumina HiSeq 2000 system offers the opportunity to analyse molecular markers such as simple sequence repeats (SSRs) that are related to terpenoid pathway genes. SSR molecular markers have proven to be a powerful method for understanding genetic variation. Moreover, polymorphic SSR markers are very important for the investigation of related comparative genomics, genetic diversity, evolution, linkage mapping, gene-based association studies, and relatedness. Even though SNP markers have become promising, especially for studying complex genetic traits and high-throughput mapping, SSRs provide many advantages compared with other marker systems. Hence, SSRs have become the preferable codominant molecular marker for the construction of linkage maps37. Therefore, the development of novel SSR molecular markers for S. officinalis plants could be a valuable tool for breeding studies and genetic applications. Therefore, SSR markers were identified from transcriptome sequencing data using MISA (MIcroSAtellite) (http://pgrc.ipkgatersle-ben.de/misa/misa.html). Of the 48,671 transcripts of S. officinalis, 7,439 transcripts were observed to have SSRs (Supplementary Table S5). The total number of SSR-containing sequences in S. officinalis was 9,149 following stringent selection criteria used to identify these SSRs. The analysis data showed that dinucleotide repeats were the most abundant motif type in S. officinalis (4,295; 44.132%), followed by mononucleotide (2,348; 24.13%), trinucleotide (2,317; 23.81%), tetranucleotide (116; 1.191%), and hexanucleotide (39; 0.4%) types, while the pentanucleotide type was the least abundant motif (34; 0.35%) (Supplementary Tables S6 and S7 and Fig. S3). Except for the absence of mononucleotides, these results were similar to the previous results obtained from the transcriptomes of O. sanctum and O. basilicum (members of the same family), which have dinucleotide repeats as the most abundant motif type, followed by tri-, tetra-, hexa- and penta nucleotide types as the least abundant motif 6. After analysing the data from mono- to hexanucleotide motifs to obtain the number of repeat units, we found that the highest repeat unit of potential SSRs was 6, which accounted for 1,999 SSRs (21.86%), followed by 10 SSRs (1,490; 16.30%), 5 (1,411; 15.43%), and 7 (1,301; 14.23%), and the smallest repeat unit of potential SSRs was ≥24 (7; 0.08) (Supplementary Table S7). The AG/CT dinucleotide repeat was the most prevalent motif detected in all SSRs (2,999; 30.81%) followed by A/T as a mononucleotide repeat (2,272; 23.34%). By contrast, the least abundant motif in all SSRs (4; 0.041%) was detected in (AAAAC/GTTTT/AAAAG/CTTTT/AAAAT/ATTTT/AAACC/GGTTT) as pentanucleotide repeat and in (AAAATG/ATTTTC/AAATAG/ATTTCT/AAATTC/AATTTG/AACAAT/ATTGTT) as hexanucleotide repeat. Finally, several SSR motifs were associated with many unique sequences that encode enzymes (e.g. SoDXS4, SoDXS5, SoHDR2, SoHMGS, SoHMGR3, SoFLDH, SoPCYOX1, SoFNTA, SoDHDDS1, SoDHDDS5, momilactone-A synthase, SoGGPSΙΙ7, SoGGPSΙΙ10, ent-copalyl diphosphate synthase, ent-kaurenoic acid hydroxylase, beta-amyrin synthase and squalene monooxygenase) involved in terpenoid biosynthesis (Supplementary Table S8).

Validation of the gene expression patterns by quantitative RT-PCR

To determinate the reliability of the Illumina HiSeq 2000 read analysis, eleven candidate genes with a higher differential expression were selected, and their expression profiles were compared within young leaf, old leaf, stem, flower and bud flower samples. Quantitative real-time (qRT) PCR was used to determine the ‘transcriptional control’, which indicates the number of mRNA copies of the enzyme that complements the end-product quantity. Therefore, the correlation between the TPS mRNAs with their products and the end-products showed a relationship between the chosen differentially expressed genes (DEGs), monoterpene synthase (SoGPS; comp20551_c0), sesquiterpene synthase (SoFPS2; comp10352_c0), diterpene synthase (SoGGPS; comp25415_c0), myrcene/ocimene synthase (SoMYRS; comp11163_c0) 1,8-cineole synthase (SoCINS; comp26990_c0), (3S)-linalool synthase-2 (SoLINS; comp6814_c0), α-humulene/β-caryophyllene synthase (SoHUMS;comp101158_c0), (−)-germacrene D synthase (SoTPS6; comp26367_c0), squalene monooxygenase (SoSQUS; comp26984_c0), (+)-sabinene synthase (SoSABS; comp18462_c0) and (+)-neomenthol dehydrogenase (SoNEOD; comp10962_c0) and the terpenoid biosynthesis pathway of S. officinalis. SoACTIN was used as an internal reference gene (Supplementary Table S9). The expression patterns of the eleven selected DEGs in the young leaf, old leaf, stem, flower, and bud flower samples were examined (Fig. 5) by qRT-PCR, and the results were consistent with the results from the Illumina HiSeq 2000 read analysis. At the current stage, we may be able to answer the question which terpenoid compounds accumulate mostly in which S. officinalis tissue. From our results, we found that SoGPS, SoFPS, SoMYRS, and SoCINS genes showed the highest expression levels in young leaves, followed by old leaves, stems, flowers and bud flowers. Moreover, (+)-sabinene synthase (SoSABS) genes showed the highest expression levels in young leaves, followed by bud flowers, old leaves, flowers, and stems. (3S)-linalool synthase (SoLINS) genes showed the highest expression levels in stems, followed by bud flowers, young leaves, old leaves, and flowers. Furthermore, diterpene synthase gene SoGGPS showed the highest expression levels in stems, followed by old leaves, young leaves, flowers and bud flowers. On the other hand, SoTPS6 gene showed the highest expression levels in young leaves followed by bud flowers, old leaves, stems, and flowers. Squalene monooxygenase (SoSQUS) gene showed the highest expression levels in young leaves followed by old leaves, flowers, bud flowers, and stems. Finally, a α-humulene/β-caryophyllene synthase (SoHUMS) gene showed the highest expression levels in stems, followed by young leaves, old leaves, bud flowers and flowers. These results were compatible with our GC-MS analysis data, indicating that indicated that the main group of terpenes in young leaves, old leaves and stems consisted of mono- and sesquiterpenes. According to the findings of the GC-MS analysis, the major monoterpene compound in young and old leaves was 1,8-cineole (Table 1). Therefore, we suggest that young leaves are the primary site for monoterpene, sesquiterpene and 1,8-cineole synthase biosynthesis and accumulation, followed by old leaves, and then stems. These results are in agreement with those of previous studies38,39 that reported that the main monoterpenes in S. officinalis and other Salvia plant species are formed and accumulate in very young leaf epidermal glands, as the formation of most epidermal glands and the accumulation of the monoterpenes take a very short time in young leaf tissues. Consequently, in our study we focused on young leaves in which these genes are expressed at higher levels; monoterpenes and sesquiterpenes are also formed at their highest levels in young leaves. In addition, from our study, we found a correlation between the 1,8-cineole accumulation and 1,8-cineole synthase (SoCINS) expression levels in different tissues. For instance, the most abundant 1,8-cineole accumulation and highest SoCINS expression were in young leaves, followed by old leaves, stems, flowers and bud flowers (Table 1 and Fig. 1). Our results are in line with those of previous studies40,41,42,43,44,45,46,47 that reported that the monoterpene levels are thought to be mainly controlled transcriptionally producing different TPS enzymes. (+)-Neomenthol was not detected by GC-MS analysis as was expected from gene expression analysis, showing the expression of a putative neomenthol dehydrogynase gene that were detected in the Illumina HiSeq 2000 reads and qRT-PCR. This could be due to other unknown reasons48. The combination of the analysed data reads from the Illumina HiSeq 2000, qRT- PCR and the GC-MS will pave the way for understanding the complex mechanisms for controlling and regulating the diverse production of terpene compounds.

Figure 5
figure 5

Quantitative RT-PCR validation of expression of terpene synthase genes selected from the DGE analysis in S. officinalis. Total RNAs were extracted from young leaves, old leaves, stem, flower and bud flower samples and the expression of SoNEOD, SoGPS, SoFPPS, SoGGPS, SoMYRS, SoLINS, SoHUMS, SoTPS6, SoSQUS, SoSABS and SoCINS genes were analysed using quantitative real-time. SoACTIN was used as the internal reference. The values are means ± SE of three biological replicates.

Functional characterization of TPS genes in transgenic N. tabacum leaves

To test N. tabacum in a transgenic expression system for the production of Salvia terpenes, the following genes were selected from S. officinalis: (+)-neomenthol dehydrogenase (NEOD), 1,8-cineole synthase (CINS), (+)-sabinene synthase (SABS), (3S)-linalool synthase (LINS), and (−)-germacrene D synthase (TPS6) encoded by SoNEOD, SoCINS, SoSABS, SoLINS, and SoTPS6, respectively.The stable constitutive expression of the Salvia TPS genes in tobacco was carried out by the infection of N. tabacum leaves using A. tumefaciens strain EHA105 carrying pB2GW7-NEOD, pB2GW7-CINS, pB2GW7-SABS, pB2GW7-LINS, and pB2GW7-TPS6 under the control of 35S promoter. Samples of infected were collected 45 days after transgenic tobacco acclimatization (Fig. 6A). We then used semiquantitative RT-PCR to analyse the positive transgenic tobacco and assessed the expression levels of terpene genes from the different samples (Fig. 6B and Supplementary Fig. S4). The terpenes were extracted with hexane and analysed by GC-MS. The mono-, sesqui-, di- and triterpene peaks were clearly detected, and the type and amount of compounds represented by the percentage of peak area (% peak area). Compounds were identified by comparing the mass spectra of the compounds with mass spectra libraries. The annotation of the detected components was also confirmed by comparing them with the published references and extracts of tobacco cultivars, which produce different types and amounts of terpenes49,50. Overexpression of SoNEOD, SoCINS, SoSABS, SoLINS, and SoTPS6 genes in tobacco plants produced different amounts of mono-, sesqui-, di-, and triterpenes and other terpenoids. Moreover, from the results shown in Table 4, Supplementary Fig. 5 and Table S10, we found that the transient expression of the different TPS genes from Salvia produced different types and amounts of mono-, sesqui-, di-, and triterpenes and other terpenoid compounds. We also could show a high similarity between the product patterns of TPS genes from Salvia with these from other plant species (Fig. 7).

Figure 6
figure 6

Overexpression of five S. officinalis TPS genes in transgenic N. tabacum. (A) Transgenic tobacco plants after adaptation to soil pots. (B) Semiquantitative RT-PCR analysis of the terpene synthase gene expression.

Table 4 The major terpenoid compositions in transgenic N. tabacum leaves overexpressing SoNEOD, SoCINS,SoSABS, SoLINS, and SoTPS6.
Figure 7
figure 7

Phylogenetic analysis of terpenoid biosynthesis genes from S. officinalis and other plants. MEGA6 program was used for building up the tree through neighbor joining method.

The putative functions of TPS genes isolated from S. officinalis were initially predicted according to the conserved motifs using the InterPro protein sequence analysis & classification (http://www.ebi.ac.uk/interpro/) database. The SoCINO protein with a 591-aa length has an N-terminal domain (IPR001906) from 66–279 aa and a metal-binding domain (IPR005630) from 265–589 aa; inside the latter domain are two motifs: one is an RR (x) 8 W motif (RRTGGYQPTLW) starting at 57 aa, and the other one is a DDxxD motif (DDVFD) starting at 345 aa. On the other hand, the SoLINA protein is 505 aa in length. This protein has an N-terminal domain (IPR001906) from 1–183 aa and a metal-binding domain (IPR005630) from 171–497 aa, and inside the last domain are DDxxD conserved motifs (DDIFD) starting at 250 aa. Finally, the protein sequences contained one or two of this domain belonging to the TPS gene family.

Croteau and coworkers had revealed the carbocationic reaction mechanism for all monoterpene synthases by reporting that the reaction was initiated by the divalent metal ion-dependent ionization of the substrate. The resulting cationic intermediate undergoes a series of hydride shifts or other rearrangements and cyclizations until the reaction was terminated by the addition of a nucleophile or proton loss. They also illustrated this reaction mechanism by studying the native enzymes with substrate inhibitors, analogues and intermediates51,52. Moreover, Croteau et al. 198753 elucidated the preliminary conversion of the geranyl cation to the tertiary linalyl cation to facilitate cyclization to a six-membered ring. Afterwards, the linalyl cation provides the cyclic α-terpinyl cation; this is an important branching point intermediate in the formation of all cyclic monoterpenes because multiple terpene products can be obtained through electrophilic attack of C1 on the C6–C7 linalyl cation double bond and from the α-terpinyl cation53. From the previous discussion, the reaction mechanisms of monoterpene synthases are highly reticulate. The individual intermediate may have multiple fates, which suggests the explanation for the ability of terpene enzymes to make various terpene products54,55,56,57. On the other hand, the carbocationic reaction mechanism that uses sesquiterpene synthase to form sesquiterpenes by catalysing farnesyl pyrophosphate (FPP) recycling is similar to the reaction mechanism by those monoterpene synthases. Moreover, the larger carbon skeleton of FPP and the presence of three double bonds instead of two suggest a rationale for increases of the structural diversity of the sesquiterpene products. Furthermore, the initial cyclization reactions for sesquiterpene synthases can be divided into two types. Type one involves cyclization of the initially formed farnesyl cation to yield 11-membered (E)-humulyl cation) rings of large size and a C2–C3 double bond (this type has no barrier to cyclization). The second type involves cyclization that proceeds after the tertiary nerolidyl cation produced from preliminary isomerization of the C2–C3 double bond. This isomerization mechanism is directly analogous to the isomerization of GPP to yield a linalyl cation in monoterpene synthesis. The nerolidyl cation is considered an intermediate in the sesquiterpene synthase mechanism58,59,60,61,62.

Collectively, we can state that the ability of TPS genes to convert a prenyl diphosphate substrate into diverse products during different reaction cycles is one of the unique traits of this type of enzyme. As described above, this property is found in the majority of all characterized monoterpene and sesquiterpene synthases. However, some monoterpene and sesquiterpene synthases can catalyse substrates into a single product, and the proteins may have specific methods for multiple product formations. For example, γ-humulene synthase from A. grandis has two DDxxD motifs located on opposite sides and can generate 52 different sesquiterpenes. This protein is able to bind substrates with two different conformations, resulting in different sets of products63. In another example regarding the first monoterpene synthase cloned from Salvia officinalis, (+)-sabinene synthase produces 63% (+)-sabinene but also 21% γ-terpinene, 7.0% terpinolene, 6.5% limonene and 2.5% myrcene in in vitro assays64. These additional monoterpene products or their immediate metabolites are also found in the monoterpene-rich essential oil of the S. officinalis plant.

Conclusion

In this study, a large, high-quality transcriptome database was established for S. officinalis leaves using NGS technology to characterize and to identify genes that are related to terpenoid biosynthesis. Using de novo sequencing and analysis of the S. officinalis transcriptome data via the Illumina HiSeq 2000 system, we identified many genes that encode enzymes involved in the terpenoid biosynthesis pathway. The purpose of identifying these genes is not only to facilitate functional studies but also to develop biotechnology for improving the production of medicinal ingredients through metabolic engineering. We profiled terpenoids from three tissues of S. officinalis and used qRT-PCR to determine the correlation between the expression levels of TPS genes and the end-products. By combining the transcriptome and metabolome analyses with RNA-Seq or qRT-PCR with GC-MS approaches, this study paves the way for understanding the complex metabolic genes for the production of the diverse terpene compounds in garden sage. The results from our study will allow to understand the specific activities of TPSs in S. officinalis for the production of interesting compounds and to develop new technology for utilization.

To our knowledge, this is the first study to use Illumina HiSeq 2000 paired-end sequencing technology to investigate the global transcriptome of S. officinalis. The valuable genetic resource in Salvia will provide the foundation for future genetic and functional genomic research on S. officinalis or closely related species. We further studied the functions of various S. officinalis TPS genes, including SoNEOD, SoCINS, SoSABS, SoLINS, and SoTPS6, by stably expressing these genes in N. tabacum transgenic plants. SoNEOD, SoCINS, SoSABS, SoLINS, and SoTPS6 were functionally expressed in the leaves of N. tabacum, and these transgenes altered the levels of terpenoids, as confirmed by GC-MS analysis of extracted transgenic N. tabacum leaves. The GC-MS analysis revealed that these S. officinalis terpene synthases isolated from S. officinalis can convert a prenyl diphosphate substrate into diverse products, which is one of the unique traits of this type of enzyme. Our study provides new insights into our understanding of plant terpenoid biosynthesis and the potential for biotechnology application.

Materials and Methods

Plant materials and tissue collection

Seeds of Salvia officinalis were collected from the Egyptian Desert Gene Bank, North Sinai Research Station, Department of Plant Genetic Resources, Desert Research Center, Egypt, and grown at Huazhong Agricultural University, Wuhan, China. Different tissues were sampled from one-year-old S. officinalis plants. For RNA-Seq, three biological replicates from leaves were sampled and handled. Each replicate consisted of two young and two old leaves from the same plant. For qRT-PCR, three biological replicates were collected from the following five parts (young leaves, old leaves, stems, flowers and bud flowers). All samples were immediately frozen in liquid nitrogen and then stored at −80 °C until RNA extraction. Furthermore, another three biological replicates from the individual three fresh parts were collected for isolation of the essential oil.

Isolation of chemical compounds

The correct method to reduce technical variability throughout a sampling procedure is essential to stop cell metabolism and to avoid leaking of metabolites during the various preparation steps before the actual metabolite extraction. Therefore, three biological replicates from each of the three fresh parts were immediately frozen on dry ice. In the laboratory, the frozen three biological replicates from each of the three fresh part samples were homogenized in liquid nitrogen with a mortar and pestle, after which the plant material (ca. 10 g) was directly soaked in n-hexane as a solvent in Amber storage bottles, 60 ml screw-top vials with silicone/PTFE septum lids (http://www.sigmaaldrich.com) were used to reduce loss of volatiles to the headspace then incubated with shaking at 37 °C and 200 rpm for 72 h. Afterward, the solvent was transferred using a glass pipette to a 10-ml glass centrifuge tube with screw-top vials with silicone/PTFE septum lids and centrifuged at 5,000 rpm for 10 minutes at 4 °C to remove plant debris. The supernatant was pipetted into glass vials with a screw cap and oil was concentrated until remaining 1.5 ml of concentrated oils under a stream of nitrogen gas with a nitrogen evaporator (Organomation) and water bath at room temperature (Toption-China-WD-12). The concentrated oils transferred to a fresh crimp vial amber glass, 1.5 ml screw-top vials with silicone/PTFE septum lids were used to reduce a loss of volatiles to the headspace. For absolute oil recovery, the remaining film crude oil in the internal surface of concentrated glass vials was dissolved in the minimum volume of n-hexane, thoroughly mixed and transferred to the same fresh crimp vial amber glass, 1.5 ml. And the crimp vial was placed on the autosampler of GC-MS system for GC-MS analysis, or each tube was covered with parafilm after closed with screw-top vials with silicone/PTFE septum lids and stored at −20 °C until GC-MS analysis.

GC-MS analysis of essential oil components

GC analysis was performed using a Shimadzu model GCMS-QP2010 Ultra (Tokyo, Japan) system. An approximately 1 µl aliquot of each sample was injected (split ratios of 15:1) into a GC-MS equipped with an HP-5 fused silica capillary column (30 m × 0.25 mm ID, 0.25 µm film thickness). Helium was used as the carrier gas at a constant flow of 1.0 ml min−1. The mass spectra were monitored between 50–450 m/z. Temperature was initially under isothermal conditions at 60 °C for 10 min. Temperature was then increased at a rate of 4 °C min−1 to 220 °C, held isothermal at 220 °C for 10 min, increased by 1 °C min−1 to 240 °C, held isothermal at 240 °C for 2 min, and finally held isothermal for 10 min at 350 °C. The identification of the volatile constituents were done by parallel comparison of their recorded mass spectra with the data stored in the Wiley GC/MS Library (10th Edition) (Wiley, New York, NY, USA), and the retention time index (http://massfinder.com/wiki/MassFinder_Analysing_your_own_data), with the Volatile Organic Compounds (VOC) Analysis S/W software, and the NIST Library (2014 edition), The Adams Library (http://essentialoilcomponentsbygcms.com/list-of-compounds-in-the-essential-oil-components-database/), and the Terpenoids Library (http://massfinder.com/wiki/Terpenoids_Library_List). The relative% amount of each component was calculated by comparing its average peak area to the total areas, as well as Retention time index. (All of the experiments were performed simultaneously three times under the same conditions for each isolation technique with total GC running time was 80 minutes.

RNA extraction

Total RNAs from the three biological leaf replicates were extracted for RNA-Seq. Moreover, total RNAs from three biological replicates from each of the plant parts (young leaves, old leaves, stems, flowers and bud flowers) were extracted for qRT-PCR. Additionally, total RNAs from three biological replicates of transgenic N. tabacum were extracted for semiquantitative RT-PCR using the TRIzol Reagent (Invitrogen, USA) and treated with DNase I (Takara). RNA quality was examined on 1% agarose gels, and the purity was analysed using a Nano-Photometer® spectrophotometer (IMPLEN, CA, USA). RNA concentration was determined using a Qubit® RNA Assay Kit in a Qubit® 2.0 Fluorometer (Life Technologies, CA, USA). RNA pools were prepared for cDNA libraries by mixing equal volumes from the three RNAs replications in one tube.

cDNA library preparation and sequencing

Three micrograms of RNA per sample were used for generating a sequencing library. cDNA was synthesized using an RNA Library Prep Kit for Illumina® (NEB, USA) for generated sequencing libraries according to the manufacturer’s instructions. The first strand of cDNA was synthesized in the presence of random hexamer primers and M-MuLV Reverse Transcriptase (RNase H), and the second strand of cDNA was synthesized in the presence of DNA polymerase I and RNase H. The remaining cDNA was converted into blunt ends in the presence of exonuclease/polymerase activities. After the adenylation of 3′ ends of DNA fragments, NEB Next, an adaptor with a hairpin loop structure, was ligated to prepare for hybridization. To select cDNA fragments of preferentially 150~200 bp in length, the library fragments were purified using an AMPure XP system (Beckman Coulter, Beverly, USA). Then, 3 μl of USER Enzyme (NEB, USA) was used with size-selected, adaptor-ligated cDNA at 37 °C for 15 min followed by 95 °C for 5 min. Afterwards, PCR was performed with Phusion High-Fidelity DNA polymerase, universal PCR primers and Index (X) Primer. Finally, PCR products were purified (AMPure XP system), and the library quality was assessed using an Agilent Bioanalyzer 2100 system (Agilent Technologies, CA, USA). Clustering of the index-coded samples was performed on a cBot Cluster Generation System using a TruSeq PE Cluster Kit v3-cBot-HS (Illumina) according to the manufacturer’s instructions (Novogene Experimental Department). After cluster generation, the library preparations were sequenced on an Illumina HiSeq 2000 platform, and paired-end reads were generated.

Quality control

Raw data (raw reads) in fastq format were first processed through in-house Perl scripts. During this step, clean data (clean reads) were obtained by removing reads containing adapters, reads containing ploy-N and low-quality reads from the raw data. At the same time, Q20, Q30, GC content and sequence duplication level of the clean data were calculated. All of the downstream analyses were based on high-quality clean data.

De novo transcriptome assembly

De novo assembly of the processed reads was carried out using Trinity program (Version: trinityaseq_r 2012-10-05)20, with the min_kmer_cov set to 2 by default and all other parameters set to default. The Trinity method consists of three software modules, (1) Inchworm, (2) Chrysalis and (3) Butterfly, applied sequentially to process large volumes of RNA-Seq reads. In the first step, read datasets were assembled into linear contigs by the first module (Inchworm program). The minimally overlapping contigs were then clustered into sets of connected components (build graph components) by the second module (Chrysalis program), and the transcripts were then constructed from each de Bruijn graph by the third software module (Butterfly program). Finally, the transcripts were clustered by similarity of correct match length beyond 80% for longer transcripts or 90% for shorter transcripts using the multiple sequence alignment tool.

Annotation of unigenes

Unigenes were used as query sequences to search the annotation databases, including the NCBI non-redundant protein sequences database (NR) (http://www.ncbi.nlm.nih.gov/) and Swiss-Prot (a manually annotated and reviewed protein sequence database) (http://www.ebi.ac.uk/uniprot/), based on sequence homology to entries in the Gene Ontology (GO) database (http://www.geneontology.org/). Unigene sequences from S. officinalis were categorized into three general sections: biological process (BP), cellular component (CC) and molecular function (MF). Additionally, the unigenes were used as query sequences for searching the Kyoto Encyclopedia of Genes and Genome (KEGG) pathways database (http://www.genome.jp/kegg/) and the Pfam (Protein family) database (http://pfam.sanger.ac.uk/).

Differential expression analysis

Expression levels of unigenes were normalized and calculated as the values of fragments per kilobase of transcripts per million mapped fragments (FPKM) during the assembly and clustering process. Differential expression analysis of unigenes was performed using the DESeq R package (1.10.1). DESeq provides statistical routines for assessing the differential gene expression in leaf tissues and assigns genes as differentially expressed when the P-value < 0.05. P-value results were corrected using the Benjamini and Hochberg approach for controlling the false discovery rate (FDR)65.

Quantitative real-time PCR (qRT-PCR) analysis

Quantitative RT-PCR was performed using an IQTM 5 Multicolor Real-Time PCR Detection System (Bio-Rad, USA) as described previously66 with SYBR Green Master (ROX) (Newbio Industry, China) following the manufacturer’s instructions at a total reaction volume of 20 µl. Gene-specific primers for SoActin as a reference gene and for the other eleven genes (SoNEOD, SoGPS, SoFPPS, SoGGPS, SoMYRS, SoLINS, SoHUMS, SoTPS6, SoSQUS, SoSABS and SoCINS) involved in the biosynthesis of terpenes were designed using the primer designing tools of IDTdna (http://www.idtdna.com), as listed in Supplementary Table S9. The quantitative RT-PCR conditions were set as standard conditions: 95 °C for 3 min, 40 cycles of amplification (95 °C for 10 s, 60 or 58 °C for 30 s and 72 °C for 20 s), and a final extension at 65 °C for 1 min. The gene expression was normalized using SoActin as a reference gene. The relative expression levels were calculated by comparing the cycle thresholds (CTs) of the target genes with that of the reference gene SoActin using the 2−ΔΔCt method67,68. The sizes of amplification products were 140–160 bp. The quantified data were analysed using the Bio-Rad IQTM 5 Multicolor Real-Time Manager software. Finally, the relative expression levels of SoNEOD, SoGPS, SoFPPS, SoGGPS, SoMYRS, SoLINS, SoHUMS, SoTPS6, SoSQUS, SoSABS, and SoCINS were detected. All reactions were performed with three replications.

Identification of simple sequence repeats (SSRs)

All of the transcripts of S. officinalis were analysed with the MISA program version 1.0 (http://pgrc.ipkgatersleben.de/misa/misa.html) for the detection of SSR motifs that have mono- to hexanucleotide repeats. In addition, primers for each SSR were designed using Primer3 version 2.3.5 (http://primer3.sourceforge.-net/releases.php). The minimum number of SSR repeat units during analysis was ≥24 for mono- and dinucleotides and was 8, 7, 7, and 9 for tri-, tetra-, penta-, and hexanucleotide repeats, respectively. The default parameters corresponding to each unit size of the minimum number of repetitions were 1–10, 2–6, 3, 5, 4, 5, 5, 5, and 6-5 for Unigene SSR detection.

Full-length terpene synthase cDNA clones and vectors

Full-length cDNAs for SoNEOD, SoCINS, SoSABS, SoLINS and SoTPS6 were obtained by PCR amplification using short and long gene-specific primers (Supplementary Table S11) based on RNA-Seq sequence information from the transcriptome sequencing of S. officinalis leaves. Leaf cDNA was used as a template for the initial PCR amplification and performed using short primers with the KOD-Plus DNA polymerase (Novagen) under the following PCR conditions: 3 min at 94 °C followed by 10 s at 98 °C; 30 s at 60, 57, 59, 60 or 60 °C (different annealing temperatures), 1.5 min at 68 °C, and then 10 min at 68 °C. This process was repeated for 35 cycles. The cDNA was used as a template for PCR cloning using long primers with the KOD-Plus DNA polymerase for the Gateway pDONR221 vector. The amplified PCR products were purified and cloned into the Gateway entry vector pDONR221 using bp Clonase (Invitrogen, USA). The resulting pDONR221 constructs harbouring target genes were sequenced, and Gateway LR Clonase (Invitrogen, USA) was used for recombination into the destination vector pB2GW7 for tobacco transformation. All final constructs containing SoNEOD, SoCINS, SoSABS, SoLINS and SoTPS6 were confirmed by sequencing.

Nicotiana plant growth conditions and preparation of Agrobacterium cultures for infection

Wild-type Nicotiana tabacum plant seeds were grown under standard greenhouse conditions for ten days at the Wuhan Doublehelix Biology Science and Technology Company, Wuhan, Hubei, China. In addition, the constructs of pB2GW7 vectors with all inserted genes were introduced into Agrobacterium tumefaciens strain EHA105 by direct electroporation. Recombinant A. tumefaciens was grown for two days at 28 °C in solid LB media supplemented with 50 μg/ml each of rifampicin and spectinomycin. An individual colony of each sample was inoculated into 1.0 ml of liquid medium and grown at 28 °C under 200 rpm agitation overnight with the same media composition. After 24 h, 1.0 ml of each sample of liquid medium was transferred to a 250-ml conical flask containing 50 ml of LB media supplemented with the same compositions; the samples were grown at 28 °C in a shaker overnight until an optical density of 0.6–1.0 (OD 600) was reached. Overnight cell cultures were harvested by centrifugation at 5,000 rpm for 10 min at 4 °C, and the pellet was resuspended in the infection medium (50 ml of LB-free media + 50 μl of acetosyringone). Nicotiana tabacum plantlet leaves were collected from the greenhouse and sterilized by soaking in 70% ethanol for 30 s, soaking in 0.1% HgCl for 6 min, and then washing three times using autoclaved water each time for 3 min. Then, we cut the leaves into small pieces (1 cm × 1 cm) and discarded the petiole and midrib, after which the leaf pieces were soaked in Petri dishes with infection media for 10 min and stirred every 2 min. The transformation procedure was performed as described previously69. More than 15 individual transgenic tobacco lines were generated for each transgene and examined with PCR for positive transgenic lines of more than 10 lines for each transgene. The positive plants with good roots were transferred to the greenhouse for adaptation. Then, the transgenic tobacco plants were analysed for terpenoid profiling and target gene expression.

Semiquantitative RT-PCR analysis

Semiquantitative real-time PCR was performed on an Eppendorf PCR (Eppendorf Mastercycler-Nexus GSX1, POCD Scientific, Australia) system with a total reaction volume of 25 µl. A gene-specific primer for NtEF-1α (Nicotiana tabacum EF-1-alpha-related GTP-binding protein) was used as a reference gene, and the other five gene-specific primers for SoNEOD, SoCINS, SoSABS, SoLINS, and SoTPS6, which are involved in the biosynthesis of terpenes, were designed using the primer designing tools of IDTdna (http://www.idtdna.com/scitools/Applications/RealTimePCR/); the primer sequences are listed in (Supplementary Table S9). The Semiquantitative RT-PCR conditions were as follows: predenaturation step at 95 °C for 4 min, 35 cycles of amplification (95 °C for 30 s, 58 or 60 °C for 30 s and 72 °C for 1 min), and a final extension step at 72 °C for 10 min. The PCR products were resolved on 1% agarose gel, and the expression levels of NtEF-1α, SoNEOD, SoCINS, SoSABS, SoLINS, and SoTPS6 were detected.

Metabolite extraction from transgenic tobacco leaves

Terpenoid compounds from non-transgenic tobacco leaves (control) and transgenic tobacco leaves containing either SoNEOD, SoCINS, SoSABS SoLINS, or SoTPS6 expression constructs were extracted and isolated. For this, three leaves from each transgenic tobacco line (one leaf from each plant) were homogenized in liquid nitrogen with a mortar and pestle, after which the plant material powder was directly soaked in n-hexane as a solvent in Amber storage bottles, 60 ml screw-top vials with silicone/PTFE septum lids (http://www.sigmaaldrich.com) were used to reduce loss of volatiles to the headspace then incubated with shaking at 37 °C and 200 rpm for 72 h. Afterward, the solvent was transferred using a glass pipette to a 10-ml glass centrifuge tube with screw-top vials with silicone/PTFE septum lids and centrifuged at 5,000 rpm for 10 minutes at 4 °C to remove plant debris. The supernatant was pipetted into glass vials with a screw cap and oil was concentrated until remaining 1.5 ml of concentrated oils under a stream of nitrogen gas with a nitrogen evaporator (Organomation) and water bath at room temperature (Toption-China-WD-12). The concentrated oils transferred to a fresh crimp vial amber glass, 1.5 ml screw-top vials with silicone/PTFE septum lids were used to reduce a loss of volatiles to the headspace. For absolute oil recovery, the remaining film crude oil in the internal surface of concentrated glass vials was dissolved in the minimum volume of n-hexane, thoroughly mixed and transferred to the same fresh crimp vial amber glass, 1.5 ml. And the crimp vial was placed on the autosampler of the gas chromatography mass spectrometer (GC-MS) system for GC-MS analysis, or each tube was covered with parafilm after closed with screw-top vials with silicone/PTFE septum lids and stored at −20 °C until GC-MS analysis. The same programme and standard conditions that were used for GC-MS analysis with S. officinalis essential oil components were applied.

Gene accession number

Gene accession numbers: Genes studied here are accessible to GenBank. Salvia officinalis geranyl-diphosphate synthase (SoGPS, KY399788); farnesyl pyrophosphate synthetase (SoFPPS, KY399787); (3S)-linalool synthase (SoLINS, KY399786); terpene synthase 6 (SoTPS6, KY399785); (−)-germacrene D synthase (SoSABS, KY399783); Salvia officinalis 1,8-cineole synthase (SoCINS, KY399782); Salvia officinalis geranyl diphosphate synthase 2 (SoGGPP, KY486794); Salvia officinalis squalene monooxygenase (SoSQUS, KY486795).