De novo assembly and analysis of the Artemisia argyi transcriptome and identification of genes involved in terpenoid biosynthesis

Artemisia argyi Lev. et Vant. (A. argyi) is widely utilized for moxibustion in Chinese medicine, and the mechanism underlying terpenoid biosynthesis in its leaves is suggested to play an important role in its medicinal use. However, the A. argyi transcriptome has not been sequenced. Herein, we performed RNA sequencing for A. argyi leaf, root and stem tissues to identify as many as possible of the transcribed genes. In total, 99,807 unigenes were assembled by analysing the expression profiles generated from the three tissue types, and 67,446 of those unigenes were annotated in public databases. We further performed differential gene expression analysis to compare leaf tissue with the other two tissue types and identified numerous genes that were specifically expressed or up-regulated in leaf tissue. Specifically, we identified multiple genes encoding significant enzymes or transcription factors related to terpenoid synthesis. This study serves as a valuable resource for transcriptome information, as many transcribed genes related to terpenoid biosynthesis were identified in the A. argyi transcriptome, providing a functional genomic basis for additional studies on molecular mechanisms underlying the medicinal use of A. argyi.


Identification of genes involved in terpenoid backbone biosynthesis by KEGG analysis.
To discover the most significant biological pathways, 44,750 unigenes were annotated in the KEGG database and classified into five classes, cellular process, genetic information processing, metabolism, organismal systems and environmental information processing, comprising 19 subcategories (135 pathways) (Fig. 1A). A total of 12 pathways were involved in the biosynthesis of other secondary metabolites, among which the most genes were enriched in the phenylpropanoid biosynthesis pathway (Fig. 1B). The "metabolism of terpenoids and polyketides" subcategory contained 8 pathways, and the largest number of unigenes (241) were mapped to terpenoid backbone biosynthesis (Fig. 1C). Among these 241 unigenes, 114 were identified as encoding 16 key enzymes that control terpenoid biosynthesis, including acetyl-CoA acetyltransferase (AACT), hydroxymethylglutaryl-CoA synthase (HMGS), hydroxymethylglutaryl-CoA reductase (HMGR), mevalonate kinase (MK), phosphomevalonate kinase (PMK), mevalonate diphosphate decarboxylase (MVD), 1-deoxy-D-xylulose-5-phosphate synthase (DXS), 1-deoxy-D-xylulose-5-phosphate reductoisomerase (DXR), 2-C-methyl-D-erythritol 4-phosphate cytidylyltransferase (CMS), 4-diphosphocytidyl-2-C-methyl-D-erythritol kinase (CMK), 2-C-methyl-D-erythritol Overview of unigene expression. In each sample, all of the expressed unigenes (fragments per kilobase of transcripts per million fragments mapped (FPKM) > 1) 27 were determined, and 41,139, 41,516 and 44,750 unigenes were expressed in leaves, roots and stems, respectively (Fig. 3A). The overall expression levels were the highest for leaf transcripts, followed by the stem and root transcripts (Fig. 3B). Transcripts expressed at low levels in the three tissues were filtered with a geometric mean (FPKM + 1) < 3 as the threshold 27 , generating 43,023 unigenes in these tissues. Hierarchical clustering of the three tissues with these 43,023 unigenes showed that   leaves and stems clustered more tightly, demonstrating that the overall expression levels of transcripts in these two tissues were more closely related (Fig. 3C).
Identification of genes with leaf-specific expression and differentially expressed genes. A total of 24,505 shared unigenes were identified in all three tissues, and 8,541 were uniquely expressed in leaves (Fig. 4A). Among the shared unigenes expressed in all tissues, 603 showed leaf-specific up-regulation with fold changes (FCs) > 8, and these genes were further evaluated using GOSlim functional analysis. Based on sequence homology, these 603 unigenes were assigned to one or more ontologies, including 260 for cellular component, 328 for biological process, and 389 for molecular function (Supplementary Table S3). In the biological processes category, several genes were enriched for the term "secondary metabolic process", indicating important metabolic activities in leaves. The total differentially expressed genes (DEGs) were detected among the samples by using unigene expression analysis (Fig. 4B). Comparison of the leaves and roots revealed 25,049 DEGs, of which 15,376 were up-regulated (higher expression in leaves) and 9,673 were down-regulated (lower expression in leaves). Comparison of the leaves and stems revealed 19,588 DEGs, of which 8,409 were up-regulated in the leaves and 11,179 were down-regulated. To further connect the biological functions of the DEGs, all DEGs were mapped to the KEGG database and compared to the entire A. argyi transcriptome. Indeed, 134 pathways, primarily related to metabolism, biosynthesis of secondary metabolites and plant-pathogen interactions, were enriched in DEGs (Supplementary Table S4). The "metabolism of terpenoids and polyketides" subcategory was particularly enriched  in DEGs. Overall, 251 genes were up-regulated in leaves compared to roots, while 148 genes were up-regulated in leaves compared to stems ( Table 3).

Identification of transcription factors involved in terpenoid biosynthesis.
In plants, TFs participate in a wide variety of biological processes and play major roles in regulating gene expression at the transcriptional level to control secondary metabolite flux. A total of 2,056 unigenes encoding TFs were identified and classified into 59 different TF families (Fig. 5). Among these TFs, 200, 129, 106 and 26 unigenes were annotated to the AP2-EREBP, bHLH, WRKY and bZIP families, respectively.

Discussion
To extend the possible applications of A. argyi in TCM, three different tissues were utilized for library construction and sequencing, and approximately 74 billion clean reads were generated from each tissue. In total, 99,807 and 67,446 unigenes were assembled and annotated, respectively, among which 19,256 were co-annotated in the databases. However, 33% of the unigenes remain unannotated, probably because more unigenes were generated with the sequencing depth of 10 G and because the publicly available plant transcriptome and genome data are insufficient. These predicted CDSs, accounting for 60.0% of the total unigenes, provide information for studying crucial genes, including genes encoding lectins 28 and ribosome-inactivating proteins 29 , which are potential anticancer drugs.
The best hit for each unigene queried against the Nr database was used to assign functional GO annotations in terms of the categories cellular component, biological process and molecular function. The large number of  Table 3. The terpenoid and polyketide metabolic pathway and the numbers of related DEGs in leaves compared with the other two tissues. diverse GO terms assigned to the unigenes highlights the diversity of genes likely represented in the A. argyi leaf, root and stem transcriptomes. Upon mapping these unigenes to the KEGG database, numerous unigenes involved in terpenoid biosynthesis were identified. In addition, we examined the expression levels of unigenes encoding enzymes in the MVP and MEP pathways based on FPKM values (Fig. 2). The unigenes encoding PMK, MVD, DXR, CMS, CMK and HDS were highly differentially expressed in leaves, suggesting that these steps may be rate-limiting in IPP and DMAPP formation, which occur upstream of terpenoid synthesis. Characterization of these unigenes will further improve our understanding of the molecular mechanisms underlying terpenoid biosynthesis. The overall transcript expression level was higher in the leaves than in the roots and stems. According to the DEG annotation, hundreds of genes that were up-regulated in leaves were associated with the metabolism of terpenoids and polyketides. These up-regulated genes may be helpful for analysing terpenoid metabolites in A. argyi. In addition, the substantial numbers of genes showing leaf-specific expression and associations with secondary metabolic processes revealed the importance of metabolic activities in leaves. These genes showing leaf-specific expression or up-regulation might provide the transcriptomic support required to provide A. argyi leaves with their medicinal value.
In this work, 461 candidate TFs were assigned to the AP2-EREBP, bHLH, WRKY and bZIP families, and these TFs might play roles in regulating terpenoid biosynthesis. The bHLH transcription factor gene AabHLH1 in A. annua has been proven to effectively regulate the biosynthesis of the terpenoid artemisinin 30 . The use of genetic engineering methods to control TFs has substantial potential value and broad application prospects in studies on the regulation of terpenoid biosynthesis in A. argyi.
In this study, seven gene sequences, namely, those encoding ArHMGR1, ArHMGR2, ArMVD, ArDXS, ArDXR, ArHDS and ArHDR, were retrieved from the transcriptomic data and successfully cloned by PCR. These gene sequences were consistent with those identified from the A. argyi transcriptome, thus confirming the reliability of our transcriptional data. In addition, the expression level of HMGR, which was up-regulated in leaves, can reportedly increase the synthesis of artemisinin 31 , and the content of ginkgolide, another terpenoid, in transgenic Ginkgo biloba overexpressing HDR was significantly increased compared with that in the nontransgenic control line 32 . Therefore, our findings may help improve future studies on increasing the yield of terpenoids via gene regulation and the production of transgenic plants.
In summary, our study is the first exploration of the A. argyi transcriptome. We generated high-quality RNA-seq data from leaf, root and stem tissues of A. argyi. Using de novo transcriptome assembly, we assembled and annotated 99,807 and 67,446 unigenes, respectively. We analysed most of the unigenes encoding key enzymes involved in the terpenoid biosynthesis pathway and identified several TFs related to terpenoid synthesis. Our findings may help improve future studies on the molecular mechanisms of terpenoid biosynthesis and on increasing the yield of terpenoids via gene regulation and genetic engineering. Our transcriptomic dataset will also accelerate studies on A. argyi functional genomics.

Materials and Methods
Plant material and RNA extraction. Whole A. argyi plants (identified by Professor Qingshan Yang, Anhui University of Chinese Medicine) were harvested from the Anhui University of Chinese Medicine herb garden, cleaned with ultrapure water, dried on filter paper, and immediately soaked in liquid nitrogen after separation of the leaves, stems and roots. The leaves, roots and stems selected from five replicates were pooled together. Total RNAs from the plants were isolated with an RNA Plant Kit (Aidlab Biotech, Beijing, China) based on the manufacturer's instructions. RNA quality was verified using an Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA), and the average RNA Integrity Number (RIN) was 8.63. cDNA library construction and RNA sequencing. Total RNAs were treated with DNase I to eliminate DNA residues and then mixed with oligo (dT)-cellulose to purify the mRNAs. The purified mRNAs were fragmented, and first-strand cDNAs were synthesized using these mRNA fragments as templates. After the second-strand cDNAs were synthesized, the double-stranded cDNAs were randomly fragmented. Short cDNA fragments were recovered and repaired, and a single nucleotide (adenine) was added to the 3′ ends. The cDNA fragments were then joined to adapters, and the appropriate fragments were selected and used for PCR amplification. Each sample library was quantified and evaluated for quality on an Agilent 2100 Bioanalyzer and an ABI StepOnePlus Real-Time PCR System (ABI, New York, NY, USA), respectively. Ultimately, the one library per tissue was sequenced on the Illumina HiSeq 4000 platform (Beijing Genomics Institute, Wuhan, China). After sequencing, raw data were received, and low-quality reads and adapters were filtered to generate clean data.
De novo transcriptome assembly. De novo transcriptome assembly was implemented using Trinity (version 2.06), which successively combines Inchworm, Chrysalis and Butterfly, to assemble clean reads 33 with the following parameters: min contig length of 200 and min kmer coverage of 4. Ultimately, full-length transcripts for alternatively spliced isoforms were generated by splicing transcripts corresponding to paralogous genes. All such sequences were known as transcripts. Transcript analysis was performed to cluster and remove redundancies with TGICL (version 2.06, parameters: -l 30 -v 35) to acquire non-redundant sequences, termed unigenes 34 . All unigenes were segmented into two categories: clusters (prefixed with CL) and singletons (prefixed with unigene).
Identification of differentially expressed genes. For comparing unigene expression levels in two tissues, such as leaf vs root tissue and leaf vs stem tissue, unigenes with FCs ≥ 2.00 and false discovery rate values ≤ 0.001 were described as DEGs by the PoissonDis method 41 . KEGG functional analysis showed that DEGs were enriched for each term in the KEGG database, and the number of unigenes in each pathway was calculated. Pathways showing significant enrichment among the DEGs compared to the entire A. argyi transcriptome were identified using the hypergeometric test 42 . In this test, the p-value was calculated as follows: Transcription factor analysis. After detecting the ORF of each unigene with Getorf (parameter: -minsize 150) 43 , ORFs were aligned to TF protein domains in PlnTFDB (plant transcription factor database) using Hmmsearch with the default parameters 44 . The abilities of the unigenes to encode proteins were evaluated based on the characteristics of TF families described in PlnTFDB.
Accession code. The RNA-seq dataset for the three A. argyi tissues has been deposited into the NCBI Gene Expression Omnibus (GEO) database under accession code GSE102404.