De novo assembly and characterization of root transcriptome in two distinct morphotypes of vetiver, Chrysopogon zizaniodes (L.) Roberty

Vetiver, a perennial C4 grass, has long been known for its multifarious uses in perfumery, medicine and environmental protection. Two distinct vetiver morphotypes have been identified in India, i.e., A. North Indian type characterized by thick and smooth fast growing roots that produce superior quality of laevorotatory oil; and B. South Indian type with more number of thin and hairy roots that produce inferior quality of dextrorotatory oil. The two morphotypes were targeted for transcriptome analysis to understand the contribution of genetic background on oil quality and root morphology. Sample A showed enhanced activity of flavonoid and terpenoid biosynthesis related genes, i.e. ERF, MYB, bHLH, bZIP and WRKY. Interestingly, expression analysis revealed that the genes involved in sesquiterpene biosynthesis pathway were up regulated in Sample A. Moreover, some of the genes involved in mevalonate pathway of sesquiterpene biosynthesis were unique to Sample A. Our results also demonstrated several transcripts involved in root development and hormonal regulation being up regulated in Sample A. To validate gene expression results of RNA-seq data, 20 transcripts were validated by qRT-PCR experiment. The present study provided an important start point for further discovery of genes related to root oil quality in different ecotypes of vetiver.

It has wide range of ecological distribution ranging from sandy sea coasts and swamps to plains and foothills, and also on the hilltops up to elevations of 800 m in the Kumaun hills of Uttarakhand in India.
Two distinct morphological complexes of vetiver are found to inhabit spatially separated geographic regions in India: one in the north along the Indo-Gangetic plains and adjoining areas mainly in the states of Rajasthan, Madhya Pradesh, Uttar Pradesh, Bihar and West Bengal, and the other in the south along the east and west coasts of Indian peninsula in the states of Andhra Pradesh, Karnataka, Tamil Nadu and Kerala 4 . The two forms are karyomorphologically distinct and reproductively differentiated geographic complexes 5 . The north Indian wild types are profusely flowering, high seed-setting and produce superior quality of laevorotatory root oil (ruh-khus or khus oil) whereas the south Indian cultivated types are low/late flowering, low/non seed-setting and produce inferior quality of dextrorotatory root oil (vetiver oil) 6 . Extensive work on evaluation of genetic diversity, genetic analysis, genetic improvement has been done on Indian vetiver, and a good number of superior clones, north × south hybrids and artificial polyploids have been segregated for high productivity of essential oil and high value perfumery notes ranging from earthy-to-rosy-to-saffron odour 4,7 .
During the last few years, a large number of transcriptomic and genomic sequences became available in several model organisms, which have greatly improved the understanding of the complexity of physiological processes in higher plants [8][9][10][11] . Despite the commercial and medicinal importance of vetiver, little genomic research has been done on this species. For vetiver, only a small set of of 32 EST sequences are available in GenBank database (http:// www.ncbi.nlm.nih.gov/genbank/). This small public domain data are not sufficient for elucidating the molecular mechanisms controlling the traits of interest. Therefore, extensive genomic and transcriptomic sequence data are needed for vetiver, which can be used for discovering new genes related to root structure and oil quality in different ecotypes and for developing high density microarrays for further characterization of gene expression profiles.
In the present study, we have carried out transcriptome sequence analyses in two strategically selected and contrasting morphotypes of vetiver, one representing the North Indian type having thick, smooth and fast growing roots, and the other the South Indian type having thin, hairy and more roots. We have applied Illumina paired-end sequencing technology to characterize the root transcriptome of vetiver and to develop SSR markers. De novo assembly for both samples was carried out using Trinity. Functional categorization and pathway analysis were also performed. This is the first report on a global comparative molecular analysis of root transcriptome in two distinct morphotypes of vetiver using Illumina paired-end sequencing. This dataset will serve as a public information platform for gene expression, genomics, and functional genomics studies in vetiver.

Results and Discussion
Morphometric differentiation of the two contrasting morphotypes. Vetiver is a tall (1-2 m), fast-growing, perennial tussock grass 12,13 . It has a long, massive tufted root system, which can penetrate the deeper layers of the soil. Its roots produce an essential oil that is considered as a perfume in its own right, and is highly valued in the perfume industry. Vetiver oil is rich in sesquiterpenes, and is psychologically grounding, calming, and stabilizing. In India there are two distinct Vetiver morphotypes, North Indian Type and South Indian Type. North Indian types are of profuse flowering, high seed forming with thicker roots (root diameter is around 2 cm) while leaves are broader (lamina width at the leaf base around 2 cm). It has lesser number of roots per tiller compared to the South Indian type. North Indian type has high quality essential oil enriched with ketones and aldehydes that add to its perfumery value. North Indian type vetiver plants are tall reaching up to 1.6 m, with inflorescence stalk reaching up to 2.5 m. South Indian type sports higher number but thin roots, and thin leaves while root diameter is around 1.2 mm, lamina width near leaf base is 1-1.2 cm, plant height up to 1.2 m and inflorescence stalk reaching 1.5 m. South Indian morphotypes are low or late flowering and low seed forming. Roots are rich in essential oil concentration but the oil is of poor quality and having positive optical rotation. In order to ensure best morphological distinction in the two morphotypes we identified two distinct clones i.e. Sample A and Sample B, representing North and South Indian geographical complexes, respectively for the present study. A brief account of their morphotypic differentiation depicting key features is provided in Table 1, Fig. 1.

Sequencing of vetiver root transcriptome.
To generate a broad survey of genes associated with root morphology and difference of oil quality of two different morphotypes, North Indian and South Indian type, a total of 6,33,37,598 raw sequencing reads with 101 bp were generated from a 200 bp insert library. An assembler, Trinity was employed for de novo assembly. After stringent quality check and data cleaning, approximately 5,67,50,434 million high quality reads were obtained with 89.60% Q20 bases (base quality more than 20). Based on the high quality reads, a total of 73,061 contigs were assembled with an average length of 101 bp. The length of contigs ranged from 201 to 12,238 bp. The NGSQCTOOLKIT 14 results are given in Table 2A,B.
De novo assembly and functional annotation. The de novo assembly of vetiver transcriptome was optimized after assessing the effect of various assembly parameters, trimming bases at sequence read ends and different assembly programs as described in materials and methods. We got ~70% high quality sequence reads that were assembled using Velvet program 15 at different k-mer length of 21, 27, 31, 37, 41, 47, 51 and 57 ( Fig. 2A-D). We analysed various output parameters like average contig length, N50 value, N90 value and number of contigs as a function of k-mer length. The results suggested that k-mer length inversely affects the number of contigs. We found the best assembly to be that for k 57.
Vetiver transcriptome sequencing data was analysed in terms of GC ratio as the latter plays a vital role in gene and genome regulation and also in determining the physical properties of the genome and nucleic acid stability as well 16,17 . As Fig. 3 indicates, number of sequences and GC content was more between 45-50 and 50-55 in both the samples. We analyzed the sequence conservation in vetiver transcripts with proteome sequences of Among all, Sorghum shows highest similarity (71.60%) with vetiver transcript followed by Zea mays, Bracypodium and Oryza. Arabidopsis thaliana, being a dicot, showed least similarity (58.11%) with vetiver transcript (Fig. 4). The sequence comparison result supports previous report of Adams et al. 18 about Sorghum, as the most closely related genus of Chrysopogon. Differential gene analysis. Differential gene analysis is one of the important data analysis strategies for expression studies of the transcripts. It also involves various approaches to determine whether counts for a transcript or exons are significantly different across experimental conditions. In order to analyse differentially expressed genes, edgeR program was used as described in materials and methods 19 . We got 65,779 differential genes from the edgeR program (Table S3). In total, we detected 13,635 significant transcripts expressed in both samples, on the basis of FDR (q value < 0.05). The distribution of these genes is presented in a Venn diagram (Fig. 5A), which shows that 6,982 genes were common among both samples. The number of unique genes detected was higher in Sample A than inSample B. Additionally, several hundred highly expressed transcripts were detected in Sample A as compared with Sample B (Fig. 5B). To experimentally confirm that the transcripts obtained from sequencing data were indeed expressed, 20 transcripts related to terpenoid biosynthesis, root specific and randomly selected genes were chosen for qRT-PCR analysis. A very good correlation was obtained by real-time PCR analysis with respect to transcriptomic data (Fig. 6). These results further confirmed the potential of NGS technologies to quantify gene expression.

Differential expression analysis of transcripts encoding putative transcription factors.
Transcription factors modulate gene expression by interacting with the promoter regions of related genes. There are several transcription factors (TFs) known to be involved in regulation of terpenoid biosynthesis pathways. Therefore, analysis of transcriptional factors in contrasting morphotypes was carried out by sequence comparison with known transcription factors gene families. As shown in Fig. 7       Gene ontology and pathway analysis. Gene ontology (GO) and pathway analysis were performed in order to classify the assembled transcripts. The genes with more than 10 FPKM were considered for GO and pathway analysis. Thus 11,017 genes from Sample A and 9,600 genes from Sample B were used for further analysis. WEGO ontology tool was used for gene ontology 27 . As shown in Fig. 8A, Sample A has more functions in comparison to sample B. In the cellular component, the most enriched classes were cell, cell part and organelle. The highest percentage of molecular function in GO terms was in catalytic activity. In the biological processes, where major subcomponents were taken, the majority of the GO terms were grouped into two categories-those of metabolic and cellular processes. In detailed GO analysis (Fig. 8B, Table S4A,B), lipid metabolism, transport, DNA metabolism, carbohydrate metabolism and catabolism related genes were overrepresented in Sample A. Additionally, biological processes, like secondary metabolism, phosphorylation, oxoacid metabolism, though present in both the samples, were found overrepresented in sample A. Among cellular components, cytosol, intracellular part, cytoplasmic fraction and cytoplasm were overrepresented in sample A as compared to Sample B. Finally, the EC numbers were classified in KEGG pathways 28 , enabling the presentation of enzymatic functions in the context of the metabolic pathways involved in terpenoid biosynthesis. Among the pathways identified, secondary metabolite related transcripts were over-represented (Fig. 9A, Table S5A,B). Moreover, Sample A showed more secondary metabolite -flavonoid, and terpenoid related transcripts compared to sample B. In order to check differential expression pattern of putative genes associated with secondary metabolism, MapMan analysis was performed. Several transcripts were differentially expressed among Sample A and Sample B (Fig. 9B). Genes involved in terpene biosynthesis showed overexpression in Sample A as compared to Sample B, which further confirmed over representation of terpenoid biosynthesis pathway in Sample A. Interestingly, expression analysis of genes involved in sesquiterpene biosynthesis pathway were up-regulated in Sample A as compared to Sample B (Fig. 10A). Interestingly, some of the genes, involved in mevalonate pathway of sesquiterpene biosynthesis, were uniquely present in Sample A while others were over expressed in Sample A as compared to Sample B (Fig. 10B). Further validation of some of the selected genes involved in terpenoid biosynthesis was carried out using qRT-PCR (Fig. 6A, Table S6). In the terpenoid pathway, vetiver homologues of TR29625|c0_g1_i1 (squalene monooxygenase), TR18509|c0_g1_i1 (mevalonate kinase), TR22969|c0_g1_i1 (phosphomevalonate kinase), TR23979|c0_ g1_i1 (2-C-methyl-D-erythritol 4-phosphate cytidylyltransferase), TR18928|c0_g1_i1 (diphosphomevalonate decarboxylase), TR26232|c0_g1_i1 (hydroxymethylglutaryl-CoA synthase), TR30426|c0_g2_i1 (geranylgeranyl reductase), TR12988|c0_g1_i4 (ditrans, polycis-polyprenyl diphosphate synthase) and TR36389|c0_g1_i1 ( geranyl diphosphate synthase) were expressed much higher in Sample A roots than in Sample B. Plants use both the mevalonate dependent and the mevalonate-independent, or deoxyxylulose 5-phosphate (DXP), pathways for isoprenoid synthesis [29][30][31][32][33] . Over representation and higher expression of genes involved in mevalonate dependent sesquiterpene biosynthesis in Sample A, may be responsible for better quantity and quality of vetiver oil in Sample A as compared to Sample B.

Expression analysis of transcripts invoved in hormonal regulation and root development.
As mentioned earlier, among the two morphotypes examined Sample A produced good quality vetiver oil and possessed a better root system architecture than the Sample B. In order to check expression pattern of some of the transcripts involved in root morphology and development, candidate genes analysis was carried out (Table S7). Results suggested that several transcripts involved in root development were differentially regulated among both the morphotypes. Phytohormones play an important role in root development. In order to check expression pattern of various phytohormones responsive genes, MapMan mediated visualization was carried out and as shown in Fig. 11A, several candidate genes involved in auxin and cytokinin mediated regulation were differentially expressed in both morphotypes. Additionally some of the ABA, GA and Ethylene responsive genes were also showed differential expression pattern among both samples. Some of the candidates genes involved in root development were also found to be differentially expressed in the two morphotypes. Based on the literature expression pattern of putative transcripts involved in root development and regulation were also analysed. Several genes, directly or indirectly involved in root architecture and development, were found differentially regulated in Sample A than in its counterpart (Fig. 11B and Table S8). Our data also suggests that several auxin    responsive transcripts were also up regulated in Sample A than in Sample B (Table S7). These transcripts have been reported to play an important role in root development. Products of OsORC3 (Origin recognition Complex 3) and OsCYP2 are involved in lateral root initiation and development in rice 34,35 (34)(35). Similarly, auxin efflux carrier (EIR1), a root-specific protein involved in auxin transport, is required for root development and gravitropism in Arabidopsis thaliana 36 . Some of the selected genes were further validated using qRT-PCR (Fig. 6C). Among the genes tested for q-RT-PCR, the expression levels of TR15490|c0_g1_i1 (Similar to PLETHORA1) and TR35777|c3_g1_i2 (auxin transport protein REH1) showed very good correlation with the transcriptome data. Overexpression of such transcripts may be responsible for the dense and deep root system of Sample A compared to sample B, which in turn leads to greater oil quality by an unknown link between its morphological superiority and better oil quality. Further study may explore such type of correlation in a better way. SSR identification. The transcript based SSR markers play an important role in determining functional genetic variations 37 . SSR markers are polymorphic in nature, easy to develop and are a rich source of diversity. We use MISA perl script to find SSR marker in vetiver. A total of 73,061 sequences (Total size of sequences was 5,80,98,639 bp) were examined for SSR identification ( Table 3). Out of those, total number of identified SSRs was 13,527, whereas number of SSR containing sequences were 11,240. Additionally, following the criteria used to identify these SSRs, tri-nucleotide repeats were the most abundant (6,295) and hexa-nucleotides were the least abundant (70).
In this study, we utilized Illumina paired-end sequencing technology to characterize the root transcriptome in two different morphotypes of vetiver and to develop SSR markers. Non-normalized cDNA collections from two different types of roots were used to generate a broad survey of genes associated with different morphological characters as well as secondary metabolites. To the best of our knowledge, this study is the first exploration to characterize the root transcriptome of vetiver through the analysis of large-scale transcript sequences resulting from Illumina paired-end sequencing. North Indian (Sample A) morphotype has good quality of oil as well as root system architecture due to over represented transcripts of secondary metabolite and root specific traits as compared to south Indian morphotype. Terpenoid, particularly sesquiterpene biosynthesis pathway played a major role in quality and quantity of Vetiver oil as evident by our transcriptome study. Functional characterization of some of the important genes will lead to complete understanding of Vetiver oil biosynthesis pathway.  (Table 1) were targeted for the study. The two distinct forms were segregated from the large collection of vetiver grown at the CSIR-Central Institute of Medicinal and Aromatic Plants, Lucknow, India after ascertaining the distinctness in growth/reproductive behavior and other biological and botanical features grown over several years. Root sample of both varieties were collected from the 15-day-old seedlings grown in autoclaved mixture (1:1). At least three independent biological replicates of each tissue sample were harvested, all the samples were ground in liquid N 2 and stored at − 80 °C for further analysis.
RNA isolation and quality controls. Frozen tissues were ground to a fine powder in liquid nitrogen and total RNA was extracted using RNeasy plant Mini Kit (QIAGEN, MD) and treated with RNase free DNaseI   (QIAGEN, MD) according to manufacturer's instructions. The quality and quantity of total RNA were analysed by agarose gel and spectrophotometric analysis (ND1000 Nanodrop, NanoDrop Technologies, USA). The equal amount of total RNA from three different preparations were pooled and used for further processing. Quantity as well as quality of pooled RNA was again checked on Agilent 2100 Bioanalyzer RNA chip (Agilent Technologies Inc., Santa Clara, CA). Only the RNA samples with 260 of 280 ratio from 1.8 to 1.9, 260 of 230 ratio from 2.0 to 2.5 and RIN (RNA integrity number) more than 9.0, were used for the analysis Illumina sequencing. The cDNA libraries were generated using mRNA assay for sequencing on IlluminaHiSeq2000 sequencing platform. Paired-end cDNA library was generated from root sample and sequencing was performed to generate the ~101 bp paired-end reads. Many quality controls and adaptor removal were done by NGSQCTOOLKIT (http://www.nipgr.res.in/ngsqctoolkit.html) software. This software was used for filtering of high quality reads based on quality score (Q > 20) value given in fastqfile. Removal of adaptor and trimming of reads were done by NGSQCTOOLKIT.
De novo assembly. The assembly was performed on workstation with 12 processor and 56 GB random access memory. We used Trinity (http://trinityrnaseq.github.io), Velvet (http://www.ebi.ac.uk/~zerbino/velvet/) and SOAPdenovo (http://soap.genomics.org.cn/soapdenovo.html) software for de novo assembly 14,38,39 . Trinity working bench was divided into three steps. First one is Inchworm in which construction of de Bruijn graph takes place. Construction of contigs takes place by k-mers. Second one is Chrysalis in which reads partitioning is done by overlapping contigs of inchworm. Third one is Butterfly which is used for traversal of graph by generated reads. deBruijn graph algorithm was used for assembly of short reads. We got good assembly results with our data set. Many assembly parameters were also optimized for best results. Differential gene expression analysis. Differential gene expression analysis was carried out using trinity in build tool, EdgeRbioC in R version (http://www.bioconductor.org/packages/release/bioc/html/edgeR.html). After running trinity assembly steps, abundance estimation was performed using RSEM. Finally EdgeR bioconductor was run for differential gene expression analysis. The data were visualized and figures were produced using the MapMan software. A downloadable Version 3.6.0RC1 is available at http://mapman.gabipd.org/web/guest. GC content and SSRs identification. GC content analysis was done between two samples. GC content was analysed using Perl program from sample wise assembled contig files. MISA tool (http://pgrc.ipk-gatersleben. de/misa) was used for SSRs identification. Parameters like mono-nucleotide's repeats more than ten times, di-nucleotide repeats more than six times, tri, tetra, penta, hexa-nucleotide repeats more than five times were considered as MISA search criteria.

Functional annotation and similarity search.
Gene ontology and pathway analysis. To perform gene ontology and pathway analysis, we used transcripts which have FPKM more than 10. Genes were selected from differential file generated using EdgeRbioC described in section2.6.Gene ontology analysis was performed using agriGO search tool (http://bioinfo.cau.edu. cn/agriGO/analysis.php) with singular enrichment analysis statistical test. Sorghum (Sorghum bicolor L.) reference genome was taken for gene ontology studies. These GO terms were used in WEGO tool (http://wego. genomics.org.cn/cgi-bin/wego/index.pl) for plotting GO annotation for molecular function, biological process and cellular component analysis. Pathway analysis performed using KEGG database (http://www.genome.jp/ kegg). KO ids for pathway analysis generated with the help of KAAS (www.genome.jp/tools/kaas/) by taking Oryza sativa as reference genome 40 .
Expression analysis using qRT PCR. Real time PCR was performed in 20 μ l for a set of selected genes using Fast SYBR Green PCR Master Mix (ABI, USA). The list of selected genes and oligonucleotide primers (Eurofins, India) used for each gene are listed in Table S2. Oligonucleotide primers for vetiver actin gene (Table  S2) were used as the internal control for establishing equal amounts of cDNA in all reactions. The reactions were performed using the following cycle conditions, an initial 94 °C for 2 min, followed by 30