De novo transcriptome analysis of Lantana camara L. revealed candidate genes involved in phenylpropanoid biosynthesis pathway

Lantana camara L. is an economically important essential oil producing plant belonging to family Verbenaceae. It is used in medication for treating various diseases like cancer, ulcers, tumor, asthma and fever. The plant is a useful source of essential bioactive compounds such as steroids, flavonoids and phenylpropanoid glycosides etc. Nonetheless, very little is known about the genomic or transcriptomic resources of L. camara, and this might be the reason of hindering molecular studies leading to identification of improved lines. Here we used Illumina sequencing platform and performed the L. camara leaf (LCL) and root (LCR) de novo transcriptome analyses. A total of 70,155,594 and 84,263,224 clean reads were obtained and de novo assembly generated 72,877 and 513,985 unigenes from leaf (LCL) and root (LCR) respectively. Furthermore, the pathway analysis revealed the presence of 229 and 943 genes involved in the phenylpropanoid biosynthesis in leaf and root tissues respectively. Similarity search was performed against publically available genome databases and best matches were found with Sesamum indicum (67.5%) that were much higher than that of Arabidopsis thaliana (3.9%). To the best of our knowledge, this is the first comprehensive transcriptomic analysis of leaf and root tissues of this non-model plant from family Verbenaceae and may serve as a baseline for further molecular studies.

Library preparation for transcriptome sequencing. NEBNext ® Ultra™ RNA Library Prep Kit was used to generate sequencing libraries. Briefly, mRNA was purified from total RNA using poly-T oligo-attached magnetic beads. Divalent cations were used to carry out fragmentation under high temperature in NEBNext (FSSR) first strand synthesis reaction Buffer (5×). Random hexamer primers were used to synthesize first strand cDNA using M-MuLV Reverse Transcriptase (RNase H-). Second strand cDNA synthesis was carried out using DNA Polymerase I and RNase H. Exonuclease/polymerase activity was performed to convert remaining overhangs into blunt ends. Ligation of NEBNext adaptor with hairpin loop was performed after adenylation of 3′end of DNA fragments in order to prepare sample for hybridization. The cDNA fragments of 250-300 bp were preferentially selected and purification of the library fragments was carried out with AMPure XP system (Beckman Coulter, Beverly, USA). 3 µl of Enzyme (NEB, USA) was used with size-selected, adaptor-ligated cDNA at 37 °C for 15 min, followed by 5 min at 95 °C. PCR was carried out with Phusion High-Fidelity DNA polymerase, Universal PCR primers and Index (X) Primer. PCR products were purified (AMPure XP system) and library quality was assessed on the Agilent Bioanalyzer 2100 system. Paired-end library of 150 bp (PE150) was prepared following Illumina protocol/instructions. Clustering, quality control and transcriptome assembly. Cluster generation system (PE Cluster Kit cBot-HS Illumina) was used to perform clustering of the index coded samples following manufacturer's instructions. The library preparations were sequenced on an Illumina platform after cluster generation and paired-end reads were generated. Transcriptome assembly was performed using Trinity software 33 with min_kmer_cov set to 2, and all other parameters with default settings.
Gene functional annotation and biological pathways assignment. Libraries of LCL and LCR tissues were annotated using BLASTX against the NCBI database and all unigenes were applied for homology searches. Gene functional annotation was performed using the following 7 databases: NCBI non-redundant protein sequences (NR), NCBI non-redundant nucleotide sequences (Nt), Protein family (Pfam), Clusters of Orthologous Groups of proteins (KOG/COG), manually annotated and reviewed protein sequence database (Swiss-Prot), KEGG Ortholog database (KO) 34 and Gene Ontology (GO) database. Best aligning results were selected to annotate the unigenes; if aligning results of these databases were not similar, NR database results were preferentially selected. For assignment of function to unigenes, Gene Ontology (GO) enrichment analysis was performed using AgriGO (https ://bioin fo.cau.edu.cn/agriG O/analy sis.php) 35 and annotated sequence/s may have more than one GO term. Similarly, to understand the high-level functions and gain an overview of gene pathway networks KEGG 36,37 was used. Enzyme commission (EC) numbers were assigned to unique sequences, based on the BLASTx search of protein databases, using a cutoff E value 10 − 5. The statistical enrichment of DEGs in KEGG pathways was carried out using KOBAS 38,39 . Quantification and differential expression analysis. The RNA-seq by Expectation Maximization (RSEM) package was used to estimate gene expression levels of both root and leaf tissues 40 . At first clean data was mapped back onto the assembled transcriptome and then read count for each gene was obtained from the mapping results. Prior to differential gene expression (DEGs) analysis of the sequenced library of root and leaf tissues, read counts were adjusted through one scaling normalized factor and analysis was performed using DEGseq (2010) R package 41 . P value was adjusted using q value of < 0.005 and |log2(foldchange)| > 1 was set as the threshold for significantly differential expression. To identify DEGs between the LCR and LCL datasets, raw read counts were initially filtered to exclude orphan transcripts. Pairwise comparison was performed using the

Results and discussion
RNA sequencing and de novo assembly. Advances in genomics have led to the development of NGS based trait mapping approaches that have tremendously increased the efficiency of traits selection and mapping in complex genomes 43 . Among the high throughput genotyping approaches, Next Generation RNA Sequencing technology has contributed to a more comprehensive understanding of functional genes and is widely used for characterization of transcriptome profiles of various model and non-model plants 44 . De novo transcriptome analysis is not only providing an excellent platform for finding novel genes, molecular markers development but also cater a base for the construction of networks of gene expressions for various tissues and organs of animals as well as plants. Although, the number of available high-quality reference genomes has been constantly growing still, de novo transcriptome approaches are mainly used for non-model species where whole genomes information are missing. The remarkable advances in RNA sequencing provide a cost-effective way to obtain large amounts of transcriptome data from many organisms and tissue types, especially in the complete absence of a reference genome thereby, allowing us to identify all expressed transcripts 33,34 . Here we provide, the first report on transcriptome profiling and DEGs of L. camara leaf and root tissues; RNA-Seq library was constructed and RNA samples with RIN value more than 6 were used (Supplementary file 1). A total of 76,315,644 and 87,218,768 raw reads were obtained for LCL and LCR respectively. For the removal of adapters, poly-A tail, primer sequences and short as well as low quality sequences trimming process was performed that resulted in a total of 70,155,594 clean reads from LCL and 84,263,224 from LCR. The total size of clean bases generated was 10.5 GB for LCL sample with percent error (0.01%), Q20 (97.39%), Q30 (93.47%) and GC content (45.61%), whereas for LCR sample 12.6 GB with a percent error (0.03%), Q20 (96.81%), Q30 (92.01%) and GC content (43.73%) ( Table 1). The sequences (raw data) generated were deposited to NCBI as PRJNA503321 (for leaf sample) and PRJNA605469 (for root sample). For samples lacking reference genomes, clean reads need to be assembled to get a reference sequence. We used Trinity assembler 33  functional annotation of genes. Unigenes of the LCL and LCR tissues were annotated using BLAST search against the NR, NT, PFAM, GO and KOG databases. Gene Ontology assignments based on the protein match and annotation results are based on Götz et al. 45 Of the 72,877 and 513,985 unique transcript sequences of LCL and LCR annotated, a large proportion of the sequences (73.27% and 70.11%) had hits in databases. Of the LCL tissue, a total of 50,540 (69.34%) unigenes matched with known proteins in the NR database, while 38,975 (53.48%) unigenes were annotated with entries in the Swiss-prot database, 36,209 (49.68%) unigenes matched Table 1. Quality of reads obtained after RNA-sequencing of leaf and root transcriptomes.

Samples
Raw reads Clean reads Clean bases (GB) Error (%) Q20 (%) Q30 (%) GC (%)  (Table 3). Of the assembled unigenes, only 9,849 (13.51%) and 45,051 (8.76%) in LCL and LCR respectively, were successfully annotated in all databases. Reasons for the non-annotated sequences could be, lack of conserved protein domains in short sequences, or in some cases the transcriptome have noncoding genes, UTRs, random transcriptional noise or incomplete spliced introns which are non-homologues to the sequences available in the public databases. This may also be one of the possible reasons that genes have not shown expression at the time of RNA extraction or those genes are expressed at very low levels 46 .
The assembled unigenes annotation of LCL and LCR tissues is given as Venn diagram (Fig. 1) against PFAM, GO, KOG, NR and NT databases. For leaf tissues, a total of 13,534 unigenes were annotated in 5 databases; 6,060 unigenes sequences showed homology in 3 databases including PFAM, NR and GO. While 1,167 unigenes were annotated in KOG as well as the above three databases and 2,030 unigenes were annotated in PFAM and GO databases (Fig. 1A). Likewise for root tissues, a total of 57,114 unigenes were annotated in five databases where, 28,396 unigenes were homologous in PFAM, NR and GO databases (Fig. 1B).
Gene ontology (GO) classification. GO assignments were used to classify the functions of the LCL and LCR unigenes, which classified unigenes under the categories of biological process (BP), molecular function (MF) and cellular component (CC). Sum of 36,209 LCL unigenes were classified into three major categories (Fig. 3A). The biological process category, the unique sequences were classified into 25 groups. The most characterized biological processes were 'cellular process' (57.59%, GO-ID: 0009987) followed by 'metabolic process' (53.72%, GO-ID: 0008152). The cellular components were divided into 21 groups. Interestingly, we found both the represented cellular components 'Cell' (31.80%, GO-ID: 0005623) and 'Cell part'' (31.80%, GO: 0044464) were similar to previous report 45 . The molecular functions category, the unique sequences clustered into 10 classes. Where, the highest sub-category was 'binding' (58.35%, GO-ID: 0005488) followed by 'catalytic activity' (46.37%, GO: 0003824) (Fig. 3A, Supplementary file 2).
Similarly, 255,964 LCR unigenes were annotated and also likewise classified into three main categories and further divided into 57 sub-groups (Fig. 3B). Among the biological processes classification, the highest was cellular process 142,929 (55.83%), followed by metabolic process 135,458 (52.92%) and single-organism process 107,769 (42.1%). Furthermore, cellular component was classified in 22 groups, among them high amount of unigenes were related to "cell" 75,691 (29.57%) and "cell part" 75,607 (29.53%), followed by "organelle" 50,774 www.nature.com/scientificreports/ (19.83%) and "membrane" 35,704 (13.94%) (Fig. 3B, Supplementary file 2). Only few unigenes were assigned to extracellular matrix, extracellular region and extra cellular region part. In the molecular function category, the majority of the unigenes were assigned to "binding" 133,482 (52.14%) and "catalytic activity" 114,050 (44.55%). The annotation and sequence information from Gene Ontology results provide important gene sources for future molecular level studies that underline selection and improvements of L. camara.
functional characterization using KeGG. For biological functioning of genes, pathway-based analyses are imperative and this assignment was performed in the KEGG database. In LCL sample, of the 20,506 unigenes, 18,853 were assigned to 5 major groups in KEGG with 26 sub-categories and 131 biochemical pathways (Fig. 5, Supplementary file 4). These major groups were cellular processes (1,004 unigenes), environmental www.nature.com/scientificreports/ information processing (694 unigenes), genetic information processing (4,285 unigenes), metabolism (8,737 unigenes) and organismal systems (878 unigenes). In these 5 groups, the topmost was the metabolism with 8,737 unigenes, and its further evaluation classified it into 10 subgroups (Fig. 6). Of these, 'carbohydrate metabolism' with (1,751) genes involved the highest number of genes, followed by 'amino acid metabolism' (1,064) genes and 'energy metabloism' with (886) genes (Fig. 6). Furthermore, 936 genes were identified that coded for important proteins and have matches with 1,113 enzymes. The function of these enzymes was assigned to 21 secondary metabolite pathways (Fig. 7). Among these pathways 546 genes encoded key enzymes involved in terpeniods biosynthesis including monoterpenoid (28 genes), diterpenoids (30 genes), terpenoid backbone (166 genes), sesquiterpenoid and triterpenoids (24 genes). Similarly, 41 genes were related to flavonoid biosynthesis pathway including flavone and flavonol (32 gene). Among all the secondary metabolite pathways, the phenylpropanoid biosynthesis pathway involved the highest numbers (229) of genes (Fig. 7). These results provide valuable insight into the metabolic pathways in leaves of L. camara. These results may provide basis for future studies to identify and characterize genes and transcripts that are involved in important metabolic pathways and may provide insights into the understanding of functions of these genes in the biosynthesis of active compounds in L. camara.
Of the 124,076 unigenes in LCR transcriptome, 115,383 were assigned into 5 groups in KEGG with 19 subcategories and 131 biochemical pathways (Fig. 5, Supplementary file 4). The major groups were cellular processes (7,793 unigenes), environmental information processing (3,384 unigenes), genetic information processing (39,630 unigenes), metabolism (50,349 unigenes) and organismal systems (3,111 unigenes). In these 5 groups, www.nature.com/scientificreports/ the topmost was the metabolism with 50,349 unigenes was further analysed, which we took for further consideration. The metabolism group was further categorized into 10 subgroups (Fig. 6). In all the 10 sub-groups of metabolism 'carbohydrate metabolism' was the highest with (11,724) genes, followed by 'amino acid metabolism' with (9,120) genes and 'energy metabolism' with (6,791) genes (Fig. 6). Furthermore, 4,166 genes were identified that coded for proteins and have important matches to 5,368 enzymes. The function of these enzymes was assigned to 21 KEGG secondary metabolite pathways (Fig. 7). Among these pathways 2,796 genes encoded key enzymes involved in terpeniods biosynthesis including monoterpenoid (92 genes), diterpenoids (75 genes), terpenoid backbone (853 genes), sesquiterpenoid and triterpenoids (103 genes). The results showed 119 genes were related to flavonoid biosynthesis pathway including flavone and flavonol biosynthesis (75 gene), whereas   www.nature.com/scientificreports/ the highest number of genes (943 genes) among all the secondary metabolite were involved in phenylpropanoid biosynthesis pathway (Fig. 7).
phenylpropanoid biosynthesis. The current study identified the maximum number of genes for phenylpropanoid biosynthesis pathway and this was taken into further consideration. Phenylpropanoids are phytobased natural compounds that are usually derived from phenylalanine 48 . Phenylpropanoid plays vital role in plant response to various biotic and abiotic stresses 49 . The process of phenylpropanoid biosynthesis starts with the formation of cinnamic acid from phenylalanine. This cinnamic acid is then converted into cinnamoyl-CoA, p-Coumaryl-CoA, p-coumaryl quinic acid, caffeoyl quinic acid, caffeoyl-CoA, feruloyl-CoA, and sinapoyl-CoA. caffeoyl quinic acid also known as chlorogenic acid, which is a highly soluble phenylpropanoid in Solanaceae and it has been mentioned to play a vital role as an antioxidant and defense molecule 49 . Further, we have identi-  www.nature.com/scientificreports/ fied a total of 229 and 943 genes as well as 17 and 19 enzymes in LCL and LCR transcriptomes respectively and these enzymes are required for phenylpropanoid biosynthesis pathway (Table 4). Our results are different from 50 where the authors reported only 11 genes from Solanum trilobatum that were involved in the biosynthesis of Gene expression analysis. De novo transcriptome filtered by Corset was used as a reference 51 . RSEM 40 that map reads back to transcriptome and quantify their expression level was applied. Total reads in gene expression analysis of LCL tissues were 70,155,594, followed by total mapped 54,875,118 (78.22%); whereas total reads in gene expression analysis of LCR tissues were 84,263,224 followed by total mapped 47,187,580 (56.00%). To calculate the gene expression level, RSEM analyzed the mapping results of Bowtie and read count for each gene was converted into FPKM value. In RNA-seq, it is the most common method of estimating gene expression levels, which takes into account the effects of both sequencing depth and gene length on counting of fragments. These results are summarized in Supplementary File 6. Venn diagram, heat map and volcano plot highlights the DEGs and shows genes that are unique and common to leaf and root samples. The total numbers of expressed genes in LCL and LCR tissues were 67,714 and 359,010 respectively, while a total of 49,072 unigenes were found as commonly expressed in both tissues (Fig. 9A). We used DESeq software, to analyze the expression of unigenes in both LCL and LCR transcriptomes by normalizing the values to Fragments Per Kilobase Million (FPKM). The Benjiamini-Hochberg method was used to verify and revise the p-values. Volcano plot demonstrates the fold changes in the expression and statistical comparison (Fig. 9B). A total of 20,044 differentially expressed genes (DEGs) were identified. Within these DEGs 11,496 genes were up regulated and 8,548 genes were down regulated (Fig. 9B). Further, the hierarchical clustering analysis was used to screen the DEGs and cluster them according to their expression in LCL and LCR samples (Fig. 10A). On the basis of pathway enrichment, 20 metabolic and/or biosynthetic processes were predominantly involved. Of these spliceosome and protein processing in the endoplasmic reticulum was comprised by approximately 300 genes (Fig. 10B). Other important processes identified included; synaptic vesicle cycle, starch and sucrose metabolism, plant hormone signal transduction, conclusion Molecular studies on L. camara are rare and high-throughput genotyping efforts are almost non-existent. This study investigated the transcriptomes assembly of L. camara leaf and root tissues. A massive data of 70,155,594 and 84,263,224 clean reads were de novo assembled that revealed 72,877 and 513,985 unigenes from leaf and root tissues respectively. Further, the identified unigenes were annotated and functionally characterized in 7 databases. Notably, the numbers of expressed genes in LCL and LCR tissues varied; and 49,072 unigenes were commonly expressed in both tissues. Still, of the 20,044 DEGs 11,496 were up regulated and 8,548 genes were down regulated. Pathway analysis revealed the involvement of 229 and 943 genes (coding for 17 and 19 enzymes) in the biosynthesis of phenylpropanoid pathway in leaf and root tissues respectively. Nonetheless, the genomic resources will not only provide foundation of genomic research in L. camara but it will also provide detailed insight into the expression as well as functional analysis, gene cloning and avenues for genomics-assisted breeding in L. camara. This study will also serve as baseline to understand the regulation and biosynthesis of crucial bioactive compounds and to select superior alleles/haplotypes of L. camara with desired traits in the future.