RNA sequencing and de novo assembly of Solanum trilobatum leaf transcriptome to identify putative transcripts for major metabolic pathways

Solanum trilobatum L. is an important medicinal plant in traditional Indian system of medicine belonging to Solanaceae family. However, non-availability of genomic resources hinders its research at the molecular level. We have analyzed the S. trilobatum leaf transcriptome using high throughput RNA sequencing. The de novo assembly of 136,220,612 reads produced 128,934 non-redundant unigenes with N50 value of 1347 bp. Annotation of unigenes was performed against databases such as NCBI nr database, Gene Ontology, KEGG, Uniprot, Pfam, and plnTFDB. A total of 60,097 unigenes were annotated including 48 Transcription Factor families and 14,490 unigenes were assigned to 138 pathways using KEGG database. The pathway analysis revealed the transcripts involved in the biosynthesis of important secondary metabolites contributing for its medicinal value such as Flavonoids. Further, the transcripts were quantified using RSEM to identify the highly regulated genes for secondary metabolism. Reverse-Transcription PCR was performed to validate the de novo assembled unigenes. The expression profile of selected unigenes from flavonoid biosynthesis pathway was analyzed using qRT-PCR. We have also identified 13,262 Simple Sequence Repeats, which could help in molecular breeding. This is the first report of comprehensive transcriptome analysis in S. trilobatum and this will be an invaluable resource to understand the molecular basis related to the medicinal attributes of S. trilobatum in further studies.

). An in-depth study of S. trilobatum transcriptome is needed for analysis and characterization of its functional genes. This would help to achieve large-scale production of drugs via molecular breeding, transgenic technology, and metabolic engineering. RNA Sequencing-based de novo assembly is a well-developed approach to understanding transcriptomes of non-model plants with limited genomic information. Moreover, RNA-Seq is a cost-effective tool, offers much data with better coverage and sufficient sequence depth for de novo assembly of transcriptomes. In the past few years, there has been a significant increase in utilizing RNA-Seq for discovery and identification of functional genes involved in the biosynthesis of active compounds in plants 16 . In this study, we report the transcriptome from the leaf of Solanum trilobatum using high throughput next-generation sequencing for the first time. The high-quality reads were de novo assembled into unique transcripts, which were then extensively evaluated and annotated to identify the putative pathways and genes responsible for its medicinal properties.

Materials and Methods
Plant material and RNA isolation. Solanum trilobatum L. collected from Guduvanchery, Kancheepuram district, Tamil Nadu was taxonomically identified by the Centre for Floristic Research, Madras Christian College, Tambaram, Chennai, with field no. 523. The mature leaves from top and middle parts of the healthy plant during its flowering stage were collected and used for total RNA extraction immediately after collection using modified CTAB method 17 . The extracted total RNA was treated with DNase A and purified using RNeasy MinElute clean up kit (Qiagen Inc., GmbH, Germany). The quality was assessed using NanodropLite spectrophotometer (Thermo Scientific, Wilmington, Delaware, USA) and Qubit 2.0 (Invitrogen, Carlsbad, California, USA). The RNA integrity value was measured using Bioanalyzer 2100 (Agilent Technologies, Santa Clara, California, USA). The purified total RNA was used for sequencing library preparation.
Library preparation and illumina sequencing. The total RNA was made ribosomal RNA free using Ribo-Zero rRNA removal kit (Illumina Inc., Singapore) and the remaining fraction was purified and eluted. The purified RNA was disrupted into short fragments using fragmentation buffer; these short fragments are used as a template for first strand cDNA synthesis using superscript II reverse transcriptase (Invitrogen, Carlsbad, California, USA) followed by second strand synthesis and purification. The purified double-stranded cDNA was polyadenylated, and adapter-ligated for paired-end library preparation. The adaptor primers were used for amplification of the library for the enrichment of the cDNA fragments. Caliper LabChip GX using HT DNA High Sensitivity Assay Kit (Caliper Life Sciences Inc., USA) was used for library quality assessment. The library was hybridized on a flowcell, and clonal clusters were generated on cBOT using TruSeq PE Cluster Kit v3-cBot-HS (Illumina Inc., USA). Sequencing was carried out on Illumina Hiseq. 2500 using TruSeq v3-HS kit to generate 100 bp paired-end reads (Illumina Inc., USA).
De novo assembly and clustering. The raw paired-end reads were quality assessed by FastQC v0.11.2 18 .
Pre-processing of raw reads was performed with Cutadapt v1.7.1 19 and Sickle v1.33 tools 20 for adapter trimming and quality filtering respectively. Reads with Phred score >=30 were retained. The filtered reads were further used for transcriptome assembly using Trinity 21 . Trinity, a de novo assembler consists of Inchworm, Chrysalis, and Butterfly modules, which are applied sequentially to process RNA-seq raw reads into full-length transcripts. The process begins with Inchworm which generates full-length transcripts from raw reads based on default k-mer values. The Chrysalis clusters the contigs generated by Inchworm and prepares de bruijn graph for each cluster. Finally, butterfly processes individual graphs reporting full-length transcripts for alternatively spliced isoforms. The redundancy of the assembled contigs was removed using CD-HIT v4.5.4 22 . Assessment of gene completeness. The gene completeness analysis was performed by using the TRAPID tool (http://bioinformatics.psb.ugent.be/webtools/trapid). The analysis was carried out by comparing the unigene transcripts against PLAZA 2.5 green plants clade database 23 with an E-value of <1E-5 for significant similarity search and annotation of unigenes. The completeness of unigenes was assessed by considering one or more hits in TRAPID database 24 for "full length", "quasi full length" or "partial" based on length of the open reading frame (ORF).
Functional annotation and classification. The de novo assembled sequences of S. trilobatum were compared against plant non-redundant (nr) protein database at National Centre for Biotechnology Information (NCBI) using BLASTX tool from stand alone BLAST+ package with an E-value parameter not greater than 1E-05 for identification of best significant match. The BLASTX results were further imported to Blast2GO suite 25 for retrieving Gene Ontology (GO) terms of assembled unigenes and for mapping, the annotation was further continued with unique enzyme codes (EC) and Kyoto Encyclopedia of Genes and Genomes (KEGG) maps. Moreover KEGG Automated Annotation Server (KAAS) was also used for pathway mapping in addition to Blast2GO. GO terms are precisely defined as controlled vocabulary which can be used to describe functions of genes or gene products. The assembled transcripts based on the retrieved GO terms were classified into three categories viz. Biological process, Molecular function, and Cellular component. The pathway maps were determined from KEGG database with an E-value of 1E-05. perfect di-, tri-, tetra-, penta-and hexa nucleotide motifs with a minimum thresholds of 6, 5, 5, 5 and 5 repeats respectively.
Transcript quantification. The estimation of unigene abundance was determined using RNA-Seq by Expectation-Maximization (RSEM) tool, which quantifies transcript level abundance from RNA Seq data. RSEM first generates and preprocesses a set of reference transcript sequences and then aligns reads to reference transcripts followed by estimation of transcript abundances. RSEM calculates fragments per kilo base per million (FPKM) and transcripts per million (TPM) values for the assembled individual unigenes from S. trilobatum 28 . FPKM and TPM values are calculated to understand the expression levels of unigenes involved in biosynthetic pathways of secondary metabolites.

Validation by reverse transcription PCR.
In order, to experimentally validate de novo transcriptome assembly, some of the assembled unigenes of S. trilobatum which share sequence similarity to various secondary metabolite biosynthetic pathway genes were selected for reverse transcription-PCR. All the primers were designed from final assembled sequences, and actin (house keeping gene) was used a positive control. The primer sequences are given in Supplementary Table S1. The reverse transcription-PCR products were electrophoresed on 1% agarose gel.
Gene expression analysis by qRT-PCR. The quantitative gene expression analysis was performed using QuantStudio 5 Real-Time PCR System (Thermo Scientific, Wilmington, Delaware, USA) and QuantiNova SYBR Green PCR Kit (Qiagen Inc., GmbH, Germany). For each primer pair, a control reaction without a template was also included. Elongation factor 1-alpha (ef1a) from Solanum trilobatum was used as an internal reference gene for normalization and estimation of gene expression. The data from qRT-PCR data was analysed using comparative Ct (2 −ΔΔCt ) method. Fold change in gene expression was calculated as 2 −ΔΔCt using ΔCt values 29 . All the experiments were repeated using three technical and two biological replicates. Gene specific oligonucleotides used for qRT-PCR analysis are provided in Supplementary Table S2.

Results and Discussion
Illumina paired-end sequencing and De novo assembly. A total of 136,220,612 raw reads were generated from Solanum trilobatum leaf transcriptome that accounts for about 35.4 GB of paired-end sequencing data. The raw data were deposited at National Centre for Biotechnology Information (NCBI) Short Read Archive (SRA) database under the accession number SRP132765. The pre-processing of raw reads was done for removal of adaptor sequences and low-quality reads (Phred score <30) and a total of 124,413,306 clean reads (Phred score >=30) with GC content 39.21% were retained. Trinity assembler was employed for de novo assembly of short read sequences. A total of 144,580 assembled transcripts were generated with a mean size of 823 bp. The assembled transcripts were further clustered, using CD-HIT, into 128,934 unigenes with a mean length of 745 bp and N50 contig length of 1347 bp. The transcriptome assembly details are given in Table 1. Among unigenes, the minimum length was 201 bp and maximum length 19956 bp. The length distribution of the unigenes is given in Fig. 1.
Functional annotation of unigenes. The assembled unigenes of S. trilobatum were annotated for sequence similarity search and comparison using BLASTX against plant non-redundant protein database at NCBI with an E-value cut off 1E-5. The BLASTX results showed 60,097 annotated unigenes, and 68,837 remained non-annotated unigenes out of 128,934 assembled transcripts. Among annotated unigenes, 1.5% were uninformative hits (e.g., Predicted: Uncharacterized or hypothetical proteins) due to inadequate S. trilobatum genome
Analysis of flavonoid biosynthesis genes. The coloring pigment of most flowers, fruits, and seeds are flavonoids. They are widely distributed in plants and classified into 6 major subgroups: chalcones, flavones, flavonols, flavandiols, anthocyanins and proanthocyandins, one more subgroup is found in some species, the aurones. Flavonoids are synthesized through Phenylpropanoid pathway, by transformation of phenylalanine to p-Coumaryl-CoA, which actually enters into the flavonoid biosynthesis pathway. The first enzyme specific to this pathway is Chalcone Synthase, which produces chalcone scaffolds used for the formation of other flavonoids 33 . In the present study, Chalcone synthase (EC: 2.3.1.74) and Chalcone isomerase (EC: 5.5.1.6) enzymes were identified and are responsible for the formation of Naringenin from p-Coumaroyl-CoA. The enzymes required for conversion of Naringenin to produce apiforol by flavonone-4-reductase (EC: 1.1.1.234), eriodictyol by flavonoid-3′5′-hydroxylase (EC: 1.14.1388) and dihydrotricetin by flavonoid 3′-monooxygenase (EC: 1.14.13.88) were also identified. Moreover, the enzyme 6′-deoxychalcone synthase (EC: 2.3.1.170) responsible for converting p-Coumaroyl CoA to Isoliquiritigenin for production of butein is also identified in our dataset where Chalcone isomerase further converts butein to butin. The flavonone 4-reductase also acts on eriodictyol to produce a bioactive compound luteoforol. The enzyme Naringenin-3-dioxygenase (EC: 1.14.11.9), found in our data set, acts on pinocembrin to produce pinobanksin which is further acted by Flavanol synthase (EC: 1.14.11.23) to convert it into galingin, Pinocembrin also gets converted to Pinostrobin. Naringenin-3-dioxygenase acts on liquiritigenin to produce garbanzol and also acts on eriodictoyl to produce dihydroquercetin or taxifolin which is further converted to quercetin by Flavonol synthase enzyme were identified in S. trilobatum leaf transcriptome dataset. Moreover, Naringenin 3-dioxygenase acts on dihydrotricetin to convert it into dihydromyricetin which is finally converted to myrecitin by Flavonol synthase. A similar set of flavonoid pathway genes has been already reported in leaf transcriptomes of endangered medicinal plant Chlorophytum borivilianum 34 and also in the leaf tissue of medicinal plant Phyllanthus amarus 35 . These reports support our findings in the present dataset and also suggest how flavonoid pathway is complimenting for medicinal properties of S. trilobatum. The current study identified the genes in the biosynthesis of various flavonoids like butein, butin, pinostrobin, naringenin, galingin, garbanzol, dihydrofisetin (futin), eriodictyol, homoeriodictyol, kaempferol, apiforol, luteoforol, myricetin, dihydroquercetin (taxifolin) and cyanidin. One of the study reported that butin has aromatase inhibitory action whereas butein shows its effect against breast and lung cancer [36][37][38] . Pinostrobin has been reported to have antiviral property 39 and it also possesses antioxidant and chemoprotective activity 40 . Naringenin, a flavonone flavonoid, has been shown to possess marked antioxidant and anti-inflammatory properties 41 . Naringenin has been reported to have a protective effect on carbon tetracholride induced acute nephrotoxicity in mouse models 42 . Naringenin is reported to be an effective chemotherapeutic agent for prostate cancer 43 . It has also been reported that S. trilobatum possesses neuroprotective role against pathology of Parkinson's disease 44 . Galangin is considered to be the potential candidate for new drugs against Alzheimer's disease 45 . It also possesses hepatoprotective and anticancer properties 46,47 . Quercetin is one of the important flavonoids reported in our dataset, it possess antioxidant, also shows anti-carcinogenic and hepatoprotective properties 48 . Luteoforol, quercetin, and myricetin are reported to have anti microbial properties 49,50 . The flavonol dihydrofisetin (fustin) is reported to show protective role in neuronal cell death 51 . The Kaempferol also identified in our dataset is a well-known phytoestrogen reported to induce osteoblastic differentiation 52 . Kaempferol is also reported to possess hepatoprotective, antioxidant and anticancer effects [53][54][55] . Dihydroquercetin (Taxifolin) reported in our data acts as a potential chemopreventive agent 56 . Cyanidin is also reported as potent inhibitor of EGFR, shutting off downstream signalling cascades 57 (Fig. 6). A summary of major genes involved in phenylpropanoid and flavonoid pathways has been presented in Table 3.
Transcription factors from S. trilobatum. Transcription factors (TFs) play a crucial role in regulation of secondary metabolites by regulating the expression of related genes of biosynthetic pathways. The identification of such TFs will benefit us in understanding gene regulatory networks. TFs known to regulate plant secondary metabolism include R2R3-MYB, bHLH proteins like CrMYC2, AP2/ERF family proteins, WRKY, NAC, DOF, bZIP, HD-ZIP and TFIIIA zinc finger TFs 59 . We identified a total of 654 unitranscripts representing 48 TF families. Among the represented families, AP2 family TFs (47 unitranscripts) were the most abundant followed by      Table 3). The list of top most abundant unigenes for de novo assembled S. trilobatum transcriptome is given in Table 5. The list includes genes mostly from chloroplast, and our data set is a leaf transcriptome so it's expected to have these genes abundantly reflected in our transcriptome data analysis.

Conclusion
The main aim of our study was to analyze the transcriptome of Solanum trilobatum using Illumina high throughput sequencing platform. In order to facilitate molecular research in Solanum trilobatum, we characterized the leaf transcriptome for identification of unitranscripts involved in biosynthesis of secondary metabolites, since several medicinal attributes are affiliated to leaf tissue of this plant. Our results suggested that Flavonoid biosynthesis pathway is highly represented in the leaf transcriptome and is possibly contributing for the medicinal attributes of this plant. The predicted biosynthetic pathway is going to serve as lead for identification and isolation of medicinally important phytocompounds. We have also quantified the expression levels of transcripts for various important metabolic pathways. We have validated the de novo assembly by amplifying randomly selected unigenes by Reverse Transcription PCR method and also performed gene expression analysis of selected key genes from flavonoid biosynthesis pathway by qRT-PCR. This is the first report on leaf transcriptome assembly and analysis of Solanum trilobatum and it will serve as an important resource for studying molecular mechanisms involved in biosynthesis of its medicinal compounds.