Transcriptome analysis of the curry tree (Bergera koenigii L., Rutaceae) during leaf development

The curry tree (Bergera koenigii L.) is a widely cultivated plant used in South Asian cooking. Next-generation sequencing was used to generate the transcriptome of the curry leaf to detect changes in gene expression during leaf development, such as those genes involved in the production of oils which lend the leaf its characteristic taste, aroma, and medicinal properties. Using abundance estimation (RSEM) and differential expression analysis, genes that were significantly differentially expressed were identified. The transcriptome was annotated with BLASTx using the non-redundant (nr) protein database, and Gene Ontology (GO) terms were assigned based on the top BLAST hit using Blast2GO. Lastly, functional enrichment of the assigned GO terms was analyzed for genes that were significantly differentially expressed. Of the most enriched GO categories, pathways involved in cell wall, membrane, and lignin synthesis were found to be most upregulated in immature leaf tissue, possibly due to the growth and expansion of the leaf tissue. Terpene synthases, which synthesize monoterpenes and sesquiterpenes, which comprise much of the curry essential oil, were found to be significantly upregulated in mature leaf tissue, suggesting that oil production increases later in leaf development. Enzymes involved in pigment production were also significantly upregulated in mature leaves. The findings were based on computational estimates of gene expression from RNA-seq data, and further study is warranted to validate these results using targeted techniques, such as quantitative PCR.


Materials and Methods
tissue Collection. Both immature and mature leaf tissue (differentiated by size and proximity to the apical meristem, with immature leaves ranging from 0.5-1.0 cm, and mature leaflets from 1.5-3.0 cm) were collected from the curry tree growing in the United States Botanic Gardens in Washington, D.C. (accession 06-0532). The voucher for this Bergera koenigii L. specimen is Van Neste 229 (US). Leaf samples were immediately stored in liquid nitrogen at −190 °C. Figure 1 shows both immature and mature leaves from the tree used in this study, located in the US Botanic Garden.
RNA extraction and Library preparation. RNA was extracted from ~50 mg leaf tissue using the Spectrum ™ Plant Total RNA Kit (Sigma-Aldrich, St. Louis, USA), following a modified procedure detailed in the kit manual designed for highly recalcitrant Citrus tissue, rich in secondary metabolites and oils. Prior to the Spectrum ™ Plant Total RNA Kit, various RNA extraction protocols were used on the curry tissue with poor results. The ISOLATE II RNA www.nature.com/scientificreports www.nature.com/scientificreports/ Mini Kit (Bioline, Taunton, MA, USA), Trizol RNA extraction 27 , and a guanidinium-free phenol extraction designed for use on tissues rich in secondary metabolites 28 , all yielded RNA extracts of low quality and concentration.
The concentrations of the extracted RNAs were determined using a NanoDrop 1000 Spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA). Extractions with concentrations greater than 25 ng/μL and a 260/280 absorbance ratio of 2.0 ± 0.1 were used for cDNA library preparation. The RNA extractions were sent to the Transcriptome and Genome Analysis Laboratory at the University of Göttingen (Göttingen, Germany) for library preparation and sequencing. There, RNA quality measured by RIN (RNA Integrity Number) was quantified using the Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA) before the sequence library was prepared using the TruSeq RNA Sample Preparation Kit (Illumina Inc., San Diego, CA, USA). This library was quantified using the QuantiFluor dsDNA System (Promega, Madison, WI, USA), and size selected using the Bioanalyzer 2100 (Agilent, Santa Clara, CA, USA). Next-generation sequencing (NGS) was performed on the Illumina HiSeq. 4000 instrument (Illumina Inc., San Diego, CA, USA) following standard protocols. Data Analyses. Around 76 million ~50 bp single-end sequencing reads were filtered by quality score using the FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit) and assembled with Trinity 2.1.1 29 with default parameters. To determine the completeness of the assembly, 2121 single-copy orthologs standard to eudicots were aligned using Benchmarking Universal Single-Copy Orthologs (BUSCO) to the transcriptome 30 . The completeness of the assembly was compared with that of a previously published transcriptome of the curry leaf 31 . The assembled transcriptome was annotated using BLASTx 32 , which searches translated nucleotide sequences against the non-redundant (nr) BLAST database. Sequence descriptions for assembled transcripts were transferred from homologous BLAST hits with E-values <10 −4 . BLAST hits were then functionally annotated using Blast2GO 33,34 and Gene Ontology (GO) terms were assigned to transcripts with BLAST hits to understand gene function in the leaf tissue 35 .
The Trinity toolkit 36 was used to facilitate downstream analysis, including abundance estimation, differential expression, and GO enrichment. RNA-Seq by Expectation-Maximization (RSEM) was used for alignment-based abundance estimation 37 using Bowtie2 38 . Since the assembled transcriptome represents all the transcripts found in the leaf in both stages, the sequence reads from each sample were aligned to the assembled transcriptome to differentiate transcripts unique to one stage or present throughout leaf development. The read counts for each transcript were analyzed using edgeR 39 , a Bioconductor package 40 used for differential expression analysis, and the logarithmic fold change and false detection rate (FDR) were calculated for each transcript. The results of differential expression and GO annotation were combined for Gene Ontology enrichment using GOseq 41 , which identifies GO terms enriched as a result of differential expression between samples.

Results transcriptome Assembly statistics.
A transcriptome for the curry leaf was assembled using ~76 million single-end sequence reads, pooled from both immature and mature leaf libraries such that the final assembly shows a more accurate estimation of genes expressed throughout leaf ontogeny. Figure 2 shows the quality scores for each library after quality filtering of the reads. In the leaf transcriptome, 55,441 genes were assembled, comprising of 57,428 transcripts, with a median contig length of 362 bp. The N50 statistic for the transcriptome assembly was 1421 bp, representing the minimum contig length at which half the assembled bases can be found. Although the N50 statistic is similar to the median contig length, it inherently gives a larger weighting to longer contigs, resulting in a longer length compared to the median contig. Figure 3 shows a plot of the expressed N50 (exN50) value, which represents the N50 value computed using only a subset of genes that are highly expressed at a given expression percentile. The leaf transcriptome assembly reaches a peak exN50 at the 90 th percentile expression, indicating a high-quality assembly and adequate sequencing depth.
Benchmarking Universal Single-Copy Orthologs (BUSCO) was used to assess the completeness of the transcriptome assembly. By searching for 2121 single-copy orthologs standard in eudicots, the assembly quality of the transcriptome can be quantitatively determined. Of the 2121 benchmark single-copy orthologs, 1726 (83.1%) were complete and present in the assembly (of which 1520 [71.1%] were single-copy and 242 [11.4%] were found www.nature.com/scientificreports www.nature.com/scientificreports/ to be duplicated), 217 (10.2%) were present but fragmented, and 142 (6.7%) were missing from the assembly altogether. Less than 10% of orthologs were missing from the assembly, indicating a complete and high-quality assembly.
BUSCO was run on a previously published transcriptome for the curry leaf 31 , and the results were compared. Figure 4 shows the comparison of the completeness analysis between the two transcriptomes. This transcriptome was found to be more complete than the previously published curry leaf transcriptome, including more complete and fragmented benchmark single-copy orthologs. Of the complete orthologs present in both transcriptomes, the newly assembled transcriptome contained fewer duplications, indicating a higher quality assembly (since the benchmark orthologs used in the analysis were single-copy in origin).
transcript Abundance. Using alignment-based abundance estimation, the Fragment Per Kilobase of transcript per Million mapped reads (FPKM) value was calculated for each gene, and a threshold of 1 FPKM was used to determine if a gene was expressed in the sample. Of the 55,441 genes, 13,221 genes were unique to mature leaf tissue, 2449 were unique to immature leaf tissue, 37,271 were common to both developmental stages, and 2,500 genes were found to not be significantly expressed in either sample (FPKM values below the threshold).
Differential Expression. Abundance estimations were compared across samples to calculate the fold change for each gene, along with the false detection rate (FDR), that represents the statistical significance of the differential expression test. Figure 5 shows each gene plotted by its fold change and FDR, displaying genes with an FDR <0.05 in red. For this study, genes expressed with a greater than 6-fold (a factor of 64) change and an FDR <10 −12 between the two developmental stages were considered significantly differentially expressed. Of the 55,441 genes in the transcriptome, 54 genes were found to be significantly differentially expressed between the two samples. Figure 6 shows the significantly differentially expressed genes, clustered by expression pattern, in a heatmap. . The N50 statistic computed using subsets of highly expressed genes. Percentile expression represents the minimum percentile of expression for which to include a gene in the N50 calculation. The N50 value represents the contig length at which half the assembled bases in the assembly can be found. The peak N50 contig length is reached at the 90th percentile, indicating a high-quality transcriptome assembly.  31 ). Using Benchmarking Universal Single-Copy Orthologs (BUSCO), 2121 single-copy orthologs standard to eudicots were searched against each transcriptome to determine if they were found complete (single-copy/duplicated), fragmented, or missing. The newly-assembled transcriptome was found to be more complete than the previously published transcriptome, and fewer duplications of the single-copy benchmark orthologs were found.   www.nature.com/scientificreports www.nature.com/scientificreports/ Annotation and Go enrichment. Transcripts were annotated using BLASTx, assigning a sequence description using the top five BLAST hits with E-values <10 −4 . Through BLASTx, 33,427 transcripts (58.2% of the transcriptome) were annotated with a description based on sequence homology. The remaining ~40% may not have yielded any BLAST hits due to the lack of sequence data available for this species, with the closest well-annotated species being Citrus sinensis. These transcripts were functionally annotated using Gene Ontology (GO) terms to understand the role of the gene in the tissue. Of the 33,427 transcripts annotated using BLASTx, 23,650 (70.8%) were assigned GO terms. Figure 7 shows the spread and representation of GO categories expressed in the leaf transcriptome. Combining the results of GO annotation with the differential expression analysis, GO terms which were significantly enriched or depleted as a result of differentially expressed transcripts were identified. In immature tissue, 43 GO terms were found to be enriched due to the upregulation of genes in the sample, and 30 GO terms were found to be enriched in mature tissue.
After assigning GO terms, transcripts involved in terpene biosynthesis were identified by searching the assigned GO terms. 48 genes were identified with the GO term "GO:0010333", which represents terpene synthase activity. The differential expression of the identified terpene synthase transcripts was studied to compare terpene biosynthetic activity between the leaf developmental stages. Figure 8, similar to Fig. 5, shows each terpene synthase gene plotted by its fold change and FDR, displaying genes with an FDR <0.05 in red. Figure 9 shows the 25 most differentially expressed terpene synthase genes, labelled with the name of the closest homologous match from each respective BLAST query.

Discussion
Go enrichment in Immature tissue. GO enrichment analysis reveals information about which pathways likely were upregulated as a result of differential expression between the two developmental stages. In the immature leaf tissue, 13 out of 43 enriched GO terms were involved in the building or destruction of cell walls and other structural features of the tissue, such as the lignification of vascular tissue. Table 1 lists the GO terms related to structural activity in the immature leaf along with statistical significance.
The first pathways involved in the structural development of the leaf tissue is the lignin biosynthetic pathway. Genes for lignin metabolism are enriched in immature tissue, along with those for hydroquinone: oxygen oxidoreductase activity, such as "laccases", a class of enzymes known to be involved in the lignification of vascular tissue 42 . These lignification-related genes are upregulated as the vascular tissue in the leaf expands during development, and the lignified xylem provides structural reinforcement.
Immature leaf tissue also expresses an abundance of genes encoding enzymes that regulate pectin production and modification, such as aspartyl esterase and pectinesterase. The highly pectinaceous cell walls of the immature leaves are likely responsible for the great flexibility which protects growing and expanding leaves from structural damage 43 . The increase in pectin-digesting enzymes is expected in any tissue undergoing rapid cell division as cell wall polysaccharides are digested and modified during expansion and cytokinesis. Similar changes in pectin modification pathways have been well documented in the leaf venation of other species such as the potato (Solanum tuberosum) and mouse-ear cress (Arabidopsis thaliana) 44,45 .
Peroxidase enzymes used in cell wall modification are also up-regulated in immature leaf tissue. Hydrogen peroxide has been shown to play a signaling role in the modification and building of cell walls and other cell structures 46 . Expression of peroxidase genes was more abundant in immature leaf tissue, indicating their potential involvement in the development of these leaves. www.nature.com/scientificreports www.nature.com/scientificreports/ Go enrichment in Mature tissue. Table 2 lists selected GO terms related to housekeeping pathways, as well as oil and pigment production, upregulated in mature tissue, along with measures of their statistical significance. The majority of GO terms enriched in mature tissue were housekeeping genes, such as those involved in intracellular molecular transport (7 GO terms) and membrane production (3 GO terms). The upregulation of intracellular molecular transport genes could be a result of larger, fully differentiated cells that require more frequent and distant transport of intra-and extra-cellular materials than in immature tissue.
The terpene synthase gene family (p = 0.0028) was upregulated in mature tissue, including one enzyme class that is possibly homologous to synthases that produce limonene, pinene, (E)-beta-ocimene, and (+)-alpha-phellandrene (sequence homology E-value = 0) terpenes. These have been shown to be constituents of the essential oil in the curry leaf and contribute significantly to the aroma and flavor of the plant [13][14][15][16]20,21 . Recently, several of the secondary metabolites produced by enzymes in this polyketide synthase gene family have been identified via cloning of transcripts in an expression vector and subsequent GC-MS chromatography. Many terpene synthase genes were also identified using transcriptomic sequencing in the curry leaf and were compared with those of other species in a phylogenetic context 31 . Unlike in the present study, however, differential expression of these genes was not studied during leaf development and oil gland morphogenesis. During our study, the expression of these terpene synthase family genes was compared during leaf development, and Fig. 9 shows a significant trend of upregulation of terpene synthase activity after leaf maturation. Another upregulated class of sesquiterpene synthases are those which produce caryophyllene (sequence homology E-value <10 −110 ), a sesquiterpene oil that has been found to be the main constituent of curry essential oil 13,14,16,20 , yielding the various medicinal properties of the leaf 19 . This supports anatomical studies which show that oil gland counts continue to increase late into leaf development 47 . The upregulation of this class of enzymes indicates an increase in the production of these oils during leaf development. This corroborates with previous research that describes correlations between slower growing plants and an increased metabolic investment in antiherbivory compounds 48 Table 2. Housekeeping, pigment production, and oil production GO terms significantly enriched in mature leaf tissue as a result of differential expression.
www.nature.com/scientificreports www.nature.com/scientificreports/ upregulation in terpene synthases may also reflect an increase in gibberellin production. Fully mature leaf tissue may begin synthesizing gibberellic acid to regulate the onset of senescence and programmed cell death 25 .
The GO category for pigment biosynthesis was also enriched in mature leaf tissue (p = 0.0042), pigments which are necessary for the photosynthesis. The mature leaves have an increased cell count and size and a deeper green color than the nascent leaf tissue which appears etiolated (Fig. 1).
Further study for differential expression validation. The findings reported here for differential expression between leaf growth stages are based solely on computational approaches to RNA-seq analysis. The transcriptome was assembled using read overlap from the original RNA-seq dataset, and gene expression was estimated using read alignment to the assembled transcriptome. Using reads aligned to a transcript as a proxy for the abundance of the transcript only provides a computational estimate, which, in order to make valid claims about differential expression, must be validated by various approaches. The sequence data is provided to assist in future studies aimed at validating these results, using methods such as quantitative polymerase chain reaction (qPCR), or repeating the study with multiple biological replicates. Thus, the results presented will aid in the further study of the gene expression profile of the curry species.
Another opportunity for future study is to investigate the change in the metabolic profile of the leaves through development, correlating the changes in gene expression of relevant enzymes involved in metabolic synthesis to the changes in metabolites. By analyzing the change in compounds such as terpenes and other products in the essential oil, the results of differential gene expression studies can be further validated.

Conclusion
The transcriptome pattern of the curry leaf was a useful tool in the analysis of the differential expression of genes and pathways in immature and mature tissue. Until this study, only a few nuclear genes, complete chloroplast genome 49 , and a transcriptome 31 in mature leaves had been sequenced for the curry tree; this study adds 57,383 newly sequenced transcripts (including multiple transcriptional isoforms of the same gene), contributing to a new understanding of the molecular genetics of the curry leaf development. The upregulated pathways identified in immature leaves can now be studied in this species, as well as offer molecular resources for studying the antiherbivory mechanisms of vulnerable meristematic tissues in the absence of mature oil glands. Lastly, the classes of enzymes involved in the synthesis of terpenes were characterized in curry leaves, and the differential expression of these terpene synthase genes was studied, with results showing a significant increase in terpene biosynthesis after leaf maturation. This will enable future studies to investigate the role of these enzymes and metabolites in the ecology and physiology of this species as well as its application in food and medicine.

Data Availability
All sequence reads are available in the NCBI Sequence Read Archive (SRA) under the accessions SRR5590072 and SRR5590073.