Transcriptome analysis of five different tissues of bitter gourd (Momordica charantia L.) fruit identifies full-length genes involved in seed oil biosynthesis

The bitter gourd seed oil, rich in conjugated fatty acids, has therapeutic value to treat cancer, obesity, and aging. It also has an industrial application as a drying agent. Despite its significance, genomics studies are limited, and the genes for seed oil biosynthesis are not fully understood. In this study, we assembled the fruit transcriptome of bitter gourd using 254.5 million reads (Phred score > 30) from the green rind, white rind, pulp, immature seeds, and mature seeds. It consisted of 125,566 transcripts with N50 value 2,751 bp, mean length 960 bp, and 84% completeness. Transcript assembly was validated by RT-PCR and qRT-PCR analysis of a few selected transcripts. The transcripts were annotated against the NCBI non-redundant database using the BLASTX tool (E-value < 1E−05). In gene ontology terms, 99,443, 86,681, and 82,954 transcripts were classified under biological process, molecular function, and cellular component. From the fruit transcriptome, we identified 26, 3, and 10 full-length genes coding for all the enzymes required for synthesizing fatty acids, conjugated fatty acids, and triacylglycerol. The transcriptome, transcripts with tissue-specific expression patterns, and the full-length identified from this study will serve as an important genomics resource for this important medicinal plant.


Results and discussion
Assembly of bitter gourd fruit transcriptome. We generated 295.9 million 150 bp raw reads from the green rind, white rind, pulp, immature seeds, and mature seeds of bitter gourd. The reads from individual tissues varied between 33.3 million (green rind) and 84.2 million (immature seeds). Post quality filtering for low-quality reads (Phred score < 30) and adaptor removal, 254.5 million reads (86%) were retained for further processing (Table 1). About 14% of the reads were dis-carded after quality filtering. Even though the rejection rate is higher than the average percentage of data discarded after quality filtering, the total number of reads available for assembly is much higher than the requirement for a good assembly. Using trinity assembler, assembly of these reads generated 146,719 transcripts with 1,064 bp mean length and 2,855 bp N50. After clustering using CD-HIT, we obtained 125,566 transcripts with 960 bp mean length and 2,751 bp N50 ( Table 2). The length distribution of the assembled transcripts is given in Fig. 1. The assembly parameters indicated that the transcripts assembled in this study could be reliably used for gene prediction and other downstream analyses. Table 1. Summary of raw reads and Q30 reads obtained from different tissues of bitter gourd.

S. No Sample name Number of reads (million) Number of bases (Gb)
Q30 >  Table 2. De novo assembly statistics of bitter gourd fruit transcriptome.

Particulars Number
Number of raw reads (in million) 295.9 Number of clean reads (in million) 254.5 No of bases (after processing in Gb) 30 www.nature.com/scientificreports/ Annotation of the transcripts. Based on BLASTX similarity searches, we annotated 76,794 transcripts, and 48,771 transcripts did not show significant similarity with the existing genes. The BLASTX results were analyzed using BLAST2GO for further annotation. As shown in Fig. 2, the top-hit species for the bitter gourd transcripts was Cucumis melo followed by Cucumis sativus. According to the APG IV phylogenetic classification 30 , both the top-hit species are taxonomically closely related to bitter gourd.

Functional classification of the transcripts. The functional classification of the transcripts based on GO
terms showed that 99,443 and 86,681, and 82,954 were classified under biological process, molecular function, and cellular component, respectively. Under the biological process, the highest number of the transcripts were related to the metabolic process (18,231 transcripts), followed by the cellular metabolic process (16,342 transcripts) and primary metabolic process (14,567 transcripts). In the case of molecular function, transcripts involved in organic cyclic compound binding were more abundant (13,119 transcripts), followed by heterocyclic compound binding (11,765 transcripts) and ion binding (9,878 transcripts). In the cellular component category, transcripts responsible for the intracellular anatomical structure were found in more numbers (13,265 transcripts) than organelle (12,412 transcripts), membrane (10,768 transcripts), and others (Fig. 3).
Quantification and completeness analysis of the transcripts. The abundance of the transcripts in the transcriptome was calculated based on the FPKM value using RSEM. The data for all the transcripts from the respective tissues are given in Supplementary Table S3. The top 15 abundant transcripts in each tissue and overall abundance in the fruit transcriptome are given in Supplementary Table S4. The results showed that some genes are abundantly expressed in each tissue. The observed expression patterns of the transcripts can be used to isolate tissue-specific genes and promoters for applied research in bitter gourd. For example, the proteins from the bitter gourd fruit pulp showed antidiabetic properties in vitro and in vivo 31,32 but the corresponding genes are    Fig. S1), which indicated that the transcript assembly is of good quality.
Validation of transcript assembly. The Assembly of the transcripts was validated by performing reverse transcription PCR (RT-PCR) amplification of selected genes using the RNA from the same tissues. We selected ten transcripts ranging in size between 609 bp and 2,567 bp, and their details are given in Supplementary  Table S1. In agarose gel analysis of the RT-PCR products, we observed the amplification of cDNA fragments of approximately the same size as expected based on the transcript assembly (Fig. 4). This result indicated that the transcripts assembled in this study could be reliably used for gene identification.
Validation of gene expression. The expression of selected transcripts associated with terpenoid and steroid bio-synthesis pathways was validated using qRT-PCR. The genes chosen for this study included Phytoene synthase, Acetyl-CoA transferase, 3-Hydroxy-3-methylglutaryl-coenzyme A reductase 1, Phosphate kinase, and Hy-droxymethylglutaryl-CoA synthase. The FPKM values for these genes are given in Supplementary Table S5.
Expression of these genes, as assessed from the cycle threshold (ct) values, largely correlated with the expression levels in terms of FPKM (Fig. 5).
Identification of simple sequence repeats (SSRs). We identified 57,093 SSRs from 125,566 assembled transcripts. The majority of the SSRs were either di-nucleotide repeats (59.0%) or tri-nucleotide repeats (37.4%).  www.nature.com/scientificreports/ Di-nucleotide repeats with six repeats and tri-nucleotide repeats with five repeats were found to be the most abundant ones. Only a small proportion of the SSRs belonged to the tetra-nucleotide (2.0%), hexa-nucleotide (1.0%), and penta-nucleotide repeats (0.65%) ( Table 3). AG/CT was the most abundant type (68.98%) in terms of the actual sequence present in the repeats. AAG/CTT was the most abundant tri-nucleotide repeat (42.16%). The details of different types of repeats identified from the transcripts are given in Fig. 6. SSR marker-based study of genetic diversity in 211 accessions of bitter gourd showed clear segregation according to geographical origin 33 . Alhariri et al. 34 have analyzed 51 accessions using 61 primers and clustered the accession based on the polymorphism in SSR markers. Surprisingly, the 51 accessions formed three groups according to fruit size. These reports indicate that the SSR markers identified in this study will be highly useful for varietal identifications, mapping, and association studies in bitter gourd.

Identification of transcripts related to secondary metabolites.
In KEGG pathway analysis, we mapped 25,497 transcripts to 147 pathways. About half of the mapped transcripts (51.5%) were related to the metabolism of cofactors and vitamins (18%), carbohydrates (17.8%), amino acids (15.7%) (Fig. 7). Antidiabetic properties of bitter gourd fruits have been demonstrated in several in vitro, in vivo, and clinical studies 8,35,36 .
Most of the compounds responsible for its bitter taste and antidiabetic properties are triterpenoids and steroids 37 . The carbon skeletons of cucurbitacin triterpenoids are derived from 2,3-oxidosqualene by the action of oxidosqualene cyclase. Although cucurbitacins accumulated in high amounts in the fruits, cucurbitadienol synthase gene was highly expressed in the leaves. This indicated that cucurbitacins might be synthesized in the leaves and transported to the fruits 38 . Charantin also accumulated in high amounts in the fruits compared to the root, stem, leaves, and flowers 39 . The same study identified fifteen genes related to triterpenoid biosynthesis from the RNASeq library constructed from the bitter gourd seedlings. However, most of them encoded for the enzymes involved only up to the synthesis of squalene and 2,3-oxidosqualene. We identified 191 and 99 transcripts related to the terpenoid backbone and steroid biosynthesis from the bitter gourd fruit transcriptome. In our study also, most of the transcripts were mapped up to the synthesis of squalene and squalene-2,3-epoxide No. of SSR loci Repeat motif  www.nature.com/scientificreports/ Identification of full-length genes related to bitter gourd seed oil biosynthesis. We mapped 2,050 transcripts to lipid metabolism from the assembled transcripts, including 15 pathways related to the biosynthesis and metabolism of various fatty acids and lipids (Fig. 8). Among them, 150, 80, 297, and 339 transcripts were related to the biosynthesis of fatty acids, unsaturated fatty acids, glycerolipids, and glycerophospho-lipids. Plants synthesize various fatty acids and lipids, which have medicinal, cosmetic, and industrial applications [40][41][42] . Fatty acid biosynthesis occurs in plastids using acetyl Co-A as the major precursor and involves fatty acid synthases and carrier proteins. Fatty acids are synthesized up to 16:0 long-chain acyl carrier protein (ACP) during this process. The 16:0 fatty acids are elongated to 18:0 ACP and desaturated to 18:1 ACP by ketoacyl synthase II (KAS II) and stearoyl-ACP desaturase, respectively. All the long-chain ACPs from the chloroplast contribute to the long-chain acyl-CoA pool, which is used to syn-thesize oils and lipids in the endoplasmic reticulum. Fatty acids are incorporated into glycerol-3P to obtain diacylglycerol (DAG). The DAG pool is used to make triacylglycerol (oil) or conjugated fatty acids through distinct pathways. A schematic diagram representing the biochemical pathways of fatty acids, conjugated fatty acids, and triacylglycerol is shown in Fig. 9. Yang et al. 2010 studied the transcriptome of bitter gourd seeds by 454 sequencing and identified several transcripts related to oil biosynthesis genes. However, full-length genes were analyzed only for diacylglycerol acyltransferase 1 and 2 43 . In our study, we identified 2,050 transcripts related to lipid metabolism. The transcripts related to fatty acid biosynthesis, glycerolipid metabolism, and glycerophospholipid metabolism were analyzed to determine full-length genes coding for the enzymes and proteins involved in the biosynthesis of fatty acids conjugated fatty acids, and triacylglycerol. We identified 39 full-length genes (GenBank Accession No: ON175892 to ON175930), which code for 23 enzymes and two carrier proteins required for the biosynthesis of fatty acids, conjugated fatty acids, and triacylglycerol (Fig. 9, Table 4). This included ten full-length genes coding for the five enzymes needed for converting long-chain acyl-CoA to triacylglycerol. The highest number of four genes were identified for 1-acyl-sn-glycerol-3-phosphate acyltransferase (AGPAT). We also found two genes for diacylglycerol acyltransferase (DGAT) and one for phospholipid: diacylglycerol acyltransferase (PDAT), which are essential to convert diacyl glycerol pool to triacylglycerol. These oil biosynthesis genes can be used to increase the oil content through genetic engineering approaches. This is evident from the studies, which showed that overexpression of heterologous genes could achieve as high as a ten-fold increase in seed oil content 44,45 .
Bitter gourd seed oil's medicinal and industrial applications are attributed to the conjugated fatty acid, mainly α-eleostearic acid. Plants with a negligible amount of conjugated fatty acids can also be metabolically engineered to produce a higher amount of these fatty acids by transferring the relevant genes. In Soybean and Arabidopsis, conjugated fatty acid content increased by 15 to 20% when delta-9 and delta-12 conjugase genes were expressed under seed-specific promoters 46 . In Arabidopsis, the accumulation of con-jugated fatty acids was further improved by expressing both fatty acid desaturase and delta-9 and delta-12 conjugase 47 . Developing bitter gourd and other plant varieties with a high amount of conjugated fatty acids will require identifying the genes necessary to divert diacylglycerol from triacylglycerol synthesis. Three enzymes, CDP-choline diacyl glycerol choline phosphotransferase (AAPT), fatty acid desaturase (FAD), and fatty acid conjugase (FADX), are required to synthesize α-eleostearic acid from diacylglycerol 48 . Several transcripts related to the biosynthesis of conjugated fatty acids were identified by analyzing the normalized cDNA library of seed tissues 43 . However, the www.nature.com/scientificreports/ identification of full-length genes was limited to diacylglycerol acyltransferase 1 and 2, which do not directly affect the production of conjugated fatty acid. In the current study, we identified full-length genes for all three enzymes, all of which were present as single-copy genes (Fig. 9, Table 4). These genes will help to improve the conjugated fatty acid content in bitter gourd and other high-yielding oilseed crops like brassica, sunflower, groundnut, and soybean. Though this has been demonstrated in Arabidopsis by co-expressing FADX and DGAT genes from tung 49 , further research in important oilseed crops is required to achieve the scale of production needed for medicinal and industrial applications.

Materials and methods
Plant material. Seeds of Momordica charantia L. variety Co1 were obtained from Tamil Nadu Agricultural University, Coimbatore, Tamil Nadu, India. It is a cultivated variety and its fruits are usually long and dark green weighing upto 100-120 g. It was released for cultivation in 1978, and can yield up to 14 tonnes/ha. The plants were raised from the seeds in a greenhouse (23-25 ºC temperature, 16 h light, and 50% relative humidity). Green unripe fruits were cleaned with sterile distilled water, and dissected to separate green rind, white rind, pulp, and immature seeds. Mature seeds were collected from ripe fruits (Fig. 10). The separated tissues were flashfrozen under liquid nitrogen before storing in a freezer (-80 °C) until use. The plant studies were carried out in accordance with relevant institutional, national, and international guidelines and legislation. www.nature.com/scientificreports/ RNA isolation. RNA isolation from the different samples of M.charantia fruits was done using the Trizol reagent (Invitrogen). Frozen tissue sample of 100 mg was ground to fine powder under liquid nitrogen, 1 ml of Trizol was added, and the tissue suspension was transferred to a 1.5 ml centrifuge tube. Chloroform (0.2 volume) was added, mixed vigorously, and centrifuged at 12,000 rpm for 15 min at 4 °C. The aqueous phase was transferred to a fresh 1.5 ml centrifuge tube, and 0.7 volume of isopropanol was added. The content was mixed gently, centrifuged at 12,000 rpm for 10 min at 4 °C, and the supernatant was discarded. The pellet was washed by adding 500 µl of 70% ethanol, and centrifuging at 10,000 rpm for 5 min at room temperature. The supernatant was discarded, pellet was air-dried, and dissolved in nuclease-free water. The total RNA was checked in the gel, and treated DNase I (Qiagen, Germany) followed by column purification using RNeasy MinElute Cleanup Kit (Qiagen, Germany). The purified RNA was assessed using a spectrophotometer (Eppendorf, Germany), Qubit 2.0 fluorimeter (Invitrogen, USA), and Bioanalyzer (Agilent Technologies, USA). The RNA integrity number Table 4. Details of the full-length genes identified from the bitter gourd fruit transcriptome.
S. no Name of the gene mRNA (nt) CDS (bp) ORF (aa) 5ʹUTR (bp) 3ʹUTR (bp) Genbabank accession no Library preparation and sequencing. The sequencing library was prepared using one µg total RNA as a template using TruSeq mRNA Library Preparation Kit V2 (Illumina Inc, USA). Poly(A)-RNA was purified from total RNA using oligo-dT magnetic beads and fragmented before cDNA synthesis. The fragmented cDNAs were used as template for cDNA synthesis using SuperScript II reverse transcriptase (Invitrogen, USA). The first strand cDNA was converted to double strand cDNA and purified using AMPure XP beads. The purified cDNAs were end-repaired to generate blunt-ended cDNAs and single ' A' nucleotide tail was added to the 3' ends. These ' A' tails were complementary to the 'T' overhang in the adapters. Adapters were ligated to the cDNA ends for indexing and size selected for 420 bp fragments using AMPure XP beads. The size-selected cDNAs were enriched by PCR. The PCR-amplified library was purified, and analyzed in Bioanalyzer 2100 (Agilent Technologies, USA) using High Sensitivity DNA Chips to determine the insert size and yield. The DNA yield was normalized to 10 nM, and pooled for cluster generation in the flow cell. The libraries were subjected to 2 × 150 bp paired-end sequencing in the Illumina platform (Illumina, USA).
De novo assembly and clustering. FastQC v0.11.2 27 was used for the quality analysis of the reads.
Removal of adaptor sequences and trimming was performed using Cut adapt v1.7.1 50 and Sickle v1.33 51 . The high-quality (Q30 >) paired-end reads were assembled using the Trinity assembler with three modules 52 . Inchworm assembled unique transcripts using k-mer values. Chrysalis clustered related contigs and constructed de Bruijn graphs. Butterfly reconstructed full-length transcripts and distinct transcripts for splice isoforms. The redundancy among the assembled transcripts was reduced by clustering using CD-HIT v4.0 53 .

Annotation of the transcripts.
To perform functional annotation, de novo assembled transcripts of bitter gourd were analyzed using BLASTX algorithm plant non-redundant protein database at National Center for Biotechnology Information (NCBI). The annotations of the best hits were saved and subjected to gene ontology (GO) analysis using Blast2GO. Kyoto Encyclopedia of Genes and Genomes (KEGG) database was used, and the transcripts were mapped to different pathways 54,55 .
Transcript quantification. The abundance of the transcripts in the bitter gourd fruit tissues was measured using RNA-Seq by expectation maximization (RSEM) tool. In RSEM, the RNA-Seq reads were aligned to the reference transcripts to estimate the transcript abundance 56 . The fragments per kilobase per million (FPKM) and transcripts per million (TPM) were calculated to estimate the transcript abundance.
Identification of simple sequence repeats. Simple sequence repeats (SSRs) from bitter gourd were identified using the MISA tool 57 . The search parameters identified different repeat types and number repeat for each kind.

Assessment of gene completeness. Completeness of the transcripts was estimated using Benchmarking
Universal Single Copy Orthologs (BUSCO) analysis against a database of single-copy orthologous genes for plants (eudicotyledons_odb10). The BUSCO analysis identifies the conserved orthologs and helps to assess the completeness of transcripts 58 . www.nature.com/scientificreports/ Validation of transcript assembly. Ten assembled transcripts were selected at random and validated for the correctness of de novo assembly by reverse transcription PCR (RT-PCR) amplification of the transcripts. Transcript-specific primers were designed based on the sequence of the de novo assembled transcripts (Supplementary Table S1). cDNA was synthesized using oligo-dT primers, and individual transcripts were amplified using the respective transcript-specific primers. PCR amplification included initial denaturation at 95 ºC for 2 min, 35 cyles of denaturation at 95 ºC for 30 s, annealing (55-59 ºC) for 30 s, and extension at 72 ºC for 2 min, final extension at 72 ºC for 5 min, and storage at 4 ºC. The amplified products were analyzed by agarose gel electrophoresis.
Analysis of gene expression. Five transcripts were selected at random and analyzed for gene expression.
Tran-script-specific primers suitable for qRT-PCR were designed based on the sequence of the de novo assembled transcripts. Analysis of gene expression was performed using TB Green Premix Ex Taq II (Takara, Japan) and QuantStudio 5 (ThermoFisher Scientific, USA). As reported before, the serine/threonine phosphatase gene was used as an internal reference 59 . A negative control reaction without a template was included in all experiments done. The sequences of the primers designed for qRT-PCR analysis are given in Supplementary Table S2. Gene expression was analyzed using the individual efficiency corrected calculation method 60 .

Conclusions
In the current study, we investigated the fruit transcriptome of bitter gourd, focusing on identifying full-length genes involved in seed oil biosynthesis. We identified 39 full-length genes coding for 23 enzymes and two carrier proteins involved in the biosynthesis of fatty acids, conjugated fatty acids, and triacylglycerol. We also identified several transcripts related to the biosynthesis of the compounds responsible for the bitterness and antidiabetic properties. The tissue-specific abundance of the transcripts reported here will form the basis for identifying tissue-specific promoters. The full-length genes, gene expression data, and SSR markers reported from this study are vital genomic resources for further research.