Transcriptional profiling reveals differentially expressed genes involved in lipid biosynthesis during cacao seed development

Theobroma cacao is a plant of economic value due to the use of its seed lipid for chocolate, confectionery, and cosmetic industries. The seed lipid contains a stable ratio of saturated and unsaturated fatty acids, which determines its unique melting temperature. However, little is known about the molecular mechanism determining the fatty acid ratio and lipid content in cacao. To gain insight into the unique properties of lipid synthesis in cacao, biochemical and transcriptomic approaches were used to compare the lipid accumulation between high and low lipid content cacao accessions. Lipid accumulation rates and lipid content were different between the two accessions. Moreover, differentially expressed genes were detected between high and low lipid content cacao accessions. The data allowed the identification of distinct candidate genes and furthered our understanding of lipid accumulation, potentially explaining the differences in lipid content between various cacao accessions. The results might be used to develop molecular tools and engineer alternative pathways for cacao breeding with improved lipid production potentials.

The fatty acid biosynthetic pathway and lipid accumulation in cacao has been poorly studied in previous investigations, despite the uniqueness of cocoa butter for making chocolate. In this study, to obtain an overall comprehensive characterization of fatty acids and lipid metabolism in the cacao seed, lipid temporal accumulation patterns, transcriptional analysis, and differential gene expression profiles were assessed. The unique characteristics of lipid metabolism were described, revealing insight into transcriptional coordination in the cacao seed. The results generated in the present study provide an important foundation to further explore the regulatory mechanism of cacao seed lipid accumulation.

Results
Seed development and lipid content. In this study, we first collected the seeds of 65 T. cacao accessions and analysed their lipid content, ranging from approximately 30-60% of dry weight (data not shown). Then, two different lipid content accessions, TAS42 and TAS57, were selected for further research. The pod and seed development processes in the high (TAS42) and low (TAS57) lipid content accessions were the same. The immature pods are green, and the colour of the seeds gradually turns from light purple to dark (Fig. 1a-p). Under field conditions, the pods of the two accessions completed their development and maturation in approximately 165 DAP. To assess the dynamic lipid accumulation patterns in developing seeds, we evaluated the TAS42 and TAS57 seed lipid contents at different developmental stages. Significant global and developmental stage-specific changes in fatty acid concentrations, saturation levels, and lipid accumulation were found (Fig. 1). The seed lipid was mainly composed of saturated fatty acids such as palmitic acid (C16:0) and stearic acid (C18:0) and unsaturated fatty acids such as oleic acid (C18:1) and was up to more than 90% after 126 DAP. Linoleic acid (C18:2) and alpha-linolenic acid (C18:3) decreased from 105 to 126 DAP and then stabilized along with pod maturation. However, other fatty acids such as palmitoleic acid (C16:1), arachidic acid (C20:0), and behenic acid (C22:0) stabilized at low content throughout pod development (Fig. 2). A  sequencing, over 13 million raw reads were obtained in each library (Table 1). After reads containing adapters, reads containing poly-N sequences, and low-quality reads were removed, the total clean reads per library ranged from 10.1 to 12.8 million. The total reads mapped to the cacao cv. B97-61/B2 genome according to the HISAT2 analysis ranged from 8.6 to 12.3 million, and more than 73% of the reads had perfect matches and unique matches to reference genes ( Table 1).

Analysis of differential gene expression.
To compare the changes in gene expression between different accessions and different stages of maturation, fragments per kilobase of transcript per million mapped reads (FPKM) was introduced to calculate the expression level of each gene. All uniquely mapped reads were used to calculate the FPKM values of the genes. FPKM ≥1 or <1 was used as the threshold to identify whether a gene was expressed or not (Table 2). Biological replicates were necessary for high-throughput sequencing to obtain reliable analysis results. Our experiments showed a close relationship for each pair of biological replicates ( Supplementary  Fig. S1). The final FPKM of one gene with replicates in the same condition was the average value for all replicate data. The differentially expressed genes (DEGs) were hierarchically clustered based on the log 10 FPKM of the eight treatments, showing the overall gene expression pattern. The red bands indicate high gene expression, and the blue bands indicate low gene expression ( Supplementary Fig. S2). The genes whose expression differed in the two samples were identified and filtered for fold-change ≥2.0 and diverge probability ≥0.8. The DEGs were compared Values are the means of three biological replicates. C12:0, lauric acid; C16:0, palmitic acid; C18:0, stearic acid; C18:1, oleic acid; C18:2, linoleic acid; C18:3, alpha-linolenic acid; C20:0, arachidic acid; C22:0, behenic acid. between different stages within an accession and between accessions within a specific stage. The number of DEGs among these comparisons varied; approximately 25-791 genes displayed significant changes in expression, with an average of 291 DEGs. The numbers of upregulated and downregulated genes are shown in Fig. 3.
A Venn diagram was used to obtain the overlapping DEGs in the comparison group to see which DEGs were induced transiently and which DEGs were induced over a relatively long period of time. In this experiment, a comparative analysis of DEGs in the four developmental stages identified 78, 64, 425, and 21 DEGs at 105, 126, 147 and 168 DAP, respectively (Fig. 4a). Six genes were induced at both 105 and 126 DAP, 16 genes were induced at both 126 and 147 DAP, and 2 genes was induced at both 147 and 168 DAP, indicating that these genes may play important roles in seed development. (Fig. 4a). As shown in the Venn diagram, both TAS42 and TAS57 showed differential gene expression across seed development (Fig. 4b,c).
To illustrate the DEGs detected in the different developmental stages of pods from the high and low lipid content accessions, Gene Ontology (GO) enrichment analysis was performed to classify the gene functions. The DEGs were clustered into three main GO classification categories: Biological Process, Cellular Component and Molecular Function ( Supplementary Fig. S3). The genes from the different expression clusters associated with the different functional categories clearly indicate the biological and molecular events involved in cacao seed maturation. In addition, ninety-five DEGs detected along seed development were associated with the cardinal enriched www.nature.com/scientificreports www.nature.com/scientificreports/ terms of fatty acid biosynthetic process, fatty acid metabolic process, lipid biosynthetic process, and carboxylic acid metabolic process (Table 3). Especially, genes differentially detected in TAS57 from 147 DAP to 168 DAP were largely enriched in the biological process domain, with the cardinal term metabolic process. In contrast, very few genes differentially detected in TAS42 from 147 DAP to 168 DAP were found to be enriched in a particular domain (Supplementary Table S1). Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis of the DEGs revealed the most representation effects on starch and sucrose, pyruvate, glycerolipid, glycerophospholipid, alpha-linolenic acid, arachidonic acid, amino acids, and secondary metabolite biosynthesis pathways. These annotations provide clues for investigating specific processes, especially those involved in starch and sucrose metabolism, fatty acid biosynthesis and glycerolipid metabolism. Fatty acid biosynthesis-related genes. Plant fatty acid biosynthesis occurs in plastids and is performed by a FAS dissociable complex of monofunctional enzymes. During cacao seed maturation, we found 34 DEGs involved in almost all parts of fatty acid biosynthesis and metabolism including ACACA (2 genes), Fab (6 genes),   Table S4). There were almost 10 unique DEGs involving unsaturated fatty acid metabolism in TAS57, including LOX, EPHX, methyltransferases and benzoate carboxyl methyltransferase (Supplementary Table S4).
TAG assembly-related genes. TAG biosynthesis occurs in the endoplasmic reticulum through an acyl-CoA-dependent pathway and acyl-CoA-independent pathway. In the acyl-CoA-dependent pathway, acyl-CoA is used as a substrate for the serial incorporation of three acyl groups into the glycerol backbone, dependent on enzymes such as glycerol-3-phosphate acyltransferase (GPAT), lysophosphatidic acid acyl transferase (LPAT), phosphatidic acid phosphatase (PAP), and diacylglycerol acyltransferase (DGAT). In the acyl-CoA-independent pathway, phospholipids are used as acyl donors and diacylglycerol (DAG) as an acceptor, and phospholipid/diacylglycerol acyltransferase (PDAT) is used for TAG production. In the present study, we found differential expression of 1 gene coding for GPAT, 2 for LPAT and 6 for PDAT in the seed, suggesting that the acyl-CoA-independent pathway may be the most important pathway for the synthesis of TAGs in T. cacao. Five PDAT genes were detected in both accessions examined, among which 2 genes (18609038 and 18598114) showed a significant increase in expression from 105 DAP to 126 DAP and 2 genes (18608303, 18611047) showed a significant increase in expression from 147 DAP to 168 DAP (Supplementary Tables S5, 6 and 7). Moreover, the expression level of late embryogenesis abundant (LEA), zinc finger protein (ZNF3) and methyltransferase genes significantly changed over the course of seed development, and these genes have been shown to participate in lipid body formation and lipid synthesis regulation 16 . We found 7 DEGs involved in glycerolipid and glycerophospholipid metabolism in the comparison between TAS42 and TAS57. The expression level of aldehyde dehydrogenase (ALDH) in TAS42 was higher than that in TAS57 along seed development. The expression levels of PDAT (18611047) and glycerol-3-phosphate dehydrogenase (GPDH) in TAS42 were lower than those in TAS57 at stage 105 DAP to 126 DAP. However, in stages 147 DAP and 168 DAP, the expression levels of PDAT (18611047) and GPDH in TAS42 were higher than those in TAS57 (  The results showed that, although there were fold changes in gene expression among different seed developmental stages, the expression patterns determined for all fifteen genes were consistent, confirming the trends observed in the RNA-Seq results (Fig. 5).

Discussion
The cacao, a worldwide cash crop for chocolate and cosmetic industries, has a high quality and unique physicochemical characteristics. The genome sequences of cacao have been completed, and high-throughput RNA-Seq is an effective method to obtain large amounts of transcriptome data and DEGs from different types. In this work, the kinetic patterns of lipid contents and fatty acid compositions were detected at different developmental stages  www.nature.com/scientificreports www.nature.com/scientificreports/ of cacao pods, and the optimal accessions for comparative deep transcriptomic analysis were determined. The transcriptomes of different lipid content cacao seeds were sequenced by Illumina technology, and the DEGs were functionally annotated. The DEGs involved in lipid accumulation of developing pods were determined by the NOISeq method, and the role and regulation of some key genes was analysed.
Lipid accumulates in cacao seeds at a slow rate in the first 4 months of pod development and then accelerates into a rapid, linear rate after the seeds solidify until near the beginning of maturation, when lipid build-up proceeds at a greatly reduced rate 8 . The most noticeable changes concerned the percentage of linoleic acid (C18:2) between 105 and 126 DAP, which dropped from 16% to 3%. After 126 DAP, the amount of lipid in the seeds continued to increase. The fatty acid composition did not exhibit a significant change, and, importantly, palmitic acid (C16:0), stearic acid (C18:0) and oleic acid (C18:1) were the main components, the relative ratio of which was maintained during the examined developmental period (Fig. 2). These trends in fatty acid storage and accumulation are very different from those observed in other tropical wood lipid plants 15,16 . For example, in coconut, medium-chain fatty acids (C8-C14) predominate the lipid components 16 . These differences indicate that cacao seeds possess species-specific mechanisms for lipid accumulation.
In the last decade, the details of fatty acid and lipid biosynthesis derived from plant seeds have generally suggested that the substrate specificity of FatA and FatB acyl-ACP thioesterases, to a large extent, determine the fatty acid composition of storage lipids 17,18 . FatA is generally accepted to act on long-chain acyl-ACPs, preferentially on oleoyl-ACP, while FatB preferably hydrolyses acyl-ACPs with saturated fatty acyl chains 19,20 . In the present study, one FatA and one FatB gene were identified in the transcriptomes of the developing cacao seed. Gene duplication and neofunctionalization are common features among species that play a key role in driving evolutionary novelty for organisms to differentiate new organs or synthesize various storage substances 21,22 . FatB genes mainly act to form medium-chain fatty acids of lauric acid (C12:0) and myristic acid (C14:0) in oil palm and coconut 15,16 . As in tropical crops, the FatB isoform leading to long-chain fatty acid accumulation of palmitic acid (C16:0) and stearic acid (C18:0) was expressed in cacao seeds. These findings suggest that, in cacao, to lengthen the acyl chain, the cleavage of acyl-ACP requires the fatty acid-specific FatB gene. Previous investigations have indicated that synthesis of fatty acids and supply of pyruvate rather than acyl assembly into TAG might be the major factors that control lipid storage [23][24][25] . Our transcriptomic data found DEGs involved in almost all parts of fatty acid biosynthesis and metabolism including ACACA, Fab, FAT, ADH, OPR, LOX, EPHX, PECR, KCS, and ADH (Table 3). However, the expression patterns of these DEGs were significantly different; the sharp change in the expression   www.nature.com/scientificreports www.nature.com/scientificreports/ level in TAS57 was delayed from that observed in TAS42. The WRINKLED1 (WRI1) transcription factor acts as a coregulator of genes involved in fatty acid biosynthesis, especially at early seed stages, and WRI1 expression is vital in the carbon flux necessary for the synthesis of fatty acids in seeds [26][27][28][29] . We also found that WRI1 showed the delayed sharp change in the expression level in TAS57 than TAS42. (Supplementary Table S6, 7). During the last seed maturation stage, the lipid content of TAS57 decreased from 39.2% to 36.7%. Notably, lipid degradation plays an indispensable role in whole plant growth and the development process [30][31][32] . These results, combined with the results from other crops, suggest that fatty acid biosynthesis hysteresis and lipid degradation are the most important factors leading to lipid content reduction.
Generally, fatty acid synthesis may be accomplished by releasing the fatty acid from the acyl-ACP molecule by acyl-ACP thioesterases (FATA and FATB). The unsaturated bonds on the monounsaturated fatty acids at specifically defined positions are catalysed by SAD to form unsaturated fatty acids 24 . Interestingly, oleic acid (C18:1) and linoleic acid (C18:2) are the main unsaturated fatty acids of cacao seeds 10,33 . The cacao genome has been shown to contain eight SAD genes 1,4 . Among the eight SAD genes, TcSAD1 and TcSAD7 are the primary ones expressed in developing seeds, with TcSAD7 being more highly expressed over all stages except the maturation stage, implying that the activities of both TcSAD1 and TcSAD7 are involved in the synthesis and accumulation of C18:1 in cacao 4 . In our study, TcSAD7 (18592535) was detected in both TAS42 and TAS57 and exhibited a downward expression trend along seed development. At the onset of lipid accumulation, the total percentage of mainly unsaturated fatty www.nature.com/scientificreports www.nature.com/scientificreports/ acids (C18:1 + C18:2 + C18:3) was highest. The major fatty acid composition shift occurred between 105 and 126 DAP when the transiently high proportion of C18:2 dropped rapidly to approximately 3%. C18:2 synthesis catalysed by C18:1 desaturase ceased during this period, as indicated by the increase in dry seed weight with a decrease in C18:2 content 10 . Although the biochemical mechanism has not been clarified, C18:2 content may limit the activity potential of TcSAD and activate the termination of C18:2 synthesis, and, during the following stages, the ratio of unsaturated fatty acids and saturated fatty acids was maintained. The ratio of mainly fatty acids should be investigated in future studies.

Conclusions
Illumina sequencing was performed on messenger RNAs (mRNAs) obtained from developing cacao seeds. The comparison of high and low lipid content cacao accessions by RNA-Seq and qRT-PCR of genes involved in fatty acid synthesis, TAG accumulation and metabolism led to the hypothesis that fatty acid biosynthesis hysteresis and lipid degradation are the primary factors leading to lipid content reduction. An isoform of the Fab2 gene TcSAD7 (18592535) was expressed at high levels at the onset of massive lipid accumulation in the seed, which may be responsible for the stable ratio of unsaturated fatty acids. TAG assembly enzymes (PDAT and GPDH) may result in the considerable differences in lipid content between TAS42 and TAS57. Future research will aim to verify the function of key candidate genes using transgenic approaches in model organisms. Furthermore, future studies must confirm the differentiated lipid hypothesis, for example, by assessing the metabolomics of the seed organ at different time points. The results may help to understand and manipulate the lipid contents of cacao by means of genetic engineering.

Materials and Methods
Plant material. The high lipid content cacao accession T. cacao (TAS42) and low lipid content cacao accession T. cacao (TAS57) were used in this study. The plants were grown in the Spice and Beverage Research Institute of Chinese Academy of Tropical Agricultural Science, Wanning, Hainan, China. According to the data from our laboratory, the pod development periods of the two cultivars are similar. During flowering period, self-bred was carried out for each cultivar. Seed samples were collected at four different developmental stages from self-bred pods: 105 days after pollination (DAP), 126 DAP, 147 DAP and 168 DAP. After collection, samples were flash-frozen in liquid nitrogen and stored at −80 °C before use. Owing to pod setting rate, each cacao cultivar successfully produced several self-bred pods, only satisfied two replicates of each seed developmental stages. In the following investigation, two replicates were used in the DEG analysis.
Lipid content and fatty acid composition determination. The total lipid content in the cacao accessions was assayed according to the method by Bucheli et al. with minor modification 10 . Briefly, freeze-dried cacao seed powder (2-5 g) was poured into a round-bottom flask, followed by the addition of 60 mL chloroform-methanol (1:1, v/v) and homogenization for 1.5-2.0 min. Successively, the mixture was reflux-heated in a water bath at 60 °C for 1 h and then cooled to room temperature. After being filtered through a fluted filter paper, the resultant filtrate was collected and dried under reduced pressure using a vacuum evaporator. Then, the lipid obtained was resolved in 25 mL petroleum ether and mixed with 5 g sodium sulfate anhydrous to further remove the residual solvent. Subsequently, the mixture was separated by centrifugation at 3000 g for 5 min, and the supernatant was collected, evaporated and dried at 105 °C in succession. The residue was then weighed and stored at −20 °C. Approximately 10 mg of extracted lipid sample was dissolved in 2 mL of N-hexane with 2.5 mg/ mL of methyl nonadecanoate in a tube, and 0.1 mL of methanolic potassium hydroxide solution (2 mol/L) was added. The tube was capped tightly and shaken vigorously for approximately 2 min. The solution was then centrifuged at 3000 g for 5 min, and 1 mL of N-hexane upper phase was transferred into a 2 mL autosampler vial, and 5 μL was injected onto a Chrompack CP-Sil88 column (100 m × 0.25 mm i.d., 0.25 μm run on a film thickness; Varian Inc., USA) for subsequent gas chromatography analysis.

RNA isolation and quantification.
A total of 16 samples (two replicates of two differential lipid content accessions at four time points) were prepared for RNA extraction. Total RNA was extracted individually using TRIzol reagent (Life Technologies). High-quality RNA was obtained through twice chloroform/isoamyl alcohol (24:1) purification and then mixed with approximately the same quantity. RNA purity was checked using a NanoPhotometer ® spectrophotometer (IMPLEN, CA, USA).

Digital gene expression (DGE) library preparation and sequencing.
A total of 5 μg of RNA per sample was used as the input material for the RNA sample preparations. Sequencing libraries were generated using the NEBNext ® Ultra TM RNA Library Prep Kit for Illumina ® (NEB,USA) following the manufacturer's recommendations. In brief, mRNA was purified from the total RNA using oligo (dT) magnetic beads. After fragmentation with divalent cations in NEBNext First Strand Synthesis Reaction Buffer (5×), first-strand cDNA was synthesized using random hexamer primers and M-MuLV Reverse Transcriptase (RNaseH − ). Then, second-strand cDNA synthesis was performed using DNA polymerase I and RNaseH. To select cDNA fragments 150-200 bp in length, the library fragments were purified using the AMPure XP system (Beckman Coulter, Beverly, USA). PCR was then performed with Phusion High-Fidelity DNA polymerase, Universal PCR primers and an Index (X) Primer. Finally, the PCR products were purified (AMPure XP system), and the library quality was assessed on an Agilent Bioanalyzer 2100 system and ABI StepOnePlus Real-Time PCR System. The library was sequenced on the Illumina HiSeq. 2000 platform, and 100-bp paired-end reads were generated. www.nature.com/scientificreports www.nature.com/scientificreports/ data had adaptor sequences and a few low-quality sequences, along with several types of impurities. Clean data were obtained by removing low-quality reads containing adapters or poly-N sequences from the raw data. All downstream analyses were based on clean, high-quality data.
The reference cacao genome (B97-61/B2) and gene model annotation files were downloaded from the genome website directly. All clean reads were aligned to the reference genome using the HISAT2-2.0.4 aligner 34 . To estimate the expression levels of each gene for the four pod developmental stages, the read numbers mapped to each gene were counted using HTSeq v0.5.4p3. And then normalized expression values of genes were estimated as the fragments per kilobase of transcript per million mapped reads (FPKM) 35 by RSEM (v1.2.9) 36 . The differentially expressed genes (DEGs) were required to achieve the probability (P) thresholds of differential expression of 0.8 and a minimum two-fold change in expression between samples, as determined by NOISeq. 37 . A higher probability resulted in a remarkable change in expression between the two samples.
Gene Ontology (GO) functional enrichment analysis was implemented in the GOseq R package 38 , and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis was performed to examine the high-level functions and utilities of the DEGs 39 . In all tests, P values were calculated using Benjamini-corrected modified Fisher's exact test, and ≤0.05 was considered the threshold of significance.
qRT-PCR validation. To verify the differential expression detected by Illumina RNA-Seq, qRT-PCR was performed on the same samples that had been used previously. A set of fifteen genes was chosen at random, the expression of each of which was checked at different time points and compared with their observed FPKM.
qRT-PCR was performed by using an ABI QuantStudio ™ 6 Flex Real-Time thermocycler (Applied Biosystems, Foster City, CA, USA) with the SYBR Green Real-Time PCR Master Mix (TOYOBO, Osaka, Japan) following the manufacturer's instructions. The forward and reverse primers used for qRT-PCR are listed in Supplementary  Table S8. The thermal cycling conditions were as follows: 2 min at 95 °C, followed by 45 cycles of denaturation at 95 °C for 10 s, annealing at 60 °C for 15 s, and extension at 72 °C for 30 s. Optical data were acquired following the extension step, and the PCRs were subject to melting curve analysis beginning at 60 °C through 95 °C at 0.1 °C s −1 . Actin was chosen as the reference gene based on the study by Pinheirob et al. 40 . The data are presented as an average ± SD of three independently produced RT preparations used for PCR runs, each having at least three replicates. The relative expression levels were calculated using the delta-delta Ct method 41 .