Background & Summary

Cacao (Theobroma cacao L.) is an evergreen tree native to the Amazon rainforests1,2. Cacao is a diploid (2n = 2x = 20) species in the Malvaceae family. Among the 22 known species in the genus Theobroma, T. cacao is the only one grown commercially for the production of seeds (beans), which serve as the raw material for chocolate production. Cacao farming predominantly takes place on small-scale farms located in tropical developing nations. It is estimated that five to six million small holder farmers are engaged in cacao farming worldwide, supporting the livelihoods of approximately 40 to 50 million people3. With a global production of 4.75 million metric tons of dry cacao beans (The International Cocoa Organization, 2020), the global chocolate industry had a retail market value of USD 106.2 billion in 2017 and is expected to grow to USD 189.9 billion by 20264.

The domestication of cacao is believed to have occurred around 5,300 years ago5. It is proposed that domestication took place at multiple sites by different indigenous groups in tropical America5,6,7,8. The fermentation of sweet cacao for the production of an alcoholic beverage preceded the grinding of the seeds and may have played a crucial role in the initial development of cacao farming5,7,8. While cacao was initially classified into three genetic groups (Criollo, Forastero, and Trinitario), subsequent research suggests that cacao germplasms can be divided into ten distinct genetic clusters, including Amelonado, Contamana, Criollo, Curaray, Guianna, Iquitos, Marañon, Nacional, Nanay, and Purús9. Recent collecting expeditions in the Amazon have identified additional germplasm groups, expanding our understanding of cacao’s genetic diversity10,11. These natural populations represent the primary gene pool of cacao in the Amazon basin, spanning from French Guiana to Bolivia2,12,13. Consequently, cacao is among the tropical plants with natural populations distributed throughout the entire Amazon region. These populations exhibit differentiation and adaptation to diverse environmental conditions, making cacao an ideal species for studying the spatial distribution of genetic diversity in tree species within the Amazon.

The hypothesis of Pleistocene refugia has been proposed to explain the significant intra-specific divergence of T. cacao in the Amazon13,14. According to this theory, the glacial cycles of the Pleistocene period caused forest cover to contract to limited areas or refuges, leading to the isolation of populations within climatically suitable regions and facilitating allopatric differentiation15. However, research by Motamayor et al. (2008) has shown that the geographical distribution of cacao genetic clusters does not correspond to the proposed refuge centers for other species in the region16,17. Instead, the differentiation patterns observed in the studied populations appear to align with potential dispersal barriers created by ancient ridges or palaeoarches9. Apart from the conflicting hypotheses explaining the geographical distribution of cacao populations, the timing of intra-specific divergence has also been a subject of debate. Richardson et al.18 analyzed dated phylogenies of chloroplast and nuclear DNA sequences and demonstrated that cacao diverged from its most recent common ancestor approximately 9.9 million years ago during the mid-to-late Miocene18. This suggests that cacao had ample time to generate genetic diversity within the species. In the same study, two individuals of T. cacao formed a monophyletic group with stem and crown node ages estimated at 6.5 million years ago and 1.2 million years ago, respectively18. Conversely, Thomas et al.14 proposed that the last glaciation period (22,000 to 13,000 years BP) had the most significant pre-human impact on the present distribution and diversity of cacao14.

The genome size of cacao is estimated to be between 430 and 445 Mbp based on reported reference genome assemblies of the two domesticated populations, namely ‘B97-61/B2’ from the Criollo group and ‘Matina 1–6’ from the Amelonado19,20,21. These reference genomes have provided valuable resources for recent studies on the origin, evolution, and adaptation of cacao22,23,24. They have also facilitated the development of genomic tools for breeding new cacao varieties with improved yield, tolerance to biotic and abiotic stress, and desired quality attributes25,26,27,28,29.

However, in recent years, increased plant genome projects have revealed significant genomic variation among related individuals. This high degree of genomic variation highlights the importance of de novo sequencing of multiple reference genomes to capture the genome diversity present within a given plant species30,31. Furthermore, the two cacao varieties with published reference genomes (Criollo and Amelonado) are self-compatible, and represent domesticated varieties from Mesoamerica and East Amazonia, respectively, and are far removed from the putative origin and center of genetic diversity for cacao. Recently, thirty-one high-quality genome assemblies were reported from four genetic populations (Iquitos, Nanay, Marañon, and Guiana)23. These genomes were de novo assembled into contigs using short reads and then scaffolded using a reference-based approach (with the Matina 1–6 reference genome), resulting in assembly sizes smaller (ranging from 30 to 70 Mb) than the estimated k-mer sizes, suggesting sequence data was lost due to the lack of reference-free scaffolding.

In this study, we present three high-quality de novo genome assemblies and their annotations for wild cacao populations originating from the Upper Amazon region (Contamana, Iquitos, Nanay, Fig. 1). We produced completely de novo chromosome-scale genome assemblies using long reads and Hi-C technology for comparative genomics research in cacao. By analyzing the genomic distance based on single-copy orthologous genes, we provide new insights into the timing of the intra-specific divergence of cacao. This study provides valuable genomic resources for cacao, serving as essential references for future research on the conservation and utilization of cacao genetic diversity.

Fig. 1
figure 1

Geographic origins of the five cacao populations. The source populations of the three wild accessions sequenced in this paper (Nanay, Iquitos, and Contamana) are labeled in pink color.

Materials and Methods

Plant material and long read sequencing

Leaf samples of T. cacao clones “SCA 6” (Contamana population), “IMC 67” (Iquitos population), and “POUND 7” (Nanay population) were collected from the greenhouse of USDA-ARS Sustainable Perennial Crops Laboratory (SPCL), Beltsville, Maryland. These three wild cacao clones were originally collected from the Peruvian Amazon (indicated in the map in Fig. 1), the putative center of origin of T. cacao. DNA samples were sequenced on PacBio Sequel II 8 M SMRT cells generating 42.6, 38.1, 40.7 gigabases of data for Contamana, Iquitos, and Nanay, respectively, corresponding to more than 100X coverage of each genome. The PacBio HiFi reads were used as an input to Hifiasm1 v0.15.4-r34732,33 with default parameters. BLAST results of the Hifiasm output assembly against the nt database were used as input for blobtools2 v1.1.134 and scaffolds identified as possible contamination were removed from the assembly. Finally, purge_dups v1.2.535 was used to remove haplotigs and contig overlaps.

Chromatin conformation capture and sequencing

For each of the three cacao varieties, the Omni-C technique from Dovetail Genomics was employed to perform chromatin conformation capture36. In the process, chromatin was initially fixed using formaldehyde within the cell nucleus, followed by extraction. The fixed chromatin underwent DNAse I digestion, after which the ends of the chromatin were repaired and connected to a biotinylated bridge adapter. This was succeeded by a proximity ligation of ends containing the adapter. Post-proximity ligation, the crosslinks were undone, and the DNA was subsequently purified. The purified DNA was then subjected to a biotin removal process, specifically targeting biotin not incorporated within the ligated fragments. Sequencing libraries were created using NEBNext Ultra enzymes from New England Biolabs and adapters compatible with Illumina. Fragments containing biotin were separated using streptavidin beads, followed by PCR amplification of each library. The sequencing of the library was conducted on an Illumina HiSeqX system. Finally, only reads with a mapping quality (MQ) greater than 50 were selected for use in scaffolding.

de novo scaffolding the assembly with HiRise

The de novo genome assembly from HiFi long reads and Dovetail OmniC library reads were used as input data for HiRise v2.0.5, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies37. Dovetail OmniC library sequences were aligned to the draft input assembly using bwa v0.7.1138. The separations of Dovetail OmniC read pairs mapped within draft scaffolds were analyzed by HiRise to produce a likelihood model for genomic distance between read pairs, and the model was used to identify and break putative misjoins, to score prospective joins, and make joins above a threshold.

Repeat region annotation and comparisons

Repeat regions were identified and annotated with a combination of homology and ab initio methods using RepeatModeler v2.0.239, RepeatMasker v4.1.340, and EDTA v2.0.041 on the five cacao genomes. With RepeatMasker, repeats are assigned to different transposable element (TE) families and classes.

The repeat content rates were compared among the five cacao genomes and stratified by chromosomes and TE classes/families (Fig. 2). Overall, the genomes of the three wild cacao varieties have higher repeat contents (61.5%, 61.2%, and 59.7% of the total genome lengths for Contamana, Iquitos, and Nanay, respectively, Fig. 2a) than the genomes of the two domesticated varieties (52.3% and 56.3% for Criollo and Matina, respectively). This may be because the domestication through artificial selection has purged the repetitive elements in the two genomes. These large differences of repeat contents among the five genomes also suggest that they have diverged early in evolution (species tree in Fig. 2a,see details later).

Fig. 2
figure 2

Repeat content comparisons among the five cacao genomes. (a) On the left is the species tree inferred by a super-alignment of 6,630 single-copy orthologous protein sequences from OrthoFinder. Bootstrap support is at 100 for all nodes. In addition to the five cacao genomes, Herrania umbratica is included as an outgroup. MCMCtree was used to infer the species divergence time. On the right is the bar plot of repeat content comparison of major transposable element (TE) classes. (b) Repeat content percentages in the 10 chromosomes of the five genomes. Repeats in all genomes were inferred using the same methods. (c) Comparison of repeat lengths (in kilobase pairs) at the TE family level among the five genomes. TE families are sorted according to their classes. (d) Kernel density (y axis) of sequence divergence rate (x axis) comparisons of the four most abundant TE families (all DNA families together as the DNA class). The Div rate is calculated as Kimura distance between each cacao repeat and their best Arabidopsis repeat match in the repeat library of RepeatMasker. Lower Div rate means the repeats are more similar between cacao and Arabidopsis or have diverged less after the two species separated.

The repeat content profile also varies significantly among chromosomes of the three new genomes and the two reference genomes (Fig. 2b). In all the five genomes, Chr7 and Chr10 tend to have higher repeat contents than other chromosomes; the two chromosomes also vary the most among the five genomes along with Chr6 and Chr8 (>12% difference between the lowest Criollo and the highest Contamana). Chr1 and Chr9 have the lowest repeat contents in the five genomes. These inter-chromosomal variations in terms of repeat contents suggest the evolutionary selection pressure varies on different chromosomes influencing TE repeat content.

Breaking down the repeat contents into TE classes revealed that the three new genomes have higher percentages of long terminal retroelement (LTR) Gypsy repeats (dark green bars in Fig. 2a) as well unclassified TE repeats (dark blue bars) than Criollo and Matina. Further looking at the TE families (Fig. 2c) under each class identified some genome-specific TE families, such as hAT, hAT-Charlie and SINE in Nanay, Ginger-1 in Iquitos, and Maverick in Contamana. There are also TE families significantly expanded in some genomes, such as CMC-EnSpm, L1-Tx1 and Penelope in Contamana, ERV1 and ERVK in Criollo and Matina, Helitron-2 in Iquitos and Contamana.

To study the divergence rate of different TEs, we used the RepeatMasker utility scripts to calculate the kimura distances between the repeats of each TE family in Cacao genomes and the Arabidopsis thaliana reference repeat sequences. The Arabidopsis repeat sequences are provided by RepeatMasker and chosen at the command line aligning cacao repeats with them. Next we used parseRM.pl42 (https://github.com/4ureliek/Parsing-RepeatMasker-Outputs) to parse the repeat sequence alignments and get the bins of the Kimura distances. These bins contain the counts of the repeat elements of each TE family with the Kimura distance in the range of the bin (e.g., [0,0.1]). We compared the divergence rates of different TE families (between cacao and A. thaliana) by plotting their Kimura distance distributions. All analyses were performed with custom python scripts.

Kimura distance for each repeat pair indicates the divergence rate after Criollo and Arabidopsis separated from each other. Taking all the repeat pairs of each TE class and their Kimura distances to make plots (Fig. 2d), we found that the LTR (Copia + Gypsy) repeat distributions had a peak at a lower divergence rate than the peak of DNA and LINE\L1 repeat distributions. This suggests that LTR repeats have evolved slower than other TE elements since the divergence of cacao and Arabidopsis. Comparing the five cacao genomes also showed that the most abundant repeat family gypsy in Criollo had evolved much faster that in the other cacao genomes (black curve in Fig. 2d with a kimura peak of a higher divergence rate). Compared to the other cacao genomes, Nanay had a peak of lower divergence rate in both Copia and Gypsy plots (blue curve in Fig. 2d). Overall, this means the LTR repeats in the domesticated cacao especially Criollo evolved faster.

Gene model prediction

To predict protein coding gene models from the three new cacao genomes, MAKER v3.01.0443 was utilized to predict protein-coding genes through a combination of ab initio and homology-based techniques. The software allows the use of transcriptome and protein evidence for homology-based gene discovery. In this study, T. cacao RNA-seq raw reads were retrieved from GenBank (PRJNA785999, PRJNA471714) and trimmed using Trim_galore v0.6.844. The clean reads were then assembled into transcripts utilizing Trinity v2.15.045. Furthermore, all protein sequences of T. cacao were downloaded from Phytozome46 and UniProt47, using cacao as a search term in the two databases. To facilitate ab initio gene prediction, MAKER merged SNAP v0.0.048 and Augustus v3.4.049 outcomes, and the process was iterated three times to enhance accuracy. During the first round, Trinity transcripts, A. thaliana protein sequences, and UniProt protein sequences were fed into MAKER. The resulting output was then used to train Augustus models with BUSCOs v450 assistance and SNAP models using gene sequences from maker2gff. During the second round, MAKER employed the Augustus and SNAP models to predict ab initio genes, followed by another round of model training and running using the output from the second round. The third MAKER runs outcome was used as the input to PASA v2.5.2 and Evidence modeler v2.0.051,52. The final gene models were annotated for protein functions with eggNOG-mapper v2.1.653.

For the two reference genomes, we downloaded their gene models from GenBank: Criollo (GCF_000208745.1) and Matina 1–6 (GCF_000403535.1). In addition, we have also run MAKER on the two reference genomes using the same pipeline described above, in order to verify the disease resistance gene count difference observed in the five genomes (see below).

The BUSCO scores with the gene model input indicated that the gene model predictions were of high quality, ranging from 89.2% to 95.1% for our three cacao genomes (Table 1). Not surprisingly, they are lower than the two reference genomes, whose gene models have undergone continuous improvement in the last decade19,20,21. However, the Matina genome had a substantially higher number of predicted gene models (27,329) than the other four genomes. Moreover, the Matina genome displayed structural differences in the number of single exon genes, the number of overlapping genes, and the mean gene length compared to the other four genomes (Table 1).

Table 1 Gene model prediction statistics.

Intra-species divergence time estimation

We combined protein sequences of the five cacao genomes and the Herrania umbratica (GCF_002168275.1, as an outgroup) to define orthologous groups (orthogroups). Genome annotation files were processed to only keep the longest isoform protein of each gene. Proteins of the six genomes were combined as input to OrthoFinder v2.5.454, which generates orthogroups with the alignment tool MMseqs. 2 v12-113e355.

A total 6,630 orthogroups were identified by OrthoFinder. All of the single-copy orthogroups (containing a single copy of gene from each of the six genomes) were aligned with MUSCLE v556. The alignments of the 6,630 single-copy orthogroups were concatenated into one super alignment. A phylogenetic tree was built using RAxMLv857 to represent the species tree with 100 bootstraps and the evolutionary model -m PROTGAMMAJTT. The species tree (Fig. 2a) with H. umbratica as the outgroup shows that the Mesoamerican Criollo separated from the other cultivars the earliest, followed by the East Amazonian Amelonado (Matina 1–6), then the three wild cultivars. The alignment of single copy orthologs (amino acids) that was used to infer the species tree was converted to codon alignment using the nucleotide coding sequences of the single copy orthologs. With the concatenated codon alignments (one sequence for each of the six genomes) inferred by 6,630 single copy orthogroups, we estimated the divergence time for the five cacao genomes plus H. umbratica (the closest to Cacao among the 12 species) by MCMCtree58 with one calibration: the divergence time of H. umbratica and T. cacao = 9~12 MYA. The calibration time was obtained from the TimeTree database59. We performed 10 MCMCtree runs to ensure the confidence of the results. In all 10 runs, we used a high sampling rate of 10,000 and excluded the missing columns using the option from the.ctl file of MCMCtree set to “1”. From all 10 runs, the results were almost identical with the results shown in Fig. 2a.

The cacao speciation time from Herrania umbratica at 10 MYA inferred here (Fig. 2a) is in line with the literature18,20,22. Our results showed that the population divergence of the five cacao genetic clusters occurred during the Pleistocene epoch (within 2 MYA). Criollo population emerged at ~1.83 MYA, Matina population at ~1.34 MY, Nanay population at ~0.91 MY and Contamana and Iquitos populations at ~0.69 MY (Fig. 2a). This finding fits the established theory of the Neogene and Pleistocene origin of many neotropical species. It rejects the hypothesis that cacao population differentiation was caused by the Last Glacial Maximum (LGM) induced refugia (Thomas et al.14), which happened 24,000 ~15,000 years ago. Accurate estimation of divergence time is essential for understanding the evolutionary history of cacao and provides a framework for making future predictions about the effects of environmental change and human activities on its populations.

Data Records

The raw PacBio HiFi reads (FastQ format) are available in the NCBI SRA database under the project number PRJNA982528 (Nanay: SRR2525651260, Contamana: SRR2525651061, Iquitos: SRR2525651162), so are the raw Omni-C short reads in FastQ format (Nanay: SRR2846438463, Iquitos: SRR2846438564, Contamana: SRR2846420165). The genome assemblies (Fasta format of DNA and protein sequences) and annotations (GFF format) are available at FigShare66 and https://bcb.unl.edu/USDA_genomes_CACAO/.

In addition, the transposable element and repeat annotation (plain text format from RepeatMasker), the assemblies (Fasta format) of the three cacao genomes and gene model annotations (GFF format), the protein function annotation (TSV format from eggNOG-mapper), and the structural variation were deposited in FigShare66 with the DOI number https://doi.org/10.6084/m9.figshare.25066010.v1. The three new cacao genome assemblies can be also found in GenBank: Iquitos GCA_958328385.167, Nanay GCA_958329735.168, Contamana GCA_958329045.169.

Technical Validation

The HiRise linkage density plots of the three new cacao genomes (Fig. 3) revealed 10 chromosomes, the same number as the two reference genomes (Criollo and Matina). The final assemblies had N50 values of 39.46, 39.49, and 34.43 Mbp for Contamana, Iquitos, and Nanay, respectively (Table 2). Compared to the Criollo and Matina reference genomes, Contamana and Iquitos genome assemblies showed better quality with higher N50, L50, N90, and L90 values (Table 2). The genome BUSCO50 scores (using eukaryote_odb10) were also slightly better in Nanay, Iquitos, and Contamana, than in the two reference genomes (Table 2).

Fig. 3
figure 3

HiRise scaffolding linkage density histograms of the three cacao genome assemblies. The two axes are the positions (in base pairs) of paired Omni-C reads mapped in the genome assembly. The grids separate the major linkage groups corresponding to the 10 pseudo-chromosomes. The gray areas contain scaffolds that are not placed in the 10 pseudo-chromosomes.

Table 2 Assembly statistics of the three new cacao genomes and the two reference genomes.

In addition, Merqury v1.4.1 was run on the DNA reads and genome assemblies of the five cacao cultivars to perform k-mer-based analyses70. K-mers shared by the sequencing reads and the genome assembly can be used to calculate the k-mer completeness (recovery rate). K-mers uniquely found in the genome assembly and absent in the sequencing reads can be considered as assembly consensus errors and used to calculate the assembly base-pair quality value (QV). To run Merqury, we used HiFi reads of Nanay, Iquitos, and Contamana that we sequenced (see SRA IDs above). We downloaded Illumina NovaSeq. 6000 reads (SRR21562109) for Criollo, and 454 GS FLX + reads (SRR866472, SRR866474, SRR866481, SRR866483, SRR866484, SRR866485, SRR866487, SRR866488) for Matina. These reads were originally used to build the Criollo and Matina reference genomes.

Comparing the five genomes, Criollo has the highest k-mer completeness at 92.0%, while Matina has the lowest at 74.5% (Table 2). The three new genomes have k-mer completeness (82.9~87.6%) better than Matina but lower than Criollo. For assembly QVs, all the three new genomes have much higher values (53.0~61.6) than Criollo and Matina. Contamana stands out with a QV at 61.0 and a completeness score at 86.4%. Criollo has a much lower QV at 38.8. Matina has QV at 32.4 being the lowest among the five genomes. Lastly, the HiFi reads were quality assessed to obtain the average Phred scores using FastQC v0.1244, which confirms that Nanay reads exhibit a lower Phred score (average score of 60.5) compared to Iquitos (69) and Contamana (71.1). However, in general the average quality of all HiFi reads is outstanding. This supports orthogonally the initial k-mer analysis findings.

These k-mer-based assessments underscore the high-quality nature of the three new cacao genome assemblies, which are generally better than the two reference genomes. Despite the inherent challenges in genome assembly, these QV scores and completeness percentages highlight their reliability in genomic data analysis.

Nanay, compared to the other genomes, has a lower genome assembly quality (many more scaffolds, much lower N50 and N90, Table 2). Its much larger total genome length is probably due to the much larger number of unplaced scaffolds in the chromosomes (Fig. 3), although its BUSCO score, QV, and k-mer completeness are comparable to the other genomes. If only consider the ten pseudo-chromosomes, the total genome length was 352.06 Mbp for Contamana, 351.84 Mbp for Iquitos, and 338.4 Mbp for Nanay. In contrast, the total chromosome length of Criollo and Matina was 314.18 Mbp and 330 Mbp, respectively.

In summary, the three new wild cacao genomes, like other recent cacao genome sequencing studies23,71, will further our understanding of cacao’s genetic diversity and evolution.