Introduction

The Amazon rainforest is the largest and most biodiverse in the world1,2,3, home to a large number of plant species4, providing globally important ecosystem services for the stabilization of climate and carbon fixation5,6. Outstanding functional and taxonomic biodiversity of Amazonia is allied to the presence of indigenous communities who have managed plants in ways that increased agrobiodiversity over millennial timescales7. While more than 30 million people living in rural or urban settlements use or at least are familiar with many elements of the Amazonian flora, circa 1.5 million indigenous people from more than 400 ethnic groups rely on the forest for their livelihoods8. This number represents only a small fraction of the estimated human population before the sixteenth century, calculated to have been up to 10 million before the arrival of Europeans9.

Amazonia is also the cradle of several plant species that have been used or domesticated over the past 12,000 years10. Currently, food economy of Amazonia is one of the major and most attractive activities for the region11, and dozens of plant species are believed to have gone through some degree of domestication12,13,14. Some species are found in the wild and have been slightly modified via artificial selection, or are eaten in natura and planted elsewhere, such as the açaí palm (Euterpe oleracea, Arecaceae)15, guaraná (Paullinia cupana, Sapindaceae)16 and cacao (Theobroma cacao, Malvaceae)17,18. Other species were subjected to a strong domestication process, being selected from a wild relative with desirable traits, e.g., larger fruits with more pulp, or more seeds per pod. This is the case for pineapple (Ananas comosus, Bromeliaceae), with the selection of larger individual fruits19, cassava (Manihot esculenta, Euphorbiaceae), selected by indigenous people to be more poisonous to increase protection against herbivory and increased starch storage20, or the peach palm (Bactris gasipaes, Arecaceae), with the selection of varieties with varying starch and oil content used for food21.

Despite the great biodiversity of Amazonia, many species remain overlooked. An interesting but relatively unexplored example is cupuaçu (Theobroma grandiflorum, Malvaceae), a fruit tree crop related to cacao found growing in the Amazon Basin, mostly in Brazil (Fig. 1)22,23. Cupuaçu seed-pulp is used to produce a range of desserts24, and the seeds can also be used to produce cupulate, a candy equivalent to chocolate, but with a different taste. Although trade and cultivation of cupuaçu is currently associated with local community-based production, its economic importance has grown in recent decades. Cupuaçu trees are mostly found in Brazil, and estimates by Brazilian authorities show that its production reached over 21,000 tons in 2017, generating an income of almost 55 million Brazilian reais (over 11 million US dollars) for that year25. In fact, the issue of control of cupuaçu-derived products once resulted in an international dispute in the early 2000s, involving the improper use of the name “cupuaçu” as a trademark by a Japanese company26.

Fig. 1: Presenting cupuaçu and cupuí.
figure 1

Overview of the major collections of T. grandiflorum (cupuaçu) and its closest wild relative, T. subincanum (cupuí), in the Amazon Basin (Supplementary Methods, Supplementary Fig. 1, and Supplementary Data 1). Cupuí is characterized by smaller fruits, typically measuring 7–15 cm in length and 5–8 cm in width, with reddish-purpureus structures in the flowers. In contrast, cupuaçu produces larger fruits, ranging from 10–25 cm (or even larger) in size, accompanied by yellowish-green flower structures. The pulp of cupuaçu fruits is highly valued and widely used in the preparation of a variety of desserts. The collections were conducted at multiple location sites, including Xapuri-Acre (ACRE), Balbina-Amazonas (BALB), Tapajós-Pará (PARA), and São Gabriel da Cachoeira-Amazonas (SGCA).

Cupuaçu’s closest relative is cupuí (T. subincanum), and in recent phylogenies the two species have consistently emerged as sister-groups27,28,29,30. Morphologically, cupuí is very similar to cupuaçu, but with smaller fruits and seeds and more pubescent leaves. They both share almost the same geographic distribution, except that cupuí extends further into Western Amazonia towards Colombia, Ecuador, and Venezuela22,23 (Fig. 1; Supplementary Data 1; Supplementary Methods). The geographic distribution of cupuaçu tends to be associated with human settlements. The seeds that are enveloped by its edible pulp are discarded and germinate, forming seedlings that grow near anthropized areas. Etymologically, cupuaçu and cupuí also show a close relationship, as the two words have a common Tupi-Guarani root (kupu-, “like cacao” and -uasu, “large”). Repeated cases of natural and artificial fertile hybrids between cupuaçu and cupuí have also been reported31,32,33,34.

Traditionally, cupuaçu has been defined as “incipiently domesticated”, i.e., a taxon whose populations have been subject to some degree of domestication, with an expected reduction in genetic variability, but whose average phenotype would still be within the range of its wild form35. A second study demonstrated cupuaçu’s limited genetic diversity36, which would be an unexpected pattern for a species traditionally considered as wild. Studies of crop domestication increasingly use genomic data37,38, and recent studies have shown that T. cacao, cupuaçu’s close relative, underwent strong domestication ~3600 years before the present (yBP), with the introduction of some individuals from Western Amazonia into Mesoamerica17. There is also archaeogenomic evidence of cacao consumption in Western Amazonia by 5200 yBP18. Here, we employed population RAD-seq data to determine the domestication history of cupuaçu. We demonstrate that cupuaçu is a domesticated form derived from cupuí, shedding light on its origins and geographic history. Our findings uncover intriguing patterns of genetic diversity and distribution among domesticated plants in Amazonia, providing valuable insights into the influence of both pre-Columbian and modern histories in the region.

Results and discussion

Cupuaçu as a domesticated form

We used multiple approaches to determine genomic relationships and compare genetic diversity between cupuaçu and cupuí, each of which indicate that cupuaçu samples are all nested as a genetic subset within cupuí specimens (Fig. 2a–c; Supplementary Fig. 4). Cupuaçu is therefore not a sister species of cupuí, as suggested in phylogenetic studies, but is instead nested within it. One of the signatures of domestication in the genome is a reduction in genetic diversity when compared to a potential wild form of the crop13,39,40. Our findings reveal that the genetic diversity of cupuaçu is lower than that of cupuícupuaçu = 1.44 × 10−4 < πcupuí = 4.56 × 10−4, p-value < 0.001 for a pairwise Mann–Whitney U test; Supplementary Table 3; see Methods). Additionally, cupuaçu individuals were found to be less genetically diverse within their populations than cupuí individuals (Fig. 2a, b; cupuaçu weighted FST = 0.118 ± 0.078 versus FST = 0.491 ± 0.330 in cupuí, p-value < 0.001 for a pairwise Mann–Whitney U test; Supplementary Tables 4, 5). This was further supported by the strong genetic homogeneity observed in cupuaçu individuals from all locations, with K = 3 as the optimal number of ancestor groups as inferred by STRUCTURE analysis (Fig. 2d; see also Methods, Supplementary Fig. 5 and Supplementary Data 3). Expected heterozygosity levels of the identified ancestry group corresponding to cupuaçu were also low (HE = 0.06; see Supplementary Data 3). All of these patterns are consistent with a reduction in genetic diversity in cupuaçu as a consequence of inbreeding and a lack of introgression from cupuí to cupuaçu.

Fig. 2: Populational genomics and demographic history of cupuaçu and cupuí.
figure 2

This figure presents the findings from our study on the populational genomics and demographic history of cupuaçu (Theobroma grandiflorum) and cupuí (T. subincanum). The figure consists of seven panels representing different analyses conducted in the study. a An UPGMA tree is shown, depicting the genetic relationships between samples from the four study locations. The branches on the tree represent the inferred ancestral ranges of the samples. The study locations are color-coded for clarity (see Supplementary Methods and Supplementary Fig. 6 for details). b A haplotype network illustrates the genetic distances among samples of T. grandiflorum (cupuaçu) and T. subincanum (cupuí). This network provides insights into the genetic diversity and relationships between the two species. c PCA plot reveals the genetic relationships between T. grandiflorum and T. subincanum samples collected from the study locations. This analysis aids in understanding the population structure and differentiation of the two species. d Results for STRUCTURE analysis to infer the relationships among the species by grouping them into K = 2 and K = 3 optimal clusters. K = 2 represents the expected clustering for the two species (T. grandiflorum and T. subincanum). Different ancestry groups are colored by different colors. Supplementary Methods and Supplementary Data 3 provide further details on this analysis. e Mutation loads and positive selection loci candidates in T. grandiflorum and T. subincanum samples were examined by calculating the ratio of nonsynonymous (dN) to synonymous (dS) changes across 1000 bootstrap replicates. This analysis focused on regions of the genome potentially under positive selection, with the number of candidate loci indicated on the x-axis. See Supplementary Methods and Supplementary Data 4 for more information. f Based on genomic data, the biogeographic history of cupuaçu was reconstructed, indicating its origin in the Middle/Upper Rio Negro, Northwestern Brazil. Additionally, the figure shows the extensive introduction of cupuaçu to various sites within the Amazon Basin and beyond. The study sites are marked as Xapuri-Acre (ACRE), Balbina-Amazonas (BALB), Tapajós-Pará (PARA), and São Gabriel da Cachoeira-Amazonas (SGCA). g Demographic history of cupuaçu, inferred using Stairway Plot 2. The y-axis represents the effective population size (Ne), which is proportional to nucleotide diversity. Decreases in Ne indicate bottleneck events, as highlighted in the plot. Additional details can be found in the Methods and Supplementary Data 5.

Additionally, our results suggest that cupuaçu conforms to the “cost-of-domestication” hypothesis41, as evidenced by the excessive mutation loads, particularly an abundance of nonsynonymous changes in cupuaçu compared to cupuí in candidate loci for selection (Fig. 2e; Supplementary Data 3; Supplementary Methods). Furthermore, in theory, domestication typically leads to fewer regions under selection, controlling most of the desired features with larger effects42,43. Our findings align with this expectation (6471 candidate loci have increased in frequency in cupuaçu vs. 10,445 candidate loci in cupuí; Fig. 2e; Supplementary Data 3; Supplementary Methods), although only a subset of these genes would have been positively selected for, with the majority of them increasing in frequency due to hitchhiking.

Cupuaçu in time and space

A stairway plot (Fig. 2g), along with the analyses outlined above, indicates a loss of genetic diversity in cupuaçu. This plot also suggests that cupuaçu was the product of two phases of domestication, the first of which occurred long before European colonization, in the mid-Holocene, somewhere in Northwestern Amazonia (Fig. 2f). The precise locality of domestication cannot be determined based on the limitations in the extent of our geographic sampling. However, earliest diverging cupuaçu specimens are found to be most closely related to cupuí samples from São Gabriel da Cachoeira and Balbina, in the Middle-Upper Rio Negro Basin (Fig. 2b; Supplementary Fig. 6; Supplementary Data 5), suggesting that initial domestication occurred somewhere around these localities.

Dating analyses indicate two major bottlenecks: one between 5347–7943 yBP, which would mark the initial domestication of cupuaçu, and a second one, which would represent a second phase of domestication, from c. 169 years ago (Fig. 2g; Supplementary Data 4). Between these two bottleneck events, we interpret that cupuaçu may have been dispersed only locally and over limited distances, which is consistent with the observed stabilization of the effective population size seen in Fig. 2g, as also theoretically suggested40. Additionally, samples of cupuaçu from all areas are not very genetically distinct from each other (Fig. 2b, d), suggesting a single initial domestication process in the Middle/Upper Rio Negro region. However, Allaby et al.44 showed how apparent monophyly and geographic origin of a crop might result from protracted domestication with individuals selected from multiple localities. Therefore, the initial domestication process would have been followed by the introduction of cupuaçu to all other areas of its current geographic distribution during the second phase of domestication (Fig. 2f).

Cupuaçu and the history of human occupation in Amazonia

Our dating of the domestication of cupuaçu matches archaeological evidence of the early occupation of the Amazon by indigenous societies from the early Holocene13,14. It is also consistent with evidence that revealed systems of plant cultivation and management from the same period45,46,47,48, which included fructiferous trees47,48. In fact, the Rio Negro Basin has been occupied since at least 8500 yBP, while Northwestern Amazonia is considered among the very earliest hubs of human settlement and plant cultivation in the Americas46,47,48,49,50. Our dating somewhat precedes earliest archeological evidence for the use of cupuaçu that dates to the late Holocene (4200–500 yBP) and all known sites with archaeobotanical remains in the Middle-Upper Rio Negro Basin9.

People from the Northwestern Amazon Basin belong to at least twenty ethnic groups that speak languages from three linguistic families: Arawak, Tukano and Makú51. The forest in this region is composed of terra-firme upland vegetations which do not flood, campinarana forests that display stunted vegetation that grow on well-drained, nutrient poor sandy soils, and of igapó forests flooded throughout the year. Theobroma species are mostly associated with the terra-firme forests, growing over clayed, non-flooded soils up to 1000 meters elevation22,23, and cupuaçu would have been mostly planted in areas that allowed its adequate growth. Thus, in the Middle-Upper Rio Negro, the initial cultivation of cupuaçu would have been associated with other plants that are better adapted to non-flooded areas, such as cassava52. Archaeological and anthropological evidence indicate a gradual increase in the practice of creating agrobiodiversity by indigenous people since the early Holocene48,53. Evidence of the first “dark earths” (“terras pretas”)—fertile and productive lands resulting from human management—also date to the mid Holocene53,54. We argue our results add to the evidence that suggests that during the Holocene indigenous groups in Amazonia tended to have a broad-spectrum diet, including the use and management of palms, legumes, and other useful fruit trees9, such as cupuaçu.

Our results provide evidence against the idea that cupuaçu originated as a wild entity in the Eastern Amazon Basin13,14,20,22, as well as the idea that the species would be the product of extinct megafauna selection55, and point to a much faster, more recent, and intense human influence. Our findings also indicate that the introduction of cupuaçu to other regions of the Amazon Basin intensified only over the last two centuries (Fig. 2g), which coincides with historical evidence43. The popularization of cupuaçu from late 1880, but especially after 197056, reflects major historical events in the Brazilian Amazon such as, respectively, the late nineteenth century rubber boom57 and the large influx of migrants from other areas of Brazil promoted by the military governments of the 1970s and 1980s. During this time, there was an intensive influx of people from outside the Amazon Basin followed by a major decline in indigenous populations, with many locals being pushed away towards the inner forests to avoid attacks from rubber tappers. From this period on, the use and promotion of cupuaçu in popular culture would have increased, along with its commercial value52.

Our findings indicate that cupuaçu was domesticated prior to cacao17. This is perhaps not surprising given that cacao was likely first used for its pulp by indigenous societies18, and it was only later domesticated to produce a chocolate-like beverage from the seeds. The early use of pulp in both species would have made cupuaçu more attractive than cacao if a greater quantity and/or quality of pulp were found in the former or in representatives of its progenitor cupuí. But what is extraordinary about the origin of cupuaçu is that the artificial selection pressure imposed by humans on populations of its progenitor, cupuí, was intense enough to develop morphological traits that were sufficiently distinct for taxonomists to recognize it as a different species22.

It is important to acknowledge some limitations of our study, particularly with respect to the limited geographic sampling of populations. Results based on a sample size, consisting of only four locations scattered across the vast Amazon Basin (see Methods and Supplementary Methods) should be treated with a degree of caution. Consequently, the area we propose as the center of domestication is comparatively broad, encompassing a region between two locations that are over 400 km apart (Fig. 2f). Future collections in additional locations may provide more refined information about the plant’s center of origin and a better understanding of changes in genetic diversity over time. It is plausible that such research may lead to a more precise timeframe for cupuaçu’s domestication in comparison to the over two millennia period (5347–7943 yBP) we outlined here. This presents an exciting opportunity for further investigations, focusing not only on cacao’s crop wild relatives but also on other valuable plants in Amazonia from other seed-bearing plant families that may have undergone similar domestication processes, akin to that of cupuaçu.

Methods

Sampling, DNA extractions and sequencing

A total of 47 specimens, comprising 24 individuals of T. grandiflorum (cupuaçu) and 23 individuals of T. subincanum (cupuí), were collected from four different locations: Acre, Amazonas-Balbina, Pará-Tapajós, and Amazonas-São Gabriel da Cachoeira (Fig. 1). The sampling aimed to cover a significant portion of the known geographic range of the species, as defined in previous taxonomic studies22,23,24 (refer to Supplementary Methods for more details). Following the recommended sampling framework and routine for population genomic studies58, six individuals per species were collected from each of the four locations, except for Pará-Tapajós, where five cupuí individuals were collected. Field expeditions for sampling took place between October and November 2021, and during collection we avoided collecting intentionally planted specimens (see Suplementary Methods).

DNA extractions were performed on freshly collected leaf samples using QIAGEN’s DNeasy Plant Mini Kit according to the manufacturer’s instructions. For genomic analysis, restriction-site associated DNA sequencing (RAD-seq) data were generated by Floragenex (Oregon, USA). The DNA was digested with the PstI restriction enzyme, followed by sonication, ligation of sequencing adapters, and selection of 300–500 bp fragments for PCR amplification and library preparation. The resulting libraries were sequenced on an Illumina NovaSeq6000 platform, producing around 500 million 100 bp single-end reads.

Post-sequencing genomic manipulation and SNP calling

Reads were demultiplexed and assembled to each sample using the “stacks” v. 2.62 pipeline in R Environment59,60. Quality control of the sequencing data was assessed using “FastQC” v. 0.1.3 in R61. After demultiplexing, quality trimming (with a quality threshold of 30), alignment, and data normalization, an average of 3.9 million reads per sample were obtained (Supplementary Table 2 and Supplementary Data 2). The “QuasR” v. 1.38.0 package in R62 was used for data normalization. Five samples were excluded from further analysis due to generating relatively fewer reads compared to the other samples (see Supplementary Table 1).

Resulting data were mapped against the reference genome of T. cacao (Criollo cultivar B97-61/B2)63 using the “ShortRead” v. 4.2 package in R64. Variant calling was performed on the aligned files using “bcftools” v. 1.16, and SNP filtering was conducted using “vcftools” v. 0.1.16 in Python v. 3.10.8 environment65,66. SNP filtering criteria retained only SNPs with a genotype call rate above 95% across all individuals and a minor allele count of three, with genotypes containing less than three reads being recoded. The mean read depth of the samples was 2462.8 ± 803.9 (average ± standard deviation; see also Supplementary Figs. 1, 2), and the mean proportion of mapped reads (including SNPs and indels) was 206,614 ± 49,733 per sample (Supplementary Table 3; Supplementary Fig. 3; Supplementary Methods), which represented an average of 93.9% ± 2.7% of mapped variants per sample. This resulted in a final filtered and aligned variant calling file containing 42 samples from eight populations (21 samples per species, 4–6 samples per population) and 339,034 variant sites.

Genomic structure analysis and ancestral range reconstructions

We conducted a series of complementary analyses to examine the genomic structure of cupuaçu and cupuí populations. Firstly, we constructed a genetic distance tree using “poppr” v. 2.9.3 package in R67 to illustrate the relatedness between each sample. Secondly, we generated a minimum spanning haplotype network and performed a principal component analysis (PCA) using the R packages “adegenet” v. 2.1.1068 and “igraph” v. 1.3.5 in R, respectively. Additionally, we examined overall nucleotide diversity by comparing the average paired-FST and average π statistics across all populations for the 10 chromosomes. This analysis, performed using “bcftools” v. 1.1665, involved defining a window frame of 10 kbp. Statistical differences between each pair of populations were evaluated using a Mann-Whitney-Wilcoxon test in Phyton, with a significance threshold of p-value < 0.001.

We utilized STRUCTURE v. 2.3.469 to determine ancestral groups. This allowed us to assign individuals to potential ancestry populations and calculate overall membership proportions among them. For the K populations, we ran eight analyses with K varying from 1 to 8, representing each group sampled here (two species, each collected from four sites). Each set of analysis consisted of three independent runs, and we obtained average scores from these runs. The parameters were set with 10,000 reps and a 1000 burn-in period after verifying that the likelihood values stabilized within these values. To visualize and decide the optimal K populations based on our data, we used the “pophelper” v. 2.3.1 package in R70.

Furthermore, we conducted Bayesian phylogenetic inference based on whole-genomic SNPs71 (see Supplementary Methods). The resulting phylogenetic tree served as a basis for ancestral area reconstruction analysis to estimate the historical geographic distribution of the studied groups. For this analysis, we used the “BioGeoBEARS” v. 1.1.2 package in R72,73, considering the study locations adopted in this research (see Supplementary Methods, Supplementary Tables 6 and 7 and Supplementary Data 6 for more details).

Selection estimates and testing for the “cost-of-domestication” hypothesis

To investigate the “cost-of-domestication” hypothesis41, we conducted gene annotation for the coding regions and calculated dN/dS ratios for each chromosome, to identify loci from coding regions. We employed a linkage-disequilibrium approach to identify partial and incomplete selective sweeps, using two LD-based scores: iHS and XP-EHH, with the “rehh” v. 2.0.2. package74. After remapping genes against a reference genome, we identified regions of candidate genes under selection, most of these hitchhiked with much fewer genes truly being positively selected. From this set of candidate loci, we measured the ratio of synonymous to nonsynonymous mutations for each site and chromosome, comparing the changes between cupuaçu and cupuí samples. For detailed methodology and outputs, please refer to the Supplementary Methods and Supplementary Data 3.

Demographic history of cupuaçu

Demographic history of cupuaçu was investigated to determine the time of origin of cupuaçu lineages. We employed the “easySFS” and “∂a∂i” inference tools in Java to estimate a folded site-frequency spectrum (SFS), which describes the distribution of allele frequencies across variant sites in the populations75. A 4-population model was used, with each location site (Acre, Balbina, Pará, and São Gabriel da Cachoeira in Brazil) represented. The resulting SFS file was then incorporated into the Stairway Plot 2 program in Python76,77.

Stairway Plot 276 utilizes folded SNP frequency spectra data and has demonstrated good performance, particularly for estimating effective population size in non-model organisms with unknown ancestral allele frequencies. We followed a similar approach to the one used for T. cacao17, conducting a demographic inference with mutation rates μ = 3.1 × 10−9 mutations × bp−1 × generations and assuming a generation time of three years, as typically reported for T. grandiflorum78,79. Based on these parameters, we created an input model file for the demographic analysis, simulating 100 independent samples and incorporating four breaking-points, as recommended by the package developers78. A comprehensive list and description of the outputs from this analysis can be found in Supplementary Data 4, along with additional details in the Supplementary Methods section.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.