Polyploidy is a powerful evolutionary force that has shaped genome evolution across many eukaryotic lineages, possibly offering adaptive advantages in times of global change1,2. Such whole-genome duplications (WGDs) are particularly characteristic of plants3, and a great proportion of crop species are polyploid4,5,6,7,8,9,10,11. Our understanding of genome evolution following WGD is still incomplete, but outcomes can include genomic shock, in terms of activation of cryptic transposable elements (TEs), subgenome-partitioned gene regulation or fractionation, homoeologous exchange (HE), meiotic instability and even karyotype variation8,12,13,14,15,16. Alternatively, few or none of the above phenomena can materialize, and the two subgenomes can coexist harmonically, gradually adapting to new ploidy levels17. Regardless, the most common fate of polyploids appears to be fractionation and eventual reversion to the diploid state18.

With an estimated production of 10 million metric tons per year, coffee is one of the most traded commodities in the world. The most broadly appreciated coffee is produced from the allotetraploid species Coffea arabica, especially from cultivars belonging to the Bourbon or Typica lineages and their hybrids19. C. arabica (2n = 4x = 44 chromosomes) resulted from a natural hybridization event between the ancestors of present-day Coffea canephora (Robusta coffee, subgenome CC (subCC)) and Coffea eugenioides (subgenome EE (subEE)), each with 2n = 2x = 22. The founding WGD has previously been dated to between 10,000 and 1 million years ago20,21,22,23, with the Robusta-derived subgenome of C. arabica most closely related to C. canephora accessions from northern Uganda24. Arabica cultivation was initiated in fifteenth- to sixteenth-century Yemen (Extended Data Fig. 1). Around 1600, the so-called seven seeds were smuggled out of Yemen25, establishing Indian C. arabica cultivar lineages. A century later, the Dutch began cultivating Arabica in Southeast Asia—thus setting up the founders of the contemporary Typica group. One plant, shipped to Amsterdam in 1706, was used to establish Arabica cultivation in the Caribbean in 1723. Independently, the French cultivated Arabica on the island of Bourbon (presently Réunion)26, and the descendants of a single plant that survived by 1720 form the contemporary Bourbon group. Contemporary Arabica cultivars descend from these Typica or Bourbon lineages, except for a few wild ecotypes with origins in natural forests in Ethiopia. Due to its recent allotetraploid origin and strong bottlenecks during its history, cultivated C. arabica harbors a particularly low genetic diversity20 and is susceptible to many plant pests and diseases, such as coffee leaf rust (Hemileia vastatrix). As a result, the classic Bourbon–Typica lineages can be cultivated successfully in only a few regions around the world. Fortunately, a spontaneous C. canephora × C. arabica hybrid resistant to H. vastatrix was identified on the island of Timor27 in 1927. Many modern Arabicas contain C. canephora introgressions derived from this hybrid, ensuring rust resistance, but having also unwanted side effects, such as decreased beverage quality28.

Modern genomic tools and a detailed understanding of the origin and breeding history of contemporary varieties are vital to developing new Arabica cultivars, better adapted to climate change and agricultural practices29,30,31. Here, we present chromosome-level assemblies of C. arabica and representatives of its progenitor species, C. canephora (Robusta) and C. eugenioides (hereafter Eugenioides). Whole-genome resequencing data of 41 wild and cultivated accessions facilitated in-depth analysis of Arabica history and dissemination routes, as well as the identification of candidate genomic regions associated with pathogen resistance.


The genomes of C. arabica, C. canephora and C. eugenioides

As reference individuals, we chose the di-haploid Arabica line ET-39 (ref. 32), a previously sequenced doubled haploid Robusta33 and the wild Eugenioides accession Bu-A, respectively. Long- and short-read-based hybrid assemblies were obtained (Methods and Supplementary Sections 2.1 and 2.2), spanning 672 megabases (Mb) (Robusta), 645 Mb (Eugenioides) and 1,088 Mb (Arabica), respectively. Upon Hi-C scaffolding, the Robusta and Arabica assemblies consisted of 11 and 22 pseudochromosomes, and spanned 82.7% and 62.5%, respectively, of the projected genome sizes (Table 1). To improve the Arabica assembly, we generated a second assembly using Pacific Biosciences (PacBio) HiFi technology followed by Hi-C scaffolding (Methods and Supplementary Sections 2.2 and 2.3). This assembly was 1,198 Mb long, of which 1,192 Mb (93.1% of the predicted genome size based on cytological evidence34) was anchored to pseudochromosomes (Table 1). Gene space completeness, assessed using Benchmarking Universal Single-Copy Orthologs (BUSCOs)35, was >96% for all assemblies. Importantly, 93.2% of the BUSCO genes were duplicated in the HiFi assembly (Table 1), indicating that most of the gene duplicates from the allopolyploidy event were retained.

Table 1 Statistics of the Coffea assemblies presented in this paper

The Robusta and Eugenioides genomes contained, respectively, 67.5% and 59.7% TEs (Supplementary Section 3.2), with Gypsy long terminal repeat (LTR) retrotransposons accounting for most of the difference between the two species. This difference was greatly reduced (63.1% and 63.8%) in the two Arabica subgenomes (subCC and subEE, stemming from Robusta and Eugenioides ancestors, respectively), possibly indicating TE transfer via HE. Robusta contained considerably more recent LTR TE insertion elements than Eugenioides. Again, the two Arabica subgenomes showed greater similarity to each other in recent LTR TE insertions than the two progenitor genomes. No major evidence was found for LTR TE mobilization following Arabica allopolyploidization, in contrast to what has been observed in tobacco36, but similar to Brassica synthetic allotetraploids37. Observed Arabica genome evolution instead more closely follows the ‘harmonious coexistence’ pattern38 seen in Arabidopsis hybrids17,39.

High-quality gene annotations, followed by manual curation of specific gene families (Supplementary Sections 3.13.4), resulted in 28,857, 32,192, 56,670 and 69,314 gene models for the Robusta, Eugenioides, PacBio Arabica and Arabica HiFi assemblies, respectively (Table 1). Altogether, ~97% of Robusta and 99.6% of Arabica HiFi gene models were placed on the pseudochromosomes, with 33,618 and 35,449, respectively, to subgenomes subCC and subEE (Table 1). Annotation completeness from BUSCO was ≥95% for Eugenioides and Robusta, and reached 97.3% for Arabica HiFi.

Genome fractionation and subgenome dominance

Comparison of Arabica subCC and subEE against their Robusta and Eugenioides counterparts revealed high conservation in terms of chromosome number, centromere position and numbers of genes per chromosome (Fig. 1 and Supplementary Section 4). Patterns of gene loss following the gamma paleohexaploidy event displayed high structural conservation between Robusta and Eugenioides during the 4–6 million years since their initial species split22,23 (Supplementary Section 4). Likewise, the structures of the two Arabica subgenomes were highly conserved between each other, with, since the Arabica-founding allotetraploidy event, only ~5% of BUSCO genes having reverted to the diploid state (Fig. 1a and Table 1). Syntenic comparisons revealed that genomic excision events, removing one or several genes at a time in similar proportions across the two subgenomes, have been the main driving force in genome fragmentation both before and after the polyploidy event (Fig. 1b and Supplementary Section 4). Fractionation occurred mostly in pericentromeric regions, whereas chromosome arms showed more moderate paralogous gene deletion (Fig. 1c and Supplementary Section 4). The Arabica allopolyploidy event seemingly did not affect the rate of genome fractionation, which remained roughly constant when comparing deletions in progenitor species versus Arabica subgenomes after the event. In support of the dosage-balance hypothesis40, subgenomic regions with high duplicate retention rates were significantly enriched for genes that originated from the Arabica WGD (Fisher exact test, P < 2.2 × 10−16). In contrast, low duplicate retention rate regions significantly overlapped with genes originating from small-scale (tandem) duplications (Supplementary Table 1). Genes with high retention rates were enriched in Gene Ontology (GO) categories such as ‘cellular component organization or biogenesis’, ‘primary metabolic process’, ‘developmental process’ and ‘regulation of cellular process’, while low retention rate genes were enriched in categories such as ‘RNA-dependent DNA biosynthetic process’ and ‘defense response’ (in both subgenomes), and ‘spermidine hydroxycinnamate conjugate biosynthetic process’ (involved in plant defense41) and ‘plant-type hypersensitive response’ (in subEE) (Supplementary Tables 25).

Fig. 1: Patterns of synteny, fractionation and gene loss in C. arabica and its progenitor species C. canephora and C. eugenioides.
figure 1

a, Corresponding syntenic blocks between CA subgenomes subCC (orange) and subEE (blue), and with the CC (orange) and CE (blue) genomes. b, The base pairs in intergenic DNA in synteny block gaps caused by fractionation in a subCC–subEE comparison, compared with numbers of base pairs in homoeologous unfractionated regions, as a function of numbers of consecutive genes deleted. c, Gene retention rates in synteny blocks plotted along subCC chromosome 2; subCC is plotted in orange and subEE in blue. The green box indicates the pericentromeric region. CA, C. arabica; CC, C. canephora; CE, C. eugenioides.

To study possible expression biases between subgenomes, we identified syntelogous gene pairs and removed the pairs showing HEs in the Arabica subgenomes (see under ‘Origin and domestication of Arabica coffee’ below)42 (Supplementary Section 5). Overall, no significant global subgenome expression dominance was observed (Supplementary Tables 6 and 7). However, gene families regularly displayed mosaic patterns of expression, including several encoding enzymes that contribute to cup quality, such as N-methyltransferase (NMT), terpene synthase (TPS) and fatty acid desaturase 2 (FAD2) families, all having some genes being more expressed in one of the two subgenomes (Extended Data Fig. 2), as per a recent study43. Similar gene family-wise patterns occur in other evolutionarily recent polyploids such as rapeseed10 and cotton44, which are also at their early stages of transitioning back to a diploid state.

Origin and domestication of Arabica coffee

To obtain a genomic perspective on the evolutionary history of Arabica, we sequenced 46 accessions, including three Robusta, two Eugenioides and 41 Arabica. The latter included an eighteenth-century type specimen, kindly provided by the Linnaean Society of London, 12 cultivars with different breeding histories, the Timor hybrid and five of its backcrosses to Arabica, and 17 wild and three wild/cultivated accessions collected from the Eastern and Western sides of the Great Rift Valley45,46 (Supplementary Table 8 and Fig. 2a).

Fig. 2: Population history of C. arabica.
figure 2

a, Geographic origin of resequenced wild C. arabica accessions (red placeholders). Accession names are given in c. The red arrow indicates the probable route of migration to Yemen in historical times. b, Ancestral population assignments of C. arabica accessions for subCC (left) and subEE (right). Relationships among individuals are illustrated with phylogenetic trees obtained from independent SNPs. For magnified views of the trees, see Supplementary Fig. 37. c, Magnification of the bottom left part of a, showing the admixture values for each of the accessions in subCC (top) and subEE (bottom); the colors correspond to the analysis in b. d, Population sizes of wild and cultivated accessions, inferred using SMC++, suggest genetic bottlenecks at ~350 and 1 ka (limited to nonadmixed wild individuals). e, FastSimcoal2 output, suggesting a population split ~30.5 ka, followed by a period of migration between the populations until ~8.9 ka. This timing corresponds with increased population diversity in cultivars at a similar time, calculated using SMC++. Green rectangles along the timeline show ‘windows of opportunity’, times when Yemen was connected with the African continent wherein human migrations to the Arabian Peninsula may have occurred. The purple rectangle shows the last ice age. M, migration; OAE, out-of-Africa event. f, Directional gene flow analysis using Orientagraph suggests two hypotheses: gene flow from the shared ancestral population of all cultivars to the Ethiopian wild individuals (subCC), or gene flow from the Typica lineage to Ethiopia (subEE). Maps in a and c were generated with Google Earth and Google Maps, respectively.

HE between subgenomes has been observed in several recent polyploids8,10,42. Arabica generally displays bivalent pairing of homologous chromosomes and disomic inheritance47, but since the subgenomes share high similarity, occasional homoeologous pairing and exchange may also occur. We therefore explored the extent of HE among Arabica accessions and its possible contribution to genome evolution. Overall, all accessions shared a fixed allele bias toward subEE at one end of chromosome 7, which contained genes enriched for chloroplast-associated functions (Extended Data Fig. 3a, Supplementary Section 5 and Supplementary Table 9). Since the Arabica plastid genome is derived from Eugenioides48, HE in this region was likely selected for, due to compatibility issues between nuclear and chloroplast genes encoding chloroplast-localized proteins49. Surprisingly, all but one accession (BMJM) showed significant (Bonferroni-adjusted P values < 0.0005; chi-squared test, each d.f. = 1) 3:1 allelic biases toward subCC. The highly concordant HE patterns, present in both wild and cultivated Arabicas (Extended Data Fig. 4), suggested that (1) the allelic bias is an adaptive trait not associated with breeding and (2) it originated in a common ancestor of all sampled accessions, possibly immediately after the founding allopolyploidy event. Some exchanges, shared by only a few accessions, probably originated more recently (Extended Data Fig. 3b). More recent HE events were also found in some cultivars and also showed a bias toward subCC, except for BMJM, which showed bias toward subEE due to a single large crossover in chromosome 1 (Extended Data Fig. 3a). An interesting hypothesis for future investigation is that in a low-diversity polyploid species such as Arabica, HE could be a major contributor to phenotypic variation observed among closely related accessions50.

We next studied population genetic statistics for each of the subgenomes (Supplementary Table 10). The 17 wild samples demonstrated low genomic diversities, indicative of small effective population sizes, while negative Tajima’s D suggested an expanding population, possibly following one or more population bottlenecks. The cultivars and wild population samples had similar genetic diversities, as demonstrated by low fixation index (FST) values. In cultivars, nucleotide diversities were only slightly lower than in wild populations and Tajima’s D scores were less negative, suggesting that only minor bottlenecks and subsequent population expansions occurred during domestication.

SNP tree estimation and ADMIXTURE analyses (Fig. 2b) identified a three-population solution for subCC: Typica–Bourbon cultivars (Population 1), wild accessions (Population 2), and Timor hybrid-derived cultivars (Population 3). The old BMJM and the recently established Geisha cultivars showed admixed states on both subgenomes, similar to about half of the wild accessions. Indian varieties encompassed both Typica and Bourbon variation, in agreement with previous studies20. The Linnaean sample grouped with the cultivars, supporting its hypothesized origin from the Dutch East Indies25. A complementary principal component analysis (PCA) (Extended Data Fig. 5) was in agreement with ADMIXTURE analysis.

In wild accessions, both subgenomes concordantly showed two population bottlenecks (Fig. 2d) in the SMC++ (ref. 51) modeling. Assuming a 21-year generation time52, the oldest bottleneck initiated abruptly around 350 thousand years ago (ka) and ended around 15 ka, at the start of the African humid period53, when climatic conditions were more favorable for Arabica growth. The more recent bottleneck initiated more gradually around 5 ka and lasts to this day. Cultivated accessions, however, exhibited the oldest, but not the more recent, bottleneck. In part due to these differences, we also modeled Arabica population history using FastSimcoal2 (ref. 54), modeling the wild population and cultivars as two separate lineages. In the best-fitting model (Fig. 2e), the wild population was predicted to split from the cultivar founding population 1,450 generations ago (~30 ka), that is, before the last glacial maximum. The original founding event was analyzed using the nonadmixed wild individuals, revealing an ancestral population bottleneck at 350 ka (Extended Data Fig. 6a). Divergence estimates based on gene fractionation, the distribution of nonsynonymous mutations (Extended Data Fig. 6b) and calibrated SNP trees (Fig. 2b) suggested the allopolyploid founding event occurred at 610 ka, which is close to previous estimates22,23. The 350 ka bottleneck, on the other hand, corresponds to that found in the SMC++ analyses (Fig. 2d). We therefore consider 610–350 ka a likely time range for the polyploidization event (Fig. 2e). The wild and pre-cultivar lineages maintained some gene flow (in terms of migration) until ~8–9 ka, which may have contributed to the modeled increase in effective population size (Fig. 2d,e).

While these data were not able to identify the precise place of origin of the modern cultivated population (see also the following section), the extended period of migration between wild and cultivated accessions suggests that they were separated only by a relatively small geographic distance, such as along the two sides of the African Great Rift Valley (Fig. 2a–c). It is also possible that the cultivated lineage could have extended as far as Yemen and that the end of migration between the two populations could have been caused by the widening of the Bab al-Mandab strait (separating Yemen and Africa) due to rising sea levels55 at the end of the African humid period. A native Arabica population exists in Yemen56, which could support this hypothesis. The Linnaean sample, together with the Typica and Bourbon cultivars, originates from this second population, which was also used to establish cultivation in Yemen, as suggested by the SNP, ADMIXTURE and PCA analyses (Fig. 2b and Extended Data Fig. 5).

In conclusion, our analyses suggest that the Arabica allopolyploidy event occurred between 610 and 350 ka, when considering that inbreeding present in Coffea populations would accelerate coalescence estimation57,58. Earlier work proposing more recent timings, such as 20 ka (ref. 20), could be underestimates stemming from confounding effects of population bottlenecks in cultivated and wild lineages.

Origin of modern cultivars

The known breeding history of several of our Arabica cultivars provided us with a gold standard set for deducing the Arabica pedigree using Kinship-based INference for Gwas (KING)59 (Fig. 3). The method correctly identified the relationships between Bourbon and Typica group cultivars and the Bourbon–Typica crosses in subCC. In contrast, the subEE pedigree showed lower (second) order relationships, possibly due to HE in that subgenome (Extended Data Fig. 7). Timor hybrid-derived accessions did not show significant relationships to mainline cultivars in subCC (likely due to Robusta introgressions in this subgenome that broke the haplotype blocks; see below), while subEE showed second-degree relationships to both the Typica and Bourbon groups (Fig. 3 and Extended Data Fig. 7), confirming that subEE has not received substantial introgression.

Fig. 3: Kinship estimation of C. arabica accessions, inferred from SNPs in subCC.
figure 3

The degree of relatedness was estimated using KING and describes the number of generations between the related accessions. Thumbnail images show FDR-corrected F3 tests of introgression for each of the target individuals. Each cell in the matrix illustrates an F3 test result for the target accession containing introgression from two different sources (x and y axes); blue color illustrates significant adjusted Z-score (Z adj; value indicated by color key), indicative of gene flow (or allele sharing via identity by descent78) from the two source accessions to the target, while red color illustrates no support for gene flow. See Extended Data Fig. 7 for corresponding analyses in subEE. In the wild accessions, the dark green background highlights the admixed individuals (Fig. 2b), while the nonadmixed individuals are highlighted with red background. Relationships follow standard nomenclature (for example, second degree refers to an individual’s grandparents, grandchildren and so on, whereas third degree refers to great-grandparents, great-grandchildren and so on).

Interestingly, the Typica, Bourbon and JK1 individuals were also first degree related, suggesting direct parent–offspring relationships. Besides confirming their shared Yemeni origins, this finding also underscores the Yemeni germplasm’s limited genetic diversity. Further, the old cultivar lines JK1 (Indian), Erecta (Indonesian Typica), BMJM (Caribbean Typica), TIP1 (Brazilian Typica) and BB1 (Brazilian Bourbon) showed second- or higher-degree relationships with a cluster of closely related wild admixed accessions, centered on E016/136 (Fig. 2b). The recently established Geisha cultivar showed similar relationships to the wild admixed individuals and the Bourbon and Typica groups, suggesting common origins. Interestingly, admixed wild accession E016/136 was closely related to both wild and cultivated populations.

In a comparison of geographic origins, wild individuals from the Eastern side of the Great Rift Valley had some levels of admixture and were closely interrelated, while on the Western side, the admixed, related individuals were mostly concentrated around the Gesha region (Figs. 2c and 3). The E016/136 admixed accession, closest to cultivars, demonstrated a first-degree relationship with several wild accessions, of which only Ar35-06 and Eth28.2 were pure representatives of the wild population (Fig. 2b). Therefore, these two accessions are genetically closest, in our sample, to the hypothetical true wild parent of cultivated Arabica, with E016/136 representing an intermediate form. Ar35-06 was collected near Gesha mountain, close to the origin of the modern Geisha cultivar. Altogether, these data point to the Gesha region as a hotspot of wild accessions amenable to domestication.

Admixed wild samples may have originated from a recent hybridization event that occurred before or after their collection from the wild. A third alternative is that the Yemeni population (and hence the cultivars) originated from an admixed population from the Eastern side of the Great Rift Valley or the Gesha region. Analysis of admixture patterns with Orientagraph60 (Fig. 2f) suggested hybridization with the common ancestor of the Bourbon and Typica lineages in subCC, and of Typica in subEE. In the case of recent hybridization, introduced haplotypes would exist as long contiguous blocks (as in the Timor hybridization, which occurred 100 years ago), while for older events, the blocks would be more fragmented due to crossing-over. Analysis using the distance fraction (df) statistic61 showed the latter to be the case (Extended Data Fig. 8), indicating that admixture events among wild accessions were not very recent, supporting our third hypothesis.

Domestication and cultivation usually involve strong population bottlenecks based on high wild diversity, resulting in reduced genetic diversity in cultivars62. However, Arabica nucleotide diversity was already very low in the wild, probably as a result of earlier bottlenecks (Fig. 2d,e), but only marginally reduced in the pre-cultivated lineage (Extended Data Fig. 9a). Bourbon had lower diversity than Typica, probably resulting from the known single-individual bottleneck in this group. Also, the inbreeding coefficients in the wild and cultivated accessions were similar (Extended Data Fig. 9b), differing from general expectations for a domesticated species62.

To look for pathways under purifying selection in cultivars, we identified genes with high FST (95% quantile) between cultivars and wild accessions. This resulted in a set of 1,908 genes that were enriched for the GO categories ‘cellular response to nitrogen starvation’, ‘regulation of innate immune response’ and ‘regulation of defense response’ (Supplementary Table 11), and contained homologs of ammonium transporters AMT1 and AMT2, important for nitrogen uptake in Coffea63; a homolog of the salicylic acid receptor NONEXPRESSER OF PR GENES 1 (NPR1), required in salicylic acid signaling and systemic acquired resistance64; as well as a homolog of the Arabidopsis LSU2 gene, previously identified as a hub convergently targeted by effectors of pathogens from different kingdoms65. A second screen, focused on genes with a large number of high-impact nonsynonymous mutations shared among cultivars (>40% individuals having the mutation), generated a list of 556 genes that were significantly enriched for only one GO category, ‘defense response’ (Supplementary Table 12). From the 22 genes in this category, 16 were NB-ARC domain-containing resistance (R) genes, and two were members of the leucine-rich repeat (LRR) defense gene family. High diversity in immune-related responses is one possible pathogen resistance mechanism in plant communities66, and therefore reduced diversity may have compromised modern Arabica cultivar immunity.

The high level of conservation between the Arabica subgenomes and their diploid progenitors may have facilitated spontaneous interspecific hybridization events. This was the case for the Timor hybrid, a spontaneous Robusta × Arabica hybrid resistant to H. vastatrix27. Our sample set included five descendants of the original Timor hybrid, obtained by backcrossing to Arabica. As expected, the hybridization affected subCC more profoundly, with much higher levels of nucleotide divergence apparent (FST = 0.185) than in subEE (FST = 0.0897), when comparing cultivars and hybrids. The divergence from wild populations was even greater, with FST = 0.254 for subCC and FST = 0.138 for subEE, illustrating that introgression occurred almost exclusively within subCC.

In the Timor hybrids, the regions found with df statistics61 largely overlapped the introgressed loci identified using FST scans (Fig. 4a) and were found in large blocks, reflecting recent hybridization, and covering 7–11% of the genome (Fig. 4a and Extended Data Fig. 8). Transposon insertion polymorphisms (TIPs) also overlapped with introgressed regions (Gypsy P = 0.0002; Copia P = 0.035; Fisher exact test), confirming their recent origin from Robusta (Fig. 4b). The introgressed regions overlapped with regions of higher subgenome fractionation (P = 0.001873; Supplementary Table 13), possibly due to heterologous recombination between subCC and Robusta, resulting in unequal crossing-over.

Fig. 4: Introgression of C. canephora into H. vastatrix-resistant C. arabica lineages.
figure 4

a, Introgression df statistic estimated for different Timor hybrid derivatives. Colored lines above the axis mark regions of significant introgression in the line under inspection, and are colored by chromosome. The shared introgressed region on chromosome 4 is colored in purple and boxed. TIPs are represented as lines below the x axis and exhibit overlap with introgressed regions. b, The shared introgressed genomic region on subCC chromosome 4 contains a cluster of R genes (RPP8), a cluster of homologs of a negative regulator of R genes (CPR1) and a cluster of homologs of Leaf rust resistance 10 kinases (LRK10L) (bottom). The heatmap shows, from the bottom up, (1) log fold change of gene expression after H. vastatrix inoculation, when comparing resistant Timor hybrid lineage against a susceptible cultivar; red color means elevated expression in the hybrid, and blue decreased expression. (2) Fixation index (FST) values for the introgressed lines versus cultivars and between cultivars and wild accessions. (3) Nucleotide diversity for the wild and cultivated accessions for each gene coding region, plus the flanking 2 kb upstream and downstream of the region. FC, fold change.

An introgressed region shared by all Timor hybrid lines was evident on chromosome 4 (Fig. 4a). We identified a set of 233 genes shared by all hybrids (Supplementary Table 14). The set contained members of three colocalized tandemly duplicated blocks of resistance-related genes on chromosome 4, subCC, and showed high FST values between cultivars and introgressed lines. A tandem array of five genes were homologs of Arabidopsis RPP8, a NOD-like receptor resistance locus conferring pleiotropic resistance to several pathogens67,68. RPP8 shows a great amount of variation in Arabidopsis alone, where intrachromosomal gene conversion combined with balancing selection contributes to its exceptional diversity69. The same subCC region also included a tandem array of ten homologs of CONSTITUTIVE EXPRESSER OF PR GENES 1 (CPR1), a negative regulator of defense response that targets resistance proteins70,71. Finally, we identified three duplicates encoding Leaf rust 10 disease-resistance locus receptor-like protein kinases (LRK10L). The LRK10L are a gene family that is widespread across plants. First identified as a protein kinase in a locus contributing leaf rust resistance in wheat72, they were found to be upregulated during various biotic and abiotic stresses73 and were confirmed as positive regulators of wheat hypersensitive resistance response to stripe rust fungus73 and powdery mildew74.

The high FST values between cultivated and introgressed, but not wild, individuals (Fig. 4b) indicate that the wild population cannot be the source for allelic asymmetries. Nucleotide diversities further illustrate this point; some genes demonstrate lower nucleotide diversity in wild individuals, suggesting these genes to have experienced selective sweeps. To further narrow down candidate genes involved in leaf rust resistance, we reanalyzed comparative gene expression data from susceptible and resistant accessions after H. vastatrix inoculation75. This analysis identified 723 differentially expressed genes, most of which were associated with defense responses (Fig. 4b and Supplementary Tables 14 and 15). The combination of high FST values, nucleotide diversities and differential expression data highlights several strong candidate genes (one RPP8, six CPR1 and one LRK10L) at this locus.


Besides providing genomic resources for molecular breeding of one of the most important agricultural commodities, our Arabica, Robusta and Eugenioides genomes provide a unique window into the genome evolution of a recently formed allopolyploid stemming from two closely related species. Our Arabica data did not suggest a genomic shock induced by allopolyploidy, but, instead, only higher LTR transposon turnover rate. Genome fractionation rates remained basically unaltered before and after the allopolyploidy event. Likewise, no global subgenome dominance in gene expression was observed, but rather a mosaic-type pattern as in other recent polyploids10,44, affecting the expression of individual gene family members. However, similar to octoploid strawberry8, we detected genome dominance in terms of biased HEs favoring subCC. Since Robusta has one of the widest geographic ranges in the Coffea genus, whereas Eugenioides is more range-limited, this biased HE might be adaptive. This hypothesis was supported by the site frequency spectrum of HE loci, showing signs of directional selection (Extended Data Fig. 3). Intriguingly, transposable insertion polymorphisms significantly overlapped with tandem gene duplications and biosynthetic gene clusters, hinting at their possible roles in cluster evolution.

Domestication of perennial species such as Arabica coffee differs markedly from that of annual crops, consisting instead of three phases: selection of outstanding genotypes from wild forests, clonal propagation and cultivation, and then breeding and diversification76. In addition to being a perennial crop, Arabica is also a predominantly autogamous allopolyploid, which puts it in a class of its own. We show here that genetic diversity was already very low among wild accessions, due to multiple pre-domestication bottlenecks, and that the genotypes selected for cultivation by humans (both the ancient cultivated Ethiopian landraces and the recent Geisha cultivar) already were somewhat admixed between divergent lineages. The resequenced accessions displayed a geographic split along the Eastern versus Western sides of the Great Rift Valley, with cultivated coffee variants all placed with the Eastern population. Such admixture has played a large role in breeding many fruit-bearing crops, the nonpolyploid allogamous perennial lychee being one of the most extreme cases58.

The prevalent autogamy of Arabica, combined with the multiple genetic bottlenecks it underwent in the wild, may have selectively purged deleterious alleles, explaining the capacity of the species to survive single-plant bottlenecks that occurred during its cultivation. An additional element buffering deleterious alleles was probably Arabica’s allopolyploidy itself, which provided some level of heterosis77. However, the narrow genetic basis of both cultivated and wild modern Arabica constitutes a major drawback, as well as an obstacle for its breeding using wild genepool diversity. On the other hand, the extensive collinearity of its CC and EE subgenomes with those of its Robusta and Eugenioides progenitors is likely to facilitate introgression of interesting traits from these species, as already happened historically in the Timor spontaneous hybrid. The high-quality genome sequences of the three species provided in this work, together with the identification of the genomic region conferring resistance to coffee leaf rust, constitute a cornerstone for the breeding of novel Arabica varieties with superior adaptability and pathogen resistance.


Genome sequencing

For the three Coffea species, genomic DNA was extracted from leaf tissue. A Qiagen kit was used for DNA extraction for Illumina sequencing. Illumina short reads and PacBio 20-kilobase (kb) libraries were prepared following the manufacturer’s instructions. Sequencing was performed on a HiSeq2000 instrument for the short reads, and the PacBio RSII platform for long reads (specifications given in Supplementary Table 16). For the generation of HiFi reads, DNA was extracted from C. arabica leaf tissue following nuclei purification by centrifugation followed by lysis, phenol–chloroform extraction and isopropanol precipitation. DNA was fragmented to 20 kb using a Megaruptor 3. SMRTbell libraries were sequenced on a single SMRTcell on a Sequel IIe platform.

For the resequencing of 39 wild and cultivated C. arabica accessions, libraries were prepared using the KAPA HyperPrep Kits (Roche) following the manufacturer’s instructions, and paired-end (2 × 125) sequenced on an Illumina HiSeq2500 instrument to ~40× coverage. The Linnaean herbarium sample was sequenced to 46× coverage with Ion Torrent technology.


Contig-level assembly for C. canephora was obtained with MHAP79 and scaffolded using BAC-end sequences and 454 paired-end sequences generated previously33. Both C. eugenioides and C. arabica were assembled with Falcon80, and C. arabica was subsequently phased using Falcon_unzip. All three genomes were error-corrected with Pilon81 using Illumina short reads (Supplementary Section 2.2). C. canephora and C. arabica were further scaffolded into pseudochromosomes using Dovetail Hi-C technology. For C. eugenioides no more material could be obtained for further improvement of the assembly contiguity, and the assembly was scaffolded into pseudomolecules using C. canephora as reference. Gaps in the scaffolds were filled with PBJelly82, after which six more rounds of polishing were done with Pilon using the Illumina shotgun sequenced genomic DNA as well as RNA sequencing (RNA-seq) reads.

The resulting chromosome assemblies for C. canephora were checked and corrected using an ultra-high-density linkage map83 generated during the project. To further improve the quality of the C. arabica assembly, Bionano genome maps were generated.

C. arabica HiFi assembly was carried out with hifiasm v.0.16.1 (ref. 84), followed by scaffolding using Hi-C data from Dovetail technology and ALLHiC85 pipeline. Final quality checks and manual adjustments of the assembly were carried out using 3d-DNA86 and juicebox87.

The completeness of the different assemblies was assessed using BUSCO v.5.2.2 (ref. 35) with the eudicots_odb10 database (2,326 genes; Table 1). Telomeric repeats were searched across the chromosomes using CoGeBLAST88.

To assess the phasing of both subgenomes from C. arabica, synonymous nucleotide substition (Ks) values were obtained from CoGe89 and compared between C. arabica and each of two diploid outgroups, C. canephora and C. eugenioides, using scripts in R.

Linkage map

A reference genetic map was constructed from a cross between a Congolese group genotype (BP409) and a Congolese × Guinean hybrid parent (Q121). The segregating population was composed of 93 F1 individuals90. The parents were sequenced to 60× and progeny to 20× coverage using the Illumina HiSeq2000 platform at Nestlé Research. Following quality control with FastQC and trimming with Trimmomatic v.0.36 (ref. 91), the reads were mapped against the C. canephora reference assembly using BWA-MEM v.0.7.15 (ref. 92). The linkage mapping was conducted with Lep-MAP3 (ref. 83). The markers were clustered into paternal and maternal linkage groups by using a logarithm of the odds score of 18 in a segregation distortion aware model. The final curation of the assembly, combining the two parental maps, solving conflicts as well as identification of haplotype alleles, was carried out manually.

TE annotation and analysis

EDTA93 was used to de novo identify TEs in the C. canephora, C. eugenioides as well as C. arabica subgenomes. Inpactor2 (ref. 94) was used to recover full-length LTR retrotransposons in the three genomes and to classify them at the lineage level. EDTA and Inpactor2 libraries were merged and clustered using cd-hit95. Clusters were manually inspected to remove nested and false predictions. After curation, libraries were used for annotation using Repeat Masker (default parameters). Annotations with length >200 base pairs (bp) were retained. The timing of LTR retrotransposon insertions was studied in the three genomes using individual sequences recovered by Inpactor2 and using an average base substitution rate of 1.3 × 10−8 (ref. 96), similar to Orozco-Arias et al.97.

Gene prediction

RNA-seq and IsoSeq reads were generated to support de novo gene prediction. A MAKER-P pipeline98 was used to combine several de novo gene callers with the IsoSeq and junction information from short-read RNA-seq. High-evidence gene models with Annotation Edit Distance score < 0.5 were selected for the annotation. For C. arabica HiFi assembly, the annotations were first transferred from CC, CE and the previous CA assembly using GeMoMa v.1.9 (ref. 99), and then combined. All genes of interest linked to coffee flavor were subjected to manual inspection and gene model curation. Following the annotation, BUSCO completeness scores were assessed for the CC, CE and CA predicted transcriptomes.

Gene expression

Three gene families, encoding terpene synthases (TPS), N-methyltransferases (NMT) and fatty acid desaturase 2 (FAD2), were further characterized and used to investigate the influence of the presence of the extra gene copies in the allopolyploid using previously published expression data100. The expression data presented here are the TPM (transcripts per million) normalized counts with log-scaling: log10(x + 1 × 10−4), where x is the TPM count from STARaligner101. For leaf rust differential expression analysis, previously published RNA-seq data75 were reanalyzed by mapping the reads on C. arabica HiFi assembly using STARaligner. Differential expression in Timor hybrid versus susceptible Caturra accession after inoculation with H. vastatrix was analyzed with DEseq2 (ref. 102) in R. False discovery rate (FDR) adjustment was carried out using the Benjamini–Hochberg method; adjusted P value < 0.05 was considered statistically significant.

Evolution of synteny and fractionation

Synteny information was obtained using the SynMap tool on the CoGe platform88,89. Only genes within synteny blocks were considered, not only gene pairs but also singleton genes in each genome that have lost their counterpart in the other genome due to fractionation or other gene loss.

We used the ‘peaks’ method103, as calculated by the R function geom_density, for the three events that generate duplicate genomes during genome evolution of C. arabica, that is, the gamma triplication at the origin of the core eudicots, the speciation underlying the CC/CE divergence and the allotetraploidization event.


Syntenic genes between CE, CC, subCC and subEE were identified using the SynMap tool on the CoGe platform. Identification of allele biases was carried out by mapping the C. arabica short-read sequencing data against combined CE and CC assemblies using BWA-MEM92 and calculating sequencing coverages on syntenic genes using bedtools. Differential coverage across the chromosomes was visualized using custom R scripts. To reduce noise, a sliding window of ten genes was used to calculate the average coverage along chromosomes. The allele balance was calculated as A = 4 × ((CC/(CC + EE)) − 0.5), where CC and EE are the subCC and subEE syntelog coverages, respectively. Allele balances <−1.5 or >1.5 were considered homozygous for EE, or CC, respectively, while balances <0.5 and >−0.5 were considered equal.

SNP calling

Following quality control with FastQC104, Illumina short reads were trimmed using Trimmomatic v.0.36 (ref. 91) and mapped on the C. arabica reference assembly with BWA-MEM v.0.7.16a-r1181 (ref. 105). For the Linnaean sample, the reads were processed according to the protocols recommended for degraded DNA analysis in MapDamage v.2.0.8 (ref. 106). GATK (v.3.8.0) pipeline was used for SNP calling. Duplicates were marked and removed using Picard v.2.0.1 and genotype likelihoods were called into GVCF files using HaplotypeCaller (GATK). For the diploid progenitors, to allow interspecies comparisons, the mapping was done to each of the subgenomes separately, including chromosome zero, that is, contigs not assembled into pseudomolecules, in both mappings. Joint calling was carried out using GenotypeGVCFs (GATK)107 and snpEff v.4.3t was used to assess the impact of the SNPs108. To remove regions with cross-species mappings, we removed the SNPs that were called as heterozygous when mapping the di-haploid ET-39 sequencing data to the Arabica reference genome.

Genome-wide nucleotide diversity was calculated with vcftools v.0.1.17 (ref. 109), by calculating the mean of pi values from sliding windows of 100 kb with 10-kb step size. Similarly, genome-wide Tajima’s D was calculated from the mean of Tajima’s D values with window size of 100 kb. PCA was run using Plink v.1.90 (ref. 110). ADMIXTURE v.1.3.0 (ref. 111) was run for SNP data where the variants in repeat regions were filtered out and the outgroup species (diploid Coffea species) were excluded. The SNPs were filtered for linkage disequilibrium (LD) according to the recommendation in the ADMIXTURE manual with (--indep-pairwise 50 10 0.1) while allowing maximum 10% missing values (--geno 0.1). Admixture analysis was run using tenfold cross-validation. The solution giving lowest cross-validation score was selected as the best solution. Nonsynonymous nucleotide diversity, π0, and neutral, intergenic πs were calculated using the PiNSiR R package ( and ANGSD v.0.933 (ref. 112), similar to ref. 58.

Analysis of GBS data

Read data from 736 PstI GBS libraries of C. arabica20 were downloaded from the SRA repository (bioproject PRJNA554647). The samples were 100-bp single-end reads sequenced on an Illumina HiSeq2000 instrument. After trimming and quality filtering, the data were mapped onto the reference genome sequence of C. arabica using the BWA-MEM algorithm with default settings in BWA v.0.7.17 (ref. 105). SNPs were called using the Unified Genotyper in GATK v.3.7 (ref. 107).

F3 statistics

The Admixtools package113 was used to calculate the F3 statistics, and the obtained P values were subjected to FDR correction using the procedure developed by Salojärvi et al.114, where the Z-scores were converted into P values, subjected to FDR correction using Benjamini–Hochberg correction and then converted back to Z-scores.

SNP trees

The SNPs were filtered for repetitive regions, followed by filtering for LD > 0.4 and loci with >40% missing values, as well as minor allele prevalence <10%. The obtained fasta file of the selected sites was input for RAxML with -T 30 -m GTRGAMMA model, using 30 starting trees and 1,000 bootstrap samples115.

Pairwise sequentially Markovian coalescent modeling

For each individual, the reads were mapped against the full CA reference assembly. The mappings were then filtered for indels using bcftools and regions with <8× or >100× coverage. After filtering, the obtained pairwise sequentially Markovian coalescent (PSMC) fastq file was split into subCE and subCC specific parts and PSMC demography was estimated using standard parameter settings (-N25 -t15 -r5)116. The inferred history was then visualized using R and ggplot2 package.

Ancestral state estimation

The ancestral state was inferred from reads of two representatives of each of the diploid coffee species, C. canephora (BUD15, Q121) and C. eugenioides (BU-A, DA56), mapped against each of the subgenomes and the unassigned contigs. Subsequently, a majority vote was carried out to infer the ancestral allele using ANGSD v.0.933 (ref. 112) with options -doFasta 2 and -doCounts 1. The SNP calls in the VCF file were then flipped to the ancestral states using bcftools +fixref117.


The input data for SMC++ comprised the VCF file where the ancestral state was used as reference (see above) and the SNPs in repeat regions were filtered out. For the cultivar population, the representatives of Bourbon and Typica lineages were included (TIP1, Bourbon, Mundo Novo, BMJM, Moka, Rubi, Topazio, Bourbon pointu, Catuai99, BB1, Erecta, JK1, Guatemalense, Amsterdam); Geisha was removed from the analysis because of its unknown pedigree. SMC++ parameter selection was carried out using threefold cross-validation (smc++ cv) implemented in SMC++ v.1.15.3 (ref. 51).

Kinship analysis

Before kinship analysis, the diploid species were removed from the SNP file and the kinship was estimated using KING software v.2.2.5. with --kinship option59. The results were visualized using Keynote, for each subgenome separately.

Introgression analyses

Orientagraph v.1.0 (ref. 60) was run for each of the subgenomes separately according to the developer recommendations by carrying out filtering for linkage as recommended for TreeMix118. PopGenome R package was used to calculate d_f statistics61. For the subCE introgression, BUD15 was used as outgroup, DA56 as the source of introgression and E383 as the nonadmixed wild representative. For subCC, DA56 was used as outgroup and BUD15 as the source of introgression. The statistic was calculated in 20-kb nonoverlapping windows using weighted jackknife to assess the significance of introgression. The results were visualized using R.

Population simulations

FastSimCoal v.2.6 was used for population simulations54. Site frequency spectrum was calculated using ANGSD112 with the VCF file containing wild individuals and repetitive regions filtered out. The ancestral states were estimated as described above. For each of the models, 100 parameter files were simulated. For each parameter file, 1,000,000 simulations were run; monomorphic sites were not used. Maximum composite likelihood estimation of parameters was carried out with 40 expectation-conditional maximization iterations.

Fixation index

Site-wise FST values between wild and cultivated individuals were calculated for each gene annotation and 2-kb flanking regions using vcftools109. Then, mean FST values were calculated for each gene model using the R package.

TE insertion polymorphisms

We studied LTR retrotransposon insertions via analysis of short-read whole-genome resequencing data using TIP_finder119, using the discordant mapping pair approach.

Biosynthetic gene clusters

Biosynthetic gene clusters were identified with the Plantismash web server ( following default analysis protocols120.

Statistical testing

Statistical significance of overlaps between various gene sets was assessed using Fisher exact test in R. Gene set enrichments were carried out by first assigning each gene to the GO category of the closest Arabidopsis homolog (using E-value threshold 1 × 10−5). Tests for enrichment were carried out using goatools121. Bonferroni-corrected P value of 0.05 was used as threshold for significance. Tests for the allele balance were carried out using chi-squared test; each test had d.f. = 1.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.