A single polyploidization event at the origin of the tetraploid genome of Coffea arabica is responsible for the extremely low genetic variation in wild and cultivated germplasm

The genome of the allotetraploid species Coffea arabica L. was sequenced to assemble independently the two component subgenomes (putatively deriving from C. canephora and C. eugenioides) and to perform a genome-wide analysis of the genetic diversity in cultivated coffee germplasm and in wild populations growing in the center of origin of the species. We assembled a total length of 1.536 Gbp, 444 Mb and 527 Mb of which were assigned to the canephora and eugenioides subgenomes, respectively, and predicted 46,562 gene models, 21,254 and 22,888 of which were assigned to the canephora and to the eugeniodes subgenome, respectively. Through a genome-wide SNP genotyping of 736 C. arabica accessions, we analyzed the genetic diversity in the species and its relationship with geographic distribution and historical records. We observed a weak population structure due to low-frequency derived alleles and highly negative values of Taijma’s D, suggesting a recent and severe bottleneck, most likely resulting from a single event of polyploidization, not only for the cultivated germplasm but also for the entire species. This conclusion is strongly supported by forward simulations of mutation accumulation. However, PCA revealed a cline of genetic diversity reflecting a west-to-east geographical distribution from the center of origin in East Africa to the Arabian Peninsula. The extremely low levels of variation observed in the species, as a consequence of the polyploidization event, make the exploitation of diversity within the species for breeding purposes less interesting than in most crop species and stress the need for introgression of new variability from the diploid progenitors.


Gene prediction
The raw gene prediction produced a total of 101,644 genes. As described in material and methods, we applied several filters to reduce the number of false positive predictions. 67.9% of the genes passed the quality filters, while 1.9% of the genes were discarded because too short (length < 300 bp), 1.9% because of a low evidence support and 28.7% because with a low ab-initio support. Genes with a low ab-initio support were further processed and searched for similarity against C. canephora proteome allowing the rescue of further 9,351 genes. After the last step with PASA, for UTRs and alternative splicing prediction, the final dataset had a total of 78,311 protein-coding genes and 93,078 transcripts. Similarity search against NR database shows that 97.1% of the transcripts present at least a significant match, suggesting that the filters applied to the gene predictions allowed to eliminate the majority of wrong predictions. Using Blast2GO, we were able to assign 5,841 distinct gene ontology terms to 82% of the transcripts: 48.3% of the terms belonged to biological process ontology, 39.2% to molecular function and 12.5% to cellular component. Predicted genes cover 92.4% of the plant orthologs set of BUSCO, suggesting a high completeness of the prediction.
To remove the redundancy due to assembly strategy, we performed a sequence clustering using CD-HIT. We obtained a final number of 46,562 clusters, 21,254 containing as cluster representative sequence a gene assigned to the canephora parental genome, 22,888 with a representative gene assigned to the eugenioides parental genome, while 2,420 with a gene that was not assigned to any parental genome. We found that 91.4% of the clusters are composed by genes from the same parental genomes, while only 8.6% of the clusters are composed by genes from the two genomes, indicating that the identity threshold used was able to discriminate between genes from the two parental genomes. Finally, 62.5% of the clusters are singleton, i.e. they are composed by just one sequence, while 20% are composed by two sequences. The number of clusters containing more than 5 sequences is only 1.3%.

Genetic diversity structuring in Coffea Arabica
After running STRUCTURE on the whole dataset as presented in the paper, we run STRUCTURE only on G1 and assigned the color codes on the PCA according to the new ancestry assignments. This further analysis revealed a clear structuration of the Ethiopian germplasm ( Figure S2, Figure S3A, Figure Figure   S3B, Table S3). We propose to rename the G2 population as the 'Harar-Yemen' group as it included Ethiopian germplasm from the eastern region around the city of Harar and all the Yemeni varieties (Fig. 2F). The population G1B included all accessions from the Sheka forest, 91% of the accessions collected from Mizan-Teferi and 92% of those collected from Teppi (Table S3), around a rainforest area that is located approximately 200 km west of the city of Jimma. We hence propose to call this population "Sheka".
We found that all accessions collected around Jimma at the sites of Agaro, Bonga, the Didessa wildlife sanctuary, Shebe, Gera and Wush-Wush were assigned to the G1A population (Table S3). Some accessions that were naturally found further east across the Rift valley in the region of Sidamo (Yirga Cheffe) were also grouped into G1A. Therefore, population G1A includes germplasm dispersed over a vast region around Jimma delimited northwards by the Diddessa wildlife sanctuary and southwards by Maji, Shebe, Bonga, Wush-Wush, and the Bonga forest. In this mountainous area, the natural forest has been interspersed with patches of small coffee plantations since the time the surveys were made. The diversity of habitats included forest edges, agricultural fields, graze land and swamps. We propose to call this population 'Jimma-Bonga'.
We also classified individuals within the three populations according to the plant material category assigned by FAO and ORSTOM reports ( Figure S4). The categories 'Forest', 'Garden', and 'Intensive plantation' were adopted by the FAO (1964FAO ( -1965 and ORSTOM (1966) surveys to refer to material collected in the wilderness, around human settlements, or in large scale plantations. In this study we considered that 'wild' accessions originated from forest and garden at the time of the survey except if the provenance of the seeds was clearly cited, as for example was the case of some accessions for which the authors mentioned 'Harar' or 'Kenya'. At the time of the surveys, 92% of the 'Harar-Yemen' population (G2) was classified as 'intensive plantation' by the botanists, whereas only 15% of the Jimma-Bonga population (G1A) and no accessions of the Sheka population (G1B) were categorized as such. As much as 63% of the Sheka population and 27% for the Jimma-Bonga population were previously classified as forest accessions ( Figure S4).

Fig. S1. 16-mer analysis of paired-end WGS sequences.
On the x axis the k-mers coverage, on the y axis the abundance of k-mers. When the two subgenomes are different they provide the expected coverage at ~54X, while identical 16-mers from homoeologous regions provide a double coverage at ~108X. Forest' correspond to categories adopted by the FAO (1964FAO ( -1965 and the ORSTOM (1966) surveys. Intensive plantations refer to coffee cultivated after land clearing with systematic soil preparation and seedling planting and managed in order to maximize the volume of production and productivity. Garden represents coffee grown in smallholdings under a few shade trees usually combined with other crops and fruit trees. Forest represents coffee grown in forests (or semi-forested) areas where eventually farmers slash weeds, lianas and competing shrubs, thin forest trees and fill open spaces with local seedlings (Labouisse and Kotecha 2008). In this study we considered that the 'wild' accessions came from forest and garden at the time of the survey except if the provenance of the seeds was mentioned as was the case for example for some accessions for which the authors mentioned 'Harar' or 'Kenya'. Figure S5. Identification of the optimal threshold of sequence identity to collapse duplicates using the CD-HIT software. DNA coding sequences were clustered using CD-HIT. CD-HIT was initially run with the default clustering percentage identity of 0.9. Genes are classified by CD-HIT as representative or non-representative within each cluster. In order to identify a threshold for collapsing redundant gene sequences of the same homoeolog while separating inter-homoeologous gene sequences, we plotted for each bin of sequence identity (%) the number of sequences in each cluster consistent with the subgenome assignment of the representative gene of that cluster (in blue) versus the number of sequences in each cluster inconsistent with the subgenome assignment of the representative gene of that cluster (in orange). We then identified empirically the percentage of identity (0.9961) representing a tradeoff between sensitivity and specificity.