Genome evolution across 1,011 Saccharomyces cerevisiae isolates

Large-scale population genomic surveys are essential to explore the phenotypic diversity of natural populations. Here we report the whole-genome sequencing and phenotyping of 1,011 Saccharomyces cerevisiae isolates, which together provide an accurate evolutionary picture of the genomic variants that shape the species-wide phenotypic landscape of this yeast. Genomic analyses support a single ‘out-of-China’ origin for this species, followed by several independent domestication events. Although domesticated isolates exhibit high variation in ploidy, aneuploidy and genome content, genome evolution in wild isolates is mainly driven by the accumulation of single nucleotide polymorphisms. A common feature is the extensive loss of heterozygosity, which represents an essential source of inter-individual variation in this mainly asexual species. Most of the single nucleotide polymorphisms, including experimentally identified functional polymorphisms, are present at very low frequencies. The largest numbers of variants identified by genome-wide association are copy-number changes, which have a greater phenotypic effect than do single nucleotide polymorphisms. This resource will guide future population genomics and genotype–phenotype studies in this classic model system.


Figure S6
. Narrow-sense heritability of fitness traits. The genome-wide heritability (GW heritability) ranges from 0.47 to 0.9 with an average of 0.69. Growth conditions are described in Table  S2).    supplementary Table 21 for confidence intervals.                       Nevertheless, one subclade is enriched in wild strains (2-sided Mann-Whitney test p-value = 3.6e-4). This subclade contains 8 strains from wine related sources (6 from grape must, 1 from wine, and 1 from grapes), and 10 from "domesticated nature" sources, i.e. fir, plum, apple, apricots, and pear (trees and fruits) (n=18).
Twelve of these eighteen harbour a partial region C (ranging from ORF 2 to 6). The most divergent subclade (1.4) is composed mainly of Georgian wine strains isolated from wine conserved in amphorae (n=39). This subclade is sufficiently divergent that ADMIXTURE described it as a separate clade (see Fig S8). Caucasus is thought to be the birthplace of winemaking (~6000 years ago) and these isolates might represent the current closest relatives to the original pool of wine domesticated S. cerevisiae strains.  Americas at the bottom of the phylogenetic tree. Those similarities are also captured by the DAPC analysis on the genomic content difference matrix, which create a single cluster for all these strains (Fig. S7).

06.A -African beer (n=20)
. This lineage is characterised by high ploidy. It is one of the two clades with a median ploidy greater than 2 (median 3.5, max 5). South African Kaffir beer strains have variable ploidy (1, 2 or 3) while bili bili (Chad) strains have ploidies of 3 or 4. The highest ploidy is 5 among the pearl millet beer isolates from Ivory Coast. This clade also includes 3 African bakery strains (all diploids) that form a small subgroup and 3 natural strains (two from the mainland are tetraploid, one from Madagascar is diploid).
The production of these African beverages does not rely on the use of commercial starter strains but occur by spontaneous fermentation. The lineage appears to have been through a strong process of specialization, ! losing several genes and becoming the lineage with the second lowest number of unique ORFs (median 5973). ADMIXTURE results suggest that this is a clean lineage.

07.M -Mosaic beer (n=21).
This group consists mainly of beer strains but also includes a few strains with different ecological origins. Both the DAPC and ADMIXTURE analyses suggest this is a mosaic clade with shared ancestry common to several other mosaic strains, especially those belonging to the group M3 and similar to the ones from Brazilian bioethanol and other mosaic groups M1 and M2. The ploidy in this clade is low compared to the other two beer clusters (median 2). The clade was already described previously 1 (see Beer 2 clade) but its mosaic component was not evident.
M2.M -Mosaic region 2 (n=20). More than half (n=12) of these mosaic single-branch strains are natural isolates and include a cluster of strains isolated from the Israeli Evolution Canyon (n=8) and a smaller cluster of cider isolates (n=3). Both PCA and ADMIXTURE indicate that all isolates are mosaic.

08.M -Mixed origin (n=72). A large cluster of isolates isolated from a large array of ecological sources.
This group includes the majority of the baking isolates (n=23 out of 38) as well as clinical (11), and wild isolates (26). A total of 10% of the beer strains in our collection belong to this clade (6 out of 59).
ADMIXTURE indicates largely a clean clade closely related to the Ale beer lineage. The clade has been described previously 1 .

09.M -Mexican agave (n=7).
These strains were isolated from artisanal Agave fermentation for Mezcal production in Tamaulipas, Mexico. We detected in this group the highest number of unique ORFs per genome due to a massive amount S. paradoxus introgressions largely maintained in a heterozygous state. The ADMIXTURE results suggest that they originated from an outcross between Wine/European and French Guiana strains but we were unable to conclusively confirm this with additional SNP analysis approaches.

10.F -French Guiana human (n=31). These strains were isolated from a remote village of Wayampi
Amerindians in French Guiana. Peculiarly, these strains were mostly isolated from stool of healthy people and few from plants and animals associated with this indigenous human population. This clade also has a large number of introgressed S. paradoxus genomic blocks driving a large genome content difference. SNP analysis using ADMIXTURE suggest a clean lineage without contribution from other ancestries.

17.T -Taiwanese (n=3).
These strains were isolated from Taiwanese forest soil. They now represent the most diverged S. cerevisiae clade ever described. This lineage has an average difference of 1.1% in ! nucleotide composition compared to the other isolates, which is more than twice the typical distance between different clades (0.5%) and exceed the divergence of CHN I (0.8%), which was hereto the most divergent known lineage. ! Isolation of mutants (mutation breeding) and generation of hybrids (cross breeding) were the approaches used to generate isolates for sake production. Only a limited number of individuals were used during domestication leading to the observed low level of genetic diversity.

18.F -Far East
Interestingly, one Japanese sake strain (CMG) carries 14 ORFs introgressed from S. mikatae, consistent with their geographic overlap. This event represents the only largest example of introgressions in addition to the one reported from S. paradoxus. It is interesting to note that S. mikatae is the next closely related species after S. paradoxus, perhaps suggesting that the level of sequence divergence play a role in the generation or fate of genomic introgressions.  9). Based on sequence, three classes of plasmids were previously described 5 and we obtained relative frequencies! in the population. Class A is the most common form (n=463) and it is also present in the reference! strain S288C, while class C is a much less common form (n=26  . 17), perhaps the same that also has transferred the 2 plasmid. We have used the sequence coverage to infer the plasmid copy numbers and revealed extreme variation across isolates and plasmid type (Supplementary fig. 9).

Supplementary Note 3 -Detailed description of introgression and horizontal gene transfer events
In total, 913 introgressed ORFs were identified with 885 coming from S. paradoxus. Interestingly, introgressed ORFs! are rare in the highly diverged lineages, consistent with secondary contacts with S.
paradoxus occurring mainly after the out-of-China dispersal (Fig. 2). No ORFs can be traced to S. kudriavzevii or S. eubayanus, despite hybrids with S. cerevisiae being frequently described. These results imply that the lack of introgression from more divergent species is either not occurring because of higher sequence divergence or is selected against because of biological incompatibilities.
The amount of introgressed content is highly variable between the different clades (Supplementary fig. 18).
Massive enrichment was found in the alpechin, bioethanol, Mexican agave and French Guiana subpopulations (two-sided Mann-Whitney-Wilcoxon test p-value 9.46e-46), i.e. human associated niches where the two species might coexist and consequently represent interspecific hybrid zones. As mentioned, there is a striking match between the geographic origins of the four S. cerevisiae clades and the ancestry of S.
paradoxus subpopulation were found in the French Guiana, Brazilian bioethanol and Mexican agave clades whereas the alpechin lineage, mainly isolated from Europe, carries introgressions from the European S.
We also identified 6 large HGT events (regions A-F) that account for a total size of ~500 kb, with 3 events of  fig. 21b).
The region A was detected in 44 strains, 36 of which belong to the Wine/European clade. In addition, a smaller version lacking the terminal two ORFs was found in two additional strains (CHE and CHF).
For the region B acquired from Zygosaccharomyces bailii sensu lato, the pattern is more complex. We detected a single strain (BMD) that harbours an extremely large event (117 kb versus the 17 kb originally described) with at least 22 additional ORFs. These additional ORFs are syntenic to a region present in Z.
bailii on the chromosome II (Supplementary fig. 22a) and are contiguous in the BMD de novo assembly, supporting that this extended region B represents the ancestral event before size reduction by multiple independent deletions. The 5 ORFs originally described are the most common in the population but additional ORFs from the ancestral event can be retained. Remnants of the ancient region B are present in 575 strains: 166 with only one ORFs and they are not restricted to the Wine/European clade.
A similar scenario is observed for the 65 kb region C transferred from Torulaspora microellipsoides and initially described in EC1118 7 . We collected multiple evidences indicating a larger ancestral event followed WWW.NATURE.COM/NATURE | 47 ! by deletions leading to size reduction. A single strain isolated from the Carlsberg brewery contains the long ancestral event of ~165 kb with at least 41 ORFs. Small relics of this larger event are detected in 186 additional strains. Recently, two strains were described carrying ORFs located at the two extremities of the ancestral event we detected (Supplementary fig. 22b) 8 .
The three newly identified regions (D, E, F) all show clear signs of size reductions (Supplementary fig. 21).
The region D that consist of 16 ORFs of 54 kb with high sequence identity (~95%) with Torulaspora delbrueckii. This block is syntenic to T. delbrueckii chromosome 3, except for an extremity, which aligns to the right subtelomere of chromosome 8 (Supplementary fig. 22c). The region E (17 ORFs 70 kb) and region F (10 ORFs 50 kb) could not be traced to a clear donor yeast species, although the best identities matches are with Saccharomycetaceae genera. Out of the three strains in which the region E has been identified, one of them lack of three terminal ORFs, while four out of six strain containing the region F lack five terminal ORFs (Supplementary fig. 21b).
We also identify at least 46 ORFs found isolated or in very small clusters in single isolates, which we refer to as candidate HGT events. We detected an event encompassing 6 ORFs in a South American isolate (ALI).
In addition to the HGT events coming from yeast species, a handful of inter-kingdom HGT have been detected. Two ORFs from bacteria were previously characterized in the S288C reference, these ORF are YLR157C, present in 88 strains and YOL164W, present in 114 strains. In addition, we found 3 ORFs with likely bacteria origin (found in A, 27 and 228 strains) and one ORF found in 3 strains which is related to the viral yeast killer protein M28 found in S. paradoxus (protein identity of 82%), which is known to be integrated in some isolates conferring a killer phenotype. These three strains belong to separate clades (African beer, mosaic and African palm wine) but share the African origin.

Supplementary Note 4 -Timing estimation of major events in the evolutionary history of yeast
Given no well-defined fossil record is available for yeasts to be used for calibration-based methods, we performed our molecular dating analysis using a molecular-clock-based method based on previous studies 9, 10 .
These previous studies used synonymous substitution sites as a proxy for sequences under neutral evolution and infer the divergence time between different strains accordingly by assuming a strict molecular clock.
Here, we restricted the analysis to the 4-fold degenerated (4D) sites for the calculation to further minimize the confounding effect introduced by natural selection (e.g. selection for biased codon usage). Based on the 41-way CDS alignments that we constructed for the phylogenetic analysis showed in Figure 2a ! independent estimates 9,11 of spontaneous mutation rates per base pair (bp) per generation, µ! = 1.84x10 -10 and µ! = 1.67x10 -10 , we can estimate the minimal bound for the divergence time between strain A and strain B using the formula: T = (djc/u)/G. In this way, we obtained tentative estimates for the timing of the S.
cerevisiae -S. paradoxus speciation, S. cerevisiae out-of-China and different S. cerevisiae domestication events as follows. Our estimates regarding to the evolutionary history of the sake and wine lineages are consistent with previous estimates using different datasets 9,10 . More precise estimates could be obtained in the future when lineage-specific generation time and mutation rate data become available.