Contrasting evolutionary genome dynamics between domesticated and wild yeasts

Structural rearrangements have long been recognized as an important source of genetic variation with implications in phenotypic diversity and disease, yet their detailed evolutionary dynamics remain elusive. Here, we use long-read sequencing to generate end-to-end genome assemblies for 12 strains representing major subpopulations of the partially domesticated yeast Saccharomyces cerevisiae and its wild relative Saccharomyces paradoxus. These population-level high-quality genomes with comprehensive annotation allow for the first time a precise definition of chromosomal boundaries between cores and subtelomeres and a high-resolution view of evolutionary genome dynamics. In chromosomal cores, S. paradoxus exhibits faster accumulation of balanced rearrangements (inversions, reciprocal translocations and transpositions) whereas S. cerevisiae accumulates unbalanced rearrangements (novel insertions, deletions and duplications) more rapidly. In subtelomeres, both species show extensive interchromosomal reshuffling, with a higher tempo in S. cerevisiae. Such striking contrasts between wild and domesticated yeasts likely reflect the influence of human activities on structural genome evolution.


Introduction
Understanding how genetic variation translates into phenotypic diversity is a central theme in biology. With the rapid advancement of sequencing technology, genetic variation in large natural populations has been extensively explored for humans and several model organisms1-9. However, our current knowledge of natural genetic variation is heavily biased towards single nucleotide variants (SNVs). Large-scale structural variants (SVs) such as inversions, reciprocal translocations, transpositions, novel insertions, deletions, and duplications are much less well characterized due to technical difficulties in detecting them using short-read sequencing data. This is a critical problem to address given that SVs often account for a substantial fraction of genetic variation and can have significant implications in adaptation, speciation and disease susceptibility10-12.
The long-read sequencing technologies from Pacific Biosciences (PacBio) and Oxford Nanopore offer powerful tools for high-quality genome assembly13. Their recent applications provided highly continuous genome assemblies with many complex regions correctly resolved, even for large mammalian genomes14, 15. This is especially important in characterizing SVs, which are frequently embedded in complex regions. For example, eukaryotic subtelomeres, which profoundly contribute to genetic and phenotypic diversity, are known hotspots of SVs due to rampant ectopic sequence reshuffling16-19.
The baker's yeast Saccharomyces cerevisiae is a leading biological model system with great economic importance in agriculture and industry. Discoveries in S. cerevisiae have illuminated almost every aspect of molecular biology and genetics. It is the first eukaryote to have its genome sequence, population genomics and genotype-phenotype map extensively explored1,20,21. Here, we applied PacBio sequencing to 12 representative strains of S. cerevisiae and its wild relative Saccharomyces paradoxus and revealed striking interspecific contrasts in structural dynamics across their genomic landscapes. This is the first study in eukaryotes that brings long-read sequencing technologies to the field of population genomics and studies genome evolution using multiple reference-quality genome sequences. nuclear and mitochondrial genomes exhibited compelling completeness and accuracy, with most chromosomes assembled into single contigs and highly complex regions accurately assembled ( Supplementary Fig. 1). After manual gap filling and Illumina-read-based error correction (See Methods), we obtained end-to-end assemblies for almost all the 192 chromosomes, with only the rDNA array on chromosome XII (chrXII) and 26 of 384 (6.8%) chromosome-ends remaining not fully assembled. We estimate that only 45-202 base-level sequencing errors remain across each 12 Mb nuclear genome (Supplementary Tables 3-4). For each assembly, we annotated centromeres, protein-coding genes, tRNAs, Ty retrotransposable elements, core X-elements, Y'-elements and mitochondrial RNA genes (Supplementary Tables 5-7). Chromosomes were named according to their encompassed centromeres.
When evaluated against the current S. cerevisiae and S. paradoxus reference genomes, our PacBio assemblies of the same strains (S288C and CBS432 respectively) show clean collinearity for both nuclear and mitochondrial genomes (Figs. 1a-b), with only a few discrepancies at finer scales actually caused by assembly problems in the reference genomes. For example, we found five non-reference Ty1 insertions on chrIII in our S288c assembly (Fig. 1a, inset), which were corroborated by previous studies22-24 as well as our own long-range PCR amplifications. Likewise, we found a mis-assembly on chrIV (Fig. 1b, inset) in the S. paradoxus reference genome, which were confirmed by both Illumina and Sanger reads1. Moreover, we checked several known cases of copy number variants (CNVs) (e.g. Y'-elements25, the CUP1 locus6 and ARR6 gene clusters) and SVs (e.g. those in the Malaysian S. cerevisiae UWOPS03-461.426) and they were all correctly recaptured in our assemblies.
The final assembly sizes of these 12 strains range from 11.73 to 12.14 Mb for the nuclear genome (excluding rDNA gaps) and from 69.95 to 85.79 kb for the mitochondrial genome (Fig. 1c-d and Supplementary Tables 8-9). The Ty and Y'-element abundance substantially contributed to the nuclear genome size differences ( Fig. 1c and Supplementary Table 8). For example, we observed strain-specific enrichment of full-length Ty1 in S. cerevisiae S288C, Ty4 in S. paradoxus UFRJ50816 and Ty5 in S. paradoxus CBS432, whereas no full-length Ty was found in S. cerevisiae UWOPS03-461.4 (Supplementary Table 6). Similarly, >30 copies of the Y'-element were found in S. cerevisiae SK1 but none in S. paradoxus N44 (Supplementary Table 5). Mitochondrial genome size variation is heavily shaped by the presence/absence dynamics of group I and group II introns in COB1, COX1 and rnl ( Fig. 1d and Supplementary Tables 9-10). Despite large-scale interchromosomal rearrangements in a few strains (S. cerevisiae UWOPS03-461.4, S. paradoxus UFRJ50816 and S. paradoxus UWOPS91-917.1), the 12 strains all maintained 16 nuclear chromosomes.

Molecular evolutionary rate and diversification timescale
To gauge structural dynamics in a well-defined evolutionary context, we performed phylogenetic analysis for the 12 strains and six Saccharomyces sensu stricto outgroups based on 4,717 one-to-one orthologs of nuclear protein-coding genes (Supplementary Data Set 1). The resulting phylogeny is consistent with our prior knowledge about these strains (Fig. 1e). Analyzing this phylogenetic tree, we found the entire S. cerevisiae lineage to have evolved faster than the S. paradoxus lineage as indicated by the overall longer branch from the common ancestor of the two species to each tip of the tree (Fig. 1e). We confirmed such rate differences by Tajima's relative rate test27 for all S. cerevisiae versus S. paradoxus strain pairs, using S. mikatae as the outgroup (P < 1x10 -5 for all pairwise comparisons). In contrast, molecular dating analysis reveals that the cumulative diversification time for the five S. paradoxus strains is 3.87-fold of that for the seven S. cerevisiae strains, suggesting a much longer time span for accumulating species-specific genetic changes in the former lineage ( Supplementary Fig. 2a). This timescale difference is further supported by the synonymous substitution rate (dS) (Supplementary Fig. 2b).

Core-subtelomere chromosome partitioning
Conceptually, linear nuclear chromosomes can be partitioned into internal chromosomal cores, interstitial subtelomeres and terminal chromosome-ends. However, their precise boundaries are challenging to demarcate without a rigid subtelomere definition. Here, we propose an explicit way to pinpoint yeast subtelomeres based on multi-genome comparison, which can be further applied to other eukaryotic organisms. For each subtelomere, we located its proximal boundary based on the sudden loss of synteny conservation and demarcated its distal boundary by the telomere-associated core X-and Y'-elements (See Methods; Supplementary Fig. 3). The partitioning for the left arm of chrI is illustrated in Fig. 2a. The strict gene synteny conservation is lost after GDH3, thus marking the boundary between the core and the subtelomere for this chromosome arm (Fig. 2a). All chromosomal cores, subtelomeres, and 358 out of 384 chromosome-ends across the 12 strains could be defined in this way (Supplementary Tables 11-13 and Supplementary Data Sets 2-3). For the remaining 26 chromosome-ends, both X/Y'-elements and telomeric repeats (TG 1-3 ) are missing. We assigned the orthology of subtelomeres from different strains based on the ancestral chromosomal identity of their flanking chromosomal cores (see Methods). Here, we used Arabic numbers to denote such ancestral chromosomal identities and the associated subtelomeres, which takes into account the large-scale interchromosomal rearrangements having occurred in some strains (Supplementary Fig. 4 and Supplementary Table 12). Such accurately assigned subtelomere orthology, together with explicit chromosome partitioning, allows for an in-depth examination of subtelomeric evolutionary dynamics.
Our analysis captures distinct properties of chromosomal cores and subtelomeres. All previously defined essential genes in S. cerevisiae S288C28 fell into the chromosomal cores, whereas all previously described subtelomeric duplication blocks in S288C (see URLs ) were fully enclosed in our defined S288C subtelomeres. Furthermore, the genes from our defined subtelomeres show 36.6-fold higher level of CNV accumulation than those from the cores (one-sided Mann-Whitney U test, P < 2.2x10 -16 ) (Figs. 2b-c). When only considering oneto-one orthologs, the subtelomeric genes show 8.4-fold higher level of gene order loss (GOL)29-31 than their core counterparts (one-sided Mann-Whitney U test, P < 2.2x10 -16 ) (Figs. 2d-e). Additionally, subtelomeric one-to-one orthologs also show significantly higher nonsynonymous-to-synonymous substitution rate ratio (dN/dS) than those from the cores in the S. cerevisiae-S. cerevisiae and S. cerevisiae-S. paradoxus comparisons (one-sided Mann-Whitney U test, P < 2.2x10 -16 ), although no clear trend was found in the S. paradoxus-S. paradoxus comparison (one-sided Mann-Whitney U test, P = 0.936). These observations fit well with known properties of cores and subtelomeres and provide the first quantitative assessment of the core-subtelomere contrasts in genome dynamics. Interestingly, aside from such core-subtelomere contrasts, we also observed clear interspecific differences in all three measurements. S. cerevisiae strains show faster CNV accumulation (one-sided Mann-Whitney U test; P = 6.7x10 -5 for cores, P = 5.1x10 -5 for subtelomeres) and more rapid GOL (one-sided Mann-Whitney U test, P = 5.5x10 -5 for cores and P = 2.6x10 -5 for subtelomeres) than S. paradoxus strains in both core and subtelomeres respectively ( Fig. 2c and 2e). Similarly, S. cerevisiae subtelomeric genes also show higher dN/dS than their S. paradoxus counterparts (one-sided Mann-Whitney U test, P = 4.3x10 -4 ), although their core genes appear to have similar dN/dS (one-sided Mann-Whitney U test, P = 1.000). These observations collectively suggest accelerated evolution in S. cerevisiae relative to S. paradoxus, especially in subtelomeres.

Structural rearrangements in chromosomal cores
Structural rearrangements can be balanced (e.g. inversions, reciprocal translocations, and transpositions) or unbalanced (e.g. large-scale novel insertions, deletions, and duplications) depending on whether the copy number of genetic material is affected10. We identified 35 balanced rearrangements in total, including 28 inversions, six reciprocal translocations, and one massive rearrangement (Fig. 3a, Supplementary Figs. 5a-c, Supplementary Data Set 4). All events occurred during the species-specific diversification of the two species, with 29 events occurring in S. paradoxus and only six in S. cerevisiae. Factoring in the cumulative evolutionary time difference, S. paradoxus still shows 1.25-fold faster accumulation of balanced rearrangements than S. cerevisiae. Six inversions are tightly packed into a ~200 kb region on chrVII of the South American S. paradoxus UFRJ50816, indicating a strainspecific inversion hotspot (Fig. 3b). With regard to interchromosomal rearrangements, six of them are reciprocal translocations that occurred in two S. paradoxus strains ( Fig. 3c and Supplementary Figs. 5a-b). The remaining one found in the Malaysian S. cerevisiae UWOPS03-461.4 is particularly striking: chrVII, chrVIII, chrX, chrXI, and chrXIII were heavily reshuffled, confirming recent chromosomal contact data26 ( Fig. 3c and Supplementary Fig. 5c). We describe this as a "massive rearrangement" because it cannot be explained by typical independent reciprocal translocations. This is more likely to result from a single catastrophic event resembling the chromothripsis observed in tumor cells32,33. This massive rearrangement in the Malaysian S. cerevisiae and the rapid accumulation of inversions and translocations in the South American S. paradoxus resulted in extensively altered genome configurations, which explain the reproductive isolation of these two lineages34, 35. As previously observed in yeasts on larger divergence scales36,37, the breakpoints of those balanced rearrangements are associated with tRNAs and Tys, highlighting the roles of these elements in triggering genome instability and suggesting nonallelic homologous recombination (NAHR) as the mutational mechanism.
Considering unbalanced structural rearrangements in chromosomal cores, we identified seven novel insertions, 32 deletions, four dispersed duplications and at least seven tandem duplications ( Fig. 3a and Supplementary Data Set 5). There are two additional cases of which the evolutionary history cannot be confidently determined due to potentially multiple independent origins or secondary deletions (Supplementary Data Set 5). Although this is a conservative estimate, our identified unbalanced structural rearrangements clearly outnumbered the balanced ones, as recently reported in Lachancea yeasts38. We found that S. cerevisiae accumulated as many unbalanced rearrangements as S. paradoxus despite its much shorter cumulative diversification time. We noticed that the breakpoints of these unbalanced rearrangements (except for tandem duplications) were also frequently associated with Tys and tRNAs, mirroring our observation for balanced rearrangements. Finally, we found genes involved in unbalanced rearrangements to be significantly enriched for gene ontology (GO) terms related to the binding, transporting and detoxification of metal ions (e.g. Na + , K + , Cd 2+ and Cu 2+ ) (Supplementary Table 14), hinting that these events likely are adaptive.

Structural evolutionary dynamics of subtelomeres
The complete assemblies and well-defined subtelomere boundaries enabled us to examine subtelomeric regions with unprecedented resolution. We found both the size and gene content of the subtelomere to be highly variable across different strains and chromosome arms ( Fig. 4a and Supplementary Data Set 3). The subtelomere size ranges from 0.13 to 76 kb (median = 15.6 kb) while the number of genes enclosed in each subtelomere varies between 0 and 19 (median = 4) and the total number of subtelomeric genes varied between 134-169 (median = 146) per strain. While the very short subtelomeres (e.g. chr04-R and chr11-L) can be explained by an unexpected high degree of synteny conservation extending all the way to the end, some exceptionally long subtelomeres are instead the products of multiple mechanisms. For example, the chr15-R subtelomere of S. cerevisiae DBVPG6765 has been drastically elongated by a 65 kb horizontal gene transfer (HGT)39 ( Fig. 4b and Supplementary Fig. 6a). The chr07-R subtelomere of S. paradoxus CBS432 was extended by a series of tandem duplications of MAL31-like and MAL33-like genes, as well as the addition of the ARR cluster ( Fig. 4c and Supplementary Fig. 6b). The chr15-L subtelomere of S. paradoxus UFRJ50816 increased size by duplications of subtelomeric segments from two other chromosomes ( Fig. 4d and Supplementary Fig. 6c). Inversions have also occurred in subtelomeres, including one affecting the HMRA1-HMRA2 locus in UFRJ50816 and another affecting a MAL11-like gene in CBS432 ( Supplementary Fig. 7).
The enrichment of segmental duplication blocks occurring via ectopic sequence reshuffling is a common feature of eukaryotic subtelomeres, however, incomplete genome assemblies have prevented population-level quantitative analysis of this phenomenon. Here, we identified subtelomeric duplication blocks based on pairwise comparisons of different subtelomeres within the same strain ( Fig. 5a and Supplementary Data Set 6). In total, we identified 173 pairs of subtelomeric duplication blocks across the 12 strains, with 8-26 pairs for each strain (Supplementary Table 15). Among the 16 pairs of subtelomeric duplication blocks previously identified in S288C (mentioned above), all the 12 larger pairs passed our filtering criteria. Interestingly, the Hawaiian S. paradoxus UWOPS91-917.1 has the most subtelomeric duplication blocks and half of these are strain-specific, suggesting unique subtelomere evolution in this strain. The duplicated segments always maintained the same centromere-telomere orientation, supporting a mutational mechanism of double-strand break (DSB) repair as previously suggested in other species40,41. We further summarized those 173 pairs of duplication blocks based on the orthologous subtelomeres involved. This led to 75 unique duplicated subtelomere pairs, 59 (78.7%) of which are new compared to what was previously identified in S288C (Supplementary Data Set 7). We found 31 (41.3%) of these unique pairs to be shared between strains, or even between species with highly dynamic strain-sharing patterns ( Fig. 5b and Supplementary Fig. 8a). Most (87.1%) of this sharing pattern could not be explained by the strain phylogeny (Supplementary Data Set 7). This suggests a constant gain and loss process of subtelomeric duplications throughout evolutionary history.
Given the rampant subtelomere reshuffling, we investigated to what extent the similarity in orthologous subtelomere composition reflects the intra-species phylogenies. We measured the proportion of conserved orthologous subtelomeres in all strain pairs within the same species and performed hierarchical clustering accordingly (Fig. 5c). While the clustering in S. paradoxus correctly recapitulated the true phylogeny, the clustering in S. cerevisiae revealed a quite different topology, with only the relationship of the most recently diversified strain pair (DBVPG6044 vs. SK1) being correctly recovered. Interestingly, the distantly related Wine/European (DBVPG6765) and Sake (Y12) S. cerevisiae strains were clustered together, suggesting possible convergent subtelomere evolution during their respective domestication for alcoholic beverage production. The proportion of conserved orthologous subtelomeres between S. cerevisiae strains (56.3%-81.3%) is comparable to that between S. paradoxus strains (50.0%-81.3%), despite the much smaller diversification timescales of S. cerevisiae. This translates into a 3.8-fold difference in subtelomeric reshuffling intensity between the two species during their respective diversifications (one-sided Mann-Whitney U test, P = 2.93x10 -8 ) (Fig. 5d). The frequent reshuffling of subtelomeric sequences often has drastic impacts on gene content both qualitatively and quantitatively. For example, four genes (PAU3, ADH7, RDS1, and AAD3) were lost in S. cerevisiae Y12 due to a single chr08-L to chr03-R subtelomeric duplication event ( Supplementary Fig. 8b). Therefore, the accelerated subtelomere reshuffling in S. cerevisiae is likely to have important functional implications.

Native non-canonical chromosome-end structures
S. cerevisiae chromosome-ends are characterized by two telomere associated sequences: the core X-and the Y'-element42. The core X-element is present in nearly all chromosomeends, whereas the number of Y'-element varies across chromosome-ends and strains. The two previously described chromosome-end structures are (1) with a single core X-element and (2) with a single core X-element followed by 1-4 distal Y'-elements42. S. paradoxus chromosome-ends also contain core X-and Y'-elements43, but their detailed structures and genome-wide distributions have not been systematically characterized. Across our 12 strains, most (~85%) chromosome-ends have one of the two structures described above but we also discovered several novel chromosome-ends (Supplementary Table 13). We found several examples of tandem duplications of the core X-element in both species. In most cases, including the ones in the S. cerevisiae reference genome (chrVIII-L and chrXVI-R), the proximal duplicated core X-elements were degenerated. Nevertheless, we found two examples where intact duplicated copies were retained: the chrXII-R in S. cerevisiae Y12 and the chrIII-L in S. paradoxus CBS432. The latter case is especially striking, with six core X-elements (including three complete copies) tandemly arranged. Surprisingly, we discovered five chromosome-ends consisting of only Y'-elements (one or more copies) but no core X-elements. This is unexpected given the importance of core X-elements in maintaining genome stability44,45. The discovery of these non-canonical chromosome-end structures offers a new paradigm to investigate the functional role of core X-elements.

Mitochondrial genome evolution
Despite being highly repetitive and AT-rich, we found the mitochondrial genomes of all S. cerevisiae strains show high degrees of collinearity (Fig. 6a). In contrast, S. paradoxus mitochondrial genomes show lineage-specific structural rearrangements. The two Eurasian strains (CBS432 and N44) share a transposition of the entire COX3-rnpB-rns segment, in which rns was further inverted (Fig. 6b-d). In addition, given the gene order in two outgroups, the COB gene was relocated in the S. cerevisiae-S. paradoxus common ancestor (Fig. 6e). The phylogenetic tree inferred from mitochondrial protein-coding genes show clear deviation from the nuclear tree (Fig. 6e). In particular, the Eurasian S. paradoxus lineage (CBS432 and N44) clustered together with the seven S. cerevisiae strains before joining with the other S. paradoxus strains, which supports the idea of mitochondrial introgression from S. cerevisiae46 (Fig. 6e). We found low topology consensus (normalized quartet score = 0.59 versus 0.92 for the nuclear gene tree) across different mitochondrial gene loci, suggesting heterogeneous phylogenetic histories. Together with the drastically dynamic presence/absence patterns of mitochondrial group I and group II introns (Supplementary Table 10), this reinforces the argument for extensive cross-strain recombination in yeast mitochondrial evolution47. In addition, we noticed that the COX3 gene in S. paradoxus UFRJ50816 and UWOPS91-917.1 started with GTG rather than the typical ATG start codon, which was further supported by Illumina reads. This suggests either an adoption of an alternative ATG start codon nearby (e.g. 45 bp downstream) or a rare case of near-cognate start codon48-50.

Fully-resolved SVs illuminate complex phenotypic traits
SVs are expected to account for a substantial fraction of phenotypic variation, therefore fully resolved SVs can be crucial in understanding complex phenotypic traits. Here, we used the copper-tolerance related CUP1 locus and the arsenic-tolerance related ARR cluster as two examples of associations between fully-characterized genomic compositions (i.e. copy numbers and genotypes) and conditional growth rates. The PacBio assemblies precisely resolve these complex loci and phenotype associations are consistent with previous findings based on copy number analysis6,21,51 (Fig. 7a-d and Supplementary Note). We further illustrated their phenotypic contributions via linkage mapping using 826 phased outbred lines (POLs) derived from crossing the North American (YPS128) and West African (DBVPG6044) S. cerevisiae52 (see Methods). The linkage analysis accurately mapped a large-effect quantitative trait locus (QTL) at the chr03-R subtelomere (the location of ARRs in DBVPG6044), but showed no arsenic resistance association with the YPS128 ARRs on the chr16-R subtelomere (Fig. 7e). This profile is consistent with the relocation of an active ARR cluster to the chr03-R subtelomere in DBVPG6044 and the presence of deleterious mutations predicted to inactivate the ARR cluster in YPS1286,35. Thus, a full understanding of the relationship between genome sequence and arsenic resistance phenotype is not provided by the knowledge of copy number alone, but rather requires the combined knowledge of genotype, genomic location, and copy number as provided by our end-to-end assemblies (Fig. 7f).

Discussion
The landscape of genetic variation is shaped by multiple evolutionary processes, including mutation, drift, recombination, gene flow, natural selection and demographic history. The combined effect of these factors can vary considerably both across the genome and between species, resulting in different patterns of evolutionary dynamics. The complete genome assemblies that we generated for multiple strains from both domesticated and wild yeasts provide a unique dataset for exploring such patterns with unprecedented resolution.
Considering the evolutionary dynamics across the genome, eukaryotic subtelomeres are exceptionally variable compared to chromosomal cores40,53,54, with accelerated evolution manifested by extensive CNV accumulation, rampant ectopic reshuffling, and rapid functional divergence6,41,55-57. Our study provides the first quantitative comparison between subtelomeres and cores in structural genome evolution and a high-resolution view of the extreme evolutionary plasticity of subtelomeres. This rapid evolution of subtelomeres can substantially alter the gene repertoire and generate novel recombinants with adaptive potential57. Given that subtelomeric genes are highly enriched in functions mediating interactions with external environments (e.g. stress response, nutrient uptake, and ion transport)6,55,58, it is tempting to speculate that the accelerated subtelomeric evolution reflects selection for evolvability, i.e. the ability to respond and adapt to changing environments59.
With regard to the genome dynamics between species, external factors such as selection and demographic history play important roles. The ecological niches and recent evolutionary history of S. cerevisiae have been intimately associated with human activities, with many strains isolated from human-associated environments, like breweries, bakeries and even clinical patients60. Consequently, this wide-spectrum of selection schemes could significantly shape the genome evolution of S. cerevisiae. In addition, human activities also promoted admixture and crossbreeding of S. cerevisiae strains from different geographical locations and ecological niches61, resulting in many mosaic strains with mixed genetic backgrounds1. In contrast, the wild-living S. paradoxus occupies very limited ecological niches, with most strains isolated from trees in the Quercus genus62. S. paradoxus strains from different geographical subpopulations are genetically well-differentiated with partial reproductive isolations34,63. Such interspecific differences in their life history could result in distinct evolutionary genome dynamics, which is captured in our study (Fig. 8). In chromosomal cores, S. cerevisiae strains show slower accumulation of balanced structural rearrangements compared with S. paradoxus strains. This pattern might be explained by the admixture between different S. cerevisiae subpopulations during their recent association with human activities, which would considerably impede the fixation of balanced structural rearrangements. In contrast, geographical isolation of different S. paradoxus subpopulations would favor relatively fast fixation of balanced structural rearrangements64. We observed an opposite pattern for unbalanced rearrangements in chromosomal cores. The S. cerevisiae strains accumulate such changes more rapidly than their S. paradoxus counterparts, which is likely shaped by selection considering the biological functions of those affected genes. Likewise, the more rapid subtelomeric reshuffling and higher dN/dS of subtelomeric genes in S. cerevisiae than in S. paradoxus are probably also driven by selection. As a consequence of such unbalanced rearrangements and subtelomeric reshuffling, we observed more rapid CNV accumulation and GOL in S. cerevisiae strains, which reinforce this argument. In addition, the mitochondrial genomes of S. cerevisiae strains maintained high degrees of collinearity, whereas those of S. paradoxus strains showed lineage-specific structural rearrangements and introgression, suggesting distinct modes of mitochondrial evolution. Taken together, many of these observed differences between S. cerevisiae and S. paradoxus likely reflect the influence of human activities on structural genome evolution, which shed new light on why S. cerevisiae, but not its wild relative, is one of our most biotechnologically important organisms.

Strain sampling, preparation, and DNA extraction
Based on previous population genomics surveys1, we sampled seven S. cerevisiae and five S. paradoxus strains (all in the haploid or homozygous diploid forms) to represent major evolutionary lineages of the two species (Supplementary Table1). The reference strains for S. cerevisiae (S288C) and S. paradoxus (CBS432) were included for quality control. All strains were taken from our strain collection stored at -80°C and cultured on YPD plates. A single colony for each strain was picked and cultured in 5 mL YPD liquid at 30°C 220 rpm overnight. The DNA extraction was carried out using the MasterPure™ Yeast DNA Purification Kit (Epicentre, USA).

PacBio sequencing and raw assembly
The sequencing center at the Wellcome Trust Sanger Institute (Cambridge, UK) performed library preparation and sequencing using the PacBio Single Molecule, Real-Time (SMRT) DNA sequencing technology (platform: PacBio RS II; chemistry: P4-C2 for the pilot phase and P6-C4 for the main phase). The raw reads were processed using the standard SMRT analysis pipeline (v2.3.0). The de novo assembly was carried out following the hierarchical genome-assembly process (HGAP) assembly protocol with Quiver polishing65.

Assembly evaluation and manual refinement
We retrieved the reference genomes (Supplementary Note) for both species to assess the quality of our PacBio assemblies. For each polished PacBio assembly, we first used RepeatMasker (v4.0.5) (see URLs ) to soft-mask repetitive regions (option: -species fungi -xsmall -gff). The soft-masked assemblies were subsequently aligned to the reference genomes using the nucmer program from MUMmer (v3.23)66 for chromosome assignment. For most chromosomes, we have single contigs covering the entire chromosomes. For the cases where internal assembly gaps occurred, we performed manual gap closing by consulting the assemblies generated in the pilot phase of this project. The only gap that we were unable to close is the highly repetitive rDNA array (usually consisting 100-200 copies of 9.1 kb unit) on chrXII. The S. cerevisiae reference genome used a 17,357 bp sequence of two tandemly arranged rDNA copies to represent this complex region. For our assemblies, we trimmed off the partially assembled rDNAs around this gap and re-linked the two contigs with 17,357 bp Ns to keep consistency. The mitochondrial genomes of the 12 strains were recovered by single contigs in the raw HGAP assemblies. We further circularized them and reset their starting position as the ATP6 gene using Circlator (v1.1.4)67. The circularized mitochondrial genome assemblies were further checked by consulting the raw PacBio reads and manual adjustment was applied when necessary.

Illumina sequencing, reads mapping, and error correction
In addition to the PacBio sequencing, we also performed Illumina 151 bp paired-end sequencing for each strain at Institut Curie (Paris, France). We examined the raw Illumina reads via FastQC (v0.

Base-level error rate estimation for the final PacBio assemblies
Eight of our 12 strains have previously been sequenced using Illumina technology with moderate-to-high depths6. We retrieved those raw reads and mapped them to our PacBio assemblies (both before and after the Pilon correction) following the same protocol described above. The SNPs and Indels were called by FreeBayes (v1.0.1-2)73 (option: -p 1) to assess the performance of the Pilon correction and estimate the remaining base-level error rate in our final assemblies. The raw SNP and Indel calls were filtered by the vcffilter tool from vcflib (see URLs

Assembly completeness evaluation
We compared our S288C PacBio assembly with three published S. cerevisiae assemblies generated by different sequencing technologies (i.e. PacBio, Oxford Nanopore and Illumina)74,75. We aligned these three assemblies as well as our S288C PacBio assembly to the S. cerevisiae reference genome using nucmer from MUMmer (v3.23)66. The nucmer alignments were filtered by delta-filter (from the same package) (option: -1). We converted the output file to the "BED" format and used bedtools (v2.15.0)76 to calculate the intersection between our genome alignment and various annotation features (e.g. chromosomes, genes, retrotransposable elements, telomeres, etc) of the S. cerevisiae nuclear reference genome. The percent coverages of these annotation features by different assemblies were summarized accordingly.

Annotation of the protein-coding genes, tRNA genes, and other genomic features
For nuclear genomes, we assembled an integrative pipeline that combines three existing annotation tools to form an evidence-leveraged protein-coding gene annotation. First, we used the RATT package77 for directly transferring the non-dubious S. cerevisiae reference gene annotations to our PacBio assemblies based on whole genome alignments. Furthermore, we used the Yeast Genome Annotation Pipeline (YGAP)78 to annotate our PacBio assemblies (default options without scaffolds reordering) based on gene sequence homology and synteny conservation. A custom Perl script was used to remove redundant, truncated, or frameshifted genes annotated by YGAP. Finally, we used the Maker pipeline (v2.31.8)79 to perform de novo gene discovery with EST/protein alignment support (Supplementary Note). As a by-product, tRNA genes were also annotated via the tRNAscan-SE (v1.3.1)80 module of the Maker pipeline. The gene annotations produced by RATT, YGAP, and Maker together with the EST/protein alignment evidences generated by Maker were further leveraged by EVidenceModeler (EVM)81 to form an integrative annotation. Manual curation was carried out for selected cases (e.g. the CUP1 and ARR clusters) and pseudogenes were manually labeled when verified. The same pipeline was used for upgrading the protein-coding gene annotation of S. arboricolus, for which the originally annotated coding sequences (CDSs) and protein sequences was used for initial EST/protein alignment. In addition, for the 12 strains, we systematically annotated other genomic features encoded in their nuclear genomes, such as centromeres, Ty retrotransposable elements, and telomere-associated core X-and Y'-elements (Supplementary Note). Proteincoding genes that overlap with truncated/full-length Tys, core X-or Y'-elements were removed from our final annotation.
As for mitochondrial genomes, the protein-coding genes, tRNA genes and other mitochondrial RNA genes such as RNase P RNA (rnpB), small (rns) and large (rnl) subunit rRNA were annotated by MFannot (see URLs ). The exon-intron boundaries of annotated mitochondrial genes were manually curated based on BLAST and the 12-way mitochondrial genome alignment generated by mVISTA82.

Phylogenetic reconstruction
For nuclear genes, we performed the phylogenetic analysis based on those one-to-one orthologs that are shared across all 18 strains (seven S. cerevisiae + five S. paradoxus + six outgroups) using two complementary approaches: the concatenated tree approach and the consensus tree approach. For each one-to-one ortholog, we used MUSCLE (v3.8.1551)85 to align protein sequences and used PAL2NAL (v14)86 to align codons accordingly. For the concatenated tree approach, we generated a concatenated codon alignment across all orthology groups and fed it into RAxML (v8.2.6)87 for maximum likelihood (ML) tree building. Alignment partition was configured by the first, second, and third codon positions. The GTRGAMMA model was used for phylogenetic inference. The rapid bootstrapping method built in RAxML was used to assess the stability of internal nodes (option: -# 100). The final ML tree was visualized in FigTree (v1.4.2) (see URLs ). For the consensus tree approach, we built individual gene trees with RAxML using the same method described above, which were further summarized into a coalescent-based consensus species tree by ASTRAL (v4.7.12)88. The normalized quartet score was calculated to assess the reliability of the final species tree given individual gene trees. For mitochondrial genes, we performed the same phylogenetic analysis based on the eight mitochondrial protein-coding genes.

Relative rate test
To test the rate heterogeneity between S. cerevisiae and S. paradoxus in molecular evolution, we constructed three-way sequence alignments by sampling one strain for each species together with S. mikatae as the outgroup. The sequences were drawn from the concatenated nuclear CDS alignment described above. The extracted sequences were fed into MEGA (v7.0.16)89 for Tajima's relative rate test27. We conducted this test for all possible S. cerevisiae versus S. paradoxus strain pairs.

Molecular dating
Since no yeast fossil record can be used for reliable calibration, we performed the molecular dating analysis based on a relative time scale. We used the phylogenetic tree constructed from the nuclear one-to-one orthologs as the input and performed least-square based fast dating with LSD90 (options: -c -v -s). We specified S. bayanus var. uvarum CBS7001 and S. eubayanus FM1318 as outgroups for this analysis.

Conserved synteny block identification
We used SynChro from the CHROnicle package (version: January 2015)91,92 to identify conserved synteny blocks. We prepared the input files for SynChro with custom Perl scripts to provide the genomic coordinates of all annotated features together with the genome assembly and proteome sequences. SynChro subsequently performed exhaustive pairwise comparisons to identify synteny blocks shared in the given strain pair.

Subtelomere definition and chromosome partitioning
An often-used yeast subtelomere definition is 20-30 kb from the chromosome-ends. However, this definition is arbitrary in the sense that it treats all subtelomeres indiscriminately. In this study, we defined yeast subtelomeres based on gene synteny conservation profiles across the 12 strains. For each chromosome arm, we examined all syntenic blocks shared across the 12 strains and used the most distal one to define the distal boundary for the chromosomal core (Supplementary Table 11). Meanwhile, we defined the proximal boundary of the chromosome-end for this chromosome arm based on the first occurrence of core X-or Y'-elements. The region between these two boundaries was defined as the subtelomere for this chromosome arm with 400 bp interstitial transition zones on both sides ( Supplementary Fig. 3).
Given that some strains (i.e. UWOPS03-461.4, UFRJ50816, and UWOPS91-917.1) are involved in large-scale interchromosomal rearrangements, the current chromosomal identities (determined by centromeres) might not necessarily agree with the ancestral chromosomal identities (determined by gene contents). Therefore, we used Roman and Arabic numbers to respectively denote these two identities for all 12 strains to avoid potential confusion when it comes to those interchromosomal rearrangements (Supplementary Fig. 4 and Supplementary Table 12). Each defined subtelomere was named according to the ancestral chromosomal identity of its flanking chromosomal core and denoted also using Arabic numbers (Supplementary Data Sets 2-3).

Identification of balanced and unbalanced structural rearrangements in chromosomal cores
To identify balanced rearrangements, we first used ReChro from CHROnicle (version: January 2015)91,92. We set the synteny block stringency parameter "delta=1" for the main analysis. A complementary run was performed with "delta=0" to identify single gene inversions. Alternatively, we started with the one-to-one ortholog gene pairs (identified by our orthology group identification) in chromosomal cores between any given strain pair and examined their relative orientation and chromosomal locations. If the two one-to-one orthologous genes locate on the same chromosome but with opposite orientations, an inversion should be involved. If they reside on different chromosomes, a translocation or transposition should be involved.
As for unbalanced rearrangements, we first generated whole genome alignment for every strain pair by nucmer66 (options: -maxmatch -c 500) and used Assemblytics93 to identify potential insertions, deletions and duplications/contractions. All candidates were further intersected with our gene annotations by bedtools intersect76 to only keep those encompassing at least one protein-coding gene. Alternatively, we started with all the genes enclosed in chromosomal cores of any given strain pair and filtered out those completely covered by unique genome alignment between this strain pair. All the remaining genes were classified as candidates potentially involved in unbalanced rearrangements.
All identified candidate cases were manually examined by dotplots using Gepard (v1.30)94. All verified rearrangements in chromosomal cores were further mapped to the phylogeny of the 12 strains to reconstruct their evolutionary histories based on the maximum parsimony principle. The corresponding genomic regions in those six outgroups were also checked by dotplots to provide further support for our evolutionary history inferences.

Gene ontology analysis
The CDSs of the S. cerevisiae non-dubious reference genes were BLAST against the NCBI non-redundant (nr) database using blastx (E-value = 1x10 -3 ) and further annotated by BLAST2GO (v.3.2)95,96 to generate gene ontology (GO) mapping for each gene. We performed Fisher's exact test97 to detect significantly enriched GO terms of our test gene set relative to the genome-wide background. False discovery rate (FDR) (cutoff: 0.05)98 was used for multiple correction. The significantly enriched GO terms were further processed by the "Reduce to most specific terms" function implemented in BLAST2GO to only keep child terms.

Molecular evolutionary rates, CNV accumulation, and GOL estimation
For the one-to-one orthologs in each strain pair, we calculate synonymous substitution rate (dS), nonsynonymous substitution rate (dN) and nonsynonymous-to-synonymous substitution rate ratio (dN/dS) using the yn00 program from PAML (v4.8a)99 based on Yang & Nielsen (2000) model100. We also measured the proportion of genes involved in CNVs (i.e. those are not one-to-one orthologs) in any strain pair. We denoted this measurement as P CNVs , a quantity analogous to the P-distance in sequence comparison. To correct for multiple changes at the same gene loci, the Poisson distance D CNVs can be given by −ln (1 − P CNVS ). This value can be further adjusted with evolutionary time by dividing 2T, where T is the diversification time of the two compared strains obtained from our molecular dating analysis. To further capture evolutionary dynamics in terms of gene order changes, we further measured gene order loss (GOL) for those one-to-one orthologs using the method proposed by previously studies without allowing for intervening genes29-31. For GOL, we performed similar Poisson correction and evolutionary time adjustment as we did for CNV accumulation. The calculation values for dN/dS, CNV accumulation, and GOL were further summarized by "core genes" and "subtelomeric genes" based on our genome partitioning described above.

Subtelomeric homology search
For each defined subtelomeric region, we hard-masked all the enclosed Ty-related features (i.e. full-length Ty, truncated Ty and Ty solo-LTRs) and then searched against all the other subtelomeric regions for shared sequence homology. The search was performed by BLAT101 (options: -noHead -stepSize=5 -repMatch=2253 -minIdentity=80 -t=dna -q=dna -mask=lower -qMask=lower). We used pslCDnaFilter (options: -minId=0.9 -minAlnSize=1000 -bestOverlap -filterWeirdOverlapped) to filter out trivial signals and used pslScore to calculate sequence alignment scores for those filtered BLAT matches. Since the two reciprocal scores obtained from the same subtelomere pair are not symmetrical (depending on which sequence was used as the query), we took their arithmetic mean in our analysis. Such subtelomeric homology search was carried out for both within-strain and cross-strain comparisons and subtelomere pairs with strong sequence homology (BLAT alignment score >= 5000 and sequence identity >= 90%) were recorded.

Hierarchical clustering analysis and reshuffling rate calculation for orthologous subtelomeres
For all the strains within the same species, we performed pairwise comparisons of their subtelomeric regions to identify conserved orthologous subtelomeres in any given strain pairs based on homology search described above. For each strain pair, the proportion of conserved orthologous subtelomeres was calculated as a measurement of the overall subtelomere conservation between the two strains. Such measurements were converted into a distance matrix by the dist() function in R (v3.1)102, based on which the hclust() function was further used for hierarchical clustering. We gauged the reshuffling intensity of orthologous subtelomeres in a way similar to what we did for measuring CNV accumulation and GOL. For any given strain pair, we first calculated the proportion of the non-conserved orthologous subtelomeres in this strain pair as P reshuffling and then applied the Poisson correction and evolutionary time adjustment by −ln (1 − P reshuffling )/2T, in which T is the diversification time of the two compared strains.

Phenotyping the growth rates of yeast strains in copper-and arsenite-rich medium
The homozygous diploid versions of the 12 strains were pre-cultured in Synthetic Complete (SC) medium overnight to saturation. To examine their conditional growth rates in copperand arsenite-rich environment, we mixed 350 µl conditional media (CuCl 2 (0.38 mM) and arsenite (As[III], 3 mM) for the two environment respectively) with 10 µl saturated culture to the wells of Honeycomb plates. Oxygen permeable films were placed on top of the plates to enable a uniform oxygen distribution throughout the plate. The automatic screening was done with Bioscreen Analyser C (Thermic Labsystems Oy, Finland) at 30°C for 72 hours, measuring in 20 minute intervals using a wide-band filter at 420-580 nm103. Growth data pre-processing and phenotypic trait extraction were performed by PRECOG104.

Linkage analysis in diploid S. cerevisiae hybrids
A total of 826 phased outbred lines (POLs) were constructed and phenotyped in the same fashion as previously described52. Briefly, advanced intercrossed lines (AILs) were generated by successive rounds of mating and sporulation from the YPS128 and DBVPG6044 strains105.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material. (a) Dotplot for the comparison between the S. cerevisiae reference genome (strain S288C; X-axis) and our S288C PacBio assembly (Y-axis). Sequence homology signals were depicted in red (forward match) or blue (reverse match). The two insets show the zoomed-in comparison for chromosome III (chrIII) and the mitochondrial genome (chrmt) respectively.   (a) An example of subtelomeric duplication blocks shared among the chr01-L, chr01-R and chr08-R subtelomeres in S. cerevisiae S288C. The grey blocks denote their shared homologous regions with >= 90% sequence identity. (b) Subtelomeric duplication signals shared across the seven S. cerevisiae strains (left) and the five S. paradoxus strains (right).    The interspecific contrasts in nuclear chromosomal cores, subtelomeres and mitochondrial genomes were summarized respectively.