Main

Understanding how genetic variation translates into phenotypic diversity is a central theme in biology. With the rapid advancement of sequencing technology, genetic variation in large natural populations has been explored extensively for humans and several model organisms1,2,3,4,5,6,7,8,9. However, current knowledge of natural genetic variation is heavily biased toward single nucleotide variants (SNVs). Large-scale structural variants (SVs) such as inversions, reciprocal translocations, transpositions, novel insertions, deletions and duplications are not as well characterized owing to technical difficulties in detecting them with short-read sequencing data. This is a critical problem to address given that SVs often account for a substantial fraction of genetic variation and can have significant implications in adaptation, speciation and disease susceptibility10,11,12.

The long-read sequencing technologies from Pacific Biosciences (PacBio) and Oxford Nanopore offer powerful tools for high-quality genome assembly13. Their recent applications provided highly continuous genome assemblies with many complex regions correctly resolved, even for large mammalian genomes14,15. This is especially important in characterizing SVs, which are frequently embedded in complex regions. For example, eukaryotic subtelomeres, which contribute to genetic and phenotypic diversity, are known hot spots of SVs due to rampant ectopic sequence reshuffling16,17,18,19.

Baker's yeast, S. cerevisiae, is a leading biological model system with great economic importance in agriculture and industry. Discoveries in S. cerevisiae have helped shed light on almost every aspect of molecular biology and genetics. It was the first eukaryote to have its genome sequence, population genomics and genotype–phenotype map extensively explored1,20,21. Here we applied PacBio sequencing to 12 representative strains of S. cerevisiae or its wild relative S. paradoxus and identified notable interspecific contrasts in structural dynamics across their genomic landscapes. This study brings long-read sequencing technologies to the field of population genomics, studying genome evolution using multiple reference-quality genome sequences.

Results

End-to-end population-level genome assemblies

We applied deep PacBio (100×–300×) and Illumina (200×–500×) sequencing to seven S. cerevisiae and five S. paradoxus strains representing evolutionarily distinct subpopulations of both species1,6 (Supplementary Tables 1 and 2). The raw PacBio de novo assemblies of both nuclear and mitochondrial genomes showed compelling completeness and accuracy, with most chromosomes assembled into single contigs, and highly complex regions accurately assembled (Supplementary Fig. 1). After manual gap filling and Illumina-read-based error correction (Online Methods), we obtained end-to-end assemblies for almost all the 192 chromosomes, with only the rDNA array on chromosome XII and 26 of 384 (6.8%) chromosome ends remaining not fully assembled. We estimate that only 45–202 base-level sequencing errors remain across each 12-Mb nuclear genome (Supplementary Tables 3 and 4). For each assembly, we annotated centromeres, protein-coding genes, tRNAs, Ty retrotransposable elements, core X elements, Y′ elements and mitochondrial RNA genes (Supplementary Tables 5–7). Chromosomes were named according to their encompassed centromeres.

When evaluated against the current S. cerevisiae and S. paradoxus reference genomes, our PacBio assemblies of the same strains (S288C and CBS432, respectively) show clean collinearity for both nuclear and mitochondrial genomes (Fig. 1a,b) with only a few discrepancies at finer scales, which were caused by assembly problems in the reference genomes. For example, we found five nonreference Ty1 insertions on chromosome III in our S288C assembly (Fig. 1a, inset), which were corroborated by previous studies22,23,24 as well as our own long-range PCR amplifications. Likewise, we found a misassembly on chromosome IV (Fig. 1b, inset) in the S. paradoxus reference genome, which was confirmed by Illumina and Sanger reads1. Moreover, we checked several known cases of copy number variants (CNVs) (for example, Y′ elements25, the CUP1 locus6 and ARR6 gene clusters) and SVs (for example, those in the Malaysian S. cerevisiae UWOPS03-461.4)26 and they were all correctly recaptured in our assemblies.

Figure 1: End-to-end genome assemblies and phylogenetic framework.
figure 1

(a) Comparison of the S. cerevisiae reference genome (strain S288C) and our S288C PacBio assembly. Sequence homology signals are indicated in red (forward match) or blue (reverse match). Insets, zoomed-in comparisons for chromosome III (chrIII) and the mitochondrial genome (chrmt). Black arrows indicate Ty-containing regions missing in the S. cerevisiae reference genome. (b) Comparison of the S. paradoxus reference genome (strain CBS432) and our CBS432 PacBio assembly, color coded as in a. Insets, zoomed-in comparison for chromosome IV (chrIV) and chrmt. Black arrow indicates the misassembly on chromosome IV in the S. paradoxus reference genome. (c,d) Cumulative lengths of annotated genomic features relative to the overall assembly size of the nuclear (c) and mitochondrial genome (d). CDS, coding sequence. (e) Phylogenetic relationships of the seven S. cerevisiae strains (blue) and five S. paradoxus strains (red) sequenced in this study. Six strains from other closely related Saccharomyces species were used as outgroups. All internal nodes have 100% fast-bootstrap supports. Inset, detailed relationships of the S. cerevisiae strains.

The final assembly sizes of these 12 strains ranged from 11.73 to 12.14 Mb for the nuclear genome (excluding rDNA gaps) and from 69.95 to 85.79 kb for the mitochondrial genome (Fig. 1c,d and Supplementary Tables 8 and 9). The abundance of Ty and Y′ elements substantially contributed to the nuclear genome size differences (Fig. 1c and Supplementary Table 8). For example, we observed strain-specific enrichment of full-length Ty1 in S. cerevisiae S288C, Ty4 in S. paradoxus UFRJ50816 and Ty5 in S. paradoxus CBS432, whereas no full-length Ty was found in S. cerevisiae UWOPS03-461.4 (Supplementary Table 6). Similarly, >30 copies of the Y′ element were found in S. cerevisiae SK1 but none in S. paradoxus N44 (Supplementary Table 5). Mitochondrial genome size variation is heavily shaped by the presence or absence of group I and group II introns in COB1, COX1 and 21S rRNA (rnl) (Fig. 1d and Supplementary Tables 9 and 10). Despite large-scale interchromosomal rearrangements in a few strains (S. cerevisiae UWOPS03-461.4, S. paradoxus UFRJ50816 and S. paradoxus UWOPS91-917.1), all 12 strains maintained 16 nuclear chromosomes.

Molecular evolutionary rate and diversification timescale

To gauge structural dynamics in a well-defined evolutionary context, we performed phylogenetic analysis for the 12 strains and 6 Saccharomyces sensu stricto outgroups based on 4,717 one-to-one orthologs of nuclear protein-coding genes (Supplementary Data Set 1). The resulting phylogeny is consistent with our prior knowledge about these strains (Fig. 1e). Analyzing this phylogenetic tree, we found the entire S. cerevisiae lineage to have evolved faster than the S. paradoxus lineage, as indicated by the overall longer branch from the common ancestor of the two species to each tip of the tree (Fig. 1e). We confirmed such rate differences by Tajima's relative rate test27 for all S. cerevisiaeS. paradoxus strain pairs, using Saccharomyces mikatae as the outgroup (P < 1 × 10−5 for all pairwise comparisons). In contrast, molecular dating analysis shows that the cumulative diversification time for the five S. paradoxus strains was 3.87-fold that for the seven S. cerevisiae strains, suggesting a much longer time span for accumulating species-specific genetic changes in the former (Supplementary Fig. 2a). This timescale difference was further supported by the synonymous substitution rate (dS) (Supplementary Fig. 2b).

Core–subtelomere chromosome partitioning

Conceptually, linear nuclear chromosomes can be partitioned into internal chromosomal cores, interstitial subtelomeres and terminal chromosome ends. However, their precise boundaries are challenging to demarcate without a rigid subtelomere definition. Here we propose an explicit way to pinpoint yeast subtelomeres on the basis of multi-genome comparison, which can be further applied to other eukaryotic organisms. For each subtelomere, we located its proximal boundary on the basis of the sudden loss of synteny conservation and demarcated its distal boundary by the telomere-associated core X and Y′ elements (Online Methods and Supplementary Fig. 3). The partitioning for the left arm of chromosome I is illustrated in Figure 2a. The strict gene synteny conservation is lost after GDH3, thus marking the boundary between the core and the subtelomere for this chromosome arm (Fig. 2a). All chromosomal cores and subtelomeres and 358 out of 384 chromosome ends across the 12 strains could be defined in this way (Supplementary Tables 11–13 and Supplementary Data Sets 2 and 3). For the remaining 26 chromosome ends, X and Y′ elements and telomeric repeats (TG1–3) were missing. We assigned the orthology of subtelomeres from different strains on the basis of the ancestral chromosomal identity of their flanking chromosomal cores (Online Methods). Here we use Arabic numbers to denote such ancestral chromosomal identities and the associated subtelomeres, taking into account the large-scale interchromosomal rearrangements that have occurred in some strains (Supplementary Fig. 4 and Supplementary Table 12). Such accurately assigned subtelomere orthology, together with explicit chromosome partitioning, allows an in-depth examination of subtelomeric evolutionary dynamics.

Figure 2: Explicit nuclear chromosome partitioning.
figure 2

(a) Partitioning of the left arm of chromosome I into the core (green), subtelomere (yellow) and chromosome end (pink) based on synteny conservation and the yeast telomere-associated core X and Y′ elements. Cladogram (left) shows the phylogenetic relationships of the 12 strains; gene arrangement map (right) illustrates the syntenic conservation profile in both the core and subtelomeric regions. The names of genes within the syntenic block are underlined. (b,c) CNV accumulation (b) and CNV accumulation adjusted by diversification time (c) of strain pairs within S. cerevisiae (S.c.S.p.), within S. paradoxus (S.c.S.p.) and between the two species (S.c.S.p.) (log10 scale). (d,e) GOL (d) and GOL adjusted by diversification time (e) of strain pairs. Center lines, median; boxes, interquartile range (IQR); whiskers, 1.5× IQR. Data points beyond the whiskers are outliers.

Our analysis captures distinct properties of chromosomal cores and subtelomeres. All previously defined essential genes in S. cerevisiae S288C28 fell into the chromosomal cores, whereas all previously described subtelomeric duplication blocks in S288C (http://www2.le.ac.uk/colleges/medbiopsych/research/gact/images/clusters-fixed-large.jpg) were fully enclosed in our defined S288C subtelomeres. Furthermore, the genes from our defined subtelomeres showed 36.6-fold higher CNV accumulation than those from the cores (one-sided Mann–Whitney U test, P < 2.2 × 10−16) (Fig. 2b,c). When considering only one-to-one orthologs, the subtelomeric genes showed 8.4-fold higher gene order loss (GOL)29,30,31 than their core counterparts (one-sided Mann–Whitney U test, P < 2.2 × 10−16) (Fig. 2d,e). Additionally, subtelomeric one-to-one orthologs also showed significantly higher nonsynonymous-to-synonymous substitution rate ratio (dN/dS) than those from the cores in the S. cerevisiae–S. cerevisiae and S. cerevisiae–S. paradoxus comparisons (one-sided Mann–Whitney U test, P < 2.2 × 10−16), although no clear trend was found in the S. paradoxusS. paradoxus comparison (one-sided Mann–Whitney U-test, P = 0.936). These observations fit well with known properties of cores and subtelomeres and provide the first quantitative assessment of the core–subtelomere contrasts in genome dynamics. Notably, aside from such core–subtelomere contrasts, we also observed clear interspecific differences in all three measurements. S. cerevisiae strains showed faster CNV accumulation (one-sided Mann–Whitney U-test; P = 6.7 × 10−5 for cores, P = 5.1 × 10−5 for subtelomeres) and more rapid GOL (one-sided Mann–Whitney U-test, P = 5.5 × 10−5 for cores and P = 2.6 × 10−5 for subtelomeres) than S. paradoxus strains in both core and subtelomeres, respectively (Fig. 2c,e). Similarly, S. cerevisiae subtelomeric genes also showed higher dN/dS than their S. paradoxus counterparts (one-sided Mann–Whitney U-test, P = 4.3 × 10−4), although their core genes appear to have similar dN/dS (one-sided Mann–Whitney U-test, P = 1.000). These observations collectively suggest accelerated evolution in S. cerevisiae relative to S. paradoxus, especially in subtelomeres.

Structural rearrangements in chromosomal cores

Structural rearrangements can be balanced (as with inversions, reciprocal translocations and transpositions) or unbalanced (as with large-scale novel insertions, deletions and duplications) depending on whether the copy number of genetic material is affected10. We identified 35 balanced rearrangements in total, including 28 inversions, 6 reciprocal translocations and 1 massive rearrangement (Fig. 3a, Supplementary Fig. 5a–c and Supplementary Data Set 4). All events occurred during the species-specific diversification of the two species, with 29 events occurring in S. paradoxus and only 6 in S. cerevisiae. Factoring in the cumulative evolutionary time difference, S. paradoxus still showed 1.25-fold faster accumulation of balanced rearrangements than S. cerevisiae. Six inversions were tightly packed into a 200-kb region on chromosome VII of South American S. paradoxus UFRJ50816, indicating a strain-specific inversion hot spot (Fig. 3b). With regard to interchromosomal rearrangements, six were reciprocal translocations that occurred in two S. paradoxus strains (Fig. 3c and Supplementary Fig. 5a,b). The remaining one, in the Malaysian S. cerevisiae UWOPS03-461.4, was particularly notable: chromosomes VII, VIII, X, XI and XIII were heavily reshuffled, confirming recent chromosomal contact data26 (Fig. 3c and Supplementary Fig. 5c). We describe this as a massive rearrangement because it cannot be explained by typical independent reciprocal translocations but is more likely to result from a single catastrophic event resembling the chromothripsis observed in tumor cells32,33. This massive rearrangement in the Malaysian S. cerevisiae and the rapid accumulation of inversions and translocations in the South American S. paradoxus resulted in extensively altered genome configurations, explaining the reproductive isolation of these two lineages34,35. As previously observed in yeasts on larger divergence scales36,37, the breakpoints of those balanced rearrangements are associated with tRNAs and Tys, highlighting the roles of these elements in triggering genome instability and suggesting nonallelic homologous recombination as the mutational mechanism.

Figure 3: Structural rearrangements in the nuclear chromosomal cores.
figure 3

(a) Balanced (left) and unbalanced (right) structural rearrangements occurred along the evolutionary history of the 12 strains. IV, inversion; TL, translocation; MR, massive rearrangement; IS, insertion; DL, deletion; DD, dispersed duplication; TD, tandem duplication. (b) The six clustered inversions on chromosome VII (chrVII) of the S. paradoxus strain UFRJ50816; highlighted region (top) is shown in zoomed-in plot (bottom). (c) Genome organization of UWOPS03-461.4, UFRJ50816 and UWOPS91-917.1 relative to that of S288C, which is free from large-scale interchromosomal rearrangements. White diamonds indicate positions of centromeres. Different colors are used to differentiate gene contents in different ancestral S. cerevisiae chromosomes.

Considering unbalanced structural rearrangements in chromosomal cores, we identified 7 novel insertions, 32 deletions, 4 dispersed duplications and at least 7 tandem duplications (Fig. 3a and Supplementary Data Set 5). There were two additional cases of which the evolutionary history could not be confidently determined owing to multiple potential independent origins or secondary deletions (Supplementary Data Set 5). Although this is a conservative estimate, our identified unbalanced structural rearrangements clearly outnumbered the balanced ones, as recently reported in Lachancea yeasts38. We found that S. cerevisiae accumulated as many unbalanced rearrangements as S. paradoxus despite its much shorter cumulative diversification time. We noticed that the breakpoints of these unbalanced rearrangements (except for tandem duplications) were also frequently associated with Tys and tRNAs, mirroring our observation for balanced rearrangements. Finally, we found genes involved in unbalanced rearrangements to be significantly enriched for Gene Ontology (GO) terms related to the binding, transporting and detoxification of metal ions (for example, Na+, K+, Cd2+ and Cu2+) (Supplementary Table 14), hinting that these events are probably adaptive.

Structural evolutionary dynamics of subtelomeres

The complete assemblies and well-defined subtelomere boundaries enabled us to examine subtelomeric regions with unprecedented resolution. We found both the size and gene content of the subtelomere to be highly variable across different strains and chromosome arms (Fig. 4a and Supplementary Data Set 3). The subtelomere size ranged from 0.13 to 76 kb (median = 15.6 kb), the number of genes enclosed in each subtelomere varied between 0 and 19 (median = 4), and the total number of subtelomeric genes varied between 134 and 169 (median = 146) per strain. Whereas the very short subtelomeres (for example, chromosome 04-R and chromosome 11-L) can be explained by an unexpected high degree of synteny conservation extending all the way to the end, some exceptionally long subtelomeres are the products of multiple mechanisms. For example, the chromosome 15-R subtelomere of S. cerevisiae DBVPG6765 has been drastically elongated by a 65-kb horizontal gene transfer (HGT)39 (Fig. 4b and Supplementary Fig. 6a). The chromosome 07-R subtelomere of S. paradoxus CBS432 was extended by a series of tandem duplications of MAL31-like and MAL33-like genes, as well as the addition of the ARR cluster (Fig. 4c and Supplementary Fig. 6b). The chromosome 15-L subtelomere of S. paradoxus UFRJ50816 increased size by duplications of subtelomeric segments from two other chromosomes (Fig. 4d and Supplementary Fig. 6c). Inversions have also occurred in subtelomeres, including one affecting the HMRA1HMRA2 locus in UFRJ50816 and another affecting a MAL11-like gene in CBS432 (Supplementary Fig. 7).

Figure 4: Subtelomere size plasticity and structural rearrangements.
figure 4

(a) Size variation of the 32 orthologous subtelomeres across the 12 strains. (b) Chromosome 15-R (chr15-R) subtelomere comparison between S. cerevisiae DBVPG6765 and S288C. The extended DBVPG6765 chr15-R subtelomere is explained by a eukaryote-to-eukaryote HGT event39. (c) Chromosome 07-R (chr07-R) subtelomere comparison between S. paradoxus CBS432 and N44. The chr07-R subtelomere expansion in CBS432 is explained by a series of tandem duplications of the MAL31-like and MAL33-like genes and an addition of the ARR-containing segment from the ancestral chromosome 16-R subtelomere. (d) Chromosome 15-L subtelomere comparison between S. paradoxus UFRJ50816 and YPS138. The expanded chromosome 15-L subtelomere in UFRJ50816 is explained by the relocated subtelomeric segments from the ancestral chromosome 10-L and chromosome 03-R subtelomeres. Region coordinates in bd are based on the defined subtelomeres rather than the full chromosomes.

The enrichment of segmental duplication blocks occurring via ectopic sequence reshuffling is a common feature of eukaryotic subtelomeres; however, incomplete genome assemblies have prevented population-level quantitative analysis of this phenomenon. Here we identified subtelomeric duplication blocks based on pairwise comparisons of different subtelomeres within the same strain (Fig. 5a and Supplementary Data Set 6). In total, we identified 173 pairs of subtelomeric duplication blocks across the 12 strains, with 8–26 pairs for each strain (Supplementary Table 15). Among the 16 pairs of subtelomeric duplication blocks previously identified in S288C (mentioned above), all the 12 larger pairs passed our filtering criteria. Notably, the Hawaiian S. paradoxus UWOPS91-917.1 had the most subtelomeric duplication blocks, and half of these were strain-specific, suggesting unique subtelomere evolution in this strain. The duplicated segments always maintained the same centromere–telomere orientation, supporting a mutational mechanism of double-strand break (DSB) repair like those previously suggested in other species40,41. We further summarized those 173 pairs of duplication blocks according to the orthologous subtelomeres involved. This led to 75 unique duplicated subtelomere pairs, 59 (78.7%) of which have not been described before (Supplementary Data Set 7). We found 31 (41.3%) of these unique pairs to be shared between strains or even between species with highly dynamic strain-sharing patterns (Fig. 5b and Supplementary Fig. 8a). Most (87.1%) of this sharing pattern could not be explained by the strain phylogeny (Supplementary Data Set 7). This suggests a constant gain-and-loss process of subtelomeric duplications throughout evolutionary history.

Figure 5: Evolutionary dynamics of subtelomeric duplications.
figure 5

(a) An example of subtelomeric duplication blocks shared among the chromosome 01-L (chr01-L), chr01-R and chr08-R subtelomeres in S. cerevisiae S288C. Gray shading indicates shared homologous regions with ≥90% sequence identity. (b) Subtelomeric duplication signals shared across strains. For each subtelomere pair, the number of strains showing strong sequence homology (BLAT score ≥5,000 and identity ≥90%) is indicated in the heat map. (c) Hierarchical clustering based on the proportion of conserved orthologous subtelomeres in cross-strain comparisons within S. cerevisiae and S. paradoxus. (d) Subtelomere reshuffling intensities (log10 scale) within S. cerevisiae (S.c.–S.c.) and within S. paradoxus (S.p.–S.p.), adjusted by the diversification time of the compared strain pair. Center lines, median; boxes, interquartile range (IQR); whiskers, 1.5× IQR. Data points beyond the whiskers are outliers.

Given the rampant subtelomere reshuffling, we investigated to what extent the similarity in orthologous subtelomere composition reflects the intra-species phylogenies. We measured the proportion of conserved orthologous subtelomeres in all strain pairs within the same species and performed hierarchical clustering accordingly (Fig. 5c). The clustering in S. paradoxus correctly recapitulated the true phylogeny, whereas the clustering in S. cerevisiae showed a different topology, and only the relationship of the most recently diversified strain pair (DBVPG6044 versus SK1) was correctly recovered. Notably, the distantly related Wine/European (DBVPG6765) and Sake (Y12) S. cerevisiae strains were clustered together, suggesting possible convergent subtelomere evolution during their respective domestication for alcoholic beverage production. The proportion of conserved orthologous subtelomeres among S. cerevisiae strains (56.3–81.3%) is comparable to that among S. paradoxus strains (50.0–81.3%), despite the much smaller diversification timescales of S. cerevisiae. This translates into a 3.8-fold difference in subtelomeric reshuffling intensity between the two species during their respective diversifications (one-sided Mann–Whitney U-test, P = 2.93 × 10−8) (Fig. 5d). The frequent reshuffling of subtelomeric sequences often has drastic impacts on gene content, both qualitatively and quantitatively. For example, four genes (PAU3, ADH7, RDS1 and AAD3) were lost in S. cerevisiae Y12 owing to a single subtelomeric duplication event (chromosome 08-L to chromosome 03-R) (Supplementary Fig. 8b). Therefore, the accelerated subtelomere reshuffling in S. cerevisiae is likely to have important functional implications.

Native noncanonical chromosome end structures

S. cerevisiae chromosome ends are characterized by two telomere-associated sequences: the core X and the Y′ element42. The core X element is present in nearly all chromosome ends, whereas the number of Y′ elements varies across chromosome ends and strains. The two previously described chromosome end structures have (i) a single core X element or (ii) a single core X element followed by 1–4 distal Y′ elements42. S. paradoxus chromosome ends also contain core X and Y′ elements43, but their detailed structures and genome-wide distributions have not been systematically characterized. Across our 12 strains, most (85%) chromosome ends had one of the two structures described above, but we also discovered novel chromosome ends (Supplementary Table 13). We found several examples of tandem duplications of the core X element in both species. In most cases, including the ones in the S. cerevisiae reference genome (chromosome VIII-L and chromosome XVI-R), the proximal duplicated core X elements had degenerated, but we found two examples where intact duplicated copies were retained: chromosome XII-R in S. cerevisiae Y12 and chromosome III-L in S. paradoxus CBS432. The latter was especially notable, with six core X elements (including three complete copies) arranged in tandem. We discovered five chromosome ends consisting of only Y′ elements (one or more copies) but no core X elements. This was unexpected given the importance of core X elements in maintaining genome stability44,45. The discovery of these noncanonical chromosome end structures offers a new paradigm to investigate the functional role of core X elements.

Mitochondrial genome evolution

Despite being highly repetitive and AT-rich, the mitochondrial genomes of the S. cerevisiae strains showed high degrees of collinearity (Fig. 6a). In contrast, S. paradoxus mitochondrial genomes showed lineage-specific structural rearrangements. The two Eurasian strains (CBS432 and N44) share a transposition of the entire COX3RPM1 (rnpB)–15s rRNA (rns) segment, in which 15s rRNA was further inverted (Fig. 6b–d). In addition, given the gene order in two outgroups, the COB gene was relocated in the common ancestor of S. cerevisiae and S. paradoxus (Fig. 6e). The phylogenetic tree inferred from mitochondrial protein-coding genes showed clear deviation from the nuclear tree (Fig. 6e). In particular, the Eurasian S. paradoxus lineage (CBS432 and N44) clustered with the seven S. cerevisiae strains before joining with the other S. paradoxus strains, which supports the idea of mitochondrial introgression from S. cerevisiae46 (Fig. 6e). We found low topology consensus (normalized quartet score = 0.59, versus 0.92 for the nuclear gene tree) across different mitochondrial gene loci, suggesting heterogeneous phylogenetic histories. Together with the drastically dynamic presence and absence patterns of mitochondrial group I and group II introns (Supplementary Table 10), this reinforces the argument for extensive cross-strain recombination in yeast mitochondrial evolution47. In addition, the COX3 gene in S. paradoxus UFRJ50816 and UWOPS91-917.1 started with GTG rather than the typical ATG start codon, which was further supported by Illumina reads. This suggests either an adoption of an alternative ATG start codon nearby (for example, 45 bp downstream) or a rare case of a near-cognate start codon48,49,50.

Figure 6: Comparative mitochondrial genomics.
figure 6

(ad) Pairwise comparisons for the mitochondrial genomes of S288C and DBVPG6044 from S. cerevisiae (a), CBS432 and YPS138 from S. paradoxus (b), S. cerevisiae S288C and S. paradoxus CBS432 (c) and S. cerevisiae S288C and S. paradoxus YPS138 (d). (e) Genomic arrangement of the mitochondrial protein-coding genes and RNA genes across the 12 sampled strains. Left, phylogenetic tree constructed on the basis of mitochondrial protein-coding genes, with the number at each internal node showing rapid bootstrap support. The detailed gene arrangement map is shown on the right. There is a large inversion in S. arboricolus that encompasses the entire COX2ATP8 (according to its original mitochondrial genome assembly), which we inverted back this segment for better visualization.

Fully resolved SVs illuminate complex phenotypic traits

SVs are expected to account for a substantial fraction of phenotypic variation; fully resolved SVs can therefore be crucial in understanding complex phenotypic traits. We used the copper tolerance–related CUP1 locus and the arsenic tolerance–related ARR cluster as examples of associations between fully characterized genomic compositions (i.e., copy numbers and genotypes) and conditional growth rates. The PacBio assemblies precisely resolved these complex loci, and phenotype associations were consistent with previous findings based on copy number analysis6,21,51 (Fig. 7a–d and Supplementary Note). We further illustrated their phenotypic contributions via linkage mapping using 826 phased outbred lines (POLs) derived from crossing the North American (YPS128) and West African (DBVPG6044) S. cerevisiae52 (Online Methods). The linkage analysis accurately mapped a large-effect quantitative trait locus (QTL) at the chromosome 03-R subtelomere (the location of the ARR genes in DBVPG6044), but showed no arsenic resistance association with the YPS128 ARR locus on the chromosome 16-R subtelomere (Fig. 7e). This profile is consistent with the relocation of an active ARR cluster to the chromosome 03-R subtelomere in DBVPG6044 and the presence of deleterious mutations predicted to inactivate the ARR cluster in YPS128 (refs. 6,35). Thus, a full understanding of the relationship between genome sequence and arsenic resistance phenotype is not provided by the knowledge of copy number alone but rather requires the combined knowledge of genotype, genomic location and copy number as provided by our end-to-end assemblies (Fig. 7f).

Figure 7: Structural rearrangements illuminate complex phenotypic variation.
figure 7

(ad) Copy number and gene arrangement of the CUP1 locus (a) and the ARR cluster (c) across the 12 strains (asterisks denote involvement of pseudogenes), and generation time of the 12 strains in high-copper (b) and high-arsenic conditions (d). (e) The rearrangement that relocates the ARR cluster to the chromosome 03-R (chr03-R) subtelomere in the West African S. cerevisiae DBVPG6044 is consistent with the linkage mapping analysis using phased outbred lines (POLs) derived from North American (YPS128) and West African (DBVPG6044) S. cerevisiae. (f) Phenotypic distribution of the 826 POLs for generation time in arsenic condition partitioned for genotype positions at the chr03-R and chr16-R subtelomeres and inferred copies of ARR clusters (bottom). Center lines, median; boxes, interquartile range (IQR); whiskers, 1.5× IQR. Data points beyond the whiskers are outliers.

Discussion

The landscape of genetic variation is shaped by multiple evolutionary processes, including mutation, drift, recombination, gene flow, natural selection and demographic history. The combined effect of these factors can vary considerably both across the genome and between species, resulting in different patterns of evolutionary dynamics. The complete genome assemblies that we generated for multiple strains from both domesticated and wild yeasts provide a unique data set for exploring such patterns with unprecedented resolution.

Considering the evolutionary dynamics across the genome, eukaryotic subtelomeres are exceptionally variable compared to chromosomal cores40,53,54, with accelerated evolution manifest in extensive CNV accumulation, rampant ectopic reshuffling and rapid functional divergence6,41,55,56,57. Our study provides a quantitative comparison of subtelomeres and cores in structural genome evolution and a high-resolution view of the extreme evolutionary plasticity of subtelomeres. This rapid evolution of subtelomeres can substantially alter the gene repertoire and generate novel recombinants with adaptive potential57. Given that subtelomeric genes are highly enriched in functions mediating interactions with external environments (for example, stress response, nutrient uptake and ion transport)6,55,58, it is tempting to speculate that the accelerated subtelomeric evolution reflects selection for evolvability, i.e., the ability to respond and adapt to changing environments59.

With regard to the genome dynamics between species, external factors such as selection and demographic history have important roles. The ecological niches and recent evolutionary history of S. cerevisiae have been intimately associated with human activities, with many strains isolated from human-associated environments such as breweries, bakeries and even clinical patients60. Consequently, this wide spectrum of selection schemes could significantly shape the genome evolution of S. cerevisiae. In addition, human activities also promoted admixture and cross-breeding of S. cerevisiae strains from different geographical locations and ecological niches61, resulting in many mosaic strains with mixed genetic backgrounds1. In contrast, the wild-living S. paradoxus occupies very limited ecological niches, with most strains isolated from trees in the Quercus genus62. S. paradoxus strains from different geographical subpopulations are genetically well differentiated with partial reproductive isolations34,63. Such interspecific differences in their history could result in distinct evolutionary genome dynamics, which is captured in our study (Fig. 8). In chromosomal cores, S. cerevisiae strains show slower accumulation of balanced structural rearrangements compared with S. paradoxus strains. This pattern might be explained by the admixture between different S. cerevisiae subpopulations during their recent association with human activities, which would considerably impede the fixation of balanced structural rearrangements. In contrast, geographical isolation of different S. paradoxus subpopulations would favor relatively fast fixation of balanced structural rearrangements64. We observed an opposite pattern for unbalanced rearrangements in chromosomal cores. The S. cerevisiae strains accumulate such changes more rapidly than their S. paradoxus counterparts, which is probably driven by selection, considering the biological functions of those affected genes. Likewise, the more rapid subtelomeric reshuffling and higher dN/dS of subtelomeric genes in S. cerevisiae than in S. paradoxus are probably also driven by selection. As a consequence of such unbalanced rearrangements and subtelomeric reshuffling, S. cerevisiae strains show more rapid CNV accumulation and GOL, which reinforces this argument. In addition, the mitochondrial genomes of S. cerevisiae strains maintained high degrees of collinearity, whereas those of S. paradoxus strains showed lineage-specific structural rearrangements and introgression, suggesting distinct modes of mitochondrial evolution. Taken together, many of these observed differences between S. cerevisiae and S. paradoxus probably reflect the influence of human activities on structural genome evolution, which sheds new light on why S. cerevisiae, but not its wild relative, is one of our most biotechnologically important organisms.

Figure 8: Contrasting evolutionary dynamics across the genomic landscape between S. cerevisiae and S. paradoxus.
figure 8

The interspecific contrasts in nuclear chromosomal cores, subtelomeres and mitochondrial genomes are summarized. SR, structural rearrangement.

Methods

Strain sampling, preparation and DNA extraction.

On the basis of previous population genomics surveys1, we sampled seven S. cerevisiae and five S. paradoxus strains (all in the haploid or homozygous diploid forms) to represent major evolutionary lineages of the two species (Supplementary Table 1). The reference strains for S. cerevisiae (S288C) and S. paradoxus (CBS432) were included for quality control. All strains were taken from our strain collection stored at −80 °C and cultured on yeast extract–peptone–dextrose (YPD) plates. A single colony for each strain was picked and cultured in 5 mL YPD liquid at 30 °C 220 r.p.m. overnight. DNA extraction was carried out using the MasterPure Yeast DNA Purification Kit (Epicentre).

PacBio sequencing and raw assembly.

The sequencing center at the Wellcome Trust Sanger Institute performed library preparation and sequencing using the PacBio Single Molecule, Real-Time (SMRT) DNA sequencing technology (platform: PacBio RS II; chemistry: P4-C2 for the pilot phase and P6-C4 for the main phase). The raw reads were processed using the standard SMRT analysis pipeline (v2.3.0). The de novo assembly was carried out following the hierarchical genome-assembly process (HGAP) assembly protocol with Quiver polishing65.

Assembly evaluation and manual refinement.

We retrieved the reference genomes (Supplementary Note) for both species to assess the quality of our PacBio assemblies. For each polished PacBio assembly, we first used RepeatMasker (v4.0.5) (URLs) to soft-mask repetitive regions (option: -species fungi -xsmall -gff). The soft-masked assemblies were subsequently aligned to the reference genomes using the nucmer program from MUMmer (v3.23)66 for chromosome assignment. For most chromosomes, we have single contigs covering the entire chromosomes. For the cases where internal assembly gaps occurred, we performed manual gap closing by consulting the assemblies generated in the pilot phase of this project. The only gap we were unable to close is the highly repetitive rDNA array (usually consisting 100–200 copies of a 9.1-kb unit) on chromosome XII. The S. cerevisiae reference genome used a 17,357-bp sequence of two tandemly arranged rDNA copies to represent this complex region. For our assemblies, we trimmed off the partially assembled rDNAs around this gap and re-linked the two contigs with 17,357-bp Ns to keep consistency. The mitochondrial genomes of the 12 strains were recovered by single contigs in the raw HGAP assemblies. We further circularized them and reset their starting position as the ATP6 gene using Circlator (v1.1.4)67. The circularized mitochondrial genome assemblies were further checked by consulting the raw PacBio reads and manual adjustment was applied when necessary.

Illumina sequencing, read mapping and error correction.

In addition to the PacBio sequencing, we also performed Illumina 151-bp paired-end sequencing for each strain at Institut Curie. We examined the raw Illumina reads via FastQC (v0.11.3) (URLs) and performed adaptor-removing and quality-based trimming by trimmomatic (v0.33)68 (options: ILLUMINACLIP:adapters.fa:2:30:10 SLIDINGWINDOW:5:20 MINLEN:36). For each strain, the trimmed reads were mapped to the corresponding PacBio assemblies by BWA (v0.7.12)69. The resulting read alignments were subsequently processed by SAMTools (v1.2)70, Picard tools (v1.131) (URLs) and GATK (v3.5-0)71. On the basis of Illumina read alignments, we further performed error correction with Pilon (v1.12)72 to generate final assemblies for downstream analysis.

Base-level error rate estimation for the final PacBio assemblies.

Eight of our twelve strains were previously sequenced using Illumina technology with moderate to high depths6. We retrieved those raw reads and mapped them to our PacBio assemblies (both before and after Pilon correction) following the protocol described above. SNPs and indels were called by FreeBayes (v1.0.1-2)73 (option: -p 1) to assess the performance of the Pilon correction and estimate the remaining base-level error rate in our final assemblies. The raw SNP and indel calls were filtered by the vcffilter tool from vcflib (URLs) with the filter expression: QUAL > 30 & QUAL / AO > 10 & SAF > 0 & SAR > 0 & RPR > 1 & RPL > 1.

Assembly completeness evaluation.

We compared our S288C PacBio assembly with three published S. cerevisiae assemblies generated by different sequencing technologies (PacBio, Oxford Nanopore and Illumina)74,75. We aligned these three assemblies as well as our S288C PacBio assembly to the S. cerevisiae reference genome using nucmer from MUMmer (v3.23)66. The nucmer alignments were filtered by delta-filter (from the same package) (option: -1). We converted the output file to BED format and used bedtools (v2.15.0)76 to calculate the intersection between our genome alignment and various annotation features (such as chromosomes, genes, retrotransposable elements, telomeres) of the S. cerevisiae nuclear reference genome. The percentage coverage of these annotation features by different assemblies were summarized accordingly.

Annotation of the protein-coding genes, tRNA genes and other genomic features.

For nuclear genomes, we assembled an integrative pipeline that combines three existing annotation tools to form an evidence-leveraged protein-coding gene annotation. First, we used the RATT package77 for directly transferring the nondubious S. cerevisiae reference gene annotations to our PacBio assemblies on the basis of whole genome alignments. Furthermore, we used the Yeast Genome Annotation Pipeline (YGAP)78 to annotate our PacBio assemblies (default options without scaffolds reordering) based on gene sequence homology and synteny conservation. A custom Perl script (available on request) was used to remove redundant, truncated, or frameshifted genes annotated by YGAP. Finally, we used the Maker pipeline (v2.31.8)79 to perform de novo gene discovery with EST–protein alignment support (Supplementary Note). As a by-product, tRNA genes were also annotated via the tRNAscan-SE (v1.3.1)80 module of the Maker pipeline. Gene annotations produced by RATT, YGAP and Maker together with the EST–protein alignment evidences generated by Maker were further leveraged by EVidenceModeler (EVM)81 to form an integrative annotation. Manual curation was carried out for selected cases (for example, the CUP1 and ARR clusters) and pseudogenes were manually labeled when verified. The same pipeline was used for upgrading the protein-coding gene annotation of S. arboricolus, for which the originally annotated coding sequences (CDSs) and protein sequences was used for initial EST–protein alignment. In addition, for the 12 strains, we systematically annotated other genomic features encoded in their nuclear genomes, such as centromeres, Ty retrotransposable elements and telomere-associated core X and Y′ elements (Supplementary Note). Protein-coding genes that overlap with truncated or full-length Tys, core X or Y′ elements were removed from our final annotation.

As for mitochondrial genomes, the protein-coding genes, tRNA genes and other mitochondrial RNA genes such as RPM1 (RNase P RNA), 15S rRNA (small) and 21S rRNA (large) subunit rRNA were annotated by MFannot (URLs). The exon–intron boundaries of annotated mitochondrial genes were manually curated based on BLAST and the 12-way mitochondrial genome alignment generated by mVISTA82.

Orthology group identification.

For nuclear protein-coding genes, we used Proteinortho (v5.15)83 to identify gene orthology across the 12 strains and six other sensu stricto outgroups: Saccharomyces mikatae (strain IFO1815), Saccharomyces kudriavzevii (strain IFO1802), Saccharomyces kudriavzevii (strain ZP591), Saccharomyces arboricolus (strain H6), Saccharomyces eubayanus (strain FM1318) and Saccharomyces bayanus var. uvarum (strain CBS7001). The orthology identification took into account both sequence homology and synteny conservation (the PoFF feature84 of Proteinortho). For each annotated strain, the systematic names of nondubious genes in the Saccharomyces Genome Database (SGD) (URLs) were mapped to our annotated genes based on the orthology groups identified above.

Phylogenetic reconstruction.

For nuclear genes, we performed the phylogenetic analysis on the basis of one-to-one orthologs that are shared across all 18 strains (seven S. cerevisiae + five S. paradoxus + six outgroups) using two complementary approaches: the concatenated tree approach and the consensus tree approach. For each one-to-one ortholog, we used MUSCLE (v3.8.1551)85 to align protein sequences and PAL2NAL (v14)86 to align codons accordingly. For the concatenated tree approach, we generated a concatenated codon alignment across all orthology groups and fed it into RAxML (v8.2.6)87 for maximum likelihood (ML) tree building. Alignment partition was configured by the first, second, and third codon positions. The GTRGAMMA model was used for phylogenetic inference. The rapid bootstrapping method built in RAxML was used to assess the stability of internal nodes (option: -# 100). The final ML tree was visualized in FigTree (v1.4.2) (URLs). For the consensus tree approach, we built individual gene trees with RAxML using the same method described above, which were further summarized into a coalescent-based consensus species tree by ASTRAL (v4.7.12)88. The normalized quartet score was calculated to assess the reliability of the final species tree given individual gene trees. For mitochondrial genes, we performed the same phylogenetic analysis based on the eight mitochondrial protein-coding genes.

Relative rate test.

To test the rate heterogeneity between S. cerevisiae and S. paradoxus in molecular evolution, we constructed three-way sequence alignments by sampling one strain for each species together with S. mikatae as the outgroup. The sequences were drawn from the concatenated nuclear CDS alignment described above. The extracted sequences were fed into MEGA (v7.0.16)89 for Tajima's relative rate test27. We conducted this test for all possible S. cerevisiaeS. paradoxus strain pairs.

Molecular dating.

As no yeast fossil record can be used for reliable calibration, we performed molecular dating analysis using a relative time scale. We used the phylogenetic tree constructed from the nuclear one-to-one orthologs as the input and performed least-squares-based fast dating with LSD90 (options: -c -v -s). We specified S. bayanus var. uvarum CBS7001 and S. eubayanus FM1318 as outgroups for this analysis.

Conserved synteny block identification.

We used SynChro from the CHROnicle package (January 2015 version)91,92 to identify conserved synteny blocks. We prepared the input files for SynChro with custom Perl scripts (available on request) to provide the genomic coordinates of all annotated features together with the genome assembly and proteome sequences. SynChro subsequently performed exhaustive pairwise comparisons to identify synteny blocks shared in the given strain pair.

Subtelomere definition and chromosome partitioning.

An often-used yeast subtelomere definition is 20–30 kb from the chromosome ends. However, this definition is arbitrary in the sense that it treats all subtelomeres indiscriminately. In this study, we defined yeast subtelomeres on the basis of gene synteny conservation profiles across the 12 strains. For each chromosome arm, we examined all syntenic blocks shared across the 12 strains and used the most distal one to define the distal boundary for the chromosomal core (Supplementary Table 11). Meanwhile, we defined the proximal boundary of the chromosome end for this chromosome arm according to the first occurrence of core X or Y′ elements. The region between these two boundaries was defined as the subtelomere for this chromosome arm, with 400–bp interstitial transition zones on both sides (Supplementary Fig. 3).

Given that some strains (i.e., UWOPS03-461.4, UFRJ50816 and UWOPS91-917.1) are involved in large-scale interchromosomal rearrangements, the current chromosomal identities (determined by centromeres) might not necessarily agree with the ancestral chromosomal identities (determined by gene contents). Therefore, we used Roman and Arabic numbers, respectively, to denote these two identities for all 12 strains and avoid potential confusion about those interchromosomal rearrangements (Supplementary Fig. 4 and Supplementary Table 12). Each defined subtelomere was named according to the ancestral chromosomal identity of its flanking chromosomal core and denoted also using Arabic numbers (Supplementary Data Sets 2 and 3).

Identification of balanced and unbalanced structural rearrangements in chromosomal cores.

To identify balanced rearrangements, we first used ReChro from CHROnicle (January 2015 version)91,92. We set the synteny block stringency parameter “delta=1” for the main analysis. A complementary run was performed with “delta=0” to identify single gene inversions. Alternatively, we started with the one-to-one ortholog gene pairs (identified by our orthology group identification) in chromosomal cores between any given strain pair and examined their relative orientation and chromosomal locations. If the two one-to-one orthologous genes are located on the same chromosome but have opposite orientations, an inversion should be involved. If they reside on different chromosomes, a translocation or transposition should be involved.

As for unbalanced rearrangements, we first generated whole-genome alignment for every strain pair by nucmer66 (options: -maxmatch -c 500) and used Assemblytics93 to identify potential insertions, deletions and duplications or contractions. All candidates were further intersected with our gene annotations by bedtools intersect76 to only keep those encompassing at least one protein-coding gene. Alternatively, we started with all the genes enclosed in chromosomal cores of any given strain pair and filtered out those completely covered by unique genome alignment between this strain pair. All the remaining genes were classified as candidates potentially involved in unbalanced rearrangements.

All identified candidate cases were manually examined by dot plots using Gepard (v1.30)94. All verified rearrangements in chromosomal cores were further mapped to the phylogeny of the 12 strains to reconstruct their evolutionary histories based on the maximum parsimony principle. The corresponding genomic regions in those six outgroups were also checked by dot plots to provide further support for our evolutionary history inferences.

Gene Ontology analysis.

The CDSs of the S. cerevisiae nondubious reference genes were BLASTed against the NCBI nonredundant (nr) database using blastx (E-value = 1 × 10−3) and further annotated by BLAST2GO (v.3.2)95,96 to generate Gene Ontology (GO) mapping for each gene. We performed Fisher's exact test97 to detect significantly enriched GO terms of our test gene set relative to the genome-wide background. False discovery rate (FDR) (cutoff 0.05)98 was used for multiple correction. Significantly enriched GO terms were further processed by the 'Reduce to most specific terms' function implemented in BLAST2GO to keep only child terms.

Molecular evolutionary rates, CNV accumulation and GOL estimation.

For the one-to-one orthologs in each strain pair, we calculate synonymous substitution rate (dS), nonsynonymous substitution rate (dN) and nonsynonymous-to-synonymous substitution rate ratio (dN/dS) using the yn00 program from PAML (v4.8a)99 based on Yang and Nielsen100. We also measured the proportion of genes involved in CNVs (i.e., those are not one-to-one orthologs) in any strain pair. We denoted this measurement as PCNVs, a quantity analogous to the P-distance in sequence comparison. To correct for multiple changes at the same gene loci, the Poisson distance DCNVs can be given by −ln (1 − PCNVs). This value can be further adjusted with evolutionary time by dividing 2T, where T is the diversification time of the two compared strains obtained from our molecular dating analysis. To further capture evolutionary dynamics in terms of gene order changes, we further measured GOL for those one-to-one orthologs using the method proposed by previous studies without allowing for intervening genes29,30,31. For GOL, we performed similar Poisson correction and evolutionary time adjustment as for CNV accumulation. The calculation values for dN/dS, CNV accumulation and GOL were further summarized by 'core genes' and 'subtelomeric genes' on the basis of genome partitioning described above.

Subtelomeric homology search.

For each defined subtelomeric region, we hard-masked all the enclosed Ty-related features (i.e., full-length Ty, truncated Ty and Ty solo-LTRs) and then searched against all the other subtelomeric regions for shared sequence homology. The search was performed by BLAT101 (options: -noHead -stepSize = 5 -repMatch = 2253 -minIdentity = 80 -t = dna -q = dna -mask = lower -qMask = lower). We used pslCDnaFilter (options: -minId = 0.9 -minAlnSize = 1000 -bestOverlap -filterWeirdOverlapped) to filter out trivial signals and pslScore to calculate sequence alignment scores for those filtered BLAT matches. As the two reciprocal scores obtained from the same subtelomere pair are not symmetrical (depending on which sequence was used as the query), we took their arithmetic mean in our analysis. Such subtelomeric homology search was carried out for both within-strain and cross-strain comparisons, and subtelomere pairs with strong sequence homology (BLAT alignment score ≥5000 and sequence identity ≥90%) were recorded.

Hierarchical clustering analysis and reshuffling rate calculation for orthologous subtelomeres.

For all strains within the same species, we performed pairwise comparisons of their subtelomeric regions to identify conserved orthologous subtelomeres in any given strain pairs on the basis of homology search described above. For each strain pair, the proportion of conserved orthologous subtelomeres was calculated as a measurement of the overall subtelomere conservation between the two strains. Such measurements were converted into a distance matrix by the dist() function in R (v3.1)102, based on which the hclust() function was further used for hierarchical clustering. We gauged the reshuffling intensity of orthologous subtelomeres similarly to how we measured CNV accumulation and GOL. For any given strain pair, we first calculated the proportion of the nonconserved orthologous subtelomeres in this strain pair as Preshuffling and then applied the Poisson correction and evolutionary time adjustment by −ln (1 − Preshuffling)/2T, in which T is the diversification time of the two compared strains.

Phenotyping the growth rates of yeast strains in copper- and arsenite-rich medium.

The homozygous diploid versions of the 12 strains were pre-cultured in synthetic complete (SC) medium overnight to saturation. To examine their conditional growth rates in copper- and arsenite-rich environment, we mixed 350 μl conditional medium (CuCl2 (0.38 mM) and arsenite (As(III), 3 mM) for the two environment respectively) with 10 μl saturated culture to the wells of honeycomb plates. Oxygen-permeable films were placed on top of the plates to enable uniform oxygen distribution throughout the plate. The automatic screening was done with Bioscreen Analyser C (Thermo Labsystems Oy) at 30 °C for 72 h, measuring in 20-min intervals using a wide-band filter at 420–580 nm (ref. 103). Growth data pre-processing and phenotypic trait extraction were performed by PRECOG104.

Linkage analysis in diploid S. cerevisiae hybrids.

A total of 826 phased outbred lines (POLs) were constructed and phenotyped as previously described52. Briefly, advanced intercrossed lines (AILs) were generated by successive rounds of mating and sporulation from the YPS128 and DBVPG6044 strains105. The resulting haploid AILs were sequenced106 and crossed in different combinations to yield the 826 POLs used for the analysis. The POL diploid genotypes can be accurately inferred from the haploid AILs. Effectively, these 826 POLs constitute a subset of the larger set of POLs in Hallin et al.52 but were constructed and phenotyped independently. Phenotyping of the POLs, each with four replicates, was performed using Scan-o-Matic107 on solid agar plates (0.14% yeast nitrogen base, 0.5% ammonium sulfate, 2% (w/v) glucose and pH buffered to 5.8 with 1% (w/v) succinic acid, 0.077% complete supplement mixture (CSM, Formedium), 2% agar) supplemented with varying arsenite concentrations (0, 1, 2, and 3 mM). Using the deviations between the POL phenotype and the estimated parental mean phenotype in the mapping to combat population structure issues52, quantitative trait loci (QTLs) were mapped using the scanone() function in R/qtl108 with the marker regression method.

Statistics.

Tajima's relative rate test27 was performed in MEGA (v7.0.16)89. Fisher's exact test97 with FDR correction98 was performed in BLAST2GO (v.3.2)95,96. The Mann–Whitney U-test was performed in R (v3.1)102 using the wilcox.test() function, with one.sided alternative hypothesis. P < 0.05 was considered statistically significant in all statistical tests.

Data availability.

All genome sequencing, assembly and annotation data that support the findings of this study have been deposited in public repositories. The PacBio sequencing reads for this project has been deposed in the European Nucleotide Archive (ENA) under accession code PRJEB7245. Illumina sequencing reads have been deposed in Short Reads Archive (SRA) under accession code PRJNA340312. The genome assemblies and annotations generated by this study are available at https://yjx1217.github.io/Yeast_PacBio_2016/data/ and in GenBank under accession code PRJEB7245.

URLs.

Previously identified subtelomeric duplication blocks in S. cerevisiae S288C, http://www2.le.ac.uk/colleges/medbiopsych/research/gact/images/clusters-fixed-large.jpg; RepeatMasker, http://www.repeatmasker.org; FastQC, http://www.bioinformatics.babraham.ac.uk/projects/fastqc/; Picard tools, http://broadinstitute.github.io/picard/; vcflib, https://github.com/vcflib/vcflib; MFannot, http://megasun.bch.umontreal.ca/cgi-bin/mfannot/mfannotInterface.pl; The Saccharomyces Genome Database (SGD), http://www.yeastgenome.org; FigTree, http://tree.bio.ed.ac.uk/software/figtree/.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.