Transient Structural Variations Alter Gene Expression and Quantitative

The effects of structural variants on phenotypic diversity and evolution are poorly understood. We recently described the genetic and phenotypic variation among fission yeast strains and showed that genome-wide association studies are informative in this model. Here we extend this work by systematically identifying structural variations and investigating their consequences. We establish a curated catalog of copy number variants (CNVs) and rearrangements, including inversions and translocations. We find that SVs frequently vary within clonal populations and are weakly tagged by SNPs, consistent with rapid turnover. We show that CNVs produce measurable effects on gene expression both within and outside the duplicated regions. CNVs contribute to quantitative traits such as cell shape, cell growth under diverse conditions, sugar utilization in winemaking and antibiotic resistance, whereas rearrangements are strongly associated with reproductive isolation. Collectively, these findings have broad implications for evolution and for our understanding of quantitative traits and complex human diseases.

alignments in multiple strains for all 315 candidates.This meticulous approach aimed to ensure a high quality call set, to mitigate against the high uncertainty associated with SV calling (Layer et al., 2014).This curation produced a set of 113 SVs, comprising 23 deletions, 64 duplications, 11 inversions and 15 translocations (Figure 1a).Reassuringly, when our variant calling methods were applied to an engineered knockout strain, we correctly identified the known deletions and called no false positives.Attempts to validate all rearrangements by PCR and BLAST searches of de novo assemblies positively verified 76% of the rearrangements, leaving only a few PCR-intractable variants unverified (Methods).
peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/047266doi: bioRxiv preprint first posted online Apr. 13, 2016;   peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
Most SVs were present at low frequencies, with 28% discovered in only one of the strains analyzed (Figure 1b).The deletions were generally small (median length 595bp, Figure 1c), duplications showed a median length of 20 kb, with the largest duplication extending to 510 kb and covering 200 genes (a singleton in strain JB1207/NBRC10570).The majority of CNVs were present in copy numbers varying between zero and sixteen (subsequently we refer to amplifications of two or more copies as 'duplications').
Deletions and duplications and strongly biased towards the ends of chromosomes (Figure 1d, Supplementary Figure 1), which are characterized by high genetic diversity, frequent transposon insertions, and a paucity of essential genes (Jeffares et al., 2015).All SVs preferentially occurred in positions of low gene density and showed a strong tendency to not overlap with essential genes (Supplementary Figure 2).To describe SVs further, we conducted gene enrichment analysis with the AnGeLi tool (Supplementary Table 1), which interrogates gene lists for functional enrichments using multiple qualitative and quantitative information sources (Bitton et al., 2015).The CNVoverlapping genes were enriched for caffeine/rapamycin induced genes and genes induced during meiosis (P = 4x10 -7 and 1x10 -5 , respectively); they also showed lower relative DNA polymerase II occupancy and were less likely to contain genes that are known to produce abnormal cell phenotypes (P= 1.8x10 -5 and 3x10 -5 , respectively).These analyses are all broadly consistent with a paucity of CNVs in genes that encode essential mitotic functions.Rearrangements disrupted only a few genes and showed no significant enrichments.

Duplications are transient within clonal populations
Our previous work identified 25 clusters of near-clonal strains, which differed by <150 SNPs within each cluster (Jeffares et al., 2015).We expect that these clusters reflect either repeat depositions of strains differing only at few sites (e.g.mating-type variants of reference strains h 90 and h -differ by 14 SNPs) or natural populations of strains collected from the same location.Such 'clonal populations' reflect products of mitotic propagation from a very recent common ancestor, without any outbreeding.We therefore expected that SVs should be largely shared within these clonal populations.
Surprisingly, our genotype predictions indicated that most SVs present in clonal populations were segregating, i.e. were not fixed within the clonal population (68/95 SVs, 72%).Furthermore, we observed instances of the same SVs that were present in two or more different clonal populations that were not fixed within any clonal population.These SVs could be either incorrect allele calls in some strains, or alternatively, recent events that have emerged during mitotic propagation.To distinguish between these two peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/047266doi: bioRxiv preprint first posted online Apr. 13, 2016; scenarios, we re-examined the read coverage of all 49 CNVs present within at least one clonal population.Since translocations and inversions were more challenging to accurately genotype, we did not re-examine these variants.This analysis verified that 40 of these 49 CNVs (37 duplications, 3 deletions) were clearly segregating within at least one clonal cluster (Figure 2a, Supplementary Figure 3).For example, one clonal population of seven closely related strains, collected together in 1966 from grape must in Sicily, have an average pairwise difference of only 19 SNPs (diversity π = 1.5x10 -6 ).Notably, this collection showed four non-overlapping segregating duplications.This striking finding indicates that CNVs can arise or disappear frequently during evolution.

Transient duplications affect gene expression
Partial aneuploidies of 500-700 kb in the S. pombe reference strain are known to alter gene expression levels within and, to some extent, outside of the duplicated region (Chikashige et al., 2007).The naturally occurring duplications we described are typically smaller (median length: 46 kb), including an average of 6.5 genes.To examine whether naturally occurring CNVs have similar effects on gene expression, we examined eight pairs of closely related strains (<150 SNPs among each pair) that contained at least one unshared duplication (Figure 2b, Supplementary Figure 3, Supplementary Table 2).Several of these strain pairs have been isolated from the same substrate at the same time, and all pairs are estimated to have diverged approximately 50 to 65 years ago (Supplementary Table 2).We assayed transcript expression from log phase cultures using DNA microarrays, each time comparing a duplicated to a non-duplicated strain from within the same clonal population.In seven of the eight strain pairs, the expression levels of genes within duplications were significantly induced, although the degree of expression changes between genes was variable (Figure 2c).The increased transcript levels correlated with the increased genomic copy numbers, so that higher copy numbers produced correspondingly more transcripts (Supplementary Figure 4).No changes in gene expression were evident immediately adjacent to the duplications (Supplementary Figure 4), suggesting that the local chromatin state was not strongly altered by the CNVs.
Some genes outside the duplicated regions also showed altered expression levels (Figure 2d, Supplementary Table 3).For example, two strain pairs differ by a single 12 kb duplication.Here, five of seven genes within the duplication showed induced expression, while 45 genes outside the duplicated region also showed consistently altered expression levels (38 protein-coding genes, 7 non-coding RNAs) (Figure 2d, arrays 7 and 8).As environmental growth conditions were tightly controlled, these changes in gene expression probably reflect indirect and compensatory effects of the initial perturbation caused by the duplication (Supplementary Figure 5).We conclude that these evolutionary unstable duplications reproducibly affect the expression of distinct sets of genes and thus have the potential to influence cellular function and phenotypes.
peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/047266doi: bioRxiv preprint first posted online Apr. 13, 2016;                                                                                                                                                                                                                        peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/047266doi: bioRxiv preprint first posted online Apr. 13, 2016; induced (up) and repressed (down) genes, both inside and outside the duplicated regions.Arrays 2,3 and 7,8 (in yellow shading) are replicates within the same clonal population that contain the same duplications, so we list the number of up-and down-regulated genes that are consistent between both arrays.See Supplementary Tables 3 and 4 for details.

Copy number variants influence quantitative traits
To test whether SVs affect phenotypes, we examined the contributions of SNPs, CNVs and rearrangements to 53 quantitative traits, including 11 cell shape parameters and colony size on solid media assaying 42 stress and nutrient conditions (Jeffares et al., 2015).For each phenotype, we used mixed model analysis to estimate the total proportion of variance explained by the additive contribution of genomic variants (the narrow-sense heritability).When we determined heritability using only SNP data, estimates varied between 0% and 74%.After adding CNVs and rearrangements to SNPs in a composite model, the estimated overall heritability increased for nearly all traits, in some cases by more than 2fold (e.g., resistance to Rapamycin) (Fig 3a).This finding indicates that the CNVs and rearrangements can explain a substantial proportion of the trait variance.Using this composite model, we quantified the individual contributions of heritability best explained by SNPs, CNVs and rearrangements (Fig 3b).On average, SNPs explained 30% of trait variance, CNVs 14%, and rearrangements 5% (Supplementary Figure 6, Supplementary Table 5).Analysis of simulated data confirmed that the contribution of CNVs could not be explained by linkage to causal SNPs alone (Supplementary Figure 6).
As some of the strains have recently been shown to have fermentation properties that may be beneficial for winemaking (Benito et al., 2016), we examined three traits related to wine fermentations (glucose/fructose utilization, malic acid degradation, acetic acid content, all measured in grape must).Remarkably, the heritability of these winefermentation traits was almost entirely due to SVs, with negligible contributions from SNPs (Supplementary Figure 7, Supplementary Table 6).For glucose/fructose utilization, the CNVs accounted for the entire heritability of 0.53 (Supplementary Figure 8).Since many of these strains have been collected from fermentations (Supplementary Table 7), the strong influence of CNVs may represent recent strong selection and adaptation to fermentation conditions that has occurred via recent CNV acquisition.
To locate specific SVs that affected these traits, we performed mixed model genome-wide association studies, using all 68 SVs with minor allele counts >5 as well as 139,396 SNPs and 22,058 indels.Trait-specific significance thresholds for 5% familywise error rates were computed via permutation analysis, and were approximately peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/047266doi: bioRxiv preprint first posted online Apr. 13, 2016; 10 -4 (SVs), 10 -6 (SNPs and indels).Five SVs were significantly associated with traits (3 duplications, 1 deletion, 1 translocation) (Supplementary Table 8).The median effect size of these SVs was 9% (range 6-13%).The strongest signal was from a 32kb duplication that covered 11 protein-coding genes and was significantly associated with 15 growth traits.This duplication is segregating (not fixed) within three clonal populations (Fig. 2b, Supplementary Figure 9), often as part of a larger duplication.It was the most significantly associated variant (from SNPs, indels and SVs) for growth in the antibiotic Brefeldin, where it contributes 15% of the trait variance (Fig 3c).Three of the 11 duplicated genes encode transmembrane proteins, any of which could contribute to the trait.Since some fungi produce Brefeldin to inhibit competitive growth, this duplication is a striking example of a transient CNV that could provide a strong selective advantage in a natural environment.
Our analysis of heritability showed that SNPs are generally able to capture most of the genetic contribution of SVs (Fig. 3).To examine whether trait-influencing SVs will be effectively detected by tagging SNPs in in this population, we examined the linkage of all 113 SVs with SNPs.We found that only 63 of these SVs (55%) are in strong linkage (r 2 >0.6), leaving 45% of the SVs weakly linked (Supplementary Table 16).This lack of linkage is consistent with SVs being transient, rather than persisting within haplotypes.Such weakly-linked SVs will not be tagged well by SNPs (as shown in Fig. 3c).Collectively, these analyses indicate that SVs, most notably CNVs, contribute substantially to quantitative traits and suggests that GWAS analyses conducted without genotyping SVs could fail to capture these genetic factors.
peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/047266doi: bioRxiv preprint first posted online Apr. 13, 2016; derived from 1000 permutations for SNPs (black), indels (green) and SVs (red).Each is set to ensure a 5% family-wise error rate.The trait measured in this case was colony size in the presence of 80 μM Brefeldin (as a ratio of colony size without Brefeldin).Similar results, including a significant P-value for the 32kb duplication were found with 40 μM and 120 μM Brefeldin (Supplementary Table 8).

Structural variations contribute to intrinsic reproductive isolation
Crosses between S. pombe strains produce between 90% and < 1% viable offspring (Avelar et al., 2013, Zanders et al., 2014).We have previously shown that spore viability correlates inversely with the number of SNPs between the parental strains (Jeffares et al., 2015).This intrinsic reproductive isolation may be due to the accumulation of Dobzhansky-Muller incompatibilities (variants that are neutral in one population, but incompatible when combined) (Dobzhansky, 1933, Muller, 1939).However, genetically distant strains also accumulate SVs, which are known to lower hybrid viability and drive reproductive isolation (Rieseberg, 2001).In S. pombe, engineered inversions and translocations reduce spore viability by ~40% (Avelar et al., 2013).
As the numbers of SNP and rearrangement differences between mating parents are themselves correlated (τ = 0.53, P = 1.3x10 -8 ), we also estimated the influence of each factor alone using partial correlations.When either SNPs or rearrangements were controlled for, both remained significantly correlated with offspring viability (P = 0.04, P = 0.02, respectively) (Figure 4).Taken together, these analyses indicate that both rearrangements and SNPs contribute to reproductive isolation, but CNVs do not.
peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
A surprising aspect of our analysis is that duplications are generated and/or lost frequently within clonal populations.For example, within seven strains that differ by as few as 19 SNPs, we discovered four segregating duplications.A similar rapid occurrence of duplications in S. pombe has been observed in laboratory conditions, where spontaneous duplications suppress cdc2 mutants at least 100 times more frequently than SNPs, and these suppressor strains lose their duplications with equal frequency (Carr et al., 1989).Similarly, duplications frequently occur during experimental evolution with budding yeast (Dunham et al., 2002).Consistent with the transience of these variants, they are frequently not well tagged by SNPs.These CNVs subtly alter the expression of genes within and beyond the duplications, and contribute considerably to quantitative traits.Within small populations, CNVs may produce larger effects on traits in the short term than SNPs, as demonstrated by the 32 kb duplication that is associated with resistance to the Brefeldin A (an antibiotic produced by entophytic fungi).
This analysis has relevance for human diseases, since de novo CNV formation in the human genome occurs at a measurable rate (approximately one CNV/10 generations (Itsara et al., 2010)), and CNVs are known to contribute to a wide variety of diseases (Zhang et al., 2009).Indeed, both the population genetics and the effects of SVs within S. pombe seem similar to human, in that CNVs are associated with stoichiometric changes on gene expression, and SVs are in weak linkage with SNPs (Stranger et al., 2007, Sudmant et al., 2015), and therefore may be badly tagged by SNPs in GWAS studies.We show that CNVs in fission yeast contribute to quantitative traits and are also weakly linked to SNPs.These findings highlight the need to identify SVs when describing traits using GWAS, and indicate that a failure to call SVs can lead to an overestimation of the impact of SNPs to traits or contribute to the problem that large proportions of the heritable component of trait variation are not discovered in GWAS (the 'missing heritability').
In summary, we show that different types of SVs have distinct influences at the phenotype level and that a small number of SVs produce profound effects on the biology of this species.
peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/047266doi: bioRxiv preprint first posted online Apr. 13, 2016; This module includes methods to convert the method-specific output formats to a VCF format.SVs were filtered out if they were unique to one of the three VCF files.Two SVs were defined as overlapping if they occur on the same chromosome, their start and stop coordinates were within 1 kb, and they were of the same type.In the end, SURVIVOR produced one VCF file containing the so filtered calls.SURVIVOR is available at github.com/fritzsedlazeck/SURVIVOR.

Read mapping and detection of structural variants
Illumina paired-end sequencing data for 161 S. pombe strains were collected as described in Jeffares et al. (2015), with the addition of Leupold's reference 975 h + (JB32) and excluding JB374 (known to be a gene-knockout version of the reference strain, see below).Leupold's 968 h 90 and Leupold's 972 h -were included as JB50 and JB22, respectively (Supplementary Table 7).For all strains, reads were mapped using NextGenMap (version 0.4.12)(Sedlazeck et al., 2013) with the following parameter (-X 1000000) to the S. pombe reference genome (version ASM294v2.22).Reads with 20 base pairs or more clipped were extracted using the script split_unmapped_to_fasta.pl included in the LUMPY package (version 0.2.9) (Layer et al., 2014) and were then mapped using YAHA (version 0.1.83)(Faust and Hall, 2012) to generate split-read alignments.The two mapped files were merged using Picard-tools (version 1.105) (http://broadinstitute.github.io/picard),and all strains were then down-sampled to 40x coverage using Samtools (version 0.1.18)(Li et al., 2009).
To identify further CNVs, we ran cn.MOPS (Klambauer et al., 2012) with parameters tuned to collect large duplications/deletions as follows: read counts were collected from bam alignment files (as above) with getReadCountsFromBAM and WL=2000, and CNVs predicted using haplocn.mopswith minWidth= 6, all other parameters as default.Hence, the minimum variant size detected was 12 kb.CNV were predicted for each strain independently by comparing the alternative strain to the two reference strains (JB22, JB32) and four reference-like strains that differed from the reference by less than 200 SNPs (JB1179, JB1168, JB937, JB936).
After CNV calling, allele calling was achieved by comparing counts of coverage in 100bp windows for the two reference strains (JB22, JB32) to each alternate strain using custom R scripts.Alleles were called as non-reference duplications if the one-sided Wilcoxon rank sum test p-values for both JB22 and JB32 vs alternate strain were less than 1x10 -10 (showing a difference in coverage) and the ratio of alternate/reference peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/047266doi: bioRxiv preprint first posted online Apr. 13, 2016; coverage (for both JB22 and JB32) was >1.8 (duplications), or <0.2 (deletions).Manual inspection of coverage plots showed that the vast majority of the allele calls were in accordance with what we discerned by eye.These R scripts were also used to examine CNVs predicted to be segregating within clusters (clonal populations).All such CNVs were examined in all clusters that contained at least one non-reference allele call (Supplementary Table 11).

Reduction of false discovery rate
This filtering produced 315 variant calls.However, because 31 of these 315 (~10%) were called within the two reference strains (JB22, JB32), we expected that this set still contained false positives.To further reduce the false positive rate, we looked for parameters that would reduce calls made in reference strains (JB22 and JB32) but not reduce calls in strains more distantly related to the reference (JB1177, JB916 and JB894 that have 68223, 60087 and 67860 SNP differences to reference (Jeffares et al., 2015)).The reasoning was that we expected to locate few variants in the reference, and more variants in the more distantly related strains.This analysis showed that paired end support, repeats and mapping quality were of primary value.
We therefore discarded all SVs that had a paired end support of 10 or less.In addition, we ignored SVs that appeared in low mapping quality regions (i.e.regions where reads with MQ=0 map) or overlapped with previously identified retrotransposon LTRs (Jeffares et al., 2015).
Finally, to ensure a high specificity call set, these filtered SVs were manually curated using IGV (Thorvaldsdottir et al., 2013) (Supplementary Tables 12,13).We assigned each SVs a score (0: not reliable, 1: unclear, 2: reliable based on inspection of alignments through IGV).Only calls passing this manual curation as reliable (score 2) were included in the final data set of 113 variants utilized for all further analyses.These filtering and manual curation steps reduced our variant calls substantially, from 315 to 113.At this stage only 1/113 (~1%) of these variants was called within the two reference strains (JB22, JB32).PCR validation PCR analysis was performed to confirm 10 of the 11 inversions and all 15 translocations from the curated data set.One inversion was too small to examine by PCR (INV.AB325691:6644..6784, 140 nt).Primers were designed using Primer3 (Untergasser et al., 2012) to amplify both the reference and alternate alleles.PCR was carried out with each primer set using a selection of strains that our genotype calls predict to include at peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not least one alternate allele and at least one reference allele (usually 6 strains).Products were scored according to product size and presence/absence (Supplementary Tables 14,   15).
Inversions: 9/10 variants were at least partially verified by either reference or alternate allele PCR (3 variants were verified by both reference and alternate PCRs), and 7/10 inversions also received support from BLAST (see below).Translocations: 10/15 were at least partially verified by either reference or alternate allele PCR (5/15 variants were verified by both reference and alternate PCRs).One additional translocation received support from BLAST (see below), meaning that 11/15 translocations were supported by PCR and/or BLAST.Three of the four translocations that could not be verified were probably nuclear copies of mitochondrial genes (NUMTs) (Lenglez et al., 2010), because one breakpoint was mapped to the mitochondrial genome.

Validation by BLAST of de novo assemblies
We further assessed the quality of the predicted breakpoints for the inversions and translocations by comparing them to the previously created de novo assemblies for each of the 161 strains (Jeffares et al., 2015).To this end, we created blast databases for the scaffolds of each strain that were >1kb.We then created the predicted sequence for 1 kb around each junction of the validated 10 inversions and 15 translocations.These sequences were used to search the blast databases using BLAST+ with --gapopen 1 -gapextend 1 parameters.We accepted any blast hsp with a length >800 bp as supporting the junction (because these must contain at least 300 bp at each side of the break point).Four inversions and three translocations gained support from these searches (Supplementary File Tables2-PCR.xlsx).

Knockout strain control
Our sample of sequenced strains included one strain (JB374) that is known to contain deletions of the his3 and ura4 genes.Our variant calling and validation methods identified only two variants in this strain, both deletions that corresponded to the positions of these genes, as below: his3 gene location is chromosome II, 1489773-1488036, deletion detected at II:1488228-1489646.ura4 gene location is chromosome III, 115589-116726, deletion detected at III:115342-117145.This strain was not included in the further analyses of the SVs.

Microarray expression analysis
Cells were grown in YES (Formedium, UK) and harvested at OD 600 =0,5.RNA was isolated followed by cDNA labeling (Lyne et al., 2003).Agilent 8 x 15K custom-made S. pombe expression microarrays were used.Hybridization, normalization and subsequent peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/047266doi: bioRxiv preprint first posted online Apr. 13, 2016; washes were performed according to the manufacturer's protocols.The obtained data were scanned and extracted using GenePix and processed for quality control and normalization using in-house developed R scripts.Subsequent analysis of normalized data was performed using R. Microarray data have been submitted to ArrayExpress (accession number E-MTAB-4019).Genes were considered as induced if their expression signal after normalization was >1.9, and repressed if <0.51.Time to most recent common ancestor (TMRCA) estimates Previously, based on the genetic distances between these strains and the 'dated tip' dating method implemented in BEAST (Drummond et al., 2012), we have estimated the divergence times between all 161 S. pombe strains sequenced (Jeffares et al., 2015).To determine the TMRCA for pairs of strains, we re-examined the BEAST outputs using FigTree to obtain the medium and 95% confidence intervals.SNP and indel calling SNPs were called as described (Jeffares et al., 2015).Insertions and deletions (indels) were called in 160 strains using stampy-mapped, indel-realigned bams as described previously (Jeffares et al., 2015).We accepted indels that were called by both the Genome Analysis Toolkit HaplotypeCaller (DePristo et al., 2011) and Freebayes (Garrison and Marth, 2012), and then genotyped all these calls with Freebayes.
Briefly, indels were called on each strains bam with HaplotypeCaller, and filtered for call quality >30 and mapping quality >30 (bcftools filter --include 'QUAL>30 && MQ>30').Separately, indels were called on each strains bam with Freebayes, and filtered for call quality >30.All Freebayes vcf files were merged, accepting only positions called by both Freebayes and HaplotypeCaller.These indels were then genotyped with Freebayes using a merged bam (containing reads from all strains), using the --variantinput flag for Freebayes to genotyped only the union calls.Finally indels were filtered for by score, mean reference mapping quality and mean alternate mapping quality >30 (bcftools filter --include 'QUAL>30 && MQM>30 & MQMR>30').These methods identified 32,268 indels.Only 50 of these segregated between Leupold's h -reference (JB22) and Leupold's h 90 reference (JB50), whereas 12109 indels segregated between the JB22 reference and the divergent strain JB916.

Heredity and GWAS
We selected 53 traits that contained at least values from 100 strains (Jeffares et al., 2015), and so included multiple individuals from within clonal populations (growth rates on 42 different solid media and 11 cell shape characters measured with automated image analysis).Trait values were normalized using a rank-based transformation in R, for each trait vector y, normal.y=qnorm(rank(y)/(1+length(y))).Total heritability, and the contribution of SNPs, CNVs and rearrangements were estimated using LDAK (version 5.94) (Speed et al., 2012), with kinship matrices derived from all SNPs, 87 CNVs, and 26 rearrangements.To assess whether the contribution of CNVs could be primarily due to linkage with causal SNPs, we simulated trait data using the --make-phenos function of LDAK with the relatedness matrix from all SNPs, assuming that all variants contributed to the trait (--num-causals -1).We made one simulated trait data set per trait, for each of the 53 traits, with total heritability defined as predicted from the real data.We then estimated the heritability using LDAK, including the joint matrix of SNPs, CNVs and rearrangements.To assess the extent to which the contribution of SNPs to heritability was overestimated, we performed another simulation using the relatedness matrix from the 87 segregating CNVs alone, and then estimated the contribution of SNPs, CNVs and rearrangements in this simulated data as above.
Genome-wide associations were performed with LDAK (version 5) using default parameters, using a mixed model derived from kinship of all SNPs called previously Jeffares et al. (2015).Association analysis was run separately for 68 SVs with a minor allele count >5, for 139,396 SNPs and for 22,058 indels, both minor allele counts >5.We examined the same 53 traits as for the heritability analysis (above).For each trait, we carried out 1000 permutations of trait data, and define the 5 th percentile of these permutations as the trait-specific P-value threshold.

Offspring viability and genetic distance
Cross spore viability data and self-mating viability were collected from previous analyses (Avelar et al., 2013, Jeffares et al., 2015).The number of differences between each pair was calculated using vcftools vcf-subset (Danecek et al., 2011), and correlations were estimated using R, with the ppcor package.When calculating the number of CNVs differences between strains, we altered our criteria for 'different' variants (to merge variants whose starts and ends where within 1 kb), and merged CNVs if their overlap was >50% and their allele calls were the same.
peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.
The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/047266doi: bioRxiv preprint first posted online Apr. 13, 2016; Supplementary Figure 3. Duplications that segregate within closely related strains. Plots show the average coverage in 1 kb non-overlapping windows for strains with a duplication (red) and all closely related strains without duplication (green); all these strains differ by <150 SNPs.The coverage of the two standard reference strains (h  .No significant increase in gene expression immediately adjacent to duplications.For each duplication examined with DNA arrays, we show the relative expression (strain 1 vs strain 2) near the duplication.P-values show the support for the genes within the duplication (red vertical lines), or the 50 kb adjacent to the duplication (green vertical lines) being more highly expressed than all other genes in the chromosome (one-sided Wilcoxon rank sum tests).The grey horizontal lines show the 5 th , 50 th and 95 th percentiles for gene expression data on the chromosome.The bottom right panel shows that the median increase in expression level within a duplication correlates with the increase in genomic copy number.The solid back line shows the expected increase for the 1:1 correspondence between genomic copy number and relative expression (the line y=x), and the dashed line shows the linear model for the data.within DUP, P= 0.99 near DUP, P= 0.5  within DUP, P= 6.2e−05 near DUP, P= 0.64 within DUP, P= 0.0013 near DUP, P= 0.86 | rearr): τ = −0.19P= 0.038 | SNPs): τ = −0.22P= 0.016 peer-reviewed) is the author/funder.All rights reserved.No reuse allowed without permission.