Abstract
It has been hypothesized that individually-rare hidden structural variants (SVs) could account for a significant fraction of variation in complex traits. Here we identified more than 20,000 euchromatic SVs from 14 Drosophila melanogaster genome assemblies, of which ~40% are invisible to high specificity short-read genotyping approaches. SVs are common, with 31.5% of diploid individuals harboring a SV in genes larger than 5kb, and 24% harboring multiple SVs in genes larger than 10kb. SV minor allele frequencies are rarer than amino acid polymorphisms, suggesting that SVs are more deleterious. We show that a number of functionally important genes harbor previously hidden structural variants likely to affect complex phenotypes. Furthermore, SVs are overrepresented in candidate genes associated with quantitative trait loci mapped using the Drosophila Synthetic Population Resource. We conclude that SVs are ubiquitous, frequently constitute a heterogeneous allelic series, and can act as rare alleles of large effect.
Similar content being viewed by others
Introduction
Understanding the molecular basis of heritable variation in complex traits is of central importance to evolution, animal and plant breeding, and medical genetics1,2,3,4. Over the last decade, short read genomic data (50–150 bp reads) appropriate for characterizing SNPs and small indels in non-repetitive genomic regions has accumulated at an exponential rate5,6. This in turn has catalyzed hundreds of quantitative trait locus (QTL) mapping and genome-wide association (GWAS) studies in model organisms, humans, and agriculturally important animals and plants7,8,9. Despite these efforts, for most traits, GWAS hits only explain a small fraction of known trait heritability10,11. One hypothesis accounting for hidden genetic variation is that individually rare hidden mutations that alter genome structure make significant contributions to complex trait variation11,12. These structural variants (SVs) change the genome via duplication, deletion, transposition, and inversion of sequences. This hypothesis is attractive since rare causative variants are difficult to detect with GWAS13. Moreover, genotyping approaches based on short reads or microarrays fail to detect a significant number of SVs14,15. Finally, it is reasonable to assume that SVs are on average likely to be more deleterious and deleterious more often than SNPs16,17,18,19.
High quality genomes provide a direct and reliable path to comprehensive identification of SVs15,20,21. To achieve this goal, we assembled reference-quality genomes for fourteen geographically diverse Drosophila melanogaster strains (Fig. 1a) using single molecule real time sequencing22. These assemblies are contiguous and complete (N50 18.9–22.3 Mb; BUSCO23 99.9–100%) (Table 1, Fig. 1b, Supplementary Table 1), making them comparable to the D. melanogaster reference genome, arguably the best metazoan genome assembly. Thirteen of the fourteen strains are near isogenic founders of the Drosophila synthetic population resources (DSPR)24, a large set of advanced intercross recombinant inbred lines (RILs) designed to map QTLs25. We also assembled the genome of Oregon-R, an outbred stock widely used as a “wild-type” strain both by Drosophila geneticists and by large scale community projects like modENCODE26,27,28.
Using these reference quality genome assemblies, we show that SVs are common in D. melanogaster genes, with almost one third of diploid individuals harboring an SV in genes larger than 5 kb, and more than a third of burdened genes carrying multiple SVs. The site frequency spectrum (SFS) of SV alleles relative to amino acid polymorphisms suggests that SVs are under stronger purifying selection, and thus are more likely to impact phenotype than nonsynonymous SNPs. We further show that a number of functionally important genes harbor previously hidden SVs likely to affect complex phenotypes (e.g., Cyp6g1, Drsl5, Cyp28d1, Cyp28d2, InR, and Gss1&2). Finally, we find that SVs are overrepresented in candidate genes associated with mapped QTL. We conclude that SVs are pervasive in genomes, frequently manifest as heterogeneous allelic series affecting the same gene, and exhibit all the properties that make them prime candidates for being rare alleles of large effect.
Results
De novo assembly reveals novel functionally important SVs
Our assemblies are extremely contiguous, with the majority of each chromosome arm represented by a single contig (Fig. 1b). We also close the two remaining gaps in the major chromosome arms of the euchromatic D. melanogaster reference genome29 in all our assemblies (Supplementary Figs. 1–3). We identified SVs by comparing each assembly to the reference ISO1 genome15, focusing our attention on large (>100 bp) euchromatic SVs (Supplementary Table 2), and ignoring heterochromatin regions as they are gene poor30 and require specialized assembly approaches and extensive validation31. Manual inspection of 267 randomly sampled SVs indicate that mis-annotations are rare (3/267), and occur in ambiguously aligned structurally complex genomic regions (Supplementary Fig. 5; see Methods). We discovered 7347 TE insertions, 1178 duplication CNVs, 4347 indels, and 62 inversions in the 94.5 Mb of euchromatin spanning the five major chromosome arms across the DSPR founders (Fig. 1c–d). Each founder strain exhibits 637 TE insertions, 134 duplications, 694 indels, and 7 inversions on average (Table 2). We estimate that 36% of non-reference TEs, 26% of deletions, 48% of insertions, 60% of duplication CNVs are not routinely detected using high coverage paired end Illumina reads and high specificity SV genotyping methods15 (Supplementary Fig. 6)
We uncover many examples of previously hidden SVs predicted to affect complex traits. Extensive evidence links complex SV alleles of the cytochrome P450 gene Cyp6g1 to varying levels of DDT resistance32,33. Despite extensive study of this locus, we discovered three new SV alleles involving TE insertions that likely have different functional consequences (Supplementary Fig. 7a, b). Similarly, we discovered a previously hidden tandem duplication of the antifungal, innate immunity gene Drsl534 that exhibits >1000-fold higher expression relative to its single copy counterpart in line A4 (Supplementary Fig. 8a, b). Read pair orientation and split read methods failed to detect this mutation because one allele bears a 5 kb spacer sequence derived from the first exon and intron of Kst inserted between the gene copies (Supplementary Fig. 8a). Another duplicate allele of Drsl5 contains a Tirant LTR retrotransposon inserted into the same spacer sequence (Supplementary Fig. 8a). We also easily detect the two SV mutations underlying the D. melanogaster recessive visible genes cinnabar35 (cn) and speck (sp) present in the ISO1 reference genome36 (Supplementary Figs. 9 and 10). In the case of sp a large insertion in the reference genome is mis-annotated as an intron. For cn a large exonic deletion is not identified as such36. Both alleles are likely knock-outs.
SVs are deleterious
Most TEs and duplicates are present in only one strain (Fig. 1e), with the folded SFS of the TEs and duplicates exhibiting a greater proportion of rare variants than non-synonymous SNPs (nsSNPs) assayed in the same strains (Fig. 1e; p-value < 1 × 10−10, χ2 test between frequency classes of these two types of SVs and non-synonymous SNPs). Since nsSNPs were ascertained via high coverage short reads from virtually isogenic strains24, the low frequency skew of the site frequency spectrum of SVs relative to nsSNPs is unlikely due to SNP miscalls (see Methods). It is well-known that the SFS is affected by demographic history37,38, but selective constraints can be inferred with some confidence by comparing site classes from the same sample37,39,40,41. The skew toward rare variants we observe in our SVs relative to nsSNPs is strongly indicative of SVs being under stronger purifying selection, consistent with previous work in which SVs were ascertained with higher bias and/or errors16,18,19. Furthermore, TEs are more enriched for rare variants than duplicates, indicating that TE insertions as a class are more deleterious than duplicates (Fig. 1e; p-value < 1 × 10−10, χ2 test between frequency classes of TEs and duplicates). Under mutation selection balance models42,43, rare deleterious variants (minor allele frequency or MAF <1%) are predicted to contribute significantly to the variation in complex traits, yet are unlikely to be tagged by SNPs typically used in GWAS experiments10. Although demography can impact the proportion of variation due to rare deleterious alleles, recent population bottlenecks or growth44,45 tend to amplify the contribution of rare alleles to variation in a complex trait.
SVs are common in genes and enriched at mapped QTLs
In order to illustrate how common SV genotypes are in heterozygous individuals, we quantified the per gene SV burden per synthetic diploid D. melanogaster individual (Fig. 2a, b; each synthetic diploid is one of 78 possible pairings of the thirteen assembled DSPR founders). On average, SVs appear in 9.3% of genes in diploid individuals (1285/13761). Of those, more than a third of burdened genes in diploids (443/1285) bear multiple SV mutations. One or more SVs burden more than half of genes in and above the 20-35kb range (Fig. 2a). Furthermore, individual genes bearing multiple SVs comprise more than a third of burdened genes between 20 and 35 kb in length and more than half of larger genes (Fig. 2b). Thus, although generally having rare minor allele frequencies, SVs are ubiquitous in the functional elements of D. melanogaster genome.
Although hypotheses employing SVs to explain missing heritability have been proposed10,46, the systematic under-identification of SVs via short read- and microarray-based genotyping21 limits their explanatory power. Using our comprehensive SV map, we measured the prevalence of SVs at the candidate genes reported in eight complex trait mapping experiments employing DSPR (Supplementary Data 1). We consider only genes in mapped QTLs explicitly cited by the authors of the original QTL studies (Supplementary Data 1; see Methods). In total, we identified 31 candidate genes of which 15 (48.4%) possess at least one SV in one founder strain, whereas only 23.4% (3237/13,830) of other D. melanogaster genes harbor SVs (p = 0.0023; Fisher’s exact test). The 31 candidate genes from QTL mapping work (Supplementary Data 1) are more than twice as large as an average Drosophila gene (12.2 vs 5.4 kb, Wilcoxon rank sum test; p = 1.4 × 10−4. Supplementary Data 2). We then tested whether QTL candidate genes are enriched for SVs independent of gene size. To control for the observed elevated size distribution of QTL candidate genes, we randomly drew 100,000 Monte Carlo gene samples matching the candidate gene length distribution in order to generate null distribution for the number of SVs we expect to observe. In the Monte Carlo sample, 10.4 genes are expected to harbor SVs by chance whereas we observe 15, an enrichment of 45% (p = 0.026; Fig. 2c). Similarly, enrichment is also observed when burden is measured as density (# burdened genes/bp) in comparison of QTL candidate genes to the remaining genes (39.6 burdened genes/Mbp of genic DNA vs 27.3 burdened genes/Mbp of genic DNA, p = 0.017; Supplementary Data 2). Finally, candidate genes also exhibit greater SV mutation density compared to the expected SV density in the Monte Carlo samples (205.8 SVs/Mbp vs 138.4 SV/Mbp, p = 0.047; Supplementary Data 2), an enrichment of 49%. These observations suggest that SVs may be disproportionately associated with QTL candidate genes even after accounting for gene size. However, due to the limited sample size of QTL candidate genes, the power to detect even strong effects (e.g. the 45% enrichment reported here) is limited. Consequently, further QTL studies identifying candidate genes would substantially improve our understanding of this effect. These observations suggest that the contribution of rare SVs of large effect to complex traits could be pervasive.
Functional structural variation at mapped QTL
GWAS experiments are poorly powered to detect the segregation of multiple alleles at a causal gene43. Although allelic heterogeneity can be readily identified in multi-parent panels (MPPs) via QTL mapping25, mapping resolution is often poor, forcing investigators to identify mutations of obvious functional significance in the genomic interval most likely to harbor the QTL. Both GWAS and QTL mapping suffer if putatively causative SVs disproportionately escape detection by short read sequencing15. This limitation can be readily solved in MPPs, as the SV genotypes of the large panel of mapping lines can be imputed from de novo assemblies of the much smaller number of MPP founders24.
A nicotine resistance mapping study employing the DSPR identified differentially expressed cytochrome P450 genes Cyp28d1 and Cyp28d2 as candidate causative genes at a mapped QTL, but proposed no causative mutations47. A previous de novo assembly of single DSPR founder strain identified a resistant allele possessing tandem copies of the Cyp28d1 gene separated by an Accord LTR retrotransposon fragment15 (Fig. 3a; Supplementary Fig. 11). Our assemblies of additional DSPR founder strains reveal a total of seven structurally distinct alleles in this region, including additional candidate resistant alleles harboring gene duplications (Fig. 3a–b). For example, the resistant strain A2 carries a tandem duplication of a 15Kb segment containing both Cyp28d genes. The expression level of Cyp28d1 in the adult female heads of RILs bearing the A2 genotype is highest among all founder genotypes measured (Fig. 3c). Consistent with this, DSPR Recombinant Inbred Lines (RILs) bearing the A2 genotype show the highest average resistance to nicotine toxicity among the RILs derived from the A set of founders47 (Fig. 3b). This implies that the extra copies of Cyp28d1 and/or Cyp28d2 account for the increased expression and concomitant resistance to nicotine. Similarly, the B4 allele comprises a tandem duplication of a 6 Kb segment, containing one extra copy of Cyp28d1 and a nearly complete copy of Cyp28d2 (Fig. 3a; Supplementary Fig. 11). RILs carrying the B4 genotype at the Cyp28d locus also show high resistance to nicotine, making the duplication a compelling candidate for the causative mutation. On the other hand, in two alleles, TE insertions disrupt Cyp28d gene structure. For instance, A1 has a duplication sharing the same breakpoints as B4, but a 4.7Kb F element inserted in the 5th exon disrupts the protein coding sequence of the second Cyp28d1 copy, likely rendering the copy nonfunctional (Fig. 3a; Supplementary Fig. 11). Consistent with the hypothesis that the duplication causes increased nicotine resistance, the A1 genotype is more susceptible to nicotine than B4 (Fig. 3b). All of these SV alleles are singletons, and thus represent a hidden allelic series composed of individually rare alleles.
SVs may also affect genes central to life history traits. Expression levels of the insulin signaling pathway genes show substantial variation in F1 hybrids between DSPR panel B RILs and the A4 founder48. Among these is Insulin Receptor (InR), which plays a key role in several life history traits related to lifespan and is likely a key molecular mediator of the tradeoff between reproductive success and longevity49,50,51. Amino acid polymorphism in InR evolves under positive selection and some non-synonymous variants affect fecundity and stress response51,52. Expression variation of InR also affects body size, lifespan, and fecundity53,54, suggesting that natural cis-regulatory variation might also be under selection. We discovered a 215 bp fragment of a DOC6 element within a second intron enhancer55 (Fig. 4b, c) of InR on the AB8 haplotype, and this allele exhibits reduced gene expression relative to reference genotypes (Fig. 4b). This mutation potentially disrupts the enhancer (Supplementary Fig. 12), making it a plausible candidate for expression variation in InR. Another founder, A6, carries a 1,042 bp insertion of DMRT1A (LINE) in the 2nd intron and a 946 bp insertion of a fragment of PROTOP in the 3rd intron. Both insert within known cis-regulatory elements55 (Fig. 4b, c). Except for A2 and A6, all strains, including ISO1, harbor an FB-NOF element (FB{}1698) inside the first intron of InR (Fig. 4a). Like many genes, the first intron of InR possess several transcription factor binding sites (TFBS), including those for factors Nejire and Caudal56 (Fig. 4c). The FB-NOF element is inserted within this dense cluster of TFBS and active enhancer marks (Fig.4c). Furthermore, the FB element is segregating at high frequency in the strains discussed here (13/15), a North American population57 (125/170), and a French population58 (4/9), but is rare in populations derived from D. melanogaster’s ancestral range in Africa58,59 (Cameroon: 0/10, Rwanda: 1/27, Zambia: 10/139) (Fig. 4d). This raises the possibility that the FB element is more common in temperate cosmopolitan populations, similar to a previously described adaptive amino acid variant in InR52. In total, InR harbors a remarkable amount of potentially functional structural diversity. Including these variants described above, there are nine TE insertions and two deletions throughout the gene, many of which impinge on candidate regulatory regions or transcribed portions of the gene (Fig. 4a, c).
Public resources like modENCODE annotate molecular phenotypes (e.g., RNAseq, ChIPseq, DNase-seq) against reference genomes which are often genetically different than the strains assayed26,27,28,56. Canton-S (our DSPR founder A1) and Oregon-R are strains commonly used in phenotypic assays26,27,28, and we observe SVs segregating between these two strains and the reference (Table 2). Interpretation of functional genomics data such as RNA-seq can be misleading when gene copy number varies between strains. We explored the glutathione synthetase region (containing Gss1 and Gss2), which is just one example among hundreds in modENCODE that likely suffer from misleading annotations. A tandem duplication present in ISO1 has created two copies of Gss1 and Gss2, which are associated with toxin metabolism and linked to tolerance to arsenic60 and ethanol induced oxidative stress61. While this duplication segregates at high frequency in DSPR strains (9/13), it is absent in Oregon-R (Fig. 5a) and escapes detection via short-read methods. As a result, using transcript and ChIP data derived from Oregon-R (as used in modENCODE27,28) results in misleading annotations of the two copies in ISO1. Indeed, among the eight structurally distinct Gss alleles in our dataset, ISO1 is the sole representative of its allele (Fig. 5a). The two most common Gss alleles include one that contains only a single Gss gene (in four strains, including Oregon-R) and one carrying only a tandem duplication, creating the Gss1/Gss2 pair (in five strains, including Samarkand/AB8) (Fig. 5a). The remaining six alleles have SV genotypes represented by only a single individual in the sample. Collectively, this sample represents a haplotype network of structural variation involving five TE insertions, one duplication, one insertion comprising TE and simple repeats, and two non-TE indels. The single copy allele with a 5′ insertion of a 14 kb repetitive sequence comprising Nomad retrotransposon fragments exhibits the highest expression, followed by duplicate alleles, whereas single copy alleles and duplicate alleles with intronic TE insertions generally have the lowest expression levels (Fig. 5b).
Discussion
Despite claims that a significant proportion of complex trait variation in humans, model organisms, and agriculturally important animals and plants is likely due to rare SVs of large effect11, systematic inquiry of this hypothesis has been impeded by genotyping approaches attuned to SNP detection21. As reference quality de novo assemblies of population samples for eukaryotic model systems become increasingly cost-effective, methodical evaluation of the contribution of SVs to the genetic architecture of complex traits becomes feasible. Our comprehensive map of SVs in Drosophila provides the means to systematically quantify the contribution of rare SVs to heritable complex trait variation (Figs. 2a, 3a, 4a, and 5a). The value of comprehensive SV detection is underscored by the presence of SVs in ~50% of the candidate genes underlying mapped Drosophila QTL, and by the observation that a large fraction of Drosophila genes harbor multiple rare SV alleles. The genomes of humans and agriculturally important plants and animals harbor more SVs than Drosophila, and thus are likely more burdened with genic SVs.
The genetic heterogeneity hypothesis for variation in complex traits posits that a sizable fraction of human complex disease is associated with an allelic series consisting of individually rare causative mutations at several genes of large effect62. Furthermore, models for complex traits under either stabilizing63,64 or purifying selection42,43 with constant mutational input predict the existence of genes segregating several individually rare causative alleles that account for a sizable fraction of trait variation. We provide examples of SVs in genes of functional significance, and show that genes harboring SVs are overrepresented in a collection of QTL candidate genes. Hidden SVs are thus examples of collectively common but individually rare deleterious genetic variants predicted under the genetic heterogeneity hypothesis. Future de novo assemblies of other genomes, including humans, models, and agriculturally important species, would quantify the generality of observations from Drosophila.
Methods
DNA extraction
Genomic DNA was extracted from females following the protocols described previously22 and the genomic DNA was sheared using 10 plunges of a 21-gauge needle, followed by 10 of a 24-gauge needle (Jensen Global, Santa Barbara). All testing and research involving flies were performed in compliance with relevant ethical regulations. SMRTbell template library was prepared following the manufacturer’s guidelines and sequenced using P6-C4 chemistry in Pacific Biosciences RSII platform at University of California Irvine Genomics High Throughput Facility. The total number of SMRTcell and base pairs sequenced, and read length metrics for each strain is given in Table S6.
Genome assembly
The genomes were assembled following the approach described in Chakraborty et al.22. For all calculations of sequence coverage, a genome size of 130Mbp is assumed (G = 130 × 106 bp). For individual strain, we generated a hybrid assembly with DBG2OLC65 and longest 30X PacBio reads, and a PacBio assembly with canu v1.366 (Supplementary Data 3). The paired end Illumina reads were obtained from King et al.24. The hybrid assemblies were merged with the PacBio only assemblies with quickmerge v0.222,67 (l = 2 Mb, ml = 20000, hco = 5.0, c = 1.5), with the hybrid assembly being used as the query. Because the PacBio assembly sizes were closer to the genome size of D. melanogaster, we added the contigs that were present only in the PacBio only assembly but not the hybrid assembly by performing a second round of quickmerge67. For the second round of quickmerge (l = 5 mb, ml = 20000, hco = 5.0, c = 1.5), the PacBio assembly was used as the query and the merged assembly from the first merging round the reference assembly. The resulting merged assembly was processed with finisherSC to remove the redundant sequences and additional gap filling using raw reads68. The assemblies were then polished twice with quiver (SMRTanalysis v2.3.0p5) and once with Pilon v1.1669 using the same Illumina reads as used for the hybrid assemblies.
Comparative scaffolding
We scaffolded the contigs for each assembly based on the scaffolds from the reference assembly70, following a previously described approach15. Briefly, TEs and repeats in the assemblies were masked using RepeatMasker (v4.0.7) and aligned to the repeat-masked chromosome arms (X, 2L, 2R, 3L, 3R, and 4) of the D. melanogaster ISO1 assembly using MUMmer71. After filtering of the alignments due to the repeats (delta-filter −1), contigs were assigned to specific chromosome arms on the basis of the mutually best alignment. The scaffolded contigs were joined by 100 Ns, a convention representing assembly gaps. The unscaffolded sequences were named with a “U” prefix.
BUSCO analysis
We ran BUSCO (v3.02)72 on the Pilon polished pre-scaffolding assemblies to evaluate the completeness of all the assemblies relative to the ISO1 release 6 (r6.13) assembly. We used both the arthropoda and diptera datasets for the BUSCO evaluation. For the arthropoda database, three orthologs (EOG090X0BNZ,EOG090X0M0J, and EOG090X049L) were not found in any of the 15 strains (ISO1, Oregon-R, and 13 DSPR founders). Further inspection of these orthologs revealed that they are present in ISO1 even though the BUSCO analysis misses them when applied to ISO1 (EOG090X0BNZ is CG3223, EOG090X0M0J is Pa1 and EOG090X049L is CG40178). Consequently, we removed these three genes from consideration as uninformative.
Variant detection
For variant detection, we aligned each DSPR assembly individually to the ISO1 release 6 assembly (release 6.13)70 using nucmer (nucmer –maxmatch –noextend)71. We identified and classified the variants using SVMU 0.2beta (Structural variants from MUMmer) (n = 10)15. SVMU classifies the structural differences between two assemblies as insertion, deletion, duplication, and inversion based on whether the DSPR assemblies have longer, shorter, more copy, or inverted sequence, respectively, with respect to the reference genome. The variant calls for individual genomes were combined using bedtools merge73 and converted into a vcf file using a custom script (https://github.com/mahulchak/dspr-asm). TE insertions were identified by examining the overlap between RepeatMasker identified TEs and SVMU insertion calls using bedtools, requiring that at least 90% of RepeatMasker TE annotation overlap with svmu insertion annotation. 12.8% SV mutations, for which mutation annotation were complicated by secondary mutations, were flagged as “complex” (CE = 2 in the VCF file). Additionally, 16.3% SVs that were located within 5Kb of a complex SV were often part of a complex event and were also assigned a tag (CE = 1) to differentiate them from the unambiguously annotated SVs (CE = 0).
Genotype validation
To determine the genotyping error rate, a set of randomly selected 50 simple (CE = 0) SVs obtained from SVMU were manually inspected on UCSC genome browser representation of the multiple genome alignment of the 15 genomes (http://goo.gl/LLpoNH). Furthermore, to estimate the genotyping accuracy of the SVs occurring in the vicinity of the complex mutations, where mutation annotation is complicated by alignment ambiguities, we manually inspected 217 SVs occurring within 20 Kb of 50 randomly selected complex (CE = 2) SVs. Among these, 3/217 and 0/50 SVs were absent in the UCSC browser and therefore they are likely mis-annotated by our pipeline. The mis-annotated SVs (insertion in A1 and tandem array CNV in A7) are located in a complex, repetitive, structurally variable genomic region on chromosome 3L (3L:7669500-7679100) (Supplementary Fig. 5).
Comparing SV genotypes from de novo assemblies to short read only calls
TE genotypes for the founders18 were downloaded from flyrils.org and the insertion coordinates were lifted over to the current release (release 6) of the reference genome70 using UCSC liftover tool74. For detection of the duplicates, we have previously found that discordant read pair based method (Pecnv)75 was comparable to split read mapping76 and more reliable than methods based on coverage alone15,77, so we used Pecnv. Pecnv was run using the settings described before15. Because svmu reports tandem duplicate CNVs as insertions (with appropriate CNV tags to separate from TE and other insertions) and Pecnv reports sequence range being duplicated, the SVMU CNV insertion coordinates were extended by 100 bp before comparison (bedtools intersect) between Pecnv output and svmu output was conducted. The non-TE indel genotypes were obtained from Pindel output (the “LI” and “D” events) using the commands described previously15. For determining population frequency of indel SVs (e.g. the reference FB element in InR), Pindel output based on the alignment bam files were used. We only estimate the false negative rate of short read only callers, but note that these methods also generate false positive SV calls.
Gene expression analysis
The preprocessed expression data for female heads78 and IIS/TOR expression data48 from whole bodies were downloaded from www.flyrils.org. Expression QTL analysis (Supplementary Fig. 11) for Cyp28d1 and Gss1 using the head expression data were performed using the R package DSPRqtl following the instructions provided in the manual (DSPRscan,model = gene ~ 1,design = “ABcross”). When expression data for multiple isoforms were present, expression data only for the longest transcript that is expressed in the head was used. The genotype values at the eQTL were determined using the function DSPRpeaks included in the DSPRqtl package. No eQTL were found for InR so the genotype values for the InR expression data were obtained by assigning the founder genotypes to the RILs used in the IIS/TOR expression dataset, using the posterior probabilities of the forward-backward decoding of the HMM for the panel B RILs available on www.flyrils.org. Drsl5 expression levels in A4 and A3 were obtained from a publicly available RNAseq dataset47.
Comparison of site frequency spectra
The histogram of allele frequencies (site frequency spectrum or SFS) was collated for four categories: synonymous SNPs, non-synonymous SNPs, duplicate CNVs, and TE insertions. The frequencies of SNPs were collected from the VCF file24 using vcftools and bcftools79,80. The frequencies of SVs were collected from the column 4 of the combined SVMU output for the TE insertions and duplication CNVs from all DSPR strains (https://github.com/mahulchak/dspr-asm). Complex mutations (CE = 1 and CE = 2) were excluded from the analysis. Let N be the sample size and xi be the number of sites in frequency class i, where 0 < i < N. The SFS was “folded”, meaning we focused attention on the minor allele frequency (MAF), or yi = minimum (xi, N − xi). Pairwise comparisons between different SFS site categories were conducted using the χ2 test on allele frequencies and site categories. For allele frequencies, two types of classifications were used: (1) every yi for 0 < i < N (N − 1 df); and (2) considering singletons versus the other frequency categories, or yi for i = 1 versus 2 < i < N (1 df).
Candidate genes associated with mapped QTL
The candidate genes from DSPR QTL papers were selected based the following criteria: (1) The gene falls within the QTL peak; (2) additional functional data is cited by the authors of the respective study to highlight the gene; (3) the functional information cited by the authors did not use knowledge about structural variation affecting the candidate locus (Supplementary Data 1). The additional data can either be expression data collected by the authors or existing functional data known about the genes. Only 44 candidate genes from eight studies fulfilled these criteria but three among these fell outside the euchromatic boundaries used here (Supplementary Table 1). Hence only 41 candidate genes were included in the SV enrichment analysis. Of the 41 candidate genes identified, 10 of them were at a single locus (GstE1-10). As a result, we carry out our analysis treating GstE1-10 as either a single gene or ten different genes (the qualitative outcome is unchanged). To test if candidate genes are longer than average genes, we considered all genes (Supplementary Data 2) as well as the dataset excluding the GstE1-10 genes (Supplementary Data 2). The lengths of candidate genes were compared against the rest of the genome using a Mann–Whitney U test.
Candidate gene enrichment analysis
To determine if candidate genes are enriched for SVs relative to the rest of the genome, we analyzed the candidate gene dataset without the GstE1-10 gene array (Table S4). The genes comprising GstE1-10 were excluded to avoid confounding effects of the complex structure of the locus and the identity of GstE1, GstE3, GstE5, and GstE6. As described below, the conclusions of the enrichment analysis does not depend on this locus, so we report the results of excluding it. A Fisher’s Exact Test was applied to the counts in categories of candidate gene vs. rest of the genome and SV-free vs. SV-burdened genes. To account for the lengths of the candidate genes being longer than the rest of the genome, we performed a Monte Carlo sampling of the whole genome according to the histogram of gene sizes in the candidate gene list (Supplementary Data 2). We sampled from the genome by drawing from each gene length bin according to a hypergeometric distribution, where n is the number of candidate genes in the candidate bin, K is the number of SV-burdened genes in the genome bin, and N-K is the number of SV-free genes in the genome bin (Supplementary Data 2). We then tallied up the number of observed SVs across all bins. We repeated this 100,000 times to construct a Monte Carlo distribution of the SV burden expected of genes matching the size distribution observed in the actual candidate genes. This led to 100,000 simulated size distributions that matched the observed size distributions (every Mann–Whitney U p-value of Monte Carlo sample lengths compared against the observed candidate lengths >0.1). Expected density of SV-burdened genes (number of burdened genes per Mbp of gene spans) and expected SV density (total number of SVs per Mbp of gene spans) were also calculated from the Monte Carlo samples. Although we present the enrichment results from analysis performed on the gene set without the GstE1-10 array, inclusion of the array either as a single 13kb locus or as individual genes does not alter the conclusion of the enrichment analysis (single Gst locus: enrichment p-value = 0.021, length p-value = 6.5 × 10−5; individual Gst genes: length p-value = 0.034 and enrichment p-value = 2.9 × 10−3).
Calculating the SV burden in genes in diploid individuals
In order to calculate the distribution of SV burden expected in diploids, the haploid genotypes of each founder was paired with every other founder, for a total of 78 possible pairings. For each of these diploid pairings, the number of unique SV mutations for each gene in the genome was recorded. A mutation is said to affect a gene if it falls within the gene span, which is defined as affecting nucleotides between the start and end coordinates of the gene feature in the D. melanogaster release 6.16 gff file36. The number of SV mutations overlapping a gene in a given diploid combination is considered that gene’s multiplicity for that combination. Any gene with a multiplicity ≥1 for a particular diploid comparison is considered SV-burdened for that diploid.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
All scaffolded assemblies and the raw data (HDF5 files and their respective metadata) have been deposited in NCBI under the Bioproject accession PRJNA418342. All raw SV outputs, and processed data are available at https://github.com/mahulchak/dspr-asm.
Code availability
All scripts and codes have been deposited to GitHub and freely accessible from https://github.com/mahulchak/dspr-asm.
References
Mauricio, R. Mapping quantitative trait loci in plants: uses and caveats for evolutionary biology. Nat. Rev. Genet. 2, 370–381 (2001).
Mackay, T. F., Stone, E. A. & Ayroles, J. F. The genetics of quantitative traits: challenges and prospects. Nat. Rev. Genet. 10, 565–577 (2009).
Goddard, M. E. & Hayes, B. J. Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat. Rev. Genet. 10, 381–391 (2009).
Stranger, B. E., Stahl, E. A. & Raj, T. Progress and promise of genome-wide association studies for human complex trait genetics. Genetics 187, 367–383 (2011).
Shendure, J. & Ji, H. Next-generation DNA sequencing. Nat. Biotechnol. 26, 1135–1145 (2008).
Bansal, V. et al. Accurate detection and genotyping of SNPs utilizing population sequencing data. Genome Res. 20, 537–545 (2010).
Varshney, R. K., Nayak, S. N., May, G. D. & Jackson, S. A. Next-generation sequencing technologies and their implications for crop genetics and breeding. Trends Biotechnol. 27, 522–530 (2009).
Day-Williams, A. G. & Zeggini, E. The effect of next-generation sequencing technology on complex trait research. Eur. J. Clin. Invest. 41, 561–567 (2011).
Davey, J. W. et al. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat. Rev. Genet. 12, 499–510 (2011).
Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
Eichler, E. E. et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 11, 446–450 (2010).
Frazer, K. A., Murray, S. S., Schork, N. J. & Topol, E. J. Human genetic variation and its contribution to complex traits. Nat. Rev. Genet. 10, 241–251 (2009).
Spencer, C. C. A., Su, Z., Donnelly, P. & Marchini, J. Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip. PLoS Genet. 5, doi:ARTN e1000477 https://doi.org/10.1371/journal.pgen.1000477 (2009).
Huddleston, J. & Eichler, E. E. An incomplete understanding of human genetic variation. Genetics 202, 1251–1254 (2016).
Chakraborty, M. et al. Hidden genetic variation shapes the structure of functional elements in Drosophila. Nat. Genet. 50, 20–25 (2018).
Emerson, J. J., Cardoso-Moreira, M., Borevitz, J. O. & Long, M. Natural selection shapes genome-wide patterns of copy-number polymorphism in Drosophila melanogaster. Science 320, 1629–1631 (2008).
Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 (2010).
Cridland, J. M., Macdonald, S. J., Long, A. D. & Thornton, K. R. Abundance and distribution of transposable elements in two Drosophila QTL mapping resources. Mol. Biol. Evol. 30, 2311–2327 (2013).
Rogers, R. L. et al. Tandem duplications and the limits of natural selection in Drosophila yakuba and Drosophila simulans. PLoS. One. 10, e0132184 (2015).
Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).
Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).
Chakraborty, M., Baldwin-Brown, J. G., Long, A. D. & Emerson, J. J. Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage. Nucleic Acids Res. https://doi.org/10.1093/nar/gkw654 (2016).
Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
King, E. G. et al. Genetic dissection of a model complex trait using the Drosophila Synthetic Population Resource. Genome Res. 22, 1558–1566 (2012).
Long, A. D., Macdonald, S. J. & King, E. G. Dissecting complex traits using the Drosophila Synthetic Population Resource. Trends Genet. 30, 488–495 (2014).
mod, E. C. et al. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science 330, 1787–1797 (2010).
Graveley, B. R. et al. The developmental transcriptome of Drosophila melanogaster. Nature 471, 473–479 (2011).
Schwartz, Y. B. et al. Nature and function of insulator protein binding sites in the Drosophila genome. Genome Res. 22, 2188–2198 (2012).
Chang, C.-H. & Larracuente, A. M. Heterochromatin-enriched assemblies reveal the sequence and organization of the Drosophila melanogaster Y chromosome. Genetics 211, 333–348 (2019).
Smith, C. D., Shu, S. Q., Mungall, C. J. & Karpen, G. H. The Release 5.1 annotation of Drosophila melanogaster heterochromatin. Science 316, 1586–1591 (2007).
Khost, D. E., Eickbush, D. G. & Larracuente, A. M. Single-molecule sequencing resolves the detailed structure of complex satellite DNA loci in Drosophila melanogaster. Genome Res. 27, 709–721 (2017).
Daborn, P. J. et al. A single P450 allele associated with insecticide resistance in Drosophila. Science 297, 2253–2256 (2002).
Schmidt, J. M. et al. Copy number variation and transposable elements feature in recent, ongoing adaptation at the Cyp6g1 locus. PLoS Genet. 6, e1000998 (2010).
Yang, W. Y. et al. Functional divergence of six isoforms of antifungal peptide Drosomycin in Drosophila melanogaster. Gene 379, 26–32 (2006).
Warren, W. D., Palmer, S. & Howells, A. J. Molecular characterization of the cinnabar region of Drosophila melanogaster: identification of the cinnabar transcription unit. Genetica 98, 249–262 (1996).
dos Santos, G. et al. FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations. Nucleic Acids Res. 43, D690–D697 (2015).
Tajima, F. Statistical-method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585–595 (1989).
Slatkin, M. & Hudson, R. R. Pairwise comparisons of mitochondrial-DNA sequences in stable and exponentially growing populations. Genetics 129, 555–562 (1991).
Andolfatto, P. Adaptive evolution of non-coding DNA in Drosophila. Nature 437, 1149–1152 (2005).
Williamson, S. H. et al. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc. Natl. Acad. Sci. U. S. A. 102, 7882–7887 (2005).
Akashi, H. & Schaeffer, S. W. Natural selection and the frequency distributions of “silent” DNA polymorphism in Drosophila. Genetics 146, 295–307 (1997).
Pritchard, J. K. Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 69, 124–137 (2001).
Thornton, K. R., Foran, A. J. & Long, A. D. Properties and modeling of gwas when complex disease risk is due to non-complementing, deleterious mutations in genes of large effect. PLoS Genet. 9, doi:ARTN e1003258 https://doi.org/10.1371/journal.pgen.1003258 (2013).
Lohmueller, K. E. The Impact of population demography and selection on the genetic architecture of complex traits. PLoS Genet. 10, doi:ARTN e1004379, https://doi.org/10.1371/journal.pgen.1004379 (2014).
Simons, Y. B., Turchin, M. C., Pritchard, J. K. & Sella, G. The deleterious mutation load is insensitive to recent population history. Nat. Genet. 46, 220−+ (2014).
Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–+ (2015).
Marriage, T. N., King, E. G., Long, A. D. & Macdonald, S. J. Fine-mapping nicotine resistance loci in Drosophila using a multiparent advanced generation inter-cross population. Genetics 198, 45–57 (2014).
Stanley, P. D., Ng’oma, E., O’Day, S. & King, E. G. Genetic dissection of nutrition-induced plasticity in insulin/insulin-like growth factor signaling and median life span in a Drosophila multiparent population. Genetics 206, 587–602 (2017).
Tatar, M. et al. A mutant Drosophila insulin receptor homolog that extends life-span and impairs neuroendocrine function. Science 292, 107–110 (2001).
Toivonen, J. M. & Partridge, L. Endocrine regulation of aging and reproduction in Drosophila. Mol. Cell. Endocrinol. 299, 39–50 (2009).
Paaby, A. B., Bergland, A. O., Behrman, E. L. & Schmidt, P. S. A highly pleiotropic amino acid polymorphism in the Drosophila insulin receptor contributes to life-history adaptation. Evolution 68, 3395–3409 (2014).
Paaby, A. B., Blacket, M. J., Hoffmann, A. A. & Schmidt, P. S. Identification of a candidate adaptive polymorphism for Drosophila life history by parallel independent clines on two continents. Mol. Ecol. 19, 760–774 (2010).
Brogiolo, W. et al. An evolutionarily conserved function of the Drosophila insulin receptor and insulin-like peptides in growth control. Curr. Biol. 11, 213–221 (2001).
Rauschenbach, I. Y. et al. Interplay of insulin and dopamine signaling pathways in the control of Drosophila melanogaster fitness. Dokl. Biochem. Biophys. 461, 135–138 (2015).
Wei, Y. et al. Complex cis-regulatory landscape of the insulin receptor gene underlies the broad expression of a central signaling regulator. Development 143, 3591–3603 (2016).
Negre, N. et al. A cis-regulatory map of the Drosophila genome. Nature 471, 527–531 (2011).
Mackay, T. F. C. et al. The Drosophila melanogaster Genetic Reference Panel. Nature 482, 173–178 (2012).
Pool, J. E. et al. Population Genomics of sub-saharan Drosophila melanogaster: African diversity and non-African admixture. PLoS Genet. 8, e1003080 (2012).
Lack, J. B. et al. The Drosophila genome nexus: a population genomic resource of 623 Drosophila melanogaster genomes, including 197 from a single ancestral range population. Genetics 199, 1229–1241 (2015).
Ortiz, J. G. M., Opoka, R., Kane, D. & Cartwright, I. L. Investigating arsenic susceptibility from a genetic perspective in drosophila reveals a key role for glutathione synthetase. Toxicol. Sci. 107, 416–426 (2009).
Logan-Garbisch, T. et al. Developmental ethanol exposure leads to dysregulation of lipid metabolism and oxidative stress in Drosophila. G3-Genes Genom. Genet 5, 49–59 (2015).
McClellan, J. & King, M. C. Genetic heterogeneity in human disease. Cell 141, 210–217 (2010).
Turelli, M. Heritable genetic-variation via mutation selection balance - lerch zeta meets the abdominal bristle. Theor. Popul. Biol. 25, 138–193 (1984).
Johnson, T. & Barton, N. Theoretical models of selection and mutation on quantitative traits. Philos. Trans. R. Soc. B-Biol. Sci. 360, 1411–1425 (2005).
Ye, C., Hill, C. M., Wu, S., Ruan, J. & Ma, Z. S. DBG2OLC: efficient assembly of large genomes using long erroneous reads of the third generation sequencing technologies. Sci. Rep. 6, 31900 (2016).
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).
Solares, E. A. et al. Rapid low-cost assembly of the Drosophila Melanogaste reference genome using low-coverage, long-read sequencing. G3. 8, 3143–3154 (2018).
Lam, K. K., LaButti, K., Khalak, A. & Tse, D. FinisherSC: a repeat-aware tool for upgrading de novo assembly using long reads. Bioinformatics 31, 3207–3209 (2015).
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS. One. 9, e112963 (2014).
Hoskins, R. A. et al. The Release 6 reference sequence of the Drosophila melanogaster genome. Genome Res. 25, 445–458 (2015).
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).
Waterhouse, R. M. et al. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol. Biol. Evol. https://doi.org/10.1093/molbev/msx319 (2017).
Quinlan, A. R. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr. Protoc. Bioinforma. 47, 11–34 (2014).
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Rogers, R. L. et al. Landscape of standing variation for tandem duplications in Drosophila yakuba and Drosophila simulans. Mol. Biol. Evol. 31, 1750–1766 (2014).
Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865–2871 (2009).
Abyzov, A., Urban, A. E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21, 974–984 (2011).
King, E. G., Sanderson, B. J., McNeil, C. L., Long, A. D. & Macdonald, S. J. Genetic dissection of the Drosophila melanogaster female head transcriptome reveals widespread allelic heterogeneity. PLoS Genet. 10, e1004322 (2014).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
Acknowledgements
We wish to acknowledge support from the following grants OD010974 (SJM and ADL), GM115562 (ADL), R01GM123303-1 and University of California, Irvine setup funds (JJE), and K99GM129411 (MC). We thank Luna Thanh Ngo and Daniel Na for help with data management and fly maintenance. This work was made possible, in part, through access to the Genomics High-Throughput Facility Shared Resource of the Cancer Center Support Grant CA-62203 at the University of California, Irvine, and NIH shared-instrumentation grants 1S10RR025496-01, 1S10OD010794-01, and 1S10OD021718-01.
Author information
Authors and Affiliations
Contributions
M.C., J.J.E., S.J.M., A.D.L. conceived of the work and wrote the paper. M.C. assembled the genomes and wrote the variant caller.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks Wen Huang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Chakraborty, M., Emerson, J.J., Macdonald, S.J. et al. Structural variants exhibit widespread allelic heterogeneity and shape variation in complex traits. Nat Commun 10, 4872 (2019). https://doi.org/10.1038/s41467-019-12884-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-019-12884-1
This article is cited by
-
Genomic determinants, architecture, and constraints in drought-related traits in Corymbia calophylla
BMC Genomics (2024)
-
Low-input PacBio sequencing generates high-quality individual fly genomes and characterizes mutational processes
Nature Communications (2024)
-
A wild melon reference genome provides novel insights into the domestication of a key gene responsible for melon fruit acidity
Theoretical and Applied Genetics (2024)
-
Pangenome graph construction from genome alignments with Minigraph-Cactus
Nature Biotechnology (2024)
-
Cotton pedigree genome reveals restriction of cultivar-driven strategy in cotton breeding
Genome Biology (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.