Structural variants exhibit widespread allelic heterogeneity and shape variation in complex traits

Chakraborty, Mahul; Emerson, J. J.; Macdonald, Stuart J.; Long, Anthony D.

doi:10.1038/s41467-019-12884-1

Download PDF

Article
Open access
Published: 25 October 2019

Structural variants exhibit widespread allelic heterogeneity and shape variation in complex traits

Nature Communications volume 10, Article number: 4872 (2019) Cite this article

10k Accesses
82 Citations
35 Altmetric
Metrics details

Subjects

Abstract

It has been hypothesized that individually-rare hidden structural variants (SVs) could account for a significant fraction of variation in complex traits. Here we identified more than 20,000 euchromatic SVs from 14 Drosophila melanogaster genome assemblies, of which ~40% are invisible to high specificity short-read genotyping approaches. SVs are common, with 31.5% of diploid individuals harboring a SV in genes larger than 5kb, and 24% harboring multiple SVs in genes larger than 10kb. SV minor allele frequencies are rarer than amino acid polymorphisms, suggesting that SVs are more deleterious. We show that a number of functionally important genes harbor previously hidden structural variants likely to affect complex phenotypes. Furthermore, SVs are overrepresented in candidate genes associated with quantitative trait loci mapped using the Drosophila Synthetic Population Resource. We conclude that SVs are ubiquitous, frequently constitute a heterogeneous allelic series, and can act as rare alleles of large effect.

From Mendel to quantitative genetics in the genome era: the scientific legacy of W. G. Hill

Article 11 July 2022

The sequences of 150,119 genomes in the UK Biobank

Article Open access 20 July 2022

Properties of structural variants and short tandem repeats associated with gene expression and complex traits

Article Open access 10 June 2020

Introduction

Understanding the molecular basis of heritable variation in complex traits is of central importance to evolution, animal and plant breeding, and medical genetics^1,2,3,4. Over the last decade, short read genomic data (50–150 bp reads) appropriate for characterizing SNPs and small indels in non-repetitive genomic regions has accumulated at an exponential rate^5,6. This in turn has catalyzed hundreds of quantitative trait locus (QTL) mapping and genome-wide association (GWAS) studies in model organisms, humans, and agriculturally important animals and plants^7,8,9. Despite these efforts, for most traits, GWAS hits only explain a small fraction of known trait heritability^10,11. One hypothesis accounting for hidden genetic variation is that individually rare hidden mutations that alter genome structure make significant contributions to complex trait variation^11,12. These structural variants (SVs) change the genome via duplication, deletion, transposition, and inversion of sequences. This hypothesis is attractive since rare causative variants are difficult to detect with GWAS¹³. Moreover, genotyping approaches based on short reads or microarrays fail to detect a significant number of SVs^14,15. Finally, it is reasonable to assume that SVs are on average likely to be more deleterious and deleterious more often than SNPs^16,17,18,19.

High quality genomes provide a direct and reliable path to comprehensive identification of SVs^15,20,21. To achieve this goal, we assembled reference-quality genomes for fourteen geographically diverse Drosophila melanogaster strains (Fig. 1a) using single molecule real time sequencing²². These assemblies are contiguous and complete (N50 18.9–22.3 Mb; BUSCO²³ 99.9–100%) (Table 1, Fig. 1b, Supplementary Table 1), making them comparable to the D. melanogaster reference genome, arguably the best metazoan genome assembly. Thirteen of the fourteen strains are near isogenic founders of the Drosophila synthetic population resources (DSPR)²⁴, a large set of advanced intercross recombinant inbred lines (RILs) designed to map QTLs²⁵. We also assembled the genome of Oregon-R, an outbred stock widely used as a “wild-type” strain both by Drosophila geneticists and by large scale community projects like modENCODE^26,27,28.

Table 1 Summary of assembly metrics

Full size table

Using these reference quality genome assemblies, we show that SVs are common in D. melanogaster genes, with almost one third of diploid individuals harboring an SV in genes larger than 5 kb, and more than a third of burdened genes carrying multiple SVs. The site frequency spectrum (SFS) of SV alleles relative to amino acid polymorphisms suggests that SVs are under stronger purifying selection, and thus are more likely to impact phenotype than nonsynonymous SNPs. We further show that a number of functionally important genes harbor previously hidden SVs likely to affect complex phenotypes (e.g., Cyp6g1, Drsl5, Cyp28d1, Cyp28d2, InR, and Gss1&2). Finally, we find that SVs are overrepresented in candidate genes associated with mapped QTL. We conclude that SVs are pervasive in genomes, frequently manifest as heterogeneous allelic series affecting the same gene, and exhibit all the properties that make them prime candidates for being rare alleles of large effect.

Results

De novo assembly reveals novel functionally important SVs

Our assemblies are extremely contiguous, with the majority of each chromosome arm represented by a single contig (Fig. 1b). We also close the two remaining gaps in the major chromosome arms of the euchromatic D. melanogaster reference genome²⁹ in all our assemblies (Supplementary Figs. 1–3). We identified SVs by comparing each assembly to the reference ISO1 genome¹⁵, focusing our attention on large (>100 bp) euchromatic SVs (Supplementary Table 2), and ignoring heterochromatin regions as they are gene poor³⁰ and require specialized assembly approaches and extensive validation³¹. Manual inspection of 267 randomly sampled SVs indicate that mis-annotations are rare (3/267), and occur in ambiguously aligned structurally complex genomic regions (Supplementary Fig. 5; see Methods). We discovered 7347 TE insertions, 1178 duplication CNVs, 4347 indels, and 62 inversions in the 94.5 Mb of euchromatin spanning the five major chromosome arms across the DSPR founders (Fig. 1c–d). Each founder strain exhibits 637 TE insertions, 134 duplications, 694 indels, and 7 inversions on average (Table 2). We estimate that 36% of non-reference TEs, 26% of deletions, 48% of insertions, 60% of duplication CNVs are not routinely detected using high coverage paired end Illumina reads and high specificity SV genotyping methods¹⁵ (Supplementary Fig. 6)

Table 2 Number of euchromatic SVs in the sequenced DSPR founder strains and Oregon-R

Full size table

We uncover many examples of previously hidden SVs predicted to affect complex traits. Extensive evidence links complex SV alleles of the cytochrome P450 gene Cyp6g1 to varying levels of DDT resistance^32,33. Despite extensive study of this locus, we discovered three new SV alleles involving TE insertions that likely have different functional consequences (Supplementary Fig. 7a, b). Similarly, we discovered a previously hidden tandem duplication of the antifungal, innate immunity gene Drsl5³⁴ that exhibits >1000-fold higher expression relative to its single copy counterpart in line A4 (Supplementary Fig. 8a, b). Read pair orientation and split read methods failed to detect this mutation because one allele bears a 5 kb spacer sequence derived from the first exon and intron of Kst inserted between the gene copies (Supplementary Fig. 8a). Another duplicate allele of Drsl5 contains a Tirant LTR retrotransposon inserted into the same spacer sequence (Supplementary Fig. 8a). We also easily detect the two SV mutations underlying the D. melanogaster recessive visible genes cinnabar³⁵ (cn) and speck (sp) present in the ISO1 reference genome³⁶ (Supplementary Figs. 9 and 10). In the case of sp a large insertion in the reference genome is mis-annotated as an intron. For cn a large exonic deletion is not identified as such³⁶. Both alleles are likely knock-outs.

SVs are deleterious

Most TEs and duplicates are present in only one strain (Fig. 1e), with the folded SFS of the TEs and duplicates exhibiting a greater proportion of rare variants than non-synonymous SNPs (nsSNPs) assayed in the same strains (Fig. 1e; p-value < 1 × 10⁻¹⁰, χ² test between frequency classes of these two types of SVs and non-synonymous SNPs). Since nsSNPs were ascertained via high coverage short reads from virtually isogenic strains²⁴, the low frequency skew of the site frequency spectrum of SVs relative to nsSNPs is unlikely due to SNP miscalls (see Methods). It is well-known that the SFS is affected by demographic history^37,38, but selective constraints can be inferred with some confidence by comparing site classes from the same sample^37,39,40,41. The skew toward rare variants we observe in our SVs relative to nsSNPs is strongly indicative of SVs being under stronger purifying selection, consistent with previous work in which SVs were ascertained with higher bias and/or errors^16,18,19. Furthermore, TEs are more enriched for rare variants than duplicates, indicating that TE insertions as a class are more deleterious than duplicates (Fig. 1e; p-value < 1 × 10⁻¹⁰, χ² test between frequency classes of TEs and duplicates). Under mutation selection balance models^42,43, rare deleterious variants (minor allele frequency or MAF <1%) are predicted to contribute significantly to the variation in complex traits, yet are unlikely to be tagged by SNPs typically used in GWAS experiments¹⁰. Although demography can impact the proportion of variation due to rare deleterious alleles, recent population bottlenecks or growth^44,45 tend to amplify the contribution of rare alleles to variation in a complex trait.

SVs are common in genes and enriched at mapped QTLs

In order to illustrate how common SV genotypes are in heterozygous individuals, we quantified the per gene SV burden per synthetic diploid D. melanogaster individual (Fig. 2a, b; each synthetic diploid is one of 78 possible pairings of the thirteen assembled DSPR founders). On average, SVs appear in 9.3% of genes in diploid individuals (1285/13761). Of those, more than a third of burdened genes in diploids (443/1285) bear multiple SV mutations. One or more SVs burden more than half of genes in and above the 20-35kb range (Fig. 2a). Furthermore, individual genes bearing multiple SVs comprise more than a third of burdened genes between 20 and 35 kb in length and more than half of larger genes (Fig. 2b). Thus, although generally having rare minor allele frequencies, SVs are ubiquitous in the functional elements of D. melanogaster genome.

Although hypotheses employing SVs to explain missing heritability have been proposed^10,46, the systematic under-identification of SVs via short read- and microarray-based genotyping²¹ limits their explanatory power. Using our comprehensive SV map, we measured the prevalence of SVs at the candidate genes reported in eight complex trait mapping experiments employing DSPR (Supplementary Data 1). We consider only genes in mapped QTLs explicitly cited by the authors of the original QTL studies (Supplementary Data 1; see Methods). In total, we identified 31 candidate genes of which 15 (48.4%) possess at least one SV in one founder strain, whereas only 23.4% (3237/13,830) of other D. melanogaster genes harbor SVs (p = 0.0023; Fisher’s exact test). The 31 candidate genes from QTL mapping work (Supplementary Data 1) are more than twice as large as an average Drosophila gene (12.2 vs 5.4 kb, Wilcoxon rank sum test; p = 1.4 × 10⁻⁴. Supplementary Data 2). We then tested whether QTL candidate genes are enriched for SVs independent of gene size. To control for the observed elevated size distribution of QTL candidate genes, we randomly drew 100,000 Monte Carlo gene samples matching the candidate gene length distribution in order to generate null distribution for the number of SVs we expect to observe. In the Monte Carlo sample, 10.4 genes are expected to harbor SVs by chance whereas we observe 15, an enrichment of 45% (p = 0.026; Fig. 2c). Similarly, enrichment is also observed when burden is measured as density (# burdened genes/bp) in comparison of QTL candidate genes to the remaining genes (39.6 burdened genes/Mbp of genic DNA vs 27.3 burdened genes/Mbp of genic DNA, p = 0.017; Supplementary Data 2). Finally, candidate genes also exhibit greater SV mutation density compared to the expected SV density in the Monte Carlo samples (205.8 SVs/Mbp vs 138.4 SV/Mbp, p = 0.047; Supplementary Data 2), an enrichment of 49%. These observations suggest that SVs may be disproportionately associated with QTL candidate genes even after accounting for gene size. However, due to the limited sample size of QTL candidate genes, the power to detect even strong effects (e.g. the 45% enrichment reported here) is limited. Consequently, further QTL studies identifying candidate genes would substantially improve our understanding of this effect. These observations suggest that the contribution of rare SVs of large effect to complex traits could be pervasive.

Functional structural variation at mapped QTL

GWAS experiments are poorly powered to detect the segregation of multiple alleles at a causal gene⁴³. Although allelic heterogeneity can be readily identified in multi-parent panels (MPPs) via QTL mapping²⁵, mapping resolution is often poor, forcing investigators to identify mutations of obvious functional significance in the genomic interval most likely to harbor the QTL. Both GWAS and QTL mapping suffer if putatively causative SVs disproportionately escape detection by short read sequencing¹⁵. This limitation can be readily solved in MPPs, as the SV genotypes of the large panel of mapping lines can be imputed from de novo assemblies of the much smaller number of MPP founders²⁴.

A nicotine resistance mapping study employing the DSPR identified differentially expressed cytochrome P450 genes Cyp28d1 and Cyp28d2 as candidate causative genes at a mapped QTL, but proposed no causative mutations⁴⁷. A previous de novo assembly of single DSPR founder strain identified a resistant allele possessing tandem copies of the Cyp28d1 gene separated by an Accord LTR retrotransposon fragment¹⁵ (Fig. 3a; Supplementary Fig. 11). Our assemblies of additional DSPR founder strains reveal a total of seven structurally distinct alleles in this region, including additional candidate resistant alleles harboring gene duplications (Fig. 3a–b). For example, the resistant strain A2 carries a tandem duplication of a 15Kb segment containing both Cyp28d genes. The expression level of Cyp28d1 in the adult female heads of RILs bearing the A2 genotype is highest among all founder genotypes measured (Fig. 3c). Consistent with this, DSPR Recombinant Inbred Lines (RILs) bearing the A2 genotype show the highest average resistance to nicotine toxicity among the RILs derived from the A set of founders⁴⁷ (Fig. 3b). This implies that the extra copies of Cyp28d1 and/or Cyp28d2 account for the increased expression and concomitant resistance to nicotine. Similarly, the B4 allele comprises a tandem duplication of a 6 Kb segment, containing one extra copy of Cyp28d1 and a nearly complete copy of Cyp28d2 (Fig. 3a; Supplementary Fig. 11). RILs carrying the B4 genotype at the Cyp28d locus also show high resistance to nicotine, making the duplication a compelling candidate for the causative mutation. On the other hand, in two alleles, TE insertions disrupt Cyp28d gene structure. For instance, A1 has a duplication sharing the same breakpoints as B4, but a 4.7Kb F element inserted in the 5th exon disrupts the protein coding sequence of the second Cyp28d1 copy, likely rendering the copy nonfunctional (Fig. 3a; Supplementary Fig. 11). Consistent with the hypothesis that the duplication causes increased nicotine resistance, the A1 genotype is more susceptible to nicotine than B4 (Fig. 3b). All of these SV alleles are singletons, and thus represent a hidden allelic series composed of individually rare alleles.

SVs may also affect genes central to life history traits. Expression levels of the insulin signaling pathway genes show substantial variation in F1 hybrids between DSPR panel B RILs and the A4 founder⁴⁸. Among these is Insulin Receptor (InR), which plays a key role in several life history traits related to lifespan and is likely a key molecular mediator of the tradeoff between reproductive success and longevity^49,50,51. Amino acid polymorphism in InR evolves under positive selection and some non-synonymous variants affect fecundity and stress response^51,52. Expression variation of InR also affects body size, lifespan, and fecundity^53,54, suggesting that natural cis-regulatory variation might also be under selection. We discovered a 215 bp fragment of a DOC6 element within a second intron enhancer⁵⁵ (Fig. 4b, c) of InR on the AB8 haplotype, and this allele exhibits reduced gene expression relative to reference genotypes (Fig. 4b). This mutation potentially disrupts the enhancer (Supplementary Fig. 12), making it a plausible candidate for expression variation in InR. Another founder, A6, carries a 1,042 bp insertion of DMRT1A (LINE) in the 2nd intron and a 946 bp insertion of a fragment of PROTOP in the 3rd intron. Both insert within known cis-regulatory elements⁵⁵ (Fig. 4b, c). Except for A2 and A6, all strains, including ISO1, harbor an FB-NOF element (FB{}1698) inside the first intron of InR (Fig. 4a). Like many genes, the first intron of InR possess several transcription factor binding sites (TFBS), including those for factors Nejire and Caudal⁵⁶ (Fig. 4c). The FB-NOF element is inserted within this dense cluster of TFBS and active enhancer marks (Fig.4c). Furthermore, the FB element is segregating at high frequency in the strains discussed here (13/15), a North American population⁵⁷ (125/170), and a French population⁵⁸ (4/9), but is rare in populations derived from D. melanogaster’s ancestral range in Africa^58,59 (Cameroon: 0/10, Rwanda: 1/27, Zambia: 10/139) (Fig. 4d). This raises the possibility that the FB element is more common in temperate cosmopolitan populations, similar to a previously described adaptive amino acid variant in InR⁵². In total, InR harbors a remarkable amount of potentially functional structural diversity. Including these variants described above, there are nine TE insertions and two deletions throughout the gene, many of which impinge on candidate regulatory regions or transcribed portions of the gene (Fig. 4a, c).

Public resources like modENCODE annotate molecular phenotypes (e.g., RNAseq, ChIPseq, DNase-seq) against reference genomes which are often genetically different than the strains assayed^26,27,28,56. Canton-S (our DSPR founder A1) and Oregon-R are strains commonly used in phenotypic assays^26,27,28, and we observe SVs segregating between these two strains and the reference (Table 2). Interpretation of functional genomics data such as RNA-seq can be misleading when gene copy number varies between strains. We explored the glutathione synthetase region (containing Gss1 and Gss2), which is just one example among hundreds in modENCODE that likely suffer from misleading annotations. A tandem duplication present in ISO1 has created two copies of Gss1 and Gss2, which are associated with toxin metabolism and linked to tolerance to arsenic⁶⁰ and ethanol induced oxidative stress⁶¹. While this duplication segregates at high frequency in DSPR strains (9/13), it is absent in Oregon-R (Fig. 5a) and escapes detection via short-read methods. As a result, using transcript and ChIP data derived from Oregon-R (as used in modENCODE^27,28) results in misleading annotations of the two copies in ISO1. Indeed, among the eight structurally distinct Gss alleles in our dataset, ISO1 is the sole representative of its allele (Fig. 5a). The two most common Gss alleles include one that contains only a single Gss gene (in four strains, including Oregon-R) and one carrying only a tandem duplication, creating the Gss1/Gss2 pair (in five strains, including Samarkand/AB8) (Fig. 5a). The remaining six alleles have SV genotypes represented by only a single individual in the sample. Collectively, this sample represents a haplotype network of structural variation involving five TE insertions, one duplication, one insertion comprising TE and simple repeats, and two non-TE indels. The single copy allele with a 5′ insertion of a 14 kb repetitive sequence comprising Nomad retrotransposon fragments exhibits the highest expression, followed by duplicate alleles, whereas single copy alleles and duplicate alleles with intronic TE insertions generally have the lowest expression levels (Fig. 5b).

Discussion

Despite claims that a significant proportion of complex trait variation in humans, model organisms, and agriculturally important animals and plants is likely due to rare SVs of large effect¹¹, systematic inquiry of this hypothesis has been impeded by genotyping approaches attuned to SNP detection²¹. As reference quality de novo assemblies of population samples for eukaryotic model systems become increasingly cost-effective, methodical evaluation of the contribution of SVs to the genetic architecture of complex traits becomes feasible. Our comprehensive map of SVs in Drosophila provides the means to systematically quantify the contribution of rare SVs to heritable complex trait variation (Figs. 2a, 3a, 4a, and 5a). The value of comprehensive SV detection is underscored by the presence of SVs in ~50% of the candidate genes underlying mapped Drosophila QTL, and by the observation that a large fraction of Drosophila genes harbor multiple rare SV alleles. The genomes of humans and agriculturally important plants and animals harbor more SVs than Drosophila, and thus are likely more burdened with genic SVs.

The genetic heterogeneity hypothesis for variation in complex traits posits that a sizable fraction of human complex disease is associated with an allelic series consisting of individually rare causative mutations at several genes of large effect⁶². Furthermore, models for complex traits under either stabilizing^63,64 or purifying selection^42,43 with constant mutational input predict the existence of genes segregating several individually rare causative alleles that account for a sizable fraction of trait variation. We provide examples of SVs in genes of functional significance, and show that genes harboring SVs are overrepresented in a collection of QTL candidate genes. Hidden SVs are thus examples of collectively common but individually rare deleterious genetic variants predicted under the genetic heterogeneity hypothesis. Future de novo assemblies of other genomes, including humans, models, and agriculturally important species, would quantify the generality of observations from Drosophila.

Methods

DNA extraction

Genomic DNA was extracted from females following the protocols described previously²² and the genomic DNA was sheared using 10 plunges of a 21-gauge needle, followed by 10 of a 24-gauge needle (Jensen Global, Santa Barbara). All testing and research involving flies were performed in compliance with relevant ethical regulations. SMRTbell template library was prepared following the manufacturer’s guidelines and sequenced using P6-C4 chemistry in Pacific Biosciences RSII platform at University of California Irvine Genomics High Throughput Facility. The total number of SMRTcell and base pairs sequenced, and read length metrics for each strain is given in Table S6.

Genome assembly

The genomes were assembled following the approach described in Chakraborty et al.²². For all calculations of sequence coverage, a genome size of 130Mbp is assumed (G = 130 × 10⁶ bp). For individual strain, we generated a hybrid assembly with DBG2OLC⁶⁵ and longest 30X PacBio reads, and a PacBio assembly with canu v1.3⁶⁶ (Supplementary Data 3). The paired end Illumina reads were obtained from King et al.²⁴. The hybrid assemblies were merged with the PacBio only assemblies with quickmerge v0.2^22,67 (l = 2 Mb, ml = 20000, hco = 5.0, c = 1.5), with the hybrid assembly being used as the query. Because the PacBio assembly sizes were closer to the genome size of D. melanogaster, we added the contigs that were present only in the PacBio only assembly but not the hybrid assembly by performing a second round of quickmerge⁶⁷. For the second round of quickmerge (l = 5 mb, ml = 20000, hco = 5.0, c = 1.5), the PacBio assembly was used as the query and the merged assembly from the first merging round the reference assembly. The resulting merged assembly was processed with finisherSC to remove the redundant sequences and additional gap filling using raw reads⁶⁸. The assemblies were then polished twice with quiver (SMRTanalysis v2.3.0p5) and once with Pilon v1.16⁶⁹ using the same Illumina reads as used for the hybrid assemblies.

Comparative scaffolding

We scaffolded the contigs for each assembly based on the scaffolds from the reference assembly⁷⁰, following a previously described approach¹⁵. Briefly, TEs and repeats in the assemblies were masked using RepeatMasker (v4.0.7) and aligned to the repeat-masked chromosome arms (X, 2L, 2R, 3L, 3R, and 4) of the D. melanogaster ISO1 assembly using MUMmer⁷¹. After filtering of the alignments due to the repeats (delta-filter −1), contigs were assigned to specific chromosome arms on the basis of the mutually best alignment. The scaffolded contigs were joined by 100 Ns, a convention representing assembly gaps. The unscaffolded sequences were named with a “U” prefix.

BUSCO analysis

We ran BUSCO (v3.02)⁷² on the Pilon polished pre-scaffolding assemblies to evaluate the completeness of all the assemblies relative to the ISO1 release 6 (r6.13) assembly. We used both the arthropoda and diptera datasets for the BUSCO evaluation. For the arthropoda database, three orthologs (EOG090X0BNZ,EOG090X0M0J, and EOG090X049L) were not found in any of the 15 strains (ISO1, Oregon-R, and 13 DSPR founders). Further inspection of these orthologs revealed that they are present in ISO1 even though the BUSCO analysis misses them when applied to ISO1 (EOG090X0BNZ is CG3223, EOG090X0M0J is Pa1 and EOG090X049L is CG40178). Consequently, we removed these three genes from consideration as uninformative.

Variant detection

For variant detection, we aligned each DSPR assembly individually to the ISO1 release 6 assembly (release 6.13)⁷⁰ using nucmer (nucmer –maxmatch –noextend)⁷¹. We identified and classified the variants using SVMU 0.2beta (Structural variants from MUMmer) (n = 10)¹⁵. SVMU classifies the structural differences between two assemblies as insertion, deletion, duplication, and inversion based on whether the DSPR assemblies have longer, shorter, more copy, or inverted sequence, respectively, with respect to the reference genome. The variant calls for individual genomes were combined using bedtools merge⁷³ and converted into a vcf file using a custom script (https://github.com/mahulchak/dspr-asm). TE insertions were identified by examining the overlap between RepeatMasker identified TEs and SVMU insertion calls using bedtools, requiring that at least 90% of RepeatMasker TE annotation overlap with svmu insertion annotation. 12.8% SV mutations, for which mutation annotation were complicated by secondary mutations, were flagged as “complex” (CE = 2 in the VCF file). Additionally, 16.3% SVs that were located within 5Kb of a complex SV were often part of a complex event and were also assigned a tag (CE = 1) to differentiate them from the unambiguously annotated SVs (CE = 0).

Genotype validation

To determine the genotyping error rate, a set of randomly selected 50 simple (CE = 0) SVs obtained from SVMU were manually inspected on UCSC genome browser representation of the multiple genome alignment of the 15 genomes (http://goo.gl/LLpoNH). Furthermore, to estimate the genotyping accuracy of the SVs occurring in the vicinity of the complex mutations, where mutation annotation is complicated by alignment ambiguities, we manually inspected 217 SVs occurring within 20 Kb of 50 randomly selected complex (CE = 2) SVs. Among these, 3/217 and 0/50 SVs were absent in the UCSC browser and therefore they are likely mis-annotated by our pipeline. The mis-annotated SVs (insertion in A1 and tandem array CNV in A7) are located in a complex, repetitive, structurally variable genomic region on chromosome 3L (3L:7669500-7679100) (Supplementary Fig. 5).

Comparing SV genotypes from de novo assemblies to short read only calls

TE genotypes for the founders¹⁸ were downloaded from flyrils.org and the insertion coordinates were lifted over to the current release (release 6) of the reference genome⁷⁰ using UCSC liftover tool⁷⁴. For detection of the duplicates, we have previously found that discordant read pair based method (Pecnv)⁷⁵ was comparable to split read mapping⁷⁶ and more reliable than methods based on coverage alone^15,77, so we used Pecnv. Pecnv was run using the settings described before¹⁵. Because svmu reports tandem duplicate CNVs as insertions (with appropriate CNV tags to separate from TE and other insertions) and Pecnv reports sequence range being duplicated, the SVMU CNV insertion coordinates were extended by 100 bp before comparison (bedtools intersect) between Pecnv output and svmu output was conducted. The non-TE indel genotypes were obtained from Pindel output (the “LI” and “D” events) using the commands described previously¹⁵. For determining population frequency of indel SVs (e.g. the reference FB element in InR), Pindel output based on the alignment bam files were used. We only estimate the false negative rate of short read only callers, but note that these methods also generate false positive SV calls.

Gene expression analysis

The preprocessed expression data for female heads⁷⁸ and IIS/TOR expression data⁴⁸ from whole bodies were downloaded from www.flyrils.org. Expression QTL analysis (Supplementary Fig. 11) for Cyp28d1 and Gss1 using the head expression data were performed using the R package DSPRqtl following the instructions provided in the manual (DSPRscan,model = gene ~ 1,design = “ABcross”). When expression data for multiple isoforms were present, expression data only for the longest transcript that is expressed in the head was used. The genotype values at the eQTL were determined using the function DSPRpeaks included in the DSPRqtl package. No eQTL were found for InR so the genotype values for the InR expression data were obtained by assigning the founder genotypes to the RILs used in the IIS/TOR expression dataset, using the posterior probabilities of the forward-backward decoding of the HMM for the panel B RILs available on www.flyrils.org. Drsl5 expression levels in A4 and A3 were obtained from a publicly available RNAseq dataset⁴⁷.

Comparison of site frequency spectra

The histogram of allele frequencies (site frequency spectrum or SFS) was collated for four categories: synonymous SNPs, non-synonymous SNPs, duplicate CNVs, and TE insertions. The frequencies of SNPs were collected from the VCF file²⁴ using vcftools and bcftools^79,80. The frequencies of SVs were collected from the column 4 of the combined SVMU output for the TE insertions and duplication CNVs from all DSPR strains (https://github.com/mahulchak/dspr-asm). Complex mutations (CE = 1 and CE = 2) were excluded from the analysis. Let N be the sample size and x_i be the number of sites in frequency class i, where 0 < i < N. The SFS was “folded”, meaning we focused attention on the minor allele frequency (MAF), or y_i = minimum (x_i, N − x_i). Pairwise comparisons between different SFS site categories were conducted using the χ² test on allele frequencies and site categories. For allele frequencies, two types of classifications were used: (1) every y_i for 0 < i < N (N − 1 df); and (2) considering singletons versus the other frequency categories, or y_i for i = 1 versus 2 < i < N (1 df).

Candidate genes associated with mapped QTL

The candidate genes from DSPR QTL papers were selected based the following criteria: (1) The gene falls within the QTL peak; (2) additional functional data is cited by the authors of the respective study to highlight the gene; (3) the functional information cited by the authors did not use knowledge about structural variation affecting the candidate locus (Supplementary Data 1). The additional data can either be expression data collected by the authors or existing functional data known about the genes. Only 44 candidate genes from eight studies fulfilled these criteria but three among these fell outside the euchromatic boundaries used here (Supplementary Table 1). Hence only 41 candidate genes were included in the SV enrichment analysis. Of the 41 candidate genes identified, 10 of them were at a single locus (GstE1-10). As a result, we carry out our analysis treating GstE1-10 as either a single gene or ten different genes (the qualitative outcome is unchanged). To test if candidate genes are longer than average genes, we considered all genes (Supplementary Data 2) as well as the dataset excluding the GstE1-10 genes (Supplementary Data 2). The lengths of candidate genes were compared against the rest of the genome using a Mann–Whitney U test.

Candidate gene enrichment analysis

To determine if candidate genes are enriched for SVs relative to the rest of the genome, we analyzed the candidate gene dataset without the GstE1-10 gene array (Table S4). The genes comprising GstE1-10 were excluded to avoid confounding effects of the complex structure of the locus and the identity of GstE1, GstE3, GstE5, and GstE6. As described below, the conclusions of the enrichment analysis does not depend on this locus, so we report the results of excluding it. A Fisher’s Exact Test was applied to the counts in categories of candidate gene vs. rest of the genome and SV-free vs. SV-burdened genes. To account for the lengths of the candidate genes being longer than the rest of the genome, we performed a Monte Carlo sampling of the whole genome according to the histogram of gene sizes in the candidate gene list (Supplementary Data 2). We sampled from the genome by drawing from each gene length bin according to a hypergeometric distribution, where n is the number of candidate genes in the candidate bin, K is the number of SV-burdened genes in the genome bin, and N-K is the number of SV-free genes in the genome bin (Supplementary Data 2). We then tallied up the number of observed SVs across all bins. We repeated this 100,000 times to construct a Monte Carlo distribution of the SV burden expected of genes matching the size distribution observed in the actual candidate genes. This led to 100,000 simulated size distributions that matched the observed size distributions (every Mann–Whitney U p-value of Monte Carlo sample lengths compared against the observed candidate lengths >0.1). Expected density of SV-burdened genes (number of burdened genes per Mbp of gene spans) and expected SV density (total number of SVs per Mbp of gene spans) were also calculated from the Monte Carlo samples. Although we present the enrichment results from analysis performed on the gene set without the GstE1-10 array, inclusion of the array either as a single 13kb locus or as individual genes does not alter the conclusion of the enrichment analysis (single Gst locus: enrichment p-value = 0.021, length p-value = 6.5 × 10⁻⁵; individual Gst genes: length p-value = 0.034 and enrichment p-value = 2.9 × 10⁻³).

Calculating the SV burden in genes in diploid individuals

In order to calculate the distribution of SV burden expected in diploids, the haploid genotypes of each founder was paired with every other founder, for a total of 78 possible pairings. For each of these diploid pairings, the number of unique SV mutations for each gene in the genome was recorded. A mutation is said to affect a gene if it falls within the gene span, which is defined as affecting nucleotides between the start and end coordinates of the gene feature in the D. melanogaster release 6.16 gff file³⁶. The number of SV mutations overlapping a gene in a given diploid combination is considered that gene’s multiplicity for that combination. Any gene with a multiplicity ≥1 for a particular diploid comparison is considered SV-burdened for that diploid.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

All scaffolded assemblies and the raw data (HDF5 files and their respective metadata) have been deposited in NCBI under the Bioproject accession PRJNA418342. All raw SV outputs, and processed data are available at https://github.com/mahulchak/dspr-asm.

Code availability

All scripts and codes have been deposited to GitHub and freely accessible from https://github.com/mahulchak/dspr-asm.

References

Mauricio, R. Mapping quantitative trait loci in plants: uses and caveats for evolutionary biology. Nat. Rev. Genet. 2, 370–381 (2001).
Article CAS PubMed Google Scholar
Mackay, T. F., Stone, E. A. & Ayroles, J. F. The genetics of quantitative traits: challenges and prospects. Nat. Rev. Genet. 10, 565–577 (2009).
Article CAS PubMed Google Scholar
Goddard, M. E. & Hayes, B. J. Mapping genes for complex traits in domestic animals and their use in breeding programmes. Nat. Rev. Genet. 10, 381–391 (2009).
Article CAS PubMed Google Scholar
Stranger, B. E., Stahl, E. A. & Raj, T. Progress and promise of genome-wide association studies for human complex trait genetics. Genetics 187, 367–383 (2011).
Article CAS PubMed PubMed Central Google Scholar
Shendure, J. & Ji, H. Next-generation DNA sequencing. Nat. Biotechnol. 26, 1135–1145 (2008).
Article CAS PubMed Google Scholar
Bansal, V. et al. Accurate detection and genotyping of SNPs utilizing population sequencing data. Genome Res. 20, 537–545 (2010).
Article CAS PubMed PubMed Central Google Scholar
Varshney, R. K., Nayak, S. N., May, G. D. & Jackson, S. A. Next-generation sequencing technologies and their implications for crop genetics and breeding. Trends Biotechnol. 27, 522–530 (2009).
Article CAS PubMed Google Scholar
Day-Williams, A. G. & Zeggini, E. The effect of next-generation sequencing technology on complex trait research. Eur. J. Clin. Invest. 41, 561–567 (2011).
Article PubMed PubMed Central Google Scholar
Davey, J. W. et al. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat. Rev. Genet. 12, 499–510 (2011).
Article ADS CAS PubMed Google Scholar
Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
Article ADS CAS PubMed PubMed Central Google Scholar
Eichler, E. E. et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 11, 446–450 (2010).
Article CAS PubMed PubMed Central Google Scholar
Frazer, K. A., Murray, S. S., Schork, N. J. & Topol, E. J. Human genetic variation and its contribution to complex traits. Nat. Rev. Genet. 10, 241–251 (2009).
Article CAS PubMed Google Scholar
Spencer, C. C. A., Su, Z., Donnelly, P. & Marchini, J. Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip. PLoS Genet. 5, doi:ARTN e1000477 https://doi.org/10.1371/journal.pgen.1000477 (2009).
Huddleston, J. & Eichler, E. E. An incomplete understanding of human genetic variation. Genetics 202, 1251–1254 (2016).
Article CAS PubMed PubMed Central Google Scholar
Chakraborty, M. et al. Hidden genetic variation shapes the structure of functional elements in Drosophila. Nat. Genet. 50, 20–25 (2018).
Article CAS PubMed Google Scholar
Emerson, J. J., Cardoso-Moreira, M., Borevitz, J. O. & Long, M. Natural selection shapes genome-wide patterns of copy-number polymorphism in Drosophila melanogaster. Science 320, 1629–1631 (2008).
Article ADS CAS PubMed Google Scholar
Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 (2010).
Article CAS PubMed Google Scholar
Cridland, J. M., Macdonald, S. J., Long, A. D. & Thornton, K. R. Abundance and distribution of transposable elements in two Drosophila QTL mapping resources. Mol. Biol. Evol. 30, 2311–2327 (2013).
Article CAS PubMed PubMed Central Google Scholar
Rogers, R. L. et al. Tandem duplications and the limits of natural selection in Drosophila yakuba and Drosophila simulans. PLoS. One. 10, e0132184 (2015).
Article PubMed PubMed Central CAS Google Scholar
Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).
Article CAS PubMed PubMed Central Google Scholar
Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).
Article CAS PubMed PubMed Central Google Scholar
Chakraborty, M., Baldwin-Brown, J. G., Long, A. D. & Emerson, J. J. Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage. Nucleic Acids Res. https://doi.org/10.1093/nar/gkw654 (2016).
Article PubMed PubMed Central Google Scholar
Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article CAS PubMed Google Scholar
King, E. G. et al. Genetic dissection of a model complex trait using the Drosophila Synthetic Population Resource. Genome Res. 22, 1558–1566 (2012).
Article CAS PubMed PubMed Central Google Scholar
Long, A. D., Macdonald, S. J. & King, E. G. Dissecting complex traits using the Drosophila Synthetic Population Resource. Trends Genet. 30, 488–495 (2014).
Article CAS PubMed PubMed Central Google Scholar
mod, E. C. et al. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science 330, 1787–1797 (2010).
Article ADS CAS Google Scholar
Graveley, B. R. et al. The developmental transcriptome of Drosophila melanogaster. Nature 471, 473–479 (2011).
Article ADS CAS PubMed Google Scholar
Schwartz, Y. B. et al. Nature and function of insulator protein binding sites in the Drosophila genome. Genome Res. 22, 2188–2198 (2012).
Article CAS PubMed PubMed Central Google Scholar
Chang, C.-H. & Larracuente, A. M. Heterochromatin-enriched assemblies reveal the sequence and organization of the Drosophila melanogaster Y chromosome. Genetics 211, 333–348 (2019).
Smith, C. D., Shu, S. Q., Mungall, C. J. & Karpen, G. H. The Release 5.1 annotation of Drosophila melanogaster heterochromatin. Science 316, 1586–1591 (2007).
Article ADS CAS PubMed PubMed Central Google Scholar
Khost, D. E., Eickbush, D. G. & Larracuente, A. M. Single-molecule sequencing resolves the detailed structure of complex satellite DNA loci in Drosophila melanogaster. Genome Res. 27, 709–721 (2017).
Article CAS PubMed PubMed Central Google Scholar
Daborn, P. J. et al. A single P450 allele associated with insecticide resistance in Drosophila. Science 297, 2253–2256 (2002).
Article ADS CAS PubMed Google Scholar
Schmidt, J. M. et al. Copy number variation and transposable elements feature in recent, ongoing adaptation at the Cyp6g1 locus. PLoS Genet. 6, e1000998 (2010).
Article PubMed PubMed Central CAS Google Scholar
Yang, W. Y. et al. Functional divergence of six isoforms of antifungal peptide Drosomycin in Drosophila melanogaster. Gene 379, 26–32 (2006).
Article CAS PubMed Google Scholar
Warren, W. D., Palmer, S. & Howells, A. J. Molecular characterization of the cinnabar region of Drosophila melanogaster: identification of the cinnabar transcription unit. Genetica 98, 249–262 (1996).
Article CAS PubMed Google Scholar
dos Santos, G. et al. FlyBase: introduction of the Drosophila melanogaster Release 6 reference genome assembly and large-scale migration of genome annotations. Nucleic Acids Res. 43, D690–D697 (2015).
Article PubMed CAS Google Scholar
Tajima, F. Statistical-method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585–595 (1989).
CAS PubMed PubMed Central Google Scholar
Slatkin, M. & Hudson, R. R. Pairwise comparisons of mitochondrial-DNA sequences in stable and exponentially growing populations. Genetics 129, 555–562 (1991).
CAS PubMed PubMed Central Google Scholar
Andolfatto, P. Adaptive evolution of non-coding DNA in Drosophila. Nature 437, 1149–1152 (2005).
Article ADS CAS PubMed Google Scholar
Williamson, S. H. et al. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc. Natl. Acad. Sci. U. S. A. 102, 7882–7887 (2005).
Article ADS CAS PubMed PubMed Central Google Scholar
Akashi, H. & Schaeffer, S. W. Natural selection and the frequency distributions of “silent” DNA polymorphism in Drosophila. Genetics 146, 295–307 (1997).
CAS PubMed PubMed Central Google Scholar
Pritchard, J. K. Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 69, 124–137 (2001).
Article CAS PubMed PubMed Central Google Scholar
Thornton, K. R., Foran, A. J. & Long, A. D. Properties and modeling of gwas when complex disease risk is due to non-complementing, deleterious mutations in genes of large effect. PLoS Genet. 9, doi:ARTN e1003258 https://doi.org/10.1371/journal.pgen.1003258 (2013).
Lohmueller, K. E. The Impact of population demography and selection on the genetic architecture of complex traits. PLoS Genet. 10, doi:ARTN e1004379, https://doi.org/10.1371/journal.pgen.1004379 (2014).
Simons, Y. B., Turchin, M. C., Pritchard, J. K. & Sella, G. The deleterious mutation load is insensitive to recent population history. Nat. Genet. 46, 220−+ (2014).
Article PubMed PubMed Central CAS Google Scholar
Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–+ (2015).
Article CAS PubMed PubMed Central Google Scholar
Marriage, T. N., King, E. G., Long, A. D. & Macdonald, S. J. Fine-mapping nicotine resistance loci in Drosophila using a multiparent advanced generation inter-cross population. Genetics 198, 45–57 (2014).
Article CAS PubMed PubMed Central Google Scholar
Stanley, P. D., Ng’oma, E., O’Day, S. & King, E. G. Genetic dissection of nutrition-induced plasticity in insulin/insulin-like growth factor signaling and median life span in a Drosophila multiparent population. Genetics 206, 587–602 (2017).
Article CAS PubMed PubMed Central Google Scholar
Tatar, M. et al. A mutant Drosophila insulin receptor homolog that extends life-span and impairs neuroendocrine function. Science 292, 107–110 (2001).
Article ADS CAS PubMed Google Scholar
Toivonen, J. M. & Partridge, L. Endocrine regulation of aging and reproduction in Drosophila. Mol. Cell. Endocrinol. 299, 39–50 (2009).
Article CAS PubMed Google Scholar
Paaby, A. B., Bergland, A. O., Behrman, E. L. & Schmidt, P. S. A highly pleiotropic amino acid polymorphism in the Drosophila insulin receptor contributes to life-history adaptation. Evolution 68, 3395–3409 (2014).
Article PubMed PubMed Central Google Scholar
Paaby, A. B., Blacket, M. J., Hoffmann, A. A. & Schmidt, P. S. Identification of a candidate adaptive polymorphism for Drosophila life history by parallel independent clines on two continents. Mol. Ecol. 19, 760–774 (2010).
Article CAS PubMed Google Scholar
Brogiolo, W. et al. An evolutionarily conserved function of the Drosophila insulin receptor and insulin-like peptides in growth control. Curr. Biol. 11, 213–221 (2001).
Article CAS PubMed Google Scholar
Rauschenbach, I. Y. et al. Interplay of insulin and dopamine signaling pathways in the control of Drosophila melanogaster fitness. Dokl. Biochem. Biophys. 461, 135–138 (2015).
Article CAS PubMed Google Scholar
Wei, Y. et al. Complex cis-regulatory landscape of the insulin receptor gene underlies the broad expression of a central signaling regulator. Development 143, 3591–3603 (2016).
Article CAS PubMed PubMed Central Google Scholar
Negre, N. et al. A cis-regulatory map of the Drosophila genome. Nature 471, 527–531 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Mackay, T. F. C. et al. The Drosophila melanogaster Genetic Reference Panel. Nature 482, 173–178 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Pool, J. E. et al. Population Genomics of sub-saharan Drosophila melanogaster: African diversity and non-African admixture. PLoS Genet. 8, e1003080 (2012).
Article PubMed PubMed Central Google Scholar
Lack, J. B. et al. The Drosophila genome nexus: a population genomic resource of 623 Drosophila melanogaster genomes, including 197 from a single ancestral range population. Genetics 199, 1229–1241 (2015).
Article CAS PubMed PubMed Central Google Scholar
Ortiz, J. G. M., Opoka, R., Kane, D. & Cartwright, I. L. Investigating arsenic susceptibility from a genetic perspective in drosophila reveals a key role for glutathione synthetase. Toxicol. Sci. 107, 416–426 (2009).
Article PubMed CAS Google Scholar
Logan-Garbisch, T. et al. Developmental ethanol exposure leads to dysregulation of lipid metabolism and oxidative stress in Drosophila. G3-Genes Genom. Genet 5, 49–59 (2015).
CAS Google Scholar
McClellan, J. & King, M. C. Genetic heterogeneity in human disease. Cell 141, 210–217 (2010).
Article CAS PubMed Google Scholar
Turelli, M. Heritable genetic-variation via mutation selection balance - lerch zeta meets the abdominal bristle. Theor. Popul. Biol. 25, 138–193 (1984).
Article CAS PubMed MATH Google Scholar
Johnson, T. & Barton, N. Theoretical models of selection and mutation on quantitative traits. Philos. Trans. R. Soc. B-Biol. Sci. 360, 1411–1425 (2005).
Article CAS Google Scholar
Ye, C., Hill, C. M., Wu, S., Ruan, J. & Ma, Z. S. DBG2OLC: efficient assembly of large genomes using long erroneous reads of the third generation sequencing technologies. Sci. Rep. 6, 31900 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722–736 (2017).
Article CAS PubMed PubMed Central Google Scholar
Solares, E. A. et al. Rapid low-cost assembly of the Drosophila Melanogaste reference genome using low-coverage, long-read sequencing. G3. 8, 3143–3154 (2018).
Lam, K. K., LaButti, K., Khalak, A. & Tse, D. FinisherSC: a repeat-aware tool for upgrading de novo assembly using long reads. Bioinformatics 31, 3207–3209 (2015).
Article CAS PubMed Google Scholar
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS. One. 9, e112963 (2014).
Article ADS PubMed PubMed Central CAS Google Scholar
Hoskins, R. A. et al. The Release 6 reference sequence of the Drosophila melanogaster genome. Genome Res. 25, 445–458 (2015).
Article PubMed PubMed Central Google Scholar
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).
Article PubMed PubMed Central Google Scholar
Waterhouse, R. M. et al. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol. Biol. Evol. https://doi.org/10.1093/molbev/msx319 (2017).
Article PubMed Central Google Scholar
Quinlan, A. R. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr. Protoc. Bioinforma. 47, 11–34 (2014).
Article Google Scholar
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Article CAS PubMed PubMed Central Google Scholar
Rogers, R. L. et al. Landscape of standing variation for tandem duplications in Drosophila yakuba and Drosophila simulans. Mol. Biol. Evol. 31, 1750–1766 (2014).
Article CAS PubMed PubMed Central Google Scholar
Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865–2871 (2009).
Article CAS PubMed PubMed Central Google Scholar
Abyzov, A., Urban, A. E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21, 974–984 (2011).
Article CAS PubMed PubMed Central Google Scholar
King, E. G., Sanderson, B. J., McNeil, C. L., Long, A. D. & Macdonald, S. J. Genetic dissection of the Drosophila melanogaster female head transcriptome reveals widespread allelic heterogeneity. PLoS Genet. 10, e1004322 (2014).
Article PubMed PubMed Central CAS Google Scholar
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Article CAS PubMed PubMed Central Google Scholar
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We wish to acknowledge support from the following grants OD010974 (SJM and ADL), GM115562 (ADL), R01GM123303-1 and University of California, Irvine setup funds (JJE), and K99GM129411 (MC). We thank Luna Thanh Ngo and Daniel Na for help with data management and fly maintenance. This work was made possible, in part, through access to the Genomics High-Throughput Facility Shared Resource of the Cancer Center Support Grant CA-62203 at the University of California, Irvine, and NIH shared-instrumentation grants 1S10RR025496-01, 1S10OD010794-01, and 1S10OD021718-01.

Author information

Authors and Affiliations

Department of Ecology and Evolutionary Biology, University of California Irvine, Irvine, CA, 92697, USA
Mahul Chakraborty, J. J. Emerson & Anthony D. Long
Department of Molecular Biosciences, University of Kansas, Lawrence, KS, 66045, USA
Stuart J. Macdonald

Authors

Mahul Chakraborty
View author publications
You can also search for this author in PubMed Google Scholar
J. J. Emerson
View author publications
You can also search for this author in PubMed Google Scholar
Stuart J. Macdonald
View author publications
You can also search for this author in PubMed Google Scholar
Anthony D. Long
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.C., J.J.E., S.J.M., A.D.L. conceived of the work and wrote the paper. M.C. assembled the genomes and wrote the variant caller.

Corresponding authors

Correspondence to Mahul Chakraborty or Anthony D. Long.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks Wen Huang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Supplementary Data 4

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chakraborty, M., Emerson, J.J., Macdonald, S.J. et al. Structural variants exhibit widespread allelic heterogeneity and shape variation in complex traits. Nat Commun 10, 4872 (2019). https://doi.org/10.1038/s41467-019-12884-1

Download citation

Received: 27 December 2018
Accepted: 25 September 2019
Published: 25 October 2019
DOI: https://doi.org/10.1038/s41467-019-12884-1

This article is cited by

Pangenome graph construction from genome alignments with Minigraph-Cactus
- Glenn Hickey
- Jean Monlong
- Benedict Paten
Nature Biotechnology (2024)
Cotton pedigree genome reveals restriction of cultivar-driven strategy in cotton breeding
- Shang Liu
- Dongyun Zuo
- Guoli Song
Genome Biology (2023)
The composition of piRNA clusters in Drosophila melanogaster deviates from expectations under the trap model
- Filip Wierzbicki
- Robert Kofler
BMC Biology (2023)
The genomics and evolution of inter-sexual mimicry and female-limited polymorphisms in damselflies
- Beatriz Willink
- Kalle Tunström
- Christopher West Wheat
Nature Ecology & Evolution (2023)
Telomere-to-telomere assemblies of 142 strains characterize the genome structural landscape in Saccharomyces cerevisiae
- Samuel O’Donnell
- Jia-Xing Yue
- Gilles Fischer
Nature Genetics (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.