A structural variation reference for medical and population genetics

Structural variants (SVs) rearrange large segments of DNA1 and can have profound consequences in evolution and human disease2,3. As national biobanks, disease-association studies, and clinical genetic testing have grown increasingly reliant on genome sequencing, population references such as the Genome Aggregation Database (gnomAD)4 have become integral in the interpretation of single-nucleotide variants (SNVs)5. However, there are no reference maps of SVs from high-coverage genome sequencing comparable to those for SNVs. Here we present a reference of sequence-resolved SVs constructed from 14,891 genomes across diverse global populations (54% non-European) in gnomAD. We discovered a rich and complex landscape of 433,371 SVs, from which we estimate that SVs are responsible for 25–29% of all rare protein-truncating events per genome. We found strong correlations between natural selection against damaging SNVs and rare SVs that disrupt or duplicate protein-coding sequence, which suggests that genes that are highly intolerant to loss-of-function are also sensitive to increased dosage6. We also uncovered modest selection against noncoding SVs in cis-regulatory elements, although selection against protein-truncating SVs was stronger than all noncoding effects. Finally, we identified very large (over one megabase), rare SVs in 3.9% of samples, and estimate that 0.13% of individuals may carry an SV that meets the existing criteria for clinically important incidental findings7. This SV resource is freely distributed via the gnomAD browser8 and will have broad utility in population genetics, disease-association studies, and diagnostic screening. A large empirical assessment of sequence-resolved structural variants from 14,891 genomes across diverse global populations in the Genome Aggregation Database (gnomAD) provides a reference map for disease-association studies, population genetics, and diagnostic screening.

coverage) 1 , and the substantial technical challenges of SV discovery from WGS 15 has led to non-uniform SV analyses across contemporary studies [16][17][18][19][20] . Moreover, short-read WGS is unable to capture a subset of SVs accessible to more expensive niche technologies, such as long-read WGS 21 . Owing to the combination of these challenges, SV references are dwarfed by contemporary resources for short variants, such as the Exome Aggregation Consortium (ExAC) and its successor, the Genome Aggregation Database (gnomAD), which have jointly analysed more than 140,000 individuals 4,6 . Publicly available resources such as ExAC and gnomAD have transformed many aspects of human genetics, including defining sets of genes constrained against damaging coding mutations 6 and providing frequency filters for variant interpretation 5 . As short-read WGS is rapidly becoming the predominant technology in large-scale human disease studies, and will probably displace conventional methods for diagnostic screening, there is a mounting need for comparable references of SVs across global populations.
In this study, we developed gnomAD-SV, a sequence-resolved reference for SVs from 14,891 genomes. Our analyses revealed diverse mutational patterns among SVs, and principles of selection acting against reciprocal dosage changes in genes and noncoding cis-regulatory elements. From these analyses, we determined that SVs represent more than 25% of all rare protein-truncating events per genome, emphasizing the unrealized potential of routine SV detection in WGS studies. This SV reference has been integrated into the gnomAD browser (http:// gnomad.broadinstitute.org) with no restrictions on reuse so that it can be mined for new insights into genome biology and applied as a resource to interpret SVs in diagnostic screening.

SV discovery and genotyping
We analysed WGS data for 14,891 samples (average coverage of 32×) aggregated from large-scale sequencing projects, of which 14,237 (95.6%) passed all quality thresholds, representing a general adult population depleted for severe Mendelian diseases (median age of 49 years) (Supplementary Table 1, Supplementary Figs. 1, 2). This cohort included 46.1% European, 34.9% African or African American, 9.2% East Asian, and 8.7% Latino samples, as well as 1.2% samples from admixed or other populations (Fig. 1). Following family-based analyses using 970 parentchild trios for quality assessments, we pruned all first-degree relatives from the cohort, retaining 12,653 unrelated genomes for subsequent analyses.
We discovered and genotyped SVs using a cloud-based, multi-algorithm pipeline for short-read WGS ( Supplementary Fig. 3), which we prototyped in a study of 519 autism quartet families 20 . This pipeline integrated four orthogonal evidence types to capture SVs across the size and allele frequency spectra, including six classes of canonical SVs (Fig. 1a) and 11 subclasses of complex SVs 22 (Fig. 2). We augmented this pipeline with new methods to account for the technical heterogeneity of aggregated datasets (Extended Data Fig. 1, Supplementary Figs. 4,5), and discovered 433,371 SVs (Fig. 1c). After excluding low-quality SVs, which were predominantly (61.6%) composed of incompletely resolved breakpoint junctions (that is, 'breakends') that lack interpretable alternative allele structures for functional annotation and produce high false-discovery rates 20 (Extended Data Fig. 2a), we retained 335,470 high-quality SVs for subsequent analyses (Supplementary Table 3). This final set of high-quality SVs corresponded to a median of 7,439 SVs per genome, or more than twice the number of variants per genome captured by previous WGS-based SV studies such as the 1000 Genomes Project (3,441 SVs per genome from approximately 7× coverage WGS), which underscores the benefits of high-coverage WGS and improved multi-algorithm ensemble methods for SV discovery.
Given that there are no gold-standard benchmarking procedures for SVs from WGS, we evaluated the technical qualities of gnomAD-SV using seven orthogonal approaches. These analyses are described in detail in Extended Data Figs. 2, 3, Supplementary Figs. 6-12, Supplementary Table 4 and Supplementary Note 1, but we highlight just a few here to demonstrate that gnomAD-SV conforms to many fundamental principles of population genetics, including Mendelian segregation, genotype distributions, and linkage disequilibrium. We found that the precision of gnomAD-SV was comparable to our previous study of 519 autism quartets that attained a 97% molecular validation rate for all de novo SV predictions 20 : in gnomAD, analyses of 970 parent-child trios indicated a median Mendelian violation rate of 3.8% and a heterozygous de novo rate of 3.0%. We also observed that 86% of SVs were in Hardy-Weinberg equilibrium, and common SVs were in strong linkage disequilibrium with nearby SNVs or indels (median peak R 2 = 0.85). We performed extensive in silico confirmation of 19,316 SVs predicted from short-read WGS using matched long-read WGS from four samples 21,23 , finding a 94.0% confirmation rate with breakpoint-level read evidence, and revealing that 59.8% of breakpoint coordinates were accurate within a single nucleotide of the long-read data. These and other benchmarking approaches suggested that gnomAD-SV was sufficiently sensitive and specific to be used as a reference dataset for most applications in human genomics.  Fraction of SVs

Population genetics and genome biology
The distribution of SVs across samples matched expectations based on human demographic history, with the top three components of genetic variance separating continental populations (Fig. 1d, Supplementary Fig. 13). African and African American samples exhibited the greatest genetic diversity and their common SVs were in weaker linkage disequilibrium with nearby short variants than Europeans, whereas East Asians featured the highest levels of homozygosity ( Fig. 1e, Extended Data Fig. 4a-d, Supplementary Fig. 7). The mutational diversity of gnomAD-SV was extensive: we completely resolved 5,295 complex SVs across 11 mutational subclasses, of which 3,901 (73.7%) involved inverted segments (Fig. 2), confirming that inversion variation is predominantly composed of complex SVs rather than canonical inversions 1,24 . Across all SV classes, most SVs were small (median size of 331 bp) and rare (allele frequency < 1%; 92% of SVs), with half of all SVs (49.8%) appearing as 'singletons' (that is, only one allele observed across all samples) (Fig. 1f, g). Although the proportion of singletons varied by SV class, it was strongly dependent on SV size across all classes, which suggests that the amount of DNA rearranged is a key determinant of selection against most SVs (Fig. 1h, Extended Data Fig. 5a). Mutation rate estimates for SVs have remained elusive owing to limited sample sizes, poor resolution of conventional technologies, technical challenges of SV discovery, and use of cell line-derived DNA in population studies 1,25 . Here, we used the Watterson estimator 26 to project a mean mutation rate of 0.29 de novo SVs (95% confidence interval 0.13-0.44) per generation in regions of the genome accessible to short-read WGS, or roughly one new SV every 2-8 live births, with mutation rates varying markedly by SV class (Fig. 3a). Although this imperfect method extrapolates from data pooled across unrelated individuals, we previously demonstrated comparable rates from molecularly validated observations in 519 quartet families 20 . Like mutation rates, the distribution of SVs throughout the genome was non-uniform, significantly correlated with repetitive sequence contexts, and was enriched near centromeres and telomeres 23 (Supplementary Fig. 16). These trends were dependent on SV class, as biallelic deletions and duplications were predominantly enriched at telomeres, whereas MCNVs were enriched in centromeric segmental duplications ( Fig. 3b-d). Given the reduced sensitivity of short-read WGS in repetitive sequences, this study certainly underestimates the true SV mutation rates; nevertheless, these analyses implicate several aspects of chromosomal context and SV class in determining SV mutation rates throughout the genome.

Dosage sensitivity of coding and noncoding loci
Owing to their size and mutational diversity, SVs can have varied consequences on protein-coding genes 12 (Fig. 4a, Supplementary Fig. 17). In principle, any SV can result in predicted loss-of-function (pLoF), either by deleting coding nucleotides or altering open-reading frames. Coding duplications can result in copy-gain of entire genes, or of a subset of exons within a gene (referred to here as intragenic exonic    Fig. 16). c, The distribution of SVs along the meta-chromosome was dependent on variant class. d, SV enrichment by class and chromosomal position provided as mean and 95% confidence intervals (CI). C, centromeric; I, interstitial; T, telomeric. P values were computed using a two-sided t-test and were Bonferroni-adjusted for 21 comparisons. *P ≤ 2.38 × 10 −3 .
duplication, or IED). The average genome in gnomAD-SV contained a mean of 179.8 genes altered by biallelic SVs (144.3 pLoF, 24.3 copy-gain, and 11.2 IED), of which 11.6 were predicted to be completely inactivated by homozygous pLoF (Fig. 4b, Extended Data Fig. 4e-h). When restricted to rare (allele frequency < 1%) SVs, we observed a mean of 10.2 altered genes per genome (5.5 pLoF, 3.4 copy-gain, and 1.3 IED). By comparison, a companion gnomAD paper estimated 122.4 pLoF short variants per genome, of which 16.3 were rare 4 . These analyses suggest that 29.4% of rare heterozygous gene inactivation events per individual are contributed by SVs, or conservatively 25.2% of pLoF events if we exclude IEDs given the context-dependence of their functional impact. A fundamental question in human genetics is the degree to which natural selection acts on coding and noncoding loci. The proportion of singleton variants has been established as a proxy for strength of selection 6 ; however, this metric is confounded for SVs given the strong correlation between allele frequency and SV size, among other factors. Therefore, we developed a new metric, adjusted proportion of singletons (APS), to account for SV class, size, genomic context, and other technical covariates (Extended Data Fig. 5, Supplementary Fig. 14). Under this normalized APS metric, a value of zero corresponds to a singleton proportion comparable to intergenic SVs, whereas values greater than zero reflect purifying selection, similar to the 'mutability-adjusted proportion of singletons' (MAPS) metric used for SNVs 6 . Applying this APS model revealed signals of pervasive selection against nearly all classes of SVs that overlap genes, including intronic SVs, whole-gene inversions, SVs in gene promoters, and deletions as small as a single exon (Fig. 4c, Extended Data Fig. 6, Supplementary Fig. 18). The one notable exception was copy-gain duplications, which showed no clear evidence of selection beyond what could already be explained by their sizes, which were vastly larger than non-copy-gain duplications (median copy-gain duplication size = 134.8 kb; median non-copy-gain duplication size = 2.7 kb; one-tailed Wilcoxon test, W = 1.18 × 10 8 , P < 10 −100 ). This result could have numerous explanations, but it is consistent with the known diverse evolutionary roles of gene duplication events, including positive selection reported in humans 27,28 .
Methods that quantify evolutionary constraint on a per-gene basis, such as the probability of intolerance to heterozygous pLoF variation (pLI) 6 and the pLoF observed/expected upper fraction (LOEUF) 4 , have become core resources in human genetics. Nearly all existing metrics, including pLI and LOEUF, are derived from SNVs. Although previous studies have attempted to compute similar scores using large CNVs detected by microarray and exome sequencing 29,30 , or to correlate deletions with pLI 18 , no gene-level metrics comparable to LOEUF exist for SVs at WGS resolution. To gain insight into this problem, we built a model to estimate the depletion of rare SVs per gene compared to expectations based on gene length, genomic context, and the structure of exons and introns. This model is imperfect, as current sample sizes are too sparse to derive precise gene-level metrics of constraint from SVs. Nevertheless, we found strong concordance between the depletion of rare pLoF SVs and existing pLoF and missense SNV constraint metrics 4 (pLoF Spearman correlation test, ρ = 0.90, P < 10 −100 ) (Fig. 4d,  Supplementary Fig. 19). Notably, a comparable positive correlation was also observed for copy-gain SVs and SNV constraint (pLoF Spearman correlation test, ρ = 0.78, P < 10 −100 ), whereas a weaker yet significant correlation was detected for IEDs (pLoF Spearman correlation test, ρ = 0.58, P = 2.0 × 10 −11 ). As orthogonal support for these trends, we identified an inverse correlation between APS and SNV constraint across all functional categories of SVs, which was consistent with our observed depletion of rare, functional SVs in constrained genes (Extended Data Fig. 6f). These comparisons confirm that selection against most classes of gene-altering SVs mirrors patterns observed for short variants 18,30 . They further suggest that SNV-derived constraint metrics such as LOEUF capture a general correspondence between haploinsufficiency and triplosensitivity for a large fraction of genes in the genome. It therefore appears that the most highly pLoF-constrained genes not only are sensitive to pLoF, but also are more likely to be intolerant to increased dosage and other functional alterations.
In contrast to the well-studied effects of coding variation, the effects of noncoding SVs on regulatory elements are largely unknown. There are a handful of examples of SVs with strong noncoding effects, although they are scarce in humans and model organisms 31,32 . In gnomAD-SV, we explored noncoding dosage sensitivity across 14 regulatory element classes, ranging from high-confidence experimentally validated enhancers to large databases of computationally predicted elements (Supplementary Table 5). We found that noncoding CNVs overlapping most element classes had increased proportions of singletons, although none exceeded the APS observed for pLoF SVs (Fig. 5a). In general, the effects of noncoding deletions appeared stronger than noncoding duplications, and CNVs predicted to delete or duplicate entire elements were under stronger selection than partial element disruption (Fig. 5b). We also observed that primary sequence conservation was correlated with selection against noncoding CNVs (Fig. 5c, d), which provides a foothold for future work on interpretation and functional effect prediction for noncoding SVs. Broadly, these results followed trends we observed for protein-coding SVs, which we interpreted as evidence for weak but widespread selection against CNVs altering most classes of annotated regulatory elements.   Supplementary Fig. 19 for comparisons to missense constraint. Fig. 7), 14.8% of which matched a reported association from the NHGRI-EBI GWAS catalogue or a recent analysis of 4,203 phenotypes in the UK Biobank 33,34 . Common SVs in linkage disequilibrium with GWAS variants were enriched for genic SVs across multiple functional categories (Supplementary Table 6), and included candidate SVs such as a deletion of a thyroid enhancer in the first intron of ATP6V0D1 at a hypothyroidism-associated locus 34 (Extended Data Fig. 7). We also identified matches for previously proposed causal SVs tagged by common SNVs, including pLoF deletions of CFHR3 or CFHR1 in nephropathies and of LCE3B or LCE3C in psoriasis 35,36 . These results demonstrate the value of imputing SVs into GWAS, and for the eventual unification of short variants and SVs in all trait association studies. Given the potential value of this resource, we have released these linkage disequilibrium maps in Supplementary Table 7.

Article
As genomic medicine advances towards diagnostic screening at sequence resolution, computational methods for variant discovery from WGS and population references for interpretation will become indispensable. One category of disease-associated SVs, recurrent CNVs mediated by homologous segmental duplications known as genomic disorders, are particularly important because they collectively represent a common cause of developmental disorders 37 . Accurate detection of large, repeat-mediated CNVs is thus crucial for WGS-based diagnostic testing as chromosomal microarray is the recommended first-tier diagnostic screen at present for unexplained developmental disorders 37 . Using gnomAD-SV, we evaluated our ability to detect genomic disorders in WGS data by calculating CNV carrier frequencies for 49 genomic disorders across 10,047 unrelated samples with no known neuropsychiatric disease and found that CNV carrier frequencies in gnomAD-SV were consistent with those reported from chromosomal microarray in the UK Biobank 38 (R 2 = 0.669; Pearson correlation test, P = 7.38 × 10 −13 ) (Fig. 6a, Supplementary Table 8, Supplementary Fig. 20). The frequencies of carriers of genomic disorders did not vary significantly among populations, with the exception of duplications of NPHP1 at 2q13, in which carrier frequencies in East Asian samples were up to 4.6-fold higher than in other populations, further highlighting the potential for variant interpretation to be confounded by the limited diversity of existing SV references (Supplementary Fig. 21).
In the context of variant interpretation, the current gnomAD-SV resource will permit a screening threshold of allele frequencies less than 0.1% when matching on ancestry to the populations sampled here, and allele frequencies less than 0.004% globally. In the current release, we catalogued at least one pLoF or copy-gain variant for 36.9% and 23.7% of all autosomal genes, respectively, and 490 genes with at least one homozygous pLoF SV (Fig. 6b, Extended Data Fig. 6e, Supplementary Fig. 22). We also benchmarked carrier rates for several categories of clinically relevant variants in gnomAD-SV. First, 0.32% of samples carried a very rare (allele frequency < 0.1%) SV resulting in pLoF of a gene for which incidental findings are clinically actionable, nearly half of which (that is, 0.13% of all samples) would meet diagnostic criteria as pathogenic or likely pathogenic based upon the American College of Medical Genetics (ACMG) recommendations 7 (Fig. 6c). Second, 7.22% of individuals were heterozygous carriers of rare pLoF SVs in known recessive developmental disorder genes 39 . Third, we estimated that 3.8% of the general population (95% confidence interval of 3.2-4.6%) carries at least one very large (≥1 Mb) rare autosomal SV, roughly half of which (45.2%) were balanced or complex (Fig. 6d). Among these was an example of localized chromosome shattering involving at least 49 breakpoints, yet resulting in largely balanced products, reminiscent of chromothripsis, in an adult with no known severe disease or DNA repair defect 13,14,22 (Fig. 6e, Extended Data Fig. 8). Collectively, these analyses highlight the potential of gnomAD-SV and WGS-based SV methods to augment disease-association studies and clinical interpretation across a broad spectrum of variant classes and study designs.

Discussion
Human genetic research and clinical diagnostics are becoming increasingly invested in capturing the complete landscape of variation in individual genomes. Ambitious international initiatives to generate short-read WGS in many thousands of individuals from common disease cohorts have underwritten this goal 40,41 , and millions of genomes will be sequenced in the coming years from national biobanks 42,43 . A central challenge to these efforts will be the uniform analysis and interpretation of all variation accessible to WGS, particularly SVs, which are frequently invoked as a source of added value offered by WGS. Indeed, early WGS studies in cardiovascular disease and autism have been largely consistent in their analyses of short variants, but every study has differed in its analysis of SVs [18][19][20]40,41 . Thus, while ExAC and gnomAD have prompted remarkable advances in medical and population genetics for short variants, the same gains have not yet been realized for SVs. Although gnomAD-SV is not exhaustively comprehensive, it was derived from WGS methods and a reference genome that match those currently used in many research and clinical settings, which will help to facilitate the eventual standardization of SV discovery, analysis, and interpretation across studies.
Most foundational assumptions about human genetic variation were consistent between SVs and short variants in gnomAD, most notably that SVs segregate stably on haplotypes in the population and experience selection commensurate with their predicted biological consequences. This study also spotlights unique aspects of SVs, such as their remarkable mutational diversity, their varied functional effects on coding sequence, and the intense selection against large and complex    38 . Light bars indicate binomial 95% confidence intervals. Solid grey line represents linear best fit. b, At least one pLoF or copy-gain SV was detected in 36.9% and 23.7% of all autosomal genes, respectively. 'Constrained' and 'unconstrained' includes the least and most constrained 15% of all genes based on LOEUF 4 , respectively. c, Carrier rates for very rare (allele frequency < 0.1%) pLoF SVs in medically relevant genes across several gene lists 7,39,44 . SVs per category listed in Supplementary Table 9. d, Carrier rates for very large (≥1 Mb) rare autosomal SVs among 12,653 genomes. Bars represent binomial 95% confidence intervals. e, A complex SV involving at least 49 breakpoints and seven chromosomes (also see Extended Data Fig. 8). Teal arrows indicate insertion point into chromosome 1.
SVs. Our analyses also demonstrate that gene-altering effects of SVs beyond pLoF are remarkably similar to the mutational constraints of SNVs, and that SNV constraint metrics are not specific to haploinsufficiency but underlie a general intolerance to alterations of both gene dosage and structure. Beyond genes, we uncovered widespread but modest selection against noncoding dosage alterations of many families of cis-regulatory elements. This study represents one of the largest empirical assessments of noncoding dosage sensitivity in humans, and underscores that: (1) few-if any-classes of noncoding cis-regulatory variants are likely to experience selection as strong as protein-truncating variants; (2) sequence conservation is unsurprisingly one of the strongest features associated with selection against noncoding SVs; and (3) current WGS sample sizes are vastly underpowered to identify individual constrained functional elements in the noncoding genome.
The value of the multi-algorithm ensemble approach and deep WGS is evident in the improved sensitivity of SV detection in gnomAD-SV. However, short-read WGS remains limited by comparison to emerging long-read technologies 21 . Given that short-read WGS is blind to a disproportionate fraction of repeat-mediated SVs and small insertions by comparison to long-read methods, this study certainly underestimates the true mutation rates within such hypermutable regions. Similarly, although our approach involves extensive methods to resolve complex SV alleles, some variants such as high-copy-state MCNVs often involve complicated haplotype configurations, and we expect that emerging de novo assembly and graph-based genome representations will greatly expand our knowledge of such SVs 21,23 . Nonetheless, 92.7% of all known autosomal protein-coding nucleotides are not localized to simple-or low-copy repeats, and therefore we expect that the catalogues of SVs accessible to short-read WGS across large populations like gnomAD-SV will capture a majority of the most interpretable gene-disrupting SVs in humans.
The scale of short-read WGS datasets currently in production has magnified the need for publicly available SV resources, and gnomAD-SV represents an initial effort to fill this void. Although these data remain insufficient to derive accurate estimates of gene-level constraint, sequence-specific mutation rates, and intolerance to noncoding SVs, they provide a step towards these goals and reinforce the value of data sharing and harmonized analyses of aggregated genomic data sets. These data have been made available without restrictions on reuse (https://gnomad.broadinstitute.org), and this resource will catalyse new discoveries in basic research while providing immediate clinical utility for the interpretation of rare structural rearrangements across human populations.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41586-020-2287-8.

Data availability
All gnomAD-SV site-frequency data for appropriately consented samples (n = 10,847) have been distributed in VCF and BED format via the gnomAD browser (https://gnomad.broadinstitute.org/downloads/), as well as from NCBI dbVar under accession nstd166. Furthermore, these SVs have been integrated directly into the gnomAD browser 8 . The architecture of the gnomAD browser is described in the main gnomAD study 4 , as well as instructions for how to access and query the data hosted therein.

Code availability
The gnomAD-SV discovery pipeline is publicly available via a series of methods configured for the FireCloud/Terra platform (https:// portal.firecloud.org/#methods) under the methods namespace 'Talkowski-SV'. The svtk software package used extensively in the gnomAD-SV discovery pipeline is publicly available via GitHub (https:// github.com/talkowski-lab/svtk). Most custom scripts used in the production and/or analysis of the gnomAD-SV dataset are publicly available via GitHub (https://github.com/talkowski-lab/gnomad-sv-pipeline). All code is made available under the MIT license, unless stated otherwise.   4 . For this analysis, intronic, promoter and UTR SVs were required to have precise breakpoints (that is, have 'split-read' support) to protect against any cryptic overlap with coding sequence unable to be annotated due to imprecise breakpoints. For c, d and f, points and vertical bars represent 95% confidence intervals from 100-fold bootstrapping, respectively. Counts of SVs per category in c and d are provided in Supplementary Table 9. For d and f, deletions in highly repetitive or low-complexity sequence (≥30% coverage by annotated segmental duplications or simple repeats) were excluded. Fig. 7 | gnomAD-SV can augment disease association studies. a, Functional enrichments of 2,307 common SVs in strong linkage disequilibrium (R 2 ≥ 0.8) with an SNV associated with a trait or disease in the GWAS catalogue or the UK Biobank 33,34 . Points represent odds ratios of SVs being in strong linkage disequilibrium with at least one GWAS-significant SNV among all SVs in strong linkage disequilibrium with at least one SNV (total n = 15,634 SVs). Single and triple asterisks correspond to nominal (P < 0.05) and Bonferroni-corrected (P < 0.0083) significance thresholds from a two-sided Fisher's exact test, respectively. Bars represent 95% confidence intervals. Test statistics, SV counts, and P values are provided in Supplementary Table 6. b, Example locus at 16q22.1, where we identified a 336-bp deletion in strong linkage disequilibrium with SNVs significantly associated with hypothyroidism in the UK Biobank 34 . Top, the GWAS signal among genotyped SNVs in the UK Biobank, coloured by strength of linkage disequilibrium (Pearson's R 2 value) with the 336-bp deletion identified in gnomAD-SV. Bottom, the local genomic context of this deletion, which overlaps an annotated intronic Alu element near (<1 kb) the first exon of a highly constrained, thyroid-expressed gene, ATP6V0D1. The deletion lies amidst histone mark peaks commonly found at active enhancers (H3K27ac and H3K4me1) based on publicly available chromatin data from adult thyroid samples, a phenotype-relevant tissue 48 . Human Alu elements are known to frequently act as enhancers, and the sentinel hypothyroidism SNV from the UK Biobank GWAS is a significant expression-modifying variant (that is, eQTL) for ATP6V0D1 and other nearby genes across many tissues, which indicates that the hypothyroidism risk haplotype modifies expression of ATP6V0D1 and/or other genes, potentially through the deletion of an intronic enhancer 4,49 .

Extended Data Fig. 8 | An extremely complex SV involving 49 breakpoints and seven chromosomes.
A highly complex insertion rearrangement from gnomAD-SV in which 47 segments from six different chromosomes were duplicated and inserted into a single locus on chromosome 1, forming a 626,065 bp stretch of contiguous inserted sequence composed of shattered fragments. Given the involvement of multiple chromosomes, the signature of localized shattering, and the clustered breakpoints, we note that this rearrangement has several hallmarks of germline chromothripsis, which has been observed in healthy adults previously, albeit rarely 22 . However, unlike previous reports of germline chromothripsis, there are no apparent whole-chromosome translocations, and all segments were duplicated before being inserted in a compound manner into chromosome 1, potentially suggesting a replication-based repair mechanism. The exact origin of this rearrangement is unclear. a, Circos representation of all 49 breakpoints and seven chromosomes involved in this SV. Teal arrows indicate insertion point into chromosome 1. b, The median segment size was 8.4 kb. c, Linear representation of the rearranged inserted sequence. Colours correspond to chromosome of origin, and arrows indicate strandedness of the inserted sequence, relative to the GRCh37 reference. Last updated by author(s): Feb 27, 2020 Reporting Summary Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Software and code
Policy information about availability of computer code Data collection All software used for data collection is described in the Methods. All software is publicly available, and all custom software developed in this study has been released under the MIT license. Details on code access are provided in Supplementary Information.

Data analysis
All software used for data analysis is described in the Methods. All software is publicly available, and all custom software developed in this study has been released under the MIT license. Details on code access are provided in Supplementary Information. For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability The entire gnomAD-SV reference map, from which all conclusions in the study are drawn, has been made publicly accessible through multiple sources. See the Data Availability Statement in the Supplementary Information for more instructions on data access.