An open resource of structural variation for medical and population genetics

Structural variants (SVs) rearrange the linear and three-dimensional organization of the genome, which can have profound consequences in evolution, diversity, and disease. As national biobanks, human disease association studies, and clinical genetic testing are increasingly reliant on whole-genome sequencing, population references for small variants (i.e., SNVs & indels) in protein-coding genes, such as the Genome Aggregation Database (gnomAD), have become integral for the evaluation and interpretation of genomic variation. However, no comparable large-scale reference maps for SVs exist to date. Here, we constructed a reference atlas of SVs from deep whole-genome sequencing (WGS) of 14,891 individuals across diverse global populations (54% non-European) as a component of gnomAD. We discovered a rich landscape of 498,257 unique SVs, including 5,729 multi-breakpoint complex SVs across 13 mutational subclasses, and examples of localized chromosome shattering, like chromothripsis, in the general population. The mutation rates and densities of SVs were non-uniform across chromosomes and SV classes. We discovered strong correlations between constraint against predicted loss-of-function (pLoF) SNVs and rare SVs that both disrupt and duplicate protein-coding genes, suggesting that existing per-gene metrics of pLoF SNV constraint do not simply reflect haploinsufficiency, but appear to capture a gene’s general sensitivity to dosage alterations. The average genome in gnomAD-SV harbored 8,202 SVs, and approximately eight genes altered by rare SVs. When incorporating these data with pLoF SNVs, we estimate that SVs comprise at least 25% of all rare pLoF events per genome. We observed large (≥1Mb), rare SVs in 3.1% of genomes (∼1:32 individuals), and a clinically reportable pathogenic incidental finding from SVs in 0.24% of genomes (∼1:417 individuals). We also estimated the prevalence of previously reported pathogenic recurrent CNVs associated with genomic disorders, which highlighted differences in frequencies across populations and confirmed that WGS-based analyses can readily recapitulate these clinically important variants. In total, gnomAD-SV includes at least one CNV covering 57% of the genome, while the remaining 43% is significantly enriched for CNVs found in tumors and individuals with developmental disorders. However, current sample sizes remain markedly underpowered to establish estimates of SV constraint on the level of individual genes or noncoding loci. The gnomAD-SV resources have been integrated into the gnomAD browser (https://gnomad.broadinstitute.org), where users can freely explore this dataset without restrictions on reuse, which will have broad utility in population genetics, disease association, and diagnostic screening.


INTRODUCTION
Structural variants (SVs) are genomic rearrangements that alter segments of DNA ≥50 bp. By virtue of their size and abundance, 1 SVs represent an important mutational force shaping genome evolution and function, 2,3 and a significant contributor to germline and somatic disease. [4][5][6] The profound impact of SVs is partially attributable to the varied mechanisms by which intra-and inter-chromosomal rearrangements can alter linear and three-dimensional genome structure, which can disrupt protein-coding sequences and/or cis-regulatory architecture. 5,[7][8][9] Genomic rearrangements can be grouped into distinct mutational classes, including "unbalanced" SVs associated with gains or losses of DNA (e.g., copy-number variants [CNVs]), and "balanced" SVs that rearrange genomic segments without corresponding dosage alterations (e.g., inversions & translocations) (Figure 1a). 10 Other common forms of SVs include mobile elements that insert themselves throughout the genome, 11 and multiallelic CNVs (MCNVs) that exist at high copy states. 12 Beyond these canonical classes, more exotic species of complex SVs exist in all individuals. 13,14 These variants do not conform to a single canonical class, and instead involve two or more SV signatures from a single mutational event interleaved within the same allele. Complex SVs can range from CNV-flanked inversions (e.g., dupINVdup) to rare instances of localized chromosome shattering, such as chromothripsis. 8,15 The variant spectrum of germline SVs in all humans is therefore diverse, as is their influence on genome structure and function.
While SVs alter more nucleotides per genome than single nucleotide variants (SNVs) and small insertion/deletion variants (indels; <50 bp), 1 surprisingly little is known about their mutational spectra, patterns of natural selection, and functional impact on a global scale. The paucity of population-scale characterization of SVs is primarily attributable to the technical challenges of their ascertainment and the limited availability of whole-genome sequencing (WGS) datasets. Whereas gold-standard methods for profiling SNVs and indels are well-established, such with as the Genome Analysis Toolkit (GATK), 16 the uniform detection of SVs from short-read WGS has presented a much greater challenge. Analyses of SVs require specialized computational methods that simultaneously consider multiple SV signatures, and even high-coverage short-read WGS fails to capture a significant component of the variant spectrum accessible to more expensive niche data types such as longread WGS, optical mapping, or strand-specific sequencing. 17 Current population references of SVs from WGS are thus restricted to the 1000 bioRxiv Preprint While the number of SVs per genome in gnomAD-SV using the integration of multiple algorithms (n=8,202) is a marked increase from publicly accessible references from short-read WGS, such as the 1000 Genomes Project (3,441 SVs per genome from ~7X coverage WGS) and the GTEx project (3,658 SVs per genome from ~50X coverage WGS), 1,24 it is far lower than estimates of the total SVs per genome from recent long-read WGS analyses (24,825 per genome from 40X long-read coverage). 17 In the absence of gold-standard SV benchmarking methods, we evaluated the technical qualities of the gnomAD-SV callset using five orthogonal approaches summarized in Extended Data Figure 2, Supplementary  Figures 5-7, and Supplementary Table 4. Briefly, we assessed Mendelian inheritance in 966 parent-child trios (2,898 genomes). Almost all SVs that violate Mendelian transmission patterns represent algorithmic false positives or false negatives in the child and/or parents, and thus provide a proxy for the performance of SV detection and genotyping accuracy. Here, we observed an average Mendelian violation rate of 4.2% per trio (Extended Data Figure 2a). We found 97.8% sensitivity to detect large CNVs (>40 kb) previously reported from microarrays in 1,893 individuals. 25 As another proxy for genotyping accuracy, we calculated that 87% of SVs across all populations were in Hardy-Weinberg Equilibrium, although this is an imperfect metric given the potential confounding assumptions and population genetic forces that may not hold true for all SV sites (Extended Data Figure 2b). We also leveraged Pacific Biosciences long-read WGS 17 in four individuals and found long-read support for up to 88.1% of SVs predicted from short-read WGS. The AFs from gnomAD-SV were correlated with variants also observed in the 1000 Genomes Project (N=2,504; 4-8X sequence coverage) or smaller European-centric cohorts. 1,18 This stands in stark contrast to references for coding SNVs from resources such as the Exome Aggregation Consortium (ExAC), 19 and its second iteration, the Genome Aggregation Database (gnomAD), which have jointly analyzed data from >140,000 individuals. 20 These references have transformed most aspects of medical and population genetics research, including the definition of genes constrained against predicted loss-of-function (pLoF) variation, 19,21 and have become integral in the clinical interpretation of small coding variants. 22 Therefore, as short-read WGS becomes the prevailing platform for large-scale human disease studies, and is likely to eventually displace conventional technologies in diagnostic screening, there is a critical need for similar resources of SVs across diverse global populations.
In this study, we developed gnomAD-SV, a reference atlas of SVs from deep WGS in ~15,000 samples aggregated as part of gnomAD. Our analyses reveal diverse mutational patterns among SVs, and principles of strong selection acting against reciprocal dosage changes. We also find that SVs contribute approximately 25% of all rare pLoF events currently accessible to short-read WGS in each genome, and that 0.24% of individuals in the general population harbor a clinically reportable, likely pathogenic incidental finding from SVs. These reference maps have been directly incorporated into the gnomAD browser (http://gnomad. broadinstitute.org), which can be mined for new insights into genome biology and will provide an openly accessible resource for interpretation of SVs in diagnostic screening.

SV discovery & genotyping
We analyzed 14,891 samples in gnomAD-SV, of which 14,216 (95.5%) passed all data quality thresholds (Supplementary Tables 1 and Supplementary Figure 1). Samples were aggregated across population genetic and complex disease association studies, and the samples in gnomAD-SV represent a subset of the overall gnomAD project (see Supplementary Table 2). 20 All samples were previously aligned to the GRCh37/hg19 human reference assembly. This gnomAD-SV reference included 45.6% European (N=6,484), 34.7% African/African-American (N=4,937), 9.2% East Asian (N=1,304), and 7.8% Latino (N=1,109) samples, as well as 2.7% samples from admixed or other populations (N=382; Figure 1b). We discovered and genotyped SVs using a cloud-based version of a multi-algorithm pipeline for Illumina short-read WGS, which has been previously described in a disease association study of autism spectrum disorder (ASD) in 519 quartet families, where molecular assays yielded a 97% validation rate for predicted de novo SVs (Supplementary Figure 2). 23 In brief, this pipeline integrates four orthogonal signatures of SVs to delineate variants across the size and allele frequency (AF) spectrum accessible to short-read WGS, including six classes of canonical SVs (Figure  1a; deletions, duplications, MCNVs, inversions, insertions, translocations) and 13 subclasses of complex SVs (Figure 2). 14 We augmented these methods with approaches to account for the technical heterogeneity of aggregated WGS datasets (Extended Data Figure 1 and Supplementary Figures 3-4). In total, these methods discovered 498,257 distinct SVs (Figure 1c and Supplementary Table 3). Following family-based analyses from 966 parent-child trios included for quality assessment (e.g., de novo rates), we pruned all first-degree relatives from further analyses, retaining a total of 12,549 unrelated genomes. Analyses of SVs from short-read WGS also produces thousands of incompletely resolved non-reference breakpoint junctions per genome, sometimes referred to as 'breakends' (BNDs; Figure 1a), which can be valuable to document as deviations from reference sequence, but lack interpretable alternate allele structures for biological annotation. Given that these BNDs substantially inflated our variant counts (16.3% of all SVs detected), were enriched in false positives (Extended Data Figure  2a), 23 and cannot be interpreted for functional impact, we removed them from our final dataset. All analyses were thus performed on 382,460 unique, completely resolved SVs from 12,549 unrelated genomes (Supplementary Table 3    bioRxiv Preprint were apparently depleted in telomeres, although these variants might be more susceptible to false negatives than CNVs due to local repeat structures. These analyses indicate that the processes influencing SV mutation rates and mechanisms of formation vary by SV class and chromosomal context.

Constraint against SVs in protein-coding genes
By virtue of their size and mutational diversity, SVs can have varied consequences on protein-coding sequence (Figure 4a and Supplementary Figure 10). All classes of SVs can result in pLoF, either by deletion of coding nucleotides or alteration of open-reading frames, and many Genomes Project (R 2 =0.67; Extended Data Figure  2c and Supplementary Figures 6-7), 1 though 87% of SVs in gnomAD-SV were novel compared to the 1000 Genomes Project, reflecting the increase in scale and sensitivity of the current dataset.

Insights into population genetics & genome biology
The properties of SVs in the gnomAD-SV dataset followed expectations from human demographic history, 26 with the top two principal components projecting samples onto well-established axes according to population structure (Figure 1d). African/African-American samples exhibited the greatest genetic diversity (median 9,177 SVs per genome compared to 7,888 per non-African genome) (Figure 1e), and East Asian genomes featured the highest levels of homozygosity (median 1,582 homozygous SVs per East Asian genome compared to 1,475 per non-East Asian genome) (Extended Data Figure 3a-d).
Most SVs were small (median SV size=374 bp; Figure 1f) and rare (AF<1%; 92% of SVs; Figure 1g). Nearly half of all SVs (46.4%) were singletons (i.e., only one allele observed across all samples), and the singleton proportion varied by SV class and was strongly dependent on SV size (Figure 1h and Extended Data Figure 4). We completely resolved 5,729 complex SVs across 13 mutational subclasses, of which 4,341 (75.8%) involved inverted segments (Figure 2), confirming prior predictions that most inversion variation accessible to short-read WGS is comprised of complex SVs rather than canonical inversions. 1,27 Among canonical SVs, deletions were collectively more rare than other classes (P < 1x10 -100 ; one-sided Wilcoxon Test; Supplementary Figure 8). However, complex SVs were rarer than all canonical classes, including deletions (P < 1x10 -100 ; one-sided Wilcoxon Test), suggesting that purifying selection on SVs is likely strongest against loss of genomic content and extensive structural rearrangement.
Mutation rates for SVs have remained difficult to estimate due to technical limitations of SV discovery from WGS, and the frequent use of cell line-derived DNA rather than whole blood in population studies. 1 Using the Watterson Estimator, 28 we projected a mean mutation rate of 0.35 de novo SVs per generation in regions of the genome accessible to short-read WGS (95% confidence interval: 0.18-0.52 SV/generation), or roughly one new SV every 2-6 live births, with mutation rates varying markedly by SV class (Figure  3a). While this method estimates mutation rates from variation aggregated across unrelated individuals, we previously demonstrated comparable rates from molecularly validated de novo SVs in WGS analyses of 519 quartet families. 23 However, our calculations certainly underestimate the true mutation rates for SVs given the reduced sensitivity of short-read WGS in repetitive and low-complexity sequences that can mediate their formation. 29 We anticipate that emerging long-read WGS and assembly methods will greatly increase future estimates of SV mutation rates and clarify their associated mechanisms. Despite the limitations of short-read WGS in repetitive sequence, it was notable that the density of SVs in this study was significantly enriched near centromeres and telomeres (Figure 3b and Supplementary Figure 9). This trend was strongly dependent on SV class: biallelic deletions and duplications were predominantly enriched at telomeres, whereas MCNVs were preferentially enriched near centromeres (Figure 3c-d). Conversely, inversions and complex SVs

Figure 2 | Complex SVs are abundant in the human genome
We discovered and fully resolved 5

Figure 3 | Genome-wide mutational patterns of SVs
(a) We estimated the mutation rate (µ) for each SV class using the Watterston estimator, 28  bioRxiv Preprint tween CG from rare SVs and pLoF constraint from SNVs (rho=0.80). A weaker, yet significant correlation was detected for IED as well (rho=0.58). By contrast, there was no correlation between pLoF constraint and rare inversions of entire genes without directly disrupting their open reading frames (rho=-0.16), despite canonical and complex inversions appearing under particularly strong selection based on other metrics, such as the proportion of singleton SVs. Intriguingly, we found evidence for strong selection against noncoding inversions involving two or more recombination hotspots, which might suggest that large inversions influence meiotic mechanics in the general population (Extended Data Figure 6). When we cross-examined these relationships by using variant frequency distributions as a proxy for the strength of selection, we found the expected trend of an inverse correlation between proportion of singleton SVs and SNV constraint across all functional categories of SVs (Extended Data Figure 5f). These comparisons confirm that selection against multiple classes of gene-altering SVs is consistent with patterns observed for SNVs and indels. They further suggest that constraint metrics like pLI, which are derived from pLoF point mutations alone, underlie a general correspondence between haploinsufficiency and triplosensitivity, on average, for a large fraction of genes in the genome. Furthermore, these results imply that many highly constrained genes are not simply sensitive to pLoF, but intolerant to increased dosage and structural alterations more broadly.

Relevance to disease association & clinical genetics
Most large-scale disease association studies of SVs have relied upon chromosomal microarrays (CMA), which are limited to detection of large CNVs and have not had reliable reference resources to restrict analyses to ultra-rare variants. 31 We evaluated gnomAD-SV as a filtering tool for previously published CMA-based association studies (N>10,000 samples) that have identified a significant contribution of large CNVs to developmental disorders (DDs), 32 ASD, 25 schizophrenia, 33 and cancer (Extended Data Figure 7). 34 Filtering based on gnomAD-SV AFs magnified the previously reported associations of rare genic CNVs in DDs, ASD, and cancer, with less pronounced differences between schizophrenia cases and controls at ultra-rare AFs, consistent with ex-SVs can duplicate or invert coding and noncoding loci. 30 Coding duplications can result in copy-gain of entire genes (CG) or duplication of a subset of exons contained within a gene, referred to here as intragenic exonic duplication (IED). The average genome in gnomAD-SV harbored 253 genes altered by biallelic SVs (199 pLoF, 18 IED, and 36 CG), of which 24 were predicted to be completely inactivated by homozygous biallelic pLoF SVs (Figure 4b and Extended Data Figure 2e-h). When restricted to rare (AF<1%) SVs, the average genome harbored 8 altered genes (5 pLoF, 1 IED, and 2 CG), effectively all of which were in the heterozygous state. By comparison, prior analyses estimated 120 pLoF SNV/indels per genome, of which 18 were rare, 19 suggesting that up to 25% of all rare pLoF events per genome are likely to result from SVs. We found signals of pervasive selection, such as the proportion of singleton variants, 20 against all classes of SVs that overlap genes, including intronic SVs and pLoF SVs as small as single-exon deletions (Figure 4c and Extended Data Figure 5a-d). While further methods development will continue to refine these annotations, these data suggest that SVs represent a substantial fraction of all gene-altering variants per genome.
Metrics that quantitatively estimate the strength of selection on functional variation per gene, such as the probability of LoF intolerance (pLI), have become a core resource in human genetics. 19,21 No comparable metrics exist for SVs due to small variant counts by comparison to SNVs. To gain some insight into this problem, we estimated the number of rare SVs expected per gene while adjusting for gene length, exon-intron structure, and genomic context (see Methods). This model is imperfect, as expectations can be influenced by many known and unknown covariates, and SVs are too sparse to derive precise gene-level estimates of SV constraint at current sample sizes. Nevertheless, the results from this model displayed several clear and informative patterns. We found strong concordance between pLoF constraint metrics from gnomAD exome analyses and the depletion of rare pLoF SVs across 100 bins of 175 genes each, ordered by SNV constraint (Figure 4d; Spearman's rho=0.89). 20 This result was also true of missense constraint, as expected given the strong correlation of missense and pLoF constraint (Supplementary Figure 11). We also discovered a comparable positive correlation be-  bioRxiv Preprint cestry to the African, East Asian, European, or Latino populations sampled here, and AF < 0.004% when compared against all samples collectively. For instance, filtering all SVs found in an individual genome versus gnomAD-SV dramatically reduced the number of singleton SVs in that genome to a median of 13, just one of which was pLoF (Figure  6a and Extended Data Figure 3). This reference dataset also aids in gene-level interpretation, as we catalogued at least one SV resulting in pLoF or CG for 40.4% and 23.5% of all autosomal genes, respectively, and 586 genes with at least one homozygous pLoF SV (Figure 6b,  Extended Data Figure 5e, and Supplementary Figure 12). However, these data are still extremely sparse as compared to SNVs and indels, where analyses of the 120,000 gnomAD exomes have documented at least one pLoF SNV for 95.8% of all genes. 20 When further restricted to clinically relevant SVs using American College of Medical Genetics criteria, 38 we find that 0.4% of samples carry a very rare (AF<0.1%) SV resulting in pLoF of a gene for which incidental findings are clinically reportable, roughly half of which (i.e., 0.26% of all samples) likely meet ACMG diagnostic criteria as pathogenic or likely pathogenic (Figure  6c). We also observed that 9.5% of individuals were heterozygous carriers of rare pLoF SVs in known recessive DD genes. Finally, we used the gnomAD-SV dataset to catalog rare chromosomal abnormalities (SVs ≥ 1Mb). We estimate that 3.1% of the general population (95% CI: 2.5-3.9%) carries at least one rare autosomal SV ≥1Mb in size, roughly half of which are balanced or complex (Figure 6d). Among these events was an example of highly complex localized chromosome shattering involving at least 49 breakpoints yet resulting in largely balanced products, reminiscent of chromothripsis, which was identified in an adult individual from the general population with no indication of severe developmental or pediatric disease (Figure 6e and Extended Data Figure 8). 8,14,15

An online, interactive SV reference browser
A key aspect of ExAC and gnomAD for analyses of coding SNVs and indels was the open release of variant information via an user-friendly online interface. 19 Now in its second generation, the gnomAD browser (https://gnomad.broadinstitute.org) has been augmented to incorporate the gnomAD-SV callset described here. Users can query genes and regions to view all SVs, including their mutational class, frequency across populations, predicted genic effects, genotype quality, and other variant metadata (Extended Data Figure 9). These features are directly integrated into the existing interface as the gnomAD SNV and indel callsets, where users can toggle between viewing SVs and smaller point mutations within the same window. Finally, all SVs described in this study are provided for download in two common file formats via the gnomAD browser, with no use restrictions on the reanalysis of these data.

DISCUSSION
The fields of human genetics research and clinical diagnostics are becoming increasingly invested in defining the complete spectrum of variation in individual genomes. Ambitious international initiatives to generate short-read WGS in hundreds of thousands of individuals from complex disease cohorts have underwritten this goal, 41-44 and millions of genomes from unselected individuals will be sequenced in the coming years from national biobanks. 45,46 A central challenge to these efforts will be the uniform analysis and interpretation of all variation accessible to short-read WGS, particularly SVs, which are frequently cited as a source of added value offered by WGS over conventional technologies. 47 Indeed, early efforts to deploy WGS in cardiovascular disease, ASD, and type 2 diabetes were largely consistent in their analyses of SNVs using GATK, but all studies have differed in their analyses of SVs. 23,36,[42][43][44]48,49 Thus, while ExAC and gnomAD have catalyzed remarkable advances in medical and population genetics, the same opportunities for new discovery and translational impact have not yet been realized for SVs. Although gnomAD-SV is by no means comprehensive, the half-million SVs it contains will begin to address the dearth of population SV datasets. Given that gnomAD-SV was constructed with contemporary WGS technologies and a reference genome that match those currently used in clinical settings, we anticipate that these data will augment disease pectations from genetic architecture studies.
We next considered previously defined recurrent CNVs associated with syndromic phenotypes, or genomic disorders (GD), which are often mediated by recombination of long flanking segments of homologous sequences. 35 These GDs are among the most prevalent genetic causes of DDs, 36 and accordingly CMA remains the recommended first-tier genetic diagnostic screen for DDs of unknown etiology. 31 Thus, it is critical that these GD CNVs are able to be reliably captured from WGS for both routine clinical screening and studies of developmental and neuropsychiatric disease. Here, we calculated sequence-resolution CNV carrier frequencies in gnomAD for 49 GDs recently reported in the UK BioBank, and found consistent carrier frequency estimates between WGS in gnomAD-SV and those reported by CMA in the UK BioBank (UKBB; R 2 =0.69; P=1.22x10 -13 ; Pearson correlation test; Figure 5a), 37 further confirming the accuracy of read depth-based discovery of large repeat-mediated CNVs from WGS. GD carrier frequencies did not vary dramatically between populations in gnomAD-SV, with the exception of a single GD (duplications of NPHP1 at 2q13), where carrier frequencies in East Asian samples were 2.5-to-4.9-fold higher than other populations (Figure 5b). This finding underscores the value of characterizing putatively disease-associated SVs across diverse populations. Finally, given the correlation of CNV frequencies between gnomAD-SV and the UKBB, we calculated the combined CNV frequencies from these resources, which were inversely correlated with previously reported odds ratios (ORs) (Figure 5c). 32 These data estimate that roughly 0.05% of the population (~1:2,000 individuals) is a carrier of a GD-associated CNV with an estimated OR > 14.0 (e.g., the top quartile of the 49 GDs), compared to 1.86% (~1:54 individuals) for GDs with an OR < 1.7 that represent relatively common polymorphic variants in the population (e.g., the bottom quartile) (Figure 5d).
As genomic medicine advances toward diagnostic screening at sequence resolution, publicly accessible WGS references will be indispensable for variant interpretation. The current gnomAD-SV dataset will permit a screening threshold of AF < 0.1% when matching on an-

Figure 6 | gnomAD-SV as a resource for clinical WGS interpretation
(a) Filtering SVs against gnomAD-SV reduces individual genomes to ~13 singleton variants at current sample sizes. (b) At least one pLoF or CG SV was detected in 40.4% and 23.5% of all autosomal genes, respectively. "Constrained" and "unconstrained" includes the least and most constrained 15% of all genes based on pLoF SNV observed:expected ratios, respectively. 20 (c) Up to 1.3% of genomes in gnomAD-SV harbored a very rare (AF<0.1%) pLoF SV in a medically relevant gene across several gene lists. [38][39][40] Manual review of all very rare pLoF SVs indicated that 0.24% of genomes carry a pathogenic or likely pathogenic variant in a clinically reportable gene for incidental findings. 38 We also found that 9.5% of genomes carried pLoF SVs of recessive DD genes in the heterozygous state. 39  bioRxiv Preprint joint analyses of aggregated datasets by the field. The gnomAD-SV resource has been made available without restrictions on reuse, and has been integrated directly into the widely adopted gnomAD Browser (https://gnomad.broadinstitute.org), where users can freely view, sort, and filter the SV dataset described here. This resource will catalyze new discoveries in basic research and provide immediate clinical utility for the interpretation of rare structural rearrangements in the human genome.

METHODS & SUPPLEMENTARY INFO
There is supplementary information associated with this study, which includes detailed methods. These materials have been provided in a separate document, which will be linked directly from bioRxiv. MET) and the Simons Foundation for Autism Research Initiative (SFARI #573206 to MET). We are grateful to all of the families at the participating Simons Simplex Collection (SSC) sites, as well as the SSC principal investigators. RLC was supported by NHGRI T32HG002295 and NSF GRFP #2017240332. HB was supported by NIDCR K99DE026824. Most foundational assumptions of human genetic variation were consistent between SNVs/indels from the gnomAD exome study and SVs reported here, 20 most notably that SVs experience selection commensurate with their predicted biological consequences. This study also spotlights unique aspects of SVs, such as their remarkable mutational diversity, their varied functional impact on coding sequence, and the strong selection against large and complex SVs in the genome. We provide resolved structures for nearly six thousand such complex SVs, and predict that SVs comprise up to 25% of all rare pLoF variation in each genome. These analyses also demonstrate that gene-altering effects of SVs beyond pLoF parallel measures of mutational constraint derived from analyses of SNVs. Despite the strong correlation between SNV and SV constraint in this study, we made several assumptions that likely underestimate the true diversity of possible functional outcomes. For instance, we assigned any deletion of an exon from a canonical gene transcript as pLoF. There are technical and biological explanations for why that assumption will not universally hold, 3 yet the proportion of singleton SVs was nearly identical for partial or single exon deletions as for loss of a full copy of a gene (Extended Data Figure 5d). More sophisticated models of SV annotation will continue to refine future predictions of their biological impact. The patterns we observed for whole-gene copy gains (CG) and intragenic exonic duplications (IEDs) against pLoF constraint imply that existing SNV constraint metrics are not specific to depletion of pLoF variation, but rather underlie a more generalizable intolerance to alterations of both gene dosage and structure. Indeed, similar patterns of selection were observed for CG and pLoF SVs among the most constrained genes in the genome. Like complex SVs, IEDs are also an intriguing class of SVs that may operate in a context-dependent manner. Analogous to missense variation, IEDs can result in pLoF, neutral variation, or perhaps other effects, and thus represent an exciting area for future investigation. Finally, the strong selection against canonical and complex inversions despite no clear correspondence with existing gene constraint metrics is intriguing, and our analyses suggest that this may be related to large inversions blocking recombination through meiotic interference.
Technical barriers associated with short-read WGS preclude the establishment of a complete catalogue of SVs in gnomAD-SV. A recent study incorporating most extant genomics technologies demonstrated that short-read WGS is limited in low-complexity and repetitive sequence contexts. 17 The technology and methods relied upon here are thus blind to a disproportionate fraction of repeat-mediated SVs, and underestimate the true mutation rates within these hypermutable regions. Similarly, high copy state MCNVs often require specialized algorithms and manual curation to fully delineate their numerous haplotypes, 12,50,51 suggesting that the 1,053 MCNVs reported here comprise an incomplete portrait of extreme copy-number polymorphisms. We expect that emerging technologies, de novo assemblies, and graph-based genome representations are likely to expand our knowledge of SVs in repetitive sequences. 51,52 Nevertheless, based on current estimates, 92.7% of known autosomal protein-coding nucleotides are not localized to simpleand low-copy repeats. This suggests that catalogues of SVs accessible to short-read WGS across large populations, like gnomAD-SV, will likely capture a majority of the most interpretable gene-disrupting SVs in humans.
The oncoming deluge of short-read WGS datasets has magnified the need for publicly available large-scale resources of SVs. In this study, we aimed to begin to bridge the gap between the existence of such references for SNVs/indels and those for SVs. While the dataset provided here significantly exceeds current references in terms of sample size and sensitivity, these data remain insufficient to derive accurate estimates of gene-level constraint, and are dramatically underpowered to explore sequence-specific mutation rates and intolerance to noncoding SVs. Nonetheless, these data provide an initial step toward these goals, and demonstrate the value of a commitment to open data sharing and  (14)

Extended Data Figure 2 | Benchmarking the technical qualities of the gnomAD-SV callset
We evaluated the quality of gnomAD-SV with five orthogonal analyses detailed in Supplementary Table 4 and Supplementary Figures 5-7   bioRxiv Preprint Extended Data Figure 6

| Evidence of selection against possible meiotic interference caused by large noncoding inversions and CNVs
We evaluated the hypothesis that large SVs might cause meiotic interference by blocking recombination by evaluating a proxy for strength of selection (singleton proportion) between various categories of SVs. We compared SVs in this dataset against recombination hotspots, conditioned by whether or not each SV had any predicted direct effects on coding sequence. For inversions, deletions, and duplications, we found that rearrangements with no predicted genic effect that also were predicted to alter more than two recombination hotspots were under particularly strong selection, surpassing the degree of selection against SVs from the same class with direct coding effects. Although sample sizes were small, these analyses suggest that noncoding SVs may be selected against when predicted to disrupt multiple recombination hotspots, potentially due to mechanisms of meiotic interference.

Extended Data Figure 7 | gnomAD-SV can augment disease association studies of SVs
(a) Filtering CNV calls from microarray disease association studies against gnomAD-SV can magnify reported signals of association between ultra-rare genic CNVs and various diseases, including DDs, 32 ASD, 25 and cancer. 34 Bars represent the total number of large (≥100kb), rare (frequency<0.1% in the original study) CNVs overlapping at least one protein-coding exon across all cases or controls after filtering versus gnomAD-SV. Asterisks correspond to P-value thresholds of 0.05, 0.005, and 0.0005, respectively. AF=max allele frequency in gnomAD-SV; AC=max allele count in gnomAD-SV; "Not. Obs"=not observed in gnomAD-SV. (b) Odds ratios and 95% confidence intervals corresponding to the filtering procedures used in (a).

Extended Data Figure 8 | An extremely complex SV involving 49 breakpoints and seven chromosomes
In the gnomAD-SV cohort, we identified one highly complex insertion rearrangement where 47 segments from six different chromosomes were duplicated and inserted into a single locus on chromosome 1, forming a 626,065 bp stretch of contiguous inserted sequence composed of shattered fragments. Given the involvement of multiple chromosomes, the signature of localized shattering, and the clustered breakpoints, we note that this rearrangement has several hallmarks of germline chromothripsis. 14