Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes

Wang, Qingbo; Pierce-Hoffman, Emma; Cummings, Beryl B.; Alföldi, Jessica; Francioli, Laurent C.; Gauthier, Laura D.; Hill, Andrew J.; O’Donnell-Luria, Anne H.; Karczewski, Konrad J.; MacArthur, Daniel G.

doi:10.1038/s41467-019-12438-5

Download PDF

Article
Open access
Published: 27 May 2020

Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes

Nature Communications volume 11, Article number: 2539 (2020) Cite this article

22k Accesses
80 Citations
67 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 02 February 2021

This article has been updated

Abstract

Multi-nucleotide variants (MNVs), defined as two or more nearby variants existing on the same haplotype in an individual, are a clinically and biologically important class of genetic variation. However, existing tools typically do not accurately classify MNVs, and understanding of their mutational origins remains limited. Here, we systematically survey MNVs in 125,748 whole exomes and 15,708 whole genomes from the Genome Aggregation Database (gnomAD). We identify 1,792,248 MNVs across the genome with constituent variants falling within 2 bp distance of one another, including 18,756 variants with a novel combined effect on protein sequence. Finally, we estimate the relative impact of known mutational mechanisms - CpG deamination, replication error by polymerase zeta, and polymerase slippage at repeat junctions - on the generation of MNVs. Our results demonstrate the value of haplotype-aware variant annotation, and refine our understanding of genome-wide mutational mechanisms of MNVs.

A structural variation reference for medical and population genetics

Article Open access 27 May 2020

Ryan L. Collins, Harrison Brand, … Michael E. Talkowski

Mapping and characterization of structural variation in 17,795 human genomes

Article 27 May 2020

Haley J. Abel, David E. Larson, … Ira M. Hall

Systematic analysis of paralogous regions in 41,755 exomes uncovers clinically relevant variation

Article Open access 27 October 2023

Wouter Steyaert, Lonneke Haer-Wigman, … Christian Gilissen

Introduction

Multi-nucleotide variants (MNVs) are defined as clusters of two or more nearby variants existing on the same haplotype in an individual^1,2 (Fig. 1a). When variants in an MNV are found within the same codon, the overall impact may differ from the functional consequences of the individual variants³. For instance, the two variants depicted in Fig. 1b are each predicted individually to have missense consequences, but in combination result in a nonsense variant. Such cases, which would be missed by virtually all existing tools for clinical variant annotation, can result both in missed diagnoses and false positive pathogenic candidates in analyses of families affected by genetic diseases^1,2.

MNV identification tools^4,5,6,7,8 have been applied to databases of human genetic variation at varying scales, including 1000 Genomes⁹ Phase 3 (2504 individuals with high coverage exome and low coverage genome-sequencing data), and the Exome Aggregation Consortium¹ (60,706 individuals with high coverage exome data). Together, these analyses identified over 10,000 MNVs altering protein sequences, demonstrating the pervasive nature of MNV annotation in the population-level data. In addition, analysis of the 1000 Genomes data set highlighted differences in the frequencies of MNVs depending on sequence context¹⁰. In combination with yeast experiments^11,12,13, biological mechanisms that account for the enrichment of specific types of MNVs, such as DNA replication error by polymerase zeta, have been suggested.

Studies of newly occurring (de novo) MNVs have also been performed using trio data sets^2,14,15,16; analysis of 283 trios with whole-genome sequence data¹⁶ confirmed that MNV events occur much more frequently than expected by random chance. By focusing on noncoding regions, this study also highlighted potentially different mechanisms that dominate MNV generation depending on the genomic region and the distance between the two constitutive variants. As part of the Deciphering Developmental Disorders (DDD) study¹⁷, Kaplanis et al.² analyzed exome-sequence data from over 6000 trios to quantify the pathogenic impact of MNVs in developmental disorders, showing that such variants are substantially more likely to be deleterious than SNVs and further clarifying the mutational mechanisms that generate them. These analyses also have provided estimates of the germline MNV rate per generation, falling into a consistent range of 1–3% of the SNV rate. Although these studies have provided valuable information about the mutational origins and functional impact of MNVs, to date there has been no analysis that investigated MNVs across the entire genome (including noncoding regions) in many thousands of deeply sequenced individuals, limiting our understanding of the genome-wide profile and complete frequency distribution of this class of variation.

Here, we present the analysis of a large-scale collection of MNVs, along with clinical interpretation of MNVs from over 6000 sequenced individuals from rare disease families. We also provide gene-level statistics on MNVs and describe the distribution of MNVs by functional consequence and by gene-level constraint. Finally, to enhance our understanding of MNV mechanisms, we examine the distributions of MNVs stratified by more than ten different functional annotations across the human genome, as well as estimates of the genome-wide per-base frequencies of the dominant mutational processes generating MNVs.

Results

Read-based phasing for identification of MNVs

Identification of MNVs requires the constituent variants to be properly phased—that is, to be identified accurately as either both occurring on the same haplotype (in cis) or on two different haplotypes (in trans). Phasing can be performed following three broad strategies: read-based phasing¹⁸, which assesses whether nearby variants co-segregate on the same reads in DNA sequencing data; family-based phasing¹⁹, which assesses whether pairs of variants are co-inherited within families; and population-based phasing²⁰, which leverages haplotype sharing between members of a large genotyped population to make a statistical inference of phase. Read-based phasing is particularly effective for pairs of nearby variants, making it suitable for the analysis of MNVs.

For this project, we generated read-based phasing results for variants in the Genome Aggregation Database (gnomAD) v2.1 callset using GATK HaplotypeCaller²¹, yielding 125,748 human whole exomes and 15,708 genomes with local phase information; the properties of this callset are described in detail in an accompanying paper²². To assess phasing accuracy, we used 5785 family trios with exome-sequencing data and 635 family trios with whole-genome sequencing data that largely overlapped with the gnomAD 2.1 release data. We calculated the phasing sensitivity, defined as the fraction of heterozygous variant pairs that have read-based phase information assigned for both variants, and found that it was 87.9% for adjacent heterozygous variant pairs, reflecting the stringent haplotype-calling criteria of GATK²¹ (Supplementary Tables 1–3). We used Phase-By-Transmission (PBT)¹⁹, a family-based phasing method (Fig. 1c), to assess our phasing specificity, and found that over 99.8% of the MNVs identified with read-based phasing were consistent with the PBT trio-based phasing. The sensitivity and specificity of our read-based phasing remained high even when the two variants of the MNV were 10 bp apart (82.8% and 99.8%; Supplementary Fig. 1 and Supplementary Table 1). These results demonstrate high specificity and sensitivity for the detection of MNV events across the genome.

Functional impact of MNVs

In order to provide an overview of the functional impact of MNVs (Fig. 1b), we examined all phased high-quality SNV pairs (i.e., SNV pairs that pass stringent filtering criteria; see the Methods section) within 2 bp distance of each other across the 125,748 exome-sequenced individuals from our gnomAD 2.1 data set, resulting in the discovery of 31,575 MNVs exist within the same codon. When the two variants comprising the MNV were considered together, the resulting functional impact on the protein differed from the independent impacts of the individual variants in ~60% of cases (18,756 MNVs; Fig. 2a; Supplementary Data 1). Among the differing annotations of functional consequence, 407 were gained nonsense (neither individual SNV was a nonsense mutation, but the resulting MNV is), and 1821 were rescued nonsense (at least one of the two individual SNVs would create a nonsense mutation, but the resulting MNV does not). Such categories of MNVs have a major impact on variant interpretation, and thus are critical for accurate variant annotation. There was an average of 55.2 variants with altered functional interpretation (including 0.062 gained and 4.42 rescued nonsense) due to MNVs per individual.

To understand the overall impact of correctly annotating the functional consequence of MNVs in a population-level data set, we counted the number of gained/rescued nonsense mutations per gene in gnomAD (Fig. 2b; Supplementary Data 1). For rescued nonsense mutations, we found 1538 sites that are rescued in all the individuals with the component variants. A total of 1633 genes carried gained or rescued nonsense mutations within our data set, including 41 genes that are disease-relevant (reported by OMIM²³ or annotated as haploinsufficient by Clingen^24,25). In addition, the proportion of rescued nonsense mutations of falling in predicted loss-of-function (pLoF) constrained genes (genes with a significant depletion of pLoFs compared with an expectation based on a mutational model^1,26, defined as LOEUF²² decile <20%) was higher (proportion = 0.219) when compared with all the other classes of MNVs (proportion = 0.192; Fisher’s exact test, p = 0.0247; Fig. 2c; Supplementary Fig. 2). Conversely, gained nonsense mutations are depleted among constrained genes (proportion = 0.0620) compared with all other classes of MNVs (Fisher’s exact test, p = 1.01 × 10⁻¹¹). These results suggest a significant enrichment of LoF annotation errors in the absence of MNV annotation.

In addition, we have investigated another class of variant pairs whose combined interpretation can be highly different from either of the individual component variants: insertion/deletion (indel) pairs that result in frame restoration (e.g., 4 bp deletion + 7 bp insertion, resulting in 3 bp = 1 amino acid insertion), and have annotated such frame-restoring indel pairs (n = 1406) when separate by up to 30 bp (considering the limitations of read-based phasing; Supplementary Fig. 3). When we compare the LoF confidence of constituent indels, we found that the proportion of frame-restoring indel pairs falling on LoF-constrained genes were significantly higher when the constituent indels are high-confidence (HC) LoFs (proportion = 0.0262 for low-confidence, LC, and 0.167 for HC pairs. Fisher’s exact test, p = 1.66 × 10⁻⁷; Supplementary Fig. 3h), suggesting that frame-restoring indel pairs can also be a source of LoF annotation errors.

Finally, in order to understand the impact of these variants in clinical applications, we also annotated MNVs in 6072 sequenced individuals from rare disease families, including 4275 case samples. This resulted in 16 gained nonsense mutations and 110 changed missense MNVs with high CADD²⁷ scores and low frequencies in gnomAD (CADD >20 and <10 individuals in gnomAD; Supplementary Data 2). However, after close manual curation, none of the corresponding MNVs were definitively causal variants for the diseases affecting the family, suggesting that MNVs contribute to only a small fraction of total rare disease diagnoses, in line with expectations based on their relative rarity and previous results².

Genome-wide mutational mechanisms of MNVs

We next turned our attention to understanding the mutational mechanisms underlying the origins of MNVs genome-wide, focusing on whole-genome sequence data from 15,708 individuals in the gnomAD v2.1 callset. We considered pairs of high-quality variants in autosomes separated by up to 10 bp, resulting in the assembly of a catalogue of 5,513,219 MNVs including 1,792,248 MNVs within 2 bp distance—an order-of-magnitude increase in size over previous collections.

We considered three established major categories of mutational origins of MNVs with constituent SNVs falling next to each other (adjacent MNVs. Figure 3a), each of which is biased toward certain MNV patterns: (1) combinations of distinct single-nucleotide mutation events; (2) replication errors by error-prone polymerase zeta; and (3) polymerase slippage events at repeat junctions. MNVs in the first category are a product of two or more SNVs, which typically occur in different generations and may thus have different allele frequencies. We expect to see an enrichment of CpG transition compared with non-CpG transversion for this class, due to the underlying difference of SNV mutation rate^28,29,30. The second category, replication error introduced by DNA polymerase zeta (pol-zeta), is a well known class of replication error that introduces MNVs. Previous studies^{10,11,12,13,31} have shown that pol-zeta is prone to specific types of replication error, mainly TC- > AA, GC- > AA, and their reverse complements, with experimental evidence that these MNV patterns occur in a single generation; thus, the constituent SNVs will typically have the same allele frequencies. The third category, replication slippage, is another known mode of DNA replication error^32,33,34. This process is especially frequent at sites with repetitive sequence context; previous studies^35,36,37 have shown that the indel rate can be up to 10⁶ times higher than the SNV mutation rate at these sites. As shown in Fig. 3a, the combination of an insertion and then a deletion of two base pairs can result in an MNV.

We observed the signature of each of these MNV mechanisms in our data set. First, we calculated the number of MNVs for each MNV pattern (Fig. 3b) and observed that the most frequent MNV pattern is CA- > TG substitutions, which are likely to occur as a combination of an A- > G transition, followed by a high mutation rate C- > T CpG transition (Supplementary Fig. 4a). On the other hand, the least frequent MNV pattern is TA- > GC substitutions, which occur as a combination of two non-CpG transversions. The 273.4-fold difference (270,071 versus 988) of the frequency of MNVs between these two patterns is comparable with the theoretical ratio calculated based on the mutation rate of the component SNVs (475.6-fold), and the overall correlation between the theoretical and observed frequency of each MNV pattern was strong (Pearson correlation r = 0.839 with p = 9.15 × 10⁻²² in log space; Supplementary Fig. 4b–e).

To investigate the extent of pol-zeta signature, we calculated the number of MNVs in which the gnomAD allele counts of the constitutive single-nucleotide variants are equal (following previous methodology², also described in the Methods section), and observed that these one-step MNVs are significantly enriched in MNV patterns matching the pol-zeta signature (90.5% for GA- > TT, and 80.5% for GC- > AA, compared with 39.9% overall; Fisher’s exact test, p < 10⁻¹⁰⁰; Fig. 3c).

Finally, in order to capture polymerase slippage events, we calculated the fraction of MNVs in repetitive contexts per MNV pattern (Fig. 3d). For the MNV patterns AA- > TT, >30% of all the MNVs observed were in repetitive contexts. The fractions of the MNV patterns AT- > TA and TA- > AT in repetitive contexts were also high, exceeding 10% (Fisher’s exact test, p < 10⁻¹⁰⁰ compared with the 3.15% across all patterns). For all MNV patterns in repeat contexts, we see a significant excess of MNVs compared with the expected number based on a model that assumes MNVs are simple combination of two SNV events (Supplementary Fig. 4). These observations support the role of replication slippage as one of the major drivers of MNVs. In addition, we did not see a correlation between the frequency of one-step MNVs and the frequency of MNVs in repetitive contexts (Pearson correlation r = 0.0561, p > 0.05; The fraction of one-step MNVs exceeded 80% for AT- > TA and TA- > AT, but was 46% for AA- > TT), suggesting that multiple slippage events leading to MNV generation can take place either as a single event (i.e., in single generation) or multiple events (i.e., in different generation), or even recurrently. These findings come with the caveat that variants in repetitive regions will have higher error rates due to slippage and misalignment errors, but we have reduced this risk by applying random forest filtering for individual sites, as well as removing all the variants in low-complexity regions from our analysis (see the Methods section).

Estimation of global mutation rate of MNVs

In order to compare the frequency of three different mechanisms, we quantified the contribution of two single-nucleotide variation events vs other replication error modes, such as pol-zeta errors or replication slippage, using a simple probabilistic model. Specifically, focusing on adjacent MNVs, we assigned the MNV frequency for each MNV pattern to be the sum of the probability of two SNV events (P) and the probability of other replication error factors (Q), and estimated the Q term. In other words, we estimated the divergence of the observed number of MNV sites from the number expected by a simple SNV mutation model (see the Methods section). The resulting estimated proportion of two SNV events and other replication error events is described in Fig. 4a.

As expected, the proportion differs substantially from one MNV pattern to another. For example, while 98.0% of CA- > TG MNVs appear to be caused by combinations of simple SNV events, the corresponding proportion is 5.84% for GA- > TT, 18.9% for GC- > AA, and 9.52% for AA- > TT MNVs. We presume that the lower proportion of two simple SNV events is mainly due to pol-zeta errors for GA- > TT and GC- > AA, and polymerase slippage for the AA- > TT. Since 83.2% of the overall MNVs were classified as either SNV combination, repeat context, or pol-zeta error at GA- > TT or GC- > AA, our analysis suggests that these three major categories explain a substantial fraction of MNV events genome wide, although some possible additional mechanisms with smaller frequencies might exist. These calculations also allow us to estimate the genome-wide mutation rate of MNVs caused by pol-zeta: 1.59 × 10⁻¹⁰ per 2 bp per generation for GA- > TT, and 4.08 × 10⁻¹⁰ for GC- > AA. Given that there are ~1.66 × 10⁸ GA pairs and 1.20 × 10⁸ GC pairs in the reference human genome, we estimate there are on average 0.026 GA- > TT and 0.049 GC- > AA mutations per generation (Supplementary Data 3).

We also explored the potential mutational mechanisms for MNVs with a greater distance between the component variants (Supplementary Figs. 5–7), and observed signatures of non-independence of mutation events extending over distances up to 10 bp, with an enrichment of motifs consistent with pol-zeta and polymerase slippage mechanisms for adjacent MNVs (minimum 1.08, maximum 4.06-fold enrichment of one-step MNV, Fisher’s exact test, p-value < 0.05; Supplementary Figs. 8,9). This confirms the presence of mutational mechanisms capable of creating simultaneous mutations separated by considerable distances^{16,29,38,39,40}, although further work will be required to fully characterize the underlying processes.

Overall, our analysis of MNVs in 15,708 whole-genome-sequenced individuals supports the previously suggested three major mechanism of MNVs and quantifies the different contribution of each mechanism for different MNV patterns at the genome-wide scale.

MNV distribution across different genomic regions

We next examined how MNV pattern distributions differ between functional annotation categories. We used 13 different functional annotations such as coding sequence, enhancer, and promoter from Finucane et al.⁴¹, and the DNA methylation annotation from the Encyclopedia of DNA Elements (ENCODE)⁴², to calculate the number of MNVs that fall into each category (Supplementary Table 4). MNV density, defined as the number of MNVs observed in each functional category divided by the total length of the genomic interval belonging to each category, is shown in Fig. 4b and c. We found that MNV density of the substitution patterns typically involving CpG transitions is positively correlated with the methylation level (linear regression Pearson correlation r = 0.95 for CG- > TA and r = 0.87 for CA- > TG, p < 10⁻³). Conversely, MNV density for non-CpG transversion-related substitution patterns, and the substitution patterns related to pol-zeta slippage, negatively correlates with methylation status (linear regression Pearson correlation r = −0.90 for GA- > TC, r = −0.91 for AG- > CC, r = −0.91 for GA- > TT, and r = −0.92 for GC- > AA, p <10⁻⁵; Fig. 4b, c).

Finally, we explored the effect of genic context on MNV origins and discovery: we selected the seven major regional annotations around gene-coding sequences^43,44, and calculated the fraction of MNVs likely explained by different mutational origins in each of these regions (Fig. 4d). Across all regions, we found that the MNV signal is primarily dominated by CpG transitions. The fraction of non-CpG transversions and polymerase slippage at repeats were consistently lower than (or nearly equal to) 5% of the overall signal. Pol-zeta signature was not as dominant as CpG transitions, except for at the transcription start site region, which has by far the lowest methylation rate in those seven annotations, and is thus expected to have a lower rate of CpG deamination mutations (which are dependent on the methylation of the original cytosine).

Overall, our results suggest that MNV density is highly dependent on the CpG methylation status of the surrounding sequence, and that MNVs that originate from non-CpG transversions or polymerase slippage at repeat junctions are relatively uncommon compared with those driven by CpG transitions or pol-zeta errors. Finally, MNVs that originate from pol-zeta error are the most common class of MNVs in the region close to the transcription start sites of genes, as low methylation levels in these regions result in low levels of CpG transitions.

Discussion

We analyzed 125,748 human exomes and 15,708 genomes and identified 1,792,248 MNVs across genome with constituent variants falling within 2 bp distance, including 31,575 that exist within a codon. We have shown that MNVs represent an important class of genetic variation, and that they have a significant impact on the functional interpretation of genomic data, both at the population and individual level. Although we did not encounter an individual in which an MNV is the likely cause of a rare disease after sequencing 6072 individuals from rare disease families, we expect that applying our pipeline to larger numbers of disease samples will identify previously missed diagnoses, as has been observed in another study of developmental delay cases².

The large number and high quality of variant calls in the gnomAD database provided increased power for statistical analysis of the three major mutational mechanisms (combinations of independent SNVs; replication errors by pol-zeta; and polymerase slippage at repeat junctions) responsible for the generation of MNVs, and importantly allowed us to estimate the relative contribution of each of these processes.

Our estimates of substitution pattern-specific MNV mutation rate and fraction come with important caveats. Our approach assumes that the local SNV mutation rate is invariant across instances of a specific 3 bp context; however, prior work has shown considerable regional variation in mutation rate across the genome, as well as variation driven by ancestry, environment, and other factors^45,46,47,48. Another important limitation is the lack of confident estimates of insertion and deletion rate as a function of repeat length, which limits the confidence of our estimate of the fraction of polymerase slippage. Future large genome-scale data sets with more accurate insertion and deletion calls, likely involving long-read sequencing data, will be required to improve modeling of insertion and deletion mutations.

One clear feature of our data set was the signature of non-independence of mutational events separated by up to 10 bp, as suggested in various de novo studies^{16,29,38,39,40}; further investigation of these clustered mutations, and contextualizing them with known sources of genomic instability, such as homologous recombination⁴⁹ or transposable elements^50,51, will be informative in exploring the mechanisms of clustered mutations.

The complete list of MNVs identified in gnomAD is publicly available (https://gnomad.broadinstitute.org/downloads), with the allele count annotated for both genome and exome. For the coding regions, we have also annotated the functional consequence of constituent SNVs and MNVs separately, and made the result viewable in an intuitive browser (https://gnomad.broadinstitute.org). Although some fraction of MNVs is missing from this list due to incomplete phasing sensitivity and read coverage, the database provides the most comprehensive set of estimates of MNV allele frequencies to date, valuable for further analysis of mutational mechanisms as well as the interpretation of MNVs in rare disease and cancer genomics^52,53.

Finally, despite the large sample size of our MNV data set, the fraction of MNVs that we have observed out of all the possible MNV configurations is still very far from saturating the space of possible MNVs, with only ~0.005% of all possible adjacent MNVs observed in our data (Supplementary Figs. 10, 11). Increasing the number of sequenced individuals⁵⁴ in both disease and non-disease cohorts will permit the discovery and determination of the phenotypic impact of an increasingly comprehensive catalogue of variation. This study confirms the importance of incorporating haplotypic phase into these efforts to permit the discovery and accurate interpretation of the full range of human variation.

Methods

Ethics

We have complied with all relevant ethical regulations. This study was overseen by the Broad Institute’s Office of Research Subject Protection and the Partners Human Research Committee, and was given a determination of Not Human Subjects Research. Informed consent was obtained from all participants.

MNV calling

125,748 human exomes and 15,708 genomes from gnomAD 2.1 callset were used for the analyses (Supplementary Tables 5,6). We used hail (https://github.com/hail-is/hail), an open source, cloud-based scalable analysis tool for large genomic data. For MNV discovery, we exhaustively looked for variants that appear in the same individual, in cis, and within 2 bp distance for the exome data set and 10 bp distance for the genome data set, using the hail window_by_locus function (i.e., we computationally checked every pair of genotypes within a certain window size, for every individual, to see whether the individual carries a pair(s) of mutation in the same haplotype. See Supplementary Methods for further detail. Also, we did not expand the window size >10 bp for MNV discovery, as phasing sensitivity significantly drops when the distance between variants is >10 bp, as shown in Supplementary Fig. 1d). For trio-based analyses, we expanded the range to 100 bp to obtain a more macroscopic view. Although we performed MNV calling in sex chromosomes for the coding region, we restricted our analysis to autosomes, in order to control for differences in zygosity.

MNV calling in rare disease samples was performed in a similar fashion as in the gnomAD exome data set. In total, 6072 rare disease whole-exome sequences were curated at the Broad Center for Mendelian Genomics (CMG)⁵⁵ and went through the MNV calling pipeline with the window size of 2 bp distance. The phenotypes observed in the cohort include: muscle disease such as Limb Girdle Muscular Dystrophy (LGMD; roughly one-third of the total), neurodevelopmental disorders, or severe phenotypes in eye, kidney, cardiac, or other orphan diseases (Supplementary Data 2).

MNV filtering

In the gnomAD MNV analysis, variant pairs for which one or both of their components have low quality reads were filtered out. Specifically, we only selected the variant sites that pass the Random Forest filtering, resulting in acceptance of 53.3% of the initial MNV candidates (Supplementary Fig. 12a). We also filtered out variant sites that are classified as low-complexity regions (LCRs) identified with the symmetric DUST algorithm⁵⁶ at a score threshold of 30, and additionally applied adjusted threshold criteria (GQ ≥ 20, DP ≥ 10, and allele balance > 0.2 for heterozygote genotypes) for filtering individual variants (Supplementary Table 7). For each MNV site, we annotated the number of alleles that appear as MNV, as well as the number of individuals carrying the MNV as a homozygous variant. The distribution of MNV sites that contain homozygous MNVs is shown in Supplementary Fig. 13. We also collapsed the MNV patterns that are reverse complements of each other, after observing that the number of MNVs are roughly symmetric (before collapsing, the ratio of each MNV pattern to its corresponding reverse complement pattern was mostly close to 1, with 0.95 being the lowest and 1.10 being the highest for adjacent MNVs) (Supplementary Fig. 14). All the MNV patterns in the main text and figures are equivalent to their reverse complement, and we do not distinguish them.

For the rare disease cohort, since our motivation was to find a definite example where an MNV is acting as a causal variant for a rare disease with severe phenotype rather than obtaining the population-level statistics, we did not apply site and sample-specific filtering, as opposed to the gnomAD MNV analysis. Instead of being computationally filtered by read quality, the 129 putative MNVs (16 gained nonsense mutations, 110 changed missense with high CADD score and low gnomAD MNV frequency, and 3 gained missense) went through manual inspection by the analysts at the Center for Mendelian Genomics (CMG) at the Broad Institute⁵⁵, after annotating the affected gene. Specifically, all the variants were checked manually under the criteria below:

- Whether the gene affected is constrained in the gnomAD population.

- Whether the case has already been solved with other causal variant.

- Whether the MNV looks real in the Interactive Genome Browser (IGV).⁵⁷

- Whether the MNV is in the proband and, if applicable, the segregation pattern of the MNV

- Whether the known function of the gene affected matches the patient phenotype.

MNVs were filtered out if they failed one or more of the criteria above. These results suggest that MNVs explain only a small fraction of undiagnosed genetic disease cases, consistent with their overall frequency as a class of variation, and with prior work in large disease-affected cohorts². The summary for MNV analysis in rare disease cohort is also available at Supplementary Data 2.

Analysis of phasing sensitivity

In order to compare the phasing information derived from different methods (read-based and trio-based), we took an approach of comparing the relative phase (binary classification of whether two SNVs of MNV are in the same haplotype or not), as shown in Supplementary Table 8. We investigated the heterozygous variant pairs whose phasing information is not provided by the trio-based phasing and observed that majority (83.5%) of the cases reflected both parents carrying a heterozygous variant, a scenario where trio-based phasing is inherently uninformative. We also investigated the heterozygous variant pairs whose phasing information is not provided by the read-based phasing. Specifically, unphased pairs tend to have either low- or high-read depth (odds ratio = 3.20, Fisher’s exact test, p < 10⁻¹⁰⁰ for low, and odds ratio = 2.33, Fisher’s exact test, p < 10⁻¹⁰⁰ for high-read depth; Supplementary Table 3), consistent with our previous understanding that an excess of reads can lead to involvement of erroneous reads and thus reduce the confidence of phasing of HaplotypeCaller⁵⁸ (as well as the lack of the number of reads reduces the calling rate). All the statistical tests are two-sided, throughout the paper.

Analysis of functional impact in coding region

We focused on the coding region of the canonical transcript of genes and examined the codon change and their consequence for all the MNVs that fall in a single codon (see Supplementary Tables 9,10 for the number of MNVs that spans across two codons). When comparing with population-level constraint, for each MNV, we annotated the constraint metric (LOEUF²²) of the gene whose protein product is affected. For rescued nonsense mutations, we took only the ones are rescued in all the individuals with the component variants (i.e., we excluded the ones whose allele count of MNVs are not equal to the allele count of the SNV that introduces a nonsense mutation), resulting in 1538 out of 1821 rescued nonsense mutations. We next used Loss-Of-Function Transcript Effect Estimator (LOFTEE²²) in order to exclude the nonsense mutations that are not likely to affect the protein function. This resulted in 371 high-confidence (HC) gained nonsense mutations and 1400 HC rescued nonsense mutations, which were used for the population-level constraint analysis. In addition, we stratified the gene sets by core essential/nonessential genes from CRISPR/Cas knockout experiments^59,60 as an orthogonal indicator of gene constraint (Supplementary Fig. 2).

We did not include and correct for MNVs consisting of three SNVs in a single codon in the analysis of functional impact in coding region, since the number and frequency of such MNVs are significantly low (228 in total, with 5 newly gained nonsense, but no re-rescued or re-gained nonsense; 0.220 in total per person). The full list of such MNVs are available as a separate file at: https://gnomad.broadinstitute.org/downloads.

Frame-restoring indel analysis was performed in a similar fashion. We used the gnomAD exome data set to call and filter the insertion/deletion pairs using the same filtering criteria (except for the fact that we did not restrict our analysis to cases where the frameshift effect would be rescued in all individuals), and focused on the canonical transcripts for the functional impact evaluation.

Defining one-step MNVs and MNVs in repetitive contexts

A one-step MNV was defined as a MNV for which the allele count of both SNVs that make up the MNV is the same and close to the allele count of the MNV itself. We also compared the allele count of constituent SNVs (AC1 and AC2) with the allele count of the corresponding MNV (AC_mnv), and observed that the majority of one-step MNVs we discovered have AC_mnv divided by AC1 >0.9 (Supplementary Fig. 15). Therefore, we expect the false discovery rate of one-step MNVs (misclassifying the MNV whose AC1 and AC2 are equal just by chance) to be limited. The full distribution of all the allele counts, including per-population characterizations, are shown in Supplementary Fig. 16 and Supplementary Table 11.

Repetitive sequences are defined by taking the ±4 bp context of the MNV and setting the threshold manually, by looking at the distribution of repeat contexts around all the MNVs (Supplementary Figs. 17, 18). Specifically, a sequence is defined as repetitive if the number of dinucleotide repeat units > 1, for both reference and alternative ±4 bp context, and the number of dinucleotide repeat units > 2, for either reference or alternative ±4 bp context, and, for adjacent MNVs only, if the reference and/or alternative 2 bp are mononucleotide repeat, increase the threshold by one mononucleotide repeat unit.

Here, dinucleotide repeat unit is defined as the reference or the alternative allele itself (with the gap when d > 1 and counting the overlap. For example, the reference and alternative dinucleotide repeat counts for TATATAT - > TAAAAAT are both 3). The third criteria was added specifically for adjacent MNVs to adjust for counting the overlap more than once. This threshold was set so that the number of MNVs with equal or higher repeats would be <5% of the total, corresponding to two standard deviations away from the mean, and also because the estimated mutation rate in these repetitive contexts is likely to be orders of magnitude higher than the background MNV mutation rate originating from the combination of two SNV events^35,36,37.

Calculating the proportion of MNVs per biological origin

We calculated the proportion of MNV per biological origin by comparing the observed number of MNVs (that are not in repetitive contexts) with the expected number of MNV under single-nucleotide mutational model.

Specifically, if we simply hypothesize most of the MNV are combination of two single-nucleotide substitution events, we can estimate the relative probability of MNV event per substitution pattern. For example, probability of observing a CA to TG MNV in a single individual, single site (p(CA → TG)) is proportional to p(CA → TA) p(TA → TG) + p(CA → CG) p(CG → TG), and probability of TA to GC MNV (p(TA → GC)) is proportional to \(p({\mathrm{TA}} \to {\mathrm{GA}}) \cdot p({\mathrm{GA}} \to {\mathrm{GC}}) \, + p({\mathrm{TA}} \to {\mathrm{TC}}) \cdot p({\mathrm{TC}} \to {\mathrm{GC}})\). Former equation involves the product of transition at CpG, while both term of the latter are product of transversion at non-CpG, which works as a reasonable explanation of the frequency difference of those two MNV patterns.

Using the same principle (and accounting for reference base pair frequency, population number and global SNV mutation rate defined by 3 bp context²⁶, we first constructed a null model of MNV distribution. In reality, this null model does not represent the real distribution we observe, due to biological mechanisms that introduce MNV. Therefore, we allowed additional factor q, that denotes the mutational event where two SNVs are introduced at the same time. For the example of \(p({\mathrm{CA}} \to {\mathrm{TG}})\), we model this probability to be proportional to \({l}p({\mathrm{TA}} \to {\mathrm{GA}}) \cdot p({\mathrm{GA}} \to {\mathrm{GC}}) + p({\mathrm{TA}} \to {\mathrm{TC}}) \cdot p({\mathrm{TC}} \to {\mathrm{GC}}) + q({\mathrm{CA}} \to {\mathrm{TG}})\), and try to estimate the q term, which corresponds to the proportion of MNVs that are explained by non-SNV (and non-repeat) factor. Further details are explained in the Supplementary Methods (section “Models and assumptions for calculating the proportion of MNV per biological mechanism”).

In addition, for each of MNV pattern, we annotated the predicted major mechanism for each MNV pattern in the following order:

1. Pol-zeta, for the patterns known as polymerase signature (GA- > TT and GC- > AA)

2. Repeat, for the patterns whose fraction of MNVs in repeat contexts are >10% (corresponding to two standard deviations away from the mean; AA- > TT, AT- > TA, and TA- > AT)

3. One of Ti at CpG, Ti, Ti at CpG + Tv, Ti + Tv, Tv combination, based on possible combinations of single-nucleotide mutational processes. For example, Ti at CpG is when transition in CpG combined with another transition can occur in the mutational processes (Supplementary Data 3).

Estimation of the global MNV rate per substitution pattern

In order to estimate the global MNV mutation rate for adjacent MNVs, as well as the mutation rate per MNV pattern, we first focused the number of one-step MNVs, assuming that there are no recurrent mutations and therefore the allele frequency of constituent SNVs are equal if and only if it originates from an MNV event in a single generation. In this section, we will simply write one-step MNV of distance 1 bp (i.e., adjacent) as MNV.

We then calculated the global MNV mutation rate under the Watterson estimator model, as in Kaplanis et al.². Specifically, we divided the number of MNV sites by the number of SNV sites in our gnomAD data set, and scaled by the global single-nucleotide mutation rate identified in previous research (1.2 × 10⁻⁸), which yielded 2.94 × 10⁻¹¹ per 2 bp per generation. This is roughly two-thirds of the estimation provided by the Kaplanis et al.² using trio data, slightly smaller presumably due to differing filtering method. Next, In order to get the mutation rate per 2 bp for each of the MNV patterns, we simply scaled the global MNV mutation rate described above by the number of reference 2 bp and the coverage difference. The full data for all the 78 patterns are shown in Supplementary Data 3. Further details are explained in the Supplementary Methods (section “Models and assumptions for estimation of the global MNV rate per substitution pattern”).

Functional enrichment

Thirteen functional annotations were collected from Finucane et al.⁴¹ as a bed file (which originates from database, such as ENCODE, Roadmap⁶¹ and UCSC genome browser⁶².) For the methylation data, we collected the genome methylation level from ENCODE, and calculated the fraction of methylated CpG out of all the CpGs in the region, and ordered by the fraction (Supplementary Table 4).

MNV density calculation was performed under the null hypothesis that the number of MNV of type WX→YZ we observe in an arbitrary genomic interval is proportional to the number of WX in the interval. Specifically, the MNV density of WX→YZ in interval I is defined as

\(D({\mathrm{WX}} \to {\mathrm{YZ}}|I) = \frac{{N({\mathrm{WX}} \to {\mathrm{YZ}}|I)}}{{N({\mathrm{WX}}|I)}}\), where N(WX→YZ|I) is the number of MNVs of WX→YZ, and N(WX|I) is the number of WX in the reference genome we observe in that specific genomic interval. We then normalized the density by dividing by D(WX→YZ|I = whole genome) for scaling purpose (i.e., D(WX→YZ|I) = k means that the probability of observing a mutation of WX→YZ given a sequence context of WX is k times higher in genomic functional category I than the overall genome.)

For estimating the fraction of MNVs per origin, we took a thresholding approach and defined four MNVs (CA- > TG, AC- > GT, CC- > TT, and GA- > AG) as CpG signal, two (GC- > AA, GA- > TT) as pol-zeta, three as repeat (AA- > TT, TA- > AT, AT- > TA) and six transversion (TA- > GC, CG- > AT, AT- > CG, CG- > GC, GC- > CG, CG- > AC) signal (and left all the other 78-(4 + 2 + 3 + 6) = 63 patterns as others, in order to highlight the strongest signals) based on the result from Fig. 3. The fraction of MNVs per origin is then defined simply as the number of MNVs that fall into that pattern divided by all the MNVs, in the genomic interval. The coverage difference per interval was as small as negligible (Supplementary Table 4).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The list of coding MNVs in gnomAD exome are available at gs://gnomad-public/release/2.1/mnv/gnomad_mnv_coding.tsv (tab separated file). The coding MNVs consisting of three SNVs in a single codon is available as a separate file at gs://gnomad-public/release/2.1/mnv/gnomad_mnv_coding_3bp.tsv. The list of frame-restoring indel pairs are available at gs://gnomad-public/release/2.1/mnv/frame_restoring_indels.tsv. The list of all the MNVs in gnomAD genomes are available at gs://gnomad-public/release/2.1/mnv/genome/gnomad_mnv_genome_d{i}.tsv.bgz (tab separated file, compressed. Replace {i} (0 < i < 11) with the distance between two SNVs of MNV.), or gs://gnomad-public/release/2.1/mnv/genome/gnomad_mnv_genome_d{i}.ht (hail table. Replace {i} (0 < i < 11) with the distance between two SNVs of MNV.). Explanations for each column in each file can be found at gs://gnomad-public/release/2.1/mnv/mnv_readme.md. All the files above are also available at the download page of the gnomAD browser (https://gnomad.broadinstitute.org/downloads).

Code availability

The code used in the study is available at https://github.com/macarthur-lab/gnomad_mnv.

Change history

04 February 2021
A Correction to this paper has been published: https://doi.org/10.1038/s41467-021-21077-8

References

Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article CAS PubMed PubMed Central Google Scholar
Kaplanis, J. et al. Exome-wide assessment of the functional impact and pathogenicity of multinucleotide mutations. Genome Res. gr.239756.118 (2019).
Rosenfeld, J. A., Malhotra, A. K. & Lencz, T. Novel multi-nucleotide polymorphisms in the human genome characterized by whole genome and exome sequencing. Nucleic Acids Res. 38, 6102–6111 (2010).
Article CAS PubMed PubMed Central Google Scholar
Wei, L. et al. MAC: identifying and correcting annotation for multi-nucleotide variations. BMC Genomics 16, 569 (2015).
Lai, Z. et al. VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 44, e108 (2016).
Article PubMed PubMed Central CAS Google Scholar
Cheng, S.-J. et al. Accurately annotate compound effects of genetic variants using a context-sensitive framework. Nucleic Acids Res. 45, e82 (2017).
Article PubMed PubMed Central CAS Google Scholar
Danecek, P. & McCarthy, S. A. BCFtools/csq: haplotype-aware variant consequences. Bioinformatics 33, 2037–2039 (2017).
Article CAS PubMed PubMed Central Google Scholar
Khan, W. et al. MACARON: a python framework to identify and re-annotate multi-base affected codons in whole genome/exome sequence data. Bioinformatics 34, 3396–3398 (2018).
Article CAS PubMed Google Scholar
Consortium, T. 1000 G. P. A global reference for human genetic variation. Nature 526, 68 (2015).
Article ADS CAS Google Scholar
Harris, K. & Nielsen, R. Error-prone polymerase activity causes multinucleotide mutations in humans. Genome Res. 24, 1445–1454 (2014).
Article CAS PubMed PubMed Central Google Scholar
Zhong, X. et al. The fidelity of DNA synthesis by yeast DNA polymerase zeta alone and with accessory proteins. Nucleic Acids Res. 34, 4731–4742 (2006).
Article CAS PubMed PubMed Central Google Scholar
Sakamoto, A. N. et al. Mutator alleles of yeast DNA polymerase ζ. DNA Repair 6, 1829–1838 (2007).
Article CAS PubMed PubMed Central Google Scholar
Stone, J. E., Lujan, S. A. & Kunkel, T. A. DNA polymerase zeta generates clustered mutations during bypass of endogenous DNA lesions in Saccharomyces cerevisiae. Environ. Mol. Mutagenesis 53, 777–786 (2012).
Article CAS Google Scholar
Chen, J.-M., Férec, C. & Cooper, D. N. Closely spaced multiple mutations as potential signatures of transient hypermutability in human genes. Hum. Mutat. 30, 1435–1448 (2009).
Article CAS PubMed Google Scholar
Schrider, D. R., Hourmozdi, J. N. & Hahn, M. W. Pervasive multinucleotide mutational events in eukaryotes. Curr. Biol. 21, 1051–1054 (2011).
Article CAS PubMed PubMed Central Google Scholar
Besenbacher, S. et al. Multi-nucleotide de novo mutations in humans. PLOS Genet. 12, e1006315 (2016).
Article PubMed PubMed Central CAS Google Scholar
The Deciphering Developmental Disorders Study et al. Large-scale discovery of novel genetic causes of developmental disorders. Nature 519, 223–228 (2015).
Article ADS CAS Google Scholar
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. arXiv preprint: arXiv:1207.3907 [q-bio] (2012).
Francioli, L. C. et al. A framework for the detection of de novo mutations in family-based sequencing data. Eur. J. Hum. Genet. 25, 227–233 (2017).
Article CAS PubMed Google Scholar
Choi, Y., Chan, A. P., Kirkness, E., Telenti, A. & Schork, N. J. Comparison of phasing strategies for whole human genomes. PLoS Genet. 14, e1007308 (2018).
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at: https://doi.org/10.1101/201178v3 (2018).
Karczewski, K. J. et al. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. Preprint at: https://doi.org/10.1101/531210v3 (2019).
Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. & McKusick, V. A. Online Mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517 (2005).
Article CAS PubMed Google Scholar
Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985 (2014).
Article CAS PubMed Google Scholar
Rehm, H. L. et al. ClinGen–the clinical genome resource. N. Engl. J. Med. 372, 2235–2242 (2015).
Article CAS PubMed PubMed Central Google Scholar
Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).
Article CAS PubMed PubMed Central Google Scholar
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Article CAS PubMed PubMed Central Google Scholar
Nachman, M. W. & Crowell, S. L. Estimate of the mutation rate per nucleotide in humans. Genetics 156, 297–304 (2000).
CAS PubMed PubMed Central Google Scholar
Francioli, L. C. et al. Genome-wide patterns and properties of de novo mutations in humans. Nat. Genet 47, 822–826 (2015).
Article CAS PubMed PubMed Central Google Scholar
Xue, Y. et al. Human Y chromosome base-substitution mutation rate measured by direct sequencing in a deep-rooting pedigree. Curr. Biol. 19, 1453–1457 (2009).
Article CAS PubMed PubMed Central Google Scholar
Northam, M. R. et al. DNA polymerases ζ and Rev1 mediate error-prone bypass of non-B DNA structures. Nucleic Acids Res. 42, 290–306 (2014).
Article ADS CAS PubMed Google Scholar
Montgomery, S. B. et al. The origin, evolution, and functional impact of short insertion–deletion variants identified in 179 human genomes. Genome Res. 23, 749–761 (2013).
Article CAS PubMed PubMed Central Google Scholar
Bacolla, A. et al. Local DNA dynamics shape mutational patterns of mononucleotide repeats in human genomes. Nucleic Acids Res. 43, 5065–5080 (2015).
Article CAS PubMed PubMed Central Google Scholar
Ananda, G. et al. Microsatellite interruptions stabilize primate genomes and exist as population-specific single nucleotide polymorphisms within individual human genomes. PLOS Genet. 10, e1004498 (2014).
Article PubMed PubMed Central CAS Google Scholar
Leclercq, S., Rivals, E. & Jarne, P. DNA slippage occurs at microsatellite loci without minimal threshold length in humans: a comparative genomic approach. Genome Biol. Evol. 2, 325–335 (2010).
Article PubMed PubMed Central CAS Google Scholar
Lai, Y. & Sun, F. The relationship between microsatellite slippage mutation rate and the number of repeat units. Mol. Biol. Evol. 20, 2123–2131 (2003).
Article CAS PubMed Google Scholar
Pumpernik, D., Oblak, B. & Borštnik, B. Replication slippage versus point mutation rates in short tandem repeats of the human genome. Mol. Genet. Genomics 279, 53–61 (2008).
Article CAS PubMed Google Scholar
Chan, K. & Gordenin, D. A. Clusters of multiple mutations: incidence and molecular mechanisms. Annu Rev. Genet 49, 243–267 (2015).
Article CAS PubMed PubMed Central Google Scholar
Supek, F. & Lehner, B. Clustered mutation signatures reveal that error-prone DNA repair targets mutations to active genes. Cell 170, 534–547 (2017). e23.
Article CAS PubMed Google Scholar
Michaelson, J. J. et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell 151, 1431–1442 (2012).
Article CAS PubMed PubMed Central Google Scholar
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
Article CAS PubMed PubMed Central Google Scholar
Consortium, T. E. P. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306, 636–640 (2004).
Article ADS CAS Google Scholar
Maston, G. A., Evans, S. K. & Green, M. R. Transcriptional Regulatory Elements in the Human Genome. Annu. Rev. Genom. Hum. Genet. 7, 29–59 (2006).
Article CAS Google Scholar
Kulaeva, O. I., Nizovtseva, E. V., Polikanov, Y. S., Ulianov, S. V. & Studitsky, V. M. Distant Activation of transcription: mechanisms of enhancer action. Mol. Cell. Biol. 32, 4892–4897 (2012).
Article CAS PubMed PubMed Central Google Scholar
Aggarwala, V. & Voight, B. F. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat. Genet. 48, 349–355 (2016).
Article CAS PubMed PubMed Central Google Scholar
Duret, L. Mutation patterns in the human genome: more variable than expected. PLOS Biol. 7, e1000028 (2009).
Article PubMed PubMed Central CAS Google Scholar
Ségurel, L., Wyman, M. J. & Przeworski, M. Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 15, 47–70 (2014).
Article PubMed CAS Google Scholar
Harris, K. Evidence for recent, population-specific evolution of the human mutation rate. PNAS 112, 3439–3444 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Guirouilh-Barbat, J., Lambert, S., Bertrand, P. & Lopez, B. S. Is homologous recombination really an error-free process? Front. Genet. 5, 175 (2014).
Smit, A. F. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genetics Dev. 9, 657–663 (1999).
Article CAS Google Scholar
Wicker, T. et al. A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 8, 973–982 (2007).
Article CAS PubMed Google Scholar
Roberts, S. A. et al. An APOBEC cytidine deaminase mutagenesis pattern is widespread in human cancers. Nat. Genet. 45, 970–976 (2013).
Article CAS PubMed PubMed Central Google Scholar
Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).
Article CAS PubMed PubMed Central Google Scholar
Stark, Z. et al. Integrating genomics into healthcare: a global responsibility. Am. J. Hum. Genet. 104, 13–20 (2019).
Article CAS PubMed PubMed Central Google Scholar
Centers for Mendelian Genomics, Bamshad, M. J. The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. Am. J. Hum. Genet. 97, 199–215 (2015).
Article PubMed Central CAS Google Scholar
Morgulis, A., Gertz, E. M., Schäffer, A. A. & Agarwala, R. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J. Comput. Biol. 13, 1028–1040 (2006).
Article MathSciNet CAS PubMed Google Scholar
Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
Article CAS PubMed PubMed Central Google Scholar
Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).
Article CAS PubMed PubMed Central Google Scholar
Lenoir, W. F., Lim, T. L. & Hart, T. PICKLES: the database of pooled in-vitro CRISPR knockout library essentiality screens. Nucleic Acids Res. 46, D776–D780 (2018).
Article CAS PubMed Google Scholar
Hart, T. et al. High-resolution CRISPR screens reveal fitness genes and genotype-specific cancer liabilities. Cell 163, 1515–1526 (2015).
Article CAS PubMed Google Scholar
Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Article PubMed Central CAS Google Scholar
Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).
Article CAS PubMed PubMed Central Google Scholar
Wagih, O. ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics 33, 3645–3647 (2017).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We would like to thank the many individuals whose sequence data are aggregated in gnomAD for their contributions to research, and for making this work possible. The results published here are in part based upon data: (1) generated by The Cancer Genome Atlas managed by the NCI and NHGRI (accession: phs000178.v10.p8). Information about TCGA can be found at http://cancergenome.nih.gov, (2) generated by the Genotype-Tissue Expression Project (GTEx) managed by the NIH Common Fund and NHGRI (accession: phs000424.v7.p2), (3) generated by the Exome Sequencing Project, managed by NHLBI, (4) generated by the Alzheimer’s Disease Sequencing Project (ADSP), managed by the NIA and NHGRI (accession: phs000572.v7.p4). We would like to thank the Hail team for developing tools essential for the large-scale computation in this work. We would like to thank the analysis team of the Broad’s Rare Disease Group for their manual inspection of MNVs in rare disease cohorts. This work was funded by NIDDK U54 DK105566, NIGMS R01 GM104371, and NHGRI UM1 HG008900-01. Q.W. was supported by the Nakajima Foundation Scholarship. K.J.K. was supported by NIGMS F32 GM115208. A.O.D.L. was supported by NICHD K12 HD052896.

Author information

Authors and Affiliations

Program in Medical and Population Genetics, The Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
Qingbo Wang, Emma Pierce-Hoffman, Beryl B. Cummings, Jessica Alföldi, Laurent C. Francioli, Laura D. Gauthier, Andrew J. Hill, Anne H. O’Donnell-Luria, Irina M. Armean, Ryan L. Collins, Mark J. Daly, Stacey Donnelly, Namrata Gupta, Kristen M. Laricchia, Eric V. Minikel, Benjamin M. Neale, Timothy Poterba, Andrea Saltzman, Molly Schleicher, Matthew Solomonson, Grace Tiao, Arcturus Wang, James S. Ware, Nicholas A. Watts, Nicola Whiffin, Patrick T. Ellinor, Tõnu Esko, Jose Florez, Sekar Kathiresan, Steven A. Lubitz, James B. Meigs, Aarno Palotie, Samuli Ripatti, Jeremiah Scharf, Konrad J. Karczewski & Daniel G. MacArthur
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, 02114, USA
Qingbo Wang, Beryl B. Cummings, Jessica Alföldi, Laurent C. Francioli, Anne H. O’Donnell-Luria, Irina M. Armean, Mark J. Daly, Kristen M. Laricchia, Benjamin M. Neale, Timothy Poterba, Cotton Seed, Matthew Solomonson, Grace Tiao, Christopher Vittal, Arcturus Wang, Nicholas A. Watts, Konrad J. Karczewski & Daniel G. MacArthur
Program in Bioinformatics and Integrative Genomics, Harvard Medical School, Boston, MA, 02115, USA
Qingbo Wang & Ryan L. Collins
Program in Biomedical and Biological Sciences, Harvard Medical School, Boston, MA, 02115, USA
Beryl B. Cummings
Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
Laura D. Gauthier, Eric Banks, Louis Bergelson, Kristian Cibulskis, Miguel Covarrubias, Yossi Farjoun, Jeff Gentry, Thibault Jeandet, Diane Kaplan, Christopher Llanwarne, Ruchi Munshi, Sam Novod, Nikelle Petrillo, David Roazen, Valentin Ruano-Rubio, Jose Soto, Kathleen Tibbetts, Charlotte Tolonen, Gordon Wade & Ben Weisburd
Department of Genome Sciences, University of Washington, Seattle, WA, 98195, USA
Andrew J. Hill
Centre for Population Genomics, Garvan Institute of Medical Research, and UNSW Sydney, Sydney, Australia
Daniel G. MacArthur
Centre for Population Genomics, Murdoch Children’s Research Institute, Melbourne, Australia
Daniel G. MacArthur
European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
Irina M. Armean
Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, 02114, USA
Ryan L. Collins, Jose Florez, Sekar Kathiresan, Steven McCarroll & Jeremiah Scharf
Genomics Platform, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
Kristen M. Connolly
Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
Mark J. Daly, Timothy Poterba, Cotton Seed, Christopher Vittal, Arcturus Wang, Aarno Palotie & Jeremiah Scharf
Broad Genomics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
Steven Ferriera, Stacey Gabriel & Namrata Gupta
Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
Kaitlin E. Samocha
National Heart & Lung Institute and MRC London Institute of Medical Sciences, Imperial College London, London, W12 0NN, UK
James S. Ware & Nicola Whiffin
Cardiovascular Research Centre, Royal Brompton & Harefield Hospitals NHS Trust, London, SW3 6NP, UK
James S. Ware & Nicola Whiffin
Unidad de Investigacion de Enfermedades Metabolicas. Instituto Nacional de Ciencias Medicas y Nutricion, Mexico City, 14080, Mexico
Carlos A. Aguilar Salinas
Peninsula College of Medicine and Dentistry, Exeter, EX25DW, UK
Tariq Ahmad
Division of Preventive Medicine, Brigham and Women’s Hospital, Boston, MA, 02115, USA
Christine M. Albert & Daniel Chasman
Division of Cardiovascular Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, 02115, USA
Christine M. Albert
Department of Cardiology, University Hospital, 43100, Parma, Italy
Diego Ardissino
Department of Biology, Faculty of Natural Sciences, University of Haifa, Haifa, 3498838, Israel
Gil Atzmon
Departments of Medicine and Genetics, Albert Einstein College of Medicine, Bronx, NY, 10461, USA
Gil Atzmon
Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, 44122, USA
John Barnard
Sorbonne Université, APHP, Gastroenterology Department, Saint Antoine Hospital, Paris, 75012, France
Laurent Beaugerie & Harry Sokol
NHLBI and Boston University’s Framingham Heart Study, Framingham, MA, 01702, USA
Emelia J. Benjamin & Josée Dupuis
Department of Medicine, Boston University School of Medicine, Boston, MA, 02118, USA
Emelia J. Benjamin
Department of Epidemiology, Boston University School of Public Health, Boston, MA, 02118, USA
Emelia J. Benjamin
Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, 48109, USA
Michael Boehnke
National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
Lori L. Bonnycastle
The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
Erwin P. Bottinger, Judy Cho & Ruth J. F. Loos
Department of Biochemistry, Wake Forest School of Medicine, Winston-Salem, NC, 27101, USA
Donald W. Bowden
Center for Genomics and Personalized Medicine Research, Wake Forest School of Medicine, Winston-Salem, NC, 27157, USA
Donald W. Bowden
Center for Diabetes Research, Wake Forest School of Medicine, Winston-Salem, NC, 27101, USA
Donald W. Bowden
Department of Cardiovascular Sciences, University of Leicester, Leicester, LE1 7RH, UK
Matthew J. Bown & Nilesh J. Samani
NIHR Leicester Biomedical Research Centre, Glenfield Hospital, Leicester, LE3 9QP, UK
Matthew J. Bown & Nilesh J. Samani
Department of Epidemiology and Biostatistics, Imperial College London, London, W2 1PG, UK
John C. Chambers
Department of Cardiology, Ealing Hospital NHS Trust, Southall, UB1 3HW, UK
John C. Chambers & Jaspal Kooner
Imperial College Healthcare NHS Trust, Imperial College London, London, W2 1NY, UK
John C. Chambers & Jaspal Kooner
Department of Medicine and Therapeutics, The Chinese University of Hong Kong, Hong Kong, China
Juliana C. Chan & Ronald C. W. Ma
Department of Medicine, Harvard Medical School, Boston, MA, 02115, USA
Daniel Chasman, Bruce Cohen, Jose Florez, Gad Getz, Sekar Kathiresan, James B. Meigs & Dost Ongur
Departments of Cardiovascular Medicine, Cellular and Molecular Medicine, Molecular Cardiology and Quantitative Health Sciences, Cleveland Clinic, Cleveland, OH, 44195, USA
Mina K. Chung
McLean Hospital, Belmont, MA, 02478, USA
Bruce Cohen & Dost Ongur
Department of Medicine, University of Mississippi Medical Center, Jackson, MS, 39216, USA
Adolfo Correa
Department of Epidemiology, Colorado School of Public Health, Aurora, CO, 80045, USA
Dana Dabelea
Department of Medicine and Pharmacology, University of Illinois at Chicago, Chicago, IL, 60612, USA
Dawood Darbar
Department of Genetics, Texas Biomedical Research Institute, San Antonio, TX, 78227, USA
Ravindranath Duggirala
Department of Biostatistics, Boston University School of Public Health, Boston, MA, 02118, USA
Josée Dupuis
Cardiac Arrhythmia Service and Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, 02114, USA
Patrick T. Ellinor
Cardiovascular Epidemiology and Genetics, Hospital del Mar Medical Research Institute (IMIM), Barcelona, 08003, Catalonia, Spain
Roberto Elosua & Steven A. Lubitz
CIBER CV, Barcelona, 08017, Catalonia, Spain
Roberto Elosua
Department of Medicine, Medical School, University of Vic-Central University of Catalonia, Barcelona, 08500, Spain
Roberto Elosua & Jaume Marrugat
Institute for Cardiogenetics, University of Lübeck, Lübeck, 23562, Germany
Jeanette Erdmann
DZHK (German Research Centre for Cardiovascular Research), Partner Site Hamburg/Lübeck/Kiel, 23562, Lübeck, Germany
Jeanette Erdmann
University Heart Center Lübeck, 23562, Lübeck, Germany
Jeanette Erdmann
Estonian Genome Center, Institute of Genomics, University of Tartu, Tartu, 51003, Estonia
Tõnu Esko & Andres Metspalu
Helsinki University and Helsinki University Hospital, Clinic of Gastroenterology, Helsinki, 00100, Finland
Martti Färkkilä
Institute of Clinical Molecular Biology (IKMB), Christian-Albrechts-University of Kiel, Kiel, 24118, Germany
Andre Franke
Bioinformatics Program, MGH Cancer Center and Department of Pathology, Boston, MA, 02129, USA
Gad Getz
Cancer Genome Computational Analysis, Broad Institute, Cambridge, MA, 02142, USA
Gad Getz
Endocrinology and Metabolism Department, Hadassah-Hebrew University Medical Center, Jerusalem, 91120, Israel
Benjamin Glaser
Department of Psychiatry and Behavioral Sciences, SUNY Upstate Medical University, Oneida, NY, 13421, USA
Stephen J. Glatt
Institute for Genomic Medicine, Columbia University Medical Center, Hammer Health Sciences, 1408, 701 West 168th Street, New York, NY, 10032, USA
David Goldstein
Department of Genetics & Development, Columbia University Medical Center, Hammer Health Sciences, 1602, 701 West 168th Street, New York, NY, 10032, USA
David Goldstein & Matthew Harms
Centro de Investigacion en Salud Poblacional. Instituto Nacional de Salud Publica MEXICO, Mexico, 62100, Mexico
Clicerio Gonzalez
Lund University, Lund, SE-221 00, Sweden
Leif Groop
Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, 00014, Finland
Leif Groop, Aarno Palotie, Samuli Ripatti, Tuomi Tiinamaija & Maija Wessman
Lund University Diabetes Centre, Lund, SE-214 28, Sweden
Christopher Haiman & Jaakko Kaprio
Human Genetics Center, University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
Craig Hanis
Department of Neurology, Columbia University, New York, NY, 10032, USA
Matthew Harms
Institute of Biomedicine, University of Eastern Finland, Kuopio, 70210, Finland
Mikko Hiltunen
Department of Psychiatry, PL 320, Helsinki University Central Hospital, Lapinlahdentie, 00 180, Helsinki, Finland
Matti M. Holi
Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, 171 77, Sweden
Christina M. Hultman
Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
Christina M. Hultman
Department of Neurology, Helsinki University Central Hospital, Helsinki, 00290, Finland
Mikko Kallela & Patrick F. Sullivan
Department of Public Health, Faculty of Medicine, University of Helsinki, Helsinki, 00014, Finland
Jaakko Kaprio, Samuli Ripatti & Erkki Vartiainen
Center for Genome Science, Korea National Institute of Health, Chungcheongbuk-do, 363-951, Republic of Korea
Bong-Jo Kim & Young Jin Kim
MRC Centre for Neuropsychiatric Genetics & Genomics, Cardiff University School of Medicine, Hadyn Ellis Building, Maindy Road, Cardiff, CF24 4HQ, UK
George Kirov, Michael C. O’Donovan & Michael J. Owen
National Heart and Lung Institute, Cardiovascular Sciences, Hammersmith Campus, Imperial College London, London, SW3 6LY, UK
Jaspal Kooner
Department of Health, THL-National Institute for Health and Welfare, 00271, Helsinki, Finland
Seppo Koskinen
Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT, 06510, USA
Harlan M. Krumholz
Center for Outcomes Research and Evaluation, Yale-New Haven Hospital, New Haven, CT, 06510, USA
Harlan M. Krumholz
Division of Pediatric Gastroenterology, Emory University School of Medicine, Atlanta, GA, 30322, USA
Subra Kugathasan
Department of Internal Medicine, Seoul National University Hospital, Seoul, 03080, Republic of Korea
Soo Heon Kwak & Kyong Soo Park
Institute of Clinical Medicine, The University of Eastern Finland, Kuopio, 70210, Finland
Markku Laakso
Kuopio University Hospital, Kuopio, 70210, Finland
Markku Laakso
Department of Clinical Chemistry, Fimlab Laboratories and Finnish Cardiovascular Research Center-Tampere, Faculty of Medicine and Health Technology, Tampere University, Tampere, 33720, Finland
Terho Lehtimäki & Kari M. Mattila
The Mindich Child Health and Development Institute, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
Ruth J. F. Loos
Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Hong Kong, China
Ronald C. W. Ma
Hong Kong Institute of Diabetes and Obesity, The Chinese University of Hong Kong, Hong Kong, China
Ronald C. W. Ma
Cardiovascular Research REGICOR Group, Hospital del Mar Medical Research Institute (IMIM), Barcelona, 08003, Catalonia, Spain
Jaume Marrugat
Department of Genetics, Harvard Medical School, Boston, MA, 02115, USA
Steven McCarroll
Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Churchill Hospital, Old Road, Headington, Oxford, OX3 7LJ, UK
Mark I. McCarthy
Wellcome Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, OX3 7BN, UK
Mark I. McCarthy
Oxford NIHR Biomedical Research Centre, Oxford University Hospitals NHS Foundation Trust, John Radcliffe Hospital, Oxford, OX3 9DU, UK
Mark I. McCarthy
F Widjaja Foundation Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, CA, 90048, USA
Dermot McGovern
Atherogenomics Laboratory, University of Ottawa Heart Institute, Ottawa, ON K1Y 4W7, Canada
Ruth McPherson
Division of General Internal Medicine, Massachusetts General Hospital, Boston, MA, 02114, USA
James B. Meigs
Department of Clinical Sciences, University Hospital Malmo Clinical Research Center, Lund University, Malmo, 205 02, Sweden
Olle Melander
Lund University, Department of Clinical Sciences, Skane University Hospital, Malmo, 222 42, Sweden
Peter M. Nilsson
Instituto Nacional de Medicina Genómica (INMEGEN), Mexico City, 14610, Mexico
Lorena Orozco
Medical Research Institute, Ninewells Hospital and Medical School, University of Dundee, Dundee, DD1 9SY, UK
Colin N. A. Palmer
Department of Molecular Medicine and Biopharmaceutical Sciences, Graduate School of Convergence Science and Technology, Seoul National University, Seoul, 08826, Republic of Korea
Kyong Soo Park
Department of Psychiatry, Keck School of Medicine at the University of Southern California, Los Angeles, CA, 90033, USA
Carlos Pato
Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
Ann E. Pulver
Division of Genetics and Epidemiology, Institute of Cancer Research, London, SM2 5NG, UK
Nazneen Rahman
Medical Research Center, Oulu University Hospital, Oulu, Finland and Research Unit of Clinical Neuroscience, Neurology, University of Oulu, Oulu, 90014, Finland
Anne M. Remes
Research Center, Montreal Heart Institute, Montreal, Quebec, H1T 1C8, Canada
John D. Rioux
Department of Medicine, Faculty of Medicine, Université de Montréal, Québec, H3T 1J4, Canada
John D. Rioux
Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, 37212, USA
Dan M. Roden
Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, 37212, USA
Dan M. Roden
Department of Biostatistics and Epidemiology, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, 19104, USA
Danish Saleheen
Department of Medicine, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, 19104, USA
Danish Saleheen
Center for Non-Communicable Diseases, Karachi, 75300, Pakistan
Danish Saleheen
National Institute for Health and Welfare, Helsinki, 00271, Finland
Veikko Salomaa & Jaana Suvisaari
Deutsches Herzzentrum München, München, 80636, Germany
Heribert Schunkert
Technische Universität München, München, 80333, Germany
Heribert Schunkert
Division of Cardiovascular Medicine, Nashville VA Medical Center and Vanderbilt University, School of Medicine, Nashville, TN, 37232-8802, USA
Moore B. Shoemaker
Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
Pamela Sklar
Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
Pamela Sklar
Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
Pamela Sklar
Institute of Clinical Medicine, Neurology, University of Eastern Finland, Kuopio, 80101, Finland
Hilkka Soininen
Department of Twin Research and Genetic Epidemiology, King’s College London, London, WC2R 2LS, UK
Tim Spector
Departments of Genetics and Psychiatry, University of North Carolina, Chapel Hill, NC, 27599, USA
Patrick F. Sullivan
Saw Swee Hock School of Public Health, National University of Singapore, National University Health System, Singapore, 117549, Singapore
E. Shyong Tai & Yik Ying Teo
Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
E. Shyong Tai
Duke-NUS Graduate Medical School, Singapore, 169857, Singapore
E. Shyong Tai
Life Sciences Institute, National University of Singapore, Singapore, 117456, Singapore
Yik Ying Teo
Department of Statistics and Applied Probability, National University of Singapore, Singapore, 117546, Singapore
Yik Ying Teo
Folkhälsan Institute of Genetics, Folkhälsan Research Center, Helsinki, 00250, Finland
Tuomi Tiinamaija & Maija Wessman
HUCH Abdominal Center, Helsinki University Hospital, Helsinki, 00100, Finland
Tuomi Tiinamaija
Center for Behavioral Genomics, Department of Psychiatry, University of California, San Diego, CA, 92093, USA
Ming Tsuang
Institute of Genomic Medicine, University of California, San Diego, CA, 92093, USA
Ming Tsuang & Dan Turner
Juliet Keidan Institute of Pediatric Gastroenterology, Shaare Zedek Medical Center, The Hebrew University of Jerusalem, Jerusalem, 91905, Israel
Dan Turner
Instituto de Investigaciones Biomédicas UNAM, Mexico City, 04510, Mexico
Teresa Tusie-Luna
Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubirán Mexico City, Mexico City, 14080, Mexico
Teresa Tusie-Luna
Radcliffe Department of Medicine, University of Oxford, Oxford, OX3 9DU, UK
Hugh Watkins
Department of Gastroenterology and Hepatology, University of Groningen and University Medical Center Groningen, Groningen, 9713, The Netherlands
Rinse K. Weersma
Department of Physiology and Biophysics, University of Mississippi Medical Center, Jackson, MS, 39216, USA
James G. Wilson
Program in Infectious Disease and Microbiome, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
Ramnik J. Xavier
Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, MA, 02114, USA
Ramnik J. Xavier
Department of Psychiatry and Human Behavior, University of California Irvine, Irvine, CA, USA
Marquis P. Vawter

Authors

Qingbo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Emma Pierce-Hoffman
View author publications
You can also search for this author in PubMed Google Scholar
Beryl B. Cummings
View author publications
You can also search for this author in PubMed Google Scholar
Jessica Alföldi
View author publications
You can also search for this author in PubMed Google Scholar
Laurent C. Francioli
View author publications
You can also search for this author in PubMed Google Scholar
Laura D. Gauthier
View author publications
You can also search for this author in PubMed Google Scholar
Andrew J. Hill
View author publications
You can also search for this author in PubMed Google Scholar
Anne H. O’Donnell-Luria
View author publications
You can also search for this author in PubMed Google Scholar
Konrad J. Karczewski
View author publications
You can also search for this author in PubMed Google Scholar
Daniel G. MacArthur
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

Genome Aggregation Database Production Team

Irina M. Armean
, Eric Banks
, Louis Bergelson
, Kristian Cibulskis
, Ryan L. Collins
, Kristen M. Connolly
, Miguel Covarrubias
, Mark J. Daly
, Stacey Donnelly
, Yossi Farjoun
, Steven Ferriera
, Stacey Gabriel
, Jeff Gentry
, Namrata Gupta
, Thibault Jeandet
, Diane Kaplan
, Kristen M. Laricchia
, Christopher Llanwarne
, Eric V. Minikel
, Ruchi Munshi
, Benjamin M. Neale
, Sam Novod
, Nikelle Petrillo
, Timothy Poterba
, David Roazen
, Valentin Ruano-Rubio
, Andrea Saltzman
, Kaitlin E. Samocha
, Molly Schleicher
, Cotton Seed
, Matthew Solomonson
, Jose Soto
, Grace Tiao
, Kathleen Tibbetts
, Charlotte Tolonen
, Christopher Vittal
, Gordon Wade
, Arcturus Wang
, James S. Ware
, Nicholas A. Watts
, Ben Weisburd
& Nicola Whiffin

Genome Aggregation Database Consortium

Carlos A. Aguilar Salinas
, Tariq Ahmad
, Christine M. Albert
, Diego Ardissino
, Gil Atzmon
, John Barnard
, Laurent Beaugerie
, Emelia J. Benjamin
, Michael Boehnke
, Lori L. Bonnycastle
, Erwin P. Bottinger
, Donald W. Bowden
, Matthew J. Bown
, John C. Chambers
, Juliana C. Chan
, Daniel Chasman
, Judy Cho
, Mina K. Chung
, Bruce Cohen
, Adolfo Correa
, Dana Dabelea
, Dawood Darbar
, Ravindranath Duggirala
, Josée Dupuis
, Patrick T. Ellinor
, Roberto Elosua
, Jeanette Erdmann
, Tõnu Esko
, Martti Färkkilä
, Jose Florez
, Andre Franke
, Gad Getz
, Benjamin Glaser
, Stephen J. Glatt
, David Goldstein
, Clicerio Gonzalez
, Leif Groop
, Christopher Haiman
, Craig Hanis
, Matthew Harms
, Mikko Hiltunen
, Matti M. Holi
, Christina M. Hultman
, Mikko Kallela
, Jaakko Kaprio
, Sekar Kathiresan
, Bong-Jo Kim
, Young Jin Kim
, George Kirov
, Jaspal Kooner
, Seppo Koskinen
, Harlan M. Krumholz
, Subra Kugathasan
, Soo Heon Kwak
, Markku Laakso
, Terho Lehtimäki
, Ruth J. F. Loos
, Steven A. Lubitz
, Ronald C. W. Ma
, Jaume Marrugat
, Kari M. Mattila
, Steven McCarroll
, Mark I. McCarthy
, Dermot McGovern
, Ruth McPherson
, James B. Meigs
, Olle Melander
, Andres Metspalu
, Peter M. Nilsson
, Michael C. O’Donovan
, Dost Ongur
, Lorena Orozco
, Michael J. Owen
, Colin N. A. Palmer
, Aarno Palotie
, Kyong Soo Park
, Carlos Pato
, Ann E. Pulver
, Nazneen Rahman
, Anne M. Remes
, John D. Rioux
, Samuli Ripatti
, Dan M. Roden
, Danish Saleheen
, Veikko Salomaa
, Nilesh J. Samani
, Jeremiah Scharf
, Heribert Schunkert
, Moore B. Shoemaker
, Pamela Sklar
, Hilkka Soininen
, Harry Sokol
, Tim Spector
, Patrick F. Sullivan
, Jaana Suvisaari
, E. Shyong Tai
, Yik Ying Teo
, Tuomi Tiinamaija
, Ming Tsuang
, Dan Turner
, Teresa Tusie-Luna
, Erkki Vartiainen
, Hugh Watkins
, Rinse K. Weersma
, Maija Wessman
, James G. Wilson
, Ramnik J. Xavier
& Marquis P. Vawter

Contributions

Q.W. conducted the study, performed the analysis, and wrote the paper. E.P.H., A.J.H., and B.B.C. defined the MNV classification and drafted the research. A.O.D.L. provided the data set for rare disease analysis. L.C.F. and L.D.G. generated the trio-based and read-based phasing information. J.A., B.B.C., and K.J.K. reviewed and edited the paper. D.G.M. conceived the project, supervised the overall work, reviewed and edited the paper. All authors listed under The Genome Aggregation Database Consortium contributed to the generation of the primary data incorporated into the gnomAD resource.

Corresponding author

Correspondence to Daniel G. MacArthur.

Ethics declarations

Competing interests

D.G.M. is a founder with equity in Goldfinch Bio, and has received research support from AbbVie, Astellas, Biogen, BioMarin, Eisai, Merck, Pfizer, and Sanofi-Genzyme. K.J.K. owns stock in Personalis. E.V.M. has received research support in the form of charitable contributions from Charles River Laboratories and Ionis Pharmaceuticals, and has consulted for Deerfield Management. M.I.M.: The views expressed in this article are those of the author(s) and not necessarily those of the NHS, the NIHR, or the Department of Health. He has served on advisory panels for Pfizer, NovoNordisk, Zoe Global; has received honoraria from Merck, Pfizer, NovoNordisk, and Eli Lilly; has stock options in Zoe Global and has received research funding from Abbvie, Astra Zeneca, Boehringer Ingelheim, Eli Lilly, Janssen, Merck, NovoNordisk, Pfizer, Roche, Sanofi Aventis, Servier, and Takeda. As of June 2019, M.I.M. is an employee of Genentech, and holds stock in Roche. R.K.W. has received unrestricted research grants from Takeda Pharmaceutical Company. M.J.D. is a founder of Maze Therapeutics. B.M.N. is a member of the scientific advisory board at Deep Genomics and consultant for Camp4 Therapeutics, Takeda Pharmaceutical, and Biogen. A.O.D.L. has received honoraria from ARUP and Chan Zuckerberg Initiative.

Additional information

Peer review information Nature Communications thanks Jeffrey Rosenfeld and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Dataset 1

Dataset 2

Dataset 3

Peer Review File

Description of Additional Supplementary Files

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Q., Pierce-Hoffman, E., Cummings, B.B. et al. Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes. Nat Commun 11, 2539 (2020). https://doi.org/10.1038/s41467-019-12438-5

Download citation

Received: 02 April 2019
Accepted: 09 September 2019
Published: 27 May 2020
DOI: https://doi.org/10.1038/s41467-019-12438-5

This article is cited by

Pan-cancer analyses suggest kindlin-associated global mechanochemical alterations
- Debojyoti Chowdhury
- Ayush Mistry
- Shubhasis Haldar
Communications Biology (2024)
Inferring compound heterozygosity from large-scale exome sequencing data
- Michael H. Guo
- Laurent C. Francioli
- Kaitlin E. Samocha
Nature Genetics (2024)
Evaluating the use of paralogous protein domains to increase data availability for missense variant classification
- Adam Colin Gunning
- Caroline Fiona Wright
Genome Medicine (2023)
Novel homozygous frameshift variant in the ATCAY gene in an Iranian patient with Cayman cerebellar ataxia; expanding the neuroimaging and clinical features: a case report
- Elham Salehi Siavashani
- Mahmoud Reza Ashrafi
- Masoud Garshasbi
BMC Medical Genomics (2023)
HMGN1 enhances CRISPR-directed dual-function A-to-G and C-to-G base editing
- Chao Yang
- Zhenzhen Ma
- Xueli Zhang
Nature Communications (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.