Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes

Abstract

Multi-nucleotide variants (MNVs), defined as two or more nearby variants existing on the same haplotype in an individual, are a clinically and biologically important class of genetic variation. However, existing tools typically do not accurately classify MNVs, and understanding of their mutational origins remains limited. Here, we systematically survey MNVs in 125,748 whole exomes and 15,708 whole genomes from the Genome Aggregation Database (gnomAD). We identify 1,792,248 MNVs across the genome with constituent variants falling within 2 bp distance of one another, including 18,756 variants with a novel combined effect on protein sequence. Finally, we estimate the relative impact of known mutational mechanisms - CpG deamination, replication error by polymerase zeta, and polymerase slippage at repeat junctions - on the generation of MNVs. Our results demonstrate the value of haplotype-aware variant annotation, and refine our understanding of genome-wide mutational mechanisms of MNVs.

Introduction

Multi-nucleotide variants (MNVs) are defined as clusters of two or more nearby variants existing on the same haplotype in an individual1,2 (Fig. 1a). When variants in an MNV are found within the same codon, the overall impact may differ from the functional consequences of the individual variants3. For instance, the two variants depicted in Fig. 1b are each predicted individually to have missense consequences, but in combination result in a nonsense variant. Such cases, which would be missed by virtually all existing tools for clinical variant annotation, can result both in missed diagnoses and false positive pathogenic candidates in analyses of families affected by genetic diseases1,2.

Fig. 1
figure1

Definition and an example of MNVs, and validation of phasing sensitivity. a Definition and an example of an MNV. In this paper, an MNV is defined as two or more nearby variants existing on the same haplotype in the same individual. b Impact of MNVs in coding regions. The amino acid change caused by an MNV can be different from either of the individual single-nucleotide variants, which creates the potential for missannotation of the functional consequence of variants. c Graphical overview of the analysis of phasing sensitivity and specificity using trio samples from our gnomAD callset. We identified all heterozygous variant pairs that pass quality control (see the Methods section) and compared the phase information assigned by read-based phasing with that of trio-based phasing

MNV identification tools4,5,6,7,8 have been applied to databases of human genetic variation at varying scales, including 1000 Genomes9 Phase 3 (2504 individuals with high coverage exome and low coverage genome-sequencing data), and the Exome Aggregation Consortium1 (60,706 individuals with high coverage exome data). Together, these analyses identified over 10,000 MNVs altering protein sequences, demonstrating the pervasive nature of MNV annotation in the population-level data. In addition, analysis of the 1000 Genomes data set highlighted differences in the frequencies of MNVs depending on sequence context10. In combination with yeast experiments11,12,13, biological mechanisms that account for the enrichment of specific types of MNVs, such as DNA replication error by polymerase zeta, have been suggested.

Studies of newly occurring (de novo) MNVs have also been performed using trio data sets2,14,15,16; analysis of 283 trios with whole-genome sequence data16 confirmed that MNV events occur much more frequently than expected by random chance. By focusing on noncoding regions, this study also highlighted potentially different mechanisms that dominate MNV generation depending on the genomic region and the distance between the two constitutive variants. As part of the Deciphering Developmental Disorders (DDD) study17, Kaplanis et al.2 analyzed exome-sequence data from over 6000 trios to quantify the pathogenic impact of MNVs in developmental disorders, showing that such variants are substantially more likely to be deleterious than SNVs and further clarifying the mutational mechanisms that generate them. These analyses also have provided estimates of the germline MNV rate per generation, falling into a consistent range of 1–3% of the SNV rate. Although these studies have provided valuable information about the mutational origins and functional impact of MNVs, to date there has been no analysis that investigated MNVs across the entire genome (including noncoding regions) in many thousands of deeply sequenced individuals, limiting our understanding of the genome-wide profile and complete frequency distribution of this class of variation.

Here, we present the analysis of a large-scale collection of MNVs, along with clinical interpretation of MNVs from over 6000 sequenced individuals from rare disease families. We also provide gene-level statistics on MNVs and describe the distribution of MNVs by functional consequence and by gene-level constraint. Finally, to enhance our understanding of MNV mechanisms, we examine the distributions of MNVs stratified by more than ten different functional annotations across the human genome, as well as estimates of the genome-wide per-base frequencies of the dominant mutational processes generating MNVs.

Results

Read-based phasing for identification of MNVs

Identification of MNVs requires the constituent variants to be properly phased—that is, to be identified accurately as either both occurring on the same haplotype (in cis) or on two different haplotypes (in trans). Phasing can be performed following three broad strategies: read-based phasing18, which assesses whether nearby variants co-segregate on the same reads in DNA sequencing data; family-based phasing19, which assesses whether pairs of variants are co-inherited within families; and population-based phasing20, which leverages haplotype sharing between members of a large genotyped population to make a statistical inference of phase. Read-based phasing is particularly effective for pairs of nearby variants, making it suitable for the analysis of MNVs.

For this project, we generated read-based phasing results for variants in the Genome Aggregation Database (gnomAD) v2.1 callset using GATK HaplotypeCaller21, yielding 125,748 human whole exomes and 15,708 genomes with local phase information; the properties of this callset are described in detail in an accompanying paper22. To assess phasing accuracy, we used 5785 family trios with exome-sequencing data and 635 family trios with whole-genome sequencing data that largely overlapped with the gnomAD 2.1 release data. We calculated the phasing sensitivity, defined as the fraction of heterozygous variant pairs that have read-based phase information assigned for both variants, and found that it was 87.9% for adjacent heterozygous variant pairs, reflecting the stringent haplotype-calling criteria of GATK21 (Supplementary Tables 13). We used Phase-By-Transmission (PBT)19, a family-based phasing method (Fig. 1c), to assess our phasing specificity, and found that over 99.8% of the MNVs identified with read-based phasing were consistent with the PBT trio-based phasing. The sensitivity and specificity of our read-based phasing remained high even when the two variants of the MNV were 10 bp apart (82.8% and 99.8%; Supplementary Fig. 1 and Supplementary Table 1). These results demonstrate high specificity and sensitivity for the detection of MNV events across the genome.

Functional impact of MNVs

In order to provide an overview of the functional impact of MNVs (Fig. 1b), we examined all phased high-quality SNV pairs (i.e., SNV pairs that pass stringent filtering criteria; see the Methods section) within 2 bp distance of each other across the 125,748 exome-sequenced individuals from our gnomAD 2.1 data set, resulting in the discovery of 31,575 MNVs exist within the same codon. When the two variants comprising the MNV were considered together, the resulting functional impact on the protein differed from the independent impacts of the individual variants in ~60% of cases (18,756 MNVs; Fig. 2a; Supplementary Data 1). Among the differing annotations of functional consequence, 407 were gained nonsense (neither individual SNV was a nonsense mutation, but the resulting MNV is), and 1821 were rescued nonsense (at least one of the two individual SNVs would create a nonsense mutation, but the resulting MNV does not). Such categories of MNVs have a major impact on variant interpretation, and thus are critical for accurate variant annotation. There was an average of 55.2 variants with altered functional interpretation (including 0.062 gained and 4.42 rescued nonsense) due to MNVs per individual.

Fig. 2
figure2

Functional impact of MNVs. a The number of MNVs in the gnomAD exome data set per MNV category. Of the 1821 rescued nonsense mutations, 1538 are rescued in all individuals that harbor the original nonsense mutation and are used for the analysis in (b) and (c). Gained and rescued nonsense MNVs were further filtered to HC pLoF in (b) and (c). b The number of gained/rescued nonsense mutations per gene, and examples of disease-associated genes with two or more gained/rescued nonsense mutations. c The fraction of each category of MNV found in a set of 3941 constrained genes (top two deciles of constraint22)

To understand the overall impact of correctly annotating the functional consequence of MNVs in a population-level data set, we counted the number of gained/rescued nonsense mutations per gene in gnomAD (Fig. 2b; Supplementary Data 1). For rescued nonsense mutations, we found 1538 sites that are rescued in all the individuals with the component variants. A total of 1633 genes carried gained or rescued nonsense mutations within our data set, including 41 genes that are disease-relevant (reported by OMIM23 or annotated as haploinsufficient by Clingen24,25). In addition, the proportion of rescued nonsense mutations of falling in predicted loss-of-function (pLoF) constrained genes (genes with a significant depletion of pLoFs compared with an expectation based on a mutational model1,26, defined as LOEUF22 decile <20%) was higher (proportion = 0.219) when compared with all the other classes of MNVs (proportion = 0.192; Fisher’s exact test, p = 0.0247; Fig. 2c; Supplementary Fig. 2). Conversely, gained nonsense mutations are depleted among constrained genes (proportion = 0.0620) compared with all other classes of MNVs (Fisher’s exact test, p = 1.01 × 10−11). These results suggest a significant enrichment of LoF annotation errors in the absence of MNV annotation.

In addition, we have investigated another class of variant pairs whose combined interpretation can be highly different from either of the individual component variants: insertion/deletion (indel) pairs that result in frame restoration (e.g., 4 bp deletion + 7 bp insertion, resulting in 3 bp = 1 amino acid insertion), and have annotated such frame-restoring indel pairs (n = 1406) when separate by up to 30 bp (considering the limitations of read-based phasing; Supplementary Fig. 3). When we compare the LoF confidence of constituent indels, we found that the proportion of frame-restoring indel pairs falling on LoF-constrained genes were significantly higher when the constituent indels are high-confidence (HC) LoFs (proportion = 0.0262 for low-confidence, LC, and 0.167 for HC pairs. Fisher’s exact test, p = 1.66 × 10−7; Supplementary Fig. 3h), suggesting that frame-restoring indel pairs can also be a source of LoF annotation errors.

Finally, in order to understand the impact of these variants in clinical applications, we also annotated MNVs in 6072 sequenced individuals from rare disease families, including 4275 case samples. This resulted in 16 gained nonsense mutations and 110 changed missense MNVs with high CADD27 scores and low frequencies in gnomAD (CADD >20 and <10 individuals in gnomAD; Supplementary Data 2). However, after close manual curation, none of the corresponding MNVs were definitively causal variants for the diseases affecting the family, suggesting that MNVs contribute to only a small fraction of total rare disease diagnoses, in line with expectations based on their relative rarity and previous results2.

Genome-wide mutational mechanisms of MNVs

We next turned our attention to understanding the mutational mechanisms underlying the origins of MNVs genome-wide, focusing on whole-genome sequence data from 15,708 individuals in the gnomAD v2.1 callset. We considered pairs of high-quality variants in autosomes separated by up to 10 bp, resulting in the assembly of a catalogue of 5,513,219 MNVs including 1,792,248 MNVs within 2 bp distance—an order-of-magnitude increase in size over previous collections.

We considered three established major categories of mutational origins of MNVs with constituent SNVs falling next to each other (adjacent MNVs. Figure 3a), each of which is biased toward certain MNV patterns: (1) combinations of distinct single-nucleotide mutation events; (2) replication errors by error-prone polymerase zeta; and (3) polymerase slippage events at repeat junctions. MNVs in the first category are a product of two or more SNVs, which typically occur in different generations and may thus have different allele frequencies. We expect to see an enrichment of CpG transition compared with non-CpG transversion for this class, due to the underlying difference of SNV mutation rate28,29,30. The second category, replication error introduced by DNA polymerase zeta (pol-zeta), is a well known class of replication error that introduces MNVs. Previous studies10,11,12,13,31 have shown that pol-zeta is prone to specific types of replication error, mainly TC- > AA, GC- > AA, and their reverse complements, with experimental evidence that these MNV patterns occur in a single generation; thus, the constituent SNVs will typically have the same allele frequencies. The third category, replication slippage, is another known mode of DNA replication error32,33,34. This process is especially frequent at sites with repetitive sequence context; previous studies35,36,37 have shown that the indel rate can be up to 106 times higher than the SNV mutation rate at these sites. As shown in Fig. 3a, the combination of an insertion and then a deletion of two base pairs can result in an MNV.

Fig. 3
figure3

Mutational origins of MNVs. a Three major categories of the mutational origin of MNVs. (Left) A combination of single-nucleotide mutational events. Since the baseline global mutation rate is highly different between transversions and CpG and non-CpG transitions, even a simple combination of single-nucleotide mutational events could result in a highly skewed distribution of MNVs. (Center) One-step mutation caused by error-prone DNA polymerases. For this class of MNVs, since the two mutations occur at once during DNA replication, the allele frequency of the two constituent SNVs of the MNV is more likely to be equal. (Right) Polymerase slippage at repeat junctions. Mutation rates are highly elevated in repeat regions, and are therefore likely to cause various complex patterns of mutations, occasionally resulting in MNVs. b The log-scaled number of MNVs per substitution pattern. c The fraction of one-step MNVs per substitution pattern. Error bars represent standard error of the mean (often smaller than the dot size). d The fraction of MNVs that are in repetitive contexts, and bits representation63 of sequence contexts. Error bars represent standard error of the mean. Colors in the bars in panels bd represents the predicted major mechanism of MNVs for each substitution pattern

We observed the signature of each of these MNV mechanisms in our data set. First, we calculated the number of MNVs for each MNV pattern (Fig. 3b) and observed that the most frequent MNV pattern is CA- > TG substitutions, which are likely to occur as a combination of an A- > G transition, followed by a high mutation rate C- > T CpG transition (Supplementary Fig. 4a). On the other hand, the least frequent MNV pattern is TA- > GC substitutions, which occur as a combination of two non-CpG transversions. The 273.4-fold difference (270,071 versus 988) of the frequency of MNVs between these two patterns is comparable with the theoretical ratio calculated based on the mutation rate of the component SNVs (475.6-fold), and the overall correlation between the theoretical and observed frequency of each MNV pattern was strong (Pearson correlation r = 0.839 with p = 9.15 × 10−22 in log space; Supplementary Fig. 4b–e).

To investigate the extent of pol-zeta signature, we calculated the number of MNVs in which the gnomAD allele counts of the constitutive single-nucleotide variants are equal (following previous methodology2, also described in the Methods section), and observed that these one-step MNVs are significantly enriched in MNV patterns matching the pol-zeta signature (90.5% for GA- > TT, and 80.5% for GC- > AA, compared with 39.9% overall; Fisher’s exact test, p < 10−100; Fig. 3c).

Finally, in order to capture polymerase slippage events, we calculated the fraction of MNVs in repetitive contexts per MNV pattern (Fig. 3d). For the MNV patterns AA- > TT, >30% of all the MNVs observed were in repetitive contexts. The fractions of the MNV patterns AT- > TA and TA- > AT in repetitive contexts were also high, exceeding 10% (Fisher’s exact test, p < 10−100 compared with the 3.15% across all patterns). For all MNV patterns in repeat contexts, we see a significant excess of MNVs compared with the expected number based on a model that assumes MNVs are simple combination of two SNV events (Supplementary Fig. 4). These observations support the role of replication slippage as one of the major drivers of MNVs. In addition, we did not see a correlation between the frequency of one-step MNVs and the frequency of MNVs in repetitive contexts (Pearson correlation r = 0.0561, p > 0.05; The fraction of one-step MNVs exceeded 80% for AT- > TA and TA- > AT, but was 46% for AA- > TT), suggesting that multiple slippage events leading to MNV generation can take place either as a single event (i.e., in single generation) or multiple events (i.e., in different generation), or even recurrently. These findings come with the caveat that variants in repetitive regions will have higher error rates due to slippage and misalignment errors, but we have reduced this risk by applying random forest filtering for individual sites, as well as removing all the variants in low-complexity regions from our analysis (see the Methods section).

Estimation of global mutation rate of MNVs

In order to compare the frequency of three different mechanisms, we quantified the contribution of two single-nucleotide variation events vs other replication error modes, such as pol-zeta errors or replication slippage, using a simple probabilistic model. Specifically, focusing on adjacent MNVs, we assigned the MNV frequency for each MNV pattern to be the sum of the probability of two SNV events (P) and the probability of other replication error factors (Q), and estimated the Q term. In other words, we estimated the divergence of the observed number of MNV sites from the number expected by a simple SNV mutation model (see the Methods section). The resulting estimated proportion of two SNV events and other replication error events is described in Fig. 4a.

Fig. 4
figure4

Distribution of MNVs across genome. a The number and the fraction of MNVs per origin, per substitution pattern. Gray are the estimated fraction of MNV originating from two single-nucleotide substitution events, brown for polymerase slippage at repeat contexts and purple are the others (presumably mainly replication error by pol-zeta). The colors along the bottom represent the estimated biological origins that dominate MNVs of that specific substitution pattern. b, c MNV density, defined as the number of MNVs per functional annotation divided by the base pair length in the annotation (relative to the whole-genome region), ordered by the methylation level of the functional category. d Estimated fraction of MNVs by different origins, per functional category around the coding region

As expected, the proportion differs substantially from one MNV pattern to another. For example, while 98.0% of CA- > TG MNVs appear to be caused by combinations of simple SNV events, the corresponding proportion is 5.84% for GA- > TT, 18.9% for GC- > AA, and 9.52% for AA- > TT MNVs. We presume that the lower proportion of two simple SNV events is mainly due to pol-zeta errors for GA- > TT and GC- > AA, and polymerase slippage for the AA- > TT. Since 83.2% of the overall MNVs were classified as either SNV combination, repeat context, or pol-zeta error at GA- > TT or GC- > AA, our analysis suggests that these three major categories explain a substantial fraction of MNV events genome wide, although some possible additional mechanisms with smaller frequencies might exist. These calculations also allow us to estimate the genome-wide mutation rate of MNVs caused by pol-zeta: 1.59 × 10−10 per 2 bp per generation for GA- > TT, and 4.08 × 10−10 for GC- > AA. Given that there are ~1.66 × 108 GA pairs and 1.20 × 108 GC pairs in the reference human genome, we estimate there are on average 0.026 GA- > TT and 0.049 GC- > AA mutations per generation (Supplementary Data 3).

We also explored the potential mutational mechanisms for MNVs with a greater distance between the component variants (Supplementary Figs. 57), and observed signatures of non-independence of mutation events extending over distances up to 10 bp, with an enrichment of motifs consistent with pol-zeta and polymerase slippage mechanisms for adjacent MNVs (minimum 1.08, maximum 4.06-fold enrichment of one-step MNV, Fisher’s exact test, p-value < 0.05; Supplementary Figs. 8,9). This confirms the presence of mutational mechanisms capable of creating simultaneous mutations separated by considerable distances16,29,38,39,40, although further work will be required to fully characterize the underlying processes.

Overall, our analysis of MNVs in 15,708 whole-genome-sequenced individuals supports the previously suggested three major mechanism of MNVs and quantifies the different contribution of each mechanism for different MNV patterns at the genome-wide scale.

MNV distribution across different genomic regions

We next examined how MNV pattern distributions differ between functional annotation categories. We used 13 different functional annotations such as coding sequence, enhancer, and promoter from Finucane et al.41, and the DNA methylation annotation from the Encyclopedia of DNA Elements (ENCODE)42, to calculate the number of MNVs that fall into each category (Supplementary Table 4). MNV density, defined as the number of MNVs observed in each functional category divided by the total length of the genomic interval belonging to each category, is shown in Fig. 4b and c. We found that MNV density of the substitution patterns typically involving CpG transitions is positively correlated with the methylation level (linear regression Pearson correlation r = 0.95 for CG- > TA and r = 0.87 for CA- > TG, p < 10−3). Conversely, MNV density for non-CpG transversion-related substitution patterns, and the substitution patterns related to pol-zeta slippage, negatively correlates with methylation status (linear regression Pearson correlation r = −0.90 for GA- > TC, r = −0.91 for AG- > CC, r = −0.91 for GA- > TT, and r = −0.92 for GC- > AA, p <10−5; Fig. 4b, c).

Finally, we explored the effect of genic context on MNV origins and discovery: we selected the seven major regional annotations around gene-coding sequences43,44, and calculated the fraction of MNVs likely explained by different mutational origins in each of these regions (Fig. 4d). Across all regions, we found that the MNV signal is primarily dominated by CpG transitions. The fraction of non-CpG transversions and polymerase slippage at repeats were consistently lower than (or nearly equal to) 5% of the overall signal. Pol-zeta signature was not as dominant as CpG transitions, except for at the transcription start site region, which has by far the lowest methylation rate in those seven annotations, and is thus expected to have a lower rate of CpG deamination mutations (which are dependent on the methylation of the original cytosine).

Overall, our results suggest that MNV density is highly dependent on the CpG methylation status of the surrounding sequence, and that MNVs that originate from non-CpG transversions or polymerase slippage at repeat junctions are relatively uncommon compared with those driven by CpG transitions or pol-zeta errors. Finally, MNVs that originate from pol-zeta error are the most common class of MNVs in the region close to the transcription start sites of genes, as low methylation levels in these regions result in low levels of CpG transitions.

Discussion

We analyzed 125,748 human exomes and 15,708 genomes and identified 1,792,248 MNVs across genome with constituent variants falling within 2 bp distance, including 31,575 that exist within a codon. We have shown that MNVs represent an important class of genetic variation, and that they have a significant impact on the functional interpretation of genomic data, both at the population and individual level. Although we did not encounter an individual in which an MNV is the likely cause of a rare disease after sequencing 6072 individuals from rare disease families, we expect that applying our pipeline to larger numbers of disease samples will identify previously missed diagnoses, as has been observed in another study of developmental delay cases2.

The large number and high quality of variant calls in the gnomAD database provided increased power for statistical analysis of the three major mutational mechanisms (combinations of independent SNVs; replication errors by pol-zeta; and polymerase slippage at repeat junctions) responsible for the generation of MNVs, and importantly allowed us to estimate the relative contribution of each of these processes.

Our estimates of substitution pattern-specific MNV mutation rate and fraction come with important caveats. Our approach assumes that the local SNV mutation rate is invariant across instances of a specific 3 bp context; however, prior work has shown considerable regional variation in mutation rate across the genome, as well as variation driven by ancestry, environment, and other factors45,46,47,48. Another important limitation is the lack of confident estimates of insertion and deletion rate as a function of repeat length, which limits the confidence of our estimate of the fraction of polymerase slippage. Future large genome-scale data sets with more accurate insertion and deletion calls, likely involving long-read sequencing data, will be required to improve modeling of insertion and deletion mutations.

One clear feature of our data set was the signature of non-independence of mutational events separated by up to 10 bp, as suggested in various de novo studies16,29,38,39,40; further investigation of these clustered mutations, and contextualizing them with known sources of genomic instability, such as homologous recombination49 or transposable elements50,51, will be informative in exploring the mechanisms of clustered mutations.

The complete list of MNVs identified in gnomAD is publicly available (https://gnomad.broadinstitute.org/downloads), with the allele count annotated for both genome and exome. For the coding regions, we have also annotated the functional consequence of constituent SNVs and MNVs separately, and made the result viewable in an intuitive browser (https://gnomad.broadinstitute.org). Although some fraction of MNVs is missing from this list due to incomplete phasing sensitivity and read coverage, the database provides the most comprehensive set of estimates of MNV allele frequencies to date, valuable for further analysis of mutational mechanisms as well as the interpretation of MNVs in rare disease and cancer genomics52,53.

Finally, despite the large sample size of our MNV data set, the fraction of MNVs that we have observed out of all the possible MNV configurations is still very far from saturating the space of possible MNVs, with only ~0.005% of all possible adjacent MNVs observed in our data (Supplementary Figs. 10, 11). Increasing the number of sequenced individuals54 in both disease and non-disease cohorts will permit the discovery and determination of the phenotypic impact of an increasingly comprehensive catalogue of variation. This study confirms the importance of incorporating haplotypic phase into these efforts to permit the discovery and accurate interpretation of the full range of human variation.

Methods

Ethics

We have complied with all relevant ethical regulations. This study was overseen by the Broad Institute’s Office of Research Subject Protection and the Partners Human Research Committee, and was given a determination of Not Human Subjects Research. Informed consent was obtained from all participants.

MNV calling

125,748 human exomes and 15,708 genomes from gnomAD 2.1 callset were used for the analyses (Supplementary Tables 5,6). We used hail (https://github.com/hail-is/hail), an open source, cloud-based scalable analysis tool for large genomic data. For MNV discovery, we exhaustively looked for variants that appear in the same individual, in cis, and within 2 bp distance for the exome data set and 10 bp distance for the genome data set, using the hail window_by_locus function (i.e., we computationally checked every pair of genotypes within a certain window size, for every individual, to see whether the individual carries a pair(s) of mutation in the same haplotype. See Supplementary Methods for further detail. Also, we did not expand the window size >10 bp for MNV discovery, as phasing sensitivity significantly drops when the distance between variants is >10 bp, as shown in Supplementary Fig. 1d). For trio-based analyses, we expanded the range to 100 bp to obtain a more macroscopic view. Although we performed MNV calling in sex chromosomes for the coding region, we restricted our analysis to autosomes, in order to control for differences in zygosity.

MNV calling in rare disease samples was performed in a similar fashion as in the gnomAD exome data set. In total, 6072 rare disease whole-exome sequences were curated at the Broad Center for Mendelian Genomics (CMG)55 and went through the MNV calling pipeline with the window size of 2 bp distance. The phenotypes observed in the cohort include: muscle disease such as Limb Girdle Muscular Dystrophy (LGMD; roughly one-third of the total), neurodevelopmental disorders, or severe phenotypes in eye, kidney, cardiac, or other orphan diseases (Supplementary Data 2).

MNV filtering

In the gnomAD MNV analysis, variant pairs for which one or both of their components have low quality reads were filtered out. Specifically, we only selected the variant sites that pass the Random Forest filtering, resulting in acceptance of 53.3% of the initial MNV candidates (Supplementary Fig. 12a). We also filtered out variant sites that are classified as low-complexity regions (LCRs) identified with the symmetric DUST algorithm56 at a score threshold of 30, and additionally applied adjusted threshold criteria (GQ ≥ 20, DP ≥ 10, and allele balance > 0.2 for heterozygote genotypes) for filtering individual variants (Supplementary Table 7). For each MNV site, we annotated the number of alleles that appear as MNV, as well as the number of individuals carrying the MNV as a homozygous variant. The distribution of MNV sites that contain homozygous MNVs is shown in Supplementary Fig. 13. We also collapsed the MNV patterns that are reverse complements of each other, after observing that the number of MNVs are roughly symmetric (before collapsing, the ratio of each MNV pattern to its corresponding reverse complement pattern was mostly close to 1, with 0.95 being the lowest and 1.10 being the highest for adjacent MNVs) (Supplementary Fig. 14). All the MNV patterns in the main text and figures are equivalent to their reverse complement, and we do not distinguish them.

For the rare disease cohort, since our motivation was to find a definite example where an MNV is acting as a causal variant for a rare disease with severe phenotype rather than obtaining the population-level statistics, we did not apply site and sample-specific filtering, as opposed to the gnomAD MNV analysis. Instead of being computationally filtered by read quality, the 129 putative MNVs (16 gained nonsense mutations, 110 changed missense with high CADD score and low gnomAD MNV frequency, and 3 gained missense) went through manual inspection by the analysts at the Center for Mendelian Genomics (CMG) at the Broad Institute55, after annotating the affected gene. Specifically, all the variants were checked manually under the criteria below:

- Whether the gene affected is constrained in the gnomAD population.

- Whether the case has already been solved with other causal variant.

- Whether the MNV looks real in the Interactive Genome Browser (IGV).57

- Whether the MNV is in the proband and, if applicable, the segregation pattern of the MNV

- Whether the known function of the gene affected matches the patient phenotype.

MNVs were filtered out if they failed one or more of the criteria above. These results suggest that MNVs explain only a small fraction of undiagnosed genetic disease cases, consistent with their overall frequency as a class of variation, and with prior work in large disease-affected cohorts2. The summary for MNV analysis in rare disease cohort is also available at Supplementary Data 2.

Analysis of phasing sensitivity

In order to compare the phasing information derived from different methods (read-based and trio-based), we took an approach of comparing the relative phase (binary classification of whether two SNVs of MNV are in the same haplotype or not), as shown in Supplementary Table 8. We investigated the heterozygous variant pairs whose phasing information is not provided by the trio-based phasing and observed that majority (83.5%) of the cases reflected both parents carrying a heterozygous variant, a scenario where trio-based phasing is inherently uninformative. We also investigated the heterozygous variant pairs whose phasing information is not provided by the read-based phasing. Specifically, unphased pairs tend to have either low- or high-read depth (odds ratio = 3.20, Fisher’s exact test, p < 10−100 for low, and odds ratio = 2.33, Fisher’s exact test, p < 10−100 for high-read depth; Supplementary Table 3), consistent with our previous understanding that an excess of reads can lead to involvement of erroneous reads and thus reduce the confidence of phasing of HaplotypeCaller58 (as well as the lack of the number of reads reduces the calling rate). All the statistical tests are two-sided, throughout the paper.

Analysis of functional impact in coding region

We focused on the coding region of the canonical transcript of genes and examined the codon change and their consequence for all the MNVs that fall in a single codon (see Supplementary Tables 9,10 for the number of MNVs that spans across two codons). When comparing with population-level constraint, for each MNV, we annotated the constraint metric (LOEUF22) of the gene whose protein product is affected. For rescued nonsense mutations, we took only the ones are rescued in all the individuals with the component variants (i.e., we excluded the ones whose allele count of MNVs are not equal to the allele count of the SNV that introduces a nonsense mutation), resulting in 1538 out of 1821 rescued nonsense mutations. We next used Loss-Of-Function Transcript Effect Estimator (LOFTEE22) in order to exclude the nonsense mutations that are not likely to affect the protein function. This resulted in 371 high-confidence (HC) gained nonsense mutations and 1400 HC rescued nonsense mutations, which were used for the population-level constraint analysis. In addition, we stratified the gene sets by core essential/nonessential genes from CRISPR/Cas knockout experiments59,60 as an orthogonal indicator of gene constraint (Supplementary Fig. 2).

We did not include and correct for MNVs consisting of three SNVs in a single codon in the analysis of functional impact in coding region, since the number and frequency of such MNVs are significantly low (228 in total, with 5 newly gained nonsense, but no re-rescued or re-gained nonsense; 0.220 in total per person). The full list of such MNVs are available as a separate file at: https://gnomad.broadinstitute.org/downloads.

Frame-restoring indel analysis was performed in a similar fashion. We used the gnomAD exome data set to call and filter the insertion/deletion pairs using the same filtering criteria (except for the fact that we did not restrict our analysis to cases where the frameshift effect would be rescued in all individuals), and focused on the canonical transcripts for the functional impact evaluation.

Defining one-step MNVs and MNVs in repetitive contexts

A one-step MNV was defined as a MNV for which the allele count of both SNVs that make up the MNV is the same and close to the allele count of the MNV itself. We also compared the allele count of constituent SNVs (AC1 and AC2) with the allele count of the corresponding MNV (AC_mnv), and observed that the majority of one-step MNVs we discovered have AC_mnv divided by AC1 >0.9 (Supplementary Fig. 15). Therefore, we expect the false discovery rate of one-step MNVs (misclassifying the MNV whose AC1 and AC2 are equal just by chance) to be limited. The full distribution of all the allele counts, including per-population characterizations, are shown in Supplementary Fig. 16 and Supplementary Table 11.

Repetitive sequences are defined by taking the ±4 bp context of the MNV and setting the threshold manually, by looking at the distribution of repeat contexts around all the MNVs (Supplementary Figs. 17, 18). Specifically, a sequence is defined as repetitive if the number of dinucleotide repeat units > 1, for both reference and alternative ±4 bp context, and the number of dinucleotide repeat units > 2, for either reference or alternative ±4 bp context, and, for adjacent MNVs only, if the reference and/or alternative 2 bp are mononucleotide repeat, increase the threshold by one mononucleotide repeat unit.

Here, dinucleotide repeat unit is defined as the reference or the alternative allele itself (with the gap when d > 1 and counting the overlap. For example, the reference and alternative dinucleotide repeat counts for TATATAT - > TAAAAAT are both 3). The third criteria was added specifically for adjacent MNVs to adjust for counting the overlap more than once. This threshold was set so that the number of MNVs with equal or higher repeats would be <5% of the total, corresponding to two standard deviations away from the mean, and also because the estimated mutation rate in these repetitive contexts is likely to be orders of magnitude higher than the background MNV mutation rate originating from the combination of two SNV events35,36,37.

Calculating the proportion of MNVs per biological origin

We calculated the proportion of MNV per biological origin by comparing the observed number of MNVs (that are not in repetitive contexts) with the expected number of MNV under single-nucleotide mutational model.

Specifically, if we simply hypothesize most of the MNV are combination of two single-nucleotide substitution events, we can estimate the relative probability of MNV event per substitution pattern. For example, probability of observing a CA to TG MNV in a single individual, single site (p(CA → TG)) is proportional to p(CA → TA) p(TA → TG) + p(CA → CG) p(CG → TG), and probability of TA to GC MNV (p(TA → GC)) is proportional to \(p({\mathrm{TA}} \to {\mathrm{GA}}) \cdot p({\mathrm{GA}} \to {\mathrm{GC}}) \, + p({\mathrm{TA}} \to {\mathrm{TC}}) \cdot p({\mathrm{TC}} \to {\mathrm{GC}})\). Former equation involves the product of transition at CpG, while both term of the latter are product of transversion at non-CpG, which works as a reasonable explanation of the frequency difference of those two MNV patterns.

Using the same principle (and accounting for reference base pair frequency, population number and global SNV mutation rate defined by 3 bp context26, we first constructed a null model of MNV distribution. In reality, this null model does not represent the real distribution we observe, due to biological mechanisms that introduce MNV. Therefore, we allowed additional factor q, that denotes the mutational event where two SNVs are introduced at the same time. For the example of \(p({\mathrm{CA}} \to {\mathrm{TG}})\), we model this probability to be proportional to \({l}p({\mathrm{TA}} \to {\mathrm{GA}}) \cdot p({\mathrm{GA}} \to {\mathrm{GC}}) + p({\mathrm{TA}} \to {\mathrm{TC}}) \cdot p({\mathrm{TC}} \to {\mathrm{GC}}) + q({\mathrm{CA}} \to {\mathrm{TG}})\), and try to estimate the q term, which corresponds to the proportion of MNVs that are explained by non-SNV (and non-repeat) factor. Further details are explained in the Supplementary Methods (section “Models and assumptions for calculating the proportion of MNV per biological mechanism”).

In addition, for each of MNV pattern, we annotated the predicted major mechanism for each MNV pattern in the following order:

1. Pol-zeta, for the patterns known as polymerase signature (GA- > TT and GC- > AA)

2. Repeat, for the patterns whose fraction of MNVs in repeat contexts are >10% (corresponding to two standard deviations away from the mean; AA- > TT, AT- > TA, and TA- > AT)

3. One of Ti at CpG, Ti, Ti at CpG + Tv, Ti + Tv, Tv combination, based on possible combinations of single-nucleotide mutational processes. For example, Ti at CpG is when transition in CpG combined with another transition can occur in the mutational processes (Supplementary Data 3).

Estimation of the global MNV rate per substitution pattern

In order to estimate the global MNV mutation rate for adjacent MNVs, as well as the mutation rate per MNV pattern, we first focused the number of one-step MNVs, assuming that there are no recurrent mutations and therefore the allele frequency of constituent SNVs are equal if and only if it originates from an MNV event in a single generation. In this section, we will simply write one-step MNV of distance 1 bp (i.e., adjacent) as MNV.

We then calculated the global MNV mutation rate under the Watterson estimator model, as in Kaplanis et al.2. Specifically, we divided the number of MNV sites by the number of SNV sites in our gnomAD data set, and scaled by the global single-nucleotide mutation rate identified in previous research (1.2 × 10−8), which yielded 2.94 × 10−11 per 2 bp per generation. This is roughly two-thirds of the estimation provided by the Kaplanis et al.2 using trio data, slightly smaller presumably due to differing filtering method. Next, In order to get the mutation rate per 2 bp for each of the MNV patterns, we simply scaled the global MNV mutation rate described above by the number of reference 2 bp and the coverage difference. The full data for all the 78 patterns are shown in Supplementary Data 3. Further details are explained in the Supplementary Methods (section “Models and assumptions for estimation of the global MNV rate per substitution pattern”).

Functional enrichment

Thirteen functional annotations were collected from Finucane et al.41 as a bed file (which originates from database, such as ENCODE, Roadmap61 and UCSC genome browser62.) For the methylation data, we collected the genome methylation level from ENCODE, and calculated the fraction of methylated CpG out of all the CpGs in the region, and ordered by the fraction (Supplementary Table 4).

MNV density calculation was performed under the null hypothesis that the number of MNV of type WX→YZ we observe in an arbitrary genomic interval is proportional to the number of WX in the interval. Specifically, the MNV density of WX→YZ in interval I is defined as

\(D({\mathrm{WX}} \to {\mathrm{YZ}}|I) = \frac{{N({\mathrm{WX}} \to {\mathrm{YZ}}|I)}}{{N({\mathrm{WX}}|I)}}\), where N(WX→YZ|I) is the number of MNVs of WX→YZ, and N(WX|I) is the number of WX in the reference genome we observe in that specific genomic interval. We then normalized the density by dividing by D(WX→YZ|I = whole genome) for scaling purpose (i.e., D(WX→YZ|I) = k means that the probability of observing a mutation of WX→YZ given a sequence context of WX is k times higher in genomic functional category I than the overall genome.)

For estimating the fraction of MNVs per origin, we took a thresholding approach and defined four MNVs (CA- > TG, AC- > GT, CC- > TT, and GA- > AG) as CpG signal, two (GC- > AA, GA- > TT) as pol-zeta, three as repeat (AA- > TT, TA- > AT, AT- > TA) and six transversion (TA- > GC, CG- > AT, AT- > CG, CG- > GC, GC- > CG, CG- > AC) signal (and left all the other 78-(4 + 2 + 3 + 6) = 63 patterns as others, in order to highlight the strongest signals) based on the result from Fig. 3. The fraction of MNVs per origin is then defined simply as the number of MNVs that fall into that pattern divided by all the MNVs, in the genomic interval. The coverage difference per interval was as small as negligible (Supplementary Table 4).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The list of coding MNVs in gnomAD exome are available at gs://gnomad-public/release/2.1/mnv/gnomad_mnv_coding.tsv (tab separated file). The coding MNVs consisting of three SNVs in a single codon is available as a separate file at gs://gnomad-public/release/2.1/mnv/gnomad_mnv_coding_3bp.tsv. The list of frame-restoring indel pairs are available at gs://gnomad-public/release/2.1/mnv/frame_restoring_indels.tsv. The list of all the MNVs in gnomAD genomes are available at gs://gnomad-public/release/2.1/mnv/genome/gnomad_mnv_genome_d{i}.tsv.bgz (tab separated file, compressed. Replace {i} (0 < i < 11) with the distance between two SNVs of MNV.), or gs://gnomad-public/release/2.1/mnv/genome/gnomad_mnv_genome_d{i}.ht (hail table. Replace {i} (0 < i < 11) with the distance between two SNVs of MNV.). Explanations for each column in each file can be found at gs://gnomad-public/release/2.1/mnv/mnv_readme.md. All the files above are also available at the download page of the gnomAD browser (https://gnomad.broadinstitute.org/downloads).

Code availability

The code used in the study is available at https://github.com/macarthur-lab/gnomad_mnv.

References

  1. 1.

    Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  2. 2.

    Kaplanis, J. et al. Exome-wide assessment of the functional impact and pathogenicity of multinucleotide mutations. Genome Res. gr.239756.118 (2019).

  3. 3.

    Rosenfeld, J. A., Malhotra, A. K. & Lencz, T. Novel multi-nucleotide polymorphisms in the human genome characterized by whole genome and exome sequencing. Nucleic Acids Res. 38, 6102–6111 (2010).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  4. 4.

    Wei, L. et al. MAC: identifying and correcting annotation for multi-nucleotide variations. BMC Genomics 16, 569 (2015).

  5. 5.

    Lai, Z. et al. VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 44, e108 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  6. 6.

    Cheng, S.-J. et al. Accurately annotate compound effects of genetic variants using a context-sensitive framework. Nucleic Acids Res. 45, e82 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  7. 7.

    Danecek, P. & McCarthy, S. A. BCFtools/csq: haplotype-aware variant consequences. Bioinformatics 33, 2037–2039 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  8. 8.

    Khan, W. et al. MACARON: a python framework to identify and re-annotate multi-base affected codons in whole genome/exome sequence data. Bioinformatics 34, 3396–3398 (2018).

    CAS  PubMed  Article  Google Scholar 

  9. 9.

    Consortium, T. 1000 G. P. A global reference for human genetic variation. Nature 526, 68 (2015).

    ADS  Article  CAS  Google Scholar 

  10. 10.

    Harris, K. & Nielsen, R. Error-prone polymerase activity causes multinucleotide mutations in humans. Genome Res. 24, 1445–1454 (2014).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  11. 11.

    Zhong, X. et al. The fidelity of DNA synthesis by yeast DNA polymerase zeta alone and with accessory proteins. Nucleic Acids Res. 34, 4731–4742 (2006).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  12. 12.

    Sakamoto, A. N. et al. Mutator alleles of yeast DNA polymerase ζ. DNA Repair 6, 1829–1838 (2007).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  13. 13.

    Stone, J. E., Lujan, S. A. & Kunkel, T. A. DNA polymerase zeta generates clustered mutations during bypass of endogenous DNA lesions in Saccharomyces cerevisiae. Environ. Mol. Mutagenesis 53, 777–786 (2012).

    CAS  Article  Google Scholar 

  14. 14.

    Chen, J.-M., Férec, C. & Cooper, D. N. Closely spaced multiple mutations as potential signatures of transient hypermutability in human genes. Hum. Mutat. 30, 1435–1448 (2009).

    CAS  PubMed  Article  Google Scholar 

  15. 15.

    Schrider, D. R., Hourmozdi, J. N. & Hahn, M. W. Pervasive multinucleotide mutational events in eukaryotes. Curr. Biol. 21, 1051–1054 (2011).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  16. 16.

    Besenbacher, S. et al. Multi-nucleotide de novo mutations in humans. PLOS Genet. 12, e1006315 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  17. 17.

    The Deciphering Developmental Disorders Study et al. Large-scale discovery of novel genetic causes of developmental disorders. Nature 519, 223–228 (2015).

    ADS  Article  CAS  Google Scholar 

  18. 18.

    Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. arXiv preprint: arXiv:1207.3907 [q-bio] (2012).

  19. 19.

    Francioli, L. C. et al. A framework for the detection of de novo mutations in family-based sequencing data. Eur. J. Hum. Genet. 25, 227–233 (2017).

    CAS  PubMed  Article  Google Scholar 

  20. 20.

    Choi, Y., Chan, A. P., Kirkness, E., Telenti, A. & Schork, N. J. Comparison of phasing strategies for whole human genomes. PLoS Genet. 14, e1007308 (2018).

  21. 21.

    Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at: https://doi.org/10.1101/201178v3 (2018).

  22. 22.

    Karczewski, K. J. et al. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. Preprint at: https://doi.org/10.1101/531210v3 (2019).

  23. 23.

    Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. & McKusick, V. A. Online Mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517 (2005).

    CAS  PubMed  Article  Google Scholar 

  24. 24.

    Landrum, M. J. et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985 (2014).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  25. 25.

    Rehm, H. L. et al. ClinGen–the clinical genome resource. N. Engl. J. Med. 372, 2235–2242 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  26. 26.

    Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  27. 27.

    Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  28. 28.

    Nachman, M. W. & Crowell, S. L. Estimate of the mutation rate per nucleotide in humans. Genetics 156, 297–304 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  29. 29.

    Francioli, L. C. et al. Genome-wide patterns and properties of de novo mutations in humans. Nat. Genet 47, 822–826 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  30. 30.

    Xue, Y. et al. Human Y chromosome base-substitution mutation rate measured by direct sequencing in a deep-rooting pedigree. Curr. Biol. 19, 1453–1457 (2009).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  31. 31.

    Northam, M. R. et al. DNA polymerases ζ and Rev1 mediate error-prone bypass of non-B DNA structures. Nucleic Acids Res. 42, 290–306 (2014).

    ADS  CAS  PubMed  Article  Google Scholar 

  32. 32.

    Montgomery, S. B. et al. The origin, evolution, and functional impact of short insertion–deletion variants identified in 179 human genomes. Genome Res. 23, 749–761 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  33. 33.

    Bacolla, A. et al. Local DNA dynamics shape mutational patterns of mononucleotide repeats in human genomes. Nucleic Acids Res. 43, 5065–5080 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  34. 34.

    Ananda, G. et al. Microsatellite interruptions stabilize primate genomes and exist as population-specific single nucleotide polymorphisms within individual human genomes. PLOS Genet. 10, e1004498 (2014).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  35. 35.

    Leclercq, S., Rivals, E. & Jarne, P. DNA slippage occurs at microsatellite loci without minimal threshold length in humans: a comparative genomic approach. Genome Biol. Evol. 2, 325–335 (2010).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  36. 36.

    Lai, Y. & Sun, F. The relationship between microsatellite slippage mutation rate and the number of repeat units. Mol. Biol. Evol. 20, 2123–2131 (2003).

    CAS  PubMed  Article  Google Scholar 

  37. 37.

    Pumpernik, D., Oblak, B. & Borštnik, B. Replication slippage versus point mutation rates in short tandem repeats of the human genome. Mol. Genet. Genomics 279, 53–61 (2008).

    CAS  PubMed  Article  Google Scholar 

  38. 38.

    Chan, K. & Gordenin, D. A. Clusters of multiple mutations: incidence and molecular mechanisms. Annu Rev. Genet 49, 243–267 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  39. 39.

    Supek, F. & Lehner, B. Clustered mutation signatures reveal that error-prone DNA repair targets mutations to active genes. Cell 170, 534–547 (2017). e23.

    CAS  PubMed  Article  Google Scholar 

  40. 40.

    Michaelson, J. J. et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell 151, 1431–1442 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  41. 41.

    Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  42. 42.

    Consortium, T. E. P. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306, 636–640 (2004).

    ADS  Article  CAS  Google Scholar 

  43. 43.

    Maston, G. A., Evans, S. K. & Green, M. R. Transcriptional Regulatory Elements in the Human Genome. Annu. Rev. Genom. Hum. Genet. 7, 29–59 (2006).

    CAS  Article  Google Scholar 

  44. 44.

    Kulaeva, O. I., Nizovtseva, E. V., Polikanov, Y. S., Ulianov, S. V. & Studitsky, V. M. Distant Activation of transcription: mechanisms of enhancer action. Mol. Cell. Biol. 32, 4892–4897 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  45. 45.

    Aggarwala, V. & Voight, B. F. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat. Genet. 48, 349–355 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  46. 46.

    Duret, L. Mutation patterns in the human genome: more variable than expected. PLOS Biol. 7, e1000028 (2009).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  47. 47.

    Ségurel, L., Wyman, M. J. & Przeworski, M. Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 15, 47–70 (2014).

    PubMed  Article  CAS  Google Scholar 

  48. 48.

    Harris, K. Evidence for recent, population-specific evolution of the human mutation rate. PNAS 112, 3439–3444 (2015).

    ADS  CAS  PubMed  Article  Google Scholar 

  49. 49.

    Guirouilh-Barbat, J., Lambert, S., Bertrand, P. & Lopez, B. S. Is homologous recombination really an error-free process? Front. Genet. 5, 175 (2014).

  50. 50.

    Smit, A. F. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genetics Dev. 9, 657–663 (1999).

    CAS  Article  Google Scholar 

  51. 51.

    Wicker, T. et al. A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 8, 973–982 (2007).

    CAS  PubMed  Article  Google Scholar 

  52. 52.

    Roberts, S. A. et al. An APOBEC cytidine deaminase mutagenesis pattern is widespread in human cancers. Nat. Genet. 45, 970–976 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  53. 53.

    Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  54. 54.

    Stark, Z. et al. Integrating genomics into healthcare: a global responsibility. Am. J. Hum. Genet. 104, 13–20 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  55. 55.

    Centers for Mendelian Genomics, Bamshad, M. J. The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. Am. J. Hum. Genet. 97, 199–215 (2015).

    PubMed Central  Article  CAS  Google Scholar 

  56. 56.

    Morgulis, A., Gertz, E. M., Schäffer, A. A. & Agarwala, R. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J. Comput. Biol. 13, 1028–1040 (2006).

    MathSciNet  CAS  PubMed  Article  Google Scholar 

  57. 57.

    Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  58. 58.

    Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  59. 59.

    Lenoir, W. F., Lim, T. L. & Hart, T. PICKLES: the database of pooled in-vitro CRISPR knockout library essentiality screens. Nucleic Acids Res. 46, D776–D780 (2018).

    CAS  PubMed  Article  Google Scholar 

  60. 60.

    Hart, T. et al. High-resolution CRISPR screens reveal fitness genes and genotype-specific cancer liabilities. Cell 163, 1515–1526 (2015).

    CAS  Article  Google Scholar 

  61. 61.

    Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

    PubMed Central  Article  CAS  Google Scholar 

  62. 62.

    Kent, W. J. et al. The human genome browser at UCSC. Genome Res. 12, 996–1006 (2002).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  63. 63.

    Wagih, O. ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics 33, 3645–3647 (2017).

    CAS  PubMed  Article  Google Scholar 

Download references

Acknowledgements

We would like to thank the many individuals whose sequence data are aggregated in gnomAD for their contributions to research, and for making this work possible. The results published here are in part based upon data: (1) generated by The Cancer Genome Atlas managed by the NCI and NHGRI (accession: phs000178.v10.p8). Information about TCGA can be found at http://cancergenome.nih.gov, (2) generated by the Genotype-Tissue Expression Project (GTEx) managed by the NIH Common Fund and NHGRI (accession: phs000424.v7.p2), (3) generated by the Exome Sequencing Project, managed by NHLBI, (4) generated by the Alzheimer’s Disease Sequencing Project (ADSP), managed by the NIA and NHGRI (accession: phs000572.v7.p4). We would like to thank the Hail team for developing tools essential for the large-scale computation in this work. We would like to thank the analysis team of the Broad’s Rare Disease Group for their manual inspection of MNVs in rare disease cohorts. This work was funded by NIDDK U54 DK105566, NIGMS R01 GM104371, and NHGRI UM1 HG008900-01. Q.W. was supported by the Nakajima Foundation Scholarship. K.J.K. was supported by NIGMS F32 GM115208. A.O.D.L. was supported by NICHD K12 HD052896.

Author information

Affiliations

Authors

Consortia

Contributions

Q.W. conducted the study, performed the analysis, and wrote the paper. E.P.H., A.J.H., and B.B.C. defined the MNV classification and drafted the research. A.O.D.L. provided the data set for rare disease analysis. L.C.F. and L.D.G. generated the trio-based and read-based phasing information. J.A., B.B.C., and K.J.K. reviewed and edited the paper. D.G.M. conceived the project, supervised the overall work, reviewed and edited the paper.

Corresponding author

Correspondence to Daniel G. MacArthur.

Ethics declarations

Competing interests

D.G.M. is a founder with equity in Goldfinch Bio, and has received research support from AbbVie, Astellas, Biogen, BioMarin, Eisai, Merck, Pfizer, and Sanofi-Genzyme. K.J.K. owns stock in Personalis. E.V.M. has received research support in the form of charitable contributions from Charles River Laboratories and Ionis Pharmaceuticals, and has consulted for Deerfield Management. M.I.M.: The views expressed in this article are those of the author(s) and not necessarily those of the NHS, the NIHR, or the Department of Health. He has served on advisory panels for Pfizer, NovoNordisk, Zoe Global; has received honoraria from Merck, Pfizer, NovoNordisk, and Eli Lilly; has stock options in Zoe Global and has received research funding from Abbvie, Astra Zeneca, Boehringer Ingelheim, Eli Lilly, Janssen, Merck, NovoNordisk, Pfizer, Roche, Sanofi Aventis, Servier, and Takeda. As of June 2019, M.I.M. is an employee of Genentech, and holds stock in Roche. R.K.W. has received unrestricted research grants from Takeda Pharmaceutical Company. M.J.D. is a founder of Maze Therapeutics. B.M.N. is a member of the scientific advisory board at Deep Genomics and consultant for Camp4 Therapeutics, Takeda Pharmaceutical, and Biogen. A.O.D.L. has received honoraria from ARUP and Chan Zuckerberg Initiative.

Additional information

Peer review information Nature Communications thanks Jeffrey Rosenfeld and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, Q., Pierce-Hoffman, E., Cummings, B.B. et al. Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes. Nat Commun 11, 2539 (2020). https://doi.org/10.1038/s41467-019-12438-5

Download citation

Further reading

  • Exploring human genomic diversity with gnomAD

    • Linda Koch

    Nature Reviews Genetics (2020)

  • The mutational constraint spectrum quantified from variation in 141,456 humans

    • Konrad J. Karczewski
    • , Laurent C. Francioli
    • , Grace Tiao
    • , Beryl B. Cummings
    • , Jessica Alföldi
    • , Qingbo Wang
    • , Ryan L. Collins
    • , Kristen M. Laricchia
    • , Andrea Ganna
    • , Daniel P. Birnbaum
    • , Laura D. Gauthier
    • , Harrison Brand
    • , Matthew Solomonson
    • , Nicholas A. Watts
    • , Daniel Rhodes
    • , Moriel Singer-Berk
    • , Eleina M. England
    • , Eleanor G. Seaby
    • , Jack A. Kosmicki
    • , Raymond K. Walters
    • , Katherine Tashman
    • , Yossi Farjoun
    • , Eric Banks
    • , Timothy Poterba
    • , Arcturus Wang
    • , Cotton Seed
    • , Nicola Whiffin
    • , Jessica X. Chong
    • , Kaitlin E. Samocha
    • , Emma Pierce-Hoffman
    • , Zachary Zappala
    • , Anne H. O’Donnell-Luria
    • , Eric Vallabh Minikel
    • , Ben Weisburd
    • , Monkol Lek
    • , James S. Ware
    • , Christopher Vittal
    • , Irina M. Armean
    • , Louis Bergelson
    • , Kristian Cibulskis
    • , Kristen M. Connolly
    • , Miguel Covarrubias
    • , Stacey Donnelly
    • , Steven Ferriera
    • , Stacey Gabriel
    • , Jeff Gentry
    • , Namrata Gupta
    • , Thibault Jeandet
    • , Diane Kaplan
    • , Christopher Llanwarne
    • , Ruchi Munshi
    • , Sam Novod
    • , Nikelle Petrillo
    • , David Roazen
    • , Valentin Ruano-Rubio
    • , Andrea Saltzman
    • , Molly Schleicher
    • , Jose Soto
    • , Kathleen Tibbetts
    • , Charlotte Tolonen
    • , Gordon Wade
    • , Michael E. Talkowski
    • , Benjamin M. Neale
    • , Mark J. Daly
    •  & Daniel G. MacArthur

    Nature (2020)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.