Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes

Multi-nucleotide variants (MNVs), defined as two or more nearby variants existing on the same haplotype in an individual, are a clinically and biologically important class of genetic variation. However, existing tools for variant interpretation typically do not accurately classify MNVs, and understanding of their mutational origins remains limited. Here, we systematically survey MNVs in 125,748 whole exomes and 15,708 whole genomes from the Genome Aggregation Database (gnomAD). We identify 1,996,125 MNVs across the genome with constituent variants falling within 2 bp distance of one another, of which 31,510 exist within the same codon, including 405 predicted to result in gain of a nonsense mutation, 1,818 predicted to rescue a nonsense mutation event that would otherwise be caused by one of the constituent variants, and 16,481 additional variants predicted to alter protein sequences. We show that the distribution of MNVs is highly non-uniform across the genome, and that this non-uniformity can be largely explained by a variety of known mutational mechanisms, such as CpG deamination, replication error by polymerase zeta, or polymerase slippage at repeat junctions. We also provide an estimate of the dinucleotide mutation rate caused by polymerase zeta. Finally, we show that differential CpG methylation drives MNV differences across functional categories. Our results demonstrate the importance of incorporating haplotype-aware annotation for accurate functional interpretation of genetic variation, and refine our understanding of genome-wide mutational mechanisms of MNVs.


Introduction
Multi-nucleotide variants (MNVs) are defined as clusters of two or more nearby variants existing on the same haplotype in an individual 1,2 (Fig. 1a). When variants in an MNV are found within the same codon, the overall impact may differ from the functional consequences of the individual variants 3 . For instance, the two variants depicted in Fig. 1b  MNV identification tools [4][5][6][7][8] have been applied to databases of human genetic variation at varying scales, including 1000 Genomes 9 Phase 3 (2,504 individuals with high coverage exome and low coverage genome sequencing data), and the Exome Aggregation Consortium 1 (60,706 individuals with high coverage exome data). Together, these analyses identified over 10,000 MNVs altering protein sequences, demonstrating the pervasive nature of MNV annotation in population level data. Additionally, analysis of the 1000 Genomes dataset highlighted differences in the frequencies of MNVs depending on sequence context 10 . In combination with yeast experiments [11][12][13] , biological mechanisms that account for the enrichment of specific types of MNVs, such as DNA replication error by polymerase zeta, have been suggested.
Studies of newly occurring (de novo) MNVs have also been performed using trio datasets 2,14-16 ; analysis of 283 trios with whole genome sequence data 16 confirmed that MNV events occur much more frequently than expected by random chance. By focusing on noncoding regions, this study also highlighted potentially different mechanisms that dominate MNV generation depending on the genomic region and the distance between the two constitutive variants. As part of the Deciphering Developmental Disorders (DDD) study 17 ,Kaplanis et. al. 2 analyzed exome sequence data from over 6,000 trios to quantify the pathogenic impact of MNVs in developmental disorders, showing that such variants are substantially more likely to be deleterious than SNVs and further clarifying the mutational mechanisms that generate them.
These analyses also have provided estimates of the germline MNV rate per generation, falling into a consistent range of 1-3% of the SNV rate. Although these studies have provided valuable information about the mutational origins and functional impact of MNVs, to date there has been no analysis that investigated MNVs across the entire genome (including non-coding regions) in many thousands of deeply sequenced individuals, limiting our understanding of the genomewide profile and complete frequency distribution of this class of variation.
Here, we present the analysis of the largest collection of MNVs assembled to date, along with clinical interpretation of MNVs from over 6,000 sequenced individuals from rare disease families. We also provide gene-level statistics on MNVs and describe the distribution of MNVs by functional consequence and by gene-level constraint. Finally, to enhance our understanding of MNV mechanisms, we examine the distributions of MNVs stratified by more than ten different functional annotations across the human genome, as well as estimates of the genome-wide perbase frequencies of the dominant mutational processes generating MNVs.

a.
b. c.

Figure 1. Definition and an example of MNVs, and validation of phasing sensitivity a,
Definition and an example of an MNV. In this manuscript, an MNV is defined as two or more nearby variants existing on the same haplotype in the same individual. b, Impact of MNVs in coding regions. The amino acid change caused by an MNV can be different from either of the individual single nucleotide variants, which creates the potential for missannotation of the functional consequence of variants. c, Graphical overview of the analysis of phasing sensitivity and specificity using trio samples from our gnomAD callset. We identified all heterozygous variant pairs that pass quality control (Methods) and compared the phase information assigned by read based phasing with that of trio based phasing.

Read-based phasing for identification of MNVs
Identification of MNVs requires the constituent variants to be properly phased -that is, to be identified accurately as either both occurring on the same haplotype (in cis) or on two different haplotypes (in trans). Phasing can be performed following three broad strategies: readbased phasing 18 , which assesses whether nearby variants co-segregate on the same reads in DNA sequencing data; family-based phasing 19 , which assesses whether pairs of variants are co-inherited within families; and population-based phasing 20 , which leverages haplotype sharing between members of a large genotyped population to make a statistical inference of phase.
Read-based phasing is particularly effective for pairs of nearby variants, making it suitable for the analysis of MNVs.
For this project, we generated read-based phasing results for variants in the Genome

Functional impact of MNVs
In order to provide an overview of the functional impact of MNVs (Fig. 1b) To understand the overall impact of correctly annotating the functional consequence of

Genome-wide mutational mechanisms of MNVs
We next turned our attention to understanding the mutational mechanisms underlying the origins of MNVs genome-wide, focusing on whole genome sequence data from 15,708 individuals in the gnomAD v2.1 callset. We considered pairs of high-quality variants in autosomes separated by up to 10 bp, resulting in the assembly of a catalogue of 6,261,326 MNVs including 1,996,125 MNVs within 2 bp distance -an order-of-magnitude increase in size over previous collections.
We considered three established major categories of mutational origins of MNVs with constituent SNVs falling next to each other ("adjacent" MNVs. Fig. 3a), each of which is biased towards certain MNV patterns: (1) combinations of distinct single nucleotide mutation events; (2) replication errors by error-prone polymerase zeta; and (3) polymerase slippage events at repeat junctions. MNVs in the first category are a product of two or more SNVs, which typically occur in different generations and may thus have different allele frequencies. We expect to see an enrichment of CpG transition compared to non-CpG transversion for this class, due to the underlying difference of SNV mutation rate [28][29][30] . The second category, replication error introduced by DNA polymerase zeta (pol-zeta), is a well known class of replication error that introduces MNVs. Previous studies 10,[11][12][13]31 have shown that pol-zeta is prone to specific types of replication error, mainly TC->AA, GC->AA, and their reverse complements, with experimental evidence that these MNV patterns occur in a single generation; thus, the constituent SNVs will typically have the same allele frequencies. The third category, replication slippage, is another known mode of DNA replication error [32][33][34] . This process is especially frequent at sites with repetitive sequence context; previous studies [35][36][37] have shown that the insertion and deletion (indel) rate can be up to 10 6 times higher than the SNV mutation rate at these sites. As shown in To investigate the extent of pol-zeta signature, we calculated the number of MNVs in which the gnomAD allele counts of the constitutive single nucleotide variants are equal (following previous methodology 2 ), and observed that these "one-step" MNVs are significantly enriched in MNV patterns matching the pol-zeta signature (90.8% for GA->TT, and 80.5% for GC->AA, compared to 40.7% overall; Fisher's exact test p< 10 −100 ; Fig. 3c).
Finally, in order to capture polymerase slippage events, we calculated the fraction of MNVs in repetitive contexts per MNV pattern (Fig. 3d). For the MNV patterns AA->TT and AA->CC, more than 50% of all the MNVs observed were in repetitive contexts. The fractions of the MNV patterns AT->TA and TA->AT in repetitive contexts were also high, exceeding 30%  although further work will be required to fully characterize the underlying processes.
Overall, our analysis of MNVs in 15,708 whole genome-sequenced individuals supports the previously suggested three major mechanism of MNVs and quantifies the different contribution of each mechanism for different MNV patterns at genome-wide scale.

MNV distribution across different genomic regions
We next examined how MNV pattern distributions differ between functional annotation categories. We used 13 different functional annotations such as coding sequence, enhancer, and promoter from Finucane et al 39  Finally, we explored the effect of genic context on MNV origins and discovery: we selected the seven major regional annotations around gene coding sequences 41,42 , and calculated the fraction of MNVs likely explained by different mutational origins in each of these regions (Fig. 4d). Across all regions, we found that the MNV signal is primarily dominated by

Discussion
We analysed 125,748 human exomes and 15,708 genomes and identified 1,996,125 MNVs across genome with constituent variants falling within 2 bp distance, including 31,510 that exist within a codon. We have shown that MNVs represent an important class of genetic variation, and that they have a significant impact on the functional interpretation of genomic data, both at the population and individual level. Although we did not encounter an individual in which an MNV is the likely cause of a rare disease after sequencing 6,072 individuals from rare disease families, we expect that applying our pipeline to larger numbers of disease samples will identify previously missed diagnoses, as has been observed in another study of developmental delay cases 2 .
The large number and high quality of variant calls in the gnomAD database provided increased power for statistical analysis of the three major mutational mechanisms (combinations of independent SNVs; replication errors by pol-zeta; and polymerase slippage at repeat junctions) responsible for the generation of MNVs, and importantly allowed us to estimate the relative contribution of each of these processes.
Our estimates of substitution pattern-specific MNV mutation rate and fraction come with important caveats. Our approach assumes that the local SNV mutation rate is invariant across instances of a specific 3 bp context; however, prior work has shown considerable regional variation in mutation rate across the genome, as well as variation driven by ancestry, environment, and other factors [43][44][45][46] . Another important limitation is the lack of confident estimates of insertion and deletion rate as a function of repeat length, which limits the confidence of our estimate of the fraction of polymerase slippage. Future large genome-scale data sets with more accurate insertion and deletion calls, likely involving long-read sequencing data, will be required to improve modeling of insertion and deletion mutations.  Fig. 9, 10). Increasing the number of sequenced individuals in both disease and non-disease cohorts will permit the discovery and determination of the phenotypic impact of an increasingly comprehensive catalogue of variation. This study confirms the importance of incorporating haplotypic phase into these efforts to permit the discovery and accurate interpretation of the full range of human variation.

MNV filtering
In the gnomAD MNV analysis, MNVs for which one or both of their components have low quality reads were filtered out. Specifically, we only selected the variant sites that pass the Random Forest filtering, resulting in acceptance of 53.3% of the initial MNV candidates ( Supplementary Fig. 11). For the exome dataset, we also applied adjusted threshold criteria (GQ ≥ 20, DP ≥ 10, and allele balance > 0.2 for heterozygote genotypes) for filtering individual variants. For each MNV site, we annotated the number of alleles that appear as MNV, as well as the number of individuals carrying the MNV as a homozygous variant. The distribution of MNV sites that contain homozygous MNVs is shown in Supplementary Fig. 12. We also collapsed the MNV patterns that are reverse complements of each other, after observing that the number of MNVs are roughly symmetric (Before collapsing, the ratio of each MNV pattern to its corresponding reverse complement pattern was mostly close to 1, with 0.91 being the lowest and 1.05 being the highest for adjacent MNVs) (Supplementary Fig. 13). All the MNV patterns in the main text and figures are equivalent to their reverse complement, and we do not distinguish them.
For the rare disease cohort, since our motivation was to find a definite example where an MNV is acting as a causal variant for a rare disease with severe phenotype rather than obtaining the population level statistics, we did not apply site and sample-specific filtering, as

Analysis of phasing sensitivity
In order to compare the phasing information derived from different methods (read-based and trio-based), we took an approach of comparing the relative phase (binary classification of whether two SNVs of MNV are in the same haplotype or not), as shown in Supplementary Table   4. We investigated the heterozygous variant pairs whose phasing information is not provided by the trio-based phasing and observed that majority (83.5%) of the cases reflected both parents carrying a heterozygous variant, a scenario where trio-based phasing is inherently uninformative. We also investigated the heterozygous variant pairs whose phasing information is not provided by the read-based phasing. Specifically, unphased pairs tend to have either low or high read depth (odds ratio =3.20, Fisher's exact test p< 10 −100 for low, and odds ratio =2.33, Fisher's exact test p< 10 −100 for high read depth; Supplementary Table 1), consistent with our previous understanding that an excess of reads can lead to involvement of erroneous reads and thus reduce the confidence of phasing of HaplotypeCaller 58 (as well as the lack of the number of reads reduces the calling rate).

Analysis of functional impact in coding region
We focused on the coding region of the canonical transcript of genes and examined the

Defining one-step MNVs and MNVs in repetitive contexts
A one-step MNV was defined as a MNV for which the allele count of both SNVs that make up the MNV is the same and close to the allele count of the MNV itself. We also compared the allele count of constituent SNVs (AC1 and AC2) with the allele count of the corresponding MNV (AC_mnv), and observed that the majority of one-step MNVs we discovered have AC_mnv/AC1 >0.9 (Supplementary Fig 14). Therefore we expect the false discovery rate of one-step MNVs (misclassifying the MNV whose AC1 and AC2 are equal just by chance) to be limited. The full distribution of all the allele counts are shown in Supplementary   Fig 15. Repetitive sequences are defined by taking the +/-4 bp context of the MNV and setting the threshold manually, by looking at the distribution of repeat contexts around all the MNVs ( Supplementary Fig. 16, 17). Specifically, for adjacent MNVs, a sequence is defined as repetitive if -there is a >6 bp mononucleotide repeat, for both reference and alternative +/-4 bp context, or -there is a ≥6 bp dinucleotide repeat, for both reference and alternative +/-4 bp context This threshold was set so that the number of MNVs with equal or higher repeats would be less than 5% of the total. Also, the estimated mutation rate under those repetitive contexts is thought to be orders of magnitude higher than the background MNV mutation rate originating from the combination of two SNV events.

Calculating the proportion of MNVs per biological origin
We CpG, while both term of the latter are product of transversion at non-CpG, which works as a reasonable explanation of the frequency difference of those two MNV patterns.
Using the same principle (and accounting for reference base pair frequency, population number and global SNV mutation rate defined by 3 bp context 26 , we first constructed a "null model" of MNV distribution. In reality, this null model does not represent the real distribution we observe, due to biological mechanisms that introduce MNV. Therefore, we allowed additional factor , that denotes the mutational event where two SNVs are introduced at the same time. For the example of ( → ), we model this probability to be proportional to ( → ) ⋅ , and try to estimate the term, which corresponds to the proportion of MNVs that are explained by non-SNV (and non-repeat) factor.
Further details are explained in the supplementary text (section "Models and assumptions for calculating the proportion of MNV per biological mechanism").
In addition, for each of MNV pattern, we annotated the predicted major mechanism for each MNV pattern in the following order: 1. "pol-zeta", for the patterns known as polymerase signature (GA->TT and GC->AA) 2. "repeat", for the patterns whose fraction of MNVs in repeat contexts are higher than 20% 3. One of "CpG_Ti", "Ti", "CpG_Ti_Tv", "TiTv", "Tv", based on possible combinations of single nucleotide mutational processes. For example, "CpG_Ti" is when transition in CpG combined with another transition can occur in the mutational processes (Supplementary File 3)   Table 3).

Data availability:
The list of coding MNVs in gnomAD exome are available at Explanations for each column in each file can be found at gs://gnomadpublic/release/2.1/mnv/mnv_readme.md .
All the files above are also available at the download page of the gnomAD browser (https://gnomad.broadinstitute.org/downloads).

Code availability
The code used in the study is available at https://github.com/macarthurlab/gnomad_mnv .

Acknowledgements:
We would like to thank the many individuals whose sequence data are aggregated in gnomAD for their contributions to research, and for making this work possible.