Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes

Multi-nucleotide variants (MNVs), defined as two or more nearby variants existing on the same haplotype in an individual, are a clinically and biologically important class of genetic variation. However, existing tools typically do not accurately classify MNVs, and understanding of their mutational origins remains limited. Here, we systematically survey MNVs in 125,748 whole exomes and 15,708 whole genomes from the Genome Aggregation Database (gnomAD). We identify 1,792,248 MNVs across the genome with constituent variants falling within 2 bp distance of one another, including 18,756 variants with a novel combined effect on protein sequence. Finally, we estimate the relative impact of known mutational mechanisms - CpG deamination, replication error by polymerase zeta, and polymerase slippage at repeat junctions - on the generation of MNVs. Our results demonstrate the value of haplotype-aware variant annotation, and refine our understanding of genome-wide mutational mechanisms of MNVs.


Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes (Wang et al, NCOMMS-19-10112-T) Response to reviewers
We thank the reviewers for their thoughtful comments and have made a number of changes to the manuscript in response, which we believe have substantially improved this work. In particular we wish to note that a new filter for highly repetitive regions was applied across all of our variants in response to reviewer 2 (point 3 below), which slightly alters all of the numbers and figures in the text, but we believe makes our overall findings more robust to potential mapping artifacts.
Response to Reviewer 1: 1. A large amount of the text is listing statistics concerning the MNVs that were found. These paragraphs are difficult to read and the data might be better presented either as tables or as plots. Table 3, and Figures 2 and 3 describe most of the numbers that are important in the manuscript. However, we agree that there are some numbers that are difficult to extract from these display items. In order to make these clearer, we have added another subsection that summarizes the numbers as Supplementary  Table 3b.

In general usage, there is a distinction between SNP and SNV concerning the prevalence of a variant in the population. The authors should comment on this naming for multinuclotide variants and state which should be viewed as MNPs and which as MNVs
Although we understand that some researchers make a distinction between SNP and SNV, depending on the frequency of the variant, it has been difficult to find consensus on the frequency threshold used for this nomenclature. Therefore, for the sake of consistency, we use the more general term multi-nucleotide "variant" instead of "polymorphism" throughout the paper.
3. The limit of 10bp for the phasing analysis and the MNV analysis seems a bit arbitrary, especially when the reads are at least 100bp in size. This cutoff should be justified, or the authors should go up to collecting phased variants over longer haplotypes and analyze them.
There is a significant drop in the fraction of phased heterozygous pairs of variants when the distance is larger than 10bp, as shown in Supplementary Figure 1d. Phasing over longer haplotypic regions is definitely an interesting area to explore, but falls out of this paper's scope, as it requires methods other than read-based phasing. To clarify the rationale for this cut-off we added a sentence in the method section ("Also, we did not expand the window size larger than 10 bp for the MNV discovery, as the phasing sensitivity significantly drops when the distance between variants is larger than 10 bp, as shown in Supplementary Fig. 1d") and supplementary figure legend ("The fraction significantly drops at >10 bp, reflecting limitations of read-based phasing.") to explicitly state that fact.
4. How often are there more than 2 variants within 10bp? This should be explicitly noted. If the phasing results are sufficient, it would be nice to see this over longer distances.
Our analysis allows us to estimate that out of all the MNVs within a single codon (n=31,575), 0.72 % (= 228/31,575) are also observed as MNVs consisting of more than 2 variants.
The focus in this paper is on MNVs predicted to have different functional consequences compared to what would be predicted from their constituent SNVs. As such, while we do have a list of MNVs that consist of three SNVs within a single codon (n=228), we have not performed an exhaustive inference of MNVs of size >2 across the rest of the genome; performing this calculation would be very computationally expensive, which we believe would not be justified in terms of increased biological understanding.

It is noted that only ~0.005% of all possible MNVs were detected in this study.
Presumably if the entire population of the world were sequenced, we still would not get every possible MNV across the genome.
This is a great point, and indeed we confidently expect that to be true. We have added a figure plotting the number of samples against the number of MNVs discovered (by downsampling of the gnomAD data set), both in real and log scale (Supplementary Fig. 11 C and D,included below).
If we make a strong assumption that the log-linearity we observe in the current sample size holds true even with orders of magnitude higher sample size, we expect the percentage of all possible MNVs that we could observe from sequencing the entire human population would be roughly 1.7%, as shown in the figure below. We have included this in the legend of Supplementary Fig. 11 C and D, but given the considerable uncertainty associated with the linear assumption above, we have not emphasized this figure in the manuscript.
c. d. Figure S11. percentage of MNVs observed c, d, Number of MNV that are observed, as a function of down-sampled population size (median of 3 different random seeds. Error bars, denoting the maximum and minimum of 3 seeds, were constantly smaller than the dot size), in linear (c) and log (d) scale. If we make a strong assumption that the log-linearity we observe in the current sample size holds true even with sample sizes an order of magnitude larger, we expect the percentage of all possible MNVs that we observe when we sequence the entire population (n=7.7 × 10 9 ) would be roughly 1.6% Projection of the human population size (included for the interests of the reviewer, but not added to the paper): 6. Instead of just considering the functional impact of MNVs within one codon, it would be interesting to know how much of an impact all MNVs produce regardless of whether they are within one codon or across multiple codons. This is a more difficult computational task, but the real question of altered function is of a protein as a whole, not just an individual codon.
We have added a table that describes the number of MNVs in coding region that span across two codons, and showed the breakdown of functional consequence of individual SNVs, as well as the total number of such MNVs (Supplementary Table 5). This is now stated in the "Analysis of functional impact in coding region" section in the methods, as "(See Supplementary table 5 for the number of MNVs that spans across two codons)".
However, in those pairs of SNVs spanning two codons, functional inference can be performed without explicitly defining MNVs, since these variants act independently in the context of protein translation. Therefore, although we agree these are biologically interesting (e.g. in some cases there may be biochemical interactions between pairs of adjacent missense variants on the same haplotype), there is no clear way to determine which of these variants is most likely to have such an effect. As such we do not perform any further functional characterization of this class of MNV in this manuscript.
In response to this reviewer question, as well as an independent reviewer request for the gnomAD flagship paper that is being reviewed in parallel, we have considered another class of variant pairs whose combined interpretation can be highly different from either of the individual component variants: insertion/deletion pairs that result in frame restoration (e.g. 4bp deletion + 7bp insertion, resulting in 3bp = 1 amino acid insertion). We have made the list of such variant pairs available up to 30 bp distance (considering the limitation of read based phasing), showed that such indel pairs that rescue HC LoF variants are enriched in constrained genes, and suggested they mainly originate from one-step events. The results are summarized in Supplementary Fig. 3, and are mentioned in the "Functional Impact of MNVs" section of the results. We have deleted the description "Although theoretically a combination of insertions and deletions of different lengths could also change the individual consequence of the variants (for example, an insertion of length 4 followed by a nearby deletion of length 1 results in an insertion of length 3, which restores the codon reading frame), we focused on SNV combinations and did not try to identify such class of variants in this work" that was originally in the methods section from our manuscript.  Figure S3. Properties of frame-restoring indel pairs a, The number of indel pairs (orange = all, blue = phased) is shown as a function of distance between the indels. We set the threshold distance to be 30bp as there are relatively few indel pairs past this distance. b, The distribution of the distance between indel pairs resulting in frame restoration (exome only, same for c~h). c, The distribution of the resulting insertion or deletion length for frame-restoring indel pairs. d, The number of frame-restoring indel pairs per gene, and the list of genes with more than six such variants. e-f, The allele count distribution of framerestoring indels (e) and the distribution of allele counts divided by the maximum allele count of constituent SNVs (f). The value is exactly 1 for 81.5% of overall frame-restoring indel pairs, suggesting that the majority of such indel events are likely the result of one-step mutational event. g-h, The mean LOEUF (constraint) score (g) and the fraction of LoF-constrained genes for frame-restoring indel pairs (h), per combination of LOFTEE filters of the constituent indels.
Response to reviewer 2: 1. The authors made the following statement, "the most frequent MNV pattern is CA->TG substitutions, which are likely to occur as a combination of an A->G transition, followed by a high mutation rate C->T CpG transition." If this two-step process indeed occurs in this particular order, they should be able to find samples with higher frequency of CG (A->G occurring first) but not TA (assuming not everyone has the second mutation given a large sample size) at the CA->TG sites. Is this indeed the case?
This is a great point, and indeed this is very clearly the case. We have added the relevant information in supplementary

"
To investigate the extent of pol-zeta signature, we calculated the number of MNVs in which the gnomAD allele counts of the constitutive single nucleotide variants are equal (following previous methodology2), and observed that these "one-step" MNVs are significantly enriched in MNV patterns matching the pol-zeta signature…" Why not show that all samples either have WT-WT or ALT-ALT but not WT-ALT (WT-ALT would indicate a two-step process)?
Supplementary Fig. 15 contains this information: we observe that for almost all of the one-step MNVs (except for CG->TA, for which the process is likely dominated by two CpG transitions happening in close succession), the number of WT-ALT is less than 10% of that of ALT-ALT. However, we agree that this figure is not intuitive. We have added a sentence and another supplementary figure (S15a) to show that (as suggested by the reviewer) WT-ALT is very rare for one-step MNVs.
Also, to clarify, the presence of WT-ALT haplotypes in the population does indicate that there was an SNV event that did not result in an MNV at that position, but it is worth noting that this does not necessarily indicate a two-step process, as theoretically this outcome could result from an event in which a one-step MNV event is followed by an SNV that reverts one base to the intermediate stage in a subset of haplotypes in the population, or a recombination event between the two adjacent variants, both of which are extremely unlikely scenarios. However, our data supports the notion that AC1==AC2 (i.e. allele counts being equal for the two constituent SNVs) is generally a highly reliable indicator of a one-step MNV event.
Discussion of these issues can be found in the Defining one-step MNVs and MNVs in repetitive contexts section of the methods, but we have also added a sentence ("also described in Methods") to the Results to clarify this for readers.
a. Figure S15. Allele count of individual SNVs for one-step MNV a, Distribution of allele count of one-step MNV divided by that of constituent SNVs.

Finding MNVs in repetitive sequence content is going to be problematic because misalignment can definitely contribute to this. How many of the total MNVs reside within repetitive regions (about 10%?).
As we have shown in Figure 3d and Supplementary Fig. 17 of the original manuscript, this proportion depends heavily on the MNV pattern, with the highest proportion being slightly below 60% for AA->TT MNVs. For most MNV classes, it is less than 20%, and for all MNV patterns combined, it is slightly less than 10% (9.25%, as stated in the original main text "Genome-wide mutational mechanisms of MNVs" section), as you suggested.
We completely agree that there is an increased risk of misalignment errors in highly repetitive regions. Indeed, when we calculated the fraction of MNVs falling in extremely low-complexity regions (LCRs), we found that a substantial fraction of overall MNVs in non-coding regions (5% for adjacent MNVs) were in such LCRs. As we might expect, this fraction was especially high for what we define as "MNVs in repetitive regions", where it reaches 27%.
We agree with the reviewer that variant calls in these repetitive regions are highly enriched for errors. As a result, we decided to filter out MNVs in LCRs entirely from our analysis, and reperformed all analyses in the paper using this filtered data set. To be consistent with the analysis in exome dataset, we also decided to apply individual variant filtering with genotype quality criteria (adj ; GQ >= 20, DP >= 10, and allele balance > 0.2 for heterozygous genotypes).
While this resulted in (typically very small, except for figure 3d) changes to every single number, table and figure in the manuscript, our results, discussion and the major conclusions of the paper remain the same. The changes in the numbers before and after LCR filtering are summarized in the figure below, as well as in Supplementary Table 7 (e.g. After applying LCR filtering and re-defining the repetitive context that are likely to cause polymerase slippage, overall % of MNVs in repetitive regions are 3.15%, and that for AA->TT 30.2%.), and a new column in Supplementary Table 2. The fraction of variant filtered out in each filtering step is summarized in Supplementary Fig. 12, and the final fraction of MNVs falling in repetitive contexts are summarized in Supplementary Fig 18 (Note, those MNVs fall in junctions of local repetitive contexts, but are not classified as LCR). Variants in LCRs are flagged but not removed from the release file, so the users who are interested in such variants can still investigate them.
We have described these filtering steps in the MNV filtering and Defining one-step MNVs and MNVs in repetitive contexts sections in the methods, but of course this filter is not perfect, and there will remain some degree of sequencing artifacts in the MNVs we classify as likely polymerase slippage errors. To clarify this for readers we have added a sentence to the mutational mechanisms section of the results that reads "These findings come with the caveat that variants in repetitive regions will have higher error rates due to slippage and misalignment errors ( Supplementary Fig. 9), but we have reduced this risk by applying random forest filtering for individual sites, as well as removing all the variants in low-complexity regions from our analysis (see Methods)." a. b. c.

Figure. Comparison of the effect of LCR / adj filtering on figure 3 b, c and d of the main text (For the interest of reviewers, not in the final manuscript)
a, The number of MNVs, before and after LCR/adj filtering (corresponding to figure 3b ; The blue colored bars, labeled with "post-filtering" corresponds with figure 3b after in the final manuscript, and the orange bars with figure 3b in the old manuscript). The effect is very small and the order itself does not change. b, The number of one-step MNVs, before and after LCR/adj filtering. The left panel is based on the order and color of the previous figure, and the right panel is in the order and color of the updated figure. After applying the filter, one-step MNVs in AC->GT increases, but the essential information (those involving transition in CpG has the lowest, and pol-zeta signature has high fraction of one-step MNV) remains the same. c, The number of MNVs in repetitive contexts that are likely to cause polymerase slippage, before and after LCR/adj filtering and updating the definition of repetitive contexts. Left panel is based on the order and the color of the previous figure, and the right panel is in the order and color of the updated figure. Some of the top results before LCR/adj filtering (e.g. AA->CC) disappears, and thus are not defined as repetitive contexts in the new definition, but AA->TT, AT->TA, and TA->AT remains to be in the top and are still defined as repetitive contexts. c.

Figure S12. Fraction of variant pairs that are filtered out by random forest (a) / by regional filter (LCR= low complexity region) (b) / by failing the adjusted criteria (c), up to 10 bp
Row denotes the distance between two SNVs of variant pairs, and column denotes the MNV pattern (and reverse complement). The fraction is represented as color. The denominator is the number before filtering In (a), the number after applying random forest filtering in (b), and the number after applying random forest and LCR filtering in (c). Figure S18. Fraction of MNV that are in repetitive contexts, up to 10 bp Row denotes the distance between two SNVs of variant pairs, and column denotes the MNV pattern (and reverse complement). The fraction is represented as color.

4, The axes for figure 3b-d are slightly confusing when they switch from number to fraction in panel c and d. Best to stick with numbers for consistency.
We agree that this is confusing, and have explored multiple options to clarify this for the reader. Our challenge in this figure is showing the number in all cases actually obscures some key aspects of the data because the number of CpG-driven events is enormously higher than other classes. Therefore we have continued to show fraction in panels c and d, but changed the display in these panels to dot plots, so that the readers can more intuitively distinguish the panels showing number (which use histograms) versus fraction (which use dots). We applied this criteria to Figure 2 and Supplementary Fig. 2 as well (except for Supplementary Fig 2b, where dot plots would be too dense and confusing). We have also noted "Error bars represent standard error of the mean (often smaller than the dot size)" to clarify the error estimation. We believe this adds visual consistency to the figures and hopefully reduces confusion for readers.
c. d. These are excellent suggestions. The average number of MNV within codon per individual can already be found at Functional impact of MNVs section of the results ("There was an average of 55.2 variants with altered functional interpretation (including 0.062 gained and 4.42 rescued nonsense) due to MNVs per individual").
The overall allele frequency distribution can be found at Supplementary Fig. 16. However we do agree that the figure is not easy to follow, and more information could be provided. We have thus added the following information to Supplementary Fig 16: - We have also added Supplementary Table 6, which summarizes the number of MNV in each consequence category, for each population.
These results fit well with our understanding from standard SNV/indel analysis, showing features such as higher proportion of homozygous carriers in bottlenecked populations (Finns).
We will also release the ethnicity distribution of each MNV in the public gnomAD browser before publication of this manuscript.  6. I am surprised that the phasing sensitivity is only 85% for adjacent heterozygous variant pairs. Since exomes and genomes are typically with >30X coverage, there must be enough high quality reads that cover adjacent variants. An explanation of the low sensitivity is needed.
Yes, we were also initially surprised by these numbers and spent some time exploring the origins of the reduced sensitivity. As mentioned in the main text, we believe this reflects the very stringent phasing criteria employed by GATK. Specifically, as shown in supplementary table 1d and stated in the Analysis of phasing sensitivity section of the methods, we found the fraction of unphased variant pairs becomes low not only when the read depth is low, but also when it is too high. This appears to be because when read depth is high, the probability of a base error that is inconsistent with the haplotype phase increases. This error mode is also mentioned as the basis of the idea of maximum read depth filter in the reference 59: Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843-2851 (2014). It is clear that better error modeling within GATK could thus improve phasing sensitivity, but such work falls outside the scope of this paper, and the application of such new methods would require complete re-calling of all gnomAD samples.

Other edits made to the manuscript:
In addition to the revisions described above, we have also made some additional minor changes to the manuscript: -We now include the MNVs in hemizygotes in the analysis. This resulted in a slight increase in the number of MNVs we call and analyze in the exome dataset (from 31510 to 31667 pre-LCR filtering, and to 31575 after LCR filtering). -We changed the lower y-axis limit of figure 3a to be 1 instead of 10^2, to avoid overinterpretation. -We changed the upper y-axis limit of figure 3d to be 1 instead of 0.6, for the consistency of the scale with figure 3c. -In figure 3 c and d, we changed the number of top and bottom MNV patterns to show, to make it consistent across the paper. Specifically, we show top 6 and bottom 4 MNV patterns. We also updated the legend accordingly, and made it consistent with the MNV pattern classification in figure 4a. -We corrected the number of MNVs consisting of 3 SNVs to 229 (228 after LCR filtering. -In the Fig 4d and section Functional enrichment in the method, we included the pattern TA->CG as a transversion signal by error, where the correct pattern is TA->GC. We have corrected this in the method section and re-generated the figure 4d, which resulted in a very minor change in the figure (minor reduction of the estimated fraction of transversion origin). -There was a small bias in the quantification of repetitiveness in the case of d>1, where the number of dinucleotide repeats were counted for the 2 bp context including the upstream constituent SNV of the MNV, but not for the downstream constituent SNV. We have corrected this and added a line in the method section that clearly describes our quantification of repetitiveness. This resulted in a minor change in the Supplementary   Fig. 17 and 18, without major change in the biological interpretation. We also changed the threshold for the quantification of repetitiveness, after applying the LCR filtering. -We have made a minor change in the mathematical notation in the supplementary methods section. Specifically, what we previously wrote as x/y is now denoted as , to better represent that it is x divided by y.
-When explaining the estimation of MNV mutation rate per generation, we mis-annotated the digit of the number of reference base pairs. We fixed this and changed the "Given that there are roughly 1.68 × 10 9 GA pairs and 1.20 × 10 9 GC pairs in the reference human genome, we estimate there are on average 0.22 GA->TT and 0.40 GC->AA mutations per generation (Supplementary File 3)." in the main text to "Given that there are roughly 1.66 × 10 8 GA pairs and 1.20 × 10 8 GC pairs in the reference human genome, we estimate there are on average 0.026 GA->TT and 0.049 GC->AA mutations per generation (Supplementary File 3). ". (Not only the digit but the number itself slightly changes because of the LCR filtering we applied in the revision, but this does not affect any major conclusions) These changes have no material consequences for the major results discussed in the manuscript.