Abstract
The mutation rate of a specific position in the human genome depends on the sequence context surrounding it. Modeling the mutation rate by estimating a rate for each possible kmer, however, only works for small values of k since the data becomes too sparse for larger values of k. Here we propose a new method that solves this problem by grouping similar kmers. We refer to the method as kmer pattern partition and have implemented it in a software package called kmerPaPa. We use a large set of human de novo mutations to show that this new method leads to improved prediction of mutation rates and makes it possible to create models using wider sequence contexts than previous studies. As the first method of its kind, it does not only predict rates for point mutations but also insertions and deletions. We have additionally created a software package called Genovo that, given a kmer pattern partition model, predicts the expected number of synonymous, missense, and other functional mutation types for each gene. Using this software, we show that the created mutation rate models increase the statistical power to detect genes containing diseasecausing variants and to identify genes under strong selective constraint.
Introduction
The germline mutation process is the source of all genetic variation, including all adaptive and deleterious variants. Understanding and modeling this process can be used to calibrate variant calling^{1}, infer demographic history^{2}, infer patterns of genome evolution^{3}, identify sequences of clinical relevance for human diseases^{4}, and infer selective constraints of genes^{5}. The main factor determining the mutation rate of a given site in the human genome is the sequence context surrounding the position. For example, spontaneous deamination of methylated cytosines results in ten times higher CtoT rates at CpG sites^{6,7,8,9}. Other factors that affect the mutation rate—such as GC content, CpG islands, epigenetic modifications—are associated with varied mutation rates depending on nucleotide context^{10} and thus cannot be studied without taking sequence context into account.
Since nucleotide context is the main determinant of the mutation rate it is relevant to make models that estimate the mutation rate of a position using that feature. Previous studies estimated such rates by assigning an independent rate to each kmer^{10,11} or using a logistic regression model with a dummy variable for each nucleotide at each kmer position^{12} and allowing interactions between at most four positions. Both these strategies have been used to build models that predict mutation rates using 7mer contexts but become infeasible for longer contexts. Furthermore, these previous studies focused on point mutations, and little effort has been given to estimating the positionspecific rate of germline indels. Neither Carlson et al.^{10} nor Aggarwala and Voight^{12} consider indels. Samocha et al. calculate the rate of frameshift indels per gene but do not model the indel mutation probabilities at each site^{11}. Instead, they estimate the rate of frameshift indels in a given gene by assuming that the rate of such variants is proportional to the rate of nonsense mutations. While some correlation exists between the number of polymorphic nonsense mutations and the number of polymorphic frameshift indels (r = 0.493, see methods) there is no observable correlation between the number of de novo nonsense mutations and de novo frameshift indels (r = −0.003, see methods). This lack of correlation among de novo variants indicates that the correlation between nonsense mutations and frameshift indels in segregating variants is primarily due to selection and not mutation. This makes the effectiveness of the strategy of calculating the rate of frameshift indels based on the rate of nonsense mutations doubtful. Since frameshift indels account for 44% of the LoF mutations in human genes^{5}, making models that can predict the rates of such mutations should be considered an important task. In this work, we present a new method that can predict the mutation rate of both point mutations and indels based on nucleotide context.
Results
To overcome the problem that many kmers will have few or zero observations if we use long kmers to predict the mutation rates, we propose a method that we call kmer Pattern Partition (kmerPaPa). The main idea is that we partition the set of all kmers by a set of IUPAC patterns so that each kmer is matched by one and only one of the IUPAC patterns in the set. This partition should be done so that a pattern matches kmers with similar mutation probabilities. Figure 1 shows how the 16 possible 3mers containing C→G mutations can be partitioned using 10 patterns. An exact algorithm to calculate the pattern partition that optimizes the loss function is presented in the methods section. The loss function contains two regularizing hyperparameters c (complexity penalty), and α (pseudo count), which are fitted using twofold crossvalidation.
Testing prediction of point mutation probabilities
To make models that only reflect the mutation rate and have as little bias from selection and biased gene conversion as possible, we use observed de novo mutations as training input to our model (see methods). To test the model, we first separate the data into a test and training set, using the evennumbered chromosomes for training and the oddnumbered as test data (see Supplementary Fig. 1 for an overview of the data used in different analyses). We then fit independent pattern partition models to each of the six mutation types (A→C, A→G, A→T, C→A, C→G, C→T) for different values of k. We compared the performance of kmerPaPa to a model that assigns a rate to each kmer (called “all kmers” in Fig. 2). As with kmerPaPa we also include a pseudo count in the “all kmers” model and fit its value using 2fold crossvalidation. Figure 2a shows the test outofsample performance of these models for different values of k. For kmerPaPa, the joint Nagelkerke r^{2} across the 6 mutation types keeps increasing as k increases. Whereas the alternative “all kmers” model begins overfitting at 5mers and performs poorly for larger values of k. The results reveal that nucleotides situated more than three base pairs away can affect a position’s mutation rate as the 9mer partition outperforms the 7mer partition in four of the six different mutation types, even though the 9mer models tend to include fewer patterns (see Supplementary Fig. 2).
Testing prediction of indel mutation probabilities
Besides point mutations, it is essential to consider insertions and deletions. We assign mutation probabilities to indels by looking at the kmers around the breakpoints  one for insertions and two for deletion (start and end). However, unlike point mutations, it is often impossible to precisely determine the position of an indel. Usually, more than one possible mutation event could have changed the reference sequence to the alternative sequence. If, for instance, “CAG” is changed to “CAAG,” it is impossible to know whether the DNA break and base insertion happened between C and A or between A and G. We handle this uncertainty by enumerating all the possible positions and randomly selecting one of them for each observed event (Fig. 3). We evaluate the performance of the predicted indel rates using the same test/train split as we used for point mutations (Fig. 2b). Both for allkmers and kmerPaPa the outofsample performance increases with increased k. But the kmerPaPa model outperforms the allkmers model for all values of k.
After validating that the extended kmer models built by kmerPaPa give good results on an independent test set, we trained new kmerPaPa models on the whole data set. The following sections use these models to investigate the sequence contexts that increase mutation rates, find genes harboring diseasecausing variants, and quantify the intolerance to mutations of human genes.
Identification of patterns with unusually high or low mutation rates
An advantage of a kmer pattern partition compared to a regression model is the direct interpretability of the output. By ordering the output patterns based on their mutation rate, we can directly observe patterns with extreme mutation rates. Figure 4 shows the relative mutation rate of each of the patterns for each of the mutation types (blue points) and the rate of each 3mer (red points). For most mutation types the range of the predicted mutation types using kmer patterns is an order of magnitude larger than those predicted by 3mers. Many of the patterns with high point mutation probabilities match the patterns previously reported by Carlson et al.^{10} and Aggarwala and Voight^{12}. For instance, the top pattern for A→G mutations “YYCAATG” match the CCAAT motif reported by Carlson et al.^{10} and the YCAATB pattern reported by Aggarwala and Voight^{12}.
Previous studies have reported that polymerase slippage at short tandem repeats is responsible for 75% of indels^{13}. It is thus not a surprise that many of the indel patterns we observe contain repeated sequences. For insertions, we observe the highest mutation rate for (T)n and (A)n mononucleotide repeats followed by (G)n and (C)n repeats. For the deletions, the top patterns correspond to (AG)n and (CT)n dinucleotide repeats. After these, we observe a high deletion rate of the middle two (underlined) bases in the palindromic sequence: CACATGTG. This sequence has not previously been described as a mutation hotspot and the reason behind its high deletion rate is unknown. One possible explanation is that the palindromic sequence can form a hairpin structure. DNA hairpins are known to cause transient polymerase stalling, which leads to indel formation, whereas other alternative DNA structures (e.g., G4) cause persistent polymerase stalling and result in point mutations^{14}.
Detection of genes where de novo mutations cause disease
Genes in which germline mutations cause disease can be found by looking for genes with surprisingly many de novo mutations in afflicted children. The kmer pattern partitions we have created can be used to calculate how many mutations of a specific functional category (synonymous, missense, etc.) to expect in a given gene.
We have created a software tool—Genovo—to enumerate all the possible variants of a specific functional category and look up their mutation rates. Given a list of observed mutations, Genovo can then calculate pvalues by sampling from the PoissonBinomial distribution given by this list of rates (see methods and Supplementary Fig. 3). This functionality is similar to that provided by the tool denovolyzeR^{15}, so we compare our results to this tool.
As a first test, we compare the genic mutation rate predicted by Genovo and denovolyzeR to the number of segregating variants in each gene. First, we look at the number of rare (MAF < 1%) synonymous variants for each gene in the gnomAD database, where we observe a better correlation with the synonymous rate per gene estimated by Genovo (r = 0.977) than by denovolyzeR (r = 0.958) (Fig. 5a). Most of the variation in the number of synonymous variants per gene can be explained purely by the genes’ length. But if we look at the rate per site and divide both the observed number and the predicted rate by the length of the coding sequence, we still observe high correlations (r = 0.699 for Genovo and r = 0.679 for denovolyzeR) (Fig. 5b). For nonsynonymous variants, we expect smaller correlations since differences in selective pressure between genes greatly influence these. For all classes, we observe better correlations with the rates predicted by Genovo than those predicted by denovolyzeR.
Secondly, we want to compare the statistical power of Genovo and denovolyzeR when it comes to identifying diseasecausing genes. To do this, we look at 4293 trios from the Deciphering Developmental Disorders consortium^{16} and divide the set of trios into an equally sized test and train data set. 350 genes contain at least one lossoffunction(LoF) mutation in the test data set. 45 of these show significant LoF enrichment after Bonferroni correction according to Genovo, and 33 of these are significant in the test data set. Running denovolyzeR yields 34 significant genes in the train set, 24 of which are significant in the test set. This means that Genovo has a higher validation rate in the independent test data set (73.3% compared to 70.6%) even though it identified a larger number of significant genes in the training data set.
Quantifying genic tolerance to lossoffunction mutations
Besides finding genes where de novo mutations cause disease, we can also use our mutation rate models and the Genovo software to estimate the strength of negative selection acting on a given gene. To do this, we look for genes containing fewer segregating functional variants than expected because selection has purged deleterious variants from the population. The types of variants most likely to be deleterious are those that we expect to completely inactivate a gene, such as stopgain, essential splice, and frameshift variants. The observed number of such lossoffunction (LoF) variants for a gene divided by the expected number predicted by our mutation rate model—the LoF O/E ratio—provides an estimate of evolutionary constraint. If a gene has an LoF O/E ratio around one it indicates that it is evolving neutrally and that no substantial fitness cost is associated with losing the gene. In contrast, haploinsufficient genes where two functional alleles are essential for survival will have LoF O/E ratios at or close to zero. We have calculated this ratio for each gene using the observed mutations from gnomADv2 and compared them to the ratios reported for each gene by gnomAD. Looking at genes predicted to be haploinsufficient by ClinGen^{17} we observe that 55.5 percent of the genes have a Genovo O/E ratio in the first decile compared to 47 percent for O/E ratio calculated by gnomAD (see Fig. 6a). If we instead of the LoF O/E ratio use the upper bound of the 90 confidence interval of the ratio (LOEUF) as suggested by Karczewski et al.^{5} we see similar results (see Fig. 6b). We also compared Genovo and gnomAD using two lists of essential genes used in the gnomAD article^{5} and obtained similar results (see Supplementary Fig. 4). If we look at the ability to classify whether a gene is haploinsufficient we get significantly higher AUC values using Genovo (p value: 2.42 e153, See Fig. 6c).
Discussion
The kmer pattern partition method introduced in this manuscript makes it possible to build robust models of the germline mutation process using kmers. We have demonstrated the method by building models for each of the sixpoint mutation types as well as short insertions and deletions using de novo mutations as input. The results show that the outofsample error decreases as k increases proving that even nucleotides four bp away from a given position are informative about their mutation probability. An advantage of the pattern partition models is their interpretability, with overrepresented patterns readily available and not requiring any further analysis of the created models. The kmer pattern partition approach is general and could, in principle, also be used to solve other problems where the input is small positionspecific kmers. The main shortcoming of the method in its current form is that the presented dynamic programming algorithm is exponential in running time which makes it infeasible to use for kmers longer than nine bases. A possible improvement to alleviate this problem would be to consider heuristic algorithms that could quickly find a good (but not necessarily optimal) solution and thus enable analyses with larger k’s. We do consider one such heuristic algorithm in the article (Supplementary Fig. 2), but it should be possible to come up with better heuristics that can balance accuracy with computational cost.
The predictive models created using kmerPaPa are not only robust, they have the added benefit that they are easily interpretable. The method can be seen as a dimensionality reduction method that reduces the high dimensional kmer space to a lower dimension space where each dimension is represented by an IUPAC pattern. This set of patterns with associated mutation rates makes it very easy to see the general differences between the high mutation rate kmers and low mutation rate kmers. Other methods have previously been used to create models of point mutation rates based on sequence context, but it is novel that we in this study also create robust models of indel mutation rates based on sequence context. To do that, we had to overcome the problem that the exact position of an indel often is indeterminable. We do this by enumerating all the possible positions and randomly choosing one of them for each indel. Future studies that train indel kmerPaPa models using larger input data sets and further examine the resulting patterns can hopefully increase our understanding of the indel mutation process.
Besides the kmer pattern method implemented in the kmerPaPa software, we also present a tool called Genovo that can calculate the expected number of mutations with a specific functional consequence in a gene. Furthermore, Genovo can also calculate pvalues for whether a gene contains significantly more observed mutations of a particular functional category than expected. This functionality can be used to find genes with causative de novo mutations in disorders such as autism and developmental disorders. We have compared Genovo to denovolyzeR, which contains similar functionality. Trying to predict the number of synonymous mutations in each gene reported in gnomAD we see that we do slightly better than denovolyzeR. Our ability to predict the number of frameshift mutations in a gene is, however, much better than denovolyzeR. This reflects that we train independent mutation rate models for insertions and deletions whereas denovolyzeR calculates the expected number of indel variants by assuming a correlation with the number of point mutations. Unlike denovolyzeR, which is based on a specific reference genome (hg19) and gene annotation, Genovo is more general and can calculate expectations and pvalues for any reference genome and gene annotation. This allows users to use the newest reference genome and gene annotation at any time and makes it possible to use kmerPaPa and Genovo to analyze the enrichment or depletion of functional mutational categories in other species.
In conclusion, we have created a new robust method for predicting mutation probabilities based on sequence context for both point mutations and indels. Furthermore, we have developed a flexible software tool that makes it easy to use the created mutation rate models to find genes where de novo mutations cause disease or to measure the evolutionary constraint on genes.
Methods
Definition of kmer pattern partition
A kmer pattern partition is a set of IUPACencoded nucleotide patterns that partitions a set of kmers so that each kmer is matched by one and only one of the IUPAC patterns in the set. To find a pattern partition that groups kmers with similar mutation rates as well as possible while not overfitting the data, we select the partition P = {p_{1}, …, p_{n}} that optimizes the following loss function:
Where M_{i} and U_{i} are the number of mutated and unmutated positions that match the i’th pattern (p_{i}), and c is a regularizing complexity penalty. And r_{i} is the rate estimate for sites that match the i’th pattern regularized to be closer to the mean mutation rate, μ, using a pseudo count, α:
M_{i}, U_{i}, and μ can easily be calculated from the kmer counts. If m_{i} and u_{i} denote the number of times kmer k_{i} has been observed mutated and unmutated, respectively, then we can calculate the pattern counts by summing over the kmers that match the pattern:
And μ is just the average rate:
The loss function consists of the negative loglikelihood of the binomially distributed counts for each kmer plus the regularization parameter c. This makes the loss function a generalization of several widely used model selection criteria. If we for instance use c = 2 the loss function would be the Akaike information criterion (AIC).
The two regularizing hyperparameters c, and α are fitted using crossvalidation.
Calculating the optimal pattern partition
Starting with the most general pattern, we can create any possible pattern partition by:

1.
Picking a position in the pattern.

2.
Splitting the IUPAC code at that position into two subcodes that form a partition (we call such a split a twopartition).

3.
Possibly repeating these steps on each of the two subpatterns.
Table 1 shows all the possible twopartitions for each IUPAC code.
Since all pattern partitions can be created using the strategy above, we can find the minimal loss for the pattern, p, using the following recursion formula:
Using this formula, we compute the optimal partitions bottomup using dynamic programming so that the optimal partition for a given pattern is never calculated more than once. This algorithm’s running time and memory usage are proportional to the number of possible patterns  and thus exponential in k. Our python implementation, speeded up using numba^{18} for justintime compilation, can calculate the optimal 9mer pattern partition for C→T mutations in ~4 h (given α and c). The size of the IUPAC alphabet is 15 (the four standard nucleotides plus the 11 characters shown in Table 1). This means that if we wanted to calculate the optimal 11mer, we would need to multiply the 4 h running time with 15^{2} = 225, so that is not feasible using this algorithm.
Besides the optimal algorithm described above, we have also implemented a greedy heuristic algorithm where we only consider the most promising split of a given pattern into two and do not recursively test the other possible splits. This heuristic is much faster and consumes much less memory, which makes it possible to also test larger kmers. Results shown in Supplementary Fig. 2 show that the improved speed comes with a cost of decreased performance. For point mutations, the models trained using the greedy algorithm were slightly worse, but for indel mutations, the models were much worse than those generated using the optimal algorithm. Only for the A→T mutation type there was a benefit of being able to use a higher k as the 11mer model did better than the 9mer model for this mutation type. For the rest of the mutation types the results achieved with the optimal algorithm were clearly better than those achieved using the greedy algorithm. We thus decided to use the optimal algorithm to generate all models reported in this article.
Estimating hyperparameters
To find the optimal α and c we do crossvalidation over a grid, and this means that we need to calculate the optimal pattern for each fold and each α and c combination. This grid search is, however, easy to parallelize since each parameter combination can be run separately on different machines. To create the crossvalidation folds the kmerpapa software uses a hypergeometric distribution to sample subsets of the kmer count tables. For the models presented in this article, we used repeated 2fold crossvalidation where we created 5 independent crossvalidation data sets for each parameter combination and selected the parameter combination that had the lowest average outofsample negative loglikelihood. Supplementary Fig. 5 compares the negative loglikelihood for different combinations of α, c, and k. The results show that the optimal c increases with k, which can explain why we for several of the mutation types end up with fewer parameters in the 9mer model than the 7mer model.
De novo mutations and covered regions
To train kmerPaPa models, we use de novo mutations from 6 different WGS studies of parentoffspring trios^{19,20,21,22,23,24}. Together these studies include 379330 autosomal point mutations from 7206 trios. Four of the six studies also report indel mutations (39,877 autosomal variants). Studies mapped to hg19 were first lifted to hg38.
Detection of de novo mutations requires strict filters, and usually there will be a substantial fraction of the genome where de novo mutations cannot be called due to insufficient coverage or low complexity sequence. We do not know the exact genomic regions where a mutation could have been called in each of the six studies. To deal with this, we have chosen a conservative set of regions—the strict callable mask from the 1000 genomes project^{25}—and assume that mutations could have been called in this subset of the genome in all of the studies. Because we want to infer a genomewide mutation model, we further exclude the C→G enriched regions described by^{26} since they are known to have mutation patterns that differ substantially from the average. After removing the C→G enriched regions from the 1000 g callable, the callable regions contain 1754 million autosomal sites and after discarding all mutations that fall outside these regions, we are left with 249,437 autosomal SNVs and 22331 autosomal indels.
We use the program kmer_counter (https://github.com/BesenbacherLab/kmer_counter) to calculate the kmer counts used as input kmerPaPa. As explained in the results section, we first test the kmerPaPa method by using the evennumbered chromosomes as training data and the oddnumbered chromosomes as test data. Afterwards, we train models using mutations on all chromosomes (see Supplementary Fig. 1). Besides training indel models using all insertions and all deletions as presented in Fig. 4 we also train separate models for indels that are a multiple of 3 and indels that are not a multiple of 3. These models are used as input for the Genovo software to allow it to calculate separate probabilities for frameshift and inframe indels.
Computing pseudo r ^{2} values
To measure how well the models do on the test data we calculate Nagelkerke’s pseudo r^{2}:
Where l_{M} is the loglikelihood of a model where sites matching the i’th pattern is predicted to mutate with rate r_{i}:
If M_{i} and U_{i} are the numbers of mutated and unmutated sites in the test data that match the i’th pattern. And l_{0} is the loglikelihood of a model with no predictors.
The Nagelkerke’s pseudo r^{2} is equal to the Cox and Snell r^{2} value divided by an adjustment that ensures that the maximum value is always 1. Because the pseudo r^{2} values are a function of the number of data points in the data set, these statistics cannot be directly compared between typespecific models, where the number of data points varies. Due to the intrinsic randomness of the mutation processes, it will never be possible to predict exactly where mutations happen and get an r^{2} value of 1.
Genovo
From a list of observed mutations the Genovo software can determine the number of (1) synonymous, (2) missense, (3) nonsense, (4) startcodon disruption, (5) stopcodon disruption, (6) canonical splice site disruption (7) inframe indel, and (8) frameshift indel events in each transcript. Furthermore, the software takes the genomic sequence of every transcript and enumerates all possible point mutations. For each possible point mutation, it determines the type of point mutation and its probability derived from our kmerPaPa models. The possible mutations of each mutation type are tallied up to calculate the expected number of point mutations of each mutation type. In addition, the possible mutations and their probabilities are used to calculate a Poissonbinomial distribution of mutation events for every transcript, because each possible mutation event has a different probability to occur. We then sample from this distribution to calculate the pvalue for the number of observed mutations of each mutation type and to calculate the confidence interval around the number of expected mutations (see Supplementary Fig. 3).
Scaling mutation rates
The mutation rates output by kmerPaPa corresponds to the number of mutations expected in the number of individuals that we have data from. To turn these estimates into rates per generation we multiply all SNV rates with a factor so that the average mutation rate becomes 1.28 × 10^{−8} per base per generation^{26}. And we scale indel rates to have an average rate of 0.68 × 10^{−9} per base per generation^{27}.
In Genovo these mutation rates should then be scaled to reflect the number of generations that the observed mutations correspond to. To avoid that some mutations get probabilities above one for data sets with many samples we use the following formula to scale the rate per base per generation, r, to get the probability that this mutation has occurred in n_{gen} observed generations:
For the gnomAD analysis where we do not have de novo mutations from a certain number of trios but segregating variants accumulated over an unknown number of generations, we set n_{gen} so that the number of expected synonymous mutations fit the observed number of synonymous mutations in the whole exome.
Correction for coverage
Coverage bias in exome sequencing data results in fewer sequenced individuals for some transcripts compared to others in the gnomADv2 data. This means that the expected number of mutations predicted by Genovo will be too high for genes that have low coverage. To correct for this we fit a lowess curve using the statsmodels package^{28} in python, to a scatter plot with synonymous observed/expected values (y axis) vs. the number of called samples (x values) for each transcript (see Supplementary Fig. 6). The number of called samples for a transcript was estimated as the average AN (from the gnomAD2 vcf file) over all variants in the transcript. In the lowessfunction, we used 0.05 as the fraction of the data used when estimating each y value. Based on this inference, we made a cutoff where we only kept transcripts that have an average AN value above 50,000. For each transcript, we then divided the expected variant count with the corresponding lowess estimate for all variant classes to get a coverage corrected expectation.
Running Genovo
To produce the expected variant count for each transcript we ran Genovo using hg19 and gencode v. 19 (same as used for gnomADv2) using a scaling factor of 2.2 × 10^{−7}. Then we applied the coverage correction as described above. For the comparison with denovolyzeR (version 0.2.0) shown in Fig. 5 we used the VEP annotations provided in “gnomad.exomes.r2.1.1.sites.vcf“ to get the observed number of variants in each functional class.
Of the 4293 trios in the DDD data set 3664 contain at least one exome mutation. When splitting the data into test and training data sets we put half of the 3664 individuals with a mutation in each group. When calculating pvalues using Genovo we did it using 10 million random samples from the null model. For both Genovo and denovolyzeR we do Bonferroni correction of the training set pvalues based on the number of genes that have at least one observed LoF mutation in the training data (350 genes). For each gene, we only test the transcript used by denovolyzeR when running Genovo.
Calculation of Genovo LOEUF score
When calculating the “lossoffunction observed/expected upper bound fraction” (LOEUF score)^{5} from the output of the Genovo software we do it by dividing the number observed LoF variants by the lower bound of the 90% confidence interval of the expected number of LoF variants. Like the other expected values this lower bound used in the denominator had also been adjusted for coverage differences between transcripts as described above in the section Correction for coverage. As there were many transcripts with zero observed mutations, we added a pseudo score of 0.5 to all values of observed and expected LoF when we calculated the LoF O/E ratio.
gnomAD and Genovo comparison
We used the precalculated LoF O/E ratios and LOEUF scores provided in the publicly available data provided by the gnomAD consortium (gnomad.v2.1.1.lof_metrics.by_transcript.txt.bgz) to compare O/E ratios and LOEUF scores between gnomAD and Genovo. We made sure that we compared transcripts that we had data for in both gnomAD and Genovo data sets. Thereafter, we split the transcripts into deciles and inferred how many transcripts were found in each decile. We did this for each model—gnomAD or Genovo—and visually compared the two for three different groups of genes that were: essentiality of genes inferred by two lists (mouse gene knockout experiments and cellular inactivation assays) and haploinsufficient genes. We used the same curated gene lists that were used in used in the gnomAD paper^{5} downloaded from https://github.com/macarthurlab/gene_lists; haploinsufficient genes according to ClinGen Dosage Sensitivity Map (as of 13 Sep 2018)^{17}, genes deemed essential in mouse^{29,30,31}, genes deemed essential in human^{32}, and genes deemed essential and nonessential in human by CRISPR screening^{33}.
We conducted the ROC AUC analysis by using the sklearn.metrics.roc_curve function from the scikitlearn module^{34} in python. For each of the AUC scores yielded for each prediction score (gnomAD or Genovo), we also conducted a DeLong test^{35} to infer whether the AUC scores were significantly different. To conduct this test we used python code adapted from https://github.com/Netflix/vmaf/.
Correlation between nonsense variants and frameshift variants
To calculate the correlations between the number of nonsense mutations and frameshift indels that we mention in the introduction we used two different data sets. For polymorphic variants we calculated the correlation based on gnomADv3. For de novo variants we used the de novo mutations reported in Halldorsson et al.^{19}. In each case we calculated the Pearson correlation between the number of nonsense variants per coding site in each gene and the number of frameshift indels per coding site in each gene.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All de novo mutation data sets used as training data are publicly available. Two of them can be found as supplementary tables in the articles reporting them^{19,20}. Two are available using links to external files provided in the articles^{21,24}. The last two data sets^{22,23} can be downloaded from denovodb^{36,37}. The gnomAD data can be downloaded from https://gnomad.broadinstitute.org. De novo mutations from the Deciphering Developmental Disorders consortium^{16} were downloaded from supp table S1 in the article^{16}. The strict mask from the 1000 genomes project can be downloaded here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/working/20160622_genome_mask_GRCh38/StrictMask/20160622.allChr.mask.bed. The mutation rate models and predicted gene constraints generated in this study are provided as Supplementary Data. The kmerPaPa models trained on evennumbered chromosomes are available as Supplementary Data 1. The kmerPaPa models trained on all chromosomes are available as Supplementary Data 2. These models formatted as input files to Genovo can furthermore be found at https://github.com/BesenbacherLab/Genovo_Input. The predicted number of variants in each functional type for each gencode v19 transcript used for the gnomAD comparison are in Supplementary Data 3.
Code availability
kmerPaPa can be installed from https://pypi.org/project/kmerpapa/ and the source code is available at https://github.com/BesenbacherLab/kmerPaPa. Genovo can be installed from https://crates.io/crates/genovo and the source code is available at https://github.com/BesenbacherLab/genovo. kmer_counter can be installed from https://pypi.org/project/kmercounter/ and the source code is available at https://github.com/BesenbacherLab/kmer_counter. The adapted python code used to calculate the DeLong test comparing ROCAUCs is available at https://github.com/BesenbacherLab/ROCutils.
References
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
Li, H. & Durbin, R. Inference of human population history from individual wholegenome sequences. Nature 475, 493–496 (2011).
Nielsen, R. et al. Genomic scans for selective sweeps using SNP data. Genome Res. 15, 1566–1575 (2005).
MacArthur, D. G. et al. Guidelines for investigating causality of sequence variants in human disease. Nature 508, 469–476 (2014).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Hodgkinson, A. & EyreWalker, A. Variation in the mutation rate across mammalian genomes. Nat. Rev. Genet. 12, 756–766 (2011).
Ehrlich, M. & Wang, R. Y. 5Methylcytosine in eukaryotic DNA. Science 212, 1350–1357 (1981).
Rideout, W. M. 3rd, Coetzee, G. A., Olumi, A. F. & Jones, P. A. 5Methylcytosine as an endogenous mutagen in the human LDL receptor and p53 genes. Science 249, 1288–1290 (1990).
Kong, A. et al. Rate of de novo mutations and the importance of father’s age to disease risk. Nature 488, 471–475 (2012).
Carlson, J. et al. Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans. Nat. Commun. 9, 3753 (2018).
Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).
Aggarwala, V. & Voight, B. F. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat. Genet. 48, 349–355 (2016).
Montgomery, S. B. et al. The origin, evolution, and functional impact of short insertiondeletion variants identified in 179 human genomes. Genome Res. 23, 749–761 (2013).
Murat, P., Guilbaud, G. & Sale, J. E. DNA polymerase stalling at structured DNA constrains the expansion of short tandem repeats. Genome Biol. 21, 209 (2020).
Ware, J. S., Samocha, K. E., Homsy, J. & Daly, M. J. Interpreting de novo variation in human disease using denovolyzeR. Curr. Protoc. Hum. Genet. 87, 7.25.1–7.25.15 (2015).
Deciphering Developmental Disorders Study. Prevalence and architecture of de novo mutations in developmental disorders. Nature 542, 433–438 (2017).
Rehm, H. L. et al. ClinGen–the clinical genome resource. N. Engl. J. Med. 372, 2235–2242 (2015).
Lam, S. K., Pitrou, A. & Seibert, S. Numba: a LLVMbased Python JIT compiler. in Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC 1–6 (Association for Computing Machinery, 2015).
Halldorsson, B. V. et al. Characterizing mutagenic effects of recombination through a sequencelevel genetic map. Science 363, eaau1043 (2019).
Goldmann, J. M. et al. Parentoforiginspecific signatures of de novo mutations. Nat. Genet. 48, 935–939 (2016).
Francioli, L. C. et al. Genomewide patterns and properties of de novo mutations in humans. Nat. Genet. 47, 822–826 (2015).
Yuen, R. K. C. et al. Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder. Nat. Neurosci. 20, 602–611 (2017).
Turner, T. N. et al. Genomic patterns of de novo mutation in simplex autism. Cell 171, 710–722.e12 (2017).
Sasani, T. A. et al. Large, threegeneration human families reveal postzygotic mosaicism and variability in germline mutation accumulation. Elife 8, e46922 (2019).
1000 Genomes Project Consortium. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Jónsson, H. et al. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature 549, 519–522 (2017).
Kloosterman, W. P. et al. Characteristics of de novo structural changes in the human genome. Genome Res. 25, 792–801 (2015).
Seabold, S. & Perktold, J. Statsmodels: econometric and statistical modeling with python. in Proceedings of the 9th Python in Science Conference (SciPy, 2010). https://doi.org/10.25080/majora92bf1922011.
Blake, J. A. et al. The Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics. Nucleic Acids Res. 39, D842–D848 (2011).
Georgi, B., Voight, B. F. & Bućan, M. From mouse to human: evolutionary genomics analysis of human orthologs of essential genes. PLoS Genet. 9, e1003484 (2013).
Liu, X., Jian, X. & Boerwinkle, E. dbNSFP v2.0: a database of human nonsynonymous SNVs and their functional predictions and annotations. Hum. Mutat. 34, E2393–E2402 (2013).
Hart, T., Brown, K. R., Sircoulomb, F., Rottapel, R. & Moffat, J. Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Mol. Syst. Biol. 10, 733 (2014).
Hart, T. et al. Evaluation and design of genomewide CRISPR/SpCas9 knockout screens. G3 7, 2719–2727 (2017).
Pedregosa, F. et al. Scikitlearn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
DeLong, E. R., DeLong, D. M. & ClarkePearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).
denovodb, Seattle, WA (URL: denovodb.gs.washington.edu) accessed January, 2019.
Turner, T. N. et al. denovodb: a compendium of human de novo variants. Nucleic Acids Res. 45, D804–D811 (2017).
Acknowledgements
This work was funded by a Sapere Aude–Starting Grant (no. 418100490) (to S.B.) from the Independent Research Fund Denmark.
Author information
Authors and Affiliations
Contributions
S.B. designed and implemented the kmerPaPa software. J.B. designed and implemented the Genovo software. S.B. ran the kmerPaPa software to produce mutation rate models and the Genovo software to produce predicted rates per mutation type per transcript. A.K. performed coverage correction and the gnomAD comparison analysis. S.B. and A.K. wrote the manuscript. All authors reviewed and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Rayan Chikhi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bethune, J., Kleppe, A. & Besenbacher, S. A method to build extended sequence context models of point mutations and indels. Nat Commun 13, 7884 (2022). https://doi.org/10.1038/s41467022355965
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467022355965
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.