Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans

A detailed understanding of the genome-wide variability of single-nucleotide germline mutation rates is essential to studying human genome evolution. Here, we use ~36 million singleton variants from 3560 whole-genome sequences to infer fine-scale patterns of mutation rate heterogeneity. Mutability is jointly affected by adjacent nucleotide context and diverse genomic features of the surrounding region, including histone modifications, replication timing, and recombination rate, sometimes suggesting specific mutagenic mechanisms. Remarkably, GC content, DNase hypersensitivity, CpG islands, and H3K36 trimethylation are associated with both increased and decreased mutation rates depending on nucleotide context. We validate these estimated effects in an independent dataset of ~46,000 de novo mutations, and confirm our estimates are more accurate than previously published results based on ancestrally older variants without considering genomic features. Our results thus provide the most refined portrait to date of the factors contributing to genome-wide variability of the human germline mutation rate.

For the 3,716 individuals that passed our initial sample-level filters, we summarized the per-sample 32 distribution of extremely rare variants (ERVs) across 3-mer subtypes and used this information to flag 33 individuals that showed abnormal patterns of variation indicative of systematic sequencing errors or 34 batch effects. In brief, we adapted the non-negative matrix factorization (NMF) technique described by 35 Lawrence et al. 1 to deconvolute the 3-mer mutation spectra as a composite of 3 distinct "signatures." 36 Assuming the population has been susceptible to the same mutation processes over the timespan in 37 which ERVs have accumulated, we expect that the relative contribution of the 3 NMF signatures is doubletons in the BRIDGES sample and hence present in the population at a higher frequency (i.e., 55 having arose further in the past) than the average singleton, so retaining these ambiguous variants 56 might inadvertently affect the distribution of variants. 57 58 We estimate the false discovery rate among BRIDGES ERVs using the following method.

Supplementary Note 2. Estimation of false discovery rate by Ts/Tv statistics
Assuming the true Ts/Tv ratio ( ) is between 2.0 and 2.1, by this calculation we estimate a false 71 discovery rate of 0.1-2.9% among the BRIDGES ERVs. 72 with standard filtering protocols 5 . We therefore assume that most motif-specific errors are filtered by the 86 default strand-bias and quality filters used in our variant calling pipeline, and any undetected errors 87 have a negligible impact on our calculation of relative mutation rates and downstream analyses. 88 89 We expect the majority of ERVs in our data are mapped with high confidence, as the pre-filtering steps 90 in our variant calling pipeline remove sites occurring on reads with average phred-scaled mapping 91 quality score (MQ) <20 and/or where more than 10% of reads were ambiguously mapped (MQ0>10). 92

Mapping error
This filtering strategy is similar to the filters employed by other large-scale sequencing projects that 93 have demonstrated well-controlled error rates among singleton calls 6,7 . Because mapping errors are 94 more likely to occur in highly-repetitive regions, such as centromeric and pericentromeric loci 8 , including 95 these regions in our analyses might bias our estimates of motif-specific mutation rates and/or the 96 impact of genomic features. However, excluding these regions entirely might have detrimental side 97 effects: dropping ERVs in these regions will reduce the precision of our estimates, and removing hard-98 to-map regions might preclude our ability to assess mutation patterns unique to these regions, as they 99 may have many levels of heterogeneous overlap with genomic features. 100 To determine if excluding repeat-rich regions systematically influenced our inferred rates, we compared 101 the 7-mer relative mutation rates estimated from the full, unfiltered set of ERVs with 7-mer rates 102 estimated if we only count ERVs and reference motifs within the 1000 Genomes strict accessibility 103 mask, which delineates the most uniquely mappable regions of the genome (covering ~72% of non-N 104 bases). These two sets of estimates were very well-correlated: within-type correlations were >0.96, 105 indicating the estimated rates were highly consistent regardless of whether hard-to-map regions were 106 removed (Supplementary Fig. 6a). Moreover, subtypes with larger differences between the two 107 estimates tended to have fewer ERVs (Supplementary Fig. 6b), suggesting that most observed 108 discrepancies might simply be an artifact of reduced precision among rare mutation classes. 109 When we applied the masked rates to predict the set of de novo mutations, we found these estimates 110 had worse predictive performance than the unmasked estimates ( Table 1). This result leads us to 111 conclude that aggressively filtering for the highest-confidence call set comes at a cost of substantially 112 reducing the precision of the relative mutation rate estimates, and potentially causing greater bias by 113 ignoring the information captured by ERVs in the masked regions. Although we cannot entirely exclude 114 the possibility of mapping error biases among the unmasked estimates, the benefits of having more 115 numerous singletons across more contiguous genomic regions in the unmasked data outweigh the 116 concerns about errors caused by poor mapping quality. 117

118
While most singletons in the BRIDGES sample are the true derived allele, population genetic theory 119 suggests that <1/N=0.014% of singletons in a sample are the ancestral allele, and hence subject to the 120 same evolutionary biases we wish to avoid. These mispolarized singletons may be hard to detect, as 121 we expect ~0.25% of all singletons to carry the same allele in human and chimpanzee due to parallel 122 mutations that have occurred since splitting from a common ancestor. Intuitively, these parallel 123 mutations are especially likely to occur in hypermutable loci, so removing the 0.25% "ancestral" alleles 124 created by parallel mutation may create a bigger bias than including the 0.015% truly ancestral alleles. 125 To understand the impact of removing all putatively ancestral alleles, we used an ancestral genome 126 inferred by 6-way primate alignment 9 to annotate each allele with the putative ancestral state. We 127 identified 363,705 singletons (~1% of all singletons) where the alternative allele was the same as the 128 ancestral allele, and recalculated 7-mer relative mutation rates after removing these putatively 129 mispolarized singletons. We found that this polarization filter did not strongly affect estimated rates: 130 across all types combined as well as within each type, the rates before and after removal of these sites 131 were nearly perfectly correlated (Spearman's r>0.999). Further, we found that only 9 of the 24,576 7-132 mer rates differed significantly after applying this filter, and the re-estimated rates for these 9 subtypes 133 differed from the original rates by no more than 10%. More importantly, 8 of these 9 subtypes were 134 hypermutable CpG>TpG subtypes, consistent with our intuition that many putatively mispolarized sites 135 are in fact parallel mutations in the human and chimpanzee lineages. 136 As a final analysis of the potential effects of mispolarization on our estimates, we applied these filtered 137 rates to predict the GoNL/ITMI de novo mutations in the same logistic regression framework used to 138 compare other estimation strategies. Goodness-of-fit statistics indicated that the filtered rates predicted 139 de novo mutations better than 7-mer rates estimated without the polarization filter (ΔAIC=298). 140 However, when comparing goodness-of-fit between type-specific models, these differences largely 141 disappeared, with seven types showing negligible differences in AIC (ΔAIC < 7), and the unfiltered 142 rates had lower AIC for three of these (non-CpG C>T, CpG>GpG, and CpG>ApG). Only two types had 143 differences in AIC greater than 10: A>T types were predicted slightly better by the filtered rates 144 (ΔAIC=16), but CpG>TpG types were predicted better by the unfiltered rates (ΔAIC=22), suggesting the 145 accuracy of the filtered rates is particularly affected by parallel mutations at hypermutable CpG sites. 146 Given this lack of consistent type-specific improvement when applying the polarization filter, we 147 performed all subsequent analyses using the full set of 35.6 million ERVs. 148

149
A potential concern with comparisons between our ERV-derived mutation rate estimates and 150 Aggarwala and Voight's 1000G-based estimates 10 is that discrepancies might be partially attributable to 151 technical differences between the two samples, not necessarily because the 1000G estimates are 152 based on ancestrally older SNVs. For a more direct comparison, we curated a set of higher-frequency 153 SNVs found in the BRIDGES data, removing the possibility that the dissimilar estimates are a result of 154 differences in sequencing platform, variant calling, QC methods, and sampled individuals. 155 Aggarwala and Voight's mutation rate estimates are based on 7,051,667 intergenic variants observed 156 in N=379 Europeans from the 1000 Genomes Phase I study 10 . Aggarwala and Voight do not state the 157 exact site frequency spectrum for the European intergenic variants, but claim 26% of intergenic variants 158 in the 1000G Phase I African sample are singletons or doubletons 10 . Thus, it is reasonable to assume 159 that >80% of European intergenic SNVs in the 1000G data occur at a frequency greater than 160 1/(379*2)=0.0013 (i.e., the sample MAF of a singleton in the 1000G sample). To obtain SNVs in the 161 BRIDGES sample in a frequency range comparable to this, we selected all SNVs with a minor allele 162 count ≥10 (MAF≥0.0014). We identified 12,088,037 MAC10+ variants in our data, from which we 163 estimated 7-mer relative mutation rates. We compared these estimates to 1) a set of ERV-derived 7-164 mer estimates calculated after randomly downsampling to an equivalent number (12,088,037 ERVs), 165 and 2) the 1000G estimates. These comparisons show that the MAC10+ estimates are more closely 166 correlated with the 1000G estimates (Supplementary Fig. 3) than with the downsampled ERV-derived 167 estimates (Supplementary Fig. 4). We also used the MAC10+ estimates to predict the GoNL/ITMI de 168 novo mutations, and found that this model tended to perform comparably to the 1000G model 169 (Supplementary Table 5

Estimated Relative
Mutation Rate c.

Supplementary Figure 1
High-resolution heatmaps of relative mutation rates for mutation subtypes up to a 7-mer resolution, estimated from the BRIDGES ERVs (a) estimates for 3-mer mutation subtypes. (b) estimates for 5-mer mutation subtypes. (c) estimates for 7-mer mutation subtypes. Each cell delineates a subtype defined by the upstream sequence (y-axis) and downstream sequence (x-axis) from the central (mutated) nucleotide.

a.
b. c.

Supplementary Figure 3
Comparison of 7-mer relative mutation rates estimated from BRIDGES MAC10+ variants and 1000G Intergenic SNVs (a) Scatterplot of 7-mer subtype rates estimated from the BRIDGES MAC10+ data (x-axis), and 1000G intergenic SNV data (y-axis) (b) Type-specific 2D-density plots, as situated in the scatterplot of a. The dashed line indicates an expected least-squares regression line if there is no bias present. (c) Heatmap shows ratio between relative mutation rates calculated on MAC10+ variants and 1000G variants for each 7-mer mutation subtype. Subtypes with higher 1000G-derived rates relative to MAC10+-derived rates are shaded gold, and subtypes with lower 1000G-derived rates relative to MAC10+-derived rates are shaded green. 1000G-derived rates shown here are scaled relative to the MAC10+-derived rates.
a. b. c.

Supplementary Figure 4
Comparison of 7-mer relative mutation rates estimated from BRIDGES ERVs and BRIDGES MAC10+ variants (a) Scatterplot of 7-mer subtype rates estimated from the BRIDGES ERV data, after randomly downsampling the ERVs to 12,088,037 (x-axis) and the BRIDGES MAC10+ data (y-axis). (b) Type-specific 2D-density plots, as situated in the scatterplot of a. The dashed line indicates an expected least-squares regression line if there is no bias present. (c) Heatmap shows ratio between relative mutation rates calculated on MAC10+ variants and ERVs for each 7-mer mutation subtype. Subtypes with higher MAC10+-derived rates relative to ERV-derived rates are shaded gold, and subtypes with lower MAC10+-derived rates relative to ERV-derived rates are shaded green.

a.
b.

Supplementary Figure 6
Genome-wide estimates for ERV-based 7-mer subtypes are consistent with estimates from ERVs restricted to uniquely-mappable regions (a) Relationship between masked and unmasked 7-mer relative mutation rate estimates, separated by type. (b) Relationship between number of ERVs per subtype (x axis) and discordance between the masked and unmasked rates, measured as the log ratio between the estimates (y axis).

Distributions of effect sizes on mutability for 14 genomic features and depth of sequencing
For each feature, we plotted the empirical distributions of the subtype-specific odds ratios for each basic mutation type, as estimated by our logistic regression models. *Replication timing is coded with negative values indicating later replicating regions, so an OR<1 means mutation rate increases in latereplicating regions. a. b.

Supplementary Table 2 t-tests for differences in mean 1000G/ERV ratio of GC-poor vs. GCrich 7-mer motifs
For each mutation subtype, we calculated the ratio between 1000G-derived and ERV-derived relative mutation rates. Then, for each of the 9 basic types, we grouped 7-mer subtypes into low C/G subtypes (≤3 C/G bases in the +/-3 flanking positions) and high C/G subtypes (≥4 C/G bases in the +/-3 flanking positions) and performed t-tests for differences in the mean 1000G/ERV ratios of these two groups.