replying to L. Wang et al. Nature https://doi.org/10.1038/s41586-023-06314-y (2023)
Wang and colleagues1 argue that our report2 of lower mutation rates in gene bodies, essential genes and regions marked by H3K4me1 must result from DNA sequencing errors. We appreciate the issues raised by them and by other colleagues3. Although we overlooked some sources of errors, these are insufficient to invalidate our conclusions, which are confirmed by more stringent reanalyses of our original data, new analyses restricted to high-confidence germline mutations4, and direct demonstration of plant DNA repair proteins being recruited to gene bodies, essential genes and H3K4me1, where they reduce local mutation rates5,6.
Wang and colleagues1 identify issues with somatic mutation calling, suggesting that homopolymer bleed-through errors in Illumina sequencing are responsible for patterns observed in somatic mutations, and that elevated cytosine deamination in transposable elements is responsible for the patterns in germline mutations. Here we address these concerns.
Consecutive runs of identical nucleotides, or homopolymers, pose challenges to discovering rare mutations because they can lead to Illumina sequencing errors at immediately neighbouring nucleotides through homopolymer bleed-through7. At the same time, homopolymer regions have higher true mutation rates even at local but non-adjacent sites (for example, ref. 8). Wang and colleagues1 found that the distribution of simulated homopolymer errors mirrors the overall distribution of mutations we reported around genes (their Fig. 1a). However, there are several reasons why such homopolymer errors cannot be the source of inferred mutation bias.
Wang and colleagues1 assume that homopolymer bleed-through errors affect sequences up to five nucleotides away from homopolymers, although these errors occur on modern Illumina platforms at positions immediately adjacent to a run of identical bases7. Moreover, their simulation of sequencing errors apparently assumes that 100% of sequencing errors occur as a product of homopolymer bleed-through. By contrast, empirical estimates of sequencing errors report only 0.7 to 5.2% to be the result of homopolymer bleed-through7. Across all data in our study, only 12.0% of total single-nucleotide variant calls (10.2% for high-quality germline calls) could be potential homopolymer-adjacent bleed-through errors, and thus on their own cannot explain the approximately 50% mutation rate reduction we observed in gene bodies relative to intergenic regions2.
More importantly, Wang and colleagues’ own analysis1 reports that the proportion of potential homopolymer bleed-through errors in our data is actually higher in gene bodies (exons plus introns), which should lead to gene body mutation rates being overestimated, not underestimated. We confirm across our datasets that the proportion of potential homopolymer bleed-through errors is not lower in gene bodies (Fig. 1a, left), and differs from the pattern of mutation calls (Fig. 1a, right). Similarly, the proportion of potential homopolymer bleed-through errors is not reduced in essential genes (Fig. 1b). The distribution of potential homopolymer bleed-through errors, therefore, disagrees with the hypothesis of Wang and colleagues1. By contrast, the observed pattern is expected if gene bodies and essential genes experienced a reduction in true mutation rates, as noise introduced by sequencing errors should have a proportionally larger effect on regions with truly low mutation rates.
Homopolymeric sequences (but not potential homopolymer bleed-errors) are enriched outside gene bodies, as reported by Wang and colleagues1. Thus, the observed mutation rate heterogeneity is consistent with previous evidence that homopolymer-rich regions have higher true mutation rates8 and that their enrichment in these regions is consistent with the expected long-term evolutionary consequence of lower DNA repair activity, as the expansion of homopolymers is a signature of lower mismatch repair activity (Supplementary Note 3). Moreover, both preferential repair of exons by mismatch repair and higher intronic mutation rates in somatic tissues have been widely documented (Supplementary Note 3). Likewise, considerable differences in mutation rate and spectra between somatic and germline cells are well known, with somatic cells having orders of magnitude higher mutation rates. Indeed, differences between mitotic and meiotic cells have been previously proposed for Arabidopsis thaliana by Wang and colleagues9 (Supplementary Note 3).
Wang and colleagues1 further suggest that the patterns we observed in germline mutations might result largely from elevated deamination of methylated cytosines (GC-to-AT mutations) in transposable elements. Several findings are inconsistent with this hypothesis: cytosine methylation was included as a covariate in our original models, mutation accumulation experiments consistently indicate that mutation rates are lower in gene bodies relative to non-transposable element intergenic regions in A. thaliana (Fig. 2a,b; see below), and removing all GC-to-AT mutations from our original germline dataset does not alter the observed pattern, with H3K4me1 remaining the strongest epigenomic predictor of lower mutation (described in detail recently4). The same has been demonstrated for mutation rate variation in rice, in which mutation rates are lower in gene bodies relative to both intergenic regions and transposable elements6.
To further address concerns with somatic mutation calls in general, we re-called putative somatic mutations in the original 107 lines10 by mapping reads to an improved reference genome11 and applying more stringent filtering (Supplementary Note 1). This led to more complete and higher-quality read mapping (Supplementary Fig. 1) and resolved several issues described by Wang and colleagues1 (for example, high intron-versus-exon mutation ratio and the proportion of potential homopolymer bleed-through errors; Supplementary Fig. 2). These data confirm that gene bodies experience lower mutation rates, including when manually removing potential homopolymer bleed-through errors (Supplementary Note 1). Many of the analyses by Wang and colleagues are affected by unreliable centromeric mutations, which constituted 41% of questioned somatic mutations1. These sites, however, could not have affected our conclusions because they were excluded from all of our original analyses (Supplementary Note 2 and Supplementary Fig. 3).
Wang and colleagues1 examined essential genes with approaches that were not in our original study. They used subsets of our initial datasets, focusing on either about 2,000 germline or about 4,000 somatic single-nucleotide variants, finding that neither dataset directly revealed a statistically significantly lower mutation rate in essential genes. This approach seems underpowered, yielding near-zero values for mutation counts in entire gene classes, an indication that the data are poorly suited for χ2 approximation (Supplementary Note 5).
In our study2, we had instead modelled genome-wide mutation rates, and using these models, identified a connection between gene essentiality and mutation rate corresponding to epigenome differences—essential genes are enriched for H3K4me1, for example, which we found to be associated with lower mutation rates. We subsequently tested whether this expectation is met in a very large set of several hundred thousand loosely filtered putative somatic mutations with ample power to compare gene classes. We agree that somatic mutation calling is very difficult, as most real somatic mutations and unrepaired damaged sites (with DNA damage occurring 10,000 to 100,000 times per day per cell; Supplementary Note 3) are expected to be present in only one cell and thus detectable only by a single read. In Supplementary Note 4 and an accompanying Correction12, we discuss why singletons were called as putative mutations in one of our reanalyses, from 64 leaves13, owing to inadvertently mapping forward reads twice. However, analyses of variant quality in these data do not support the hypothesis that our results are simply due to higher rates of poor-quality calls in non-genic regions or non-essential genes (Supplementary Note 4 and Supplementary Fig. 4).
Finally, to directly address the possibility that our conclusions reflect unknown sources of bias in inherently uncertain somatic calls, we reanalysed germline mutations from our study2 along with mutation accumulation experiment data generated in several independent studies (Supplementary Table 1). This meta-analysis of >10,000 germline mutations confirmed the previously reported, nearly universal reduction in single-nucleotide mutation rates in gene bodies, essential genes and regions marked by H3K4me1 (Fig. 2a–c; ref. 4). The notable exception comes from plants lacking the mismatch repair protein MSH2 (Fig. 2a; ref. 5). A similar pattern is seen when somatic mutations were called with very stringent criteria in plants deficient for the MSH2 partner MSH6, using a tool specifically designed for rare somatic mutations14 (Fig. 2d). This was as predicted from H3K4me1 directly attracting MSH6 to gene bodies6, confirming that DNA repair in A. thaliana is targeted to gene bodies, as is well known in humans (Supplementary Note 3). Finally, analyses of >43,000 experimentally induced de novo germline mutations in rice (previously validated with 99% accuracy) also show that gene bodies, conserved genes, and H3K4me1-marked regions experience lower mutation rates, even when considering only silent (synonymous) mutations6.
Relationships between histone modifications, DNA repair, and mutation rate are widely known (Supplementary Note 3). Our work2 considered the evolutionary implication of these relationships. We had leveraged models of the drift-barrier hypothesis to discover that natural selection could favour mechanisms linking DNA repair to widely distributed epigenomic features, such as H3K4me1, which is not only enriched in gene bodies and essential genes in A. thaliana, but also the histone modification most strongly associated with lower mutation rates in our data2. An important higher-order test of our conclusions is therefore whether they are mechanistically supported. Since publication of our article2, it has been demonstrated that plant DNA repair proteins are recruited by H3K4me1 to gene bodies and essential genes. These repair proteins, which contain Tudor ‘reader’ domains that bind H3K4me1, include PDS5C, involved in homology-directed repair, and MSH6, which functions as a dimer with MSH2 in the mismatch repair pathway and recruits MutY of the base-excision repair pathway15. The genome-wide distribution of PDS5C, as measured by chromatin immunoprecipitation followed by sequencing4,6,16, confirms that regions subject to elevated repair protein activity coincide with features at which we detected lower spontaneous mutation rates4,6,16.
We conclude that the reported relationships between epigenomic features and mutation rates2 are well supported mechanistically (Fig. 2e). We agree that there are issues and inherent uncertainties with somatic mutation calling, which make it difficult to know the accuracy of individual calls in the very large set of loosely filtered somatic variants2. However, the proposal that the observed patterns result only from sequencing errors is inconsistent with multiple lines of evidence from the original study, independent analyses and emerging parallel work.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The TAIR10 A. thaliana reference genome can be found at https://arabidopsis.org/download. The more recent, improved A. thaliana reference genome can be found at https://github.com/schatzlab/Col-CEN. Sequencing reads for 107 A. thaliana mutation accumulation lines are stored in the National Center for Biotechnology Information Short Read Archive, accession number SRP133100. Additional mutation datasets were downloaded from publications cited in Supplementary Table 1.
Code availability
Code for this work uses functions maintained in https://github.com/greymonroe/polymorphology, with additional scripts and data for analyses and figures in https://github.com/greymonroe/mutation_bias_analysis2.
References
Wang, L., Ho, A. T., Hurst, L. D. & Yang, S. Re-evaluating evidence for adaptive mutation rate variation. Nature https://doi.org/10.1038/s41586-023-06314-y (2023).
Monroe, J. G. et al. Mutation bias reflects natural selection in Arabidopsis thaliana. Nature 602, 101–105 (2022).
Liu, H. & Zhang, J. Is the mutation rate lower in genomic regions of stronger selective constraints? Mol. Biol. Evol. 39, msac169 (2022).
Monroe, J. G. et al. Report of mutation biases mirroring selection in Arabidopsis thaliana unlikely to be entirely due to variant calling errors. Preprint at bioRxiv https://doi.org/10.1101/2022.08.21.504682 (2022).
Belfield, E. J. et al. DNA mismatch repair preferentially protects genes from mutation. Genome Res. 28, 66–74 (2018).
Quiroz, D. et al. The H3K4me1 histone mark recruits DNA repair to functionally constrained genomic regions in plants. Preprint at bioRxiv https://doi.org/10.1101/2022.05.28.493846 (2022).
Stoler, N. & Nekrutenko, A. Sequencing error profiles of Illumina sequencing instruments. NAR Genom. Bioinform. 3, lqab019 (2021).
Tran, H. T., Keen, J. D., Kricker, M., Resnick, M. A. & Gordenin, D. A. Hypermutability of homonucleotide runs in mismatch repair and DNA polymerase proofreading yeast mutants. Mol. Cell. Biol. 17, 2859–2865 (1997).
Yang, S. et al. Parent–progeny sequencing indicates higher mutation rates in heterozygotes. Nature 523, 463–467 (2015).
Weng, M.-L. et al. Fine-grained analysis of spontaneous mutation spectrum and frequency in Arabidopsis thaliana. Genetics 211, 703–714 (2019).
Naish, M. et al. The genetic and epigenetic landscape of the Arabidopsis centromeres. Science 374, eabi7489 (2021).
Monroe, J. G. et al. Author Correction: Mutation bias reflects natural selection in Arabidopsis thaliana. Nature https://doi.org/10.1038/s41586-023-06387-9 (2023).
Wang, L. et al. The architecture of intra-organism mutation rate variation in plants. PLoS Biol. 17, e3000191 (2019).
Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).
Gu, Y. et al. Human MutY homolog, a DNA glycosylase involved in base excision repair, physically and functionally interacts with mismatch repair proteins human MutS homolog 2/human MutS homolog 6. J. Biol. Chem. 277, 11135–11142 (2002).
Niu, Q. et al. A histone H3K4me1-specific binding protein is required for siRNA accumulation and DNA methylation at a subset of loci targeted by RNA-directed DNA methylation. Nat. Commun. 12, 3367 (2021).
Zhu, X. et al. Non-CG DNA methylation-deficiency mutations enhance mutagenesis rates during salt adaptation in cultured Arabidopsis cells. Stress Biol. 1, 12 (2021).
Willing, E.-M. et al. UVR2 ensures transgenerational genome stability under simulated natural UV-B in Arabidopsis thaliana. Nat. Commun. 7, 13522 (2016).
Ossowski, S. et al. The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana. Science 327, 92–94 (2010).
Lu, Z. et al. Genome-wide DNA mutations in Arabidopsis plants after multigenerational exposure to high temperatures. Genome Biol. 22, 160 (2021).
Jiang, C. et al. Environmentally responsive genome-wide accumulation of de novo Arabidopsis thaliana mutations and epimutations. Genome Res. 24, 1821–1829 (2014).
Belfield, E. J. et al. Thermal stress accelerates Arabidopsis thaliana mutation rate. Genome Res 31, 40–50 (2021).
Acknowledgements
Research was conducted at the University of California, Davis, which is located on land that was the home of the Patwin people for thousands of years.
Author information
Authors and Affiliations
Contributions
J.G.M., K.D.M., W.X., T.S., P.C.-B. and D.W. contributed to data analysis. J.G.M., K.D.M., W.X., T.S., P.C.-B. and D.W. contributed to the writing. J.G.M., K.D.M., W.X., T.S., P.C.-B., C.B., M.L., M.E.-A., M.K., J.H., M.N., D.K., M.-L.W., E.I., J.Å., M.T.R., C.B.F. and D.W. contributed to the interpretation of the results. K.D.M. and W.X., who were not part of the study by J.G.M. et al.2, carried out analyses to validate the impact of an improved genome reference sequence on reducing sequencing errors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
This file contains Supplementary Table 1, Notes 1–5 (with Figs 1–4) and References.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Monroe, J.G., Murray, K.D., Xian, W. et al. Reply to: Re-evaluating evidence for adaptive mutation rate variation. Nature 619, E57–E60 (2023). https://doi.org/10.1038/s41586-023-06315-x
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41586-023-06315-x
This article is cited by
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.