Germline mutation rates are essential for genetic and evolutionary analyses. Yet, estimating accurate fine-scale mutation rates across the genome is a great challenge, due to relatively few observed mutations and intricate relationships between predictors and mutation rates. Here, we present Mutation Rate Learner (MuRaL), a deep learning framework to predict mutation rates at the nucleotide level using only genomic sequences as input. Harnessing human germline variants for comprehensive assessment, we show that MuRaL achieves better predictive performance than current state-of-the-art methods. Moreover, MuRaL can build models with relatively few training mutations and a moderate number of sequenced individuals, and can leverage transfer learning to further reduce data and time demands. We apply MuRaL to produce genome-wide mutation rate maps for four representative species—Homo sapiens, Macaca mulatta, Drosophila melanogaster and Arabidopsis thaliana—demonstrating its high applicability. As an example, we use improved mutation rate estimates to stratify human genes into distinct groups that are enriched for different functions, and highlight that many developmental genes are subject to high mutational burden. The open-source software and generated mutation rate maps can greatly facilitate related research.
This is a preview of subscription content, access via your institution
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
All the analyses in this study were based on published data. The variants of humans were from the gnomAD database (v2.1.1) (http://www.gnomad-sg.org/downloads/)15 and the gene4denovo database (http://www.genemed.tech/gene4denovo/download/)46. The variants of M. mulatta were from the UCSC Genome Browser (https://hgdownload.soe.ucsc.edu/gbdb/rheMac10/rhesusSNVs/)31. The variants of D. melanogaster were from the Drosophila Genetic Reference Panel (http://dgrp2.gnets.ncsu.edu/data/website/dgrp2.vcf)59. The variants of A. thaliana were from the 1001 Genomes project (https://1001genomes.org/)52. Human gene annotations were from GENCODE (v37) (https://www.gencodegenes.org/human/)55. The ClinVar variants were from NCBI (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20220730.vcf.gz)60. The predicted mutation rate maps for genomes of human, M. mulatta, D. melanogaster and A. thaliana are available at the ScienceDB repository (https://doi.org/10.11922/sciencedb.01173)61. Source data are provided with this paper.
Veltman, J. A. & Brunner, H. G. De novo mutations in human genetic disease. Nat. Rev. Genet. 13, 565–575 (2012).
Acuna-Hidalgo, R., Veltman, J. A. & Hoischen, A. New insights into the generation and role of de novo mutations in health and disease. Genome Biol. 17, 241 (2016).
Hodgkinson, A. & Eyre-Walker, A. Variation in the mutation rate across mammalian genomes. Nat. Rev. Genet. 12, 756–766 (2011).
Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919–925 (2014).
Pavlidis, P. & Alachiotis, N. A survey of methods and tools to detect recent and strong positive selection. J. Biol. Res. (Thessalon.) 24, 7 (2017).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Messer, P. W. Measuring the rates of spontaneous mutation from deep and large-scale polymorphism data. Genetics 182, 1219–1232 (2009).
Zhu, Y. O., Sherlock, G. & Petrov, D. A. Extremely rare polymorphisms in Saccharomyces cerevisiae allow inference of the mutational spectrum. PLoS Genet. 13, e1006455 (2017).
Carlson, J. et al. Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans. Nat. Commun. 9, 3753 (2018).
Agarwal, I. & Przeworski, M. Signatures of replication timing, recombination, and sex in the spectrum of rare variants on the human X chromosome and autosomes. Proc. Natl Acad. Sci. USA 116, 17916–17924 (2019).
Aggarwala, V. & Voight, B. F. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat. Genet. 48, 349–355 (2016).
Zhao, Z. & Boerwinkle, E. Neighboring-nucleotide effects on single nucleotide polymorphisms: a study of 2.6 million polymorphisms across the human genome. Genome Res. 12, 1679–1686 (2002).
Li, C. & Luscombe, N. M. Nucleosome positioning stability is a modulator of germline mutation rate variation across the human genome. Nat. Commun. 11, 1363 (2020).
Segurel, L., Wyman, M. J. & Przeworski, M. Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 15, 47–70 (2014).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Sherman, M. A. et al. Genome-wide mapping of somatic mutation rates uncovers drivers of cancer. Nat. Biotechnol. 40, 1634–1643 (2022).
Monroe, J. G. et al. Mutation bias reflects natural selection in Arabidopsis thaliana. Nature 602, 101–105 (2022).
Tyekucheva, S. et al. Human-macaque comparisons illuminate variation in neutral substitution rates. Genome Biol. 9, R76 (2008).
Mugal, C. F. & Ellegren, H. Substitution rate variation at human CpG sites correlates with non-CpG divergence, methylation level and GC content. Genome Biol. 12, R58 (2011).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Eraslan, G., Avsec, Z., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).
Avsec, Z. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Schwessinger, R. et al. DeepC: predicting 3D genome folding using megabase-scale transfer learning. Nat. Methods 17, 1118–1124 (2020).
Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548 e524 (2019).
Kull, M. et al. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. In Advances in Neural Information Processing Systems 32 (NIPS, 2019).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Nusbaum, C. et al. DNA sequence and analysis of human chromosome 8. Nature 439, 331–335 (2006).
Goldmann, J. M. et al. Germline de novo mutation clusters arise during oocyte aging in genomic regions with high double-strand-break incidence. Nat. Genet. 50, 487–492 (2018).
Warren, W. C. et al. Sequence diversity analyses of an improved rhesus macaque genome enhance its biomedical utility. Science 370, eabc6617 (2020).
Taylor, M. S. et al. Heterotachy in mammalian promoter evolution. PLoS Genet. 2, e30 (2006).
Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740 (2022).
Kimura, M. Evolutionary rate at the molecular level. Nature 217, 624–626 (1968).
di Iulio, J. et al. The human noncoding genome defined by genetic diversity. Nat. Genet. 50, 333–337 (2018).
Ovadia, Y. et al. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems 32 (NIPS, 2019).
Trabelsi, A., Chaabane, M. & Ben-Hur, A. Comprehensive evaluation of deep learning architectures for prediction of DNA/RNA sequence binding specificities. Bioinformatics 35, i269–i277 (2019).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019).
Dale, R. K., Pedersen, B. S. & Quinlan, A. R. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics 27, 3423–3424 (2011).
Kopp, W., Monti, R., Tamburrini, A., Ohler, U. & Akalin, A. Deep learning for genomics using Janggu. Nat. Commun. 11, 3488 (2020).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In 3rd International Conference on Learning Representations (Eds. Bengio, Y. & LeCun, Y.) (ICLR, 2015).
Liaw, R. et al. Tune: a research platform for distributed model selection and training. Preprint at https://arxiv.org/abs/1807.05118 (2018).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Zhao, G. et al. Gene4Denovo: an integrated database and analytic platform for de novo mutations in humans. Nucleic Acids Res. 48, D913–D926 (2020).
Jonsson, H. et al. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature 549, 519–522 (2017).
Yuen, R. et al. Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder. Nat. Neurosci. 20, 602–611 (2017).
An, J. Y. et al. Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science 362, eaat6576 (2018).
Milholland, B. et al. Differences between germline and somatic mutation rates in humans and mice. Nat. Commun. 8, 15183 (2017).
Vasimuddin, M., Misra, S., Li, H. & Aluru, S. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 314–324 (IEEE, 2019).
Consortium, T. G. 1,135 Genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell 166, 481–491 (2016).
Huang, W. et al. Natural variation in genome architecture among 205 Drosophila melanogaster genetic reference panel lines. Genome Res. 24, 1193–1208 (2014).
Lyko, F., Ramsahoye, B. H. & Jaenisch, R. DNA methylation in Drosophila melanogaster. Nature 408, 538–540 (2000).
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
Ramirez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 44, W160–W165 (2016).
Wu, T. et al. clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation 2, 100141 (2021).
Berrio, A., Haygood, R. & Wray, G. A. Identifying branch-specific positive selection throughout the regulatory genome using an appropriate proxy neutral. BMC Genomics 21, 359 (2020).
Mackay, T. F. et al. The Drosophila melanogaster genetic reference panel. Nature 482, 173–178 (2012).
Landrum, M. J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2016).
Fang, Y., Deng, S. & Li, C. Whole genome mutation rate maps for multiple species. Science Data Bank https://doi.org/10.11922/sciencedb.01173 (2022).
Fang, Y., Deng, S. & Li, C. Code MuRaL v1.0.0. Zenodo https://doi.org/10.5281/zenodo.6989025 (2022).
We thank X. He, X. Shen and N. Luscombe for insightful comments on the manuscript. We thank all lab members for discussion and help throughout this project. This work was supported by National Natural Science Foundation of China (32070593), Guangdong Basic and Applied Basic Research Foundation (2022A1515010888), and Science and Technology Planning Project of Guangzhou (202102020816).
The authors declare no competing interests.
Peer review information
Nature Machine Intelligence thanks Varun Aggarwala and Jedidiah Carlson for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Fang, Y., Deng, S. & Li, C. A generalizable deep learning framework for inferring fine-scale germline mutation rate maps. Nat Mach Intell 4, 1209–1223 (2022). https://doi.org/10.1038/s42256-022-00574-5
This article is cited by
Nature Reviews Genetics (2023)