Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A generalizable deep learning framework for inferring fine-scale germline mutation rate maps

A preprint version of the article is available at bioRxiv.

Abstract

Germline mutation rates are essential for genetic and evolutionary analyses. Yet, estimating accurate fine-scale mutation rates across the genome is a great challenge, due to relatively few observed mutations and intricate relationships between predictors and mutation rates. Here, we present Mutation Rate Learner (MuRaL), a deep learning framework to predict mutation rates at the nucleotide level using only genomic sequences as input. Harnessing human germline variants for comprehensive assessment, we show that MuRaL achieves better predictive performance than current state-of-the-art methods. Moreover, MuRaL can build models with relatively few training mutations and a moderate number of sequenced individuals, and can leverage transfer learning to further reduce data and time demands. We apply MuRaL to produce genome-wide mutation rate maps for four representative species—Homo sapiens, Macaca mulatta, Drosophila melanogaster and Arabidopsis thaliana—demonstrating its high applicability. As an example, we use improved mutation rate estimates to stratify human genes into distinct groups that are enriched for different functions, and highlight that many developmental genes are subject to high mutational burden. The open-source software and generated mutation rate maps can greatly facilitate related research.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Schematics of the MuRaL model and evaluation strategies.
Fig. 2: Comparison of MuRaL models trained with different rare-variant data.
Fig. 3: Comparison of MuRaL and existing models.
Fig. 4: Training DNM models and transfer learning.
Fig. 5: Application of MuRaL to other species.
Fig. 6: Clustering human coding genes based on mutation rate profiles.

Similar content being viewed by others

Data availability

All the analyses in this study were based on published data. The variants of humans were from the gnomAD database (v2.1.1) (http://www.gnomad-sg.org/downloads/)15 and the gene4denovo database (http://www.genemed.tech/gene4denovo/download/)46. The variants of M. mulatta were from the UCSC Genome Browser (https://hgdownload.soe.ucsc.edu/gbdb/rheMac10/rhesusSNVs/)31. The variants of D. melanogaster were from the Drosophila Genetic Reference Panel (http://dgrp2.gnets.ncsu.edu/data/website/dgrp2.vcf)59. The variants of A. thaliana were from the 1001 Genomes project (https://1001genomes.org/)52. Human gene annotations were from GENCODE (v37) (https://www.gencodegenes.org/human/)55. The ClinVar variants were from NCBI (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20220730.vcf.gz)60. The predicted mutation rate maps for genomes of human, M. mulatta, D. melanogaster and A. thaliana are available at the ScienceDB repository (https://doi.org/10.11922/sciencedb.01173)61. Source data are provided with this paper.

Code availability

Source code of the MuRaL package is available at https://github.com/CaiLiLab/MuRaL. The MuRaL version (v1.0.0) used for this publication is also available at Zenodo (https://doi.org/10.5281/zenodo.6989025)62.

References

  1. Veltman, J. A. & Brunner, H. G. De novo mutations in human genetic disease. Nat. Rev. Genet. 13, 565–575 (2012).

    Article  Google Scholar 

  2. Acuna-Hidalgo, R., Veltman, J. A. & Hoischen, A. New insights into the generation and role of de novo mutations in health and disease. Genome Biol. 17, 241 (2016).

    Article  Google Scholar 

  3. Hodgkinson, A. & Eyre-Walker, A. Variation in the mutation rate across mammalian genomes. Nat. Rev. Genet. 12, 756–766 (2011).

    Article  Google Scholar 

  4. Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919–925 (2014).

    Article  Google Scholar 

  5. Pavlidis, P. & Alachiotis, N. A survey of methods and tools to detect recent and strong positive selection. J. Biol. Res. (Thessalon.) 24, 7 (2017).

    Article  Google Scholar 

  6. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

    Article  Google Scholar 

  7. Messer, P. W. Measuring the rates of spontaneous mutation from deep and large-scale polymorphism data. Genetics 182, 1219–1232 (2009).

    Article  Google Scholar 

  8. Zhu, Y. O., Sherlock, G. & Petrov, D. A. Extremely rare polymorphisms in Saccharomyces cerevisiae allow inference of the mutational spectrum. PLoS Genet. 13, e1006455 (2017).

    Article  Google Scholar 

  9. Carlson, J. et al. Extremely rare variants reveal patterns of germline mutation rate heterogeneity in humans. Nat. Commun. 9, 3753 (2018).

    Article  Google Scholar 

  10. Agarwal, I. & Przeworski, M. Signatures of replication timing, recombination, and sex in the spectrum of rare variants on the human X chromosome and autosomes. Proc. Natl Acad. Sci. USA 116, 17916–17924 (2019).

    Article  Google Scholar 

  11. Aggarwala, V. & Voight, B. F. An expanded sequence context model broadly explains variability in polymorphism levels across the human genome. Nat. Genet. 48, 349–355 (2016).

    Article  Google Scholar 

  12. Zhao, Z. & Boerwinkle, E. Neighboring-nucleotide effects on single nucleotide polymorphisms: a study of 2.6 million polymorphisms across the human genome. Genome Res. 12, 1679–1686 (2002).

    Article  Google Scholar 

  13. Li, C. & Luscombe, N. M. Nucleosome positioning stability is a modulator of germline mutation rate variation across the human genome. Nat. Commun. 11, 1363 (2020).

    Article  Google Scholar 

  14. Segurel, L., Wyman, M. J. & Przeworski, M. Determinants of mutation rate variation in the human germline. Annu. Rev. Genomics Hum. Genet. 15, 47–70 (2014).

    Article  Google Scholar 

  15. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  Google Scholar 

  16. Sherman, M. A. et al. Genome-wide mapping of somatic mutation rates uncovers drivers of cancer. Nat. Biotechnol. 40, 1634–1643 (2022).

    Article  Google Scholar 

  17. Monroe, J. G. et al. Mutation bias reflects natural selection in Arabidopsis thaliana. Nature 602, 101–105 (2022).

    Article  Google Scholar 

  18. Tyekucheva, S. et al. Human-macaque comparisons illuminate variation in neutral substitution rates. Genome Biol. 9, R76 (2008).

    Article  Google Scholar 

  19. Mugal, C. F. & Ellegren, H. Substitution rate variation at human CpG sites correlates with non-CpG divergence, methylation level and GC content. Genome Biol. 12, R58 (2011).

    Article  Google Scholar 

  20. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    Article  Google Scholar 

  21. Eraslan, G., Avsec, Z., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).

    Article  Google Scholar 

  22. Avsec, Z. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).

    Article  Google Scholar 

  23. Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).

    Article  Google Scholar 

  24. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    Article  Google Scholar 

  25. Schwessinger, R. et al. DeepC: predicting 3D genome folding using megabase-scale transfer learning. Nat. Methods 17, 1118–1124 (2020).

    Article  Google Scholar 

  26. Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548 e524 (2019).

    Article  Google Scholar 

  27. Kull, M. et al. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. In Advances in Neural Information Processing Systems 32 (NIPS, 2019).

  28. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    Article  Google Scholar 

  29. Nusbaum, C. et al. DNA sequence and analysis of human chromosome 8. Nature 439, 331–335 (2006).

    Article  Google Scholar 

  30. Goldmann, J. M. et al. Germline de novo mutation clusters arise during oocyte aging in genomic regions with high double-strand-break incidence. Nat. Genet. 50, 487–492 (2018).

    Article  Google Scholar 

  31. Warren, W. C. et al. Sequence diversity analyses of an improved rhesus macaque genome enhance its biomedical utility. Science 370, eabc6617 (2020).

    Article  Google Scholar 

  32. Taylor, M. S. et al. Heterotachy in mammalian promoter evolution. PLoS Genet. 2, e30 (2006).

    Article  Google Scholar 

  33. Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740 (2022).

    Article  Google Scholar 

  34. Kimura, M. Evolutionary rate at the molecular level. Nature 217, 624–626 (1968).

    Article  Google Scholar 

  35. di Iulio, J. et al. The human noncoding genome defined by genetic diversity. Nat. Genet. 50, 333–337 (2018).

    Article  Google Scholar 

  36. Ovadia, Y. et al. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems 32 (NIPS, 2019).

  37. Trabelsi, A., Chaabane, M. & Ben-Hur, A. Comprehensive evaluation of deep learning architectures for prediction of DNA/RNA sequence binding specificities. Bioinformatics 35, i269–i277 (2019).

    Article  Google Scholar 

  38. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

  39. Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019).

    Google Scholar 

  40. Dale, R. K., Pedersen, B. S. & Quinlan, A. R. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics 27, 3423–3424 (2011).

    Article  Google Scholar 

  41. Kopp, W., Monti, R., Tamburrini, A., Ohler, U. & Akalin, A. Deep learning for genomics using Janggu. Nat. Commun. 11, 3488 (2020).

    Article  Google Scholar 

  42. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In 3rd International Conference on Learning Representations (Eds. Bengio, Y. & LeCun, Y.) (ICLR, 2015).

  43. Liaw, R. et al. Tune: a research platform for distributed model selection and training. Preprint at https://arxiv.org/abs/1807.05118 (2018).

  44. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article  Google Scholar 

  45. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

    Article  Google Scholar 

  46. Zhao, G. et al. Gene4Denovo: an integrated database and analytic platform for de novo mutations in humans. Nucleic Acids Res. 48, D913–D926 (2020).

    Google Scholar 

  47. Jonsson, H. et al. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature 549, 519–522 (2017).

    Article  Google Scholar 

  48. Yuen, R. et al. Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder. Nat. Neurosci. 20, 602–611 (2017).

    Article  Google Scholar 

  49. An, J. Y. et al. Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science 362, eaat6576 (2018).

    Article  Google Scholar 

  50. Milholland, B. et al. Differences between germline and somatic mutation rates in humans and mice. Nat. Commun. 8, 15183 (2017).

    Article  Google Scholar 

  51. Vasimuddin, M., Misra, S., Li, H. & Aluru, S. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 314–324 (IEEE, 2019).

  52. Consortium, T. G. 1,135 Genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell 166, 481–491 (2016).

    Article  Google Scholar 

  53. Huang, W. et al. Natural variation in genome architecture among 205 Drosophila melanogaster genetic reference panel lines. Genome Res. 24, 1193–1208 (2014).

    Article  Google Scholar 

  54. Lyko, F., Ramsahoye, B. H. & Jaenisch, R. DNA methylation in Drosophila melanogaster. Nature 408, 538–540 (2000).

    Article  Google Scholar 

  55. Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).

    Article  Google Scholar 

  56. Ramirez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 44, W160–W165 (2016).

    Article  Google Scholar 

  57. Wu, T. et al. clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation 2, 100141 (2021).

    Google Scholar 

  58. Berrio, A., Haygood, R. & Wray, G. A. Identifying branch-specific positive selection throughout the regulatory genome using an appropriate proxy neutral. BMC Genomics 21, 359 (2020).

    Article  Google Scholar 

  59. Mackay, T. F. et al. The Drosophila melanogaster genetic reference panel. Nature 482, 173–178 (2012).

    Article  Google Scholar 

  60. Landrum, M. J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2016).

    Article  Google Scholar 

  61. Fang, Y., Deng, S. & Li, C. Whole genome mutation rate maps for multiple species. Science Data Bank https://doi.org/10.11922/sciencedb.01173 (2022).

  62. Fang, Y., Deng, S. & Li, C. Code MuRaL v1.0.0. Zenodo https://doi.org/10.5281/zenodo.6989025 (2022).

Download references

Acknowledgements

We thank X. He, X. Shen and N. Luscombe for insightful comments on the manuscript. We thank all lab members for discussion and help throughout this project. This work was supported by National Natural Science Foundation of China (32070593), Guangdong Basic and Applied Basic Research Foundation (2022A1515010888), and Science and Technology Planning Project of Guangzhou (202102020816).

Author information

Authors and Affiliations

Authors

Contributions

C.L. designed and supervised the project. C.L. developed the MuRaL framework, with input from Y.F. and S.D. for detailed evaluation. Y.F. and S.D. performed comparative analyses and generated mutation rate maps. C.L, Y.F. and S.D. wrote the manuscript.

Corresponding author

Correspondence to Cai Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Varun Aggarwala and Jedidiah Carlson for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–24 and Tables 1–11.

Reporting Summary

Source data

Source Data Fig. 2

Statistical source data.

Source Data Fig. 3

Statistical source data.

Source Data Fig. 4

Statistical source data.

Source Data Fig. 5

Statistical source data.

Source Data Fig. 6

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fang, Y., Deng, S. & Li, C. A generalizable deep learning framework for inferring fine-scale germline mutation rate maps. Nat Mach Intell 4, 1209–1223 (2022). https://doi.org/10.1038/s42256-022-00574-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-022-00574-5

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing