Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A genomic mutational constraint map using variation in 76,156 human genomes

An Author Correction to this article was published on 15 January 2024

This article has been updated

Abstract

The depletion of disruptive variation caused by purifying natural selection (constraint) has been widely used to investigate protein-coding genes underlying human disorders1,2,3,4, but attempts to assess constraint for non-protein-coding regions have proved more difficult. Here we aggregate, process and release a dataset of 76,156 human genomes from the Genome Aggregation Database (gnomAD)—the largest public open-access human genome allele frequency reference dataset—and use it to build a genomic constraint map for the whole genome (genomic non-coding constraint of haploinsufficient variation (Gnocchi)). We present a refined mutational model that incorporates local sequence context and regional genomic features to detect depletions of variation. As expected, the average constraint for protein-coding sequences is stronger than that for non-coding regions. Within the non-coding genome, constrained regions are enriched for known regulatory elements and variants that are implicated in complex human diseases and traits, facilitating the triangulation of biological annotation, disease association and natural selection to non-coding DNA analysis. More constrained regulatory elements tend to regulate more constrained protein-coding genes, which in turn suggests that non-coding constraint can aid the identification of constrained genes that are as yet unrecognized by current gene constraint metrics. We demonstrate that this genome-wide constraint map improves the identification and interpretation of functional human genetic variation.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Distribution of Gnocchi scores across the genome.
Fig. 2: Correlation between Gnocchi and functional non-coding annotations.
Fig. 3: Performance of Gnocchi and other predictive metrics in prioritizing non-coding variants.
Fig. 4: Contribution of non-coding constraint in evaluating CNVs.
Fig. 5: Correlation of constraint between non-coding regulatory elements and protein-coding genes.

Similar content being viewed by others

Data availability

The aggregated allele frequency dataset is available in a browser at https://gnomad.broadinstitute.org, with bulk downloads for VCF files and Hail tables, as well as all constraint statistics described in this manuscript. Additionally, we provide a subset of the dataset that includes individual-level data for the HGDP85 and 1000 Genomes projects86—the generation and use of this dataset is described in a companion manuscript75. There are no restrictions on the aggregate data released. External datasets used in this study are available in the following public resources: ENCODE cCREs, https://screen-v2.wenglab.org/; super enhancers, http://www.licpathway.net/sedb/download.php; FANTOM5 enhancers, https://fantom.gsc.riken.jp/5/datafiles/reprocessed/hg38_latest/extra/enhancer/; miRNA, https://genome.ucsc.edu/cgi-bin/hgTables (All GENCODE V32 track); FANTOM5 lncRNA, https://fantom.gsc.riken.jp/cat/v1/#/genes; GWAS Catalog, https://genome.ucsc.edu/cgi-bin/hgTables (GWAS Catalog track); GWAS fine-mapping, https://www.finucanelab.org/data; CNV morbidity map of DD, https://genome.ucsc.edu/cgi-bin/hgTables (Development Delay track); ClinVar, https://genome.ucsc.edu/cgi-bin/hgTables (ClinVar Variants track); TOPMed, https://bravo.sph.umich.edu/freeze8/hg38/downloads; ClinGen, https://genome.ucsc.edu/cgi-bin/hgTables (ClinGen track); MGI, https://www.informatics.jax.org/; OMIM, https://www.omim.org/; Roadmap Epigenomics Enhancer-Gene Linking, https://ernstlab.biolchem.ucla.edu/roadmaplinking/; GTEx https://gtexportal.org/home/datasets.

Code availability

All code to perform quality control of the resource is publicly available at https://github.com/broadinstitute/gnomad_qc, and many of the functions are documented in a Python package (gnomad) at https://broadinstitute.github.io/gnomad_methods/index.html. The code to compute the constraint statistics is available at https://github.com/atgu/gnomad_nc_constraint.

Change history

References

  1. Short, P. J. et al. De novo mutations in regulatory elements in neurodevelopmental disorders. Nature 555, 611–616 (2018).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  2. Satterstrom, F. K. et al. Large-scale exome sequencing study implicates both developmental and functional changes in the neurobiology of autism. Cell 180, 568–584.e523 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Singh, T. et al. The contribution of rare variants to risk of schizophrenia in individuals with and without intellectual disability. Nat. Genet. 49, 1167–1173 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Ganna, A. et al. Quantifying the impact of rare and ultra-rare coding variation across the phenotypic spectrum. Am. J. Hum. Genet. 102, 1204–1211 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  6. Petrovski, S., Wang, Q., Heinzen, E. L., Allen, A. S. & Goldstein, D. B. Genic intolerance to functional variation and the interpretation of personal genomes. PLoS Genet. 9, e1003709 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl Acad. Sci. USA 106, 9362–9367 (2009).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  9. Lanyi, J. K. Photochromism of halorhodopsin. cis/trans isomerization of the retinal around the 13–14 double bond. J. Biol. Chem. 261, 14025–14030 (1986).

    Article  CAS  PubMed  Google Scholar 

  10. Mathelier, A., Shi, W. & Wasserman, W. W. Identification of altered cis-regulatory elements in human disease. Trends Genet. 31, 67–76 (2015).

    Article  CAS  PubMed  Google Scholar 

  11. Spielmann, M. & Mundlos, S. Looking beyond the genes: the role of non-coding variants in human disease. Hum. Mol. Genet. 25, R157–R165 (2016).

    Article  CAS  PubMed  Google Scholar 

  12. Zhang, F. & Lupski, J. R. Non-coding genetic variants in human disease. Hum. Mol. Genet. 24, R102–R110 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Seplyarskiy, V. B. & Sunyaev, S. The origin of human mutation in light of genomic data. Nat. Rev. Genet. 22, 672–686 (2021).

    Article  CAS  PubMed  Google Scholar 

  14. Seplyarskiy, V. B. et al. Population sequencing data reveal a compendium of mutational processes in the human germ line. Science 373, 1030–1035 (2021).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  15. Gussow, A. B. et al. Orion: Detecting regions of the human non-coding genome that are intolerant to variation using population genetics. PLoS ONE 12, e0181604 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  16. di Iulio, J. et al. The human noncoding genome defined by genetic diversity. Nat. Genet. 50, 333–337 (2018).

    Article  PubMed  Google Scholar 

  17. Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740 (2022).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  18. Ritchie, G. et al. Functional annotation of noncoding sequence variants. Nat. Methods 11, 294–296 (2014).

  19. Vitsios, D., Dhindsa, R. S., Middleton, L., Gussow, A. B. & Petrovski, S. Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning. Nat. Commun. 12, 1504 (2021).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  20. Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Halldorsson, B. V. et al. Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science 363, eaau1043 (2019).

    Article  CAS  PubMed  Google Scholar 

  24. An, J. Y. et al. Genome-wide de novo risk score implicates promoter variation in autism spectrum disorder. Science 362, eaat6576 (2018).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  25. Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  26. The ENCODE Project Consortium. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020).

    Article  ADS  CAS  Google Scholar 

  27. Andersson, R. et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  28. Jiang, Y. et al. SEdb: a comprehensive human super-enhancer database. Nucleic Acids Res. 47, D235–D243 (2019).

    Article  CAS  PubMed  Google Scholar 

  29. Pott, S. & Lieb, J. D. What are super-enhancers? Nat. Genet. 47, 8–12 (2015).

    Article  CAS  PubMed  Google Scholar 

  30. Bartel, D. P. Metazoan microRNAs. Cell 173, 20–51 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014).

    Article  CAS  PubMed  Google Scholar 

  32. Kanai, M. et al. Insights from complex trait fine-mapping across diverse populations. Preprint at medRxiv https://doi.org/10.1101/2021.09.03.21262975 (2021).

  33. Jung, R. G. et al. Association between plasminogen activator inhibitor-1 and cardiovascular events: a systematic review and meta-analysis. Thromb. J. 16, 12 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Song, C., Burgess, S., Eicher, J. D., O’Donnell, C. J. & Johnson, A. D. Causal effect of plasminogen activator inhibitor type 1 on coronary heart disease. J. Am. Heart Assoc. 6, e004918 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  35. Schaefer, A. S. et al. Genetic evidence for PLASMINOGEN as a shared genetic risk factor of coronary artery disease and periodontitis. Circ. Cardiovasc. Genet. 8, 159–167 (2015).

    Article  CAS  PubMed  Google Scholar 

  36. Li, Y. Y. Plasminogen activator inhibitor-1 4G/5G gene polymorphism and coronary artery disease in the Chinese Han population: a meta-analysis. PLoS ONE 7, e33511 (2012).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  37. Drinane, M. C., Sherman, J. A., Hall, A. E., Simons, M. & Mulligan-Kehoe, M. J. Plasminogen and plasmin activity in patients with coronary artery disease. J. Thromb. Haemost. 4, 1288–1295 (2006).

    Article  CAS  PubMed  Google Scholar 

  38. Lowe, G. D. et al. Tissue plasminogen activator antigen and coronary heart disease. Prospective study and meta-analysis. Eur. Heart J. 25, 252–259 (2004).

    Article  CAS  PubMed  Google Scholar 

  39. Wang, Q. S. et al. Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs. Nat. Commun. 12, 3394 (2021).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  40. Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).

    Article  CAS  PubMed  Google Scholar 

  41. Stenson, P. D. et al. Human Gene Mutation Database (HGMD): 2003 update. Hum. Mutat. 21, 577–581 (2003).

    Article  CAS  PubMed  Google Scholar 

  42. Davydov, E. V. et al. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol. 6, e1001025 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Greenway, S. C. et al. De novo copy number variants identify new genes and loci in isolated sporadic tetralogy of Fallot. Nat. Genet. 41, 931–935 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Mefford, H. C. et al. Recurrent reciprocal genomic rearrangements of 17q12 are associated with renal disease, diabetes, and epilepsy. Am. J. Hum. Genet. 81, 1057–1069 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Sebat, J. et al. Strong association of de novo copy number mutations with autism. Science 316, 445–449 (2007).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  46. Stefansson, H. et al. Large recurrent microdeletions associated with schizophrenia. Nature 455, 232–236 (2008).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  47. Walsh, T. et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science 320, 539–543 (2008).

    Article  ADS  CAS  PubMed  Google Scholar 

  48. Wright, C. F. et al. Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data. Lancet 385, 1305–1314 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  49. Spielmann, M., Lupianez, D. G. & Mundlos, S. Structural variation in the 3D genome. Nat. Rev. Genet. 19, 453–467 (2018).

    Article  CAS  PubMed  Google Scholar 

  50. Spielmann, M. & Mundlos, S. Structural variations, the regulatory landscape of the genome and their alteration in human disease. Bioessays 35, 533–543 (2013).

    Article  CAS  PubMed  Google Scholar 

  51. Coe, B. P. et al. Refining analyses of copy number variation identifies specific genes associated with developmental delay. Nat. Genet. 46, 1063–1071 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Cooper, G. M. et al. A copy number variation morbidity map of developmental delay. Nat. Genet. 43, 838–846 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Klopocki, E. et al. Copy-number variations involving the IHH locus are associated with syndactyly and craniosynostosis. Am. J. Hum. Genet. 88, 70–75 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Barroso, E. et al. Identification of the fourth duplication of upstream IHH regulatory elements, in a family with craniosynostosis Philadelphia type, helps to define the phenotypic characterization of these regulatory elements. Am. J. Med. Genet. A 167A, 902–906 (2015).

    Article  PubMed  Google Scholar 

  55. Will, A. J. et al. Composition and dosage of a multipartite enhancer cluster control developmental expression of Ihh (Indian hedgehog). Nat. Genet. 49, 1539–1545 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

    Article  PubMed Central  Google Scholar 

  57. Rehm, H. L. et al. ClinGen—the Clinical Genome Resource. N. Engl. J. Med. 372, 2235–2242 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Blake, J. A. et al. The Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics. Nucleic Acids Res. 39, D842–D848 (2011).

    Article  CAS  PubMed  Google Scholar 

  59. McKusick, V. A. Mendelian Inheritance in Man and its online version, OMIM. Am. J. Hum. Genet. 80, 588–604 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Consortium, G. T. The Genotype–Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).

    Article  Google Scholar 

  61. Xu, H. et al. Elevated ASCL2 expression in breast cancer is associated with the poor prognosis of patients. Am. J. Cancer Res. 7, 955–961 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  62. Jubb, A. M. et al. Achaete-scute like 2 (ascl2) is a target of Wnt signalling and is upregulated in intestinal neoplasia. Oncogene 25, 3445–3457 (2006).

    Article  CAS  PubMed  Google Scholar 

  63. Tian, Y. et al. MicroRNA-200 (miR-200) cluster regulation by achaete scute-like 2 (Ascl2): impact on the epithelial-mesenchymal transition in colon cancer cells. J. Biol. Chem. 289, 36101–36115 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Guo, M. H. et al. Inferring compound heterozygosity from large-scale exome sequencing data. Nat. Genet. https://doi.org/10.1038/s41588-023-01608-3 (2023).

  65. Zhu, P. et al. Single-cell DNA methylome sequencing of human preimplantation embryos. Nat. Genet. 50, 12–19 (2018).

    Article  CAS  PubMed  Google Scholar 

  66. Tang, W. W. et al. A unique gene regulatory network resets the human germline epigenome for development. Cell 161, 1453–1467 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Ross, D. A., Lim, J., Lin, R.-S. & Yang, M.-H. Incremental learning for robust visual tracking. Int. J. Comput. Vision 77, 125–141 (2008).

    Article  Google Scholar 

  68. Karolchik, D. et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32, D493–D496 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Davis, C. A. et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 46, D794–D801 (2018).

    Article  CAS  PubMed  Google Scholar 

  71. Goldmann, J. M. et al. Germline de novo mutation clusters arise during oocyte aging in genomic regions with high double-strand-break incidence. Nat. Genet. 50, 487–492 (2018).

    Article  CAS  PubMed  Google Scholar 

  72. Zhao, H. et al. CrossMap: a versatile tool for coordinate conversion between genome assemblies. Bioinformatics 30, 1006–1007 (2014).

    Article  PubMed  Google Scholar 

  73. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Kent, W. J., Zweig, A. S., Barber, G., Hinrichs, A. S. & Karolchik, D. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 26, 2204–2207 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Koenig, Z. et al. A harmonized public resource of deeply sequenced diverse human genomes. Preprint at bioRxiv https://doi.org/10.1101/2023.01.23.525248 (2023).

  76. Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Hon, C. C. et al. An atlas of human long non-coding RNAs with accurate 5′ ends. Nature 543, 199–204 (2017).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  78. Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine-mapping. J. R. Stat. Soc. B 82, 1273–1300 (2020).

    Article  MathSciNet  Google Scholar 

  79. Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  80. Budescu, D. V. Dominance analysis: a new approach to the problem of relative importance of predictors in multiple regression. Psych. Bull. 114, 542 (1993).

    Article  Google Scholar 

  81. Azen, R. & Budescu, D. V. The dominance analysis approach for comparing predictors in multiple regression. Psych. Methods 8, 129 (2003).

    Article  Google Scholar 

  82. Ernst, J. et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49 (2011).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  83. Liu, Y., Sarkar, A., Kheradpour, P., Ernst, J. & Kellis, M. Evidence of reduced recombination rate in human regulatory domains. Genome Biol. 18, 193 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  84. Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 1–8 (2011).

    Article  Google Scholar 

  85. Bergstrom, A. et al. Insights into human genetic variation and population history from 929 diverse genomes. Science 367, eaay5012 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  86. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  Google Scholar 

Download references

Acknowledgements

The authors thank the individuals whose data is in gnomAD for their contributions to research. Development of the Genome Aggregation Database was supported by NIDDK U54DK105566 and the NHGRI of the National Institutes of Health under award number U24HG011450. Additional funding for Genome Aggregation Database Consortium members is listed in the Supplementary Information. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Author information

Authors and Affiliations

Authors

Consortia

Contributions

S.C., L.C.F., J.K.G., Q.W., A.O.-L., H.L.R., M.J.D., B.M.N., D.G.M. and K.J.K. contributed to the writing of the manuscript and generation of figures. S.C., R.L.C., M.K. and K.J.K. contributed to the analysis of data. L.C.F., Q.W., C.V., L.D.G., T.P., C.S., M.E.T., B.M.N. and K.J.K. developed tools and methods. L.C.F., J.K.G., J.A., M.W.W., Y.T., W.P., M.T.Y., Z.K., Y.F., E.B., S.D., S.G., N.G., S.F., C.T., S.N., L.B., D.R., V.R.-R., M.C., C.L., N.P., G.W., T.J., R.M., K.T., A.R.M., G.T. and K.J.K. contributed to the production and quality control of the gnomAD dataset. N.A.W., R.G., M.S. and K.J.K. contributed to the gnomAD browser. All authors listed under The Genome Aggregation Database Consortium contributed to the generation of the primary data incorporated into the gnomAD resource. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Siwei Chen or Konrad J. Karczewski.

Ethics declarations

Competing interests

K.J.K. is a consultant for Vor Biopharma, Tome Biosciences, and is on the Scientific Advisory Board of Nurture Genomics. D.G.M. is a paid advisor to GSK, Insitro, Variant Bio and Overtone Therapeutics, and has previously received research support from AbbVie, Astellas, Biogen, BioMarin, Eisai, Merck, Pfizer and Sanofi-Genzyme.

Peer review

Peer review information

Nature thanks Slavé Petrovski, Ryan Dhindsa and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer review reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Construction of mutational model and Gnocchi score.

a,b, Estimation of trinucleotide context-specific mutation rates. The proportion of possible variants observed for each substitution and context in 76,156 gnomAD genomes (y-axis) is exponentially correlated with the absolute mutation rate estimated from 1,000 downsampled genomes (x-axis). Fit lines were modeled separately for human autosomes (a) and chromosome X (b). c, Estimation of the effects of regional genomic features on mutation rates. The effects of 13 genomic features at four scales (window sizes 1kb-1Mb; x-axis) on the mutation rate of 32 trinucleotide contexts (y-axis) are shown, colored by the coefficient from regressing de novo mutations (DNMs) on each specific feature and window size. Red/Blue color indicates a positive/negative effect of increasing the feature value on mutation rates; grey crosses indicate significant features at the smallest possible window size after Bonferroni correction for 13×4 = 52 tests. Abbreviations: LCR=low-complexity region, SINE/LINE=short/long interspersed nuclear element, Dist=Distance, Recomb=Recombination, Methyl=Methylation. d,e, The distribution of Gnocchi score as a function of expected and observed variation. Each point represents the Gnocchi score of a 1kb window on the genome (N = 1,984,900 on autosomes (d) and N = 57,729 on chromosome X (e)), which quantifies the deviation of observed variation from expectation. A positive Gnocchi score (red) indicates depletion of variation (observed<expected) and the higher the score the stronger the depletion; the red dashed line indicates the 99th percentile of Gnocchi scores across the autosomes (d) or chromosome X (e).

Extended Data Fig. 2 Comparison of Gnocchi score between coding and non-coding regions.

a, The proportion of highly constrained windows (Gnocchi ≥ 4) as a function of the percentage of coding sequences in a window (left to right: N = 1,906/49,525, 3,244/55,676, 2,240/18,461, 1,506/7,094, 969/3,519, 569/1,946, 364/1,223, 283/910, 243/724, 10,392/30,138). The intervals (x-axis) are left exclusive and right inclusive. “Exonic only” refers to the 1kb windows created from directly concatenating coding exons into 1kb sequences. Error bars indicate standard errors of the proportions. b, The exonic-only regions (N = 27,875; purple) present a significantly higher Gnocchi score than regions that are exclusively non-coding (N = 1,843,559; blue). Dashed lines indicate the medians. c, The proportion of highly constrained windows (Gnocchi≥4) as a function of the proportion of exonic windows being added to the dataset of non-coding windows. d, Gnocchi score percentiles of non-coding versus exonic windows. About 0.05% (100-99.95%) and 3.12% (100-96.88%) of the non-coding windows exhibit similar constraint to the 90th and 50th of exonic regions, respectively.

Extended Data Fig. 3 Estimation of constraint for aggregated regulatory annotations.

a,b, Gnocchi scores of aggregated promoter (dark purple), enhancer (light purple), microRNA (miRNA; dark blue), and long non-coding RNA (lncRNA; light blue) annotations are compared against those of exonic (a) and non-coding (b) regions at a 1kb scale. The Gnocchi score percentiles of each annotation (y-axis) are benchmarked by the score deciles of exonic or non-coding regions (10–100 percentiles; x-axis); the grey dashed vertical line indicates the median (50th percentile).

Extended Data Fig. 4 Applications of Gnocchi for characterizing non-coding regions in addition to existing functional annotations.

a, Use of Gnocchi for prioritizing non-coding regions with or without a regulatory annotation (N = 464,504 and 1,379,055, respectively). Constrained non-coding regions are enriched for GWAS variants, independent of the candidate cis-regulatory element (cCRE) annotation from ENCODE. Error bars indicate 95% confidence intervals of the odds ratios. b, Use of Gnocchi in statistical fine-mapping. The increase in posterior inclusion probability (PIP) when incorporating Gnocchi score as a functional prior into previous fine-mapping results (that used a uniform prior; denoted as PIPGnocchi and PIPunif, respectively) is shown for 164 new likely causal associations with a PIPGnocchi ≥0.8 as a function of PIPGnocchi.

Extended Data Fig. 5 Comparison of Gnocchi and other predictive metrics in prioritizing non-coding variants.

a, Receiver operating characteristic (ROC) curves of Gnocchi and other seven metrics in classifying putative functional non-coding variants (“positive” variant set) – left to right: 9,229 GWAS Catalog variants, 2,191 GWAS fine-mapping variants, a subset of 140 high-confidence fine-mapped variants, and 1,026 likely pathogenic variants – against “negative” variant set randomly drew from the population with a similar allele frequency (AF). AF>5% and allele count (AC) = 1 were applied respectively for matching the three GWAS variant sets and the likely pathogenic variant set, based on their AF distributions in TOPMed (shown in b). b, AUCs of the classification with a varying AF threshold for the negative variant set. As most GWAS variants are common and most likely pathogenic variants are very rare (not seen in the population), AF>5% and AC = 1 were applied respectively in the primary analyses shown in a.

Extended Data Fig. 6 Comparison of constraint scores built from different mutational models and genomic windows.

Gnocchi (presented in this study) outperforms the scores rebuilt from mutational models that only consider local sequence context – trinucleotide (trimer-only) or heptanucleotide (heptamer-only) – without adjustment on mutation rate by regional genomic features, and the performance is robust to the artificial break of genomic windows when computed at a 1kb sliding by 100bp scale.

Extended Data Fig. 7 Pairwise correlations between different constraint/conservation metrics.

The Spearman’s rank correlation between each pair of the eight metrics was computed based on the mean value of each score on 1kb windows across the genome.

Extended Data Fig. 8 Power of constraint detection.

a,b, The sample size required for well-powered non-coding constraint detection. The percentage of non-coding regions powered to detect constraint (Gnocchi ≥ 4) at a 1kb (a) and 100bp (b) scale under varying levels of selection (depletion of variation) is shown as a function of log-scaled sample size. Lighter color indicates milder deletion of variation (weaker selection), which requires a larger sample size to detect constraint; the grey dashed vertical line indicates the current sample size of 76,156 genomes. Dotted curves (left to right) benchmark the 95th, 90th, and 50th percentile of depletion of variation observed in coding exons of similar size. The number of samples required to obtain an 80% detection power is labeled at corresponding benchmarks. c, AUCs of Gnocchi scores computed on different window sizes in identifying putative functional non-coding variants. 1kb (used in this study) presents the optimal window size with high performance while maintaining reasonable resolution. d, AUCs of Gnocchi scores computed from different subsets of gnomAD in identifying putative functional non-coding variants. While with an equal sample size, the downsampled dataset with diverse ancestries presents higher performance than the Non-Finnish European (NFE)-only dataset.

Supplementary information

Supplementary Information

This file provides detailed information about the aggregation, processing, and release of 76,156 human genomes from the Genome Aggregation Database (gnomAD), including Supplementary Figs. 1–8, Supplementary Tables 1–3, and descriptions of supplementary datasets.

Reporting Summary

Peer Review File

Supplementary Datasets

This zipped file contains supplementary dataset items 1–6: see Supplementary Information for supplementary dataset guide.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, S., Francioli, L.C., Goodrich, J.K. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2024). https://doi.org/10.1038/s41586-023-06045-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41586-023-06045-0

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing