Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A deep catalogue of protein-coding variation in 983,578 individuals

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Abstract

Rare coding variants that significantly impact function provide insights into the biology of a gene1-3. However, ascertaining their frequency requires large sample sizes4-8. Here, we present a catalogue of human protein-coding variation, derived from exome sequencing of 983,578 individuals across diverse populations. 23% of the Regeneron Genetics Center Million Exome data (RGC-ME) comes from non-European individuals of African, East Asian, Indigenous American, Middle Eastern, and South Asian ancestry. This catalogue includes over 10.4 million missense and 1.1 million predicted loss-of-function (pLOF) variants. We identify individuals with rare biallelic pLOF variants in 4,848 genes, 1,751 of which have not been previously reported. From precise quantitative estimates of selection against heterozygous loss-of-function, we identify 3,988 loss-of-function intolerant genes, including 86 that were previously assessed as tolerant and 1,153 lacking established disease annotation. We also define regions of missense depletion at high resolution. Notably, 1,482 genes have regions depleted of missense variants despite being tolerant to pLOF variants. Finally, we estimate that 3% of individuals have a clinically actionable genetic variant, and that 11,773 variants reported in ClinVar with unknown significance are likely to be deleterious cryptic splice sites. To facilitate variant interpretation and genetics-informed precision medicine, we make this important resource of coding variation from the RGC-ME accessible via a public variant allele frequency browser.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Similar content being viewed by others

Author information

Authors and Affiliations

Authors

Consortia

Corresponding authors

Correspondence to William Salerno or Suganthi Balasubramanian.

Supplementary information

Supplementary Information

This Supplementary information file contains the following. Description of Supplementary Tables 1-14. Supplementary Tables 1-6, and 11 are provided as separate data Excel tables. Supplementary Tables 7-10 and 12-14 are embedded within the Supplementary Information document. Supplementary Methods and descriptions of Supplementary Analyses. Supplementary Figures 1-7. Supplementary References.

Reporting Summary

Supplementary Table 1

This table includes: Sample subsets of RGC-ME used in different analyses; Full breakdown of sample counts in fine-scale ancestry groups used in Fig. 1 and for the browser; and Sample sizes and collaborator details for each dataset in RGC-ME. See main Supplementary Information PDF for full legend.

Supplementary Table 2

shet values for 16,710 genes and other annotations, including additional annotations, LOEUF scores from gnomAD, minor allele frequency, and coding sequence length. See main Supplementary Information PDF for full legend.

Supplementary Table 3

List of continuous segments of missense constrained regions found in 12,349 genes (canonical transcripts), based on the top 15-percentile threshold of MTR values. See main Supplementary Information PDF for full legend.

Supplementary Table 4

Jaccard index analysis between the MTR-constrained regions and features from UniProt (release 2022_05). See main Supplementary Information PDF for full legend.

Supplementary Table 5

List of genes with significant proportion of CDS in top 1, 5, 10, 15, and 20 percentile of exome wide MTR missense constraint scores based on one-sided binomial tests. See main Supplementary Information PDF for full legend.

Supplementary Table 6

List of 4,848 genes with rare (alternate allele frequency <1%) biallelic pLOF variants (homozygous alternate and compound heterozygous) reported for the entire RGC-ME dataset including related individuals. See main Supplementary Information PDF for full legend.

Supplementary Table 11

List of highly differentiated variants (FST > 0.15).

Peer Review File

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, K.Y., Bai, X., Chen, S. et al. A deep catalogue of protein-coding variation in 983,578 individuals. Nature (2024). https://doi.org/10.1038/s41586-024-07556-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41586-024-07556-0

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research