A deep catalogue of protein-coding variation in 983,578 individuals

Sun, Kathie Y.; Bai, Xiaodong; Chen, Siying; Bao, Suying; Zhang, Chuanyi; Kapoor, Manav; Backman, Joshua; Joseph, Tyler; Maxwell, Evan; Mitra, George; Gorovits, Alexander; Mansfield, Adam; Boutkov, Boris; Gokhale, Sujit; Habegger, Lukas; Marcketta, Anthony; Locke, Adam E.; Ganel, Liron; Hawes, Alicia; Kessler, Michael D.; Sharma, Deepika; Staples, Jeffrey; Bovijn, Jonas; Gelfman, Sahar; Di Gioia, Alessandro; Rajagopal, Veera M.; Lopez, Alexander; Varela, Jennifer Rico; Alegre-Díaz, Jesús; Berumen, Jaime; Tapia-Conyer, Roberto; Kuri-Morales, Pablo; Torres, Jason; Emberson, Jonathan; Collins, Rory; Cantor, Michael; Thornton, Timothy; Kang, Hyun Min; Overton, John D.; Shuldiner, Alan R.; Cremona, M. Laura; Nafde, Mona; Baras, Aris; Abecasis, Gonçalo; Marchini, Jonathan; Reid, Jeffrey G.; Salerno, William; Balasubramanian, Suganthi

doi:10.1038/s41586-024-07556-0

Download PDF

Article
Open access
Published: 20 May 2024

A deep catalogue of protein-coding variation in 983,578 individuals

Nature volume 631, pages 583–592 (2024)Cite this article

19k Accesses
67 Altmetric
Metrics details

Subjects

Abstract

Rare coding variants that substantially affect function provide insights into the biology of a gene^1,2,3. However, ascertaining the frequency of such variants requires large sample sizes^4,5,6,7,8. Here we present a catalogue of human protein-coding variation, derived from exome sequencing of 983,578 individuals across diverse populations. In total, 23% of the Regeneron Genetics Center Million Exome (RGC-ME) data come from individuals of African, East Asian, Indigenous American, Middle Eastern and South Asian ancestry. The catalogue includes more than 10.4 million missense and 1.1 million predicted loss-of-function (pLOF) variants. We identify individuals with rare biallelic pLOF variants in 4,848 genes, 1,751 of which have not been previously reported. From precise quantitative estimates of selection against heterozygous loss of function (LOF), we identify 3,988 LOF-intolerant genes, including 86 that were previously assessed as tolerant and 1,153 that lack established disease annotation. We also define regions of missense depletion at high resolution. Notably, 1,482 genes have regions that are depleted of missense variants despite being tolerant of pLOF variants. Finally, we estimate that 3% of individuals have a clinically actionable genetic variant, and that 11,773 variants reported in ClinVar with unknown significance are likely to be deleterious cryptic splice sites. To facilitate variant interpretation and genetics-informed precision medicine, we make this resource of coding variation from the RGC-ME dataset publicly accessible through a variant allele frequency browser.

FarGen: Elucidating the distribution of coding variants in the isolated population of the Faroe Islands

Article Open access 21 November 2022

A structural variation reference for medical and population genetics

Article Open access 27 May 2020

GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data

Article 21 August 2023

Main

Exome sequencing has enabled the discovery of rare coding variants, and has thus provided insights into gene function that have accelerated the pace of disease-associated gene discovery across Mendelian and common disorders^{1,2,3,6,9,10,11,12}. Furthermore, exome sequencing has identified protective alleles that highlight drug targets that could be amenable to pharmacological intervention^{2,13,14,15,16,17}. For example, anti-PCSK9 drug therapy is based on the observation that a loss of PCSK9 function is associated with reduced levels of cholesterol¹⁸.

Cataloguing rare coding variation can help with the implementation of precision medicine^19,20. Large datasets of genetic variation that are representative of the human population are essential for the comprehensive discovery and interpretation of rare variants. Roadmaps for numerous large-scale sequencing studies have been proposed, and several efforts are now underway^21,22,23,24. The Genome Aggregation Database⁴ (gnomAD) and Trans-Omics for Precision Medicine⁸ (TOPMed) initiatives have developed large public databases of genetic variation derived from approximately 200,000 individuals and 132,000 individuals, respectively. Here, we describe a harmonized collection of exonic data derived from 983,578 individuals who represent a diverse array of ancestries. We calculate continental and fine-scale ancestry-based allele frequencies across this dataset and make the data publicly available through the RGC research browser: https://rgc-research.regeneron.com/me.

Survey of variation in the RGC-ME dataset

The Regeneron Genetics Center Million Exome (RGC-ME) dataset contains the genetic variation observed in 983,578 individuals. These data span dozens of collaborations, including large biobanks and health systems. All data were generated by Regeneron Genetics Center using a single harmonized sequencing and informatics protocol. Previously published datasets, such as the UK Biobank and the Mexico City Prospective Study, were reprocessed^9,25. The RGC-ME dataset comprises both outbred and founder populations spanning African (AFR), European (EUR), East Asian (EAS), Indigenous American (IAM), Middle Eastern (MEA) and South Asian (SAS) continental ancestries, and includes cohorts with relatively high rates of consanguinity. More than 190,000 of the unrelated participants (23%) are of non-EUR ancestry in the RGC-ME dataset, as compared with 35,000 in gnomAD genomes (v.3.1.2), 53,000 in gnomAD exomes (v.2.1.1), and 91,000 in TOPMed Freeze 8, indicating that RGC-ME represents a large increase in the number of individuals of non-EUR ancestry in datasets of genetic variation^4,8 (Fig. 1a and Supplementary Table 1a).

**Fig. 1: Variant survey and population counts in the RGC-ME dataset.**

We performed a comprehensive survey of genetic variation, encompassing single-nucleotide variants (SNV) and insertion–deletion (indel) variants. To estimate population allele frequencies, we focused on 821,979 unrelated samples (referred to hereafter as the 822K unrelated set; Supplementary Table 1a). We identified 16,425,629 unique mutated genomic positions (that is, sites) in autosomal and X-chromosomal coding regions, with one unique reference–alternate allele change (that is, variant) every two bases on average. In canonical transcripts within sequencing target regions, mutations at 35.6%, 32.2% and 9.5% of all possible genomic positions that can lead to synonymous, missense and stop-gained variants, respectively, were observed. In highly methylated CpG sites, we observed 95.0% of all possible synonymous, 92.2% of missense and 78.6% of stop-gained variants. Across all mutational contexts, 21.4% and 8.4% of all possible synonymous variants and stop-gained variants, respectively, were observed (Extended Data Fig. 1). Thus, RGC-ME represents a major advance towards the comprehensive discovery of rare variants.

Among coding variation in canonical transcripts, 1,115,116 pLOF variants were identified, which include those causing a premature stop, affecting essential splice donor and acceptor sites or causing frameshifts (Fig. 1c). Of these pLOF variants, 53.3% were observed as singletons; that is, only observed in one individual. In addition, 4,645,092 synonymous (35.7% as singletons) and 10,444,562 missense (40.0% as singletons) variants in canonical transcripts were detected. A total of 48% of coding variants in canonical transcripts were unique to RGC-ME and absent in other large-scale datasets^4,8 (Fig. 1b). Each sample had a median of 137 pLOF, 8,652 missense and 10,184 synonymous variants (Fig. 1c). AFR individuals had, on average, 18.6% more variants across all functional categories compared with individuals of other ancestries (Extended Data Fig. 2), as expected on the basis of the ‘Out of Africa’ model of human population history²⁶.

Constrained genes

Population-scale sequencing allows the quantification of pLOF variation in genes, which is key to understanding the relationship between genes and diseases. Several gene constraint metrics have been developed to estimate the pLOF tolerance of genes²⁷. Here, we estimated pLOF depletion using the cumulative frequency of pLOF variants in a gene to derive a selection coefficient, s_het, that quantifies fitness loss due to heterozygous pLOF variation²⁸. We estimated the indispensability of 16,710 protein-coding genes on the basis of the observed number of rare pLOF variants per gene with a cumulative alternate allele frequency (AAF) of less than 0.1% compared with the expected number based on gene-specific mutation rates (Supplementary Table 2).

The mean s_het value in the RGC-ME dataset for canonical transcripts was 0.073 (95% highest posterior density (HPD)): [0.043, 0.12] (median s_het = 0.021) (Fig. 2a), which suggests that, on average, a pLOF would result in 7.3% lower evolutionary fitness relative to the reference allele. This estimate is comparable with the mean s_het value of 0.073 [0.029, 0.18] (median = 0.028) computed using the same method on the ExAC dataset²⁹ (n ≈ 60,000). Our sample size (n ≈ 822,000) helped to accurately quantify rare pLOF variants and compute more precise constraint scores than the ExAC values. This finding is best illustrated in known haploinsufficient genes, which are expected to be more constrained and thus have larger s_het values relative to all genes (Extended Data Fig. 3a). Compared with values that were computed with ExAC data, s_het values for haploinsufficient genes in the RGC-ME dataset were significantly higher (\({\Delta \bar{s}}_{\text{het}}=0.045\), P = 0.002) and had smaller 95% HPD ranges despite those larger means (∆Var(s_het) = −0.026, P = 4.5 × 10⁻²¹). Estimates for all genes were more precise in 822K samples compared with a randomly downsampled set of 60,000 samples from the RGC-ME dataset (Extended Data Fig. 3b), in which mean and median 95% HPD ranges were 6.2- and 4.0-fold larger, respectively.

**Fig. 2: Estimates of gene-level constraint, representing s_het, from the RGC-ME dataset.**

The s_het value is higher in genes that are associated with Mendelian diseases^28,30 (Extended Data Fig. 3a), and can differentiate groups of genes under varying degrees of selection (Fig. 2b). We used s_het to identify constrained genes by comparing the s_het scores of known high-constraint genes (haploinsufficient, autosomal dominant and developmental-specific autosomal dominant) with those of low-constraint genes (haplosufficient and genes with rare biallelic pLOF variants from the RGC-ME dataset) (Extended Data Fig. 4a). Among 1,476 genes in the ‘high-constraint’ and 3,893 genes in the ‘low-constraint’ groups, 89.1% of genes with a s_het score greater than the mean (0.073) and 66.6% of genes with a s_het score greater than the median (0.021) belonged to the high group (Supplementary Table 2). These thresholds served as cut-offs for mean and lower bound (2.5% HPD), respectively, to identify highly constrained genes with fitness deficits on a par with dominant disease-causing genes that also reflect uncertainty in the mean.

We compared s_het to other published LOF constraint measures, such as LOEUF⁴, and an alternate method for estimating s_het based on approximate Bayesian computing³¹, which we refer to as s_het-ABC (Supplementary Figs. 2–4). Spearman rank correlations between s_het from RGC-ME and these estimates were high (−0.768 with LOEUF; 0.778 with s_het-ABC). However, s_het derived from RGC-ME had higher sensitivity and specificity in differentiating between constrained and unconstrained genes, compared with LOEUF and s_het-ABC (Extended Data Fig. 4a).

Improving s_het estimates is most valuable for genes with few expected pLOF mutations⁴, particularly shorter genes. The RGC-ME dataset allowed s_het to be estimated more precisely for the smallest quantiles of gene coding sequence (CDS) length (Extended Data Fig. 3c), and derived more informative constraint metrics using an allele-frequency-based approach and a larger sample. We derived constraint scores for 923 genes that had 5 or fewer expected pLOF variants, deemed underpowered for similar analyses with LOEUF³². These 923 underpowered genes were significantly shorter, with a mean CDS length of 573 base pairs compared to 1,797 base pairs for genes with more than 5 expected pLOF variants. Eighty-six genes were highly constrained, with a mean s_het value greater than 0.073 and a lower bound greater than 0.021 (Extended Data Fig. 4b), and are promising candidates for efforts to discover new disease-associated genes. Thirty per cent (26 of 86) have been linked to human diseases or shown to be essential in mice or cell lines (Supplementary Table 2). These include well-studied genes with known importance in cellular function, such as the transcription factor TWIST1 (ref. ³³), DNA- and RNA-binding protein BANF1 (refs. ^34,35) and transactivator CITED2 (ref. ³⁶).

Overall, 3,988 highly constrained genes had s_het values greater than 0.073 and a lower bound greater than 0.021. Although 1,153 of these lack known associations with human diseases or lethal mouse knockout phenotypes, they are likely to have high functional importance. These constrained genes might lack disease associations because the loss of even a single copy is incompatible with life or causes reduced reproductive success without clinical disease³⁷.

Constrained coding regions

Identifying sub-genic regions that are intolerant of mutations can reveal functionally important regions that would otherwise be missed when constraint scores are aggregated at the gene level. Models of local coding constraint are powerful tools for identifying protein domains with crucial functions and for variant prioritization^38,39,40. In addition to gene constraint derived from pLOF variation, we also identified regions depleted of missense variation using the missense tolerance ratio (MTR)^40,41, defined as the ratio of the observed to the expected proportion of missense variants adjusted by synonymous variation in a defined codon window. Using the 822K unrelated samples, we calculated the MTR for each amino acid along the CDS within sliding windows of 21 and 31 amino acids (MTR scores available on figshare; see ‘Data availability’) and characterized continuous segments of missense constrained regions (Supplementary Table 3).

Compared with benign missense variants, ClinVar pathogenic missense (two stars or more) variants were highly enriched in the top percentile of exome-wide MTR scores (odds ratio = 100.0 and 89.8, computed with 21- and 31-codon windows, respectively; Fig. 3a). Our sample size, which is nearly four times larger than that used in previous MTR estimates⁴¹, resulted in improved discrimination between pathogenic and benign variants for top-10-percentile MTR scores in which we observed significant enrichment (Fig. 3a). This larger sample size enabled us to identify 24% more missense variants in the top-1-percentile constrained MTR scores (512,499 versus 413,147) compared with a subsampled set of 225,000, after adjusting for a false discovery rate (FDR) lower than 0.1. In addition, the increased power derived from 822K samples resulted in higher resolution for distinguishing pathogenic from benign variants for MTR computed with 21-codon windows, albeit at the expense of having fewer scored missense variants overall (295,958 constrained missense variants).

**Fig. 3: Missense regional constraint captured by MTR.**

Deleterious variants are expected to have lower allele frequencies than neutral variants, owing to negative selection. We can infer the functional importance of different classes of variation by comparing the proportion of singletons in each class. We computed the deleteriousness of variants using an updated mutability-adjusted proportion of singletons (MAPS) metric^5,32 and derived an MTR score threshold at which their MAPS score corresponds to that of missense variants that were predicted to be deleterious by five out of five prediction algorithms in dbNSFP (v.3.2; see Supplementary Information); that is, 5/5 missense variants. Variants with MTR values in the top-15-percentile exome-wide threshold (MTR < 0.841) were predicted to be as deleterious as 5/5 missense variants (Extended Data Fig. 5a). For 31-codon windows, 1.24% (129,990) of all missense variants (excluding known ClinVar pathogenic variants) observed in the RGC-ME dataset had significant MTR scores in the top 15 percentile. These missense variants in the top 15 percentile of exome-wide MTR are potentially deleterious and could be suitable for prioritization in projects aiming to discover disease-associated genes.

MTR is a useful metric of regional constraint that may capture functionally important segments within genes. We defined MTR-constrained regions as continuous regions within a protein that have variants with MTR values in the top-15-percentile threshold (Supplementary Information and Extended Data Fig. 6a). We identified 41,114 missense constrained regions in 12,349 genes (Supplementary Table 3). Our findings overlap with results from a previous study³⁸ that estimated the regional observed-to-expected missense ratio (ɣ) from around 60,000 ExAC samples (Extended Data Fig. 6b) to derive a composite missense deleterious score called MPC. We refer to the MPC-derived constrained regions as MPC segments, and compared these with MTR-constrained regions. MTR-constrained regions had a median length of 22 residues [14–35, quartile 1–quartile 3], compared with 358 [208, 579] in MPC segments. Overall, we identified 8.59 times more MTR-constrained regions than MPC segments (ɣ ≤ 0.612, top 15 percentile) across 2,832 transcripts with data from both methods (Extended Data Fig. 6c).

We examined the distribution of de novo missense variants in MTR-constrained regions and observed a significant enrichment (P = 2.61 × 10⁻¹⁰) of variants identified in individuals with neurodevelopmental disorders (Extended Data Fig. 7a,b). Case variants were 1.85 times [1.50, 2.31 (95% CI)] more likely to occur in constrained regions, compared with controls. As expected, well-supported (two stars or more) ClinVar pathogenic missense variants were also highly enriched (P ≈ 0) in MTR-constrained regions. Pathogenic variants were 8.82 times [8.17, 9.53 (95% CI)] more likely to occur in missense constrained regions than were benign variants.

Missense constrained sites were found in key functional regions, such as DNA-binding regions and active sites (Fig. 3b). Among membrane proteins, transmembrane regions ranked higher in MTR-constrained regions than did cytoplasmic and extracellular domains. We also compared the overlap of MTR-constrained and functional regions by computing Jaccard indices. Ubiquitin-conjugating (UBC) core domains and DNA-binding regions had the highest overlap with constrained regions (Jaccard index = 0.52 and 0.18, respectively), suggesting that, among UBC enzymes, more than half of the union set between MTR-constrained regions and core domains overlapped. Other enriched functional regions included protein kinases and nuclear receptor ligand-binding domains (Supplementary Table 4).

A total of 4,064 genes contained regions depleted in missense variation with a significant proportion of their coding sequence in the top 15 percentile of MTR (binomial test with π₀ = 0.15, P < 0.05 after multiple testing correction; Supplementary Table 5). To identify genes with signatures of missense-only constraint, we assessed the LOF-constraint metric, s_het, of these highly missense constrained genes (Fig. 3c). Among the 4,064 genes, 1,482 either were not LOF constrained or lacked s_het estimates. These genes had significantly shorter CDS lengths than those of the 1,424 LOF-specific constrained genes (P = 2.9 × 10⁻⁴⁰, Wilcoxon test; Extended Data Fig. 7c). Estimating region-level LOF constraint is difficult owing to strong selection against pLOF variants, which leads to a paucity of pLOF variation. MTR serves as a complementary lens for identifying, first, functionally important regions at a higher resolution than gene-level LOF constraint, and, second, regions within genes that are depleted of missense variation but tolerant of LOF variation. For example, KRAS, a well-known oncogene, is LOF tolerant (s_het = 0.002, LOEUF = 1.24); however, the first 80 amino acids (42%) of the protein sequence were ranked in the top 1 percentile of exome-wide MTR (Fig. 3d). This region includes the P-loop, switch 1 and switch 2 functional domains, which form crucial binding interfaces for effector proteins⁴², and these results therefore highlight the importance of regional constraint metrics.

Understanding ‘human knockouts’

Identifying genes with biallelic pLOF variants provides an opportunity to understand gene function directly through the phenotypic characterization of individuals who have such variants—effectively, naturally occurring ‘human knockouts’. The RGC-ME dataset includes founder populations and cohorts with high rates of consanguinity, contributing to a comprehensive collection of homozygous loss-of-function variation^25,43,44,45. Overall, we identified 4,686 genes comprising 8,576 rare (AAF < 1%) homozygous pLOF variants in 64,852 individuals (Supplementary Table 6). Furthermore, we identified 1,205 genes with carriers of rare (AAF < 1%) heterozygous pLOF variants in trans; that is, compound heterozygotes, 162 of which lacked homozygous pLOFs. In total, 4,848 genes were discovered with carriers of biallelic pLOF variants in which both alleles of a gene were affected by pLOF variation and could be described as putative gene knockouts (pKOs). Of these, 1,751 (1,650 from homozygous pLOFs only) have not to our knowledge been previously reported. Biallelic pLOF variants in RGC-ME are rare; 64.3% of homozygous pLOF variants and 37.4% of pKOs were detected in one participant (Fig. 4a). As expected, cohorts with higher rates of consanguinity were enriched in homozygous pLOF variants, compared with outbred populations, despite smaller sample sizes (Fig. 4b,c and Extended Data Table 1).

**Fig. 4: Rare biallelic pLOF variants and ‘human knockouts’ in the RGC-ME dataset.**

pKOs were significantly less constrained, with a lower s_het (on average −0.074 [−0.077, −0.071 (95% CI)], t-test) relative to all other genes. Only 2.67% of pKOs had an s_het value greater than 0.073, as compared with 21.6% of all human genes, and 47.2% of pKOs were in the lowest quintile of s_het scores exome-wide (s_het < 7.07 × 10⁻³). A caveat is that s_het, like most gene-specific measures of constraint, is designed to capture the effect of heterozygous LOF⁴⁶. Although genes containing biallelic pLOF variants are under less heterozygous selective pressure, existing sample sizes are inadequate⁴⁷ to directly compute selection on homozygous variation. pKOs are overrepresented in drug and xenobiotic metabolism pathways (Supplementary Fig. 6).

Among very rare doubleton variants for which we observed exactly two copies of the alternate allele, we observed a clear excess of homozygotes that is likely to be explained by population structure and background inbreeding. For example, among missense and synonymous variants, we observed 5,857 and 2,490 homozygotes among 1,580,917 and 679,335 doubleton variants, respectively, compared with a Hardy–Weinberg equilibrium (HWE) expectation of fewer than one homozygote in each case. These estimates corresponded to a background inbreeding coefficient of 0.37%. Among pLOF variants, we observed only 406 homozygotes among 129,405 doubleton variants (Supplementary Table 7). Although this number is much larger than HWE expectations, it is around 15% less than the expected 479 homozygotes calculated using an inbreeding coefficient of 0.37% (P = 0.0095, Fisher’s exact test). This suggests that a notable proportion of these homozygotes were never observed in our sample population.

Genes with biallelic inactivating mutations could reveal potential drug targets that can be disrupted with minimal side effects⁴³. Drug targets with homozygous pLOF variants in humans are more likely to progress from phase I trials to approval⁴⁴. Of 997 inhibitory preclinical targets listed in the Drug Repurposing Hub, 182 (18.3%) had at least one individual with a rare biallelic pLOF variant in the RGC-ME dataset⁴⁸. In-depth phenotyping of human knockouts can help researchers to better understand the efficacy and side-effect profiles of these potential drug targets. Human knockouts provide a way to understand the consequences of lifelong deficiency of a gene⁴⁹.

Annotation of splice-affecting variants

Several prediction tools^50,51,52,53 have been developed to understand the effects of genetic variants on alternative splicing. Although these tools mainly assess whether a variant affects splicing, some also provide a pathogenicity metric or score threshold as a measure of deleteriousness. Predicted cryptic splice sites with SpliceAI scores greater than 0.8 have been validated at high rates using RNA sequencing and are as depleted at common allele frequencies as pLOF variants⁵⁰. Here, we used human genetic data to optimize splice prediction score thresholds enriched for deleterious variants that affect splicing. We systematically quantified the deleteriousness of variants at various splice prediction score thresholds using the MAPS metric. As previously demonstrated^2,4, pLOF variants had the highest MAPS scores, followed by missense, synonymous and noncoding variants, respectively (Fig. 5a).

**Fig. 5: Identification of deleterious variants that are predicted to affect splicing.**

We used splice predictions from SpliceAI⁵⁰ and MMSplice⁵¹ to group variants into predicted splice score bins, and identified the minimum threshold at which the MAPS score of the variants is equal to that of 5/5 missense variants (variants predicted to be deleterious by five out of five prediction methods). The proposed prediction score thresholds of 0.35 for SpliceAI and 0.97 for MMSplice pathogenicity (Fig. 5a and Extended Data Fig. 5b) identify predicted deleterious splice-affecting variants (SAVs).

A total of 296,696 predicted deleterious coding SAVs (inclusive of canonical splice sites, splice region and untranslated region (UTR) splice sites) in the RGC-ME dataset had scores that exceeded the MAPS-derived splicing thresholds for both SpliceAI and MMSplice (referred to as the intersection set; Extended Data Fig. 8a). Of these, 43.5% (129,118) were cryptic splice sites (that is, non-canonical splice sites). Unsurprisingly, canonical splice sites and variants within the splice region comprised the largest category of predicted deleterious SAVs. Both SpliceAI and MMSplice identified around 80% of LOFTEE (loss of function transcript effect estimator; ref. ⁴) high-confidence splice sites and around 10% of variants within splice regions as predicted deleterious SAVs (Extended Data Fig. 8a,b). In addition, around 68% of LOFTEE low-confidence splice sites were predicted to be deleterious SAVs (94% of low-confidence splice sites were in the UTR). The impact of non-canonical splice variants on alternative splicing is often underestimated; we found that missense variants accounted for 11.3% of all predicted deleterious SAVs identified by both SpliceAI and MMSplice in the RGC-ME dataset (Extended Data Fig. 8a,b).

Predicted deleterious SAVs were enriched in well-supported ClinVar pathogenic variants (two stars or more) compared with other metrics of variant deleteriousness (Fig. 5b); for example, compared with combined annotation dependent depletion (CADD)^54,55 score ≥ 20 (odds ratio = 4.5, P = 0). Missense SAVs were significantly enriched for pathogenic variants compared with 5/5 missense variants (odds ratio = 1.8, P = 3.3 × 10⁻⁸) and missense variants with CADD ≥ 20 (odds ratio = 3.6, P = 1.01 × 10⁻²⁶), respectively. Notably, splice sites in the intersection set were also significantly enriched for pathogenic variants compared to LOFTEE high-confidence splice sites, indicating that the MAPS-derived metric identifies deleterious splice sites. Similar results were obtained when we evaluated the enrichment of pathogenic variants compared with benign ones (Supplementary Table 8a).

We next assessed the MAPS-derived splice prediction thresholds for variants that have been experimentally assessed for splicing effects^56,57,58 (Supplementary Table 9). Predicted deleterious SAVs identified in the intersection set were significantly enriched in experimentally validated large-effect splice-disrupting variants (SDVs) compared with non-SDVs in all functional categories except the splice site category, although the odds ratio was greater than one for splice sites (Fig. 5c). Variants of unknown significance (VUSs) in ClinVar that were predicted as deleterious SAVs were also significantly enriched in experimentally validated SDVs (Fig. 5c). Of the 563 predicted deleterious SAVs assayed in the experimental data, 346 (61.5%) were SDVs and more than half were cryptic splice sites, including 13 ClinVar VUSs (Extended Data Fig. 8d).

We also derived stringent thresholds to identify SAVs by removing canonical splice sites and calibrating exclusively coding non-splice-site (nonSS) variants to a MAPS score comparable with 5/5 missense variants. These thresholds corresponded to a SpliceAI score of 0.43 and an MMSplice score of 0.97 (Extended Data Fig. 5c). Pathogenic enrichment was consistent when comparing deleterious coding nonSS and missense SAVs with corresponding variant categories filtered by CADD ≥ 20 (Supplementary Table 8b,c). Consistent results were also obtained when comparing the enrichment of deleterious SAVs in SDVs to non-SDVs after applying thresholds for coding nonSS variants (Extended Data Fig. 8c,e).

Clinical utility of rare variants

To understand the prevalence of disease-associated alleles in the general population, we identified well-supported ClinVar⁵⁹ pathogenic variants (two stars or more) across 2,042 genes in 822K unrelated RGC-ME samples. We found that 40.7% of pathogenic variants (20,343/49,990) were observed in the RGC-ME dataset, of which 99.6% (20,262) had an AAF of less than 0.1% and 17.8% (3,619) were observed once. In comparison, 20% (9,821) and 29% (14,700) of pathogenic variants were observed in ExAC exomes (n ≈ 60,000) and gnomAD v.2.1.1 exomes (n ≈ 126,000), respectively (Extended Data Fig. 9a). This highlights the importance of the RGC-ME dataset’s larger sample size in identifying rare pathogenic variants. On average, individuals carry 1.58 pathogenic variants, with the majority of these individuals being heterozygous carriers of these variants. Specifically, 61.4% of the 822K unrelated individuals were heterozygous carriers of pathogenic recessive alleles in 1,143 of 2,659 known autosomal recessive genes (mean, 0.98 pathogenic alleles per person); 0.21% of the samples were homozygotes of pathogenic variants in 167 autosomal recessive genes; and 3.64% were heterozygous carriers of 353 of 1,629 total autosomal dominant genes. Pathogenic variant annotations should be interpreted cautiously owing to the incomplete penetrance of disease alleles⁶⁰.

The American College of Medical Genetics identified a set of genes (ACMG SF v.3.1) with clinically actionable variants that predispose individuals to diseases and for which medical interventions are available to reduce mortality and morbidity⁶¹. Among the 822K unrelated individuals, 22,846 (2.77%) had at least one ClinVar-reported (two stars or more) pathogenic missense or pLOF variant for 72 out of 76 autosomal genes on the ACMG list (Supplementary Table 10). As expected, two of the most prevalent pathogenic variations were the HFE Cys282Tyr allele (enriched in EUR, n_{EUR-homozygotes} = 3,220 and AAF_EUR = 13.8%) and the TTR Val142Ile allele (enriched in AFR, n_AFR = 1,670 and AAF_AFR = 3.4%).

We also tallied carriers of likely pathogenic pLOF variants (novel variants not yet reported as pathogenic in ClinVar) in 44 genes in which truncation is known to lead to disease. A total of 2,357 (0.3%) individuals in the RGC-ME dataset carried 1,407 likely pathogenic variants across 40 of these genes. In total, 3.06% of the individuals in the RGC-ME dataset were carriers of pathogenic or likely pathogenic variants. Excluding individuals with high-frequency pathogenic variants in the HFE (Cys282Tyr) and TTR (Val142Ile) genes, 2.38% of the individuals in the RGC-ME dataset carried an actionable variant (Supplementary Table 10). This number is comparable with those from other reports^6,7,62 of actionable variants, which range from 2% to 4.1% for gene sets that include ACMG v.2.0 and v.3.0. As expected, pathogenic variants are rare in large-scale studies of the general population. We found that 39% and 79% of pathogenic and likely pathogenic variants, respectively, were singletons. Focusing on non-ACMG genes, we found that 1.27% of individuals were heterozygous carriers of pathogenic variants in autosomal dominant genes, and 0.21% were homozygotes of pathogenic variants in autosomal recessive genes.

Because the RGC-ME dataset includes uniformly processed exome data from a relatively large proportion of individuals from continental ancestries other than EUR, we assessed the range of allele frequencies of variants present in ClinVar across four continental populations: AFR, EUR, IAM and SAS. Approximately 34% of unique pathogenic coding variants in equalized subsamples were observed only in individuals of non-EUR ancestry, which indicates that sampling diverse populations is necessary for the comprehensive identification of rare variation. Across all unrelated individuals, on average, those of EUR ancestry had 63% more pathogenic variants that were well characterized (rated two stars or more) per sample than did individuals of AFR ancestry. Conversely, individuals of EUR ancestry had, per sample, 25.6% fewer VUSs (Extended Data Fig. 9b,c) and 18.6% fewer variants across all functional types (Extended Data Fig. 2). In individuals of AFR ancestry, a consistent pattern of significantly fewer high-confidence (two stars or more) pathogenic variants (−0.576 [−0.567, −0.585 (95% CI)], t-test) to a surplus of VUSs (42.13 [421.97, 42.28]), compared with individuals of EUR ancestry, suggests that the most well-characterized pathogenic variants were depleted in this population (Extended Data Fig. 9b). Recruiting diverse individuals to enable the identification and characterization of novel pathogenic variants might help to address this ascertainment bias. Further analyses of pathogenic coding variants and differentiated alleles between ancestries are included in Supplementary Fig. 7 and Supplementary Table 11.

Understanding VUSs is currently a bottleneck in the interpretation of variation in clinically relevant genes and a challenge in clinical management¹⁹. Although VUSs have less empirical evidence for pathogenicity, they comprise the bulk of ClinVar, with more than one million variants. Notably, VUSs in regions of low MTR may be deleterious, comprising 5,079 (0.68%) VUSs in the top 1 percentile of MTR-constrained regions and 17,500 VUSs (2%) in the top 15 percentile (Supplementary Table 12). Using the MAPS-derived splicing score thresholds, we identified more than 11,000 candidate deleterious cryptic splice sites among VUSs (1,366 synonymous variants in 822 genes and 10,407 missense variants in 3,501 genes), offering potential insights into their functional consequences for clinical prioritization and interpretation efforts.

Discussion

The RGC-ME dataset, derived from 983,578 exomes, provides a harmonized catalogue of around 20 million coding variants in individuals from a diverse array of ancestries and is publicly accessible at https://rgc-research.regeneron.com/me/home. Cataloguing variation at scale provides an opportunity to accurately estimate the frequency of rare variants—allowing us to precisely compute gene and regional constraint metrics, expand the compendium of rare human knockouts, annotate deleterious cryptic splice sites, characterize variant frequencies across different ancestries and assess the population prevalence of pathogenic variation. RGC-ME will be an invaluable resource for interpreting rare variants and is a step towards the realization of precision medicine.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Genetic variation data for 821,979 unrelated individuals are made publicly available through the RGC-ME browser (https://rgc-research.regeneron.com/me/home). Features include genomic locations, alleles, fine-scale ancestry assignments, population-specific allele frequencies and functional annotations for the genetic variants. In addition, vcf files can be downloaded from the web portal. Exome-wide MTR scores are available for download from figshare: https://doi.org/10.6084/m9.figshare.24587328 (ref. ⁶³). The human reference genome GRCh38 can be obtained from ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa. Ensembl Release 100 gene and transcript builds can be accessed from https://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/ and corresponding gene and transcript reference nucleotide and protein sequence data from https://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/. Individual-level sequence data have been deposited with the UK Biobank and are freely available to approved researchers. Instructions for access to UK Biobank data are available at https://www.ukbiobank.ac.uk/enable-your-research. Information about the data access policy for researchers interested in the Mexico City Prospective Study data can be found at https://www.ctsu.ox.ac.uk/research/prospective-blood-based-study-of-150-000-individuals-in-mexico. Geisinger Health System individual-level data are available to qualified academic, non-commercial researchers through an information transfer agreement by contacting Lance J. Adams (ljadams1@geisinger.edu). Information about the data access policy, procedures and contact details for the cohorts included in this dataset can be obtained through the URLs given in the RGC-ME browser at https://rgc-research.regeneron.com/me/data-contributors. This information is also provided in Supplementary Table 1c, with relevant references if available.

Code availability

Publicly available software and packages used in this study are described in the Supplementary Information. In summary, sequencing reads were generated using bcl2fastq v.2.20 and were mapped to references using BWA-MEM v.0.7.17. Variants were identified using DeepVariant v.0.10, aggregated with GLnexus v.1.4.3 and converted to bed, bim or fam format using PLINK v.1.9. Variants were annotated with VEP (Ensembl, v.100.4) and pLOF variants were further classified with the VEP LOFTEE plug-in. Array variants were phased using Eagle v.2.4 and imputed using Minimac4. PLINK v.2 was used for principal component analysis and to compute Fst, a measure of genetic differentiation among population groups. The csq function in BCFtools v.1.18 was used to annotate in-frame indels resulting from a combination of frameshift indels on the same haplotype, and bedtools v.2.30.0 was used to determine the genetic context and neighbouring nucleotides of variants. Picard LiftoverVcf v.3.0.0 was used to transform sequence coordinates to GRCh38. Relatedness was determined with PRIMUS: https://primus.gs.washington.edu/primusweb/res/documentation.html. We adapted scripts from https://github.com/pjshort/dddMAPS to compute updated MAPS metrics. For identifying compound heterozygous variants, exome variants were merged with a well-imputed common variant backbone and phased using SHAPEIT5 (https://github.com/odelaneau/shapeit5). Large-scale data manipulation used Scala v.2.12 on a 10.4 LTS runtime (Apache Spark v.3.2.1) with standard Spark functions. Beyond standard R packages, visualization tools and data-processing libraries (for example, dplyr, ggplot2 and data.table), we used rstan (v.2.33) to build Bayesian hierarchical models for calculating heterozygous selection coefficients, rmutil (v.4.1.2) to project LOF accrual and boot (v.4.1.1) for bootstrapping. Python code used standard packages (for example, scipy, numpy and pandas) for analysis, scikit-learn (v.1.0) to model variant quality (see Supplementary Tables 13 and 14) and sqlalchemy (v.2.0.23) to store and query tables. Custom code to generate LOF projection curves is available at https://github.com/rgcgithub/rgc_me_analysis.

References

Baxter, S. M. et al. Centers for Mendelian Genomics: a decade of facilitating gene discovery. Genet. Med. 24, 784–797 (2022).
Article CAS PubMed PubMed Central Google Scholar
Musunuru, K. et al. Exome sequencing, ANGPTL3 mutations, and familial combined hypolipidemia. N. Engl. J. Med. 363, 2220–2227 (2010).
Article CAS PubMed PubMed Central Google Scholar
Soutar, A. K. & Naoumova, R. P. Mechanisms of disease: genetic causes of familial hypercholesterolemia. Nat. Clin. Pract. Cardiovasc. Med. 4, 214–225 (2007).
Article CAS PubMed Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dewey, F. E. et al. Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study. Science 354, aaf6814 (2016).
Article PubMed Google Scholar
Van Hout, C. V. et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature 586, 749–756 (2020).
Article ADS PubMed PubMed Central Google Scholar
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Backman, J. D. et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature 599, 628–634 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Wang, Q. et al. Rare variant contribution to human disease in 281,104 UK Biobank exomes. Nature 597, 527–532 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Park, J. et al. Exome-wide evaluation of rare coding variants using electronic health records identifies new gene-phenotype associations. Nat. Med. 27, 66–72 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kessler, M. D. et al. Common and rare variant associations with clonal haematopoiesis phenotypes. Nature 612, 301–309 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Akbari, P. et al. Sequencing of 640,000 exomes identifies GPR75 variants associated with protection from obesity. Science 373, eabf8683 (2021).
Article CAS PubMed PubMed Central Google Scholar
Nag, A. et al. Human genetics uncovers MAP3K15 as an obesity-independent therapeutic target for diabetes. Sci. Adv. 8, eadd5430 (2022).
Article CAS PubMed PubMed Central Google Scholar
Praveen, K. et al. ANGPTL7, a therapeutic target for increased intraocular pressure and glaucoma. Commun. Biol. 5, 1051 (2022).
Article CAS PubMed PubMed Central Google Scholar
Tanigawa, Y. et al. Rare protein-altering variants in ANGPTL7 lower intraocular pressure and protect against glaucoma. PLoS Genet. 16, e1008682 (2020).
Article CAS PubMed PubMed Central Google Scholar
Verweij, N. et al. Germline mutations in CIDEB and protection against liver disease. N. Engl. J. Med. 387, 332–344 (2022).
Article CAS PubMed Google Scholar
Rifai, M. A. & Ballantyne, C. M. PCSK9-targeted therapies: present and future approaches. Nat. Rev. Cardiol. 18, 805–806 (2021).
Article CAS PubMed Google Scholar
McInnes, G. et al. Opportunities and challenges for the computational interpretation of rare variation in clinically important genes. Am. J. Hum. Genet. 108, 535–548 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zhou, Y. et al. Rare genetic variability in human drug target genes modulates drug response and can guide precision medicine. Sci. Adv. 7, eabi6856 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
The All of Us Research Program Investigators. The “All of Us” Research Program. N. Engl. J. Med. 381, 668–676 (2019).
Article PubMed Central Google Scholar
GenomeAsia, K. C. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature 576, 106–111 (2019).
Article ADS Google Scholar
Wong, E. et al. The Singapore National Precision Medicine Strategy. Nat. Genet. 55, 178–186 (2023).
Article CAS PubMed Google Scholar
Fatumo, S. et al. Promoting the genomic revolution in Africa through the Nigerian 100K Genome Project. Nat. Genet. 54, 531–536 (2022).
Article CAS PubMed Google Scholar
Ziyatdinov, A. et al. Genotyping, sequencing and analysis of 140,000 adults from Mexico City. Nature 622, 784–793 (2023).
Article CAS PubMed PubMed Central Google Scholar
Li, J. Z. et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science 319, 1100–1104 (2008)
Article ADS CAS PubMed Google Scholar
Bartha, I., di Iulio, J., Venter, J. C. & Telenti, A. Human gene essentiality. Nat. Rev. Genet. 19, 51–62 (2018).
Article CAS PubMed Google Scholar
Cassa, C. A. et al. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat. Genet. 49, 806–810 (2017).
Article CAS PubMed PubMed Central Google Scholar
Seplyarskiy, V. et al. A mutation rate model at the basepair resolution identifies the mutagenic effect of polymerase III transcription. Nat. Genet. 55, 2235–2242 (2023).
Article CAS PubMed Google Scholar
Weghorn, D. et al. Applicability of the mutation-selection balance model to population genetics of heterozygous protein-truncating variants in humans. Mol. Biol. Evol. 36, 1701–1710 (2019).
Article CAS PubMed PubMed Central Google Scholar
Agarwal, I., Fuller, Z. L., Myers, S. R. & Przeworski, M. Relating pathogenic loss-of-function mutations in humans to their evolutionary fitness costs. eLife 12, e83172 (2023).
Article CAS PubMed PubMed Central Google Scholar
Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2024).
Article ADS CAS PubMed Google Scholar
Qin, Q., Xu, Y., He, T., Qin, C. & Xu, J. Normal and disease-related biological functions of Twist1 and underlying molecular mechanisms. Cell Res. 22, 90–106 (2012).
Article CAS PubMed Google Scholar
Puente, X. S. et al. Exome sequencing and functional analysis identifies BANF1 mutation as the cause of a hereditary progeroid syndrome. Am. J. Hum. Genet. 88, 650–656 (2011).
Article CAS PubMed PubMed Central Google Scholar
Samwer, M. et al. DNA cross-bridging shapes a single nucleus from a set of mitotic chromosomes. Cell 170, 956–972 (2017).
Article CAS PubMed PubMed Central Google Scholar
Kuna, M. et al. CITED2 is a conserved regulator of the uterine-placental interface. Proc. Natl Acad. Sci. USA 120, e2213622120 (2023).
Article CAS PubMed PubMed Central Google Scholar
Gardner, E. J. et al. Reduced reproductive success is associated with selective constraint on human genes. Nature 603, 858–863 (2022).
Article ADS CAS PubMed Google Scholar
Samocha, K. E. et al. Regional missense constraint improves variant deleteriousness prediction. Preprint at bioRxiv https://doi.org/10.1101/148353 (2017).
Havrilla, J. M., Pedersen, B. S., Layer, R. M. & Quinlan, A. R. A map of constrained coding regions in the human genome. Nat. Genet. 51, 88–95 (2019).
Article CAS PubMed Google Scholar
Traynelis, J. et al. Optimizing genomic medicine in epilepsy through a gene-customized approach to missense variant interpretation. Genome Res. 27, 1715–1729 (2017).
Article CAS PubMed PubMed Central Google Scholar
Silk, M., Petrovski, S. & Ascher, D. B. MTR-Viewer: identifying regions within genes under purifying selection. Nucleic Acids Res. 47, W121–W126 (2019).
Article CAS PubMed PubMed Central Google Scholar
Pantsar, T. The current understanding of KRAS protein structure and dynamics. Comput. Struct. Biotechnol. J. 18, 189–198 (2020).
Article CAS PubMed Google Scholar
Saleheen, D. et al. Human knockouts and phenotypic analysis in a cohort with a high rate of consanguinity. Nature 544, 235–239 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Narasimhan, V. M. et al. Health and population effects of rare gene knockouts in adult humans with related parents. Science 352, 474–477 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Sulem, P. et al. Identification of a large set of rare complete human knockouts. Nat. Genet. 47, 448–452 (2015).
Article CAS PubMed Google Scholar
Fuller, Z. L., Berg, J. J., Mostafavi, H., Sella, G. & Przeworski, M. Measuring intolerance to mutation in human genetics. Nat. Genet. 51, 772–776 (2019).
Article CAS PubMed PubMed Central Google Scholar
Balick, D. J., Do, R., Cassa, C. A., Reich, D. & Sunyaev, S. R. Dominance of deleterious alleles controls the response to a population bottleneck. PLoS Genet. 11, e1005436 (2015).
Article PubMed PubMed Central Google Scholar
Corsello, S. M. et al. The Drug Repurposing Hub: a next-generation drug library and information resource. Nat. Med. 23, 405–408 (2017).
Article CAS PubMed PubMed Central Google Scholar
McGregor, T. L. et al. Characterising a healthy adult with a rare HAO1 knockout to support a therapeutic strategy for primary hyperoxaluria. eLife 9, e54363 (2020).
Article CAS PubMed PubMed Central Google Scholar
Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548 (2019).
Article CAS PubMed Google Scholar
Cheng, J. et al. MMSplice: modular modeling improves the predictions of genetic variant effects on splicing. Genome Biol. 20, 48 (2019).
Article PubMed PubMed Central Google Scholar
Gelfman, S. et al. Annotating pathogenic non-coding variants in genic regions. Nat. Commun. 8, 236 (2017).
Article ADS PubMed PubMed Central Google Scholar
Zeng, T. & Li, Y. I. Predicting RNA splicing from DNA sequence using Pangolin. Genome Biol. 23, 103 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Article CAS PubMed PubMed Central Google Scholar
Baird, D. D., Dunson, D. B., Hill, M. C., Cousins, D. & Schectman, J. M. High cumulative incidence of uterine leiomyoma in black and white women: ultrasound evidence. Am. J. Obstet. Gynecol. 188, 100–107 (2003).
Article PubMed Google Scholar
Soemedi, R. et al. Pathogenic variants that alter protein code often disrupt splicing. Nat. Genet. 49, 848–855 (2017).
Article CAS PubMed PubMed Central Google Scholar
Adamson, S. I., Zhan, L. & Graveley, B. R. Vex-seq: high-throughput identification of the impact of genetic variation on pre-mRNA splicing efficiency. Genome Biol. 19, 71 (2018).
Article PubMed PubMed Central Google Scholar
Cheung, R. et al. A multiplexed assay for exon recognition reveals that an unappreciated fraction of rare genetic variants cause large-effect splicing disruptions. Mol. Cell 73, 183–194 (2019).
Article CAS Google Scholar
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).
Article CAS PubMed Google Scholar
Forrest, I. S. et al. Population-based penetrance of deleterious clinical variants. J. Am. Med. Assoc. 327, 350–359 (2022).
Article CAS Google Scholar
Miller, D. T. et al. ACMG SF v3.1 list for reporting of secondary findings in clinical exome and genome sequencing: a policy statement of the American College of Medical Genetics and Genomics (ACMG). Genet. Med. 24, 1407–1414 (2022).
Article CAS PubMed Google Scholar
Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607, 732–740 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Sun, K. Exome-wide MTR scores computed with RGC-ME data for all possible missense variants in canonical transcripts. figshare https://doi.org/10.6084/m9.figshare.24587328 (2024).

Download references

Acknowledgements

This work was supported in part by the Intramural Research Program of the National Institute of Mental Health (ZIA-MH002843) and a grant from R01 NCI R01 CA157823. Ethical approval for the UK Biobank was previously obtained from the North West Centre for Research Ethics Committee (11/ NW/0382). The work described herein was approved by the UK Biobank under application number 26041. Informed consent was obtained for all study participants. Approval for Geisinger Health System MyCode analyses was provided by the Geisinger Health System Institutional Review Board under project number 2006-0258. Informed consent was obtained for all study participants. Appropriate consent for the Penn Medicine BioBank was obtained from each participant regarding the storage of biological specimens, genetic sequencing and genotyping and access to all available EHR data. This study was approved by the Institutional Review Board of the University of Pennsylvania and complied with the principles set out in the Declaration of Helsinki. All individuals participating in the Mayo–RGC project generation provided informed consent for the use of specimens and data in genetic and health research and ethical approval for project generation was provided by the Mayo Clinic Institutional Review Board (09-007763). All research performed in this study used de-identified data (without any Protected Health Information data) with no possibility of re-identifying any of the participants. Approval for the Indiana Biobank was provided by the Indiana University Institutional Review Board under project number 1105005445. For participants in the Mexico City Prospective Study, approval for the study was given by the Mexican Ministry of Health, the Mexican National Council of Science and Technology (0595 P-M) and the Central Oxford Research Ethics Committee (C99.260) and the Ethics and Research commissions from the Medicine Faculty at the National Autonomous University of Mexico (UNAM) (FMED/CI/SPLR/067/2015). All study participants provided written informed consent. Study participants were recruited from the BioMe Biobank Program of the Charles Bronfman Institute for Personalized Medicine at Mount Sinai Medical Center from 2007 onward. The BioMe Biobank Program (Institutional Review Board 07-0529) operates under a Mount Sinai Institutional Review Board-approved research protocol. All study participants provided written informed consent. The authors thank the RGC-ME Cohort Partners for contributions to this initiative. A list of cohorts (with links to the projects where available) and the data contributors can be accessed from the RGC-ME browser (https://rgc-research.regeneron.com/me/data-contributors). The authors thank everyone who made this work possible, the professionals from the member institutions who contributed to and supported this work and, most especially, all of the participants, without whom this research would not be possible. This study is funded by Regeneron Genetics Center and Regeneron Pharmaceuticals.

Author information

These authors contributed equally: Kathie Y. Sun, Xiaodong Bai

Authors and Affiliations

Regeneron Genetics Center, Tarrytown, NY, USA
Kathie Y. Sun, Xiaodong Bai, Siying Chen, Suying Bao, Chuanyi Zhang, Manav Kapoor, Joshua Backman, Tyler Joseph, Evan Maxwell, George Mitra, Alexander Gorovits, Adam Mansfield, Boris Boutkov, Sujit Gokhale, Lukas Habegger, Anthony Marcketta, Adam E. Locke, Liron Ganel, Alicia Hawes, Michael D. Kessler, Deepika Sharma, Jeffrey Staples, Jonas Bovijn, Sahar Gelfman, Alessandro Di Gioia, Veera M. Rajagopal, Alexander Lopez, Jennifer Rico Varela, Gonçalo Abecasis, Giovanni Coppola, Andrew Deubler, Aris Economides, Adolfo Ferrando, Luca A. Lotta, Alan Shuldiner, Katherine Siminovitch, Christina Beechert, Erin D. Brian, Laura M. Cremona, Hang Du, Caitlin Forsythe, Zhenhua Gu, Kristy Guevara, Michael Lattari, Kia Manoochehri, Prathyusha Challa, Manasi Pradhan, Raymond Reynoso, Ricardo Schiavo, Maria Sotiropoulos Padilla, Chenggu Wang, Sarah E. Wolf, Amelia Averitt, Nilanjana Banerjee, Dadong Li, Sameer Malhotra, Justin Mower, Mudasar Sarwar, Jeffrey C. Staples, Sean Yu, Aaron Zhang, Andrew Bunyea, Krishna Pawan Punuru, Sanjay Sreeram, Gisu Eom, Benjamin Sultan, Rouel Lanche, Vrushali Mahajan, Eliot Austin, Sean O’Keeffe, Razvan Panea, Tommy Polanco, Ayesha Rasool, Lance Zhang, Evan Edelstein, Ju Guan, Olga Krasheninina, Samantha Zarate, Adam J. Mansfield, Evan K. Maxwell, Kathie Sun, Manuel Allen Revez Ferreira, Kathy Burch, Adrian Campos, Lei Chen, Sam Choi, Amy Damask, Sheila Gaynor, Benjamin Geraghty, Arkopravo Ghosh, Salvador Romero Martinez, Christopher Gillies, Lauren Gurski, Joseph Herman, Eric Jorgenson, Michael Kessler, Jack Kosmicki, Nan Lin, Adam Locke, Priyanka Nakka, Karl Landheer, Olivier Delaneau, Maya Ghoussaini, Joelle Mbatchou, Arden Moscati, Aditeya Pandey, Anita Pandit, Charles Paulding, Jonathan Ross, Carlo Sidore, Eli Stahl, Maria Suciu, Peter VandeHaar, Sailaja Vedantam, Scott Vrieze, Jingning Zhang, Rujin Wang, Kuan-Han Wu, Bin Ye, Blair Zhang, Andrey Ziyatdinov, Yuxin Zou, Kyoko Watanabe, Mira Tang, Brian Hobbs, Jon Silver, William Palmer, Rita Guerreiro, Amit Joshi, Antoine Baldassari, Cristen Willer, Sarah Graham, Ernst Mayerhofer, Mary Haas, Niek Verweij, George Hindy, Tanima De, Parsa Akbari, Luanluan Sun, Olukayode Sosina, Arthur Gilly, Peter Dornbos, Juan Rodriguez-Flores, Moeen Riaz, Gannie Tzoneva, Momodou W. Jallow, Anna Alkelai, Ariane Ayer, Veera Rajagopal, Vijay Kumar, Jacqueline Otto, Neelroop Parikshak, Aysegul Guvenek, Jose Bras, Silvia Alvarez, Jessie Brown, Jing He, Hossein Khiabanian, Joana Revez, Kimberly Skead, Valentina Zavala, Lyndon J. Mitnaul, Marcus B. Jones, Esteban Chen, Michelle G. LeBlanc, Jason Mighty, Nirupama Nishtala, Nadia Rana, Jennifer Rico-Varela, Jaimee Hernandez, Alison Fenney, Randi Schwartz, Jody Hankins, Samuel Hart, Ann Perez-Beals, Gina Solari, Johannie Rivera-Picart, Michelle Pagan, Sunilbe Siceron, Michael Cantor, Timothy Thornton, Hyun Min Kang, John D. Overton, Alan R. Shuldiner, M. Laura Cremona, Mona Nafde, Aris Baras, Gonçalo Abecasis, Jonathan Marchini, Jeffrey G. Reid, William Salerno & Suganthi Balasubramanian
Faculty of Medicine, National Autonomous University of Mexico (UNAM), Mexico City, Mexico
Jesús Alegre-Díaz, Jaime Berumen, Roberto Tapia-Conyer & Pablo Kuri-Morales
Instituto Tecnológico y de Estudios Superiores de Monterrey, Monterrey, Mexico
Pablo Kuri-Morales
Clinical Trial Service Unit and Epidemiological Studies Unit, Nuffield Department of Population Health, University of Oxford, Oxford, UK
Jason Torres, Jonathan Emberson & Rory Collins
Accelerated Cures, Waltham, MA, USA
David Gwynne
Lundquist Institute, Torrance, CA, USA
Jerome I. Rotter & Robert Weinreb
Case Western Reserve University, Cleveland, OH, USA
Jonathan L. Haines
University of Miami, Miami, FL, USA
Margaret A. Pericak-Vance
University of Pennsylvania, Philadelphia, PA, USA
Dwight Stambolian, Yuki Bradford, Scott Damrauer, Stephanie DerOhannessian, Theodore Drivas, Scott Dudek, Joseph Dunn, Ned Haubein, Renae Judy, Yi-An Ko, Colleen Morse Kripke, Meghan Livingstone, Nawar Naseer, Kyle P. Nerz, Afiya Poindexter, Marjorie Risman, Salma Santos, Giorgio Sirugo, Julia Stephanowski, Teo Tran, Fred Vadivieso, Anurag Verma, Shefali S. Verma, JoEllen Weaver, Colin Wollack, Daniel J. Rader, Marylyn Ritchie, Joan O’Brien & Brendan Keating
Albert Einstein College of Medicine, Bronx, NY, USA
Nir Barzilai, Yousin Suh & Zhengdong Zhang
University of Maryland School of Medicine, Baltimore, MD, USA
Elliot Hong & Braxton Mitchell
Menzies Institute for Medical Research, University of Tasmania, Hobart, Tasmania, Australia
Nicholas B. Blackburn, Bennet J. McComish & Bruce V. Taylor
Griffith University, Gold Coast, Queensland, Australia
Simon Broadley
Murdoch University, Perth, Western Australia, Australia
Marzena J. Fabis-Pedrini
Perron Institute for Neurological and Translational Science, Nedlands, Western Australia, Australia
Marzena J. Fabis-Pedrini & Allan G. Kermode
Monash University, Melbourne, Victoria, Australia
Vilija G. Jokubaitis
Florey Institute of Neuroscience and Mental Health, Melbourne, Victoria, Australia
Trevor J. Kilpatrick & Justin P. Rubio
University of Newcastle, Newcastle, New South Wales, Australia
Jeanette Lechner-Scott & Rodney J. Scott
Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Melbourne, Victoria, Australia
Stephen Leslie & Allan Motyer
University of Sydney, Sydney, New South Wales, Australia
Grant P. Parnell
Center for Non-Communicable Diseases, Karachi, Pakistan
Danish Saleheen
Cincinnati Children’s Hospital, Cincinnati, OH, USA
Ken Kaufman, Leah Kottyan, Lisa Martin & Marc E. Rothenberg
Columbia University, New York City, NY, USA
Abdullah Ali & Azra Raza
UT Southwestern Medical Center, Dallas, TX, USA
Jonathan Cohen
University of Pittsburgh, Pittsburgh, PA, USA
Adam Glassman, Erin E. Kershaw, Georgios Papachristou & David C. Whitcomb
Duke University, Durham, NC, USA
William E. Kraus, Christopher B. Newgard & Svati H. Shah
Flinders University of South Australia, Adelaide, South Australia, Australia
Jamie Craig & Alex Hewitt
Indiana University School of Medicine, Indianapolis, IN, USA
Naga Chalasani, Tatiana Foroud, Suthat Liangpunsakul & Lois B. Travis
Vanderbilt University Medical Center, Nashville, TN, USA
Nancy J. Cox, Eric Gamazon & Eric D. Austin
University of Chicago, Chicago, IL, USA
Eileen Dolan & Omar El-Charif
Loyola University, Chicago, IL, USA
Heather Wheeler
Kaiser Permanente, Oakland, CA, USA
Lori Sakoda & John Witte
Mayo Clinic, Rochester, MN, USA
Kostantinos Lazaridis
Geisinger Health System, Danville, PA, USA
Adam Buchanan, David J. Carey, Christa L. Martin, Michelle N. Meyer, Kyle Retterer & David Rolston
National Institute of Mental Health, National Institutes of Health, Bethesda, MD, USA
Nirmala Akula, Emily Besançon, Sevilla D. Detera-Wadleigh, Layla Kassem, Francis J. McMahon & Thomas G. Schulze
Northwestern University, Chicago, IL, USA
Adam Gordon, Maureen Smith & John Varga
Mount Sinai School of Medicine, New York City, NY, USA
Erwin Bottinger & Judy Cho
University of Alabama at Birmingham, Birmingham, AL, USA
S. Louis Bridges & Robert Kimberly
University of California, Los Angeles, Los Angeles, CA, USA
Marlena Fejzo
University of Colorado School of Medicine, Aurora, CO, USA
Richard A. Spritz
University of Michigan Medical School, Ann Arbor, MI, USA
James T. Elder, Rajan P. Nair, Philip Stuart & Lam C. Tsoi
Ann Arbor Veterans Affairs Hospital, Ann Arbor, MI, USA
James T. Elder
University of Ottawa, Ottawa, Ontario, Canada
Robert Dent & Ruth McPherson
University of Texas Health Science Center at Houston, Houston, TX, USA
Shervin Assassi & Maureen D. Mayes

Authors

Kathie Y. Sun
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Bai
View author publications
You can also search for this author in PubMed Google Scholar
Siying Chen
View author publications
You can also search for this author in PubMed Google Scholar
Suying Bao
View author publications
You can also search for this author in PubMed Google Scholar
Chuanyi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Manav Kapoor
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Backman
View author publications
You can also search for this author in PubMed Google Scholar
Tyler Joseph
View author publications
You can also search for this author in PubMed Google Scholar
Evan Maxwell
View author publications
You can also search for this author in PubMed Google Scholar
George Mitra
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Gorovits
View author publications
You can also search for this author in PubMed Google Scholar
Adam Mansfield
View author publications
You can also search for this author in PubMed Google Scholar
Boris Boutkov
View author publications
You can also search for this author in PubMed Google Scholar
Sujit Gokhale
View author publications
You can also search for this author in PubMed Google Scholar
Lukas Habegger
View author publications
You can also search for this author in PubMed Google Scholar
Anthony Marcketta
View author publications
You can also search for this author in PubMed Google Scholar
Adam E. Locke
View author publications
You can also search for this author in PubMed Google Scholar
Liron Ganel
View author publications
You can also search for this author in PubMed Google Scholar
Alicia Hawes
View author publications
You can also search for this author in PubMed Google Scholar
Michael D. Kessler
View author publications
You can also search for this author in PubMed Google Scholar
Deepika Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Staples
View author publications
You can also search for this author in PubMed Google Scholar
Jonas Bovijn
View author publications
You can also search for this author in PubMed Google Scholar
Sahar Gelfman
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Di Gioia
View author publications
You can also search for this author in PubMed Google Scholar
Veera M. Rajagopal
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Lopez
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer Rico Varela
View author publications
You can also search for this author in PubMed Google Scholar
Jesús Alegre-Díaz
View author publications
You can also search for this author in PubMed Google Scholar
Jaime Berumen
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Tapia-Conyer
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Kuri-Morales
View author publications
You can also search for this author in PubMed Google Scholar
Jason Torres
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Emberson
View author publications
You can also search for this author in PubMed Google Scholar
Rory Collins
View author publications
You can also search for this author in PubMed Google Scholar
Michael Cantor
View author publications
You can also search for this author in PubMed Google Scholar
Timothy Thornton
View author publications
You can also search for this author in PubMed Google Scholar
Hyun Min Kang
View author publications
You can also search for this author in PubMed Google Scholar
John D. Overton
View author publications
You can also search for this author in PubMed Google Scholar
Alan R. Shuldiner
View author publications
You can also search for this author in PubMed Google Scholar
M. Laura Cremona
View author publications
You can also search for this author in PubMed Google Scholar
Mona Nafde
View author publications
You can also search for this author in PubMed Google Scholar
Aris Baras
View author publications
You can also search for this author in PubMed Google Scholar
Gonçalo Abecasis
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Marchini
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey G. Reid
View author publications
You can also search for this author in PubMed Google Scholar
William Salerno
View author publications
You can also search for this author in PubMed Google Scholar
Suganthi Balasubramanian
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

Regeneron Genetics Center

RGC Management and Leadership Team
- Gonçalo Abecasis
- , Aris Baras
- , Michael Cantor
- , Giovanni Coppola
- , Andrew Deubler
- , Aris Economides
- , Adolfo Ferrando
- , Luca A. Lotta
- , John D. Overton
- , Jeffrey G. Reid
- , Alan Shuldiner
- & Katherine Siminovitch
Sequencing and Lab Operations
- John D. Overton
- , Christina Beechert
- , Erin D. Brian
- , Laura M. Cremona
- , Hang Du
- , Caitlin Forsythe
- , Zhenhua Gu
- , Kristy Guevara
- , Michael Lattari
- , Alexander Lopez
- , Kia Manoochehri
- , Prathyusha Challa
- , Manasi Pradhan
- , Raymond Reynoso
- , Ricardo Schiavo
- , Maria Sotiropoulos Padilla
- , Chenggu Wang
- & Sarah E. Wolf
Clinical Informatics
- Michael Cantor
- , Amelia Averitt
- , Nilanjana Banerjee
- , Dadong Li
- , Sameer Malhotra
- , Justin Mower
- , Mudasar Sarwar
- , Deepika Sharma
- , Jeffrey C. Staples
- , Sean Yu
- & Aaron Zhang
Genome Informatics and Data Engineering
- Jeffrey G. Reid
- , Mona Nafde
- , George Mitra
- , Sujit Gokhale
- , Andrew Bunyea
- , Krishna Pawan Punuru
- , Sanjay Sreeram
- , Gisu Eom
- , Sujit Gokhale
- , Benjamin Sultan
- , Rouel Lanche
- , Vrushali Mahajan
- , Eliot Austin
- , Sean O’Keeffe
- , Razvan Panea
- , Tommy Polanco
- , Ayesha Rasool
- , William Salerno
- , Xiaodong Bai
- , Lance Zhang
- , Boris Boutkov
- , Evan Edelstein
- , Alexander Gorovits
- , Ju Guan
- , Lukas Habegger
- , Alicia Hawes
- , Olga Krasheninina
- , Samantha Zarate
- , Adam J. Mansfield
- , Evan K. Maxwell
- , Suganthi Balasubramanian
- , Suying Bao
- , Kathie Sun
- & Chuanyi Zhang
Analytical Genetics and Data Science
- Gonçalo Abecasis
- , Manuel Allen Revez Ferreira
- , Joshua Backman
- , Kathy Burch
- , Adrian Campos
- , Lei Chen
- , Sam Choi
- , Amy Damask
- , Liron Ganel
- , Sheila Gaynor
- , Benjamin Geraghty
- , Arkopravo Ghosh
- , Salvador Romero Martinez
- , Christopher Gillies
- , Lauren Gurski
- , Joseph Herman
- , Eric Jorgenson
- , Tyler Joseph
- , Michael Kessler
- , Jack Kosmicki
- , Nan Lin
- , Adam Locke
- , Priyanka Nakka
- , Jonathan Marchini
- , Karl Landheer
- , Olivier Delaneau
- , Maya Ghoussaini
- , Anthony Marcketta
- , Joelle Mbatchou
- , Arden Moscati
- , Aditeya Pandey
- , Anita Pandit
- , Charles Paulding
- , Jonathan Ross
- , Carlo Sidore
- , Eli Stahl
- , Maria Suciu
- , Timothy Thornton
- , Peter VandeHaar
- , Sailaja Vedantam
- , Scott Vrieze
- , Jingning Zhang
- , Rujin Wang
- , Kuan-Han Wu
- , Bin Ye
- , Blair Zhang
- , Andrey Ziyatdinov
- , Yuxin Zou
- , Olivier Delaneau
- , Maya Ghoussaini
- , Jingning Zhang
- , Kyoko Watanabe
- & Mira Tang
Therapeutic Area Genetics
- Adolfo Ferrando
- , Giovanni Coppola
- , Luca A. Lotta
- , Alan Shuldiner
- , Katherine Siminovitch
- , Brian Hobbs
- , Jon Silver
- , William Palmer
- , Rita Guerreiro
- , Amit Joshi
- , Antoine Baldassari
- , Cristen Willer
- , Sarah Graham
- , Ernst Mayerhofer
- , Mary Haas
- , Niek Verweij
- , George Hindy
- , Jonas Bovijn
- , Tanima De
- , Parsa Akbari
- , Luanluan Sun
- , Olukayode Sosina
- , Arthur Gilly
- , Peter Dornbos
- , Juan Rodriguez-Flores
- , Moeen Riaz
- , Manav Kapoor
- , Gannie Tzoneva
- , Momodou W. Jallow
- , Anna Alkelai
- , Giovanni Coppola
- , Ariane Ayer
- , Veera Rajagopal
- , Sahar Gelfman
- , Vijay Kumar
- , Jacqueline Otto
- , Neelroop Parikshak
- , Aysegul Guvenek
- , Jose Bras
- , Silvia Alvarez
- , Jessie Brown
- , Jing He
- , Hossein Khiabanian
- , Joana Revez
- , Kimberly Skead
- & Valentina Zavala
Research Program Management and Strategic Initiatives
- Lyndon J. Mitnaul
- , Marcus B. Jones
- , Esteban Chen
- , Michelle G. LeBlanc
- , Jason Mighty
- , Nirupama Nishtala
- , Nadia Rana
- , Jennifer Rico-Varela
- & Jaimee Hernandez
Senior Partnerships and Business Operations
- Alison Fenney
- , Randi Schwartz
- , Jody Hankins
- & Samuel Hart
Business Operations and Administrative Coordinators
- Ann Perez-Beals
- , Gina Solari
- , Jaimee Hernandez
- , Johannie Rivera-Picart
- , Michelle Pagan
- & Sunilbe Siceron

RGC-ME Cohort Partners

Accelerated Cures
- David Gwynne
African Descent and Glaucoma Evaluation Study (ADAGES) III
- Jerome I. Rotter
- & Robert Weinreb
Age-related macular degeneration in the Amish
- Jonathan L. Haines
- , Margaret A. Pericak-Vance
- & Dwight Stambolian
Albert Einstein College of Medicine
- Nir Barzilai
- , Yousin Suh
- & Zhengdong Zhang
Amish Connectome Project
- Elliot Hong
Amish Research Clinic
- Braxton Mitchell
The Australia and New Zealand MS Genetics Consortium
- Nicholas B. Blackburn
- , Simon Broadley
- , Marzena J. Fabis-Pedrini
- , Vilija G. Jokubaitis
- , Allan G. Kermode
- , Trevor J. Kilpatrick
- , Jeanette Lechner-Scott
- , Stephen Leslie
- , Bennet J. McComish
- , Allan Motyer
- , Grant P. Parnell
- , Rodney J. Scott
- , Bruce V. Taylor
- & Justin P. Rubio
Center for Non-Communicable Diseases (CNCD)
- Danish Saleheen
Cincinnati Children’s Hospital
- Ken Kaufman
- , Leah Kottyan
- , Lisa Martin
- & Marc E. Rothenberg
Columbia University
- Abdullah Ali
- & Azra Raza
Dallas Heart Study
- Jonathan Cohen
Diabetic Retinopathy Clinical Research (DRCR) Retina Network
- Adam Glassman
Duke University
- William E. Kraus
- , Christopher B. Newgard
- & Svati H. Shah
Flinders University of South Australia
- Jamie Craig
- & Alex Hewitt
Indiana Biobank
- Naga Chalasani
- , Tatiana Foroud
- & Suthat Liangpunsakul
Indiana University School of Medicine
- Nancy J. Cox
- , Eileen Dolan
- , Omar El-Charif
- , Lois B. Travis
- , Heather Wheeler
- & Eric Gamazon
Kaiser Permanente
- Lori Sakoda
- & John Witte
Mayo Clinic
- Kostantinos Lazaridis
Mexico City Prospective Study (MCPS)
- Jesús Alegre-Díaz
- , Jaime Berumen
- , Rory Collins
- , Jonathan Emberson
- , Pablo Kuri-Morales
- , Roberto Tapia-Conyer
- & Jason Torres
MyCode-DiscovEHR Geisinger Health System Biobank
- Adam Buchanan
- , David J. Carey
- , Christa L. Martin
- , Michelle N. Meyer
- , Kyle Retterer
- & David Rolston
National Institute of Mental Health
- Nirmala Akula
- , Emily Besançon
- , Sevilla D. Detera-Wadleigh
- , Layla Kassem
- , Francis J. McMahon
- & Thomas G. Schulze
Northwestern University
- Adam Gordon
- , Maureen Smith
- & John Varga
Penn Medicine Biobank
- Yuki Bradford
- , Scott Damrauer
- , Stephanie DerOhannessian
- , Theodore Drivas
- , Scott Dudek
- , Joseph Dunn
- , Ned Haubein
- , Renae Judy
- , Yi-An Ko
- , Colleen Morse Kripke
- , Meghan Livingstone
- , Nawar Naseer
- , Kyle P. Nerz
- , Afiya Poindexter
- , Marjorie Risman
- , Salma Santos
- , Giorgio Sirugo
- , Julia Stephanowski
- , Teo Tran
- , Fred Vadivieso
- , Anurag Verma
- , Shefali S. Verma
- , JoEllen Weaver
- , Colin Wollack
- , Daniel J. Rader
- & Marylyn Ritchie
Primary Open-Angle African American Glaucoma Genetics (POAAG) study
- Joan O’Brien
Regeneron–Mt. Sinai BioMe Biobank
- Erwin Bottinger
- & Judy Cho
UAB GWAS in African Americans with rheumatoid arthritis
- S. Louis Bridges
UAB Whole exome sequencing of systemic lupus erythematosus patients
- Robert Kimberly
University of California, Los Angeles
- Marlena Fejzo
University of Colorado School of Medicine
- Richard A. Spritz
University of Michigan Medical School
- James T. Elder
- , Rajan P. Nair
- , Philip Stuart
- & Lam C. Tsoi
University of Ottawa
- Robert Dent
- & Ruth McPherson
University of Pennsylvania
- Brendan Keating
University of Pittsburgh
- Erin E. Kershaw
- , Georgios Papachristou
- & David C. Whitcomb
University of Texas Health Science Center at Houston
- Shervin Assassi
- & Maureen D. Mayes
Vanderbilt University Medical Center
- Eric D. Austin

Contributions

Conceptualization: S. Balasubramanian, W.S., J.G.R., J.M., G.A., A.B. and H.M.K. Data generation: X.B., W.S., J.G.R., J.D.O, M.C., E.M., A.G., A. Mansfield, B.B., L.H., A. Marcketta, A.E.L. and A.L. Formal analysis: K.Y.S., S.C., S. Bao, C.Z., M.K., J. Backman, T.J., L.G. and A.H. Browser application: E.M., A.G., B.B., G.M., S.G. and M.N. Supervision: S. Balasubramanian and W.S. Writing (original draft): K.Y.S., X.B., S.C., S. Bao, M.K. and S. Balasubramanian. Writing (review and editing): K.Y.S., S. Balasubramanian, S. Bao, J.G.R., A.R.S., G.A., J.M., M.L.C., M.D.K., D.S., J.S., J. Backman, S.G., A.D.G., V.M.R., J.A., J. Bovijn, R.T.-C., P.K.-M., J.T., J.E., R.C., T.T. and C.Z. Project management: J.R.V.

Corresponding authors

Correspondence to William Salerno or Suganthi Balasubramanian.

Ethics declarations

Competing interests

K.Y.S., X.B., S.C., S. Bao, C.Z., M.K., J. Backman, T.J., E.M., G.M., A.G., A. Mansfield, B.B., S.G., L.H., A. Marcketta, A.E.L., L.G., A.H., M.D.K., D.S., J.S., J. Bovijn, S.G., A.D.G., V.M.R., A.L., J.R.V., M.C., T.T., J.D.O., A.R.S., M.L.C., M.N., A.B., G.A., J.M., J.G.R., W.S. and S. Balasubramanian are current employees and/or stockholders of Regeneron Genetics Center or Regeneron Pharmaceuticals. H.M.K. was a former employee of Regeneron Genetics Center.

Peer review

Peer review information

Nature thanks Luke O’Connor and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Mutation saturation survey in RGC-ME data.

Counts are based on variant-transcript pairs. *Methylation level: mean methylation values across tissues of >0.65, 0.2–0.65, <0.2 correspond to methylation level of 2, 1, 0, respectively. CpG sites with methylation level of 2 are highly methylated, sites with methylation level 0 or 1 are grouped as lowly methylated sites. Variants are subset to target exome regions. Only variants that passed QC are included in the number of all possible variants.

Extended Data Fig. 2 Box plots summarizing counts of variants observed in each unrelated sample in the RGC-ME dataset.

The lower bound, centre and upper bound of each box plot represents the 25, 50, and 75 percentiles of the distributions of counts. Points represent outliers, and whisker minima and maxima represent the smallest and largest points 1.5-times beyond the interquartile range. Individuals were assigned discrete ancestries based on overall fine-scale ancestry probabilities >50% and a total of 755,261 individuals were included. See Supplementary Table 1 for sample size breakdowns.

Extended Data Fig. 3 s_het distribution across gene essentiality and disease categories, and evidence that a larger sample improves the precision of s_het estimates.

a, Proportion of genes in each disease/essentiality annotation list (gathered from published literature and databases) that correspond to each s_het decile. b, Ratio of variances from MCMC sampling for s_het calculated on full dataset (824k) and randomly downsampled set of 60,000 individuals. The mean ratio around 0.05 suggests that gene-level variance from the full dataset is 20x smaller than variance for the same gene using the downsampled set. This is the case despite similar, or even higher, s_het mean estimates as shown by the colour bar. c, Mean (points) and 95% HPD (green bars) from MCMC sampling for s_het estimates of genes in each CDS length quantile for downsampled 60k (left) and full 822K (right) samples. HPD and variance in panels b,c were derived from MCMC sampling for 8,000 final iterations.

Extended Data Fig. 4 Comparisons between s_het computed with RGC-ME and other gene constraint metrics.

a, Receiver operator curves showing the discrimination between the “high” and “low” constraint genes using different constraint metrics: s_het-RGCME, s_het-ABC, and LOEUF for transcripts with values from all three methods. The highly constrained category comprised 1,476 haploinsufficient and autosomal dominant (including developmental-specific) genes and the comparison group was represented by 3,893 haplosufficient and genes with rare, biallelic pLOFs in RGC-ME. Dotted lines indicate specificity and sensitivity for s_het = 0.073. Sensitivity = TP / (TP+FN), Specificity = TN / (TN+FP). Spearman rank correlations between s_het from RGC-ME with these estimates are high (-0.768 with LOEUF, 0.778 with s_het-ABC). b, s_het vs LOEUF results for 973 genes with ≤5 expected LOFs. 5 genes are highlighted for short coding sequence length (CDS), that are constrained according to s_het (both mean>0.073 and lower bound>0.021) and unconstrained according to LOEUF. LOEUF is an alternate measure of pLOF-based gene constraint and ranges from 0 to around 2; a value < 0.35 is considered constrained. Error bars (grey lines) around s_het denote 95% highest posterior density.

Extended Data Fig. 5 Determination of the MAPS score threshold to define constrained regions and SAVs.

a–c, To systematically evaluate the deleteriousness of missense variants at various MTR scores, we compared MAPS scores across MTR (a) and splice score (b,c) thresholds. a, MTR was divided from 5% to 100% with step size of 5% (lower MTR percentile is more constrained). For each MTR percentile threshold, we divided variants into two sets: variants that pass the percentile threshold (red dots), and variants greater than the percentile (blue dots). The y axis represents MAPS score for each set of variants; the x-axis shows the tested MTR percentile threshold. Error bars represent standard deviation around the mean proportion of singletons per bin. A total of 68,636,473 variants were included in this analysis. b,c, Schematic of MAPS-derived filters for SpliceAI and MMSplice for all variants (b) and coding nonSS variants (c), respectively. A list of prediction score thresholds were set up for both SpliceAI and MMSplice, ranging from 0.1 to 0.99, with step size of 0.02. For each threshold, we divided variants into two sets: the set of variants that passed the threshold, represented as red dots, and the set of variants that failed the filter, represented as blue dots. The y axis represents MAPS score for each set of variants; the x-axis shows the tested score threshold. All MAPS scores were calculated based on the set of variants that pass QC metrics and have splice prediction scores in RGC-ME unrelated samples. Error bars represent standard deviation around the mean proportion of singletons per bin. For all variants (b), the total number of variants were n = 12,886,467 and 20,604,651 for SpliceAI and MMSplice, respectively. For coding nonSS variants (c), n = 5,468,779 and 4,013,820 for SpliceAI and MMSplice, respectively.

Extended Data Fig. 6 MTR-based segmentation and comparison of constrained regions in 2,832 genes that have constrained regions in both MTR and MPC (matched on transcript IDs).

a, Dashed line indicates the top-15-percentile exome-wide MTR threshold used for MTR-based segmentation (0.841). Blue and red regions represent MTR-constrained and unconstrained regions, respectively. b, Constrained MTR regions that overlap with MPC segments. 80% of MTR-constrained regions (represented by the combined area of red and yellow) overlap with MPC-constrained segments (yellow), whereas 40% of constrained MPC segments (represented by the combined area of green and yellow) are included in the intersection (yellow). The numbers indicated on the Venn diagram represent number of amino acids. Aside from 2,832 genes that had both MTR- and MPC-constrained regions, an additional 297 genes had only MPC-constrained segments, and 9,514 genes had only MTR-constrained regions (not included in the Venn diagram). c, Distribution of the fold change of the number of MTR-constrained regions and constrained MPC segments (\(\gamma \le 0.6\)) per gene. Dashed line indicates the median fold change of 3. Data shown is for the 2,832 genes that have constrained region annotations both in MTR and MPC segments.

Extended Data Fig. 7 Constrained MTR regions are enriched in ClinVar pathogenic variants and shorter than LOF-specific constrained genes.

a, Odds ratios (points) comparing enrichment of pathogenic versus benign ClinVar variants (solid line) and de novo variants in cases versus controls (dotted line) in MTR-constrained regions. Error bars represent 95% CIs (Fisher’s exact test). The total number of variants in each comparison are shown in b for the MTR-percentile cut-offs highlighted in yellow. In total, 5,818 case and 553 control variants were used in the de novo analysis and 7,944 pathogenic and 11,993 benign variants were used in the ClinVar analysis. b, Table of case and control variants in constrained and unconstrained regions to compute statistical tests for ClinVar (“CV”) and de novo (“DN”) variants across 5 different MTR-percentile thresholds (13-17%, yellow boxed region in a). Statistics include hypergeometric tests (p-value for enrichment of case and control variants in constrained regions) and odds ratios comparing enrichment of case vs control in constrained regions. The background rate of constrained regions among variants in the comparison set represented by “% constrained background”. c, CDS length comparison between 1,482 missense-specific constrained genes (defined where >15% of gene is in the top 15 percentile of MTR, based on one-sided binomial tests with π₀ = 0.15, p < 0.05, Bonferroni corrected) and 1,424 LOF-specific constrained genes with s_het score <0.073. Log10 CDS length for all 19,644 genes (canonical transcripts) shown in the grey curve. The missense-specific constrained genes had significantly shorter CDS length than LOF-specific constrained genes (p = 2.9 × 10⁻⁴⁰, two-sided Wilcoxon test).

Extended Data Fig. 8 Characteristics of predicted deleterious SAVs.

a, Summary of unique variant-transcript pairs of predicted deleterious SAVs in different functional categories using MAPS-defined prediction score thresholds for SpliceAI and MMSplice. Percents are computed out of total variants in each effect class. b, Distribution of different functional categories of predicted deleterious SAVs. (HC: High confidence, LC: Low confidence. These annotation tags are derived from LOFTEE). c, Empirical validation of MAPS-predicted deleterious coding nonSS SAVs: enrichment of predicted coding nonSS deleterious SAVs in experimentally validated SDVs compared to non-SDVs. Odds ratios (points) were derived using two-sided Fisher’s exact test and error bars show 95% CIs. A total of n = 17,395 variants, of which 147 SAVs were validated SDVs, were included. d,e, Fraction of predicted deleterious SAVs (d) and coding nonSS SAVs (e) that were validated as SDVs by any of the three splice reporter assays.

Extended Data Fig. 9 ClinVar variant counts.

a, Counts of autosomal pathogenic ClinVar high-confidence variants (two stars or more) observed in large-scale exome sequencing studies, including RGC-ME, gnomAD (exomes, v2.1.1), and ExAC, are indicated on the top of each bar. The left axis indicates the total coverage of ClinVar pathogenic variants represented in each dataset. b, Bars and points both depict the mean per cent difference of per-individual counts in ClinVar categories (pathogenic 0+, 1+, 2+, VUS+CI) across continental ancestries using all unrelated samples, with respect to EUR (e.g. [count_AFR – count_EUR]/count_AFR × 100). CV0, 1, and 2 refer to ClinVar pathogenic star rating 0+, 1+, and 2+ categories, respectively. VUS/CI combines variants annotated as “variants of unknown significance” and “conflicting interpretations”. All per-individual counts in non-EUR were significantly different compared to counts in EUR (e.g. per-individual counts of ClinVar 2+ variants in AFR were 0.576 [0.567, 0.585] lower than those in EUR) except for ClinVar 0+ counts in IAM compared with EUR (t-tests with Bonferroni correction). Error bars show 95% CI of mean per cent difference from t-test. c, Per-individual count of VUSs and conflicting information (CI, left); and pathogenic variants (right) in RGC-ME for all unrelated samples. The lower bound, centre and upper bound of each box plot represents the 25, 50, and 75 percentiles of the distributions of counts. Points represent outliers, and whisker minima and maxima represent the smallest and largest points 1.5-times beyond the interquartile range. For b,c, a total of 749,584 unrelated individuals were included across 4 ancestries and individuals were assigned discrete ancestries based on overall FSA probabilities >50%; see Supplementary Table 1a for sample size breakdowns. The x-axis indicates ClinVar star rating which is a measure of the confidence of the annotation (pathogenic/benign).

Extended Data Table 1 Sample breakdown of individuals with rare biallelic pLOF variants by ancestry

Full size table

Supplementary information

Supplementary Information

This Supplementary information file contains the following. Description of Supplementary Tables 1–14. Supplementary Tables 1–6 and 11 are provided as separate data Excel tables. Supplementary Tables 7–10 and 12–14 are embedded within the Supplementary Information document. Supplementary Methods and descriptions of Supplementary Analyses. Supplementary Figures 1–7. Supplementary References.

Reporting Summary

Supplementary Table 1

This table includes: sample subsets of RGC-ME used in different analyses; full breakdown of sample counts in fine-scale ancestry groups used in Fig. 1 and for the browser; and sample sizes and collaborator details for each dataset in RGC-ME. See main Supplementary Information PDF for full legend.

Supplementary Table 2

s_het values for 16,710 genes and other annotations, including additional annotations, LOEUF scores from gnomAD, minor allele frequency and coding sequence length. See main Supplementary Information PDF for full legend.

Supplementary Table 3

List of continuous segments of missense constrained regions found in 12,349 genes (canonical transcripts), based on the top-15-percentile threshold of MTR values. See main Supplementary Information PDF for full legend.

Supplementary Table 4

Jaccard index analysis between the MTR-constrained regions and features from UniProt (release 2022_05). See main Supplementary Information PDF for full legend.

Supplementary Table 5

List of genes with significant proportion of CDS in the top 1, 5, 10, 15 and 20 percentile of exome-wide MTR missense constraint scores based on one-sided binomial tests. See main Supplementary Information PDF for full legend.

Supplementary Table 6

List of 4,848 genes with rare (alternate allele frequency < 1%) biallelic pLOF variants (homozygous alternate and compound heterozygous) reported for the entire RGC-ME dataset including related individuals. See main Supplementary Information PDF for full legend.

Supplementary Table 11

List of highly differentiated variants (F_ST > 0.15).

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sun, K.Y., Bai, X., Chen, S. et al. A deep catalogue of protein-coding variation in 983,578 individuals. Nature 631, 583–592 (2024). https://doi.org/10.1038/s41586-024-07556-0

Download citation

Received: 19 June 2023
Accepted: 10 May 2024
Published: 20 May 2024
Issue Date: 18 July 2024
DOI: https://doi.org/10.1038/s41586-024-07556-0

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Main

Survey of variation in the RGC-ME dataset

Constrained genes

Constrained coding regions

Understanding ‘human knockouts’

Annotation of splice-affecting variants

Clinical utility of rare variants

Discussion

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Consortia

Regeneron Genetics Center

RGC Management and Leadership Team

Sequencing and Lab Operations

Clinical Informatics

Genome Informatics and Data Engineering

Analytical Genetics and Data Science

Therapeutic Area Genetics

Research Program Management and Strategic Initiatives

Senior Partnerships and Business Operations

Business Operations and Administrative Coordinators

RGC-ME Cohort Partners

Accelerated Cures

African Descent and Glaucoma Evaluation Study (ADAGES) III

Age-related macular degeneration in the Amish

Albert Einstein College of Medicine

Amish Connectome Project

Amish Research Clinic

The Australia and New Zealand MS Genetics Consortium

Center for Non-Communicable Diseases (CNCD)

Cincinnati Children’s Hospital

Columbia University

Dallas Heart Study

Diabetic Retinopathy Clinical Research (DRCR) Retina Network

Duke University

Flinders University of South Australia

Indiana Biobank

Indiana University School of Medicine

Kaiser Permanente

Mayo Clinic

Mexico City Prospective Study (MCPS)

MyCode-DiscovEHR Geisinger Health System Biobank

National Institute of Mental Health

Northwestern University

Penn Medicine Biobank

Primary Open-Angle African American Glaucoma Genetics (POAAG) study

Regeneron–Mt. Sinai BioMe Biobank

UAB GWAS in African Americans with rheumatoid arthritis

UAB Whole exome sequencing of systemic lupus erythematosus patients

University of California, Los Angeles

University of Colorado School of Medicine

University of Michigan Medical School

University of Ottawa

University of Pennsylvania

University of Pittsburgh

University of Texas Health Science Center at Houston

Vanderbilt University Medical Center