Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Exome-wide evaluation of rare coding variants using electronic health records identifies new gene–phenotype associations

Abstract

The clinical impact of rare loss-of-function variants has yet to be determined for most genes. Integration of DNA sequencing data with electronic health records (EHRs) could enhance our understanding of the contribution of rare genetic variation to human disease1. By leveraging 10,900 whole-exome sequences linked to EHR data in the Penn Medicine Biobank, we addressed the association of the cumulative effects of rare predicted loss-of-function variants for each individual gene on human disease on an exome-wide scale, as assessed using a set of diverse EHR phenotypes. After discovering 97 genes with exome-by-phenome-wide significant phenotype associations (P < 10−6), we replicated 26 of these in the Penn Medicine Biobank, as well as in three other medical biobanks and the population-based UK Biobank. Of these 26 genes, five had associations that have been previously reported and represented positive controls, whereas 21 had phenotype associations not previously reported, among which were genes implicated in glaucoma, aortic ectasia, diabetes mellitus, muscular dystrophy and hearing loss. These findings show the value of aggregating rare predicted loss-of-function variants into ‘gene burdens’ for identifying new gene–disease associations using EHR phenotypes in a medical biobank. We suggest that application of this approach to even larger numbers of individuals will provide the statistical power required to uncover unexplored relationships between rare genetic variation and disease phenotypes.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Fig. 1: Flow chart for exome-by-phenome-wide association analysis using electronic health record phenotypes.
Fig. 2: ExoPheWAS plot exhibits the landscape of gene–phenotype associations across the exome and phenome in the PMBB.

Data availability

All summary statistics for significant gene–phenotype associations from the discovery phase in the PMBB, as well as significant replications from each replication cohort are fully detailed in Supplementary Tables 116. Data for the individual rare pLOF and missense variants in significant genes that were used for gene burden analyses in the PMBB discovery cohort are also included in Supplementary Tables 23 and 24. In addition, a list of all of the single variants that were used for replication analyses across all the cohorts are provided in Supplementary Table 25. Each variant in Supplementary Tables 2325 is annotated with information regarding genomic location, variant effect, amino acid change, REVEL score (for missense) and MAF in gnomAD, as well as in the PMBB discovery cohort. Additionally, up-to-date summary data for genetic variants captured using WES in the PMBB can be accessed via the PMBB Genome Browser (https://pmbb.med.upenn.edu/allele-frequency/). Individual-level data are not publicly available due to research participant privacy concerns; however, requests from accredited researchers for access to individual-level data relevant to this manuscript can be made by contacting the corresponding author. Additionally, public expression datasets were obtained from the OTDB (https://genome.uiowa.edu/otdb/), Glaucoma Discovery Platform (http://glaucomadb.jax.org/glaucoma/), and NCBI GEO (https://www.ncbi.nlm.nih.gov/geo/). From NCBI GEO, we interrogated 11 different microarray and RNA-seq datasets of human fibroblasts from various tissues treated with TGF-β (GSE1724, GSE65069, GSE64192, GSE39394, GSE79621, GSE68164, GSE97833, GSE97823, GSE135065, GSE125519 and GSE40266), as well as microarray data from muscle biopsies in participants with tibial muscular dystrophy (GSE42806).

References

  1. Dewey, F. E. et al. Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study. Science 354, https://doi.org/10.1126/science.aaf6814 (2016).

  2. Stessman, H. A., Bernier, R. & Eichler, E. E. A genotype-first approach to defining the subtypes of a complex disease. Cell 156, 872–877 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Bush, W. S., Oetjens, M. T. & Crawford, D. C. Unravelling the human genome–phenome relationship using phenome-wide association studies. Nat. Rev. Genet. 17, 129–145 (2016).

    Article  CAS  PubMed  Google Scholar 

  4. Verma, A. et al. Human-disease phenotype map derived from PheWAS across 38,682 individuals. Am. J. Hum. Genet. 104, 55–64 (2019).

    Article  CAS  PubMed  Google Scholar 

  5. Zhang, X., Basile, A. O., Pendergrass, S. A. & Ritchie, M. D. Real-world scenarios in rare-variant association analysis: the impact of imbalance and sample size on the power in silico. BMC Bioinformatics 20, 46 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Lee, S., Abecasis, G. R., Boehnke, M. & Lin, X. Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 95, 5–23 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Park, J. et al. A genome-first approach to aggregating rare genetic variants in LMNA for association with electronic health record phenotypes. Genet. Med. https://doi.org/10.1038/s41436-019-0625-8 (2019).

  8. Haggerty, C. M. et al. Genomics-first evaluation of heart disease associated with titin-truncating variants. Circulation 140, 42–54 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Guo, M. H., Plummer, L., Chan, Y. M., Hirschhorn, J. N. & Lippincott, M. F. Burden testing of rare variants identified through exome sequencing via publicly available control data. Am. J. Hum. Genet. 103, 522–534 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Ioannidis, N. M. et al. REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99, 877–885 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Ciesielski, T. H. et al. Diverse convergent evidence in the genetic analysis of complex disease: coordinating omic, informatic and experimental evidence to better identify and validate risk factors. BioData Min. 7, 10 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Casals, T. et al. Bronchiectasis in adult patients: an expression of heterozygosity for CFTR gene mutations? Clin. Genet. 65, 490–495 (2004).

    Article  CAS  PubMed  Google Scholar 

  13. Haufroid, V. & Hantson, P. CYP2D6 genetic polymorphisms and their relevance for poisoning due to amfetamines, opioid analgesics and antidepressants. Clin. Toxicol. 53, 501–510 (2015).

    Article  CAS  Google Scholar 

  14. Stoetzel, C. et al. BBS10 encodes a vertebrate-specific chaperonin-like protein and is a major BBS locus. Nat. Genet. 38, 521–524 (2006).

    Article  CAS  PubMed  Google Scholar 

  15. Consortium, G. T. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).

    Article  CAS  Google Scholar 

  16. Elbedour, K., Zucker, N., Zalzstein, E., Barki, Y. & Carmi, R. Cardiac abnormalities in the Bardet–Biedl syndrome: echocardiographic studies of 22 patients. Am. J. Med. Genet. 52, 164–169 (1994).

    Article  CAS  PubMed  Google Scholar 

  17. Ji, H. L. et al. δENaC: a novel divergent amiloride-inhibitable sodium channel. Am. J. Physiol. Lung Cell. Mol. Physiol. 303, L1013–L1026 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Battaglia, A. Del 1p36 syndrome: a newly emerging clinical entity. Brain Dev. 27, 358–361 (2005).

    Article  PubMed  Google Scholar 

  19. Gronich, N., Kumar, A., Zhang, Y., Efimov, I. R. & Soldatov, N. M. Molecular remodeling of ion channels, exchangers and pumps in atrial and ventricular myocytes in ischemic cardiomyopathy. Channels 4, 101–107 (2010).

    Article  CAS  PubMed  Google Scholar 

  20. Bowl, M. R. et al. A large scale hearing loss screen reveals an extensive unexplored genetic landscape for auditory dysfunction. Nat. Commun. 8, 886 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  21. Ingham, N. J. et al. Mouse screen reveals multiple new genes underlying mouse and human hearing loss. PLoS Biol. 17, e3000194 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Liu, H. et al. Characterization of transcriptomes of cochlear inner and outer hair cells. J. Neurosci. 34, 11085–11095 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  23. Gilling, C. E. & Carlson, K. A. The effect of OTK18 upregulation in U937 cells on neuronal survival. In Vitro Cell. Dev. Biol. Anim. 45, 243–251 (2009).

    Article  CAS  PubMed  Google Scholar 

  24. Cacciottolo, M. et al. Muscular dystrophy with marked dysferlin deficiency is consistently caused by primary dysferlin gene mutations. Eur. J. Hum. Genet. 19, 974–980 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Bonventre, J. A. et al. Fer1l6 is essential for the development of vertebrate muscle tissue in zebrafish. Mol. Biol. Cell 30, 293–301 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Burgess, R. W. et al. Evidence for a conserved function in synapse formation reveals Phr1 as a candidate gene for respiratory failure in newborn mice. Mol. Cell. Biol. 24, 1096–1105 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Wan, H. I. et al. Highwire regulates synaptic growth in Drosophila. Neuron 26, 313–329 (2000).

    Article  CAS  PubMed  Google Scholar 

  28. Zhen, M., Huang, X., Bamber, B. & Jin, Y. Regulation of presynaptic terminal organization by C. elegans RPM-1, a putative guanine nucleotide exchanger with a RING-H2 finger domain. Neuron 26, 331–343 (2000).

    Article  CAS  PubMed  Google Scholar 

  29. Laizure, S. C., Herring, V., Hu, Z., Witbrodt, K. & Parker, R. B. The role of human carboxylesterases in drug metabolism: have we overlooked their importance? Pharmacotherapy 33, 210–222 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Bergamaschi, D. et al. iASPP oncoprotein is a key inhibitor of p53 conserved from worm to human. Nat. Genet. 33, 162–167 (2003).

    Article  CAS  PubMed  Google Scholar 

  31. Howell, G. R. et al. Molecular clustering identifies complement and endothelin induction as early events in a mouse model of glaucoma. J. Clin. Invest. 121, 1429–1444 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Wilson, A. M. et al. Inhibitor of apoptosis-stimulating protein of p53 (iASPP) is required for neuronal survival after axonal injury. PLoS ONE 9, e94175 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  33. Nickells, R. W. Apoptosis of retinal ganglion cells in glaucoma: an update of the molecular pathways involved in cell death. Surv. Ophthalmol. 43, S151–S161 (1999).

    Article  PubMed  Google Scholar 

  34. Snow, B. E. et al. GTPase activating specificity of RGS12 and binding specificity of an alternatively spliced PDZ (PSD-95/Dlg/ZO-1) domain. J. Biol. Chem. 273, 17749–17755 (1998).

    Article  CAS  PubMed  Google Scholar 

  35. Cui, S. et al. The antagonist of CXCR1 and CXCR2 protects db/db mice from metabolic diseases through modulating inflammation. Am. J. Physiol. Endocrinol. Metab. 317, E1205–E1217 (2019).

    Article  CAS  PubMed  Google Scholar 

  36. Mori, M. et al. Transcriptional regulation of the cartilage intermediate layer protein (CILP) gene. Biochem. Biophys. Res. Commun. 341, 121–127 (2006).

    Article  CAS  PubMed  Google Scholar 

  37. Zhang, C. L. et al. Cartilage intermediate layer protein-1 alleviates pressure overload-induced cardiac fibrosis via interfering TGF-β1 signaling. J. Mol. Cell. Cardiol. 116, 135–144 (2018).

    Article  CAS  PubMed  Google Scholar 

  38. Szklarczyk, D. et al. STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 47, D607–D613 (2019).

    CAS  PubMed  Google Scholar 

  39. Pinard, A., Jones, G. T. & Milewicz, D. M. Genetics of thoracic and abdominal aortic diseases. Circ. Res. 124, 588–606 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177, 26–31 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Fry, A. et al. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. Am. J. Epidemiol. 186, 1026–1034 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  42. Cirulli, E. T. et al. Genome-wide rare-variant analysis for thousands of phenotypes in over 70,000 exomes from two cohorts. Nat. Commun. 11, 542 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Zhao, Z. et al. UK Biobank whole-exome sequence binary phenome analysis with robust region-based rare-variant test. Am. J. Hum. Genet. 106, 3–12 (2020).

    Article  CAS  PubMed  Google Scholar 

  44. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  45. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Carroll, R. J., Bastarache, L. & Denny, J. C. R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment. Bioinformatics 30, 2375–2376 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Price, A. L. et al. Pooled association tests for rare variants in exon-resequencing studies. Am. J. Hum. Genet. 86, 832–838 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  49. Gauderman, W. J., Morrison, J. M. & Morrison, W. G. J. QUANTO 1.1: a computer program for power and sample size calculations for genetic-epidemiology studies. http://hydra.usc.edu/gxe (2006).

  50. Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Harrison, P. F., Pattison, A. D., Powell, D. R. & Beilharz, T. H. Topconfects: a package for confident effect sizes in differential expression analysis provides a more biologically useful ranked gene list. Genome Biol. 20, 67 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  52. Wagner, A. H. et al. Exon-level expression profiling of ocular tissues. Exp. Eye Res. 111, 105–111 (2013).

    Article  CAS  PubMed  Google Scholar 

  53. Libby, R. T. et al. Inherited glaucoma in DBA/2J mice: pertinent disease features for studying the neurodegeneration. Vis. Neurosci. 22, 637–648 (2005).

    Article  PubMed  Google Scholar 

  54. Howell, G. R., Walton, D. O., King, B. L., Libby, R. T. & John, S. W. Datgan, a reusable software system for facile interrogation and visualization of complex transcription profiling data. BMC Genomics 12, 429 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Yang, W. et al. Generation of iPSCs as a pooled culture using magnetic activated cell sorting of newly reprogrammed cells. PLoS ONE 10, e0134995 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  56. Chavali, V. R. M. et al. Dual SMAD inhibition and Wnt inhibition enable efficient and reproducible differentiations of induced pluripotent stem cells into retinal ganglion cells. Sci. Rep. 10, 11828 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Verkuil, L. et al. SNP located in an AluJb repeat downstream of TMCO1, rs4657473, is protective for POAG in African Americans. Br. J. Ophthalmol. 103, 1530–1536 (2019).

    Article  PubMed  Google Scholar 

  58. Campbell-Thompson, M. et al. Network for pancreatic organ donors with diabetes (nPOD): developing a tissue biobank for type 1 diabetes. Diabetes Metab. Res. Rev. 28, 608–617 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  59. Wang, Y. J. et al. Single-cell transcriptomics of the human endocrine pancreas. Diabetes 65, 3028–3038 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies and species. Nat. Biotechnol. 36, 411–420 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. McGinnis, C. S., Murrow, L. M. & Gartner, Z. J. DoubletFinder: doublet detection in single-cell RNA-sequencing data using artificial nearest neighbors. Cell Syst. 8, 329–337 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Pliner, H. A., Shendure, J. & Trapnell, C. Supervised classification enables rapid annotation of cell atlases. Nat. Methods 16, 983–986 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Schwartz, G. W. et al. TooManyCells identifies and visualizes relationships of single-cell clades. Nat. Methods 17, 405–413 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Wang, T., Li, B., Nelson, C. E. & Nabavi, S. Comparative analysis of differential gene expression analysis tools for single-cell RNA-sequencing data. BMC Bioinformatics 20, 40 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The PMBB is funded by the Perelman School of Medicine at the University of Pennsylvania, a gift from the Smilow family, and the National Center for Advancing Translational Sciences of the National Institutes of Health under CTSA Award Number UL1TR001878. We thank D. Birtwell, H. Williams, P. Baumann and M. Risman for informatics support regarding the PMBB. We thank the staff of the Regeneron Genetics Center for whole-exome sequencing of DNA from PMBB participants. We thank S. Rathi for help with the real-time PCR experiment on iPSC-RGCs and J. He for help with iPSC-RGC cultures. We thank S. Dudek for assistance with the PMBB Genome Browser. Research reported in this paper was supported by grants from the National Human Genome Research Institute of the National Institutes of Health under award number F30HG010442 (to J.P.); the National Eye Institute of the National Institutes of Health under award number R21EY028273-01A1, BrightFocus Foundation, Lisa Dean Moseley Foundation and Research to Prevent Blindness, F.M. Kirby Foundation and The Paul and Evanina Bell Mackall Foundation Trust (to V.R.M.C); American Heart Association SFRN in Vascular Disease under award numbers 18SFRN33960114 and 18SFRN33960163 (to S.A.L. and A.D.) and National Institutes of Health under award number 1R01HL143359 (to Y.H.S. and S.A.L.); Sarnoff Cardiovascular Research Foundation (to S.A.); Institute for Translational Medicine and Therapeutics Transdisciplinary Program in Translational Medicine and Therapeutics (to R.L.K.)

Author information

Authors and Affiliations

Authors

Contributions

The study was conceived and designed by J.P., M.D.R and D.J.R. Association analyses for the study were conducted by J.P., A.M.L., X.Z., K.C., J.H.C., G.N., A.D., G.C., N.S.J., N.K., J.H.B, M.A.R.F., A.H.L. and A.E.J. Association data interpretation was performed by J.P., A.M.L., N.K., Y.B., A.V., J.C., S.M.D., T.L.E., A.E.J., R.D., M.D.R. and D.J.R. Chart review was conducted by J.P., S.A. and T.G.D. Experiments specific to PPP1R13L and glaucoma were performed by V.R.M.C. Single-cell RNA-seq of human pancreas was performed by M.F., A.N., K.H.K. and G.V. Single-cell RNA-seq of mouse and human aortae was performed by H.S., A.D., Y.L., C.Z., S.A.L. and Y.H.S. Data acquisition for association analyses was performed by J.P., X.Z., J.W., A.V., R.L.J., R.L.K., J.D.O., J.G.R., A.B., S.M.D., A.E.J., R.D., M.D.R. and D.J.R. The manuscript was written by J.P., M.D.R. and D.J.R., and revised by all authors.

Corresponding author

Correspondence to Daniel J. Rader.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Michael Basson and Kate Gao were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Distribution of number of carriers for rare predicted loss-of-function (pLOF) variants per gene in the Penn Medicine Biobank.

a, Histogram plot for the distribution of number of heterozygous carriers for rare (MAF ≤ 0.1%) pLOF variants per gene in the Penn Medicine Biobank’s (PMBB) exome sequenced cohort. The x-axis represents number of heterozygous pLOF carriers per gene in bin widths of 10, and the log-scaled y-axis represents the number of genes with the x-axis-specified number of heterozygous carriers. b, Histogram plot for the distribution of number of homozygous carriers for rare pLOF variants per gene in the PMBB’s exome sequenced cohort. The x-axis represents number of homozygous pLOF carriers per gene in bin widths of one, and the log-scaled y-axis represents the number of genes with the x-axis-specified number of homozygous carriers.

Extended Data Fig. 2 Power analyses for association of gene burdens with at least 25 heterozygous carriers for rare pLOF variants with phenotypes of various case counts.

Power analyses for association of gene burdens collapsing rare pLOF variants with 25 heterozygous carriers (allele frequency = 25/2N ≈ 0.001, where N = 2172 (AFR) + 8198 (EUR)) with phenotypes having various case counts. Phenotype case counts range from 20 to 6500 to reflect the range of case counts for phecodes in the Penn Medicine Biobank discovery cohort, and the power of the gene burden association with each phenotype as a function of odds ratio (OR=exp(beta)) is plotted on separate lines per the plot legend.

Extended Data Fig. 3 Quantile-quantile plot of gene burden testing results from discovery phase of exome-by-phenome-wide association studies in the Penn Medicine Biobank.

a, Quantile-quantile plot of p values from all exome-by-phenome-wide associations using gene burdens collapsing rare (MAF ≤ 0.1%) predicted loss-of-function (pLOF) variants per gene in the Penn Medicine Biobank (PMBB). The x-axis represents the expected –log10(p value) under the uniform distribution of p values. The y-axis represents the observed –log10(p value) from the discovery phase of the exome-by-phenome-wide gene burden association studies collapsing rare pLOF variants in the PMBB. Each point represents an association between one of 1518 gene burdens and one of 1000 phecodes via logistic regression. The solid line shows the relationship between the expected and observed p values under the uniform p value distribution. The dashed line represents the observed fit line between the 50th and 95th percentile of gene burden associations, and the slope of this line is λ∆95 = 1.558. b, AFR-specific QQ plot of p values from all exome-by-phenome-wide associations using gene burdens collapsing rare (MAF ≤ 0.1%) predicted loss-of-function (pLOF) variants per gene in the PMBB. Data is presented in a similar manner to panel A. The slope of the fitted line is the AFR-specific λ∆95 = 1.09. C) EUR-specific QQ plot of p values from all exome-by-phenome-wide associations using gene burdens collapsing rare (MAF ≤ 0.1%) predicted loss-of-function (pLOF) variants per gene in the PMBB. Data is presented in a similar manner to panel A. The slope of the fitted line is the EUR-specific λ∆95 = 1.251.

Extended Data Fig. 4 MYCBP2 is downregulated in tibial muscular dystrophy.

a, Comparison of MYCBP2 expression levels in human distal lower extremity muscles in tibial muscular dystrophy (TMD; N=6 independent muscle samples) versus healthy controls (N=5 muscle samples). Data is presented as mean transformed signal intensity, and error bars denote SEM. Transformed signal intensity values were obtained from GEO Series GSE42806, which are baseline-transformed and MAS5.0-normalized signal intensities, and individual values are plotted overlaying the bar plot. Statistical comparison was based on a moderated t-statistic, and p values were adjusted by Benjamini & Hochberg (FDR) correction. b, Comparison of MYCBP2 expression levels in each distal lower extremity muscle included in the comparison in Extended Data Fig. 4a. Data is presented as a bar plot showing mean fold change as compared to a single control sample, and individual values are plotted overlaying the bar plot. Fold changes were calculated based on inverse log-transformed signal intensity values from each lower extremity muscle, including extensor digitorum longus (N=2 independent TMD samples, 2 independent control samples), tibialis posterior (N=1 TMD sample, 1 control sample), soleus (N=1 TMD sample, 1 control sample), and tibialis anterior (N=2 TMD samples, 1 control sample).

Extended Data Fig. 5 Functional validation for the association between PPP1R13L and primary open-angle glaucoma.

a, Differential expression profile of Ppp1r13l transcript in mouse optic nerve head (ONH) with varying stages of intraocular pressure (IOP)-induced glaucoma. Data represent the fold change in Ppp1r13l expression between different stages of D2 mice (glaucoma, N=50 mice) and D2 Gpnmb+ samples (control, N=10 mice). b, Localization of PPP1R13L protein in the human retina. Shown is the distribution of PPP1R13L by immunohistochemical localization in the retina from normal 68-year-old donor eyes. Overlay of images from DAPI (blue; nuclei) and PPP1R13L (red) in adult human retinal layers are shown on the right. The left represents primary antibody control. Scale bars are shown in each image. The experiment was performed twice independently with consistent results. ONL, outer nuclear layer; OPL, outer plexiform layer; IPL, inner plexiform layer; GCL, ganglion cell layer. c, Relative expression of PPP1R13L transcript in response to oxidative stress in induced pluripotent stem cell-derived retinal ganglion cells (iPSC-RGCs). A two-tailed unpaired Student’s t test was used for statistical analysis, and significant p values are shown. Expression of PPP1R13L in iPSC-RGCs is shown under increasing concentrations of H2O2 treatment (N=3 independent experiments per condition). Plotted are the mean fold changes in comparison to no H2O2, error bars represent standard error of the mean (SEM), and individual values are plotted overlaying the bar plot.

Extended Data Fig. 6 Single-cell RNA-seq of human pancreatic cells shows that RGS12 is not differentially expressed in pancreatic exocrine and endocrine cells, but is reduced in type 1 diabetic peri-islet macrophages.

Comparison of RGS12 expression levels in type 1 diabetes (T1D) versus control in pancreatic beta (endocrine; N=2 T1D donors (410 cells), N=6 control donors (1573 cells)), endothelial (N=5 T1D donors (441 cells), N=6 control donors (166 cells)), stellate (exocrine; N=5 T1D donors (910 cells), N=6 control donors (356 cells)), and peri-islet immune (CD45+ macrophages; N=5 T1D donors (95 cells), N=4 control donors (40 cells)) cells based on single-cell RNA-seq. Differential expression of RGS12 in each cell type was determined by edgeR, which fits normalized expression data to a negative binomial model and uses an exact test with false discovery rate (FDR) control to determine differential expressed genes. Data is presented as bars representing mean normalized RGS12 expression and error bars representing standard error of the mean (SEM). Individual points are plotted overlaying their respective bar plots. Differential expression as determined by edgeR are displayed for each cell type as log2 fold change and p values adjusted by FDR correction.

Extended Data Fig. 7 CILP is expressed in aortic adventitial fibroblasts, and is downregulated in human fibroblasts in response to treatment with TGF-β.

a, t-SNE plot of aortic single cells in mice. Colors denote 6 cell types: smooth muscle cell (SMC), fibroblast, endothelial cell (EC), macrophage, stem cell, unknown. b, Relative expression of Cilp in all cells projected onto a t-SNE plot based on single-cell RNA-seq. The red arrows indicate where Cilp is expressed. c, t-SNE plot of aortic single cells in humans, with fibroblasts highlighted. d, Relative expression of CILP in all cells projected onto a t-SNE plot based on single-cell RNA-seq. The red box indicates where CILP is expressed. e,Volcano plot displaying differential expression of genes from meta-analysis of microarray and RNA-seq data for human fibroblasts treated with TGF-β (see Methods or the Reporting Summary for details about the datasets used). Meta-analysis of differential expression across the datasets was achieved using the Fisher’s combined probability test. The x-axis represents meta-analyzed log2(fold change), and the y-axis represents meta-analyzed –log10(p value). The top 1% of differentially expressed genes across all datasets are labeled in red (upregulation) or blue (downregulation).

Supplementary information

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Park, J., Lucas, A.M., Zhang, X. et al. Exome-wide evaluation of rare coding variants using electronic health records identifies new gene–phenotype associations. Nat Med 27, 66–72 (2021). https://doi.org/10.1038/s41591-020-1133-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41591-020-1133-8

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing