Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Exome sequence analysis identifies rare coding variants associated with a machine learning-based marker for coronary artery disease

Abstract

Coronary artery disease (CAD) exists on a spectrum of disease represented by a combination of risk factors and pathogenic processes. An in silico score for CAD built using machine learning and clinical data in electronic health records captures disease progression, severity and underdiagnosis on this spectrum and could enhance genetic discovery efforts for CAD. Here we tested associations of rare and ultrarare coding variants with the in silico score for CAD in the UK Biobank, All of Us Research Program and BioMe Biobank. We identified associations in 17 genes; of these, 14 show at least moderate levels of prior genetic, biological and/or clinical support for CAD. We also observed an excess of ultrarare coding variants in 321 aggregated CAD genes, suggesting more ultrarare variant associations await discovery. These results expand our understanding of the genetic etiology of CAD and illustrate how digital markers can enhance genetic association investigations for complex diseases.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Schematic of the study design.
Fig. 2: Evidence supporting the role of 17 genes associated with an ISCAD in CAD biology.

Similar content being viewed by others

Data availability

Genetic association summary statistics are available on the GWAS Catalog (study accession GCST90370243 and GCST90370244, both available at https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90370001-GCST90371000/) and Zenodo (https://zenodo.org/records/11086022)55.

Code availability

All analysis code is available on Zenodo (https://zenodo.org/records/11086022)55.

References

  1. Roth Gregory, A. et al. Global burden of cardiovascular diseases and risk factors, 1990–2019. J. Am. Coll. Cardiol. 76, 2982–3021 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Khera, A. V. & Kathiresan, S. Genetics of coronary artery disease: discovery, biology and clinical translation. Nat. Rev. Genet. 18, 331–344 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Chen, Z. & Schunkert, H. Genetics of coronary artery disease in the post-GWAS era. J. Intern. Med. 290, 980–992 (2021).

    Article  PubMed  Google Scholar 

  4. Aragam, K. G. et al. Discovery and systematic characterization of risk variants and genes for coronary artery disease in over a million participants. Nat. Genet. 54, 1803–1815 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Tcheandjieu, C. et al. Large-scale genome-wide association study of coronary artery disease in genetically diverse populations. Nat. Med. 28, 1679–1692 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Plenge, R. M., Scolnick, E. M. & Altshuler, D. Validating therapeutic targets through human genetics. Nat. Rev. Drug Discov. 12, 581–594 (2013).

    Article  CAS  PubMed  Google Scholar 

  7. Plenge, R. M. Disciplined approach to drug discovery and early development. Sci. Transl. Med. 8, 349ps15 (2016).

    Article  PubMed  Google Scholar 

  8. Szustakowski, J. D. et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat. Genet. 53, 942–948 (2021).

    Article  CAS  PubMed  Google Scholar 

  9. Do, R. et al. Exome sequencing identifies rare LDLR and APOA5 alleles conferring risk for myocardial infarction. Nature 518, 102–106 (2015).

    Article  CAS  PubMed  Google Scholar 

  10. Yao, K. et al. Exome sequencing identifies rare mutations of LDLR and QTRT1 conferring risk for early-onset coronary artery disease in Chinese. Natl Sci. Rev. 9, nwac102 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Khera, A. V. et al. Gene sequencing identifies perturbation in nitric oxide signaling as a nonlipid molecular subtype of coronary artery disease. Circ. Genom. Precis. Med. 15, e003598 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Martin, S. S. et al. 2024 heart disease and stroke statistics: a report of US and global data from the American Heart Association. Circulation 149, e347–e913 (2024).

    Article  PubMed  Google Scholar 

  13. Maddox, T. M. et al. Nonobstructive coronary artery disease and risk of myocardial infarction. JAMA 312, 1754–1763 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Park, D. W. et al. Extent, location, and clinical significance of non-infarct-related coronary artery disease among patients with ST-elevation myocardial infarction. JAMA 312, 2019–2027 (2014).

    Article  CAS  PubMed  Google Scholar 

  15. Forrest, I. S. et al. Machine learning-based marker for coronary artery disease: derivation and validation in two longitudinal cohorts. Lancet 401, 215–225 (2023).

    Article  PubMed  Google Scholar 

  16. Petrazzini, B. O. et al. Coronary risk estimation based on clinical data in electronic health records. J. Am. Coll. Cardiol. 79, 1155–1166 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53, 1097–1103 (2021).

    Article  CAS  PubMed  Google Scholar 

  18. Willer, C. J., Li, Y. & Abecasis, G. R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26, 2190–2191 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Sveinbjornsson, G. et al. Weighting sequence variants based on their annotation increases power of whole-genome association studies. Nat. Genet. 48, 314–317 (2016).

    Article  CAS  PubMed  Google Scholar 

  20. Zhou, W. et al. Efficiently controlling for case–control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Loh, P. R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Nikpay, M. et al. A comprehensive 1,000 genomes–based genome-wide association meta-analysis of coronary artery disease. Nat. Genet. 47, 1121–1130 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Tarugi, P. et al. Molecular diagnosis of hypobetalipoproteinemia: an ENID review. Atherosclerosis 195, e19–e27 (2007).

    Article  CAS  PubMed  Google Scholar 

  24. Ference, B. A. et al. Variation in PCSK9 and HMGCR and risk of cardiovascular disease and diabetes. N. Engl. J. Med. 375, 2144–2153 (2016).

    Article  CAS  PubMed  Google Scholar 

  25. Schmidt, A. F. et al. PCSK9 genetic variants and risk of type 2 diabetes: a mendelian randomisation study. Lancet Diabetes Endocrinol. 5, 97–105 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Lotta, L. A. et al. Association between low-density lipoprotein cholesterol–lowering genetic variants and risk of type 2 diabetes: a meta-analysis. JAMA 316, 1383–1391 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Benn, M., Nordestgaard, B. G., Grande, P., Schnohr, P. & Tybjærg-Hansen, A. PCSK9R46L, low-density lipoprotein cholesterol levels, and risk of ischemic heart disease: 3 independent studies and meta-analyses. J. Am. Coll. Cardiol. 55, 2833–2842 (2010).

    Article  CAS  PubMed  Google Scholar 

  28. Ghoussaini, M. et al. Open Targets Genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics. Nucleic Acids Res. 49, D1311–D1320 (2021).

    Article  CAS  PubMed  Google Scholar 

  29. Thomas, D. G., Wei, Y. & Tall, A. R. Lipid and metabolic syndrome traits in coronary artery disease: a Mendelian randomization study. J. Lipid Res. 62, 100044 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Liberzon, A. et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Schrodi, S. J. The impact of diagnostic code misclassification on optimizing the experimental design of genetic association studies. J. Healthc. Eng. 2017, 7653071 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  32. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Klarin, D. et al. Genetic analysis in UK Biobank links insulin resistance and transendothelial migration pathways to coronary artery disease. Nat. Genet. 49, 1392–1397 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Honigberg, M. C. et al. Premature menopause, clonal hematopoiesis, and coronary artery disease in postmenopausal women. Circulation 143, 410–423 (2021).

    Article  PubMed  Google Scholar 

  35. Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience 4, 7 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Kursa, M. B. & Rudnicki, W. R. Feature selection with the Boruta package. J. Stat. Softw. 36, 1–13 (2010).

    Article  Google Scholar 

  38. Rajkomar, A., Dean, J. & Kohane, I. Machine learning in medicine. N. Engl. J. Med. 380, 1347–1358 (2019).

    Article  PubMed  Google Scholar 

  39. Liaw, A. & Wiener, M. Classification and regression by randomForest. R. N. 2, 18–22 (2002).

    Google Scholar 

  40. Kuhn, M. Building predictive models in R using the caret package. J. Stat. Softw. 28, 1–26 (2008).

    Article  Google Scholar 

  41. Grün, B., Kosmidis, I. & Zeileis, A. Extended beta regression in R: shaken, stirred, mixed, and partitioned. J. Stat. Softw. 48, 1–25 (2012).

    Article  Google Scholar 

  42. McCaw, Z. R., Lane, J. M., Saxena, R., Redline, S. & Lin, X. Operating characteristics of the rank-based inverse normal transformation for quantitative trait analysis in genome-wide association studies. Biometrics 76, 1262–1272 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Wojcik, G. L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Ng, P. C. & Henikoff, S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 31, 3812–3814 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Chun, S. & Fay, J. C. Identification of deleterious mutations within three human genomes. Genome Res. 19, 1553–1561 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Schwarz, J. M., Cooper, D. N., Schuelke, M. & Seelow, D. MutationTaster2: mutation prediction for the deep-sequencing age. Nat. Methods 11, 361–362 (2014).

    Article  CAS  PubMed  Google Scholar 

  48. Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence Kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Liu, Y. et al. ACAT: a fast and powerful P value combination method for rare-variant analysis in sequencing studies. Am. J. Hum. Genet. 104, 410–421 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  51. Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2019).

    Article  CAS  PubMed  Google Scholar 

  52. Online Mendelian Inheritance in Man, OMIM®. McKusick-Nathans Institute of Genetic Medicine. (Johns Hopkins University, 2022); https://omim.org/

  53. R Core Team. R: a language and environment for statistical computing. (R Foundation for Statistical Computing, 2019); https://www.r-project.org/

  54. Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform. 12, 77 (2011).

    Article  Google Scholar 

  55. Petrazzini, B. O. et al. Exome sequence analysis identifies rare coding variants associated with a machine learning-based marker for coronary artery disease. Zenodo https://doi.org/10.5281/zenodo.11086022 (2024).

Download references

Acknowledgements

S.N.G. is supported by VA MERIT grant 1I01CX002560. R.S.R. is supported by National Institute of Aging of the National Institutes of Health R01 AG061186-0 and the National Heart, Lung, and Blood Institute of the National Institutes of Health R01HL157439-01. R.D. is supported by the National Institute of General Medical Sciences of the NIH (R35-GM124836) and the National Heart, Lung and Blood Institute of the NIH (R01-HL139865 and R01-HL155915). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

B.O.P., I.S.F. and R.D. conceived and designed the study. B.O.P. performed statistical analyses. B.O.P., I.S.F., G.R., H.M.T.V., C.M.-L., A.D., R.C., J.K.P., K.G., S.N.G., W.A.M., R.S.R., D.M.J. and R.D. provided administrative, technical and material support. B.O.P. and R.D. drafted the manuscript. R.D. supervised the study. B.O.P. and R.D. had access to all of the data in the study and take responsibility for the integrity of the data and accuracy of the analysis.

Corresponding author

Correspondence to Ron Do.

Ethics declarations

Competing interests

R.D. reports being a scientific cofounder, consultant and equity holder for Pensieve Health (pending) and being a consultant for Variant Bio, all not related to this study. R.S.R. reports research funding to his institution from Amgen, Arrowhead, Eli Lilly, Merck, NIH, Novartis, Novo Nordisk, Regeneron and 89Bio, consulting fees from Amgen, Avilar, CRISPER Therapeutics, Editas, Eli Lilly, Lipigon, New Amsterdam, Novartis, Precision Biosciences, Regeneron, UltraGenyx, Verve Therapeutics, nonpromotional honoraria from Meda Pharma, royalties from Wolters Kluwer (UpToDate) and stock holding in MediMergent, LLC. He reports patent applications on: methods and systems for biocellular marker detection and diagnosis using a microfluidic profiling device. EFS ID: 32278349. application no. (PCT/US2019/026364) (provisional); all unrelated to this study. The other authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Matthias Heinig, Samuli Ripatti and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Area under the receiver operating characteristic curves on the testing sets used to evaluate in silico score for coronary artery disease (ISCAD).

We trained and tested 100 models with independent random sampling. Receiver operator characteristic curves are shown for the current ISCAD model trained on the UK Biobank (a), the All of Us biobank (b), the BioMe biobank (c). AUC: Area under the receiver operating characteristic curve.

Extended Data Fig. 2 Distribution of the in silico score for coronary artery disease (ISCAD) in cases and controls.

We trained and tested 100 models with independent random sampling. Distributions of CAD cases and controls separately are shown for the current ISCAD model trained on the UK Biobank (a), the All of Us biobank (b) and the BioMe biobank (c). Vertical dotted lines represent the median value of the distribution. ISCAD: in silico score for coronary artery disease.

Extended Data Fig. 3 Manhattan plot of rare coding variant association meta-analysis.

We tested 2,738,849 rare missense and protein truncating variants from 604,915 individuals in the UK Biobank, the All of Us Research Program and the BioMe Biobank. Dotted horizontal line represents an exome-wide significance threshold of P = 4.3 × 10−7. We obtained two-sided base 10 logarithm P-values from a fixed-effect inverse-variance weighted meta-analysis. Italicized text indicates gene names.

Extended Data Table 1 Baseline characteristics of the study population
Extended Data Table 2 Performance metrics for the different machine learning models defining the in silico score of coronary artery disease (ISCAD)
Extended Data Table 3 Rare coding variants associated with arterial stiffness index, left ventricular ejection fraction, myocardial infarction, heart failure and/or arrythmia
Extended Data Table 4 Gene-level aggregates of ultra-rare deleterious coding variants associated with arterial stiffness index, left ventricular ejection fraction, myocardial infarction, heart failure and/or arrythmia

Supplementary information

Supplementary Information

Supplementary Results, Methods, Note, Figs. 1–14 and Tables 2–4, 9–10, 13–17, 21–22 and 26–27.

Reporting Summary

Supplementary Tables

Supplementary Tables 1, 5–8, 11–12, 18–20, 23–25 and 28.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Petrazzini, B.O., Forrest, I.S., Rocheleau, G. et al. Exome sequence analysis identifies rare coding variants associated with a machine learning-based marker for coronary artery disease. Nat Genet (2024). https://doi.org/10.1038/s41588-024-01791-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41588-024-01791-x

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing