Electronic health records and polygenic risk scores for predicting disease risk


Accurate prediction of disease risk based on the genetic make-up of an individual is essential for effective prevention and personalized treatment. Nevertheless, to date, individual genetic variants from genome-wide association studies have achieved only moderate prediction of disease risk. The aggregation of genetic variants under a polygenic model shows promising improvements in prediction accuracies. Increasingly, electronic health records (EHRs) are being linked to patient genetic data in biobanks, which provides new opportunities for developing and applying polygenic risk scores in the clinic, to systematically examine and evaluate patient susceptibilities to disease. However, the heterogeneous nature of EHR data brings forth many practical challenges along every step of designing and implementing risk prediction strategies. In this Review, we present the unique considerations for using genotype and phenotype data from biobank-linked EHRs for polygenic risk prediction.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Data integration for a PRS in EHRs.
Fig. 2: Extracting phenotypes from EHR data for deriving PRSs.
Fig. 3: Risk prediction in EHR data using the PRS with other clinical factors.


  1. 1.

    Preiss, D. & Kristensen, S. L. The new pooled cohort equations risk calculator. Can. J. Cardiol. 31, 613–619 (2015).

    PubMed  Google Scholar 

  2. 2.

    Antoniou, A. et al. Average risks of breast and ovarian cancer associated with BRCA1 or BRCA2 mutations detected in case series unselected for family history: a combined analysis of 22 studies. Am. J. Hum. Genet. 72, 1117–1130 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. 3.

    O’Sullivan, B. P. & Freedman, S. D. Cystic fibrosis. Lancet 373, 1891–1904 (2009).

    PubMed  Google Scholar 

  4. 4.

    Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Lo, A., Chernoff, H., Zheng, T. & Lo, S.-H. Why significant variables aren’t automatically good predictors. Proc. Natl Acad. Sci. USA 112, 13892–13897 (2015).

    CAS  PubMed  Google Scholar 

  6. 6.

    Visscher, P. M. et al. 10 Years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 101, 5–22 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. 7.

    Bogdan, R., Baranger, D. A. A. & Agrawal, A. Polygenic risk scores in clinical psychology: bridging genomic risk to individual differences. Annu. Rev. Clin. Psychol. 14, 119–157 (2018).

    PubMed  Google Scholar 

  8. 8.

    Schizophrenia Working Group of the Psychiatric Genomics Consortium. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014).

    PubMed Central  Google Scholar 

  9. 9.

    Zhang, J.-P. et al. Schizophrenia polygenic risk score as a predictor of antipsychotic efficacy in first-episode psychosis. Am. J. Psychiatry 176, 21–28 (2019).

    PubMed  Google Scholar 

  10. 10.

    Jones, H. J. et al. Phenotypic manifestation of genetic risk for schizophrenia during adolescence in the general population. JAMA Psychiatry 73, 221 (2016).

    PubMed  PubMed Central  Google Scholar 

  11. 11.

    Kohane, I. S. Using electronic health records to drive discovery in disease genomics. Nat. Rev. Genet.12, 417–428 (2011).

    CAS  PubMed  Google Scholar 

  12. 12.

    Fritsche, L. G. et al. Association of polygenic risk scores for multiple cancers in a phenome-wide study: results from the michigan genomics initiative. Am. J. Hum. Genet. 102, 1048–1061 (2018). This analysis uses biobank-linked EHR data to study PRS associations with cancers.

    CAS  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Banda, Y. et al. Characterizing race/ethnicity and genetic ancestry for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1285–1295 (2015).

    PubMed  PubMed Central  Google Scholar 

  14. 14.

    Kvale, M. N. et al. Genotyping informatics and quality control for 100,000 subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort. Genetics 200, 1051–1060 (2015).

    PubMed  PubMed Central  Google Scholar 

  15. 15.

    Li, R. et al. A regression framework to uncover pleiotropy in large-scale electronic health record data. J. Am. Med. Informatics Assoc 26, 1083–1090 (2019).

    Google Scholar 

  16. 16.

    McCarty, C. A., Wilke, R. A., Giampietro, P. F., Wesbrook, S. D. & Caldwell, M. D. Marshfield Clinic Personalized Medicine Research Project (PMRP): design, methods and recruitment for a large population-based biobank. Per. Med. 2, 49–79 (2005).

    PubMed  Google Scholar 

  17. 17.

    Nagai, A. et al. Overview of the Biobank Japan project: study design and profile. J. Epidemiol. 27, S2–S8 (2017).

    PubMed  PubMed Central  Google Scholar 

  18. 18.

    Cho, S. Y. et al. Opening of the National Biobank of Korea as the infrastructure of future biomedical science in Korea. Osong Public. Heal. Res. Perspect. 3, 177–184 (2012).

    Google Scholar 

  19. 19.

    Chen, Z. et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int. J. Epidemiol. 40, 1652–1666 (2011).

    PubMed  PubMed Central  Google Scholar 

  20. 20.

    Locke, A. E. et al. Exome sequencing of Finnish isolates enhances rare-variant association power. Nature 572, 323–328 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. 21.

    Sankar, P. L. & Parker, L. S. The Precision Medicine Initiative’s all of us research program: an agenda for research on its ethical, legal, and social issues. Genet. Med. 19, 743–750 (2017).

    PubMed  Google Scholar 

  22. 22.

    Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). This paper presents one of the largest genetic-linked patient clinical data sets that is publicly available to researchers.

    CAS  PubMed  PubMed Central  Google Scholar 

  23. 23.

    Casey, J. A., Schwartz, B. S., Stewart, W. F. & Adler, N. E. Using electronic health records for population health research: a review of methods and applications. Annu. Rev. Public Health 37, 61–81 (2016).

    PubMed  Google Scholar 

  24. 24.

    Chatterjee, N., Shi, J. & García-Closas, M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet. 17, 392–406 (2016). This review article provides an overview of risk prediction methods and approaches to incorporate a PRS into risk models.

    CAS  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Torkamani, A., Wineinger, N. E. & Topol, E. J. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 19, 581–590 (2018). This paper presents a review of the background of the PRS and how it can be utilized for risk predictions.

    CAS  PubMed  Google Scholar 

  26. 26.

    Li, R., Chen, Y. & Moore, J. H. Integration of genetic and clinical information to improve imputation of data missing from electronic health records. J. Am. Med. Informatics Assoc. 26, 1056–1063 (2019).

    Google Scholar 

  27. 27.

    Shi, J. et al. Winner’s curse correction and variable thresholding improve performance of polygenic risk modeling based on genome-wide association study summary-level data. PLoS Genet.12, e1006493 (2016).

    PubMed  PubMed Central  Google Scholar 

  28. 28.

    Dudbridge, F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 9, e1003348 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  29. 29.

    Vilhjálmsson, B. J. et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 97, 576–592 (2015). This study shows that the accuracy of a PRS is affected by the modelling of linkage disequilibrium between SNPs.

    PubMed  PubMed Central  Google Scholar 

  30. 30.

    Mak, T. S. H., Porsch, R. M., Choi, S. W., Zhou, X. & Sham, P. C. Polygenic scores via penalized regression on summary statistics. Genet. Epidemiol. 41, 469–480 (2017).

    PubMed  Google Scholar 

  31. 31.

    Choi, S. W., Mak, T. S. H. & O’Reilly, P. F. A guide to performing polygenic risk score analyses. Preprint at bioRxiv https://doi.org/10.1101/416545 (2018).

    Article  Google Scholar 

  32. 32.

    Fritsche, L. G. et al. Exploring various polygenic risk scores for skin cancer in the phenomes of the Michigan Genomics Initiative and the UK Biobank with a visual catalog: PRSWeb. PLoS Genet. 15, e1008202 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  33. 33.

    Reus, L. M. et al. Association of polygenic risk for major psychiatric illness with subcortical volumes and white matter integrity in UK Biobank. Sci. Rep. 7, 42140 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Mavaddat, N. et al. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. Am. J. Hum. Genet. 104, 21–34 (2019).

    CAS  PubMed  Google Scholar 

  35. 35.

    Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018). This study demonstrates that a PRS can identify individuals who have a clinically significantly increased risk of coronary artery disease, atrial fibrillation, T2DM, inflammatory bowel disease and breast cancer.

    CAS  PubMed  PubMed Central  Google Scholar 

  36. 36.

    Khera, A. V. et al. Polygenic prediction of weight and obesity trajectories from birth to adulthood. Cell 177, 587–596.e9 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. 37.

    Polubriaginof, F. C. G. et al. Disease heritability inferred from familial relationships reported in medical records. Cell 173, 1692–1704.e11 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. 38.

    DeBoever, C. et al. Assessing digital phenotyping to enhance genetic studies of human diseases. Preprint at bioRxiv https://doi.org/10.1101/738856 (2019).

    Article  Google Scholar 

  39. 39.

    Robinson, J. R., Wei, W.-Q., Roden, D. M. & Denny, J. C. Defining phenotypes from clinical data to drive genomic research. Annu. Rev. Biomed. Data Sci. 1, 69–92 (2018).

    Google Scholar 

  40. 40.

    Wei, W.-Q. & Denny, J. C. Extracting research-quality phenotypes from electronic health records to support precision medicine. Genome Med. 7, 41 (2015).

    PubMed  PubMed Central  Google Scholar 

  41. 41.

    Denny, J. C. et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics 26, 1205–1210 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. 42.

    Chiu, P.-H. & Hripcsak, G. EHR-based phenotyping: bulk learning and evaluation. J. Biomed. Inform. 70, 35–51 (2017).

    PubMed  PubMed Central  Google Scholar 

  43. 43.

    Banda, J. M., Seneviratne, M., Hernandez-Boussard, T. & Shah, N. H. Advances in electronic phenotyping: from rule-based definitions to machine learning models. Annu. Rev. Biomed. Data Sci. 1, 53–68 (2018).

    PubMed  PubMed Central  Google Scholar 

  44. 44.

    Ritchie, M. D. Large-scale analysis of genetic and clinical patient data. Annu. Rev. Biomed. Data Sci. 1, 263–274 (2018).

    Google Scholar 

  45. 45.

    Hripcsak, G. & Albers, D. J. Next-generation phenotyping of electronic health records. J. Am. Med. Inform. Assoc. 20, 117–121 (2013).

    PubMed  Google Scholar 

  46. 46.

    Kirby, J. C. et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J. Am. Med. Inform. Assoc. 23, 1046–1052 (2016). This paper discusses PheKB, which contains a wide range of phenotyping algorithms that can automatically extract phenotypes from EHR data.

    PubMed  PubMed Central  Google Scholar 

  47. 47.

    Liao, K. P. et al. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ 350, h1885 (2015).

    PubMed  PubMed Central  Google Scholar 

  48. 48.

    Yu, S. et al. Surrogate-assisted feature extraction for high-throughput phenotyping. J. Am. Med. Inform. Assoc. 24, e143–e149 (2017).

    PubMed  Google Scholar 

  49. 49.

    Beaulieu-Jones, B. K. et al. Characterizing and managing missing structured data in electronic health records: data analysis. JMIR Med. Inform. 6, e11 (2018).

    PubMed  PubMed Central  Google Scholar 

  50. 50.

    Kleinsinger, F. The unmet challenge of medication nonadherence. Perm. J. 22, 18-033 (2018).

    PubMed  PubMed Central  Google Scholar 

  51. 51.

    Kho, A. N. et al. Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study. J. Am. Med. Inform. Assoc. 19, 212–218 (2012).

    PubMed  Google Scholar 

  52. 52.

    Peissig, P. L. et al. Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J. Am. Med. Inform. Assoc. 19, 225–234 (2012).

    PubMed  PubMed Central  Google Scholar 

  53. 53.

    Halpern, Y., Horng, S., Choi, Y. & Sontag, D. Electronic medical record phenotyping using the anchor and learn framework. J. Am. Med. Inform. Assoc. 23, 731–740 (2016).

    PubMed  PubMed Central  Google Scholar 

  54. 54.

    Dumitrescu, L. et al. Genome-wide study of resistant hypertension identified from electronic health records. PLoS One 12, e0171745 (2017).

    PubMed  PubMed Central  Google Scholar 

  55. 55.

    Crosslin, D. R. et al. Genetic variants associated with the white blood cell count in 13,923 subjects in the eMERGE Network. Hum. Genet. 131, 639–652 (2012).

    PubMed  Google Scholar 

  56. 56.

    Choquet, H. et al. A large multi-ethnic genome-wide association study identifies novel genetic loci for intraocular pressure. Nat. Commun. 8, 2108 (2017).

    PubMed  PubMed Central  Google Scholar 

  57. 57.

    Gudbjartsson, D. F. et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47, 435–444 (2015).

    CAS  PubMed  Google Scholar 

  58. 58.

    Robinson, P. N. et al. The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet. 83, 610–615 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  59. 59.

    Randorff Højen, A. & Rosenbeck Gøeg, K. SNOMED CT implementation. Methods Inf. Med. 51, 529–538 (2012).

    PubMed  Google Scholar 

  60. 60.

    Vreeman, D. J., McDonald, C. J. & Huff, S. M. LOINC®: a universal catalogue of individual clinical observations and uniform representation of enumerated collections. Int. J. Funct. Inform. Personal. Med. 3, 273 (2010).

    PubMed  Google Scholar 

  61. 61.

    Schulam, P., Wigley, F. & Saria, S. Clustering longitudinal clinical marker trajectories from electronic health data: applications to phenotyping and endotype discovery. Proc. Natl Conf. Artif. Intell. 4, 2956–2964 (2015).

    Google Scholar 

  62. 62.

    Duan, R. et al. An empirical study for impacts of measurement errors on EHR based association studies. AMIA Annu. Symp. Proc. 2016, 1764–1773 (2017).

    PubMed  PubMed Central  Google Scholar 

  63. 63.

    Chen, Y., Wang, J., Chubak, J. & Hubbard, R. A. Inflation of type I error rates due to differential misclassification in EHR-derived outcomes: empirical illustration using breast cancer recurrence. Pharmacoepidemiol. Drug Saf. 28, 264–268 (2019).

    PubMed  Google Scholar 

  64. 64.

    Li, R., Tong, J., Duan, R., Chen, Y. & Moore, J. H. Evaluation of phenotyping errors on polygenic risk score predictions. Proc. Int. Joint Conf. Biomed. Eng. Syst. Technol. https://doi.org/10.5220/0008935301230130 (2020).

    Article  Google Scholar 

  65. 65.

    Wells, B. J., Chagin, K. M., Nowacki, A. S. & Kattan, M. W. Strategies for handling missing data in electronic health record derived data. EGEMS 1, 1035 (2013).

    PubMed  PubMed Central  Google Scholar 

  66. 66.

    Zheng, T. et al. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int. J. Med. Inform. 97, 120–127 (2017).

    PubMed  Google Scholar 

  67. 67.

    Gustafson, E., Pacheco, J., Wehbe, F., Silverberg, J. & Thompson, W. A machine learning algorithm for identifying atopic dermatitis in adults from electronic health records. IEEE Int. Conf. Healthc. Inform. 2017, 83–90 (2017).

    PubMed  PubMed Central  Google Scholar 

  68. 68.

    Zhou, S.-M. et al. Defining disease phenotypes in primary care electronic health records by a machine learning approach: a case study in identifying rheumatoid arthritis. PLoS One 11, e0154515 (2016).

    PubMed  PubMed Central  Google Scholar 

  69. 69.

    Carroll, R. J., Eyler, A. E. & Denny, J. C. Naïve electronic health record phenotype identification for rheumatoid arthritis. AMIA Annu. Symp. Proc. 2011, 189–196 (2011).

    PubMed  PubMed Central  Google Scholar 

  70. 70.

    Cimino, J. J., Lancaster, W. J. & Wyatt, M. C. Classification of clinical research study eligibility criteria to support multi-stage cohort identification using clinical data repositories. Stud. Health Technol. Inform. 245, 341–345 (2017).

    PubMed  Google Scholar 

  71. 71.

    Gottesman, O. et al. The electronic Medical Records and Genomics (eMERGE) Network: past, present, and future. Genet. Med. 15, 761–771 (2013).

    PubMed  PubMed Central  Google Scholar 

  72. 72.

    Zhao, J. et al. Learning from longitudinal data in electronic health record and genetic data to improve cardiovascular event prediction. Sci. Rep. 9, 717 (2019).

    PubMed  PubMed Central  Google Scholar 

  73. 73.

    Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 100, 635–649 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  74. 74.

    Sohail, M. et al. Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. eLife 8, 39702 (2019).

    Google Scholar 

  75. 75.

    Zeng, Z., Deng, Y., Li, X., Naumann, T. & Luo, Y. Natural language processing for EHR-based computational phenotyping. IEEE/ACM Trans. Comput. Biol. Bioinform. 16, 139–153 (2019).

    PubMed  Google Scholar 

  76. 76.

    Denaxas, S. et al. Methods for enhancing the reproducibility of biomedical research findings using electronic health records. BioData Min. 10, 31 (2017).

    PubMed  PubMed Central  Google Scholar 

  77. 77.

    Berg, J. J. et al. Reduced signal for polygenic adaptation of height in UK Biobank. eLife 8, e39725 (2019).

    PubMed  PubMed Central  Google Scholar 

  78. 78.

    Gao, X. R., Huang, H. & Kim, H. Polygenic risk score is associated with intraocular pressure and improves glaucoma prediction in the UK Biobank cohort. Transl. Vis. Sci. Technol. 8, 10 (2019).

    PubMed  PubMed Central  Google Scholar 

  79. 79.

    Stang, P. E. et al. Advancing the science for active surveillance: rationale and design for the observational medical outcomes partnership. Ann. Intern. Med. 153, 600 (2010).

    PubMed  Google Scholar 

  80. 80.

    Hripcsak, G. et al. Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers. Stud. Health Technol. Inform. 216, 574–578 (2015).

    PubMed  PubMed Central  Google Scholar 

  81. 81.

    Duan, R., Boland, M. R., Moore, J. H. & Chen, Y. ODAL: a one-shot distributed algorithm to perform logistic regressions on electronic health records data from multiple clinical sites. Pac. Symp. Biocomput. 24, 30–41 (2019).

    PubMed  PubMed Central  Google Scholar 

  82. 82.

    Duan, R. et al. Learning from electronic health records across multiple sites: a communication-efficient and privacy-preserving distributed algorithm. J. Am. Med. Informatics Assoc. 27, 376–385 (2019).

    Google Scholar 

  83. 83.

    Ohno-Machado, L., Kim, J., Gabriel, R. A., Kuo, G. M. & Hogarth, M. A. Genomics and electronic health record systems. Hum. Mol. Genet. 27, R48–R55 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  84. 84.

    Farmer, R. et al. Promises and pitfalls of electronic health record analysis. Diabetologia 61, 1241–1248 (2018).

    PubMed  Google Scholar 

  85. 85.

    Denny, J. C. et al. The “All of Us” research program. N. Engl. J. Med. 381, 668–676 (2019).

    PubMed  Google Scholar 

  86. 86.

    Coloma, P. M. et al. Combining electronic healthcare databases in Europe to allow for large-scale drug safety monitoring: the EU-ADR project. Pharmacoepidemiol. Drug Saf. 20, 1–11 (2011).

    PubMed  Google Scholar 

  87. 87.

    Trifiro, G. et al. The EU-ADR project: preliminary results and perspective. Stud. Health Technol. Inform. 148, 43–49 (2009).

    PubMed  Google Scholar 

  88. 88.

    Lai, E. C.-C. et al. Applying a common data model to Asian databases for multinational pharmacoepidemiologic studies: opportunities and challenges. Clin. Epidemiol. 10, 875–885 (2018).

    PubMed  PubMed Central  Google Scholar 

  89. 89.

    Platt, R. W. et al. How pharmacoepidemiology networks can manage distributed analyses to improve replicability and transparency and minimize bias. Pharmacoepidemiol. Drug Saf. 29, 3–7 (2019).

    Google Scholar 

  90. 90.

    Greco, T., Zangrillo, A., Biondi-Zoccai, G. & Landoni, G. Meta-analysis: pitfalls and hints. Heart Lung Vessel. 5, 219–225 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  91. 91.

    Lu, C.-L. et al. WebDISCO: a web service for distributed Cox model learning without patient-level data sharing. J. Am. Med. Inform. Assoc. 22, ocv083 (2015).

    Google Scholar 

  92. 92.

    Wu, Y., Jiang, X., Kim, J. & Ohno-Machado, L. Grid Binary LOgistic REgression (GLORE): building shared models without sharing data. J. Am. Med. Inform. Assoc. 19, 758–764 (2012).

    PubMed  PubMed Central  Google Scholar 

  93. 93.

    Yixin Chen et al. Regression cubes with lossless compression and aggregation. IEEE Trans. Knowl. Data Eng. 18, 1585–1599 (2006).

    Google Scholar 

  94. 94.

    Wang, J., Kolar, M., Srebro, N. & Zhang, T. Efficient distributed learning with sparsity. Preprint at arXiv https://arxiv.org/abs/1605.07991 (2016).

  95. 95.

    Wray, N. R., Kemper, K. E., Hayes, B. J., Goddard, M. E. & Visscher, P. M. Complex trait prediction from genome data: contrasting EBV in livestock to PRS in humans. Genetics 211, 1131–1141 (2019).

    PubMed  PubMed Central  Google Scholar 

  96. 96.

    Powers, D. M. W. Evaluation: from precision, recall and F-factor to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2, 37–63 (2011).

    Google Scholar 

  97. 97.

    Choudhury, P. P. et al. iCARE: an R package to build, validate and apply absolute risk models. PLoS One 15, e0228198 (2020).

    Google Scholar 

  98. 98.

    Choudhury, P. P. et al. Comparative validation of breast cancer risk prediction models and projections for future risk stratification. J. Natl. Cancer Inst. 112, 278–285 (2019).

    Google Scholar 

  99. 99.

    Violán, C. et al. Comparison of the information provided by electronic health records data and a population health survey to estimate prevalence of selected health conditions and multimorbidity. BMC Public Health 13, 251 (2013).

    PubMed  PubMed Central  Google Scholar 

  100. 100.

    Price, W. N. & Cohen, I. G. Privacy in the age of medical big data. Nat. Med. 25, 37–43 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  101. 101.

    Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  102. 102.

    Rammos, A., Gonzalez, L. A. N., Weinberger, D. R., Mitchell, K. J. & Nicodemus, K. K. The role of polygenic risk score gene-set analysis in the context of the omnigenic model of schizophrenia. Neuropsychopharmacology 44, 1562–1569 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  103. 103.

    Meisner, A., Kundu, P. & Chatterjee, N. Case-only analysis of gene–environment interactions using polygenic risk scores. Am. J. Epidemiol. 188, 2013–2020 (2019).

    PubMed  Google Scholar 

  104. 104.

    Manolio, T. A. Using the data we have: improving diversity in genomic research. Am. J. Hum. Genet. 105, 233–236 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  105. 105.

    Khoury, M. J. & Mensah, G. A. Is it time to integrate polygenic risk scores into clinical practice? Let’s do the science first and follow the evidence wherever it takes us! CDC https://blogs.cdc.gov/genomics/2019/06/03/is-it-time/ (2019)

  106. 106.

    Gibson, G. On the utilization of polygenic risk scores for therapeutic targeting. PLoS Genet. 15, e1008060 (2019).

    PubMed  PubMed Central  Google Scholar 

  107. 107.

    Lee, A. et al. BOADICEA: a comprehensive breast cancer risk prediction model incorporating genetic and nongenetic risk factors. Genet. Med. 21, 1708–1718 (2019). This study integrates known cancer genes, a PRS, lifestyle risk factors and mammographic density to better estimate breast cancer risk in women.

    PubMed  PubMed Central  Google Scholar 

  108. 108.

    Pashayan, N. et al. Reducing overdiagnosis by polygenic risk-stratified screening: findings from the Finnish section of the ERSPC. Br. J. Cancer 113, 1086–1093 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  109. 109.

    Lambert, S. A., Abraham, G. & Inouye, M. Towards clinical utility of polygenic risk scores. Hum. Mol. Genet. 28, R133–R142 (2019).

    CAS  PubMed  Google Scholar 

  110. 110.

    Arnett, D. K. et al. 2019 ACC/AHA guideline on the primary prevention of cardiovascular disease: a report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. J. Am. Coll. Cardiol. 74, e177–e232 (2019).

    PubMed  Google Scholar 

  111. 111.

    Bielinski, S. J. & Pathak, J. Heart failure with differentiation between reduced and preserved ejection fraction — Phenotype algorithm pseudo code (Mayo Clinic). NCBI https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetPdf.cgi?id=phd004988.1 (2014)

  112. 112.

    National Center for Health Statistics & Centers for Disease Control and Prevention. International classification of diseases, ninth revision (ICD-9) (CDC, 1998).

  113. 113.

    Côté, R. A. & Robboy, S. Progress in medical information management. JAMA 243, 756 (1980).

    PubMed  Google Scholar 

  114. 114.

    McDonald, C. J. et al. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin. Chem. 49, 624–633 (2003).

    CAS  PubMed  Google Scholar 

  115. 115.

    Fung, K. W., McDonald, C. & Bray, B. E. RxTerms — a drug interface terminology derived from RxNorm. AMIA Annu. Symp. Proc. 2008, 227–231 (2008).

    PubMed Central  Google Scholar 

  116. 116.

    ICD.Codes. The switch from ICD-9 to ICD-10: when and why. ICD.Codes https://icd.codes/articles/icd9-to-icd10-explained (2015)

  117. 117.

    Topaz, M., Shafran-Topaz, L. & Bowles, K. H. ICD-9 to ICD-10: evolution, revolution, and current debates in the United States. Perspect. Heal. Inf. Manag. 10, 1d (2013).

    Google Scholar 

  118. 118.

    American Medical Association. Preparing for the ICD-10 code set: the differences between ICD-9 and ICD-10 (AMA, 2014)

  119. 119.

    Hong, E. P. & Park, J. W. Sample size and statistical power calculation in genetic association studies. Genomics Inform. 10, 117–122 (2012).

    PubMed  PubMed Central  Google Scholar 

  120. 120.

    Mitt, M. et al. Improved imputation accuracy of rare and low-frequency variants using population-specific high-coverage WGS-based imputation reference panel. Eur. J. Hum. Genet. 25, 869–876 (2017).

    PubMed  PubMed Central  Google Scholar 

  121. 121.

    Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019). This study demonstrates the lack of transferability of a PRS across different populations.

    CAS  PubMed  PubMed Central  Google Scholar 

  122. 122.

    Márquez-Luna, C., Loh, P.-R., South Asian Type 2 Diabetes (SAT2D) Consortium, SIGMA Type 2 Diabetes Consortium & Price, A. L. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genet. Epidemiol. 41, 811–823 (2017). This study uses data from multiple populations to derive a more generalizable PRS for T2DM.

    PubMed  PubMed Central  Google Scholar 

Download references


This work was supported by National Institutes of Health grants LM010098 and AI116794.

Author information




The authors contributed equally to all aspects of the article.

Corresponding author

Correspondence to Jason H. Moore.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

Michigan Genomics Initiative: https://precisionhealth.umich.edu/michigangenomics/

Office of the National Coordinator for Health Information Technology: https://dashboard.healthit.gov/quickstats/pages/physician-ehr-adoption-trends.php

Phenotype KnowledgeBase (PheKB): https://www.phekb.org/

Precision Medicine Initiative: https://ghr.nlm.nih.gov/primer/precisionmedicine/initiative


Genome-wide association studies

(GWAS). Studies in which associations between genetic variation and a phenotype or trait of interest are identified by genotyping cases (for example, diseased individuals) and controls (for example, healthy individuals) for a set of genetic variants that capture variation across the entire genome.

Polygenic risk score

(PRS). A weighted score calculated from numerous genetic variants for predicting disease risk. A PRS is calculated as the sum of risk alleles multiplied by their association coefficients.

Electronic health records

(EHRs). Also known as electronic medical records. Digitally stored patients’ medical history.


Repositories that store biological samples, including blood or tissue samples, for research use. Increasingly, the term biobank is used to denote a population cohort study with stored biological samples.


The physiological traits that are related to a disease trait; for example, for hypertension this could include blood pressure, angiotensin levels or salt sensitivity.


Pertaining to a gene that affects multiple phenotypes or traits.

Positive predictive value

The proportion of true positives among positive results.

k-Nearest neighbour

A machine learning method that is based on similarities between samples.

Decision tree

A machine learning method that learns decision rules from the data and represents them in a tree-like structure. The tree is used to perform classification or regression.

Random forest

An ensemble approach that learns from multiple decision trees.

Support vector machine

A supervised machine learning method that uses a hyperplane to perform classification.

Naive Bayes

A simple classification method that is based on the Bayes’ theorem.

Lasso logistic regression

A penalized version of regular logistic regression. The additional penalty term forces some features to have zero coefficients.

Population stratification

The presence of allele frequency differences between subpopulations within a larger population.

Relative risk

The ratio of the probability of an event (such as disease) occurring in an at-risk group to the probability of it occurring in a population that is not considered at risk.

Absolute risk

The actual probability of disease occurrence.

Omnigenic model

A model that proposes that the genetic architecture of a trait or disease is affected by many genes.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, R., Chen, Y., Ritchie, M.D. et al. Electronic health records and polygenic risk scores for predicting disease risk. Nat Rev Genet 21, 493–502 (2020). https://doi.org/10.1038/s41576-020-0224-1

Download citation


Quick links

Sign up for the Nature Briefing newsletter for a daily update on COVID-19 science.
Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing