Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Harnessing EHR data for health research

Abstract

With the increasing availability of rich, longitudinal, real-world clinical data recorded in electronic health records (EHRs) for millions of patients, there is a growing interest in leveraging these records to improve the understanding of human health and disease and translate these insights into clinical applications. However, there is also a need to consider the limitations of these data due to various biases and to understand the impact of missing information. Recognizing and addressing these limitations can inform the design and interpretation of EHR-based informatics studies that avoid confusing or incorrect conclusions, particularly when applied to population or precision medicine. Here we discuss key considerations in the design, implementation and interpretation of EHR-based informatics studies, drawing from examples in the literature across hypothesis generation, hypothesis testing and machine learning applications. We outline the growing opportunities for EHR-based informatics studies, including association studies and predictive modeling, enabled by evolving AI capabilities—while addressing limitations and potential pitfalls to avoid.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: EHR data flow and sources of heterogeneity and bias.

Similar content being viewed by others

References

  1. Gillum, R. F. From papyrus to the electronic tablet: a brief history of the clinical medical record with lessons for the digital age. Am. J. Med. 126, 853–857 (2013).

    Article  PubMed  Google Scholar 

  2. US Food and Drug Administration. Real-World Evidence. FDA https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence/ (5 February 2023).

  3. Office of the National Coordinator for Health Information Technology. National Trends in Hospital and Physician Adoption of Electronic Health Records. HealthIT.gov https://www.healthit.gov/data/quickstats/national-trends-hospital-and-physician-adoption-electronic-health-records/ (2021).

  4. Liu, F. & Panagiotakos, D. Real-world data: a brief review of the methods, applications, challenges and opportunities. BMC Med. Res. Methodol. 22, 287 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Cowie, M. R. et al. Electronic health records to facilitate clinical research. Clin. Res. Cardiol. 106, 1–9 (2017).

    Article  PubMed  Google Scholar 

  6. Kierkegaard, P. Electronic health record: wiring Europe’s healthcare. Comput. Law Secur. Rev. 27, 503–515 (2011).

    Article  Google Scholar 

  7. Wen, H. -C., Chang, W. -P., Hsu, M. -H., Ho, C. -H. & Chu, C. -M. An assessment of the interoperability of electronic health record exchanges among hospitals and clinics in Taiwan. JMIR Med. Inform. 7, e12630 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10, 1 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. All of Us Research Program Investigators. The ‘All of Us’ Research Program. N. Engl. J. Med. 381, 668–676 (2019).

  10. Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  11. Sinha, P., Sunder, G., Bendale, P., Mantri, M. & Dande, A. Electronic Health Record: Standards, Coding Systems, Frameworks, and Infrastructures (Wiley, 2012); https://doi.org/10.1002/9781118479612

  12. Overhage, J. M., Ryan, P. B., Reich, C. G., Hartzema, A. G. & Stang, P. E. Validation of a common data model for active safety surveillance research. J. Am. Med. Inform. Assoc. 19, 54–60 (2012).

    Article  PubMed  Google Scholar 

  13. Murugadoss, K. et al. Building a best-in-class automated de-identification tool for electronic health records through ensemble learning. Patterns 2, 100255 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Yogarajan, V., Pfahringer, B. & Mayo, M. A review of automatic end-to-end de-identification: is high accuracy the only metric? Appl. Artif. Intell. 34, 251–269 (2020).

    Article  Google Scholar 

  15. Mandl, K. D. & Perakslis, E. D. HIPAA and the leak of ‘deidentified’ EHR data. N. Engl. J. Med. 384, 2171–2173 (2021).

    Article  PubMed  Google Scholar 

  16. Norgeot, B. et al. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. NPJ Digit. Med. 3, 57 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Steurer, M. A. et al. Cohort study of respiratory hospital admissions, air quality and sociodemographic factors in preterm infants born in California. Paediatr. Perinat. Epidemiol. 34, 130–138 (2020).

    Article  PubMed  Google Scholar 

  18. Costello, J. M., Steurer, M. A., Baer, R. J., Witte, J. S. & Jelliffe‐Pawlowski, L. L. Residential particulate matter, proximity to major roads, traffic density and traffic volume as risk factors for preterm birth in California. Paediatr. Perinat. Epidemiol. 36, 70–79 (2022).

    Article  PubMed  Google Scholar 

  19. Yan, C. et al. Differences in health professionals’ engagement with electronic health records based on inpatient race and ethnicity. JAMA Netw. Open 6, e2336383 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Lotfata, A., Moosazadeh, M., Helbich, M. & Hoseini, B. Socioeconomic and environmental determinants of asthma prevalence: a cross-sectional study at the U.S. county level using geographically weighted random forests. Int. J. Health Geogr. 22, 18 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Li, L. et al. Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Sci. Transl. Med. 7, 311ra174 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  22. De Freitas, J. K. et al. Phe2vec: automated disease phenotyping based on unsupervised embeddings from electronic health records. Patterns 2, 100337 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Tang, A. S. et al. Deep phenotyping of Alzheimer’s disease leveraging electronic medical records identifies sex-specific clinical associations. Nat. Commun. 13, 675 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Su, C. et al. Clinical subphenotypes in COVID-19: derivation, validation, prediction, temporal patterns, and interaction with social determinants of health. NPJ Digit. Med. 4, 110 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  25. Glicksberg, B. S. et al. PatientExploreR: an extensible application for dynamic visualization of patient clinical history from electronic health records in the OMOP common data model. Bioinformatics 35, 4515–4518 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Huang, Z., Dong, W., Bath, P., Ji, L. & Duan, H. On mining latent treatment patterns from electronic medical records. Data Min. Knowl. Discov. 29, 914–949 (2015).

    Article  Google Scholar 

  27. Zaballa, O., Pérez, A., Gómez Inhiesto, E., Acaiturri Ayesta, T. & Lozano, J. A. Identifying common treatments from electronic health records with missing information. An application to breast cancer. PLoS ONE 15, e0244004 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Lou, S. S., Liu, H., Harford, D., Lu, C. & Kannampallil, T. Characterizing the macrostructure of electronic health record work using raw audit logs: an unsupervised action embeddings approach. J. Am. Med. Inform. Assoc. 30, 539–544 (2023).

    Article  PubMed  Google Scholar 

  29. Glicksberg, B. S. et al. Comparative analyses of population-scale phenomic data in electronic medical records reveal race-specific disease networks. Bioinformatics 32, i101–i110 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019).

    Article  CAS  PubMed  Google Scholar 

  31. Smith, M. A. et al. Insights into measuring health disparities using electronic health records from a statewide network of health systems: a case study. J. Clin. Transl. Sci. 7, e54 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  32. Swerdel, J. N., Hripcsak, G. & Ryan, P. B. PheValuator: development and evaluation of a phenotype algorithm evaluator. J. Biomed. Inform. 97, 103258 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  33. Denny, J. C. et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics 26, 1205–1210 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Chen, C., Ding, S. & Wang, J. Digital health for aging populations. Nat. Med. 29, 1623–1630 (2023).

    Article  CAS  PubMed  Google Scholar 

  35. Woldemariam, S. R., Tang, A. S., Oskotsky, T. T., Yaffe, K. & Sirota, M. Similarities and differences in Alzheimer’s dementia comorbidities in racialized populations identified from electronic medical records. Commun. Med. 3, 50 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Austin, P. C. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav. Res. 46, 399–424 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  37. Karlin, L. et al. Use of the propensity score matching method to reduce recruitment bias in observational studies: application to the estimation of survival benefit of non-myeloablative allogeneic transplantation in patients with multiple myeloma relapsing after a first autologous transplantation. Blood 112, 1133 (2008).

    Article  Google Scholar 

  38. Ho, D., Imai, K., King, G. & Stuart, E. A. MatchIt: nonparametric preprocessing for parametric causal inference. J. Stat. Softw. 42, 8 (2011).

    Article  Google Scholar 

  39. Zhang, Z., Kim, H. J., Lonjon, G. & Zhu, Y. Balance diagnostics after propensity score matching. Ann. Transl. Med. 7, 16 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Landi, I. et al. Deep representation learning of electronic health records to unlock patient stratification at scale. NPJ Digit. Med. 3, 96 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  41. Bai, W. et al. A population-based phenome-wide association study of cardiac and aortic structure and function. Nat. Med. https://doi.org/10.1038/s41591-020-1009-y (2020).

  42. Engels, E. A. et al. Comprehensive evaluation of medical conditions associated with risk of non-Hodgkin lymphoma using medicare claims (‘MedWAS’). Cancer Epidemiol. Biomark. Prev. 25, 1105–1113 (2016).

    Article  CAS  Google Scholar 

  43. Bastarache, L., Denny, J. C. & Roden, D. M. Phenome-wide association studies. J. Am. Med. Assoc. 327, 75–76 (2022).

    Article  Google Scholar 

  44. Yazdany, J. et al. Rheumatology informatics system for effectiveness: a national informatics‐enabled registry for quality improvement. Arthritis Care Res. 68, 1866–1873 (2016).

    Article  Google Scholar 

  45. Nelson, C. A., Bove, R., Butte, A. J. & Baranzini, S. E. Embedding electronic health records onto a knowledge network recognizes prodromal features of multiple sclerosis and predicts diagnosis. J. Am. Med. Inform. Assoc. 29, 424–434 (2022).

    Article  PubMed  Google Scholar 

  46. Tang, A. S. et al. Leveraging electronic health records and knowledge networks for Alzheimer’s disease prediction and sex-specific biological insights. Nat. Aging 4, 379–395 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  47. Mullainathan, S. & Obermeyer, Z. Diagnosing physician error: a machine learning approach to low-value health care. Q. J. Econ. 137, 679–727 (2022).

    Article  Google Scholar 

  48. Makin, T. R. & Orban De Xivry, J. -J. Ten common statistical mistakes to watch out for when writing or reviewing a manuscript. eLife 8, e48175 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Carrigan, G. et al. External comparator groups derived from real-world data used in support of regulatory decision making: use cases and challenges. Curr. Epidemiol. Rep. 9, 326–337 (2022).

    Article  Google Scholar 

  50. Hersh, W. R. et al. Caveats for the use of operational electronic health record data in comparative effectiveness research. Med. Care 51, S30–S37 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  51. Rudrapatna, V. A. & Butte, A. J. Opportunities and challenges in using real-world data for health care. J. Clin. Invest. 130, 565–574 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  52. Belthangady, C. et al. Causal deep learning reveals the comparative effectiveness of antihyperglycemic treatments in poorly controlled diabetes. Nat. Commun. 13, 6921 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Roger, J. et al. Leveraging electronic health records to identify risk factors for recurrent pregnancy loss across two medical centers: a case–control study. Preprint at Res. Sq. https://doi.org/10.21203/rs.3.rs-2631220/v2 (2023).

  54. Gervasi, S. S. et al. The potential for bias in machine learning and opportunities for health insurers to address it: article examines the potential for bias in machine learning and opportunities for health insurers to address it. Health Aff. 41, 212–218 (2022).

    Article  Google Scholar 

  55. Sai, S. et al. Generative AI for transformative healthcare: a comprehensive study of emerging models, applications, case studies, and limitations. IEEE Access 12, 31078–31106 (2024).

    Article  Google Scholar 

  56. Wang, M. et al. A systematic review of automatic text summarization for biomedical literature and EHRs. J. Am. Med. Inform. Assoc. 28, 2287–2297 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  57. Katsoulakis, E. et al. Digital twins for health: a scoping review. NPJ Digit. Med. 7, 77 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  58. Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).

    Article  CAS  PubMed  Google Scholar 

  59. Meskó, B. & Topol, E. J. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit. Med. 6, 120 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  60. Hastings, J. Preventing harm from non-conscious bias in medical generative AI. Lancet Digit. Health 6, e2–e3 (2024).

    Article  CAS  PubMed  Google Scholar 

  61. Lett, E., Asabor, E., Beltrán, S., Cannon, A. M. & Arah, O. A. Conceptualizing, contextualizing, and operationalizing race in quantitative health sciences research. Ann. Fam. Med. 20, 157–163 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  62. Belonwu, S. A. et al. Sex-stratified single-cell RNA-seq analysis identifies sex-specific and cell type-specific transcriptional responses in Alzheimer’s disease across two brain regions. Mol. Neurobiol. https://doi.org/10.1007/s12035-021-02591-8 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  63. Krumholz, A. Driving and epilepsy: a review and reappraisal. J. Am. Med. Assoc. 265, 622–626 (1991).

    Article  CAS  Google Scholar 

  64. Xu, J. et al. Data-driven discovery of probable Alzheimer’s disease and related dementia subphenotypes using electronic health records. Learn. Health Syst. 4, e10246 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  65. Vyas, D. A., Eisenstein, L. G. & Jones, D. S. Hidden in plain sight—reconsidering the use of race correction in clinical algorithms. N. Engl. J. Med. 383, 874–882 (2020).

    Article  PubMed  Google Scholar 

  66. Dagdelen, J. et al. Structured information extraction from scientific text with large language models. Nat. Commun. 15, 1418 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Hu, Y. et al. Improving large language models for clinical named entity recognition via prompt engineering. J. Am. Med. Inform. Assoc. 27, ocad259 (2024).

    Article  Google Scholar 

  68. Microsoft. microsoft/FHIR-Converter (2024).

  69. Torfi, A., Fox, E. A. & Reddy, C. K. Differentially private synthetic medical data generation using convolutional GANs. Inf. Sci. 586, 485–500 (2022).

    Article  Google Scholar 

  70. Yoon, J., Jordon, J. & van der Schaar, M. GAIN: missing data imputation using generative adversarial nets. Preprint at https://arxiv.org/abs/1806.02920v1 (2018).

  71. Shi, J., Wang, D., Tesei, G. & Norgeot, B. Generating high-fidelity privacy-conscious synthetic patient data for causal effect estimation with multiple treatments. Front. Artif. Intell. 5, 918813 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  72. Stuart, E. A. Matching methods for causal inference: a review and a look forward. Stat. Sci. 25, 1–21 (2010).

    Article  Google Scholar 

  73. Murali, L., Gopakumar, G., Viswanathan, D. M. & Nedungadi, P. Towards electronic health record-based medical knowledge graph construction, completion, and applications: a literature study. J. Biomed. Inform. 143, 104403 (2023).

    Article  PubMed  Google Scholar 

  74. Li, Y. et al. BEHRT: transformer for electronic health records. Sci. Rep. 10, 7155 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Guo, L. L. et al. EHR foundation models improve robustness in the presence of temporal distribution shift. Sci. Rep. 13, 3767 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. Zhu, R. et al. Clinical pharmacology applications of real‐world data and real‐world evidence in drug development and approval—an industry perspective. Clin. Pharmacol. Ther. 114, 751–767 (2023).

    Article  PubMed  Google Scholar 

  77. Voss, E. A. et al. Accuracy of an automated knowledge base for identifying drug adverse reactions. J. Biomed. Inform. 66, 72–81 (2017).

    Article  CAS  PubMed  Google Scholar 

  78. Taubes, A. et al. Experimental and real-world evidence supporting the computational repurposing of bumetanide for APOE4-related Alzheimer’s disease. Nat. Aging 1, 932–947 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  79. Gold, R. et al. Using electronic health record-based clinical decision support to provide social risk-informed care in community health centers: protocol for the design and assessment of a clinical decision support tool. JMIR Res. Protoc. 10, e31733 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  80. Varga, A. N. et al. Dealing with confounding in observational studies: a scoping review of methods evaluated in simulation studies with single‐point exposure. Stat. Med. 42, 487–516 (2023).

    Article  PubMed  Google Scholar 

  81. Carrigan, G. et al. Using electronic health records to derive control arms for early phase single‐arm lung cancer trials: proof‐of‐concept in randomized controlled trials. Clin. Pharmacol. Ther. 107, 369–377 (2020).

    Article  PubMed  Google Scholar 

  82. Infante-Rivard, C. & Cusson, A. Reflection on modern methods: selection bias—a review of recent developments. Int. J. Epidemiol. 47, 1714–1722 (2018).

    Article  PubMed  Google Scholar 

  83. Degtiar, I. & Rose, S. A review of generalizability and transportability. Annu. Rev. Stat. Appl. 10, 501–524 (2023).

    Article  Google Scholar 

  84. Badhwar, A. et al. A multiomics approach to heterogeneity in Alzheimer’s disease: focused review and roadmap. Brain 143, 1315–1331 (2020).

    Article  PubMed  Google Scholar 

  85. Stuart, E. A. & Rubin, D. B. Matching with multiple control groups with adjustment for group differences. J. Educ. Behav. Stat. 33, 279–306 (2008).

    Article  Google Scholar 

  86. Hernan, M. A. & Robins, J. M. Causal Inference: What If (Taylor and Francis, 2024).

  87. Hernan, M. A. Causal knowledge as a prerequisite for confounding evaluation: an application to birth defects epidemiology. Am. J. Epidemiol. 155, 176–184 (2002).

    Article  PubMed  Google Scholar 

  88. Dang, L. E. et al. A causal roadmap for generating high-quality real-world evidence. J. Clin. Transl. Sci. 7, e212 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  89. Hernán, M. A. & Robins, J. M. Using big data to emulate a target trial when a randomized trial is not available. Am. J. Epidemiol. 183, 758–764 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  90. Oskotsky, T. et al. Mortality risk among patients with COVID-19 prescribed selective serotonin reuptake inhibitor antidepressants. JAMA Netw. Open 4, e2133090 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  91. Sperry, M. M. et al. Target-agnostic drug prediction integrated with medical record analysis uncovers differential associations of statins with increased survival in COVID-19 patients. PLoS Comput. Biol. 19, e1011050 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  92. Amit, G. et al. Antidepressant use during pregnancy and the risk of preterm birth – a cohort study. NPJ Womens Health 2, 5 (2024); https://doi.org/10.1038/s44294-024-00008-0

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marina Sirota.

Ethics declarations

Competing interests

B.N. is an employee at Qualified Health. The other authors declare no competing interests.

Peer review

Peer review information

Nature Medicine thanks Wenbo Wu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Karen O’Leary, in collaboration with the Nature Medicine team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, A.S., Woldemariam, S.R., Miramontes, S. et al. Harnessing EHR data for health research. Nat Med 30, 1847–1855 (2024). https://doi.org/10.1038/s41591-024-03074-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41591-024-03074-8

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing