Early prediction of patient outcomes is important for targeting preventive care. This protocol describes a practical workflow for developing deep-learning risk models that can predict various clinical and operational outcomes from structured electronic health record (EHR) data. The protocol comprises five main stages: formal problem definition, data pre-processing, architecture selection, calibration and uncertainty, and generalizability evaluation. We have applied the workflow to four endpoints (acute kidney injury, mortality, length of stay and 30-day hospital readmission). The workflow can enable continuous (e.g., triggered every 6 h) and static (e.g., triggered at 24 h after admission) predictions. We also provide an open-source codebase that illustrates some key principles in EHR modeling. This protocol can be used by interdisciplinary teams with programming and clinical expertise to build deep-learning prediction models with alternate data sources and prediction tasks.
Your institute does not have access to this article
Open Access articles citing this article.
Can Robots Do Epidemiology? Machine Learning, Causal Inference, and Predicting the Outcomes of Public Health Interventions
Philosophy & Technology Open Access 26 February 2022
Subscribe to Nature+
Get immediate online access to the entire Nature family of 50+ journals
Subscribe to Journal
Get full journal access for 1 year
only $9.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
The clinical data used for the training, validation and test sets were collected at the VA and transferred to a secure data center with strict access controls in de-identified format. Data were used with both local and national permissions. The dataset is not publicly available, and restrictions apply to its use. The full results from the evaluation of our AKI model can be found in Tomasev et al15.
Code is available at https://github.com/google/ehr-predictions. This example code illustrates the core components of the continuous prediction architecture, task configuration and auxiliary heads. The full data pre-processing pipeline is not included here because it is highly specific to this dataset. However, we do include synthetic examples of the pre-processing stages with an accompanying data-reading notebook. We believe this exemplar code can be appropriately customized to other EHR datasets and tasks.
Royal College of Physicians. National Early Warning Score (NEWS) 2: Standardising the assessment of acute-illness severity in the NHS. Updated report of a working party. https://www.rcplondon.ac.uk/file/8636/download (2017).
van Walraven, C. et al. Derivation and validation of an index to predict early death or unplanned readmission after discharge from hospital to the community. CMAJ 182, 551–557 (2010).
Sutton, R. T. et al. An overview of clinical decision support systems: benefits, risks, and strategies for success. NPJ Digit. Med. 3, 17 (2020).
Johnson, A. E. W. & Mark, R. G. Real-time mortality prediction in the Intensive Care Unit. AMIA Annu. Symp. Proc. 2017, 994–1003 (2017).
Barnes, S., Hamrock, E., Toerper, M., Siddiqui, S. & Levin, S. Real-time prediction of inpatient length of stay for discharge prioritization. J. Am. Med. Inform. Assoc. 23, e2–e10 (2016).
Horng, S. et al. Creating an automated trigger for sepsis clinical decision support at emergency department triage using machine learning. PLoS ONE 12, e0174708 (2017).
Henry, K. E., Hager, D. N., Pronovost, P. J. & Saria, S. A targeted real-time early warning score (TREWScore) for septic shock. Sci. Transl. Med. 7, 299ra122 (2015).
Wong, A. et al. Development and validation of an electronic health record-based machine learning model to estimate delirium risk in newly hospitalized patients without known cognitive impairment. JAMA Netw. Open 1, e181018 (2018).
Xiao, C., Choi, E. & Sun, J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. J. Am. Med. Inform. Assoc. 25, 1419–1428 (2018).
Fagerström, J., Bång, M., Wilhelms, D. & Chew, M. S. LiSep LSTM: a machine learning algorithm for early detection of septic shock. Sci. Rep. 9, 15132 (2019).
Bedoya, A. D. et al. Machine learning for early detection of sepsis: an internal and temporal validation study. JAMIA Open 3, 252–260 (2020).
Hyland, S. L. et al. Early prediction of circulatory failure in the intensive care unit using machine learning. Nat. Med. 26, 364–373 (2020).
Seneviratne, M. G., Shah, N. H. & Chu, L. Bridging the implementation gap of machine learning in healthcare. BMJ Innov. 6, 45–47 (2019).
Sendak, M. P. et al. A path for translation of machine learning products into healthcare delivery. EMJ Innov. https://doi.org/10.33590/emjinnov/19-00172 (2020).
Tomašev, N. et al. A clinically applicable approach to the continuous prediction of future acute kidney injury. Nature 572, 116–119 (2019).
Prosperi, M. et al. Causal inference and counterfactual prediction in machine learning for actionable healthcare. Nat. Mach. Intell. 2, 369–375 (2020).
Riley, R. D. et al. Calculating the sample size required for developing a clinical prediction model. BMJ 368, m441 (2020).
Rajkomar, A., Hardt, M., Howell, M. D., Corrado, G. & Chin, M. H. Ensuring fairness in machine learning to advance health equity. Ann. Intern. Med. 169, 866–872 (2018).
Mitchell, M. et al. Model cards for model reporting. In FAT* ’19: Proceedings of the Conference on Fairness, Accountability, and Transparency 220–229 (Association for Computing Machinery, 2019).
Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. NPJ Digit. Med. 1, 18 (2018).
Assale, M., Dui, L. G., Cina, A., Seveso, A. & Cabitza, F. The revival of the notes field: leveraging the unstructured content in electronic health records. Front. Med. 6, 66 (2019).
Huang, K., Altosaar, J. & Ranganath, R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. Preprint at https://arxiv.org/abs/1904.05342 (2019).
Kemp, J., Rajkomar, A. & Dai, A. M. Improved hierarchical patient classification with language mpretraining over clinical notes. Preprint at https://arxiv.org/abs/1909.03039 (2019).
Chen, P.-H. C., Liu, Y. & Peng, L. How to develop machine learning models for healthcare. Nat. Mater. 18, 410–414 (2019).
Liu, Y., Chen, P.-H. C., Krause, J. & Peng, L. How to read articles that use machine learning: users’ guides to the medical literature. JAMA 322, 1806–1816 (2019).
Wiens, J. et al. Do no harm: a roadmap for responsible machine learning for health care. Nat. Med. 25, 1337–1340 (2019).
Ghassemi, M. et al. Practical guidance on artificial intelligence for health-care data. Lancet Digit. Health 1, e157–e159 (2019).
Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24–29 (2019).
Collins, G. S. & Moons, K. G. M. Reporting of artificial intelligence prediction models. Lancet 393, 1577–1579 (2019).
Sounderajah, V. et al. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: the STARD-AI Steering Group. Nat. Med. 26, 807–808 (2020).
Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat. Med. 26, 1364–1374 (2020).
Cruz Rivera, S. et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nat. Med. 26, 1351–1363 (2020).
Harutyunyan, H. et al. Multitask learning and benchmarking with clinical time series data. Sci. Data 6, 96 (2019).
Purushotham, S., Meng, C., Che, Z. & Liu, Y. Benchmarking deep learning models on large healthcare datasets. J. Biomed. Inform. 83, 112–134 (2018).
Nemati, S. et al. An interpretable machine learning model for accurate prediction of sepsis in the ICU. Crit. Care Med. 46, 547–553 (2018).
Caicedo-Torres, W. & Gutierrez, J. ISeeU: visually interpretable deep learning for mortality prediction inside the ICU. J. Biomed. Inform. 98, 103269 (2019).
Shickel, B. et al. DeepSOFA: a continuous acuity score for critically ill patients using clinically interpretable deep learning. Sci. Rep. 9, 1879 (2019).
Avati, A. et al. Improving palliative care with deep learning. BMC Med. Inform. Decis. Mak. 18, 122–122 (2018).
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Li, R. C., Asch, S. M. & Shah, N. H. Developing a delivery science for artificial intelligence in healthcare. NPJ Digit. Med. 3, 107 (2020).
Blecker, S. et al. Interruptive versus noninterruptive clinical decision support: usability study. JMIR Hum. Factors 6, e12469 (2019).
Selby, N. M., Hill, R. & Fluck, R. J. Standardizing the early identification of acute kidney injury: the NHS England national patient safety alert. Nephron 131, 113–117 (2015).
Amland, R. C. & Hahn-Cover, K. E. Clinical decision support for early recognition of sepsis. Am. J. Med. Qual. 31, 103–110 (2016).
Wang, S. et al. MIMIC-Extract: a data extraction, preprocessing, and representation pipeline for MIMIC-III. In CHIL ’20: Proceedings of the ACM Conference on Health, Inference, and Learning 222–235 (Association for Computing Machinery, 2020).
Khwaja, A. KDIGO clinical practice guidelines for acute kidney injury. Nephron Clin. Pract. 120, c179–c184 (2012).
Ding, D. Y. et al. The effectiveness of multitask learning for phenotyping with electronic health records data. Preprint at https://arxiv.org/pdf/1808.03331.pdf (2018).
McDermott, M. B. A. et al. A comprehensive evaluation of multi-task learning and multi-task pre-training on EHR time-series data. Preprint at https://arxiv.org/abs/2007.10185 (2020).
Lipton, Z. C., Kale, D. C. & Wetzel, R. C. Directly modeling missing data in sequences with RNNs: improved classification of clinical time series. In Proceedings of the 1st Machine Learning for Healthcare Conference, PMLR Vol. 56, 253–270 Available at https://arxiv.org/abs/1606.04130 (2016).
Beaulieu-Jones, B. K. et al. Characterizing and managing missing structured data in electronic health records: data analysis. JMIR Med. Inform. 6, e11 (2018).
Xue, Y., Klabjan, D. & Luo, Y. Mixture-based multiple imputation model for clinical data with a temporal dimension. In 2019 IEEE International Conference on Big Data (Big Data) 245–252 (IEEE, Los Angeles, CA, USA, 2019).
Yoon, J., Jordon, J. & van der Schaar, M. GAIN: missing data imputation using generative adversarial nets. In ICML ’18: Proceedings of the 35th International Conference on Machine Learning (eds. Dy, J. & Krause, A.) 5689–5698 (International Machine Learning Society, 2018).
Saito, T. & Rehmsmeier, M. The precision recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10, e0118432 (2015).
Lee, C., Yoon, J. & d. Schaar, M. V. Dynamic-DeepHit: a deep learning approach for dynamic survival analysis with competing risks based on longitudinal data. IEEE Trans. Biomed. Eng. 67, 122–133 (2020).
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, 2016).
Efron, B. & Tibshirani, R. J. An Introduction to the Bootstrap (Chapman & Hall/CRC Press, 1994).
Miotto, R., Li, L., Kidd, B. & T. Dudley, J. Deep Patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci. Rep. 6, 26094 (2016).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Collins, J., Sohl-Dickstein, J. & Sussillo, D. Capacity and learnability in recurrent neural networks. Preprint at https://arxiv.org/abs/1611.09913 (2017).
Lei, T., Zhang, Y., Wang, S. I., Dai, H. & Artzi, Y. Simple recurrent units for highly parallelizable recurrence. Preprint at https://arxiv.org/abs/1709.02755 (2017).
Bradbury, J., Merity, S., Xiong, C. & Socher, R. Quasi-recurrent neural networks. Preprint at https://arxiv.org/abs/1611.01576 (2016).
Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint at https://arxiv.org/abs/1412.3555 (2014).
Graves, A., Wayne, G. & Danihelka, I. Neural Turing machines. Preprint at https://arxiv.org/abs/1410.5401 (2014).
Santoro, A. et al. One-shot learning with memory-augmented neural networks. In ICML ’16: Proceedings of the 33rd International Conference on Machine Learning Vol. 48 (eds. Balcan, M. F. & Weinberger, K. Q.) 1842–1850 (International Machine Learning Society, 2016).
Graves, A. et al. Hybrid computing using a neural network with dynamic external memory. Nature 538, 471–476 (2016).
Santoro, A. et al. Relational recurrent neural networks. Preprint at https://arxiv.org/abs/1806.01822 (2018).
Zilly, J. G., Srivastava, R. K., Koutník, J. & Schmidhuber, J. Recurrent highway networks. In ICML ’17: Proceedings of the 34th International Conference on Machine Learning Vol. 70 (eds. Precup, D. & Teh, Y. W.) 4189–4198 (International Machine Learning Society, 2017).
Zeiler, M. D. & Fergus, R. Visualizing and understanding convolutional networks. In Computer Vision—ECCV 2014 (eds. Fleet, D. et al.) 818–833 (Springer, 2014).
Ancona, M., Öztireli, C. & Gross, M. H. Explaining deep neural networks with a polynomial time algorithm for Shapley values approximation. In Proceedings of the 36th International Conference on Machine Learning Vol. 97 (eds. Chaudhuri, K. & Salakhutdinov, R.) 272–281 (International Machine Learning Society, 2019).
Oakden-Rayner, L., Dunnmon, J., Carneiro, G. & Ré, C. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In CHIL ‘20: Proceedings of the ACM Conference on Health, Inference, and Learning 151–159 (Association for Computing Machinery, 2020).
Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. In ICML ’17: Proceedings of the 34th International Conference on Machine Learning Vol. 70 (eds. Precup, D. & Teh, Y. W.) 1321–1330 (International Machine Learning Society, 2017).
Zadrozny, B. & Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 694–699 (Association for Computing Machinery, 2002).
Niculescu-Mizil, A. & Caruana, R. Predicting good probabilities with supervised learning. In ICML ’05: Proceedings of the 22nd International Conference on Machine Learning (eds. Raedt, L. D. & Wrobel, S.) 625–632 (Association for Computing Machinery, 2005).
Fauw, J. D. et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24, 1342–1350 (2018).
Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems Vol. 30, 6402–6413 (2017).
Gal, Y. & Ghahramani, Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In ICML ’16: Proceedings of the 33rd International Conference on Machine Learning (ICML) Vol. 48 (eds. Balcan, M. F. & Weinberger, K. Q.) 1050–1059 (2016).
Dusenberry, M. W. et al. Analyzing the role of model uncertainty for electronic health records. In CHIL ’20: Proceedings of the ACM Conference on Health, Inference, and Learning (Association for Computing Machinery, 2020).
Romero-Brufau, S., Huddleston, J. M., Escobar, G. J. & Liebow, M. Why the C-statistic is not informative to evaluate early warning scores and what metrics to use. Crit. Care 19, 285 (2015).
Nestor, B. et al. Feature robustness in non-stationary health records: caveats to deployable model performance in common clinical machine learning tasks. In Machine Learning for Healthcare (MLHC) Vol. 106, 1–23 (PMLR, 2019).
Johnson, A. E. W. et al. A comparative analysis of sepsis identification methods in an electronic database. Crit. Care Med. 46, 494–499 (2018).
Bates, D. W. et al. Big data in health care: using analytics to identify and manage high-risk and high-cost patients. Health Aff. 33, 1123–1131 (2014).
Verburg, I. W. M., de Keizer, N. F., de Jonge, E. & Peek, N. Comparison of regression methods for modeling intensive care length of stay. PLoS ONE 9, e109684 (2014).
Shillan, D., Sterne, J. A. C., Champneys, A. & Gibbison, B. Use of machine learning to analyse routinely collected intensive care unit data: a systematic review. Crit. Care 23, 284 (2019).
Nakas, C. T., Schütz, N., Werners, M. & Leichtle, A. B. Accuracy and calibration of computational approaches for inpatient mortality predictive modeling. PLoS ONE 11, 1–11 (2016).
Aczon, M. et al. Dynamic mortality risk predictions in pediatric critical care using recurrent neural networks. Preprint at https://arxiv.org/abs/1701.06675 (2017).
Che, Z., Purushotham, S., Cho, K. & Sontag, D. Recurrent neural networks for multivariate time series with missing values. Sci. Rep. 8, 6085 (2018).
Mayampurath, A. et al. Combining patient visual timelines with deep learning to predict mortality. PLoS ONE 14, e0220640 (2019).
Fritz, B. A. et al. Deep-learning model for predicting 30-day postoperative mortality. Br. J. Anaesth. 123, 688–695 (2019).
Xia, J. et al. A long short-term memory ensemble approach for improving the outcome prediction in intensive care unit. Comput. Math. Methods Med. 2019, 8152713 (2019).
Nielsen, A. B. et al. Survival prediction in intensive-care units based on aggregation of long-term disease history and acute physiology: a retrospective study of the Danish National Patient Registry and electronic patient records. Lancet Digit. Health 1, e78–e89 (2019).
Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).
Brajer, N. et al. Prospective and external evaluation of a machine learning model to predict in-hospital mortality of adults at time of admission. JAMA Netw. Open 3, e1920733 (2020).
Jamei, M., Nisnevich, A., Wetchler, E., Sudat, S. & Liu, E. Predicting all-cause risk of 30-day hospital readmission using artificial neural networks. PLoS ONE 12, e0181173 (2017).
Hilton, C. B. et al. Personalized predictions of patient outcomes during and after hospitalization using artificial intelligence. NPJ Digit. Med. 3, 51 (2020).
Liu, S., Davison, A. J. & Johns, E. Self-supervised generalisation with meta auxiliary learning. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2019) (Neural Information Processing Systems Foundation Inc., 2019).
Ghassemi, M. et al. A review of challenges and opportunities in machine learning for health. Preprint at https://arxiv.org/abs/1806.00388 (2020).
Kelly, C. J. et al. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 17, 195 (2019).
We thank the veterans and their families under the care of the VA. We also thank A. Phalen, A. Graves, O. Vinyals, K. Kavukcuoglu, S. Chiappa, T. Lillicrap, R. Raine, P. Keane, A. Schlosberg, O. Ronneberger, J. De Fauw, K. Ruark, M. Jones, J. Quinn, D. Chou, C. Meaden, G. Screen, W. West, R. West, P. Sundberg and the Google Research team, J. Besley, M. Bawn, K. Ayoub and R. Ahmed. Special thanks to K. Peterson and the many other VA staff, including physicians, administrators and researchers who worked on the data collection. Thanks to the many DeepMind and Google Health colleagues for their support, ideas and encouragement. G.R. & H.M. were supported by University College London and the National Institute for Health Research (NIHR) University College London Hospitals Biomedical Research Centre. The views expressed are those of these author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.
G.R., H.M. and C.L. are paid contractors of DeepMind/Google Health.
Peer review information Nature Protocols thanks Issam El Naqa and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Key reference using this protocol
Tomašev, N. et al. Nature 572, 116–119 (2019): https://doi.org/10.1038/s41586-019-1390-1
About this article
Cite this article
Tomašev, N., Harris, N., Baur, S. et al. Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records. Nat Protoc 16, 2765–2787 (2021). https://doi.org/10.1038/s41596-021-00513-5
Can Robots Do Epidemiology? Machine Learning, Causal Inference, and Predicting the Outcomes of Public Health Interventions
Philosophy & Technology (2022)