Hospital systems, payers, and regulators have focused on reducing length of stay (LOS) and early readmission, with uncertain benefit. Interpretable machine learning (ML) may assist in transparently identifying the risk of important outcomes. We conducted a retrospective cohort study of hospitalizations at a tertiary academic medical center and its branches from January 2011 to May 2018. A consecutive sample of all hospitalizations in the study period were included. Algorithms were trained on medical, sociodemographic, and institutional variables to predict readmission, length of stay (LOS), and death within 48–72 h. Prediction performance was measured by area under the receiver operator characteristic curve (AUC), Brier score loss (BSL), which measures how well predicted probability matches observed probability, and other metrics. Interpretations were generated using multiple feature extraction algorithms. The study cohort included 1,485,880 hospitalizations for 708,089 unique patients (median age of 59 years, first and third quartiles (QI) [39, 73]; 55.6% female; 71% white). There were 211,022 30-day readmissions for an overall readmission rate of 14% (for patients ≥65 years: 16%). Median LOS, including observation and labor and delivery patients, was 2.94 days (QI [1.67, 5.34]), or, if these patients are excluded, 3.71 days (QI [2.15, 6.51]). Predictive performance was as follows: 30-day readmission (AUC 0.76/BSL 0.11); LOS > 5 days (AUC 0.84/BSL 0.15); death within 48–72 h (AUC 0.91/BSL 0.001). Explanatory diagrams showed factors that impacted each prediction.
Patients and providers face a great amount of uncertainty before, during, and after hospital encounters. Predictive modeling holds promise for identifying patients at the highest risk for adverse events, such as extended length of stay (LOS), 30-day readmission, and death within the hospital encounter. Despite the success of predictive models in achieving discriminatory power in these and other areas, simplistic models cannot account for complicated intersections of medical, institutional, and demographic factors. Conversely, complex models that account for these interactions are difficult or impossible to interpret or audit, and therefore may be inactionable or harmful if put into use, and can also be difficult for healthcare providers to understand or accept1,2,3. Recent studies suggest that a focus on metrics such as 30-day readmission without addressing underlying causes may lead to increased patient mortality and increased cost without improving patient outcomes4.
Significant recent advances in artificial intelligence (AI), machine learning (ML), and deep learning (DL) have yielded compelling innovations including self-driving cars5, product recommendations6, and superhuman performance in complex games, such as chess and Go7. These advances have also started to impact healthcare, particularly in visual tasks: detecting diabetic retinopathy in ophthalmology images8, cancers in biopsy slides9, and malignant versus benign skin lesions, often with accuracy comparable to or exceeding trained physicians10. As electronic healthcare record (EHR) data increase in size and complexity, AI and ML may provide predictive modeling tools that can improve patient safety and outcomes while decreasing cost. A major hurdle for ML in healthcare is the “black box phenomenon,” or lack of explainability, to patients and healthcare providers. However, recent advances have provided algorithms that reliably extract important variables and explain model decisions, allowing for auditing and exploration. Such approaches can ensure that variables included in the final model are clinically relevant and can be recognized and understood and may lead to new insights and hypotheses. Most importantly, explainable ML supports clinician and patient decision-making, rather than supplants it, by making explicit the natures and characters of the variables the algorithm considered most important when making its predictions.
In this study, we hypothesized that interpretable predictive models would achieve comparable or superior performance to existing models and enable an understanding of factors associated with adverse outcomes. Here we report ML models with high predictive power for readmission and extended LOS, along with patient-level and cohort-level interpretations, and discuss the use of ML as a tool to aid understanding.
In the study period, there were 1,485,880 hospitalizations for 708,089 unique patients, 439,696 (62%) of whom had only 1 hospitalization recorded. The median number of hospitalizations per patient was 1 (first and third quartile (QI) [1.0, 2.0]). There were 211,022 30-day readmissions for an overall readmission rate of 14%. Among patients aged ≥65 years, the 30-day readmission rate was 16%. The median LOS, including patients in observation status and labor and delivery patients, was 2.94 days (QI [1.67, 5.34]), or if these patients are excluded, 3.71 days (QI [2.15, 6.51]). The demographic and clinical characteristics of the patient cohort are summarized in Table 1. Higher rates of 30-day readmissions were observed in patients who were older (median age 62 vs. 59 years), African American (rate of 17% vs. 13% in whites), divorced/separated or widowed (17% vs. 13% in married/partnered or single patients), on Medicare insurance (rate of 17% vs. 10% for private insurance), and had one or multiple chronic conditions such as cancer, renal disease, congestive heart failure, chronic obstructive pulmonary disease, etc. (Table 1).
Prediction of inpatient outcomes
Thirty-day readmissions were predicted with an area under the receiver operator characteristic curve (ROC AUC, here abbreviated as simply “AUC”) of 0.76 (Supplementary Fig. 1a). The Brier score loss (BSL) was 0.11, calibration curve shown in Supplementary Fig. 1b. Average precision was 0.38 (see Supplementary Fig. 2c). Other off-the-shelf ML models, including a deep neural network, were trained on the same task, with performance generally inferior to the Gradient Boosting Machine (GBM), or in the case of the deep neural network, similar (see Supplementary Fig. 2 and Supplementary Table 1). When trained and evaluated on a smaller cohort of 300,000 hospitalizations, performance metrics were similar: AUC 0.75, BSL 0.11. The most impactful features included (ranked from the most to the least important): primary diagnosis, days between the current admission and the previous discharge, number of past admissions, LOS, total emergency department visits in the past 6 months, number of reported comorbidities, admission source, discharge disposition, and Body Mass Index (BMI) on admission and discharge, as well as others (Fig. 1a, b, see also Supplementary Fig. 3). Including more than the top ten variables in the model did not improve predictive power for the cohort overall but does allow for more specific rationale for prediction for certain patients, as well as examination of feature interactions for further exploration. Sample individualized predictions with their explanations are shown in Fig. 1c, d, and further examples are shown in Supplementary Fig. 4. The examples in Supplementary Fig. 4 show patients with comparable predicted probabilities but different compositions of features leading to these predictions.
In order to examine possible changes in causes of readmission risk as a function of time from discharge, we predicted readmission risk for several readmission thresholds and calculated SHAP (SHapley Additive exPlanation) for each. SHAP values for 3- and 7-day readmission are shown in Supplementary Fig. 5a, b, respectively. For example, 7-day readmission risk prediction achieved AUC of 0.70 with a BSL of 0.05 (Table 2). The most impactful feature remained primary diagnosis, but other features played more important roles—e.g., BlockGroup rose to second most important variable (from ninth), number of emergency department visits in the past 6 months rose to third importance from fourth, admission blood counts increased in importance, and insurance provider rose to eighth from twelfth. BMI on admission fell several places, and BMI on discharge no longer features in the top variables. The BMI variables are unique in that missing values tend to be important, in addition to extreme values, perhaps correlating with disease burden and/or hospital practices that could be further investigated.
LOS was predicted in terms of the number of days and was binarized at various thresholds. LOS in days was predicted poorly, within 3.97 days measured by root mean square error (RMSE; average LOS 2.94–3.71 days). LOS over 5 days was predicted with an AUC of 0.84 (Fig. 2a) and a BSL of 0.15 (calibration curve shown in Supplementary Fig. 1d). Average precision was 0.70 (see Supplementary Fig. 2d). When trained and evaluated on a cohort of 300,000 patients, performance was similar: AUC 0.81 and BSL 0.17. Other ML models, including a deep neural network, were trained on the same task, with performance generally inferior to the GBM (see Supplementary Fig. 2 and Supplementary Table 1). The most impactful features included the type of admission, primary diagnosis code, patient age, admission source, LOS of the most recent prior admission, medications administered in the hospital in the first 24 h, insurance, and early admission to the intensive care unit, among others shown in Fig. 2c, d. Impactful features for LOS at thresholds of 3 and 7 days are shown in Supplementary Fig. 5c, d, respectively. The AUC did not differ in these time points compared to 5 days (Table 2). Given that primary diagnosis is often assigned late in the hospital encounter or even after discharge, we trained the LOS models with and without this feature for comparison. Results are shown in Supplementary Table 1d. Overall, predictive performance was decreased, as expected. AUC for LOS > 5 days was 0.781, BSL was 0.173, and average precision was 0.640.
Prediction of death within 48–72 h of admission was predicted with an AUC of 0.91 and BSL of 0.001 (Table 2). However, owing to extreme class imbalance (e.g., in the testing set there were 260,518 non-deaths and 390 deaths), this was achieved by predicting non-death in every case. Strategies to produce a reliable model by addressing class imbalance, such as data oversampling, were unsuccessful. AUC and BSL do not reliably indicate model performance and applicability in this clinical setting.
SHAP analysis also allows examination of interactions between variables. Key variable interactions are shown in Supplementary Figs 6 and 7. For example, high and low values of heart rate were shown to affect probability of readmission differently for patients at different ages. With older patients, there is a clearer trend toward lower heart rates on discharge contributing to lower readmission risk and higher heart rates contributing to higher readmission risk, though modestly (SHAP values from −0.1 to +0.1–0.2). With younger patients, higher discharge heart rates overall are observed, and the positive trend is more modest. This may highlight the importance of considering a variable such as heart rate in a more complete clinical setting, such as one that includes patient age and clinical reasoning (e.g., an adult is unlikely to be discharged with marked tachycardia) (Supplementary Fig. 6c). A similar finding is observed in Supplementary Fig. 7c for LOS prediction, though clinical reasoning is less likely to play a role compared with more purely physiologic phenomena: higher heart rates overall are observed for pediatric patients, and the relationship between heart rate and LOS is not observed to be as linear for pediatric patients (high and low SHAP values are observed more uniformly for given levels of tachycardia in pediatric patients).
Our investigation of ML methods for predicting and explaining inpatient outcomes was initiated as a result of increased focus on the costs and risks of inpatient stays in the United States and other countries, availability of complex data in the EHR, and the development of explainable predictive models. In addition, recent concerns over the impact of metrics such as readmission rates4 yield an opportunity to develop models that may be used to not only predict but also understand the components of risk and their interactions. We therefore sought to predict and understand current and future readmissions and the LOS during hospitalization.
Our models achieved comparable performance to the existing state of the art in the prediction of readmission and LOS but with more explainable models11,12. By using a model that accounts for non-linear interactions, we can flexibly predict outcomes across a large number of patients with many diagnoses and comorbidities. In addition to reporting AUC, which assesses performance across classification cutoffs, we show that our models are well calibrated when using raw probabilities, which may be more useful than binary classifications in many settings13. The most important components of the probability prediction for each patient can be examined, which would ideally lead to items that can be further studied, perhaps leading to quality improvement efforts (e.g., patients with a high number of emergency department visits contributing significantly to their risk of readmission may be targeted for hotspotting efforts rather than the usual scheduled in-office follow-up)14,15,16,17 or at least to a deeper understanding of the current situation (e.g., a given diagnosis or necessary therapeutic agent may be associated with a higher risk of readmission or another adverse outcome, but these features are not likely modifiable)18. We also generate cohort-level diagrams that explain the contributions of each variable to the model output as well as key variable interactions.
Because of the focus on interpretability, the study was designed to cast a broad net with regards to inclusion criteria. Rather than including only CMS (Centers for Medicare and Medicaid Services)-defined readmissions, we chose to include all patients who survived the index hospitalization, including those in observation status. We also included all available diagnoses and ranges of demographic categories, including age. This allowed us to examine the impacts of these variables, as well as develop a broadly applicable model for the institution as a whole, which included many specialties, hospitals, and a range of socioeconomic environs. Using diverse data also allowed us to find interactions, such as the varying impacts of heart rate and number of administered medications on readmission risk across the range of ages. We also found, as have others19, that presence or missingness of data within the EHR can be informative on its own, as in the case of BMI measurement in Fig. 2c, d.
Our study is additionally unique for balancing a relatively simple model architecture and hand-selected variables with a robust and generalizable explanatory method. Rajkomar et al. achieved comparable results using a DL model trained on nearly 47 billion data points spread over ~215,000 patients, acquired with an automated data collection method11. Their explanatory method highlighted areas of the medical record that were most important for prediction but used restricted and less performant versions of their models, retrained on a single data type (text, laboratory results, etc.). Our approach is a direct interpretation of the full predictive algorithm and also explains the impact of variables across the range of possible values, rather than simply highlighting which variables were important. It may be the case that more highly tuned DL or other, less complex approaches would achieve similar or superior predictive power, but likely at the expense of either interpretability or richness20,21,22. It is also important to note that our approach and Rajkomar’s are not directly comparable, given the heavily specialized algorithms and explanatory methods used in their approach, with a different cohort, different data format, and breadth of variables considered. We used off-the-shelf algorithms that are free and open source, do not require advanced computational power, and may therefore be more accessible in less resource-rich settings. One of Rajkomar’s key contributions was the use of an interoperable, rich, dynamic data format, and hence their approach has an increased focus on the data pipeline proper, whereas ours is a more simple database query with a modest amount of feature engineering. However, we share the goal of predicting adverse outcomes with a high degree of explainability that targets decision support and hypothesis generation, rather than automated decision-making. Further, given the comparable performance metrics achieved by our approach and others in similar cohorts, it may be that the inherent complexity of readmissions and long LOS confer a natural upper limit on predictive power, encouraging a further focus on interpretability.
The study has several limitations. First, we selected only variables available at the beginning and end of the hospitalization. Second, because we only used data available in our EHR, we could only assess for readmissions with reference to our hospital system. We therefore did not capture the total readmission rate, nor could we account for admissions to our system that were readmissions from another system. Third, this was a retrospective study based on data from a single health system. It therefore requires external validation, though the most important variables that impacted each outcome were also described as important prognostic factors in prior reports, which suggests that our model could be applicable in other systems. Fourth, primary diagnosis code was used as a predictor. This is typically not available until some time after the encounter has completed and financial teams have processed the hospitalization and so would not be available for either LOS or readmission predictions in a live system. We are exploring ways to dynamically assign primary diagnosis within an encounter for our in-house implementations of the model, such as ranking the electronic medical record problem list according to surrogate markers of severity. Finally, and in summary, as with all ML seeking to explore causal relationships, this is a hypothesis-generating work, in need of rigorous validation, independent studies on promising components, and, ultimately, patient and clinician judgment as regards application. We hope that an emphasis on intelligence augmentation, decision support, and explainability will lead to a more nuanced and skilled adoption of ML as yet another tool in a holistic approach to patient care and research.
In conclusion, we generated prediction models that reliably predict the probability of readmission and LOS, which are explainable on the patient level and cohort level. We propose the use of this approach as an auditable decision aid that also contributes to hypothesis generation.
Hospitalizations with a discharge date from January 2011 to May 2018 were extracted from the Cleveland Clinic (CC) EHR. Clinical, demographic, and institutional features were extracted using natural language processing and parsing of structured data available within the EHR (see Supplementary Table 2). Data available at the time of hospitalization (i.e., within roughly 24 h of encounter creation) and discharge were marked as such and used as appropriate to the predictive task. Publicly available American Community Survey census information was retrieved for each patient’s census block group (BlockGroup), which is based on home address and reports aggregate sociodemographic data for a small geographic region23. This study was approved by the CC Institutional Review Board with a waiver of individual informed consent due to the retrospective nature of the study and conducted in accordance with the Declaration of Helsinki.
The cohort of hospitalized patients was split into three groups for analysis: 80% for model development, 10% for testing, and 10% for validation. Selection of hospitalizations for inclusion in each group was random with the exception of ensuring that the rate of the positive class (30-day readmission, LOS over 5 days, etc.) was consistent between sets.
GBM algorithms were used to produce predictive models. GBMs are nonparametric methods that train many decision trees in succession, using information from each set to optimize the performance of the next iteration24. GBMs achieve state-of-the-art performance in relation to other ML methods, especially in structured data25. They also allow for inclusion of many types of variables, and can explicitly account for missing data, and thus do not require imputation of missing values. More information regarding the GBM algorithm is available in Supplementary Materials. To reduce model overfitting, we employed a standard train/test/validation split and early stopping at 200 iterations26,27. For comparison, we also trained a deep neural network, logistic regression, and several other ML algorithms on the same data, applying standard imputation and scaling techniques. We performed ten-fold ten-repeat cross-validation to generate confidence intervals. Given that primary diagnosis is often not assigned until after the hospital encounter, we trained the LOS models with and without this feature for comparison. Finally, we trained our final model on a smaller subset of 300,000 hospitalizations to examine the effect of training data size on model performance.
To extract important variables that impacted the algorithm and ensure the appropriateness of the final models, cohort and personalized model predictions were interpreted using SHAP values28. SHAP values, based on the Shapley value from coalitional game theory, are consistent and accurate calculations of the contributions of each feature to any ML model’s prediction. They are additionally able to account for feature interactions, including situations where a given value may either increase or decrease risk (for example, a child with a heart rate of 130 vs. a geriatric patient with the same heart rate). SHAP values also overcome limitations inherent to standard variable importance information available in tree-based models, which yields an ordering of all variables used in the model by how much each impacts the predictions overall, by showing the impact of variables across the range of their values, the interactions of variables with each other, and allowing for case-specific (here, patient-specific) explanations as well as cohort-level exploration. More details regarding the SHAP package are summarized in Supplementary Materials.
Descriptive statistics were used to summarize the patient cohort in general and in each subgroup. Model performance was assessed with metrics appropriate to the prediction endpoint. For binary outcomes, the BSL, AUC, and area under the precision-recall curve (average precision) were calculated. We also produced appropriate figures for these metrics, including calibration curves, which show the quality of a model’s proposed probability by comparing it with the percentage of patients at that probability with the outcome of interest (i.e., proposed probability vs. actual probability). Numeric outcomes including LOS in days and days until readmission were evaluated with RMSE. All analyses were performed with ScikitLearn v0.20.3129 and Python v3.6.6. More details regarding the statistical methods are summarized in Supplementary Materials.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
The data that support the findings of this study are available in a deidentified form from Cleveland Clinic, but restrictions apply to the availability of these data, which were used under Cleveland Clinic data policies for the current study, and so are not publicly available.
We used only free and open-source software. The software packages used are described in the “Methods” section.
Auerbach, A. D., Neinstein, A. & Khanna, R. Balancing innovation and safety when integrating digital tools into health care. Ann. Intern. Med. 168, 733–734 (2018).
Cabitza, F., Rasoini, R. & Gensini, G. F. Unintended consequences of machine learning in medicine. JAMA 318, 517 (2017).
Sniderman, A. D., D’Agostino, R. B. Sr & Pencina, M. J. The role of physicians in the era of predictive analytics. JAMA 314, 25–26 (2015).
Wadhera, R. K. et al. Association of the Hospital Readmissions Reduction Program with mortality among Medicare beneficiaries hospitalized for heart failure, acute myocardial infarction, and pneumonia. JAMA 320, 2542–2552 (2018).
Bojarski, M. et al. End to end learning for self-driving cars. Preprint at https://arxiv.org/abs/1604.07316 (2016).
Bobadilla, J., Ortega, F., Hernando, A. & Gutiérrez, A. Recommender systems survey. Knowledge-Based Syst. 46, 109–132 (2013).
Silver, D. et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362, 1140–1144 (2018).
Gulshan, V. et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316, 2402 (2016).
Coudray, N. et al. Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. NPJ Digital Med. 1, 18 (2018).
Artetxe, A., Beristain, A. & Grana, M. Predictive models for hospital readmission risk: a systematic review of methods. Comput. Methods Prog. Biomed. 164, 49–64 (2018).
Steyerberg, E. W. et al. Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology 21, 128 (2010).
Donzé, J., Aujesky, D., Williams, D. & Schnipper, J. L. Potentially avoidable 30-day hospital readmissions in medical patients: derivation and validation of a prediction model. JAMA Intern. Med. 173, 632–638 (2013).
Leppin, A. L. et al. Preventing 30-day hospital readmissions: a systematic review and meta-analysis of randomized trials. JAMA Intern. Med. 174, 1095–1107 (2014).
Burke, R. E. et al. The HOSPITAL score predicts potentially preventable 30-day readmissions in conditions targeted by the hospital readmissions reduction program. Med. Care 55, 285 (2017).
Auerbach, A. D. et al. Preventability and causes of readmissions in a national cohort of general medicine patients. JAMA Intern. Med. 176, 484–493 (2016).
Saunders, N. D. et al. Examination of unplanned 30-day readmissions to a comprehensive cancer hospital. J. Oncol. Pract. 11, e177–e181 (2015).
Agniel, D., Kohane, I. S. & Weber, G. M. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ 361, k1479 (2018).
Aubert, C. E. et al. Simplification of the HOSPITAL score for predicting 30-day readmissions. BMJ Qual. Saf. 26, 799–805 (2017).
Garrison, G. M., Robelia, P. M., Pecina, J. L. & Dawson, N. L. Comparing performance of 30-day readmission risk classifiers among hospitalized primary care patients. J. Eval. Clin. Pract. 23, 524–529 (2017).
Sun, C., Shrivastava, A., Singh, S. & Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision 843–852 (IEEE, 2017).
US Census Bureau. American community survey 5-year estimates, https://data.census.gov/cedsci/table?q=United%20States&tid=ACSDP5Y2015.DP05 (2015).
Natekin, A. & Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobotics 7, 21 (2013).
Ke, G. et al. Lightgbm: a highly efficient gradient boosting decision tree. in Advances in Neural Information Processing Systems 3146–3154 (Neural Information Processing Systems Foundation, Inc., 2017).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer Science & Business Media, 2009).
Zhang, T. & Yu, B., others. Boosting with early stopping: convergence and consistency. Ann. Stat. 33, 1538–1579 (2005).
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. in Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) 4765–4774 (Curran Associates, Inc., 2017).
Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
The authors wish to acknowledge the Cleveland Clinic for providing support and funding for this project.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Hilton, C.B., Milinovich, A., Felix, C. et al. Personalized predictions of patient outcomes during and after hospitalization using artificial intelligence. npj Digit. Med. 3, 51 (2020). https://doi.org/10.1038/s41746-020-0249-z
This article is cited by
Predicting in-hospital length of stay: a two-stage modeling approach to account for highly skewed data
BMC Medical Informatics and Decision Making (2022)
Implementation Experience with a 30-Day Hospital Readmission Risk Score in a Large, Integrated Health System: A Retrospective Study
Journal of General Internal Medicine (2022)
Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records
Nature Protocols (2021)