Introduction

Patients and providers face a great amount of uncertainty before, during, and after hospital encounters. Predictive modeling holds promise for identifying patients at the highest risk for adverse events, such as extended length of stay (LOS), 30-day readmission, and death within the hospital encounter. Despite the success of predictive models in achieving discriminatory power in these and other areas, simplistic models cannot account for complicated intersections of medical, institutional, and demographic factors. Conversely, complex models that account for these interactions are difficult or impossible to interpret or audit, and therefore may be inactionable or harmful if put into use, and can also be difficult for healthcare providers to understand or accept1,2,3. Recent studies suggest that a focus on metrics such as 30-day readmission without addressing underlying causes may lead to increased patient mortality and increased cost without improving patient outcomes4.

Significant recent advances in artificial intelligence (AI), machine learning (ML), and deep learning (DL) have yielded compelling innovations including self-driving cars5, product recommendations6, and superhuman performance in complex games, such as chess and Go7. These advances have also started to impact healthcare, particularly in visual tasks: detecting diabetic retinopathy in ophthalmology images8, cancers in biopsy slides9, and malignant versus benign skin lesions, often with accuracy comparable to or exceeding trained physicians10. As electronic healthcare record (EHR) data increase in size and complexity, AI and ML may provide predictive modeling tools that can improve patient safety and outcomes while decreasing cost. A major hurdle for ML in healthcare is the “black box phenomenon,” or lack of explainability, to patients and healthcare providers. However, recent advances have provided algorithms that reliably extract important variables and explain model decisions, allowing for auditing and exploration. Such approaches can ensure that variables included in the final model are clinically relevant and can be recognized and understood and may lead to new insights and hypotheses. Most importantly, explainable ML supports clinician and patient decision-making, rather than supplants it, by making explicit the natures and characters of the variables the algorithm considered most important when making its predictions.

In this study, we hypothesized that interpretable predictive models would achieve comparable or superior performance to existing models and enable an understanding of factors associated with adverse outcomes. Here we report ML models with high predictive power for readmission and extended LOS, along with patient-level and cohort-level interpretations, and discuss the use of ML as a tool to aid understanding.

Results

Study cohort

In the study period, there were 1,485,880 hospitalizations for 708,089 unique patients, 439,696 (62%) of whom had only 1 hospitalization recorded. The median number of hospitalizations per patient was 1 (first and third quartile (QI) [1.0, 2.0]). There were 211,022 30-day readmissions for an overall readmission rate of 14%. Among patients aged ≥65 years, the 30-day readmission rate was 16%. The median LOS, including patients in observation status and labor and delivery patients, was 2.94 days (QI [1.67, 5.34]), or if these patients are excluded, 3.71 days (QI [2.15, 6.51]). The demographic and clinical characteristics of the patient cohort are summarized in Table 1. Higher rates of 30-day readmissions were observed in patients who were older (median age 62 vs. 59 years), African American (rate of 17% vs. 13% in whites), divorced/separated or widowed (17% vs. 13% in married/partnered or single patients), on Medicare insurance (rate of 17% vs. 10% for private insurance), and had one or multiple chronic conditions such as cancer, renal disease, congestive heart failure, chronic obstructive pulmonary disease, etc. (Table 1).

Table 1 Characteristics of hospital encounters in the study sample, overall and according to readmission and extended length of stay.

Prediction of inpatient outcomes

Thirty-day readmissions were predicted with an area under the receiver operator characteristic curve (ROC AUC, here abbreviated as simply “AUC”) of 0.76 (Supplementary Fig. 1a). The Brier score loss (BSL) was 0.11, calibration curve shown in Supplementary Fig. 1b. Average precision was 0.38 (see Supplementary Fig. 2c). Other off-the-shelf ML models, including a deep neural network, were trained on the same task, with performance generally inferior to the Gradient Boosting Machine (GBM), or in the case of the deep neural network, similar (see Supplementary Fig. 2 and Supplementary Table 1). When trained and evaluated on a smaller cohort of 300,000 hospitalizations, performance metrics were similar: AUC 0.75, BSL 0.11. The most impactful features included (ranked from the most to the least important): primary diagnosis, days between the current admission and the previous discharge, number of past admissions, LOS, total emergency department visits in the past 6 months, number of reported comorbidities, admission source, discharge disposition, and Body Mass Index (BMI) on admission and discharge, as well as others (Fig. 1a, b, see also Supplementary Fig. 3). Including more than the top ten variables in the model did not improve predictive power for the cohort overall but does allow for more specific rationale for prediction for certain patients, as well as examination of feature interactions for further exploration. Sample individualized predictions with their explanations are shown in Fig. 1c, d, and further examples are shown in Supplementary Fig. 4. The examples in Supplementary Fig. 4 show patients with comparable predicted probabilities but different compositions of features leading to these predictions.

Fig. 1: 30-Day readmission.
figure 1

a Shows the most impactful features on prediction (ranked from most to least important). b Shows the distribution of the impacts of each feature on the model output. The colors represent the feature values for numeric features: red for larger values and blue for smaller. The line is made of individual dots representing each admission, and the thickness of the line is determined by the number of examples at a given value (for example, most patients have a low number of past admissions). A negative SHAP value (extending to the left) indicates a reduced probability, while a positive one (extending to the right) indicates an increased probability. For non-numeric features, such as primary diagnosis, the gray points represent specific possible values, with certain diagnoses greatly increasing or reducing the model’s output, while the majority of diagnoses have relatively mild impact on prediction. c, d Show the composition of individualized predictions for two patients. The patient in c was admitted from the emergency outpatient unit with a headache and stayed for >7 days. In addition, this patient had been hospitalized 3 times prior to this admission and had been discharged from the last admission only 8 days prior. The predicted probability of 30-day readmission (~0.30) was three times the baseline value predicted by the model (~0.1). All of the listed features increased the model’s prediction of risk by the relative amounts shown by the size of the red bars. Conversely, the patient in d was admitted for a complete uterovaginal prolapse, stayed less than a full day, and had no reported comorbidities, such as hypertension, depression, or a history of cancer. The model predicted their probability of 30-day readmission at 0.03 or roughly one-third of the baseline prediction. The top variables that contribute and will fit on the chart are shown, but the others can be queried in the live system. The model considers all variables, and SHAP reports on all variables internally, but the images are understandably truncated for visibility.

In order to examine possible changes in causes of readmission risk as a function of time from discharge, we predicted readmission risk for several readmission thresholds and calculated SHAP (SHapley Additive exPlanation) for each. SHAP values for 3- and 7-day readmission are shown in Supplementary Fig. 5a, b, respectively. For example, 7-day readmission risk prediction achieved AUC of 0.70 with a BSL of 0.05 (Table 2). The most impactful feature remained primary diagnosis, but other features played more important roles—e.g., BlockGroup rose to second most important variable (from ninth), number of emergency department visits in the past 6 months rose to third importance from fourth, admission blood counts increased in importance, and insurance provider rose to eighth from twelfth. BMI on admission fell several places, and BMI on discharge no longer features in the top variables. The BMI variables are unique in that missing values tend to be important, in addition to extreme values, perhaps correlating with disease burden and/or hospital practices that could be further investigated.

Table 2 Performance of predictive models.

LOS was predicted in terms of the number of days and was binarized at various thresholds. LOS in days was predicted poorly, within 3.97 days measured by root mean square error (RMSE; average LOS 2.94–3.71 days). LOS over 5 days was predicted with an AUC of 0.84 (Fig. 2a) and a BSL of 0.15 (calibration curve shown in Supplementary Fig. 1d). Average precision was 0.70 (see Supplementary Fig. 2d). When trained and evaluated on a cohort of 300,000 patients, performance was similar: AUC 0.81 and BSL 0.17. Other ML models, including a deep neural network, were trained on the same task, with performance generally inferior to the GBM (see Supplementary Fig. 2 and Supplementary Table 1). The most impactful features included the type of admission, primary diagnosis code, patient age, admission source, LOS of the most recent prior admission, medications administered in the hospital in the first 24 h, insurance, and early admission to the intensive care unit, among others shown in Fig. 2c, d. Impactful features for LOS at thresholds of 3 and 7 days are shown in Supplementary Fig. 5c, d, respectively. The AUC did not differ in these time points compared to 5 days (Table 2). Given that primary diagnosis is often assigned late in the hospital encounter or even after discharge, we trained the LOS models with and without this feature for comparison. Results are shown in Supplementary Table 1d. Overall, predictive performance was decreased, as expected. AUC for LOS > 5 days was 0.781, BSL was 0.173, and average precision was 0.640.

Fig. 2: Length of stay >5 days.
figure 2

a shows the most impactful features on prediction (ranked from most to least important). b shows the distribution of the impacts of each feature on the model output. The colors represent the feature values for numeric features: red for larger values and blue for smaller. The line is made of individual dots representing each admission, and the thickness of the line is determined by the number of examples at a given value (for example, many of our patients are elderly). A negative SHAP value (extending to the left) indicates a reduced probability, while a positive one (extending to the right) indicates an increased probability. For example, advanced age increases the probability of extended length of stay (SHAP value between zero and one), while young age tends toward a SHAP value between roughly −1 and zero, corresponding to reduced probability. For non-numeric features, such as primary diagnosis, the gray points represent specific possible values, with certain diagnoses greatly increasing or reducing the model’s output, while the majority of diagnoses have relatively mild impact on prediction. c, d show the composition of individualized predictions for two patients. The 75-year-old patient in c was admitted to the inpatient service directly from a physician’s office with leakage of a heart valve graft. The patient received 32 medications in the first 24 h and has Medicare Part A insurance coverage. The model predicted that the patient’s probability of staying >5 days was 0.80, nearly four times the baseline prediction of ~0.2. The majority of the model’s prediction was based on the diagnosis, followed by the number of initial medications, and then the other variables as shown. The patient in d, on the other hand, had a predicted probability of length of stay of 0.06 or roughly one-fourth of the baseline, despite being admitted to the ICU within 24 h of admission. The major contributor to this low probability was the diagnosis of antidepressant poisoning, followed by a private insurance provider, and finally by a lack of BMI recorded in the chart for this encounter. The reasoning behind the importance of a missing value for BMI is unclear but is repeatedly apparent in several analyses and may have to do with systematic recording practices within the hospital system (see Agniel et al.19 for an exploration of this phenomenon).

Prediction of death within 48–72 h of admission was predicted with an AUC of 0.91 and BSL of 0.001 (Table 2). However, owing to extreme class imbalance (e.g., in the testing set there were 260,518 non-deaths and 390 deaths), this was achieved by predicting non-death in every case. Strategies to produce a reliable model by addressing class imbalance, such as data oversampling, were unsuccessful. AUC and BSL do not reliably indicate model performance and applicability in this clinical setting.

Variable interactions

SHAP analysis also allows examination of interactions between variables. Key variable interactions are shown in Supplementary Figs 6 and 7. For example, high and low values of heart rate were shown to affect probability of readmission differently for patients at different ages. With older patients, there is a clearer trend toward lower heart rates on discharge contributing to lower readmission risk and higher heart rates contributing to higher readmission risk, though modestly (SHAP values from −0.1 to +0.1–0.2). With younger patients, higher discharge heart rates overall are observed, and the positive trend is more modest. This may highlight the importance of considering a variable such as heart rate in a more complete clinical setting, such as one that includes patient age and clinical reasoning (e.g., an adult is unlikely to be discharged with marked tachycardia) (Supplementary Fig. 6c). A similar finding is observed in Supplementary Fig. 7c for LOS prediction, though clinical reasoning is less likely to play a role compared with more purely physiologic phenomena: higher heart rates overall are observed for pediatric patients, and the relationship between heart rate and LOS is not observed to be as linear for pediatric patients (high and low SHAP values are observed more uniformly for given levels of tachycardia in pediatric patients).

Discussion

Our investigation of ML methods for predicting and explaining inpatient outcomes was initiated as a result of increased focus on the costs and risks of inpatient stays in the United States and other countries, availability of complex data in the EHR, and the development of explainable predictive models. In addition, recent concerns over the impact of metrics such as readmission rates4 yield an opportunity to develop models that may be used to not only predict but also understand the components of risk and their interactions. We therefore sought to predict and understand current and future readmissions and the LOS during hospitalization.

Our models achieved comparable performance to the existing state of the art in the prediction of readmission and LOS but with more explainable models11,12. By using a model that accounts for non-linear interactions, we can flexibly predict outcomes across a large number of patients with many diagnoses and comorbidities. In addition to reporting AUC, which assesses performance across classification cutoffs, we show that our models are well calibrated when using raw probabilities, which may be more useful than binary classifications in many settings13. The most important components of the probability prediction for each patient can be examined, which would ideally lead to items that can be further studied, perhaps leading to quality improvement efforts (e.g., patients with a high number of emergency department visits contributing significantly to their risk of readmission may be targeted for hotspotting efforts rather than the usual scheduled in-office follow-up)14,15,16,17 or at least to a deeper understanding of the current situation (e.g., a given diagnosis or necessary therapeutic agent may be associated with a higher risk of readmission or another adverse outcome, but these features are not likely modifiable)18. We also generate cohort-level diagrams that explain the contributions of each variable to the model output as well as key variable interactions.

Because of the focus on interpretability, the study was designed to cast a broad net with regards to inclusion criteria. Rather than including only CMS (Centers for Medicare and Medicaid Services)-defined readmissions, we chose to include all patients who survived the index hospitalization, including those in observation status. We also included all available diagnoses and ranges of demographic categories, including age. This allowed us to examine the impacts of these variables, as well as develop a broadly applicable model for the institution as a whole, which included many specialties, hospitals, and a range of socioeconomic environs. Using diverse data also allowed us to find interactions, such as the varying impacts of heart rate and number of administered medications on readmission risk across the range of ages. We also found, as have others19, that presence or missingness of data within the EHR can be informative on its own, as in the case of BMI measurement in Fig. 2c, d.

Our study is additionally unique for balancing a relatively simple model architecture and hand-selected variables with a robust and generalizable explanatory method. Rajkomar et al. achieved comparable results using a DL model trained on nearly 47 billion data points spread over ~215,000 patients, acquired with an automated data collection method11. Their explanatory method highlighted areas of the medical record that were most important for prediction but used restricted and less performant versions of their models, retrained on a single data type (text, laboratory results, etc.). Our approach is a direct interpretation of the full predictive algorithm and also explains the impact of variables across the range of possible values, rather than simply highlighting which variables were important. It may be the case that more highly tuned DL or other, less complex approaches would achieve similar or superior predictive power, but likely at the expense of either interpretability or richness20,21,22. It is also important to note that our approach and Rajkomar’s are not directly comparable, given the heavily specialized algorithms and explanatory methods used in their approach, with a different cohort, different data format, and breadth of variables considered. We used off-the-shelf algorithms that are free and open source, do not require advanced computational power, and may therefore be more accessible in less resource-rich settings. One of Rajkomar’s key contributions was the use of an interoperable, rich, dynamic data format, and hence their approach has an increased focus on the data pipeline proper, whereas ours is a more simple database query with a modest amount of feature engineering. However, we share the goal of predicting adverse outcomes with a high degree of explainability that targets decision support and hypothesis generation, rather than automated decision-making. Further, given the comparable performance metrics achieved by our approach and others in similar cohorts, it may be that the inherent complexity of readmissions and long LOS confer a natural upper limit on predictive power, encouraging a further focus on interpretability.

The study has several limitations. First, we selected only variables available at the beginning and end of the hospitalization. Second, because we only used data available in our EHR, we could only assess for readmissions with reference to our hospital system. We therefore did not capture the total readmission rate, nor could we account for admissions to our system that were readmissions from another system. Third, this was a retrospective study based on data from a single health system. It therefore requires external validation, though the most important variables that impacted each outcome were also described as important prognostic factors in prior reports, which suggests that our model could be applicable in other systems. Fourth, primary diagnosis code was used as a predictor. This is typically not available until some time after the encounter has completed and financial teams have processed the hospitalization and so would not be available for either LOS or readmission predictions in a live system. We are exploring ways to dynamically assign primary diagnosis within an encounter for our in-house implementations of the model, such as ranking the electronic medical record problem list according to surrogate markers of severity. Finally, and in summary, as with all ML seeking to explore causal relationships, this is a hypothesis-generating work, in need of rigorous validation, independent studies on promising components, and, ultimately, patient and clinician judgment as regards application. We hope that an emphasis on intelligence augmentation, decision support, and explainability will lead to a more nuanced and skilled adoption of ML as yet another tool in a holistic approach to patient care and research.

In conclusion, we generated prediction models that reliably predict the probability of readmission and LOS, which are explainable on the patient level and cohort level. We propose the use of this approach as an auditable decision aid that also contributes to hypothesis generation.

Methods

Data collection

Hospitalizations with a discharge date from January 2011 to May 2018 were extracted from the Cleveland Clinic (CC) EHR. Clinical, demographic, and institutional features were extracted using natural language processing and parsing of structured data available within the EHR (see Supplementary Table 2). Data available at the time of hospitalization (i.e., within roughly 24 h of encounter creation) and discharge were marked as such and used as appropriate to the predictive task. Publicly available American Community Survey census information was retrieved for each patient’s census block group (BlockGroup), which is based on home address and reports aggregate sociodemographic data for a small geographic region23. This study was approved by the CC Institutional Review Board with a waiver of individual informed consent due to the retrospective nature of the study and conducted in accordance with the Declaration of Helsinki.

The cohort of hospitalized patients was split into three groups for analysis: 80% for model development, 10% for testing, and 10% for validation. Selection of hospitalizations for inclusion in each group was random with the exception of ensuring that the rate of the positive class (30-day readmission, LOS over 5 days, etc.) was consistent between sets.

Predictive modeling

GBM algorithms were used to produce predictive models. GBMs are nonparametric methods that train many decision trees in succession, using information from each set to optimize the performance of the next iteration24. GBMs achieve state-of-the-art performance in relation to other ML methods, especially in structured data25. They also allow for inclusion of many types of variables, and can explicitly account for missing data, and thus do not require imputation of missing values. More information regarding the GBM algorithm is available in Supplementary Materials. To reduce model overfitting, we employed a standard train/test/validation split and early stopping at 200 iterations26,27. For comparison, we also trained a deep neural network, logistic regression, and several other ML algorithms on the same data, applying standard imputation and scaling techniques. We performed ten-fold ten-repeat cross-validation to generate confidence intervals. Given that primary diagnosis is often not assigned until after the hospital encounter, we trained the LOS models with and without this feature for comparison. Finally, we trained our final model on a smaller subset of 300,000 hospitalizations to examine the effect of training data size on model performance.

Model interpretation

To extract important variables that impacted the algorithm and ensure the appropriateness of the final models, cohort and personalized model predictions were interpreted using SHAP values28. SHAP values, based on the Shapley value from coalitional game theory, are consistent and accurate calculations of the contributions of each feature to any ML model’s prediction. They are additionally able to account for feature interactions, including situations where a given value may either increase or decrease risk (for example, a child with a heart rate of 130 vs. a geriatric patient with the same heart rate). SHAP values also overcome limitations inherent to standard variable importance information available in tree-based models, which yields an ordering of all variables used in the model by how much each impacts the predictions overall, by showing the impact of variables across the range of their values, the interactions of variables with each other, and allowing for case-specific (here, patient-specific) explanations as well as cohort-level exploration. More details regarding the SHAP package are summarized in Supplementary Materials.

Statistical analysis

Descriptive statistics were used to summarize the patient cohort in general and in each subgroup. Model performance was assessed with metrics appropriate to the prediction endpoint. For binary outcomes, the BSL, AUC, and area under the precision-recall curve (average precision) were calculated. We also produced appropriate figures for these metrics, including calibration curves, which show the quality of a model’s proposed probability by comparing it with the percentage of patients at that probability with the outcome of interest (i.e., proposed probability vs. actual probability). Numeric outcomes including LOS in days and days until readmission were evaluated with RMSE. All analyses were performed with ScikitLearn v0.20.3129 and Python v3.6.6. More details regarding the statistical methods are summarized in Supplementary Materials.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.