Predicting acute clinical deterioration with interpretable machine learning to support emergency care decision making

The emergency department (ED) is a fast-paced environment responsible for large volumes of patients with varied disease acuity. Operational pressures on EDs are increasing, which creates the imperative to efficiently identify patients at imminent risk of acute deterioration. The aim of this study is to systematically compare the performance of machine learning algorithms based on logistic regression, gradient boosted decision trees, and support vector machines for predicting imminent clinical deterioration for patients based on cross-sectional patient data extracted from electronic patient records (EPR) at the point of entry to the hospital. We apply state-of-the-art machine learning methods to predict early patient deterioration, based on their first recorded vital signs, observations, laboratory results, and other predictors documented in the EPR. Clinical deterioration in this study is measured by in-hospital mortality and/or admission to critical care. We build on prior work by incorporating interpretable machine learning and fairness-aware modelling, and use a dataset comprising 118, 886 unplanned admissions to Salford Royal Hospital, UK, to systematically compare model variations for predicting mortality and critical care utilisation within 24 hours of admission. We compare model performance to the National Early Warning Score 2 (NEWS2) and yield up to a 0.366 increase in average precision, up to a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$21.16\%$$\end{document}21.16% reduction in daily alert rate, and a median 0.599 reduction in differential bias amplification across the protected demographics of age and sex. We use Shapely Additive exPlanations to justify the models’ outputs, verify that the captured data associations align with domain knowledge, and pair predictions with the causal context of each patient’s most influential characteristics. Introducing our modelling to clinical practice has the potential to reduce alert fatigue and identify high-risk patients with a lower NEWS2 that might be missed currently, but further work is needed to trial the models in clinical practice. We encourage future research to follow a systematised approach to data-driven risk modelling to obtain clinically applicable support tools.

Since the choice of scale is determined by the responsible clinical staff on a case-by-case basis, we use the following criteria to infer which scale to use when re-computing the NEWS2 sub-score for SpO2: -Patients receiving oxygen using NIV, as recorded directly or in their set of coded procedures (OPCS-4 E85.2).
Finally, if the patient is receiving supplemental oxygen, there is ambiguity as to whether a high NEWS2 oxygen sub-score indicates very high or very low saturation. In that case, we mark the value as missing. Patient records pre-dating the introduction of NEWS2 use the original NEWS scale (SpO2 1 ) for their O2 saturation score.
• Respiration Rate. In addition to the range given in Table 1, we assume triple-digit values to be erroneous entries of two-digit values (e.g., 250 → 25.0).
• Oxygen Flow Rate. This supplemental parameter is recorded in mixed units (Litres/min or FiO2). We translate all values to FiO2 where possible: -Values 1 − 15 are inferred to be in Litres/min.
-Decimal values are inferred to be FiO2, with the exception of 0.5.
-Values of 0.5 and any remaining values are determined based on the device used to deliver the oxygen. Nasal cannula and simple mask correspond to Litres/min, while other devices correspond to FiO2.      Table 4) as training data. For each one, and for each x-axis value, we train independent models to identify critical deterioration up to the corresponding number of days after admission and measure their AUROC on the validation set (y-axis). The lines in colour represent the performance for the indicated feature set, and the lines in gray represent the lines from the other sections for easier visual comparison.   Table 8. Differential Fairness Bias Amplification (95% bootstrapped confidence interval) of each classifier type trained on each feature set. The columns "Sex", "Age", and "Sex & Age" indicate the protected characteristic for each measurementbiological sex, age group (per Figure 1), or both.  Table 9. Summary of model performance compared to the NEWS2 across the tested feature sets. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, F2 score, and numbers needed to evaluate (NNE) of NEWS2, GBDT (LightGBM) and logistic regression with L2 penalty (LR-L2). We fix the sensitivity of the models at three levels (0.602, 0.396, and 0.220) that match the observed sensitivity of NEWS2 at thresholds 3, 5, and 7, respectively.  All classifier types trained on the complete feature set. We plot the between-group component of the generalised entropy index, representing unfairness between demographic groups defined by the protected characteristics of age group and sex, per Figure  1. The remainder of the generalised entropy, as presented in Figure 6 is the within-group component, representing all other potential biases. A lower value on the y-axis indicates a more fair distribution of 'benefit', i.e. of receiving a positive prediction, between the demographic groups we consider. A theoretical 'perfect' model would yield a single point (0, 1) in the lower-right corner of the plot. Clearly define all predictors used in developing or validating the multivariable prediction model, including how and when they were measured.

15/17
14; Table 4 7b Report any actions to blind assessment of predictors for the outcome and other predictors. N/A Sample size 8 Explain how the study size was arrived at. 12; Supplementary  Figure 2 Missing data 9 Describe how missing data were handled (e.g., complete-case analysis, single imputation, multiple imputation) with details of any imputation method.
14 Statistical analysis methods 10a Describe how predictors were handled in the analyses. 14 10b Specify type of model, all model-building procedures (including any predictor selection), and method for internal validation. 14 10d Specify all measures used to assess model performance and, if relevant, to compare multiple models.

15-16
Risk groups 11 Provide details on how risk groups were created, if done.

Participants 13a
Describe the flow of participants through the study, including the number of participants with and without the outcome and, if applicable, a summary of the follow-up time. A diagram may be helpful.

13b
Describe the characteristics of the participants (basic demographics, clinical features, available predictors), including the number of participants with missing data for predictors and outcome. Present the full prediction model to allow predictions for individuals (i.e., all regression coefficients, and model intercept or baseline survival at a given time point).
Supplementary Table 2 15b Explain how to the use the prediction model. 14 Model performance 16 Report performance measures (with CIs) for the prediction model. Discuss any limitations of the study (such as nonrepresentative 11-12 sample, few events per predictor, missing data).

Interpretation 19b
Give an overall interpretation of the results, considering objectives, limitations, and results from similar studies, and other relevant evidence.

8, 11
Implications 20 Discuss the potential clinical use of the model and implications for future research. 11

Other information
Supplementary information 21 Provide information about the availability of supplementary resources, such as study protocol, Web calculator, and data sets. 16 Funding 22 Give the source of funding and the role of the funders for the present study. 16