DeepSOFA: A Continuous Acuity Score for Critically Ill Patients using Clinically Interpretable Deep Learning

Traditional methods for assessing illness severity and predicting in-hospital mortality among critically ill patients require time-consuming, error-prone calculations using static variable thresholds. These methods do not capitalize on the emerging availability of streaming electronic health record data or capture time-sensitive individual physiological patterns, a critical task in the intensive care unit. We propose a novel acuity score framework (DeepSOFA) that leverages temporal measurements and interpretable deep learning models to assess illness severity at any point during an ICU stay. We compare DeepSOFA with SOFA (Sequential Organ Failure Assessment) baseline models using the same model inputs and find that at any point during an ICU admission, DeepSOFA yields significantly more accurate predictions of in-hospital mortality. A DeepSOFA model developed in a public database and validated in a single institutional cohort had a mean AUC for the entire ICU stay of 0.90 (95% CI 0.90–0.91) compared with baseline SOFA models with mean AUC 0.79 (95% CI 0.79–0.80) and 0.85 (95% CI 0.85–0.86). Deep models are well-suited to identify ICU patients in need of life-saving interventions prior to the occurrence of an unexpected adverse event and inform shared decision-making processes among patients, providers, and families regarding goals of care and optimal resource utilization.

advantages despite the potentially confounding impact of transient and self-limited fluctuations in real-time data 6 , they are only feasible if implemented as an autonomous real-time process.
The availability of temporal trends and high-fidelity physiologic measurements in the ICU offers the opportunity to apply computational approaches beyond existing conventional models [7][8][9] . Our primary aim was to develop an acuity score framework that encompasses the full scope of a patient's physiologic measurements over time to generate dynamic in-hospital mortality predictions. Our solution uses deep learning, a branch of machine learning that encompasses models and architectures that learn optimal features from the data itself, capturing increasingly complex representations of raw data by combining layers of nonlinear data transformations 10,11 . Deep learning models automatically discover latent patterns and form high-level representations from large amounts of raw data without the need for manual feature extraction based on a priori domain knowledge or practitioner intuition, which is time-consuming and error-prone. Deep learning has revolutionized natural language processing, speech recognition, and computer vision, and is gaining momentum within healthcare 12 . Computer vision has been used to identify diabetic retinopathy 13 and recognize skin cancer with accuracy similar to that of a board-certified dermatologist 14 . Deep models have also been used to predict pain responses 15 , the onset of heart failure 16 , and ICU mortality 17 .
Here we report the development and external validation of DeepSOFA, a deep learning model that employs a clinician-interpretable variant of recurrent neural network (RNN) to analyze multivariate temporal clinical data in the ICU. Experiments were performed with two independent hospital populations and were designed to be cross-institutional; we report internal and externally validated results for both hospital cohorts. Cohorts were derived from ICU admissions at the University of Florida Health Hospital and the publicly available Medical Information Mart for Intensive Care (MIMIC-III) dataset that contains records for ICU patients from the Beth Israel Deaconess Medical Center in Boston, Massachusetts 18 . We compared deep learning mortality prediction models trained on hourly measurements with baseline models using traditional SOFA score definitions and the same hourly measurements using the entirety of a patient's data stream over the same time period. Two baseline SOFA models were tested: a Bedside SOFA model using published mortality rates correlating with any given total SOFA score 2 , and a Traditional SOFA model in which hourly SOFA scores are correlated with in-hospital mortality for individual patients 5 . Because deep models automatically learn the complex, nonlinear associations among input variables, we hypothesized that DeepSOFA would yield greater accuracy in predicting in-hospital mortality among ICU patients compared with traditional SOFA techniques.

Results
Development of DeepsoFA model. Two datasets (UFHealth and MIMIC) derived from two distinct cohorts of ICU patients from two academic medical centers, University of Florida Health (Gainesville, FL) and Beth Israel Deaconess Medical Center (Boston, MA), respectively, were used for model development and external cross validation (  p < 0.05 and the Traditional SOFA AUC of 0.64, 95% CI 0.64-0.65, p < 0.05). This advantage remained statistically significant for the remaining duration of ICU admissions. Although all models gained accuracy over time as more input data became available, DeepSOFA accuracy increased at a greater rate during the first 24 hours following ICU admission (Fig. 1A,B).
In addition to using all 14 variables in our primary DeepSOFA model, we also assessed the accuracy of six separate DeepSOFA models using only individual subsets of variables defined in each of the SOFA organ systems (Fig. 1C,D). The individual DeepSOFA component systems most predictive of mortality were central nervous system (Glasgow Coma Scale (GCS) score), respiratory (partial pressure of arterial oxygen, fraction of inspired oxygen, and mechanical ventilation status), and cardiovascular (mean arterial pressure and vasopressor administration), all relying on more frequent time series (GCS assessed every three hours, oxygen saturation and mean arterial blood pressure assessed every minute and averaged per hour).
We also examined model accuracy as a function of the predictive window being further away from the time of death or hospital discharge. As expected, all models achieved maximum AUC in the last hour of the predictive window when using data available from the entire ICU stay in both the UFHealth cohort (DeepSOFA AUC of 0.93, 95% CI 0.93-0.94, p < 0.05 compared to Bedside SOFA AUC of 0.82, 95% CI 0.81-0.83 and p < 0.05 compared to Traditional SOFA AUC of 0.88, 95% CI 0.88-0.89) and the MIMIC cohort (DeepSOFA AUC of 0.93, 95% CI 0.92-0.93, p < 0.05 compared to Bedside SOFA AUC of 0.81, 95% CI 0.80-0.82 and p < 0.05 compared to Traditional SOFA AUC of 0.85, 95% CI 0.84-0.86). Although model performance decreased slightly when prediction occurred over a longer time window, DeepSOFA retained excellent AUC above 0.87, 95% CI 0.87-0.88 in the UFHealth cohort and above 0.83, 95% CI 0.82-0.83 in the MIMIC cohort up to 100 hours away from discharge, regardless of the mortality time point of interest (Fig. 2). These findings were consistent across all development and validation cohorts.

Usability of DeepsoFA.
To demonstrate the feasibility of clinical application, DeepSOFA and Bedside SOFA scores were applied to a single patient encounter from the UFHealth cohort. The patient was a 25-year-old female with cystic fibrosis who was admitted to a Medical ICU following angioembolization of the blood supply to a lung abscess and remained in the ICU for 112 hours prior to death following cardiac arrest. Figure 3A illustrates the predicted probability of death according to DeepSOFA and Bedside SOFA scores. During the second day of ICU admission, despite increased supplemental oxygen requirements and worsening chest pain (Fig. 3D,E), the patient's vital signs remained relatively stable over time (Fig. 3B), and the Bedside SOFA model continued to estimate a low probability of death (<5%). However, predicted mortality according to DeepSOFA increased during these events, and continued to increase significantly as the patient developed increased work of breathing and required procedures to decompress the stomach and place a breathing tube, estimating a 50-80% probability of death, while the Bedside SOFA model continued to estimate a 5% probability of death. The Bedside SOFA score did not reflect clinical decompensation until the time of cardiac arrest. In the final five hours before death, the Bedside SOFA model estimated a 51.5% probability of mortality, while DeepSOFA estimated a 99.6% probability of mortality. Translating our mortality prediction task into a real-time continuous acuity score is possible by examining the predicted probability of death at each hour of a patient's ICU stay (Fig. 4, Supplementary Fig. S4). Given mean mortality probabilities stratified by survival status, the traditional SOFA score tended to underestimate the severity of illness, predicting relatively low chances of death for both survivors (<5%) and non-survivors (20-30%). In contrast, DeepSOFA is better equipped to quantify illness severity for non-survivors, estimating mortality probability of 60-90% among non-survivors compared with 20-40% for survivors. DeepSOFA overestimated the probability of death for survivors, but Bedside SOFA underestimated the probability of death for non-survivors by a greater margin.
DeepsoFA Interpretability. DeepSOFA includes added mechanisms designed to improve the human interpretability of mortality predictions (see Supplemental Section Model Details). Our self-attention approach is designed to highlight particular time steps of the input time series that the model believes to be most important in formulating its final mortality prediction. Since DeepSOFA is focused on real-time prediction, at each new hour after ICU admission, the model learns to distribute its internal "attention" in such a way to assign more weight to time steps it deems more influential for overall prediction.  Self-attention can be visualized as a two-dimensional matrix. At each time step after ICU admission (columns), the model assigns weights to all preceding time steps (rows) in such a way that the column weights sum to 1. Figure 5 shows examples of self-attention matrices for one survivor and one non-survivor, along with raw time series aligned by hours after ICU admission. For the example survivor, the model focused on what occurred five hours after ICU admission and continued to focus on that hour for the remaining seven hours of the encounter. By consulting the raw time series, it appears that a clinically significant decrease in creatinine and clinically significant increases in urine output and GCS contributed to DeepSOFA's overall survival prediction. Figure 3C features a modified version of self-attention, where we visualize only the diagonal of the two-dimensional matrix. This answers the question, "how important was each time step of data at the moment it was received by the model?" For the example non-survivor, we see several attention updates in the beginning and end of the ICU stay, with changes corresponding to salient changes in the clinical time series.

Discussion
In large, heterogeneous populations of ICU patients, we have developed and externally validated a dynamic deep learning model (DeepSOFA) that uses a time-honored illness severity score framework to predict in-hospital mortality with significantly greater accuracy than traditional methods. We also demonstrate that the deep model may be used to generate real-time prognostic data for a single patient with visual representation of model attention, indicating time periods during which model inputs made a significant impact on predictions, improving model interpretability and application. When used to predict the likelihood of death for a single patient, DeepSOFA exhibited a consistent and proportionate response to clinical events. Because DeepSOFA may be automated, it is well suited to capitalize on the emerging availability of streaming EHR data. In this regard, deep models may augment clinical decision-making by serving as an early warning system to identify patients in need of therapeutic interventions and by informing the shared decision-making processes among patients, providers, and families regarding goals of care and resource utilization by instantaneously assessing large volumes of data over time, a task which is difficult and time-consuming for clinicians.
The superior accuracy of deep models is partially attributable to their ability to learn latent structure and complex relationships from low-level data, including temporal trends in the case of recurrent neural networks. Due to their internal memory mechanisms, recurrent mortality prediction models based on sequential time series learn temporal patterns from potentially long-term dependencies in time series variables. These complex relationships are lost in traditional models, especially when applying worst-value thresholds like SOFA score calculations.
Previous work has often employed multivariable regression models in predicting mortality for ICU patients. The Simplified Acute Physiology Score (SAPS) 19,20 and Mortality Probability Model (MPM) 21 have each been used to predict in-hospital mortality using data available within one hour of ICU admission. Afessa et al. 22   Physiology and Chronic Health Evaluation (APACHE) IV 23 score was used with data from the first 24 hours of ICU admission for 110,558 patients from 45 hospitals in the United States, and achieved AUC 0.88. Although these methods have produced reasonably accurate predictions of in-hospital mortality, their accuracy is inferior to that of deep models, and their clinical application is cumbersome compared with automated models that have the capacity for integration of streaming electronic health record data. This study was limited by using data from hospitals within a single country. Patient populations and practice patterns from UFHealth and MIMIC-III may differ from that of other ICU settings, limiting the generalizability of these findings. This study is also limited by restricting the deep learning input data to SOFA components rather than the full spectrum of variables in electronic health records. Future studies should apply DeepSOFA to live streaming electronic health record data and investigate the efficacy of expanding the input variables beyond SOFA score components to include the full spectrum of variables in electronic health records.
To our knowledge, DeepSOFA is the first application of deep learning toward generating real-time patient acuity scores. Our interpretability mechanism is also a novel application of recent advances in deep learning self-attention, where sequence elements involved in the self-attention calculation are distinct hours of a patient's ICU trajectory. We utilized these attention scores to determine and visualize the severity of fundamental time series patterns and their overall effect on the resulting acuity scores, an important contribution toward the interpretability of deep learning techniques in clinical setting.
DeepSOFA models trained on time series data were more accurate than baseline SOFA models for predicting in-hospital mortality among ICU patients. Baseline SOFA models significantly underestimated the probability of death, especially among non-survivors; DeepSOFA overestimated the probability of death among survivors, albeit to a lesser degree. Magnitude of error aside, the latter is less likely to contribute to a scenario in which clinicians fail to rescue a decompensating patient, a primary concern in ICUs. DeepSOFA may be applied to individual patients, exhibiting consistent and proportionate responses to clinical events, with visual representation of the probability of death and time periods during which model inputs disproportionately contributed to predictions. These findings suggest that the SOFA score can be augmented with more nuanced and intelligent mechanisms for assessing patient acuity. Deep learning technology may be used to augment clinician decision-making by generating accurate real-time prognostic data to identify patients in need of therapeutic interventions and inform shared decision-making processes among patients, providers, and families.

Methods study Design.
Using the University of Florida Health Integrated Data Repository as Honest Broker, we created a single-center longitudinal dataset (referred to as UFHealth) that was extracted directly from the electronic medical records derived from 84,350 patients 18 years or older at University of Florida Health during their admissions between January 1, 2012 and April 1, 2016 as well as all encounters within one-year history and one-year follow-up. All electronic health records were de-identified, except that dates of service were maintained. The dataset includes structured and unstructured clinical data, demographic information, vital signs, laboratory values, medications, diagnoses, and procedures. Among these hospital encounters, there were 33,953 distinct encounters related to 27,660 unique patients and 36,216 ICU stays in which the patient was at least 18 years old, had their ICU stay last between 4 hours and 30 days, and had at least one measurement of mean arterial pressure and either PaO2 or SpO2 ( Supplementary Fig. S5). Identical selection criteria were applied to the publicly available MIMIC-III 18  This was a retrospective study. To predict in-hospital mortality, we made predictions every hour starting when a subject first entered the ICU, with the first mortality predictions generated one hour after ICU admission and ending at the time of ICU discharge or death. Prediction modeling was limited to data accrued during ICU admission. For patients transferred out of the ICU to an intermediate care unit or hospital ward, the end-point of hospital discharge or death was assessed at the conclusion of that hospital admission. For every prediction we used all information for our selected 14 variables available in the EHR up to the time at which the prediction was made.
Data processing. For both cohorts, all raw time series data were extracted for the 14 variables in electronic health records (mean arterial pressure, fraction of inspired oxygen, partial pressure of oxygen, mechanical ventilation status, Glasgow Coma Scale, urine output, platelet count, serum bilirubin, serum creatinine and dosing for dopamine, dobutamine, epinephrine and norepinephrine) used in the original SOFA score, as well as for blood oxygen saturation, a commonly used respiratory measurement when partial pressure of oxygen is unavailable ( Table 1, Supplementary Table S2). Although additional variables would have likely improved mortality prediction accuracy, the deep learning models were limited to the use of SOFA input variables to facilitate direct comparison with baseline SOFA models and as a starting point for real-time continuous acuity assessment. Variable time series began at ICU admission and ended at ICU discharge or death.
Following variable extraction, measurement outliers were removed from both cohorts according to rules in Supplementary Table S2, which come from both expert-defined ranges and modified Z-scores. We also employed an FiO2 imputation strategy outlined in Supplementary Table S1 and Supplementary Fig. S6 for calculating FiO2 based on respiratory device and oxygen flow rates. Raw time series were then resampled to an hourly frequency, taking the mean value when multiple measurements existed for the same encounter during the same one-hour window. Following resampling, gaps in the resulting time series were filled by forward-propagating previous values for vital signs and laboratory tests and substituting 0 for vasopressor rates and the use of mechanical ventilation. For all remaining missing values, including instances in which a variable was missing entirely from an admission or before the first measurement became available, clinically normal ranges defined by experts were imputed (Supplementary Table S2).
The primary outcome was in-hospital mortality. Discharges to hospice in which death occurred within 7 days of hospital discharge (3% of encounters in UFHealth and 1.1% in MIMIC) were treated as mortalities.
Model Development and Analysis. For predicting in-hospital mortality, we used a modification of a recurrent neural network (RNN) with gated recurrent units (GRU) 24 , a deep learning model ideal for working with sequentially ordered temporal data. Figure 6 shows a high-level overview of our model at three increasing levels of abstraction. Background, motivation, and detailed technical specification for our model can be found in the Supplementary Section Model Details. Briefly, the RNN internally and continuously updates its parameters based on multivariate inputs from both the current time step and previous time steps. As such, a mortality prediction incorporates patterns detected across the entirety of an ICU admission, with recognition of longer-range temporal relationships aided by the addition of GRUs.
One of the weaknesses of deep learning techniques is the inherent difficulty in understanding the relative importance of model inputs in generating the output. In the case of mortality prediction, clinicians are interested not only in the likelihood of death, but also in knowing which factors are primarily responsible for the risk of death. If such factors are modifiable, then they represent therapeutic targets. If such factors are not modifiable, then the sustained provision of life-prolonging interventions may reach futility. To improve clinical interpretability, inspired by state-of-the-art results in other deep learning domains, we modified the traditional GRU-RNN network to include a final self-attention mechanism to allow clinicians to understand why the deep network is making its predictions. At each hour during a real-time ICU stay, the model's attention mechanism focuses on salient deep representations of all previous time points, assigning relevance scores to every preceding hour that determine the magnitude of each hour's contribution to the model's overall mortality prediction. Subject to the constraint that each hour's relevance scores must sum to 1, we are able to see exactly which hours of the multivariate time series the model thinks are most important, and how sudden the shift in attention happens. An example of this interpretable attention mechanism is shown in Fig. 5 where along with a mapping back to the original input time series, the model is able to justify its mortality predictions by changes in each of the input variables. DeepSOFA mortality predictions were compared with two baseline models using traditional SOFA scores, which were calculated at each hour using the previous 24 hours of EHR data. The mortality predictions associated with calculated SOFA scores were derived from both published mortality rate correlations with any given score 2 , which we refer to as "Bedside SOFA", and to overall AUC derived from raw SOFA scores, which we refer to as "Traditional SOFA". At any hour during an ICU admission, the Bedside SOFA baseline model associated the current SOFA score with a predicted probability of mortality, as would be performed using an online calculator, in which total SOFA scores correlate with mortality ranges. The Traditional SOFA model is based on retrospective analysis that derives AUC from raw SOFA scores and outcomes, and while not suitable for real-time prediction in practice, is a reasonable and contemporary baseline and an appropriate challenger to compare with DeepSOFA. A high-level comparison between the prediction and AUC calculation for all three models used in our experiments can be found in Supplementary Table S3. Our baselines are based on both current practice (Bedside SOFA) and recent retrospective methods (Traditional SOFA). Both of these baselines utilize a single feature (current SOFA score) from patient time series for making hourly predictions. As a sensitivity analysis, we also trained two additional conventional machine learning models (logistic regression, random forest) using 84 aggregate features recalculated at every hour after ICU admission, including the following for each of the 14 SOFA variables: minimum value, maximum value, mean value, standard deviation, first value, and last value. A summary of these results can be found in Supplementary Tables S4 and S5 for internal and external validation, respectively. The SOFA baselines included in our study outperformed these additional machine learning models. In this setting, we performed 5-fold cross-validation in which a model was trained on a random 80% of ICU admissions and tested on the remaining 20%, repeated for 5 non-overlapping iterations to yield a prediction trajectory for every ICU stay in the cohort. In the external validation experiments, a DeepSOFA model was trained on the entirety of one cohort and tested on the entirety of the other cohort. Baseline models did not require training and were applied to the same testing cohorts as DeepSOFA. All reported results are from the external validation experiments. Internal cross-validation results, which are more optimistic than their external counterparts, can be found in the online supplement. Predictions involved in our experiments were performed on individual ICU encounters; as such, a single patient could have multiple ICU stays that appear as distinct prediction units. We performed a sensitivity analysis involving two variations of adjusting for patients with multiple ICU encounters in their EHR, including (1) only keeping their first ICU encounter, and (2) removing such patients entirely from the dataset. All models were retrained and tested using these modified datasets, and a summary of these prediction results can be found in Supplementary Tables S6 and S7 for internal and external validation, respectively. Both modifications resulted in increased performance for all models.

Model evaluations and statistical
For all models, a prediction was obtained at every hour, beginning one hour after ICU admission and ending at the time of ICU discharge or death. We assessed model discrimination by calculating area under the receiver operating characteristic curve (AUC), and calculated 95% confidence intervals using 100 bootstrapped iterations of sampling mortality prediction probabilities with replacement. At each hour, all ICU stays were included in reported results; for encounters with duration less than the current hour, the final prediction was used in AUC calculations. Figures 4 and S4 show the number of active ICU encounters and corresponding mortality rates by hour.

Data Availability
The MIMIC cohort is derived from the publicly available MIMIC-III database 18 . UFHealth cohort data are available from the University of Florida Institutional Data Access/Ethics Committee for researchers who meet the criteria for access to confidential data and may require additional IRB approval.