Forecasting adverse surgical events using self-supervised transfer learning for physiological signals

Hundreds of millions of surgical procedures take place annually across the world, which generate a prevalent type of electronic health record (EHR) data comprising time series physiological signals. Here, we present a transferable embedding method (i.e., a method to transform time series signals into input features for predictive machine learning models) named PHASE (PHysiologicAl Signal Embeddings) that enables us to more accurately forecast adverse surgical outcomes based on physiological signals. We evaluate PHASE on minute-by-minute EHR data of more than 50,000 surgeries from two operating room (OR) datasets and patient stays in an intensive care unit (ICU) dataset. PHASE outperforms other state-of-the-art approaches, such as long-short term memory networks trained on raw data and gradient boosted trees trained on handcrafted features, in predicting six distinct outcomes: hypoxemia, hypocapnia, hypotension, hypertension, phenylephrine, and epinephrine. In a transfer learning setting where we train embedding models in one dataset then embed signals and predict adverse events in unseen data, PHASE achieves significantly higher prediction accuracy at lower computational cost compared to conventional approaches. Finally, given the importance of understanding models in clinical applications we demonstrate that PHASE is explainable and validate our predictive models using local feature attribution methods.

In Figure 2, we only utilized SAO2 (SpO2) from the ICU M data because it was one of the most predictive 28 signals for our outcomes (out of SAO2, ETCO2, and NIBPM) with >80% of available measurements by time 29 points (Supplementary Table 1 For hypoxemia, a particular time point t is labelled to be one if the minimum of the next five minutes is 35 hypoxemic (min(SAO2 t+1:t+6 ) ≤ 93). All points where the current time step is currently hypoxemic are 36 ignored (SAO2 t < 93). Additionally we ignore time points where the past ten minutes were all missing or the 37 future five minutes were all missing. hypocapnia, hypotension, and hypertension have slightly stricter label 38 conditions. We label the current time point t to be one if (min(S t−10:t ) > T ) and the minimum of the next five 39 minutes is "hypo" (min(S t+1:t+5 ) ≤ T ) (maximum and ≥ for hypertension). We label the current time point t 40 to be zero if (min(S t−10:t ) > T ) and the minimum of the next ten minutes is not "hypo" (min(S t+1:t+10 ) > T ).

41
For Hypertension, we use max(S t+1:t+5 ) ≥ T rather than min and an analogous filtering procedure. All other       Given that different surgeries have different outcomes and challenges associated with them, we aim to analyze 121 the performance of our models for a few major surgical diagnoses. We focus on analyzing the performance 122 of XGB models trained with next embeddings (from Figure 2). First, we identify the top ten diagnoses for 123 each target operating room dataset according to the distributions in the test set. We evaluate our model's 124 predictions for subsets of samples that correspond to procedures with a given diagnosis. We report the 125 performances for each diagnosis in Supplementary Tables 13-15. Furthermore, because we are comparing 126 across outcomes with different base rates, we report and rely on the results of ROCAUC, rather than average 127 precision (or PRAUC) which is more sensitive to base rates. 128 We find that for hypoxemia, our model performs strongly across most categories in OR 0 with the exception 129 of "prev c-sect nos-deliver" (repeat cesarean delivery), where hypoxemia is more prevalent. Furthermore, 130 for "cataract" in OR 1 , we find that our models are unsuccessful for identifying hypoxemia. For hypocapnia, 131 challenging diagnoses include "senile cataract unspecified" and "cataract". For hypotension, challenging 132 diagnoses include "prev c-sect nos-deliver" and "atrial fibrillation". One possible reason that these particular 133 diagnoses (cataracts and cesareans) were challenging is that they are different to the majority of procedures in 134 the data set. In terms of anesthesia, both are given minimal anesthesia and in the case of cesarean may involve 135 a spinal anesthetic, which has its own characteristic effects. In particular, cesareans and cataracts typically 136 involve regional ( 97%) or monitored (MAC) ( 92%) anesthesia respectively in comparison to the 85% of 137 procedures that involve general anesthesia. This finding is in line with recent research in distributionally 138 robust optimization that suggests that even performant models often have poor performance in "minority" or 139 undersampled regions of training data [10] as is the case with MAC and regional anesthesia. Furthermore,   10% the number of procedures that the source dataset does. We report these results in Supplementary Figure   167 6. 168 We find that training from scratch is not very beneficial in comparison to raw or ema when the target 169 training set is so small. This is not surprising given that neural networks often require extremely large data to 170 be effective and suggests that the performance of next in Figure 2 Figure 6: Performance on a smaller target set (10%). We perform the same experiments from Figures 2 and 3, where we train next embedding models in three ways: (1) standard embedding, (2) transferred embedding, and (3) fine-tuned embedding. However, when we perform the experiments, we first greatly subsample the target training dataset to 3000 procedures (roughly 10% of each target dataset's full sample size). This simulates the setting where the target hospital is much smaller than the source hospital.
data sets. We also find that the performance of next' is quite high, suggesting that the transferred embedding 172 model from a much larger data set is the best approach in this setting. Finally, we find that fine tuning 173 (next f t ) does consistently improve over the standard embedding setting (next), but the improvements are 174 small relative to the transferred embedding approach (next'). This suggests that the fine-tuned embedding 175 models find similar local minima to the standard embedding models that are simply not as performant due to 176 the low sample size of the target data set. Although fine tuned embedding models were not the best approach 177 in this low sample size setting, it is important to note that they may still be advisable under dataset drift.

178
For our experiment we utilized a random subsample of the target dataset, which we found to be similar pre-training with two tasks: (1) predicting randomly masked input tokens and (2) predicting next sentences.

191
Both papers utilize an analogous task to our next task for different domains and predicting the current frames 192 as in Srivastava,Mansimov,and Salakhudinov [20] is analogous to our auto task. In a more related domain, transformations was applied to the input signal. Tang et al. [21] builds upon this self-supervised approach 201 using teacher-student learning.

202
A final major category of self-supervised approaches fall under the broad category of contrastive learning.

203
The high level goal of contrastive learning is to learn representations that group together similar images LSTMs from before but predict both the auto (t-60:t) and (t:t+5) next outcomes (t-60:t+5).

216
• augauto -utilizes the augmentation techniques utilized in Saeed, Ozcelebi, and Lukkien [18] and Tang 217 et al. [21] and trains the LSTM embedding models to predict whether one of seven transformations of 218 the original signal were performed: noised, scaled, rotated, negated, horizontal flip, permute, time-warp. 219 We exclude the final transformation (channel shuffling) because it is not applicable to our signals. We In Supplementary Figure 9, we find that compared to our previous approaches, the augmentation-based 227 approaches (augauto and contrast) do not appear to improve performance, and actually appear to be 228 destructive compared to the straightforward auto task. One possible reason is that these approaches are 229 likely heavily dependent on the augmentations we apply and would require a great deal of hyperparameter 230 searching to identify the most appropriate augmentations. Finally, autonext appears to improve performance 231 relative to auto, but is consistently worse than next. Put together, these results suggest that these additional 232 self-supervised approaches fail to improve performance beyond our straightforward next approach for the 233 signals and outcomes we consider. Furthermore, future work on data augmentation techniques or contrastive 234 learning that incorporates future signals may be valuable as well.  Figure 2). We report the absolute performance of XGB raw below 0 on the x axis in parenthesis.

Evaluating alternative outcome definitions 236
There are many possible thresholds that have been defined with respect to hypoxemia, hypocapnia, and 237 hypotension. Importantly, the goal of PHASE is not to identify the best definition of each outcome, but 238 rather to evaluate our self-supervised approach on a variety of diverse outcomes. However, to show that our 239 models are relatively robust to choice of threshold for our outcomes, we include analysis of our models from  consists of the following: a dense layer with 100 nodes (with a relu activation) followed by a dropout layer with 256 dropout rate 0.5 followed by a dense layer with 100 nodes (with a relu activation) followed by a dropout layer 257 with dropout rate 0.5 followed by the dense output layer with one node and sigmoid activation function. We ema Exponential moving averages/variances. ' ( ) sec., 1 min., 5 min.
rand LSTM with random initialization auto LSTM trained to predict ! "#$%&" next LSTM trained to predict ! "*+&"*, min LSTM trained to predict -./ ! "*+&"*, hypo LSTM trained to predict the target task (e.g., -./ 0123 "*+&"*, 4 53) By default embed with source OR data set that matches the target OR data set Embed with a source OR data set that does not match the target OR data set Embed SAO2 using ICU P as a source data set and embed the remaining fourteen signals using the target data set as the source data set   Figure 18: Comparison of different embedding sizes. We train XGB models on next LSTM embedded data (15 embedded signals and 6 static features) using different embedding sizes. We focus on predicting hypotension which has fewer samples than alternative outcomes. The shaded regions show 99% confidence intervals. We train XGB models on a single embedded feature (SAO2 for hypoxemia, ETCO2 for hypocapnia, and NIBPM for hypotension). The y axis represents the test average precision of the XGB model and the x axis represents the time slice we utilize for the final embedding. t − 1 corresponds to the embedding for the most recent time point (size 200), t − 2 : t − 1 corresponds to the embedding for the two most recent time points (size 400), and t − 5 : t − 1 corresponds to the embedding for the five most recent time points (size 1000). The shaded regions show 99% confidence intervals.
the final embedding, we include more time steps (slices) for a single feature for our three hypo outcomes in the hypoxemia outcome. Including multiple time slices would multiplicatively increase this, with five time 302 slices constituting more than a terabyte of memory. In Supplementary Figure 19, we see that increasing the 303 number of time slices can improve performance for the transferred models (next'), but overall the marginal 304 improvement in performance from incorporating the information from more time slices is relatively small.

305
Furthermore, for the non-transferred models (next), we do not find that more time slices yields better 306 performance. As such, utilizing the final time step's embedding as we have done throughout our experiments 307 seems to be the best approach. We train raw XGB models on all fifteen physiological features and static features. We use varying window sizes for the physiological signals and report test average precision.
The window size we use in our experiments is 60 minutes which amounts to 60 features from each 310 signal, because they are sampled minute by minute. However, the window size is actually an additional 311 hyperparameter. The primary motivation to use 60 minutes is twofold: (1) an hour is an easy to understand 312 choice of hyperparameter and (2) we found that 60 minutes was sufficient to make the best possible predictions.

313
In particular, we can evaluate (2) by training XGB models with raw data that includes different window sizes: 314 60 minutes (as is the default choice in our paper), 40 minutes, 20 minutes, 5 minutes, and 1 minutes. In