Prediction of postoperative cardiac events in multiple surgical cohorts using a multimodal and integrative decision support system

Postoperative patients are at risk of life-threatening complications such as hemodynamic decompensation or arrhythmia. Automated detection of patients with such risks via a real-time clinical decision support system may provide opportunities for early and timely interventions that can significantly improve patient outcomes. We utilize multimodal features derived from digital signal processing techniques and tensor formation, as well as the electronic health record (EHR), to create machine learning models that predict the occurrence of several life-threatening complications up to 4 hours prior to the event. In order to ensure that our models are generalizable across different surgical cohorts, we trained the models on a cardiac surgery cohort and tested them on vascular and non-cardiac acute surgery cohorts. The best performing models achieved an area under the receiver operating characteristic curve (AUROC) of 0.94 on training and 0.94 and 0.82, respectively, on testing for the 0.5-hour interval. The AUROCs only slightly dropped to 0.93, 0.92, and 0.77, respectively, for the 4-hour interval. This study serves as a proof-of-concept that EHR data and physiologic waveform data can be combined to enable the early detection of postoperative deterioration events.

Patients who undergo cardiovascular surgery are at risk of developing hemodynamic decompensation 1 . Decompensation can include arrhythmias (e.g., atrial fibrillation), hypotension, and pulmonary edema. Monitoring devices in intensive care units (ICU) produce wealth of data, but the deluge of information produced, in addition to false alarms, alarm fatigue, and cognitive biases, can negatively affect patient care [2][3][4] . As such, there is a growing need for predictive models and effective clinical decision support systems (CDSS) that identify only the relevant data from multiple sources to give information to healthcare providers.
Previous work has incorporated electronic health record (EHR) data in machine learning (ML) to create predictive models, as EHR data are noninvasive and routinely collected. Some examples of previous studies include 5 to predict coronary artery disease 6 , to compute survival risk scores, and [7][8][9] to predict mortality or 30-day readmission.
Other research focuses on the utility of physiological signals. Several models [10][11][12] used features from heart rate variability (HRV) to predict mortality or vascular events. Belle et al. 13 showed that HRV and electrocardiogram (ECG) features could better predict hemodynamic stability compared to standard biomarkers used in hemodynamic assessment. Hernandez et al. 14 took a multimodal approach, using ECG, HRV, arterial blood pressure (ABP) waveform from an arterial line, pulse plethysmography (PPG) waveform from a pulse oximeter, and EHR data to predict hemodynamic decompensation.
In this study, we use multimodal features from physiological signals and EHR data to create ML models that predict hemodynamic decompensation after surgery in surgical ICU patients. We do this in order to test the hypothesis that models trained on a cardiovascular surgery cohort may be successfully applied to other cohorts that underwent different surgical procedures (vascular and acute non-cardiac surgery). We then share the results www.nature.com/scientificreports/ Patient cohorts. The three cohorts of surgical patients have increasing heterogeneity. Cohort 1 consists of patients recovering from elective cardiac surgery, Cohort 2 major vascular surgeries, and Cohort 3 acute conditions requiring urgent and/or major non-cardiac surgery. Examples of surgeries in Cohort 1 include coronary artery bypass grafting, cardiac valve repair/replacement, and thoracic aortic procedures; Cohort 2 abdominal aortic aneurysm surgeries (both open repairs and endovascular stenting) and major vascular bypass procedures (e.g., aortofemoral bypass, axillofemoral bypass, and femoral-popliteal artery bypass); Cohort 3 major abdominal surgeries (e.g., exploratory laparotomies), orthopedic surgeries (e.g., total hip replacements for hip fractures), and neurosurgeries (e.g., craniotomies and spinal fusions). Seven adverse events associated with hemodynamic decompensation are included in our study: low cardiac index, sustained low mean arterial pressure, epinephrine bolus, inotropic therapy initiated, inotropic therapy escalated by ≥100%, vasopressor therapy initiated, and vasopressor therapy escalated by ≥100%. In addition, we include two additional adverse events that may not necessarily results from hemodynamic decompensation but are clinically significant enough to warrant detection by an algorithm: re-intubation and mortality, as an algorithm missing these two events while detecting other decompensation events would be clinically less meaningful. The exact definitions, details, and rationale for inclusion of these events as agreed by our clinical team are available in Hernandez et al. 14 .
Signal processing for feature extraction. We divided ECG, ABP, and PPG signals into five non-overlapping tumbling windows three minutes in length. We created four prediction windows of different lengths: 30 minutes, 1 hour, 2 hours, and 4 hours.
Data preprocessing. To remove artifacts, each ECG tumbling window was preprocessed using a second order Butterworth bandpass filter with cutoff frequencies of 0.5 and 40 Hz. Similarly, each ABP window was preprocessed with a third order Butterworth bandpass filter with cutoff frequencies 1.25 and 25 Hz, and each PPG window with a third order Butterworth bandpass filter with cutoff frequencies 1.75 and 10 Hz.
Heart rate variability. We identified peaks within the filtered ECG signal using the peak detection method defined in Hernandez et al. 14 . We compute the difference in time between subsequent peaks to produce HRV.
Taut string. Taut String (TS) 17 has previously been used to capture hemodynamic instability in ECG 13,14 , and we utilize these same TS features in our method with varying values for ǫ.
Given a discrete signal f = (f 1 , f 2 , ..., f n ) , we define the first-order finite difference as We use a parameter ǫ > 0 to create the piecewise linear function g, the TS estimate of f, such that the max norm of (f − g) is ≤ ǫ and the Euclidean norm of diff(g) is minimal, defined below as and respectively. Next, diff * is defined as Here, diff * : R n−1 → R n is dual to diff : R n → R n−1 . The function g also minimizes One can visualize the function g as a string pulled tightly between f + ǫ and f − ǫ . It is generally a piecewise linear function that results in a smoother line than f. Figure 2 shows a sample TS estimate of ECG. We applied TS to each tumbling window of ECG, HRV, ABP, and PPG signals with different ǫ values. A summary of TS estimations performed and the number of features extracted are listed in Table 1. These features are detailed in Hernandez et al. 14 .
Dual-tree complex wavelet packet transform. Dual-tree complex wavelet packet transform (DTCWPT) 18 has previously been used in an ECG context 14 . More detail is available in Bayram et al. 18 , but briefly: at each level k, DTCWPT uses a high-and low-pass perfect reconstruction wavelet filter bank to decompose the previous level's subbands. Increasing k yields increased frequency resolution, but at computational expense. We select k = 2 in this study.
The filter banks of k are selected such that the first filter bank's discrete Hilbert transform is the frequency response of each branch of the second filter bank. This allows for approximate shift-invariance. If is the wavelet diff * (y 1 , y 2 , ..., y n−1 ) = (−y 1 , y 1 − y 2 , ..., y n−2 − y n−1 , y n−1 ). where d is an odd integer and |w| < π. Following this, if H (k) (e jw ) is the equivalent response at level k, then:  Electronic health records. EHR data was available for each subject in addition to physiological signals. It included static information, including age, race, and comorbidities, as well as temporal information, such as lab results and medications administered. The features that required different levels of representation through onehot encoding are presented in Supplementary Tables S1 through S4 and also further explained in Hernandez et al. 14 . For completeness, the temporal EHR information for each tumbling window was carried over from the most recent record.
Tensor formation and reduction. Taking into account all values of ǫ from TS estimation, the filter banks of DTCWPT, and the five tumbling windows, a total of 5,150 features are extracted from ECG, HRV, ABP, and PPG signals. To reduce the feature space of this data and maintain structural information, we rely on tensor decomposition. First, we format the data into a tensor. Each type of signal features has the structure of five tumbling windows and five values of ǫ , with varying numbers of features. We standardize the features within each tumbling window and ǫ value using mean and standard deviation (SD) of each ǫ-window-feature entry from the training set. These training set values are later used to standardize the ǫ-window-feature entries from the test set. We construct a third-order tensor of structure ( ǫ × feature × window ) for each feature type, visualized in Table 2. Once the tensors of a feature type have been created for all subjects in the training set, they are stacked along a new fourth mode, generating a tensor of size (ǫ × feature × window × N train ) , where N train is the number of individuals in the training set. This is repeated for tensors of each feature type. We perform higher-order singular value decomposition (HOSVD) [19][20][21] using Tensor Toolbox 22 to reduce the DTCWPT, ABP, and PPG tensors to their core tensors, G DTCWPT , G ABP , G PPG , and reserve their respective transformation matrices ( U DTCWPT , U ABP , U PPG ).
Next, we stack all fourth order tensors from the different feature types along the feature mode (second mode) to create a new tensor T. We then perform a Canonical Polyadic (CP) decomposition 23 to obtain T's four factor matrices: (A, B, C, D). Given the tensor T = R n 1 ×n 2 ×n 3 ×n 4 and rank r, the CP decomposition produces the tensor such that �T −T� is minimized. The tensor Frobenius norm � · � is defined as We use alternating least squares (ALS) 23 to solve the CP decomposition, setting rank r = 4 as done in Hernandez et al. 14 . After solving for T , we reserve factor matrices A and C for feature extraction. Now that we have the transformation matrices from HOSVD and the factor matrices from CP-ALS, we can compute the core tensor of any DTCWPT, ABP, or PPG tensor in the test set. We perform the following to compute features for each individual j in the test set: Using the transformation matrices from HOSVD ( U DTCWPT , U ABP , U PPG ), core tensors are computed for DTCWPT, ABP, and PPG. Next, all of j's tensors are stacked along mode 2 to build the new tensor S j . Lastly, the factor matrices A and C from T's CP decomposition are used to solve for B via least squares by minimizing The optimal solution is calculated as  (2) is the mode-2 slice of tensor S j , ⊙ is the Khatri-Rao product, and † is the Moore-Penrose pseudoinverse 23 .
Machine learning. The ML models are created to predict an occurrence of the nine adverse events detailed in the Data section. in prediction windows 0.5, 1, 2, and 4 hours before the event. The training set consisted of Cohort 1 while the test set was Cohort 2 or 3 exclusively. The incidence rates of adverse events by cohort and prediction window are available in Table 3. Given that clinicians may want to detect more cases with adverse outcomes at the risk of somewhat increased false positives, we balanced the training set to have a ratio of 0.35 to 0.65 between positive and negative cases by discarding excess negative cases. No such adjustment was performed for the test sets. Patients without complete signals in the analysis windows (e.g., ECG signals with missing R peaks) were excluded. The respective sample size for each prediction window is available in Table 4. Model training consisted of 3-fold cross-validation (CV) of Cohort 1 data to select optimal hyperparameters for each model using a validation set. Models were then trained on all three folds with the selected hyperparameters in Cohort 1, and tested against Cohort 2 and Cohort 3 data. This process was repeated 101 times, with shuffling of data across CV folds, for each model to obtain the mean area under the receiver operating characteristic curve (AUROC).
Naive Bayes (NB) models with normal distribution were trained with no hyperparameter tuning, serving as the simplest baseline models.
Random forest (RF) models 24 were trained with varying numbers of trees (50, 75, or 100), minimum leaf size (1, 5, 10, 15, or 20), percentage of features to include for maximum number of splits (25, 50, 75, 100%), split criterion (Gini impurity or cross entropy), and number of predictors to sample ( [10,100] in increments of 10) using grid search to select hyperparameters. Support vector machines (SVM) 25 were trained with a linear kernel via sequential minimal optimization. Grid search was used to determine the optimal box constraint C and scaling parameter γ , where C ∈ [10 −7 , 10 12 ] and γ ∈ [10 −12 , 10 12 ] consist of logarithmically-spaced values.  Table 4. Mean AUROC and standard deviation of the models. The means and standard deviations (SDs) of AUROC for models trained on the Cohort 1 (Cardiac Surgery) and tested on Cohorts 2 (Vascular Surgery) and 3 (Acute Non-cardiac Surgery). The n in each row represents the number of deterioration events (i.e., the sample size). The best performance in each prediction window is boldfaced. Ethics declarations. The human data collection and use were approved by the University of Michigan's Institutional Review Board (IRB) under HUM00092309. The informed consent was obtained and all experiments in this study were performed in accordance with this approval.

Results
The AUROCs of each ML model are presented in Table 4. Cohort 2 -testing with vascular surgery cohort. The RF models achieve the highest AUROCs on testing with Cohort 2. The LUCCK models perform marginally better on Cohort 2 compared to the training set, achieving AUROCs similar to those of RF. The NB and SVM models also achieve AUROCs comparable with those obtained on the training set, with the NB models on 2-and 4-hour windows actually achieving higher AUROCs than their respective training sets. The NB and SVM models demonstrate higher SDs compared to the LUCCK and RF models.

Cohort 3 -Testing with acute non-cardiac surgery cohort. With Cohort 3, the overall performance
is lower across all models. The RF models achieve the highest performance, with the models on 0.5-and 1-hour windows achieving AUROCs greater than 0.8. LUCCK models maintain mean AUROCs between 0.74 and 0.77 across models on all windows, while the SVM's AUROCs fluctuate to a larger extent, between 0.67 and 0.81. The NB models decline to the greatest extent, exhibiting an AUROC as low as 0.42 on the 0.5-hour window.
Shapley values for the best performing model. The best performing model was the 36th Random

Discussion
In this study of postoperative deterioration events among three successively more heterogeneous cohorts of surgical patients, the incidence rates of a patient-level deterioration event were 16-19%, 19-28%, and 75-90%, respectively, although for Cohort 1, we adjusted the rate of positive cases to 35% during training. The best performing models for each cohort yielded AUROCs of 0.94, 0.94 and 0.82, respectively, all for 0.5-hour prediction windows. Our study serves as a proof-of-concept that EHR data and physiologic waveform data may be combined to improve the early detection of postoperative deterioration events, even when trained on one surgical cohort and applied to other cohorts.
The predictive performance is well maintained throughout all prediction windows during training with Cohort 1; the RF models achieve the highest performance. With Cohort 2, likely due to smaller sample size, the SDs of AUROCs of the models tend to be higher than those of the training set. Nonetheless, these models achieve performance comparable to that achieved on the training set, with some models (especially LUCCK) even exceeding the mean AUROCs achieved in training. Overall, the models trained on Cohort 1 generalize extremely well to the Cohort 2.
When the models are tested on Cohort 3, the acute non-cardiac surgery cohort, the performance of all models decline. This is likely due to the paucity of samples in Cohort 3, which is a fraction of Cohort 2, let alone Cohort 1; the SDs of the AUROCs are also much larger than their counterparts in Cohorts 1 and 2. The NB models suffer most from the performance decline, with models on 0.5-and 1-hour prediction windows performing near or worse than random guessing. Given the simplicity of the NB models, it is not surprising that they fail to generalize to a small dataset from a different surgical cohort.
On the other hand, the other three models maintain an acceptable performance for most prediction windows despite the paucity of data. The RF models achieve the highest AUROC with the 0.5-and 1-hour prediction windows at 0.82. The AUROC declines for the 2-hour prediction window to 0.72 but recovers to 0.77 at the 4-hour prediction window. The SVM model displays even more fluctuation, with the highest AUROC of 0.81 www.nature.com/scientificreports/ on the 2-hour prediction window and lowest AUROC of 0.67 on the 1-hour prediction window. The LUCCK models maintain the most consistent range of performance, between 0.74 and 0.77 across all prediction windows, with generally smaller SDs than RF and SVM. LUCCK also does not experience a steep decline that the RF and SVM models show in the 1-and 2-hour prediction windows, respectively. It should also be noted that LUCCK's generalization with Cohort 2 was the best among the four models, showing the largest increase in performance from training when tested on Cohort 2. Such findings are consistent with previous reports indicating that the LUCCK models are capable of demonstrating more robust and generalizable performance, especially when the number of samples is small 14,26 , which is often the case with clinical data.
The Shapley values obtained from the best performing model indicate that both the signal and EHR features make meaningful contributions towards the prediction, as both categories of features are fairly represented in the lists of Shapley values with the highest magnitudes ( Figure 3). A positive Shapley value indicates the importance of the feature with respect to the positive class (i.e., an occurrence of a deterioration event), while a negative value indicates the same for the negative class (i.e., the non-occurrence of a deterioration event). Also, we may also infer that the signal features are more represented in the marginal contribution towards the prediction of an occurrence rather than a non-occurrence of an adverse cardiovascular event, as indicated by the greater number of signal features with very positive Shapley values than very negative ones. www.nature.com/scientificreports/ A few predictive scoring systems for assessing mortality risks in the ICU patients have been developed in clinical settings. The APACHE II scoring system, which utilizes 12 physiological measurements, age, and previous health conditions from 5030 ICU patients in 13 hospitals to create a logistic regression model to predict hospital death, reported an AUROC of 0.863 28 .
The SAPS II scoring system, which utilizes 12 physiological variables, age, type of admission, and three underlying disease variables to create a logistic regression model predicting the hospital mortality based on 12,997 ICU patients from 12 countries, reported an AUROC of 0.88 on the developmental dataset and 0.86 on the validation dataset 29 .
The evaluation of Sequential Organ Failure Assessment (SOFA) scoring system 30 , which utilizes physiological variables from various organ systems to assign a score between 0 and 24, demonstrated the highest AUROC of 0.90 based on 352 patients when using the highest SOFA score taken during the entire ICU stay.
Our study differs from these studies in two major aspects: (1) We have included features derived from several digital signal processing (DSP) techniques performed on the physiological waveform data, and (2) our models are trained to predict other clinical deterioration events besides mortality, given that the ultimate form of these models is intended for real-time clinical monitoring in a CDSS. Despite being trained on just a few hundred cases, our best-performing models can achieve AUROCs between 0.90 and 0.94 for all prediction windows on Cohort 2. Although Cohort 3's best performing models achieve a maximum AUROC of 0.82, we expect that such performance can be improved upon including additional quality physiological waveform data, given the largest sample size in this Cohort was 21.
Monitoring patients in the ICU for cardiovascular adverse events and complications is a crucial component of postoperative critical care. Early warning systems for patients predicted to be at risk of major cardiovascular complications in advance of clinicians' attention can enable potentially life-saving interventions to be pursued before deterioration onset and significantly improve clinical outcomes. However, because different surgical procedures imply different underlying patient pathologies, along with varying levels of invasiveness and resultant derangements to organ systems, a CDSS for predicting postoperative adverse outcomes may be challenged by inadequate generalizability across a variety of surgical populations. In this study, we test the hypothesis that models trained on one surgical cohort may be successfully applied to other cohorts that underwent different surgical procedures, by training them on the cardiac surgery cohort and testing them on vascular and acute noncardiac surgery cohorts. The results obtained in this work suggest such generalizability and serve as a prototype for a CDSS where the ML models trained in one surgical cohort can be successfully applied to others.
Additionally, one should consider ensembles of different ML models in implementation of a CDSS, since no model definitively stands out as the best model in all cases. For example, in Cohort 3, RF has the highest performance of 0.82 in the 0.5-and 1-hour prediction windows, SVM of 0.81 in the 2-hour window, and LUCCK of 0.77 (which is equal to that of RF but with a smaller SD) in the 4-hour window.
As for limitations, sufficient data for training were only available for Cohort 1, allowing us only to train the models on Cohort 1 and test on Cohorts 2 and 3, but not vice versa. Adjusting the positive/negative case ratio to increase sensitivity may result in more false positives. Also, all cases in this study were from a single quaternary care center. The care processes and patient populations at other facilities can be substantially different. Another limitation is the few negative cases in Cohort 3, resulting in very high incidence rates of adverse events that are unlikely to be encountered in actual clinical settings. In future work, more non-cardiac surgical cases from other surgical care facilities would be needed to further increase the generalizability and robustness of these models. Also, our results must be interpreted with caution given that they were based on all surgical cohorts and may not necessarily extrapolate to non-surgical cohorts.

Conclusion
In this study, we utilized DSP techniques such as TS estimation and DTCWPT to generate features from physiological waveforms. A novel tensor dimension reduction algorithm successfully reduced the feature space while still demonstrating translatable performance across different surgical cohorts. Four different types of ML models were trained on a cardiac surgery cohort and tested on different surgery cohorts. The RF models perform the best in terms of the AUROC, but LUCCK models maintain the most consistent range of performance, especially when the sample sizes are small. Our study suggests that ML models trained on a combination of waveform and EHR data of one group of surgical cases to predict life-threatening cardiovascular complications have potential to be successfully applied to other types of surgical cases, opening doors to the clinical decision support system enabling early detection of such events and timely interventions to improve clinical outcomes on a wide variety of surgical cases.

Data availability
The datasets generated or analyzed in this study were collected by Michigan Medicine. The University of Michigan's Innovation Partnerships (UMIP) office will handle potential charges/arrangements of the use of data by external entities, using such methods as material transfer agreements. The executable code developed in this work is available from the corresponding author upon UMIP approval. www.nature.com/scientificreports/ Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.