Dynamic predictions of postoperative complications from explainable, uncertainty-aware, and multi-task deep neural networks

Accurate prediction of postoperative complications can inform shared decisions regarding prognosis, preoperative risk-reduction, and postoperative resource use. We hypothesized that multi-task deep learning models would outperform conventional machine learning models in predicting postoperative complications, and that integrating high-resolution intraoperative physiological time series would result in more granular and personalized health representations that would improve prognostication compared to preoperative predictions. In a longitudinal cohort study of 56,242 patients undergoing 67,481 inpatient surgical procedures at a university medical center, we compared deep learning models with random forests and XGBoost for predicting nine common postoperative complications using preoperative, intraoperative, and perioperative patient data. Our study indicated several significant results across experimental settings that suggest the utility of deep learning for capturing more precise representations of patient health for augmented surgical decision support. Multi-task learning improved efficiency by reducing computational resources without compromising predictive performance. Integrated gradients interpretability mechanisms identified potentially modifiable risk factors for each complication. Monte Carlo dropout methods provided a quantitative measure of prediction uncertainty that has the potential to enhance clinical trust. Multi-task learning, interpretability mechanisms, and uncertainty metrics demonstrated potential to facilitate effective clinical implementation.


INTRODUCTION
In the United States, more than 15 million major, inpatient surgeries are performed each year. 1 Complications occur in up to 32%; major complications decrease quality of life and increase health care costs by as much as $11,000. 2,3 Accurate, personalized predictions of postoperative complications can inform shared decisions between patients and surgeons regarding prognosis, the appropriateness of surgery, prehabilitation strategies targeting modifiable risk factors (e.g., smoking cessation), and postoperative resource use (e.g., triage to intensive care or general wards), suggesting opportunities to augment clinical risk prediction with objective, machine learningenabled decision-support.
Most existing perioperative predictive analytic decision-support tools are hindered by suboptimal performance, time constraints imposed by manual data entry requirements, and lack of intraoperative data and clinical workflow integration. [4][5][6][7][8][9] These challenges are theoretically mitigated by automated deep learning models that capture latent, nonlinear data structure and relationships among raw feature representations in large datasets, 10 now widely available in electronic health records (EHRs). 11 Despite these potential advantages, [12][13][14][15][16][17][18][19] deep learning using the full spectrum of preoperative and intraoperative, patient-specific EHR data to predict postoperative complications has not been previously reported. Recognition that deep learning models with high overall accuracy are nevertheless capable of egregious errors, along with their lack of interpretability, have invited skepticism regarding the clinical application of deep learning-enabled decision-support; model interpretability and uncertainty-awareness mechanisms have the potential to improve clinical applicability, but their efficacy remains unclear.
Using a longitudinal cohort of 56,242 patients who underwent 67,481 inpatient surgeries, we test the hypotheses that deep learning models would outperform random forest models in predicting postoperative complications using both preoperative and intraoperative physiological time series data. We also explore the utility of multi-task learning 20 by training a single deep learning model on several postoperative complications simultaneously to improve model efficiency, integrated gradients to promote model interpretability, and uncertainty metrics that represent variance across predictions.

Participant Baseline Characteristics and Outcomes
Cohort characteristics are summarized in Table 1 and detailed cohort statistics are presented in Supplementary Tables S1-S4. The overall study population had mean age 56 years and 50% were female. In the validation cohort of 20,293 surgical procedures, the incidence of complications was: 33.3% prolonged ICU stay (for 48 hours or more), 7.8% prolonged mechanical ventilation, 20.2% neurological complications, 16.9% acute kidney injury, 16.3% cardiovascular complications, 5.4% venous thromboembolism, 21.4% wound complications, 8.7% sepsis, and 1.6% inhospital mortality. The distribution of complications was similar between development and validation cohorts.

Multi-Task Learning Improved Efficiency without Compromising Predictive Performance
For deep learning models trained on preoperative data alone, there were no significant differences between multi-task outcome-specific models. For models trained on intraoperative time series alone, the multi-task model yielded significantly higher preoperative and intraoperative data, the multi-task postoperative model yielded somewhat higher AUROC for prolonged mechanical ventilation, sepsis, venous thromboembolism, and in-hospital mortality, and lower AUROC for prolonged ICU stay, wound complications, neurological complications, and acute kidney injury, though the differences were not statistically significant. A comprehensive AUROC comparison between individual models and multi-task learning is shown in Fig. 1a-c. Given that multi-task models had marginally stronger performance and have a reduced computational requirements and training times compared with nine individual models, the multi-task approach is used henceforth as our deep learning-based postoperative model, unless stated otherwise. Full results are shown in Supplementary Table S5.

Deep Learning Outperformed Random Forests
Both deep learning and baseline models used the same feature sets with one exception: due to the nature of sequential deep learning methods, our deep intraoperative models processed the entire physiological time series minute-by-minute, whereas the baseline intraoperative and postoperative models required extraction of summary statistics. A full list of random forest time series features is described in Supplementary Table S6. A full comparison among all models, performance metrics, and complication outcomes is described in Supplementary Methods and Supplementary   Table S5.

Preoperative Models
The deep multi-task model trained only on static, preoperative descriptors yielded higher AUROC compared with baseline random forest models for all nine outcomes, with significant performance increases for prolonged mechanical ventilation

Intraoperative Models
Using intraoperative time series input data alone, multi-task deep learning yielded higher AUROC for all complications except prolonged ICU stay, for which AUROC was equivalent. Significant AUROC improvements were yielded for wound

Postoperative Models
The deep postoperative multi-task model trained on all available data yielded  Fig. 1d. Using deep multi-task preoperative predictions as a benchmark, the deep multi-task postoperative models made significant overall reclassification improvements for prolonged ICU stay (overall, correctly reclassified 3.7% of all surgical encounters, p<0.01), prolonged mechanical ventilation (overall, correctly reclassified 4.8%, p<0.01), and cardiovascular complications (overall, correctly reclassified 0.3%, p<0.01). There were no statistically significant declines in reclassification. In some cases, deep models for individual complications yielded better net reclassification indices than multi-task models, including wound complications (-

Model Uncertainty
We applied the method of Monte Carlo dropout to derive measures of prediction uncertainty, representing variance across predictions, for each of our deep learning models. Uncertainty results for each prediction phase and training procedure are shown in Table 2, where uncertainty is expressed as prediction variance over 100 stochastic trials using dropout at inference time. Interestingly, models trained only using intraoperative data resulted in the lowest uncertainty for each postoperative complication. Within each outcome and prediction phase, individual models yielded lower predictive uncertainty compared with multi-task model counterparts. Using the models with the least uncertain training scheme for each outcome and prediction phase, postoperative predictions were less uncertain than preoperative predictions for prolonged mechanical ventilation, wound complications, cardiovascular complications, and in-hospital mortality; postoperative uncertainty was higher for the remaining five complications.

Model Interpretability
We applied integrated gradients to our multi-task deep learning postoperative prediction model. The top 10 features per complication outcome for every sample in the validation cohort are shown with corresponding attribution scores in Table 3. Importance

Prolonged ICU Stay
The most important feature was peak inspiratory pressure; the presence of such a value indicates the performance of mechanical ventilation, and higher values could indicate intrinsic lung disease, proximal airway or breathing tube narrowing or obstruction, or the transmission of increased intra-abdominal pressure, each of which suggest greater illness severity. The next two most important features were heart rate and blood oxygen saturation, both of which are major determinants of cardiac output and oxygen delivery.

Prolonged Mechanical Ventilation
Peak inspiratory pressure and heart rate were again top features, along with fraction of inspired oxygen, the number one feature. This result is consistent with prior observations that most etiologies of hypoxemia improve with increasing fraction of inspired oxygen, apart from right-to-left shunt, which is often accompanied by another pathophysiologic process that is responsive to higher fraction of inspired oxygen.

Wound Complications
The major factors affecting wound complications (i.e., infection, dehiscence, and non-healing) are the type of surgery its associated degree of wound contamination. 22,23 These factors are aligned with the top four important features for wound complication prediction: primary procedure, surgeon specialty, attending surgeon, surgery type, and scheduled surgery room. Although body mass index is unexpectedly missing from the top 10 feature list, several other factors relate to known risk factors for wound complications, including malnutrition, long duration of surgery, blood loss, and anemia.

Neurological Complications
Similar to wound complications, neurological complications are primarily a function of type of surgery; neurosurgical procedures typically involve pre-existing neurological pathology and confer above-average risk for postoperative neurological pathology relative to other types of surgery. Accordingly, primary procedure and surgery type were the top two important features in predicting neurological complications.

Cardiovascular Complications
Cardiovascular complications may be caused by or lead to cardiac and respiratory pathophysiology, primarily measured by cardiac and respiratory vital signs and mechanical ventilator measurements. 24 Consistent with these phenomena, the top five important features for cardiovascular complications were systolic blood pressure, peak inspiratory pressure, blood oxygen saturation, heart rate, and diastolic blood pressure.

Sepsis
Important features for sepsis were similar to those of wound complications, with the exception of heart rate, which was the most important feature for sepsis. One might expect that fever, leukocytosis, and hypotension would be important features in predicting sepsis, but it is possible that these elements would occur later after surgery when sepsis was developing as a postoperative complication, and they can also represent sterile postoperative inflammation from tissue damage without infection. Heart rate variability, which would be learned from intraoperative time series heart rate values, is well established as a strong predictor of sepsis and associated adverse outcomes. 25,26

Acute Kidney Injury
Serum creatinine is the primary method for measuring kidney function among hospitalized patients and tends to be more reliable than volume of urine output, which is difficult to record accurately in the absence of an indwelling bladder catheter.
Accordingly, the number one important feature in predicting acute kidney injury was serum creatinine. Several other important features represented kidney perfusion or red blood cell production, which is affected by the endogenous renal hormone erythropoietin.

Venous Thromboembolism
Major risk factors for venous thromboembolism are encompassed by Virchow's triad of vessel injury, altered blood flow, and hypercoagulability. 27 These elements are represented in two of the top three important features for predicting venous thromboembolism (i.e., primary procedure and serum prothrombin time), as well as several other variables in the top 10 feature list.

DISCUSSION
In predicting postoperative complications among adult patients undergoing major, inpatient surgery, deep neural networks outperformed random forest classifiers, exhibiting strongest performance when leveraging the full spectrum of preoperative and intraoperative EHR data. Intraoperative physiological time-series had meaningful associations with postoperative patient outcomes, suggesting that prediction models augmented with intraoperative data may have utility for routine clinical tasks such as sharing prognostic information with patients and caregivers and making clinical management decisions regarding triage destination and resource use after surgery.
Deep models maintained high performance using efficient multi-task methods predicting nine complications simultaneously, rather than predicting individual complications with separate models that require extra training time. Uncertainty metrics revealed that variance across model predictions is lowest when using intraoperative data alone, consistent with the perspective that many preoperative EHR predictor variables represent clinician decision-making (e.g., the lack of preoperative bilirubin values indicates a decision to forego hepatic function testing) rather than pure physiology, and therefore introduce greater variance in predictions. Finally, applying integrated gradients interpretability methods elucidated feature importance patterns that were biologically plausible and consistent with medical knowledge, experience, and evidence, harboring the potential to gain trust from patients and clinicians. 21 Previous studies have established that for many clinical prediction tasks, deep neural networks outperform other methods, such as logistic regression classifiers. 28,29 Parametric regression equations often fail to accurately represent complex, non-linear associations among input variables, limiting their predictive performance. More than thirty years ago, Schwartz et al. 30 suggested that human disease is too broad and complex to be accurately represented by rule-based algorithms, and that machine learning models obviate this limitation by learning from data. In our study, deep learning also outperformed random forest models, likely because the deep models capitalized on the availability of intraoperative time series data. As EHR data volumes expand, deep learning healthcare applications gain greater potential for clinical application. 31 However, this will require integration with real-time clinical workflow. Therefore, it seems prudent to design models that make updated predictions as EHR data become available. We sought to achieve this objective by using recurrent neural networks that can update their predictions when new data becomes available. Our results suggest that these models would perform well in prospective clinical settings.
Multi-task methods did not yield predictive performance advantages in our study, but it has yielded performance advantages in prevoius studies. Muti-task learning can improve model generalizability by penalizing the exploration of certain regions of the available function space, thus reducing overfitting from the false assumption that data noise is sparse or absent. This has been demostrated by Si and Roberts 32 in applying CNN multi-task learning to word embeddings in MIMIC-III clinical notes data, demonstrating that multi-task learning models outperformed single-task models in predicting mortality within 1, 3, 5, and 20 different timeframes. In addition, multi-task learning can act as a regulizer for learning classifiers from a finite set of examples by penalizing complexity in a loss function, as demonstrated by Harutyunyan et al. 20 in predicting mortality and physiological decompensation among ICU patients in the publicly available MIMIC-III database. 33 However, multi-task learning was not advantageous for phenotyping acute care conditions; the authors postulated that this occurred because phenotyping is multi-task by nature, i.e., already benefits from regularization across phenotypes. This may not hold true for rare, complex phenotypes, for which multi-task learning can reduce neural network sensitivity to hyperparameter settings (i.e., parameters that are set before learning begins), as demonstrated by Ding et al. 34 Properly applied, multi-task learning can improve model generalizability and classification in deep learning clinical prediction models, optimizing performance and usability across diverse settings and datasets, with the added advantage of reduced model training times relative to training multiple individual models.
One barrier to clinical adoption of deep learning clinical prediction models is difficulty interpreting outputs. Patients, caregivers, and clinicians may be more willing to incorporate model predictions in shared decision-making processes if they understand how and why a prediction was made and believe that the prediction is consistent with medical knowledge and evidence. Integrated gradients techniques attempt to explain predictions made by deep learning models, usually by feeding perturbed inputs to the model, evaluating effects on outputs, and using this information to quantify and convey feature importance. Sayres et al. 35 used integrated gradients to identify retinal image regions contributing to deep learning-based diabetic retinopathy diagnoses, which was associated with improved ophthalmologist diagnostic accuracy and confidence. These methods have the potential to facilitate clinical adoption of deep learning prediction models by allowing patients, caregivers, and clinicians to undertand how and why and output was produced. Finally, demonstrating low variance across predictions with uncertainty metrics could assuage well-founded patient and clinician fears that an individual model output represents a rare but egregious prediction error, for which deep learning models are infamous.
This study was limited by its single-institution, retrospective design. Although multi-task functions may reduce overfitting, the use of data from a single institution limits generalizability. Our models have not been tested using prospective, real-time data, which may present data pre-processing challenges. Future research should seek prospective, multi-center validation of these findings. This will be difficult to perform until cloud sharing of standardized EHR data or federated learning are achieved at scale. 36 Finally, it remains unknown how the predictions generated by models presented herein would affect shared decision-making processes and patient outcomes.
In summary, deep learning yielded greater discrimination than random forests for predicting complications after major, inpatient surgery. Uncertainty metrics and predictive performance were optimal when leveraging the full spectrum of preoperative and intraoperative physiologic time-series data as predictor variables in an efficient multi-task deep learning model. Uncertainty-aware deep learning may have utility for understanding the probability that a prediction deviates substantially from usual predictions and represents a rare, major prediction error. Integrated gradients interpretability mechanisms identified biologically plausible important features. The accurate, interpretable, uncertainty-aware predictions presented herein require further investigation regarding their potential to augment surgical decision-making during preoperative and immediate postoperative phases of care.

METHODS
All analyses were performed on a retrospective, single-center, longitudinal cohort of surgical patients that included data from both preoperative and intraoperative phases of care. We used deep learning and random forest models to predict the onset of nine major postoperative complications following surgery with three primary objectives: (1) compare deep learning techniques with random forest models in predicting postoperative complications, (2) compare deep learning predictions made at two phases of perioperative care: immediately before surgery (using preoperative data alone, referred to henceforth as preoperative prediction), and immediately after surgery by two different methods: (a) using intraoperative data alone (referred to henceforth as intraoperative prediction), and (b) using both preoperative and intraoperative data (referred to henceforth as postoperative prediction), and (3)  Recommendations were followed from both Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD 37 ) guidelines and from best practices for prediction modeling from Leisman, et al. 38 All methods were performed in accordance with relevant guidelines and regulations.

Data Source
The University of Florida Integrated Data Repository was used as an honest broker to build a longitudinal dataset representing patients admitted to University of Florida Health between June 1 st , 2014 and September 20 th , 2020 who were at least 18 years of age and underwent at least one surgical procedure during hospitalization. The dataset was constructed by integrating electronic health records with other clinical, administrative, and public databases. 9 The resulting dataset included information on patient demographics, laboratory values, vital signs, diagnoses, medications, blood product administration, procedures, and clinical outcomes, as well as detailed intraoperative physiologic and monitoring data.

Predictors
Our final cohort included electronic health record data from both before and during surgery. Preoperative models were trained on data available between one year prior to surgery and the day of surgery, prior to surgery start time (i.e., preoperative features alone). Intraoperative models were trained on data created during the surgical procedure (i.e., intraoperative features alone). Postoperative models were trained on data available between one year prior to surgery through the end of the surgical procedure (i.e., both preoperative and intraoperative features).
We identified 402 preoperative features, including demographic and socioeconomic indicators, planned procedure and provider information, Charlson comorbidities, and summary statistics of select medications, laboratory tests, and physiological measurements (e.g., vital signs such as heart rate and blood pressure) taken prior to a surgical procedure over one-year and one-week time windows. We calculated Charlson comorbidity indices using International Classification of Diseases (ICD) codes. 39 We modeled procedure types on ICD-9-CM codes with a forest structure in which nodes represent groups of procedures, roots represent the most general groups of procedures, and leaf nodes represent specific procedures. Medications were derived from RxNorm codes grouped into drug classes as previously described.
Intraoperative data consisted of 14 physiological measurements taken during surgery: systolic blood pressure, diastolic blood pressure, mean arterial pressure, heart rate, blood oxygen saturation (SpO2), fraction of inspired oxygen (FiO2), end-tidal carbon dioxide (EtCO2), tidal volume, respiration rate, peak inspiratory pressure (PIP), minimum alveolar concentration (MAC), temperature, urine output, and operative blood loss. These variables were presented to deep learning models as variable-length multivariate time series. For random forest models, a set of 49 statistical features were extracted from each encounter's intraoperative measurements. Supplementary Table   S6 summarizes all input features and relevant preprocessing procedures.

Participants
We excluded patients with intraoperative mortality or who were missing the variables necessary to classify postoperative complications. If a single patient's hospital encounter included more than one surgery, only the first surgery during that encounter was included in our analyses. Our final dataset included 56,242 patients who underwent 67,481 surgeries. Supplementary Fig. S5 illustrates derivation of the study population and cohort selection criteria.

Outcomes
We used several different machine learning methods to model the risk of nine postoperative complications: prolonged intensive care unit stay (greater than 48 hours), prolonged mechanical ventilation requirement (greater than 48 hours), neurological complications, cardiovascular complications, acute kidney injury, sepsis, venous thromboembolism, wound complications, and in-hospital mortality.

Sample Size
We chronologically divided our perioperative cohort into a development set of Using a validation cohort of 20,293 surgeries, the overall sample size allows for a maximum width of the 95% confidence interval for area under the receiver operating characteristic curve (AUROC) to be between 0.01 to 0.03 for postoperative complications with prevalence ranging between 5.4% and 33.3% for AUROC of 0.80 or higher. The sample size allows for a maximum width of 0.06 for hospital mortality given 1.6% prevalence.

Predictive Analytic Workflow
The postoperative models update preoperative risk predictions using data collected during surgery. This workflow emulates clinical scenarios in which patients' preoperative information is enriched by the influx of new data from the operating room.
The model consists of two main preoperative and intraoperative layers, each containing a data transformer core and a data analytics core. 9 The data transformer integrates data from multiple sources, including EHR data with zip code links to US Census data for patient neighborhood characteristics and distance from the hospital. The data transformer then performs preprocessing and feature transformation steps to optimize the data for analysis. In the multi-task setting, this preoperative data representation was passed through nine branches corresponding to our nine postoperative complication outcomes.
Each branch contained one outcome-specific fully connected layers followed by a sigmoid activation function to produce a per-outcome prediction score, interpreted as the probability of a preoperative patient developing a given postoperative complication. We apply the method of integrated gradients to our final postoperative multi-task model to illuminate specific input features that yielded the largest impacts on predicting each of the nine complication outcomes. A complete discussion of this technique is beyond the scope of this study; we refer interested readers to the work of Sundarajan et al. 43 Briefly, integrated gradients is a comparative technique for local interpretability, centered around the analysis of model outputs based on a given input and corresponding baseline values, and assigns attributions values to every input feature. In theory, features most influential to a given prediction will receive larger attribution values, and taken over an entire population, this can reveal the importance of certain features which drive the model predictions. We use a zero-vector reference value for such computations, and as all variables are Z-normalized to zero mean and unit variance; such a reference can be viewed as the per-variable mean value across the entire cohort.

Model Validation
All through September 20 th , 2020. For each model performance metric, ninety-five percent nonparametric confidence intervals were calculated using 1,000 bootstrapped samples with replacement.

Model Performance
Model performance was evaluated by sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, area under the precision-recall curve (AUPRC), and area under the receiver operating characteristic curve (AUROC).

Reported metrics include class predictions based on Youden's index threshold on
predicted risk scores, which maximizes sensitivity and specificity, as the cutoff point for low versus high risk. 44 When predicting rare events, models can exhibit deceivingly high accuracy by predicting negative outcomes in predominantly negative datasets. 45

Data is available from the University of Florida Institutional Data Access/Ethics
Committee for researchers who meet the criteria for access to confidential data and may require additional IRB approval.

COMPETING INTERESTS
The authors declare no competing interests.

Surgery duration Continuous Raw
Outlier adjustment a , Missing value imputation b , Feature scaling c a For continuous variables, values that fell in the top and bottom 1% of its distribution were considered outliers and capped to the respective values given at the 1 st and 99 th percentiles. b Missing numerical values were replaced with the median from the development cohort, and missing nominal variables were assigned to a distinct "missing" category. c Continuous variables were standardized to zero mean and unit variance. d Nominal variables with less than 10 levels were represented as zero vectors of length equal to the number of levels, with level indicators equal to one. e Using residency zip code, we linked to US Census data to calculate residing neighborhood characteristics and distance from hospital. f Nominal variables with 10 levels or greater were transformed to a numeric integer identifier ranging from 0 to the number of unique levels minus one, where implicit variable representations were learned as part of the model training process. g To preserve relative proximity, temporally recurring features such as month and day of admission were cyclically embedded as two separate features by sine and cosine-based transformation. For example, December (12) is near January (1), and Sunday (7) is near Monday (1). h Medications were taken within one year timeframe prior to surgery using RxNorms data grouped into drug classes according to the US, Department of Veterans Affairs National Drug File-Reference Terminology 24 . i Measurement values lying outside of expert-defined clinically normal value ranges for each variable were discarded. If two measurements existed at the same timestamp for a given patient, a random measurement was kept. j For each surgical procedure, a time series was constructed by arranging intraoperative measurements chronologically, resampling to one-minute frequency intervals, performing linear interpolation in both directions (except for blood loss and urine output, which were imputed with zero), and imputing the development median at every timestep for procedures lacking a single measurement of a particular variable. k For baseline models, a set of 49 statistical features was extracted from each intraoperative time series. This set included the following features: minimum, maximum, mean, median, standard deviation, sum of values, variance,

Variable Name
Type Data Source Categories Preprocessing kurtosis, skewness, absolute energy, absolute sum of changes, counts above and below mean, first and last locations of both minimum and maximum, sequence length, longest strike above and below mean, mean absolute change, mean change, ratio of unique values to sequence length, variance larger than standard deviation, 9 quantiles, 9 index mass quantiles, 10-binned entropy, number of peaks, and range count.  Figure S4. Temporal integrated gradients attributions for an example patient developing postoperative cardiovascular complications. The model correctly predicted elevated risk based on intraoperative physiological time series.