Ensemble machine learning prediction and variable importance analysis of 5-year mortality after cardiac valve and CABG operations

Despite having a similar post-operative complication profile, cardiac valve operations are associated with a higher mortality rate compared to coronary artery bypass grafting (CABG) operations. For long-term mortality, few predictors are known. In this study, we applied an ensemble machine learning (ML) algorithm to 88 routinely collected peri-operative variables to predict 5-year mortality after different types of cardiac operations. The Super Learner algorithm was trained using prospectively collected peri-operative data from 8241 patients who underwent cardiac valve, CABG and combined operations. Model performance and calibration were determined for all models, and variable importance analysis was conducted for all peri-operative parameters. Results showed that the predictive accuracy was the highest for solitary mitral (0.846 [95% CI 0.812–0.880]) and solitary aortic (0.838 [0.813–0.864]) valve operations, confirming that ensemble ML using routine data collected perioperatively can predict 5-year mortality after cardiac operations with high accuracy. Additionally, post-operative urea was identified as a novel and strong predictor of mortality for several types of operation, having a seemingly additive effect to better known risk factors such as age and postoperative creatinine.


Content of supplementary material
• Data descriptives tables (table 1).
• Tables of P-values of the DeLong test for difference in areas under the curve for prediction performance of the different Super Learners (tables 2-4).
• Calibration plots and ECI coefficients for all models (table 5 and figures 1-11).
• Peri-operative and pre-operative model comparison (figure 12) • Detailed explanation of the Super Learner, and model hyper-parameter definitions and tuning. Blood glucose 0-6 hours after surgery (mmol/L) 8.9 (8.8-9.0) 9.4 (9.2-9.5) <0.001 Blood glucose 6-12 hours after surgery 9.9 (9.8-9.9) 10. All values presented as mean (95% CI), and categorical variable with the percentage in parentheses. BMI = body mass index, eCCR = estimated creatinine clearance, CPB = cardio-pulmonary bypass, HR = heart rate, SBP = systolic blood pressure, DBP = diastolic blood pressure, CVP = central venous pressure, PaCO2 = arterial CO2 pressure, PaO2 = arterial oxygen pressure, SaO2 = oxygen saturation, ICU = intensive care unit, ESR = erythrocyte sedimentation rate, LDH = lactate dehydrogenase, Hb = hemoglobin, ALAT = alanine aminotransferase, ASAT = aspartate aminotransferase, AKI = acute kidney injury.   Calibration plots and ECI coefficients for all models.          Specificity and sensitivity per operation type with adjusted risk thresholds Table 14. Specificity and sensitivity for the outcome "Non-survivors" for the Super Learners trained on individual operation groups using the default cut-off of 0.50, the cut-off defined by a 50% increase in mortality risk, and the cut-off defined by the maximum Youden index.  Table 15. Specificity and sensitivity for the outcome "Non-survivors" with the Super Learner and GLM models trained on the full cohort using the default cut-off of 0.50, the cut-off defined by a 50% increase in mortality risk, and the cut-off defined by the maximum Youden index.

Peri-operative and pre-operative model comparison
For the comparison between a model built only with pre-operative variables and the full perioperative model, the following variables were selected: age, gender, body mass index (BMI), pre-operative creatinine, pre-operative urea, pre-operative creatinine clearance (eCCR), preoperative erythrocyte sedimentation rate (ESR), pre-operative alanine aminotransferase (ALAT), pre-operative aspartate aminotransferase (ASAT), pre-operative lactate dehydrogenase (LDH), pre-operative hemoglobin, pre-operative thrombocytes, and preoperative leukocytes. P-value for difference < 0.01

Detailed explanation of the Super Learner, and model hyper-parameter definitions and tuning
The Super Learner algorithm was designed by Dudoit and van der Laan, and is a generalization of the stacking algorithms developed by Breiman (1996), which chooses the optimal regression algorithm from a set of candidates based on a loss function after k-fold-cross-validation (Dudoit andvan der Laan, 2006, Legrand et al., 2013). In this process, the dataset is divided into k mutually exclusive and exhaustive subsets of nearly equal size, with one of the k sets serving as a validation set, while the other sets are used for training of each candidate algorithm (Van der Laan, Polley, and Hubbard, 2007). At patient level, this means that each patient is used exactly once in the validation set, and included in the training set for all other rounds (Pirracchio et al., 2015). For each candidate learner, k risks are calculated and averaged into a "cross-validated risk", based on which the learners with the minimal risk are selected and applied to the entire dataset. These are then included in the new, weighted estimator (the SL), that attributes a relative coefficient to each of the learners that constitute it, so that only those which reduce the calculated risk the most end up contributing to the final weighted prediction.
By automatically estimating the weights of the ensemble, and also automatically removing models that do not contribute to the prediction, the SL eliminates the manual tuning and experimentation that a non-automated majority vote or weighed ensemble requires. Moreover, the SL presents individual patient predicted probabilities for 5-year mortality per ensemble and per learner (Polley et al., 2018). This is necessary to define cut-offs for binary classification that increase the relevance and potential clinical applicability of the findings of predictive modelling.
Five candidate algorithms were included in the Super Learner: support bayesian additive regression trees (BART), extremely randomized trees, elastic net, support vector machine (SVM), and extreme gradient boosted machine (XGBoost). To optimize the performance of the algorithm, multiple hyper-parameter combinations were generated for each candidate algorithm. The hyper-parameters tuned and the values defined as optimal are found below, per algorithm.

Support bayesian additive regression trees (BART)
BART is an ensemble-of-trees method which differ from Random Forests and stochastic gradient boosting like Gradient Boosting Machines, in that it relies on an underlying Bayesian probability model, instead of a pure algorithm (Chipman et al., 2010;Kapelner and Bleich, 2016;Breiman, 2001). It sets a number of priors for the structure and the leaf parameters of the trees it generates, which provides additional regularization to the model, and allows for variable importance exploration, including permutation tests and interaction detection. Furthermore, the bartMachine R implementation of this algorithm includes an external predict function that allows forecasts to be generated without the need to re-fit the whole model (Kapelner and Bleich, 2016).
The user-defined hyper-parameters that generate different model configurations are alpha (α), beta (β), and k. The first two, α and β, represent respectively the base and power hyperparameter in the prior probability that a node at a certain depth is nonterminal, which is given by the expression: (0, 1),β Chipman et al. (2010) advise taking α = 0.95 and β = 2, respectively, as pre-defined values for these parameters. The higher the alpha, the more splits are encouraged even in situations where predictive gains are modest, which is in line with the tendency of BART to include spurious splits (Chipman et al., 2010). In turn, k determines the prior probability that E(Y|X) is between (-3; 3), which is the number of standard deviations towards each side of the mean. The larger k, the broader the coverage of variances of the provided response values in the training set, and the more conservative the model fit. Seen as the prior specifications for variable selection via BART are a topic of on-going research, we opted for not defining hyper-parameter values for this algorithm, using instead the default α, β, and k.

Extremely randomized trees
This Random Forest-based algorithm available in the R package "extraTrees" is a better performing update to the original "randomForest" R package developed by Liaw and Wiener (2002), mainly due to its novel node splitting process (Simm et al., 2014;Liaw and Wiener, 2002). It also facilitates parallel processing by building a large number of binary decision trees, independently of each other, and has the ability to learn non-linear models even in large datasets with over 100000 training samples. This algorithm is an extension of Multi-task learning (MTL) to a binary decision tree-based ensemble method, allowing the user to dedicate tree branches to certain specific tasks. This limits the reciprocal influence of multiple tasks on each other, but also reduces a model's ability to capture relevant information from other tasks. The implementation of extraTrees in the Super Learner is rich in hyper-parameterization, especially in sub-task definition. Since we do not define any a priori tasks due to our naïve approach to the data analysis process, data from all tasks is pooled together and no ideal taskwise split hyper-parameter must be optimized. Therefore, only the parameters ntree, which stands for the number of trees to be built, and mtry, which represents the number of features tried at each node, were chosen for tuning. The number of random cuts for each chosen feature was defined as 1, according to the official extraTrees method. Ultimately, 21 extraTree models were used, with mtry ranging from 1 to 7, and ntree as 500, 1000, and 2000.

Elastic net
The biglasso wrapper in the Super Learner implements more computationally-efficient sparse linear and logistic regression models with lasso, ridge, and elastic net penalties by providing a better feature screening process to identify and discard inactive features from the lasso optimization (Zeng and Breheny, 2017). The elastic net algorithm, developed by Zou and Hastie, outperforms lasso-only, as it groups strongly correlated predictors instead of doing covariate selection and is applicable with the "biglasso" package (by adjusting the hyperparameter penalty) (Zou and Hastie, 2005). This method of variable selection, while best applied to cases where the number of predictors (p) is much bigger than the number of observations (n), which is not the case in our dataset, increases the interpretability of the model, a key issue when addressing its possible clinical applicability.
We defined 5 different models, all with a sequential strong rule screening algorithm, but with different mixing values (α) for the elastic net penalty (0.05, 0.4, 0.5, 0.6, and 0.95), which moves from closer to ridge at 0, to close to lasso, at 1, as defined by α‖ ‖$+(1− )/2‖ ‖ --For lambda, a hyper-parameter which represents the shrinkage penalty for the fitting process and directly affects variance, a maximum of 100 values was tried out by the Super Learner algorithm for optimization (Zeng and Breheny, 2017).

SVM
A Support Vector Machine (SVM) is a class of supervised learning algorithms which classifies data points into two different classes by taking datapoints in a multidimensional space and constructing the hyperplane that best differentiates between the two (Chapelle and Zien, 2005). The projection of the input data to a higher-dimensional space can further be potentiated by the use of kernels, which can increase the efficacy of the trained models.
The hyper-parameters to be optimized are the cost (C) and sigma (σ). Cost controls the misclassification tolerance, so that the higher the cost, the harder the margin and the smaller the tolerance for misclassification. In turn, sigma defines the smoothing of the Radial Basis Function kernel we chose. Different values of sigma define how much a single training example influences the model, with a higher sigma constraining the model towards linear.
We defined 36 different models for SVM with all possible combinations of cost between 2-2 and 221, and sigma between 2-7 and 2-21. Since a SVM tries to maximize the distance between the separating plane and each support vector (the datapoints closest to the hyperplane), the input of attributes with greater numeric ranges can dominate that of those with smaller ranges (Hastie et al., 2009). Therefore, dummy variables were created for all categorical variables and our numeric predictor variables were centred (subtracting the mean) and scaled (dividing by the standard deviation).

XGBoost
The Extreme Gradient Boosting Machine (XGBoost) algorithm is a scalable tree boosting system used widely in data science, and with very interesting, potentially clinicallyoriented properties. Like Ridgeway's original GBM, it is based on the consecutive fitting of new models (base-learners) to the training data set in order to provide a more accurate estimate of the outcome variable (Ridgeway, 2012). Decision trees are combined, and increasingly weigh the "difficult to predict" events to a greater degree, from which a cross validation error is estimated, using k-fold cross-validation (Natekin and Knoll, 2013). The novelty in XGBoost is its use of column sampling, borrowed from Random Forests, and a more solid approach to data sparsity patterns (Chen, 2016).