Introduction

Open-heart surgery is a common surgery in the intensive care unit (ICU) with various complications such as acute kidney injury (AKI), sepsis, septic shock, chronic kidney disease (CKD), pneumonia, thrombocytopenia, and inflammatory responses1,2,3,4,5,6,7. Lysak et al8 showed that because AKI and CKD are prevalent and generate high expenditure, early diagnosis is necessary to prevent these comorbidities from deteriorating. Early predicting comorbidities for critically ill patients after cardiac surgery is vital for patients’ prognosis and doctors’ decision making.

Compared to traditional methods to build a clinical prediction model by using logistic regression (LR), machine learning prediction models have the advantage of higher accuracy and robustness. Traditional algorithms like LR requires researchers to manually select the highly related independent variables X while cutting edge machine learning algorithms can find out the relationship between X and Y automatically. Many researchers tried to construct prediction models for patients who underwent cardiac surgery. Meyer et al9 used the recurrent neural network (RNN) to predict mortality, bleeding, and renal failure after patients received heart surgery. Lei et al10, Tseng et al11, and Lee et al12 used acute kidney injury (AKI) as their primary outcome, moreover, Kilic et al13 considered prolonged ventilation, reoperation as their prediction objectives. Many researchers paid attention to the most common complications including AKI and sepsis after cardiac surgery, however, the research on other comorbidities such as septic shock, liver dysfunction, severe thrombocytopenia was limited.

Vardon-Bounes et al14 suggested that thrombocytopenia, with a prevalence of 50%, is a common hemostatic disorder in ICU, and is associated with bleeding, high illness severity, organ failure, and bad prognosis15. Moreover, Kunutsor et al16 demonstrated that alanine transaminase (ALT) and aspartate transaminase (AST), as the indicator of liver dysfunction, are inversely associated with coronary heart disease (CHD) and are positively associated with stroke. In addition, Ambrosy et al17 showed that the higher the ALT and AST, the lower the survival rate. The increase of transaminase often indicates that the body is in a state of hypoperfusion or hypoxemia. It reminds us that timely intervention is needed, otherwise, patients will have adverse prognoses such as AKI or even death. Font et al18 claimed that during septic shock, the body produces a large number of inflammatory cytokines, causing multiple organ failures, such as septic cardiomyopathy, acute respiratory distress syndrome, septic encephalopathy, and other complications. Therefore, early prediction of the occurrence of septic shock is particularly important to reduce the further deterioration of the patient's condition.

Therefore, in this study, we aimed to build up multiple machine learning models to predict several risk factors of prognosis after open-heart surgery. Our primary outcomes were all-cause 30-days mortality, septic shock, severe thrombocytopenia, and liver dysfunction (abnormal AST and ALT).

Result

Study population

Among 6844 patients after heart surgery, 5475 (80%) patients were randomly divided into training data and 1369 (20%) patients were in test data. Table 1 showed the characteristics’ differences between these two groups' data, and most of the variables have no significant differences. Among 6844 data enrolled in this study, 219 (3.1%) patients died within 30-days after heart surgery. Septic shock, liver dysfunction, and thrombocytopenia accounts for 32 (0.5%), 248 (3.6%), 202 (2.9%) in respective. Table 2 showed most of the input variables have significant difference between positive samples (ill patients) and negative samples (normal patients) (P < 0.05).

Table 1 Baseline characteristics and variables 1.
Table 2 Baseline characteristics and variables 2.

Machine learning models’ performance

Accuracy, area under the curve (AUC), F1 score, precision, and recall of four models of all complications were shown in Table 3, and ROC curves of 4 primary outcomes were plotted in Fig. 1.

Table 3 Model evaluation.
Figure 1
figure 1

ROC curve of 4 outcomes.

XGBoost (AUC: 0.99; F1 score 0.70 for septic shock; AUC: 0.88; F1 score 0.58 for 30-days mortality; AUC: 0.88; F1 score 0.55 for thrombocytopenia; AUC: 0.89; F1 score 0.40 for liver dysfunction) achieved the highest AUC and F1 score, which means it is the most robust model.

Compared to other algorithms, XGBoost has overall better performance in terms of AUC, test accuracy, and F1 score in respective. In Fig. 2, decision curve analysis showed that, in terms of net benefit, XGBoost and RF were better than LR and ANN. Besides, XGBoost is slightly better than RF. Therefore, we selected XGBoost model as our final model in this study, and based on XGBoost model files we built up a Windows 10 software, which can find the download link from the website https://github.com/Zhihua-PredictionModel/ML-Prediction-Model, to present our research results as shown in Fig. 3. Source files of XGBoost for 4 outcomes from Sklearn were also uploaded to the website. Other researchers or programmers can easily apply these trained model files (“.model” file can be loaded by joblib, a package in Python) to the practical customized use of prediction.

Figure 2
figure 2

Decision curve analysis of 4 outcomes.

Figure 3
figure 3

Windows 10 software for patients who underwent open-heart surgery.

The top 5 predictors that influenced the decision making of XGBoost were calculated as shown in Table 4. The first, second, and third predictors of 4 outcomes are as follow. 30-days mortality: vasopressin (first), PH (second), and creatinine (third); septic shock: hemoglobin, hematocrit, and lactate; severe thrombocytopenia: vasopressin, bicarbonate, lactate; liver dysfunction: partial thromboplastin time, gender, partial pressure of oxygen.

Table 4 Feature importance outputted by XGBoost.

Discussion

In our study, four machine learning models were constructed and compared for 30-days mortality and 3 comorbidities after heart-related surgery. Other researchers also conducted many researches on the prediction model for patients. Based on 2010 patients in the database of Seoul National University Hospital, Lee et al12 found that among machine learning algorithms including decision tree, support vector machine, and random forest, XGBoost (Test accuracy: 0.74; AUC: 0.78) has the best performance to predict AKI after cardiac surgery and a website was created to process patients’ data in real-time. Kilic et al13 also applied XGBoost to predict multiple complications, including operative mortality (AUC: 0.771), renal failure (AUC: 0.776), prolonged ventilation (AUC: 0.739), reoperation (AUC: 0.637), stroke (AUC: 0.684), and deep sternal wound infection (AUC: 0.599), for adult patients after surgical aortic valve replacement in the Society of Thoracic Surgeons National Database. In addition, other researchers usually paid attention to common complications such as AKI, sepsis, and hospital mortality. However, researches on other complications were limited. Therefore, our study managed to predict 30-days mortality, septic shock, liver dysfunction, and severe thrombocytopenia which is also important for patients’ prognosis.

Several predictors of different comorbidities were outputted by XGBoost. According to Table 4, among 4 primary outcomes, lactate and platelet appeared 3 times, vasopressin, creatinine, platelet, appeared 2 times, which means they were the important factors for our outcomes. Models built by Kilic et al13 also showed that creatinine is an important factor to predict mortality, renal failure after heart surgery.

Our study has some limitations. Firstly, all experiments were conducted on a clinical database of critically ill patients called MIMIC-III, which means our machine learning models may have a good performance on those who are critically ill and are living in America. However, models may not work that well on people living in other regions. Therefore, further study is needed to obtain as much as possible data from various databases to construct a more comprehensive model that can work well on any population in any area.

Secondly, sample imbalance problem occurred in the experiment. There is a trade-off between accuracy, F1 score, precision, recall, and AUC because medical data usually are highly unbalanced that among 100 patients, there may be three positive samples and 97 negative samples (normal samples). In this situation, ML algorithms tend to classify samples into the class with most data. Therefore, we adjusted the weight of the positive samples of complications in the loss function and it improved precision, recall, and F1 score of models at the cost of reducing AUC and accuracy. And by setting other hyperparameters and using subsample, we make XGBoost keep a good balance between precision and recall. Facing with unbalanced medical data, how to improve accuracy, F1 score, and AUC simultaneously as much as possible remains an open problem.

In conclusion, four machine learning algorithms were built to predict 30-days mortality and 3 comorbidities after open-heart surgery. XGBoost model was the most robust model with the highest AUC, F1 score, and net benefit. Besides, Windows 10 software was created and is available on the website mentioned above for clinical staff. Moreover, multiple predictors outputted by XGBoost model indicated the relevance between these factors and comorbidities, and generated a hypothesis. Besides, whether these factors can be independent biochemical indexes remain open problems.

Methods

Data source and participants

Medical Information Mart for Intensive Care (MIMIC-III) is a freely available database containing critically ill patients who were admitted to the ICU of the Beth Israel Deaconess Medical Center between 2001 and 201219. Those who were under coronary artery bypass surgery, aortic valve replacement, or insertion of the implantable heart assist system (including ICD9 code 3961, 3615, 3612, 8872, 3521, 6311, 3522, 3614, 3733, 3524) were enrolled in this study. 6844 related samples were extracted from MIMIC-III clinical database by using PostgreSQL and Python 3 (version 3.7.8).

Definition and primary outcomes

The primary outcomes were 30-days mortality, and three comorbidities including septic shock, liver dysfunction, and severe thrombocytopenia after heart-related surgery. 30-days mortality was defined as death after discharge from ICU within 30 days. A patient will be marked as liver dysfunction if his/her first test value of aspartate transaminase (AST) and alanine transaminase (ALT) was normal (10–45 IU/L for ALT; 10–35 IU/L for AST) and values, tested later, of AST or ALT were greater than the max normal value (45 IU/L for ALT; 37 for AST)20. Severe thrombocytopenia was considered as that first platelet count was higher than 50 K/uL and one of later platelet count was lesser than 50 K/uL14. Considering septic shock is a severe disease with acute symptoms, it was diagnosed by ICD-9 code (785.52) in MIMIC-III database21. Only data of the first time ICU admission for each patient was considered.

Machine learning models

Logistic regression (LR) is a classic classification algorithm that makes a linear combination of input variables and uses the sigmoid function to output a probability. Main LR hyperparameter is C.

Neurons in artificial neural network (ANN) make a linear combination of the output value from the upper layers’ neurons, pass it through sigmoid functions, and finally output a value to the next neurons22. The width and depth of hidden layers influence the performance of ANN.

Compared to a single classifier, the ensemble learning algorithm random forest (RF), merging multiple weak classifiers to a strong classifier, showed a more powerful performance in the classification task23. Main hyperparameters are n_estimators, max_depth, and max_leaf_nodes.

Extreme gradient boosting (XGBoost) is also an ensemble model of decision trees24. Main parameters are n_estimators, max_depth, reg_lambda, gamma, min_child_weight, scale_pos_weight (when samples are unbalanced, this parameter can change the weight of positive samples in loss function), max_delta_step, and subsample.

Statistical method

35 input variables including demographics, use of vasopressin, laboratory variables, vital signs, comorbidities, urine output of the first day, and 4 output variables were extracted from the database as shown in Table 1. Some important variables such as fraction of inspiration O2 (Fio2) were excluded due to too much missing data. Variables that have less than 40% missing data were retained25. All missing values were filled with the average value of this variable. A statistical method called winsorization was used to deal with the outliers.

Figure 4 showed the flow chart of data process. After that, 6844 samples were randomly divided into training data (5475), and test data (1369) in a ratio of 80–20%. Chi-square test and Wilcoxon rank-sum test were used to compare the differences of categorical and continuous variables respectively. They were employed to compare the differences between training data and test data to make sure the distributions of two datasets were as same as possible. In Table 1, P value was calculated and P > 0.05 was considered there were no significant distribution differences between training data and test data. In Table 2, Chi-square test and Wilcoxon rank-sum test were used to compare the difference between positive samples and negative samples to observe the correlation between the independent variables X and outcome variables Y. P < 0.05 was considered there was strong correlation.

Figure 4
figure 4

Flow chart of data process.

Machine learning models training

Training data were evenly split into 5 parts that 4 parts were used to train a model of a certain hyperparameter, and the remaining one, also called validation set, was used to test the performance of this parameter. This process will conduct 5 times to gain 5 validation scores and the average score was used to evaluate the performance of the model. Data scientists usually call this method fivefold cross-validation which usually was used to select the best hyperparameter. 4 machine learning algorithms including LR, ANN, RF, and XGBoost were employed to fit the data, and all of these models have many hyperparameters that need to be specified as shown in Table 5. By applying grid search techniques, various kinds of parameters were searched automatically in Python and the best one was selected. Each model has its own best parameter. By comparing four models' performance, the best one, XGBoost, was picked up to be the final model of our study. We continued to fine-turn hyperparameters of XGBoost manually to obtain a better performance.

Table 5 Candidate parameters for grid search and fine-tune of parameters.

It will overestimate the model performance if just using the validation set and its score to evaluate the model, and because of that, test data, divided at the beginning, will be utilized to obtain a final score whose results were presented in Table 3. Besides, decision curve analysis was also applied to evaluate the model as shown in Fig. 2. All machine learning experiments were conducted on Python (version 3.7.8).