Machine learning prediction models for prognosis of critically ill patients after open-heart surgery

We aimed to build up multiple machine learning models to predict 30-days mortality, and 3 complications including septic shock, thrombocytopenia, and liver dysfunction after open-heart surgery. Patients who underwent coronary artery bypass surgery, aortic valve replacement, or other heart-related surgeries between 2001 and 2012 were extracted from MIMIC-III databases. Extreme gradient boosting, random forest, artificial neural network, and logistic regression were employed to build models by utilizing fivefold cross-validation and grid search. Receiver operating characteristic curve, area under curve (AUC), decision curve analysis, test accuracy, F1 score, precision, and recall were applied to access the performance. Among 6844 patients enrolled in this study, 215 patients (3.1%) died within 30 days after surgery, part of patients appeared liver dysfunction (248; 3.6%), septic shock (32; 0.5%), and thrombocytopenia (202; 2.9%). XGBoost, selected to be our final model, achieved the best performance with highest AUC and F1 score. AUC and F1 score of XGBoost for 4 outcomes: 0.88 and 0.58 for 30-days mortality, 0.98 and 0.70 for septic shock, 0.88 and 0.55 for thrombocytopenia, 0.89 and 0.40 for liver dysfunction. We developed a promising model, presented as software, to realize monitoring for patients in ICU and to improve prognosis.

rithms like LR requires researchers to manually select the highly related independent variables X while cutting edge machine learning algorithms can find out the relationship between X and Y automatically. Many researchers tried to construct prediction models for patients who underwent cardiac surgery. Meyer et al 9 used the recurrent neural network (RNN) to predict mortality, bleeding, and renal failure after patients received heart surgery. Lei et al 10 , Tseng et al 11 , and Lee et al 12 used acute kidney injury (AKI) as their primary outcome, moreover, Kilic et al 13 considered prolonged ventilation, reoperation as their prediction objectives. Many researchers paid attention to the most common complications including AKI and sepsis after cardiac surgery, however, the research on other comorbidities such as septic shock, liver dysfunction, severe thrombocytopenia was limited.
Vardon-Bounes et al 14 suggested that thrombocytopenia, with a prevalence of 50%, is a common hemostatic disorder in ICU, and is associated with bleeding, high illness severity, organ failure, and bad prognosis 15 . Moreover, Kunutsor et al 16 demonstrated that alanine transaminase (ALT) and aspartate transaminase (AST), as the indicator of liver dysfunction, are inversely associated with coronary heart disease (CHD) and are positively associated with stroke. In addition, Ambrosy et al 17 showed that the higher the ALT and AST, the lower the survival rate. The increase of transaminase often indicates that the body is in a state of hypoperfusion or hypoxemia. It reminds us that timely intervention is needed, otherwise, patients will have adverse prognoses such as AKI or even death. Font et al 18 claimed that during septic shock, the body produces a large number of inflammatory cytokines, causing multiple organ failures, such as septic cardiomyopathy, acute respiratory distress syndrome, Result Study population. Among 6844 patients after heart surgery, 5475 (80%) patients were randomly divided into training data and 1369 (20%) patients were in test data. Table 1 showed the characteristics' differences between these two groups' data, and most of the variables have no significant differences. Among 6844 data enrolled in this study, 219 (3.1%) patients died within 30-days after heart surgery. Septic shock, liver dysfunction, and thrombocytopenia accounts for 32 (0.5%), 248 (3.6%), 202 (2.9%) in respective. Table 2 showed most of the input variables have significant difference between positive samples (ill patients) and negative samples (normal patients) (P < 0.05).
Machine learning models' performance. Accuracy, area under the curve (AUC), F1 score, precision, and recall of four models of all complications were shown in Table 3, and ROC curves of 4 primary outcomes were plotted in Fig. 1.
Compared to other algorithms, XGBoost has overall better performance in terms of AUC, test accuracy, and F1 score in respective. In Fig. 2, decision curve analysis showed that, in terms of net benefit, XGBoost and RF were better than LR and ANN. Besides, XGBoost is slightly better than RF. Therefore, we selected XGBoost model as our final model in this study, and based on XGBoost model files we built up a Windows 10 software, which can find the download link from the website https ://githu b.com/Zhihu a-Predi ction Model /ML-Predi ction -Model , to present our research results as shown in Fig. 3. Source files of XGBoost for 4 outcomes from Sklearn were also uploaded to the website. Other researchers or programmers can easily apply these trained model files (".model" file can be loaded by joblib, a package in Python) to the practical customized use of prediction.
The top 5 predictors that influenced the decision making of XGBoost were calculated as shown in Table 4. The first, second, and third predictors of 4 outcomes are as follow. 30-days mortality: vasopressin (first), PH (second), and creatinine (third); septic shock: hemoglobin, hematocrit, and lactate; severe thrombocytopenia: vasopressin, bicarbonate, lactate; liver dysfunction: partial thromboplastin time, gender, partial pressure of oxygen.

Discussion
In our study, four machine learning models were constructed and compared for 30-days mortality and 3 comorbidities after heart-related surgery. Other researchers also conducted many researches on the prediction model for patients. Based on 2010 patients in the database of Seoul National University Hospital, Lee et al 12 found that among machine learning algorithms including decision tree, support vector machine, and random forest, XGBoost (Test accuracy: 0.74; AUC: 0.78) has the best performance to predict AKI after cardiac surgery and a website was created to process patients' data in real-time. Kilic  In addition, other researchers usually paid attention to common complications such as AKI, sepsis, and hospital mortality. However, researches on other complications were limited. Therefore, our study managed to predict 30-days mortality, septic shock, liver dysfunction, and severe thrombocytopenia which is also important for patients' prognosis.
Several predictors of different comorbidities were outputted by XGBoost. According to Table 4, among 4 primary outcomes, lactate and platelet appeared 3 times, vasopressin, creatinine, platelet, appeared 2 times, which means they were the important factors for our outcomes. Models built by Kilic et al 13 also showed that creatinine is an important factor to predict mortality, renal failure after heart surgery.
Our study has some limitations. Firstly, all experiments were conducted on a clinical database of critically ill patients called MIMIC-III, which means our machine learning models may have a good performance on those who are critically ill and are living in America. However, models may not work that well on people living in other regions. Therefore, further study is needed to obtain as much as possible data from various databases to construct a more comprehensive model that can work well on any population in any area.
Secondly, sample imbalance problem occurred in the experiment. There is a trade-off between accuracy, F1 score, precision, recall, and AUC because medical data usually are highly unbalanced that among 100 patients, there may be three positive samples and 97 negative samples (normal samples). In this situation, ML algorithms tend to classify samples into the class with most data. Therefore, we adjusted the weight of the positive samples of complications in the loss function and it improved precision, recall, and F1 score of models at the cost of reducing AUC and accuracy. And by setting other hyperparameters and using subsample, we make XGBoost keep a good balance between precision and recall. Facing with unbalanced medical data, how to improve accuracy, F1 score, and AUC simultaneously as much as possible remains an open problem.
In conclusion, four machine learning algorithms were built to predict 30-days mortality and 3 comorbidities after open-heart surgery. XGBoost model was the most robust model with the highest AUC, F1 score, and net benefit. Besides, Windows 10 software was created and is available on the website mentioned above for clinical Definition and primary outcomes. The primary outcomes were 30-days mortality, and three comorbidities including septic shock, liver dysfunction, and severe thrombocytopenia after heart-related surgery. 30-days mortality was defined as death after discharge from ICU within 30 days. A patient will be marked as liver dysfunction if his/her first test value of aspartate transaminase (AST) and alanine transaminase (ALT) was normal (10-45 IU/L for ALT; 10-35 IU/L for AST) and values, tested later, of AST or ALT were greater than Table 2. Baseline characteristics and variables 2. P value of outcomes presents whether a variable has significant difference between positive samples (ill patients) and negative samples (normal patients). www.nature.com/scientificreports/ www.nature.com/scientificreports/ the max normal value (45 IU/L for ALT; 37 for AST) 20 . Severe thrombocytopenia was considered as that first platelet count was higher than 50 K/uL and one of later platelet count was lesser than 50 K/uL 14 . Considering septic shock is a severe disease with acute symptoms, it was diagnosed by ICD-9 code (785.52) in MIMIC-III database 21 . Only data of the first time ICU admission for each patient was considered.
Machine learning models. Logistic regression (LR) is a classic classification algorithm that makes a linear combination of input variables and uses the sigmoid function to output a probability. Main LR hyperparameter is C. Neurons in artificial neural network (ANN) make a linear combination of the output value from the upper layers' neurons, pass it through sigmoid functions, and finally output a value to the next neurons 22 . The width and depth of hidden layers influence the performance of ANN.
Compared to a single classifier, the ensemble learning algorithm random forest (RF), merging multiple weak classifiers to a strong classifier, showed a more powerful performance in the classification task 23 . Main hyperparameters are n_estimators, max_depth, and max_leaf_nodes.
Extreme gradient boosting (XGBoost) is also an ensemble model of decision trees 24 . Main parameters are n_estimators, max_depth, reg_lambda, gamma, min_child_weight, scale_pos_weight (when samples are unbalanced, this parameter can change the weight of positive samples in loss function), max_delta_step, and subsample.
Statistical method. 35 input variables including demographics, use of vasopressin, laboratory variables, vital signs, comorbidities, urine output of the first day, and 4 output variables were extracted from the database as shown in Table 1. Some important variables such as fraction of inspiration O2 (Fio2) were excluded due to too much missing data. Variables that have less than 40% missing data were retained 25 . All missing values were filled with the average value of this variable. A statistical method called winsorization was used to deal with the outliers. Figure 4 showed the flow chart of data process. After that, 6844 samples were randomly divided into training data (5475), and test data (1369) in a ratio of 80-20%. Chi-square test and Wilcoxon rank-sum test were used to compare the differences of categorical and continuous variables respectively. They were employed to compare the differences between training data and test data to make sure the distributions of two datasets were as same www.nature.com/scientificreports/ as possible. In Table 1, P value was calculated and P > 0.05 was considered there were no significant distribution differences between training data and test data. In Table 2, Chi-square test and Wilcoxon rank-sum test were used to compare the difference between positive samples and negative samples to observe the correlation between the independent variables X and outcome variables Y. P < 0.05 was considered there was strong correlation.
Machine learning models training. Training data were evenly split into 5 parts that 4 parts were used to train a model of a certain hyperparameter, and the remaining one, also called validation set, was used to test the performance of this parameter. This process will conduct 5 times to gain 5 validation scores and the average  www.nature.com/scientificreports/ score was used to evaluate the performance of the model. Data scientists usually call this method fivefold crossvalidation which usually was used to select the best hyperparameter. 4 machine learning algorithms including LR, ANN, RF, and XGBoost were employed to fit the data, and all of these models have many hyperparameters that need to be specified as shown in Table 5. By applying grid search techniques, various kinds of parameters were searched automatically in Python and the best one was selected. Each model has its own best parameter. By comparing four models' performance, the best one, XGBoost, was picked up to be the final model of our study. We continued to fine-turn hyperparameters of XGBoost manually to obtain a better performance.
It will overestimate the model performance if just using the validation set and its score to evaluate the model, and because of that, test data, divided at the beginning, will be utilized to obtain a final score whose results were presented in Table 3. Besides, decision curve analysis was also applied to evaluate the model as shown in Fig. 2. All machine learning experiments were conducted on Python (version 3.7.8).

Data availability
Original data were extracted from the MIMIC-III database by Z.Z., the first author, who passed the online training and obtained access to the database, https ://mimic .mit.edu. If needed, related data of this article can be obtained by contacting F.L., the corresponding author, on reasonable request.   www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.