Predicting Short-term Survival after Liver Transplantation using Machine Learning.

Liver transplantation is one of the most effective treatments for end-stage liver disease, but the demand for livers is much higher than the available donor livers. Model for End-stage Liver Disease (MELD) score is a commonly used approach to prioritize patients, but previous studies have indicated that MELD score may fail to predict well for the postoperative patients. This work proposes to use data-driven approach to devise a predictive model to predict postoperative survival within 30 days based on patient's preoperative physiological measurement values. We use random forest (RF) to select important features, including clinically used features and new features discovered from physiological measurement values. Moreover, we propose a new imputation method to deal with the problem of missing values and the results show that it outperforms the other alternatives. In the predictive model, we use patients' blood test data within 1-9 days before surgery to construct the model to predict postoperative patients' survival. The experimental results on a real data set indicate that RF outperforms the other alternatives. The experimental results on the temporal validation set show that our proposed model achieves area under the curve (AUC) of 0.771 and specificity of 0.815, showing superior discrimination power in predicting postoperative survival.


Results
Random forest 29 (RF) is a state-of-the-art algorithm, and it could provide feature importance based on the out-of-bag samples and permutation test, in which informative variables produce a systematic decrease in accuracy when permuted. This work uses RF to estimate the feature importance, and the top nine important features are international normalized ratio (INR), lymphocytes, prothrombin time (PT), platelets, white blood cell (WBC), Magnesium (Mg), Sodium (Na), age, and BMI. The PT and INR represent the same measurement, so we only use INR to construct the model in the following experiments to prevent from bias brought by duplicated variables.
To evaluate model performance, we use the AUC as the evaluation metric. The AUC provides the overall result of the receiver operating characteristic (ROC) curve by using the area under the ROC curve as an important metric for evaluating the predictive model. The value of AUC is between 0 and 1, and a larger value indicates that the classifier yields better performance. ROC curve is commonly used in the medical field to determine thresholds for patient diagnosis. More detailed introduction about ROC and AUC can refer to the work conducted by Fawcett 30 . Besides AUC, sensitivity and specificity are also used as the metrics. Notably, specificity is more important than sensitivity in this work as present study aims to predict survival outcome after the surgery of liver transplantation.
In order to explore and confirm each selected feature is helpful for prediction, we conduct experiments with seven models, each of which uses a feature combination. The first model is constructed with the features of patient information, and we gradually add the features of blood test item as model features to construct the subsequent models. Table 1 shows the experimental results, indicating that the performances of the models increase as more features are used by the model. Moreover, once the model uses all the eight features, the AUC could be 0.799 when using the data 10 days before the surgery as the data source, meaning that the selected features are important for the prediction of survival after liver transplantation.
Once the feature selection process and pre-processing are completed, we use the selected features to learn a predictive model. This work proposes to use RF to construct the predictive model, and we compare RF with other alternatives by using patient basic information and blood test data from day 1 to day 9 before surgery as our final data source. Note that this work proposes to use RF for two tasks, feature selection and the predictive model. We use the derivation set with 10-fold cross-validation to evaluate model performance and confirmed the generalization ability of the proposal model with temporal validation set. Besides, we compare the proposed method with eXtreme Gradient Boosting (XGBoost) 31 , logistic regression, and decision tree 32 . The results are presented in Table 2, which shows that RF yields the best AUCs on derivation and temporal validation sets. The AUC of 0.771 on the temporal validation set indicates that the proposed model achieves superior discrimination power than other alternatives in predicting postoperative survival. We conclude that the RF uses bagging approach to combine various decision trees, giving a base to perform well on imbalanced data set. XGBoost uses another ensemble learning approach, boosting, and it also works well in the experiment.

Discussion
In this study, we apply feature selection technique to select important features from the physiological measurement items. The top features presented in Fig. 1 are obtained from RF as well as step-wise selection, and the selected features include INR, Lymphocyte, Platelets, WBC, Mg, Na, age and BMI. Moreover, missing data in medical research is a common problem, and we propose an imputation method based on the characteristics of the features to deal with this problem. Next, detailed discussion regarding the findings is presented below.
Missing data in medical research is a common problem, and we propose an imputation method based on the characteristics of the features to deal with this problem, in which we use a conservative strategy to replace the missing values. We compare our proposed imputation method with the other alternatives, including imputation with mean, maximum, median and minimum, respectively. The results are presented in Table 3, indicating that the proposed method achieves the best predictive performance. The proposed method considers the  www.nature.com/scientificreports www.nature.com/scientificreports/ characteristics of features, and we believe that is the main reason why the proposed method outperforms the other alternatives.
This work uses the data 10 days before the surgery as the data source to construct the first model. Subsequently, we conduct experiments to find out which period of data before surgery has the most impact on the prediction results. The experiments use RF as the machine learning algorithm, and different ranges of data as the data sources to train different predictive models. The experimental results are presented in Fig. 2. The results show that the data from day 1 to day 9 before surgery is more important than the other ranges. This result conforms to the intuition as day 10 is the day farthest from the surgery in the range.
MELD score is a formula involving bilirubin, INR and creatinine. Thus, MELD score could be considered as a combination of the three features. To compare the performance impacts brought by the features, we use RF with MELD score and several features to learn a predictive model. The features used by the comparison model includes MELD score, hepatitis, HCC, DX1, age, gender, and BMI. The results are presented in the top of Table 4, indicating that our proposed model outperforms the alternative model for predicting postoperative survival. Besides, we apply cox proportional hazards model with the two combinations of features to perform survival analysis within 30 days and the results are listed in the bottom of Table 4. Both of them are statistically significant, but the features selected by RF achieves higher Concordance Index (0.85) than those used by MELD score. Moreover, the same experiments are applied to temporal validation set to investigate the generalization capability of our proposed model, and the results are presented in Table 5. The experimental results indicate that the features selected by RF could provide more discriminative capability than the features used by MELD score in predicting survival outcome after liver transplantation. Besides the above analysis, the hazard ratios (HR) from cox proportional hazards model are presented in Fig. 3, which only shows the basic features and the blood test data of day 9 owing to the limit of paper length. Significant features comprise INR, Platelets and age, which conform to the bedside experience of the domain expert.   The medical data used in this study is imbalanced as presented in most medical studies. We propose to use RF 29 to construct the predictive model, which provides not only accurate performance, but also the capability of dealing with imbalanced data. This is because RF uses a technology called bagging that can reduce or mitigate bias for imbalance data [33][34][35] . Bagging approach uses bootstrap sampling technique to sample enormous sub-samples with replacement from the initial data set, each of the sub-samples is used to train a predictive model. In RF, each model is a classification and regression tree (CART) 32 , which is a decision tree algorithm. The final model is obtained by averaging all these models, and majority vote rule is a typical approach in determining the final results. The bagging approach provides a way to eliminate bias caused by unstable models. The experimental results indicate that RF works well on imbalanced data used in this study.
In conclusion, the analysis of experimental results presents two findings. First, among these important features, most of the features are blood test items and have been clinically proven that those features have a certain impact on survival outcome after liver transplantation except Mg. Our experimental results show that Mg is also an important feature which has impact on survival outcome after liver transplantation. Second, the experimental results show that RF is robust on imbalanced data. Most medical data sets are characterized by imbalanced property, and many medical applications are interested in the risk factors that lead to the results. Thus, RF is a very good machine learning model in medical domain.
Although previous research has indicated that patients who have undergone orthotopic liver transplantation may be a group especially predisposed to hypomagnesaemia 36 , the domain experts pointed out that Mg has not been used clinically to predict survival outcome after liver transplantation. However, blood magnesium ion concentration indeed is a very important electrolyte. If the patient has malabsorption or used diuretics, he/she would be considered as being in a high-risk group for hypomagnesemia. When the blood magnesium ion concentration is too low, it would directly affect the recovery of many other electrolytes. Moreover, the previous research has indicated that Mg has a direct relationship with heart function 37 , which associates Mg with mortality 38,39 , such as the association between hypomagnesemia and fatal cardiac arrhythmia 40 . Thus, another benefit of using data-driven approach to devise a predictive model is that one may discover the factors that are not directly related to the organs we are focusing on.  Hazard ratio www.nature.com/scientificreports www.nature.com/scientificreports/ In summary, this work considers the characteristics of features to propose an imputation method to deal with missing values, and the results point out that the proposed method works well. Central to this study is using machine learning to predict short-term survival which can detect the high risk patients in early phase after liver transplantation, and discover important factors that are essential in liver transplantation, in which we argue machine learning could help the physicians make decisions. Once the higher risk patients are identified by the model, several treatment options could be given to these patients. For example, the immunosuprpesion drug should be admitted earlier or in relative high concentration, to avoiding the trigger of acute rejection and causing the vulnerable complications, such as acute kidney injection, and secondary bacterial infection.

Methods
In this study, the data in the experiments was collected by liver transplantation ICU of Chang Gung Memorial Hospital, Linkou and has been approved by institutional review board (IRB) of Chang Gung Memorial Hospital with case number 103-6018B. All the data and methods were performed in accordance with the relevant guidelines and regulations by IRB of Chang Gung Memorial Hospital. Additionally, this work is a retrospective study, and the IRB waived the need for informed consent. The patient data ranges from January, 2004 to December, 2013, and the number of data records is approximately two million. We divide the whole research process into several stages as shown in Fig. 4.
The first stage is data pre-processing, including two steps: (1) Data cleaning: we follow the suggestion of domain experts to clean the data, including unifying the name of test items, processing the extreme values, removing the duplicated data, and so on. (2) Defining survival time: because our objective is to predict postoperative survival, we use the "Postoperative survival days" in the patient's personal information as an indicator to define survival time.
The second stage is feature selection. In this study, we purpose to use RF to select features, and the goal is to select important features from the whole blood tests. A model with enormous features may suffer from over-fitting problem. Thus, it is expected that the final model could benefit from feature selection. The third stage is to perform imputation of missing values to replace missing values with meaningful values. In this work, we separate the data into derivation set and temporal validation set based on time information. Table 6 shows patient characteristics in derivation and temporal validation sets. The purpose of derivation set is to train a predictive model, while temporal validation set is used for model evaluation. Therefore, in the fourth stage, we use derivation set to train the RF model. Finally, we use temporal validation set to evaluate the model. Data pre-processing. The exported data was presented in comma separated values (CSV) format, and we used R (version 3.6.1) to process the data and build the predictive model. The data pre-processing involves two tasks in this study, and they are described in the following sections.
Data cleaning. This work focuses on the prediction of postoperative survival, so it is natural to use the data records before the surgery to construct a predictive model. We retain the data records 10 days before the surgery for analysis. Several data cleaning steps are used to make the data suitable for the subsequent analysis, and these steps are listed below. The final items used in the experiments are listed in Table A.1 of our previous work regarding the prediction of acute allograft rejection after liver transplantation 41 .
• Retaining the data within 10 days: The original data comprises the patient's postoperative records and the records from long-term follow-up study. The goal is to predict postoperative survival, so we only kept the data of 10 days before the surgery. • Removing the items of urine test: Urinalysis results are less accurate than blood tests, so we used blood test data in the model without using urine test data in this work. • Removing duplicated measurement: When multiple measurement records were present in the examination results, we used the average value to represent the measurement value. • Calculating BMI: We used Eq. (1) to calculate BMI based on the patient's height and weight. Data labeling. In this study, we retain those data records 10 days before the surgery for analysis and used the "Postoperative survival days" in the patient's personal information as an indicator to define survival. To exclude factors that affect modeling such as quality of life, diet and others, we focus on short-term survival prediction. In addition, liver function would not recover until 30 days after liver transplantation, so we define the survival time of more than 30 days as "Survival" and others as "Non-survival". Once the data labeling is completed, the data set   Feature selection. RF combines "bagging" technique as well as random subspace method 42 to construct enormous decision trees. It is important to construct uncorrelated decision trees during the learning process, and random subspace method is an ensemble learning method that applies to features to reduce the correlation between the trees by using a random sample of features to construct each decision tree. As RF relies on a collection of decision trees to make the prediction, it provides a way to estimate feature performance from all the decision trees by measuring the impact of each feature on accuracy of the model. The idea is to permute the values of a feature i, and test its importance by measuring how much the permutation decreases the accuracy of the model. For important features, the permutation would significantly decrease model performance. In contrast, permuting unimportant ones should have little impact on model performance. Once the importance scores are available, one could use these scores to rank the features.

Imputation of missing values.
Missing values are always present in the data records, which may come from human errors or the patient did not perform some tests. Ignoring these values may cause the model to be unstable. As a consequence, we propose a method to replaces missing values with reasonable values. The proposed approach is a conservative strategy, and the imputation is based on feature characteristics and domain knowledge. The steps are listed as follows: • STEP 1 -Stratifying the data by MELD score: Based on the MELD score, we divided all data into several groups, including 1-9, 10-19, 20-29, 30-39, and 40 or more. • STEP 2 -Dividing features into three categories: The three categories are listed below: • A features: The higher the value, the worse the prognosis. For example, age and INR.
• B features: The lower the value, the worse the prognosis. For example, Na.
• C features: The value is too low or too high, the worse the prognosis. For example, WBC.
• STEP 3 -Replacing the missing values: Different rules for replacing missing values are applied based on different groups of MELD scores and different categories of features.
• If missing value belongs to A features, we will replace this value with the maximum for that group.
• If missing value belongs to B features, we will replace this value with the minimum for that group.
• If missing value belongs to C features, we will replace this value with the average for that group.
Model construction. Since the scales of physiological measurement are quite different, we take the nature log of them first. In the previous steps, the important items obtained from feature selection are INR, Lymphocyte, PT, Platelets, WBC, Mg, Na, age and BMI. However, according to expert's experience, INR and PT represent the same measurement. To prevent from bias, we only use INR to construct the model. Next, we separated the data into derivation set and temporal validation set. The models are trained by RF, XGBoost, Decision Tree and Logistic Regression algorithms, and each algorithm made ten different results. Notably, we use AUC as the performance metric because of the data imbalance. One of the limitations for this work is that we do not apply external validation to our proposed model. This is our initial step that attempts to use machine learning to develop a predictive model for the prediction of predict short-term survival after liver transplantation. To objectively assess our proposed model, we use a systematic approach to develop the model, and separate the data into derivation set and temporal validation set based on time information. We believe that the findings in this work are useful for other researchers, and applying external validation to our proposed model is our future work.  Table 7. The causes of death for the short-term survival patients in derivation and temporal validation sets.