Predicting graft failure in pediatric liver transplantation based on early biomarkers using machine learning models

The early detection of graft failure in pediatric liver transplantation is crucial for appropriate intervention. Graft failure is associated with numerous perioperative risk factors. This study aimed to develop an individualized predictive model for 90-days graft failure in pediatric liver transplantation using machine learning methods. We conducted a single-center retrospective cohort study. A total of 87 liver transplantation cases performed in patients aged < 12 years at the Severance Hospital between January 2010 and September 2020 were included as data samples. Preoperative conditions of recipients and donors, intraoperative care, postoperative serial laboratory parameters, and events observed within seven days of surgery were collected as features. A least absolute shrinkage and selection operator (LASSO) -based method was used for feature selection to overcome the high dimensionality and collinearity of variables. Among 146 features, four variables were selected as the resultant features, namely, preoperative hepatic encephalopathy, sodium level at the end of surgery, hepatic artery thrombosis, and total bilirubin level on postoperative day 7. These features were selected from different times and represent distinct clinical aspects. The model with logistic regression demonstrated the best prediction performance among various machine learning methods tested (area under the receiver operating characteristic curve (AUROC) = 0.898 and area under the precision–recall curve (AUPR) = 0.882). The risk scoring system developed based on the logistic regression model showed an AUROC of 0.910 and an AUPR of 0.830. Together, the prediction of graft failure in pediatric liver transplantation using the proposed machine learning model exhibited superior discrimination power and, therefore, can provide valuable information to clinicians for their decision making during the postoperative management of the patients.

The number of pediatric patients who receive liver transplant has increased over the years 1 . Liver transplantation is the standard treatment for children with end-stage liver disease, malignancy, and metabolic disorders related to liver 2 . However, the number of suitable donor livers is limited and numerous children remain on the waiting list for long periods or even die before receiving a transplant 3 . Owing to better management during the pretransplant and post-transplant phases, the graft survival rate has improved 4 . However, the individualized prediction of graft failure in pediatric liver transplantation remains challenging because there are numerous different perioperative factors including the preoperative status of patients, intraoperative anesthetic management and postoperative complications such as vascular thrombosis or biliary leakage that can together influence the outcome of transplantation 2,5,6 . Moreover, the number of pediatric liver transplant patients in single centers is usually small. The use of traditional statistical methods can be limited under these conditions. www.nature.com/scientificreports/ Several studies have employed statistical methods to evaluate the predictive factors of graft failure and developed predictive models for pediatric liver transplantation [7][8][9][10] . However, the complex interactions between numerous parameters derived from patient medical records make it difficult to select specific predictors of graft failure. Recently, several attempts have been made to analyze medical data using machine learning for application in clinical practice 11,12 . Several machine learning models have been used to predict graft survival/failure in liver transplantation (Supplementary Table S1) [13][14][15] . However, pediatric patients are excluded, and serial early biomarkers are not used as features in those models.
Machine learning techniques enable the analysis of considerable amounts of complex data and the development of models for estimating risks or predicting events. Specifically, the least absolute shrinkage and selection operator (LASSO) -based method can be used in the case of relatively many collected features in comparison to the subjects, and high correlation among these features 16,17 . By using this method, we tried to overcome the high correlation of features such as serial laboratory data which were collected from a small number of patients.
The objective of this study involved the development of a high-performance machine learning model for individualized prediction of 90-day graft failure in pediatric liver transplantation using perioperative parameters collected within 7 days after surgery. For this, we collected patient data from the electronic medical records at a single center retrospectively. The dataset included 146 features that exceed the number of patients, thereby intractable when using traditional statistical methods. We utilized a LASSO-based method to identify the potential predictors of 90-day graft failure. These predictors served as input variables for the development of machine learning models. Graft failure was defined as failure of the liver allograft that required re-transplantation or resulted in death. Deaths caused by other than liver failure were not defined as graft failure. Finally, we developed a risk scoring system of graft failure in patients for easy utilization in the routine clinical setting.

Methods
Patients and data collection. This study was approved by Institutional Review Board (IRB) of our hospital (IRB No. 4-2018-0205). The requirement for informed consent was waived by the IRB of Severance Hospital, Yonsei University Health System, Seoul, Korea, owing to the retrospective nature of this study. All methods were performed in accordance with the relevant guidelines and regulations. We conducted a singlecenter retrospective cohort study. The data were obtained from electronic medical records. Patients aged below 12 years who underwent pediatric liver transplantation surgeries at the Severance Hospital between January 2010 and September 2020 were selected as the subjects.
Moreover, the characteristics of recipients and donors, anesthetic and surgical events, complications during hospitalization, and serial laboratory results until the seventh postoperative day (Supplementary Table S2) were collected as the features. The primary endpoint was graft failure at 90 days after transplantation.
Statistical and machine learning methods. Nested cross-validation. Because the total number of transplantations and graft failure cases was small, splitting the dataset into training and testing could cause additional uncertainties in the analysis. Therefore, we employed a nested cross-validation scheme for feature selection and predictive machine-learning model development. The training-test split was repeated as outer cross-validation (Fig. 1). For each outer training fold, additional inner cross-validation was applied for feature selection, hyperparameter tuning, and performance assessment during model development.
Feature selection. To select the features, we employed stability selection, a LASSO-based method implemented in the R package (stabs) 18 . This method assesses how often (or stably) each feature is selected by LASSO across bootstrapped subsamples for the given input data by quantifying the stability score (the number of inclusions/ bootstrapped subsamples) of each feature. Bootstrapping ensures robustness and the effective reduction of selecting false positives. The features listed in Table 1 and Supplementary Table S2 were considered candidate predictors. Given the missing values (119 out of 12,702 items) across subjects (17 out of 87 subjects) and features (56 out of 146 features) and high collinearity between features, we sought to maximize the utilization of data in feature selection by filling the missing values with the mean values of the corresponding features. Stability selection was applied with five-fold outer and inner nested cross-validation, repeated 10 times (Fig. 1). For each outer training fold, the features were ranked in the order of their decreasing stability scores. Logistic regression models with incrementally added top-ranking features were then evaluated based on the predictive performance of five-fold inner cross-validation. After all iterations across the outer folds, the consensus stability scores for all features were obtained by averaging the stability scores across the outer folds. Subsequently, the optimal number of features was determined as the minimum number of features associated with the maximum averaged cross-validated predictive performance. Features were then selected up to the rank of the optimal number in the consensus stability scores, preoperative hepatic encephalopathy (HE), Na level at the end of operation (Endop_Na), hepatic artery thrombosis (HA_thrombosis), and total bilirubin level on postoperative day (POD) 7 (POD7_Tbilirubin).
Machine learning methods. Based on the selected features, we constructed machine learning models using logistic regression, elastic net, random forests, extreme gradient boosting, support vector machines (SVM), and neural networks. We used the dataset without two patients with missing values under POD7_Tbilirubin, which obtained the second highest rank in the consensus stability scores. We employed five-fold outer and three-fold inner nested cross-validation, repeated 10 times, to assess and compare the model performance across machine learning methods and the hyperparameters of each machine learning method using the R caret package. For the prediction performance, we computed the area under the receiver operating characteristic (AUROC) and area www.nature.com/scientificreports/ under the precision-recall (AUPR) curves. A nomogram was constructed based on the final logistic regression model using the R rms package.
Risk scoring system. The risk scoring system was developed by first transforming the continuous features into categorical ones by binning. The regression coefficients of the categories were then used to assign appropriate risk scores. To determine the optimal binning boundaries, we first obtained the deciles of each continuous feature, generating nine binary categorization schemes, and retrained two-fold cross-validated logistic regression models for each categorization scheme. Subsequently, we selected the scheme with the highest cross-validated AUROC. The same procedure was applied recursively until the cross-validated AUROC did not improve to determine the optimal bins for all continuous features, which were then filtered based on the statistical significance of the regression coefficients of categorical features and their effects on the Akaike information criteria. Finally, the scores of categorical features were derived from regression coefficients by finding the corresponding integer values that best preserved the relative ratios between the coefficients.
Software. All analyses were performed using R (version 3.6.2).

Results
Among the 87 cases of pediatric liver transplantations selected in this study, 17 eventually developed liver graft failure 90 days after the surgery ( Table 1). The baseline demographic data, perioperative conditions, laboratory data, and complication-related features differed significantly (p < 0.05) across the graft outcomes (Table 1 and  Supplementary Table S2). On POD 1, the transplanted liver with potential graft failure already showed signs of tissue destruction (reflected by higher ALT), followed by decreased function of producing the coagulation factor on POD 2 (reflected by higher aPTT) and decreased excretion function (reflected by higher total and direct bilirubin) on POD 7 (see Supplementary Table S3 and Supplementary Fig. S1). These significant individual associations between input features and graft outcomes suggested that early patient-derived features are likely to have predictive power for the graft status at later time points. However, the overall high dimensionality of data, with 146 input features greater than the number of patients, and high collinearity between redundant features, such as the repeatedly measured ones across multiple time points, presented challenges for distinguishing most predictive features to develop predictive models of liver graft failure ( Supplementary Fig. S2).
To address this, we conducted feature selection using stability selection, a LASSO-based method in the nested cross-validation scheme (Fig. 1). The consensus stability scores obtained by stability selection were ranked in decreasing order, as shown in Fig. 2A. The top four features were determined as the rank cutoff because this was the optimal number of features that resulted in the best averaged cross-validation AUROC curve across the outer training fold (Fig. 2B). The selected features included preoperative HE, Na level at the end of operation (Endop_Na), hepatic artery thrombosis (HA_thrombosis), and total bilirubin level on POD7 (POD7_Tbilirubin). Interestingly, these are from various time points and reflect various pathophysiological aspects of liver transplant surgery. This suggests that our feature selection approach delineated non-redundant predictive features, thereby overcoming the high dimensionality and collinearity of the dataset.
Next, we sought to build predictive machine learning models based on these selected features. Logistic regression is one of the best models, along with elastic net, based on the cross-validation performance (AUROC = 0.898 and AUPR = 0.882; Supplementary Table S4 and Supplementary Fig. S3). The final logistic regression models were generated using the entire dataset, as summarized in Supplementary Table 5.
Finally, we constructed a nomogram based on the final logistic regression model ( Supplementary Fig. S4). In addition, we developed a risk scoring system after categorizing the continuous features (Table 2 and Fig. 3). The cross-validation prediction performance of the optimal categorized logistic regression model exhibited an AUROC and AUPR of 0.910 and 0.830, respectively. The scoring system robustly reflected the categorized logistic regression model with a Pearson correlation coefficient of 1.00 between the scores and linear predictors. The scoring system could delineate a 50-fold difference in the risk of graft failure across score intervals. These findings may guide early therapeutic interventions to prevent graft failure after liver transplant surgeries.

Discussion
We developed a predictive machine learning model for 90-days graft failure in pediatric liver transplantation patients using the features derived until POD 7. The number of extracted features was greater than the number of observed features. Therefore, we used a LASSO-based method to overcome the high dimensionality and collinearity of features. To further ensure robust feature selection by minimizing the detection of false positive features, we employed the stability selection method developed by Meinshausen and Buhlmann in nested cross-validation 18 . The model with logistic regression based on the selected features exhibited the best prediction performance (AUROC = 0.898 and AUPR = 0.882). Furthermore, we developed a risk scoring system to predict graft failure.
Pediatric liver transplantation differs from adult transplantation in terms of etiology and outcome 19 . Some models use traditional statistical approaches for predicting the prognosis of pediatric transplantation [8][9][10] . www.nature.com/scientificreports/ Recently, machine learning methods have exhibited outstanding performance in analyzing large volumes of medical data and predicting the outcomes of patients to help clinicians in decision making 20 . The early detection of graft failure is important for registration on the waiting list when patients need to undergo re-transplantation. In pediatric liver transplantation, specifically, the proportion of living donor transplantation is higher than that of deceased donor transplantation when compared to the same proportion in adult liver transplantation 3,21 . Therefore, the early prediction of graft failure is essential for the prompt evaluation of other living candidate donors.
Recently, Wadhwani et al. reported a machine learning model for predicting the ideal outcomes three years after pediatric liver transplantation 22 . They used features including the postoperative characteristics until one year after surgery. We aimed to predict earlier phase graft outcomes for surgeons to prepare for proper www.nature.com/scientificreports/ management or re-transplantation. According to a recent outcome analysis using the National Registry of Korea, the Kaplan-Meier analysis of survival rates of patients and grafts showed a rapidly decreasing curve until three months after transplantation, followed by flattening of the curve after three months. The graft survival rates reached a plateau area after three months 21 . This is why we predicted 90-days graft failure using only perioperative data collected until POD 7. We also used serial perioperative laboratory data for each patient to obtain more precise predictions and personalized algorithms. HE is a brain dysfunction caused by liver insufficiency and/or portosystemic blood shunting 23 . The pathophysiology of HE has not been completely understood 24 . However, it has been reported as an independent risk factor of poor outcome in liver transplantation 25 . Recently, Sahinturk et al. reported that preoperative HE is the predictor of postoperative prolonged mechanical ventilation 26 . Prolonged postoperative mechanical ventilation could affect the blood flow and oxygenation of the graft 27,28 . Therefore, HE is an important feature of the proposed model.
The values of sodium and total bilirubin were serially collected from the preoperative phase to POD 7. Among these values, the sodium level at the end of surgery and total bilirubin level measured on POD 7 were selected as the resultant features. Elevated sodium levels may reflect the administration of large volumes of packed red blood cell, fresh frozen plasma, or albumin. Liver transplantation often requires massive transfusion of blood components. The sodium concentration in packed red blood cell is 150 mEq/L 29 and that in fresh frozen plasma is 172 mEq/L 30 . Higher sodium concentration in these components can explain the higher sodium concentration at the end of surgery. Hypernatremia could also be associated with sodium bicarbonate infusion for correcting severe metabolic acidosis during surgery. Hyperbilirubinemia can be caused by impaired liver function, massive transfusions, or cholestasis 31,32 . Preoperative total bilirubin levels may be influenced by the preoperative status of recipients. Bilirubin levels before POD 7 could be affected by increased heme breakdown resulting from massive transfusion. Bilirubin levels measured on POD 7 may be associated with the postoperative graft function or cholestasis. Therefore, the total bilirubin level on POD 7 could be selected as an important predictive feature.  www.nature.com/scientificreports/ Thrombosis in the hepatic artery or portal vein is known to be a significant risk factor for graft survival in pediatric liver transplantation 7 . The incidence of early hepatic artery thrombosis is reported to be higher in children than in adults 33 . Hepatic artery and other vascular thromboses were reported as the most common cause of re-transplantation 1 . Early detection and revascularization can improve graft survival. Re-transplantation can be lifesaving if other interventions fail 34,35 . We used the thrombotic events observed until POD 7 as a feature, which was a relatively earlier period than that used in other studies [34][35][36] . A recent study reported a median time interval of 5.5 days between transplantation and hepatic artery complication 21 . Despite the different criteria, thrombosis in the hepatic artery still served as an important risk factor in graft survival.
Limitation. First, we conducted a single-center retrospective study with a small sample size. Moreover, high correlations were observed between the collected features. To overcome the high dimensionality and collinearity in the dataset, we used a LASSO-based method to select predictive features. The top four predictive features were selected based on different clinical aspects and at different time points. We applied nested cross-validation during model development to avoid additional noise when splitting small-sized data into training and test datasets.
Second, we did not apply external validation to our prediction model because this was a single-center study. It is difficult to use serial laboratory data from other organizations for external validation. Further randomized controlled studies could help overcome this limitation and evaluate the impact of our machine learning model.
Third, we selected both living and deceased donor transplantations. Although not selected as a predictive feature, this might have biased the results of our study. In addition, the proportion of living donor transplantations was higher in our country, including our institution, than that in other countries. However, the inclusion of both types of donors could help in the generalizability of our study.

Conclusion
We developed a machine learning model that predicts 90-days graft failure with high accuracy by overcoming the high dimensionality and collinearity of the dataset. The most predictive features were preoperative HE, Endop_Na, HA_thrombosis, and POD7_Tbilirubin. Based on this prediction model, we further developed a nomogram and risk scoring system for easy utilization in the routine clinical setting. These methods can serve as decision support systems for surgeons in identifying high-risk patients and preparing for proper intervention including re-transplantation during the early stage after surgery.

Data availability
The data is not publicly available due to privacy or ethical restrictions, but will be made available on reasonable request from the corresponding author, with the permission of the Institutional Review Board (IRB) of Severance Hospital. Restrictions apply to the availability of these data, which were used under license for this study. The code developed for this study is available on reasonable request from the corresponding author. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.