A Machine Learning Approach Using Survival Statistics to Predict Graft Survival in Kidney Transplant Recipients: A Multicenter Cohort Study

Accurate prediction of graft survival after kidney transplant is limited by the complexity and heterogeneity of risk factors influencing allograft survival. In this study, we applied machine learning methods, in combination with survival statistics, to build new prediction models of graft survival that included immunological factors, as well as known recipient and donor variables. Graft survival was estimated from a retrospective analysis of the data from a multicenter cohort of 3,117 kidney transplant recipients. We evaluated the predictive power of ensemble learning algorithms (survival decision tree, bagging, random forest, and ridge and lasso) and compared outcomes to those of conventional models (decision tree and Cox regression). Using a conventional decision tree model, the 3-month serum creatinine level post-transplant (cut-off, 1.65 mg/dl) predicted a graft failure rate of 77.8% (index of concordance, 0.71). Using a survival decision tree model increased the index of concordance to 0.80, with the episode of acute rejection during the first year post-transplant being associated with a 4.27-fold increase in the risk of graft failure. Our study revealed that early acute rejection in the first year is associated with a substantially increased risk of graft failure. Machine learning methods may provide versatile and feasible tools for forecasting graft survival.


Results
Baseline characteristics. Among the 3,117 KT recipients enrolled in our study, graft failure (GF) occurred in 304 (9.8%) over a mean follow-up period of 85.2 months. Baseline characteristics for patients who experienced GF, compared to those without GF, are summarized in Tables 1 and 2. Among the 304 cases of GF, 201 (66.1%) were male. Compared to the non-GF group, patients who experienced GF also exhibited a higher prevalence of the following pre-transplant risk factors: smoking, a history of viral hepatitis B and C and ischemic heart disease. Other relevant factors were comparable between the two groups, including: pre-transplant dialysis modality, recipients' age and comorbidities, including diabetes, hypertension, and cerebral and peripheral vascular disease (Table 1).
With regard to donor characteristics, 30.5% of patients in the GF group had received a deceased donor transplant, with 53.1% of patients in the non-GF group receiving a living donor transplant. No between-group differences were identified in terms of donor age, sex, kidney function at the time of donation, body mass index, and cytomegalovirus IgG titer ( Table 2).
In terms of immunological profiles, HLA-B and donor-recipient (DR) mismatch were the only differences identified between the two groups. Cyclosporine A was more frequently used as the maintenance immunosuppressant therapy among patients with GF, compared to those without GF. Moreover, patients with GF experienced more frequent episodes of infection during the first year post-KT. Laboratory findings at 3 months post-KT were significantly different between the two groups, with higher serum creatinine and lower serum sodium and total CO 2 levels among patients who experienced GF than among those who did not (Table 2).
Conventional decision tree modelling of the 10-year graft survival prediction. When estimating the survival prediction at N years post-transplant, distortion can be introduced in the estimate when using data from people who have not been followed-up for N years. We tried to overcome the disadvantages of retrospective medical records, including the issue of missing data, with various methods, such as multiple imputation methods ( Fig. 1. Setting 1).
In this process, we identified donor-specific antibody (DSA) as a being an important issue to consider. As shown in several retrospective studies using the same dataset, the proportion of DSA-positive patients in our study was very low. We assumed this low prevalence of DSA-positive patients (n = 23) was because these cases did not proceed to transplantation or that a positive DSA was not detected prior to the formal implementation of testing after 2008 ( Table 1). The distribution of DSA negative-to-positive patients in our study group was 550:23. Moreover, DSA was not an issue in 2,535 cases. Considering that formal testing was implemented only after 2008, we constructed two different models, one that included DSA (setting 1) and one that did not include DSA (setting 2).
Parameters used in our predictive model is summarized in Table 3 including: the method of imputation, test ratio, validation method, validation ratio, N folds, training and test performance of the classifiers on the test set, and validation of the mixed dataset in conventional decision tree using different model setting. The performance of the classifier, with the associated index of concordance, is also reported in Table 3. The routines used to implement conventional classification and regression trees (CART) are shown in Fig. 2 24,25 .
The binary classification tree derived from the dataset by conventional methods is shown in Fig. 2, with the index of concordance ranging between 0.65 to 0.71. In the decision tree, internal nodes represent attributes, with leaf nodes clarifying the relative frequency of graft failure being equal to '1' .
Among various conventional decision tree (DT) models, the 3-month serum creatinine level post-transplant was identified at the first decision node, being the most important risk factor of GF (predicted 10-year GF rate, 77.8%; Fig. 2(B), setting 2, index of concordance 0.71). The age of the recipient at the time of transplantation was selected at the next node in the group of patients with a 3-month serum creatinine level <1.65 mg/dL, with a predicted 10-year GF rate of 71.4% for patients without HLA mismatch and ≥57. 5  In setting the process for 10 year graft survival prediction, we firstly applied a conventional decision-tree model from 1 to 10 years after KT, and additionally verified applying weighting method (Table 4) 6 . However, these models were problematic to use with participants with a shorter follow-up period. Therefore, we used a different predictive approach based on survival statics.
Decision tree modelling using survival statics, with comparisons to previous conventional models. The    random forest, and ridge and lasso. These models use the survival analysis statistic 26 , rather than the Gini index or entropy index for the split rule used in conventional tree algorithms. The parameters of the survival model, including the index of concordance, were presented in Table 5. The survival tree algorithm performed better than conventional tree modelling, with an index of concordance of 0.80 compared to 0.71, respectively. GF in survival tree models was predicted as a survival hazard ratio (HR), with overall graft survival within the study group used as the reference value. The survival tree algorithm identified an episode of acute rejection within the first year after KT to be the most deterministic node for predicting GF (Fig. 3(B), setting 2), resulting in a 4.27-fold increase in the risk of GF compared to the overall GF rate for the study group. The 3-month serum creatinine level and the age of the recipient were significantly associated with allograft survival. Among patients with normal serum creatinine level (<1.65 mg/dL) who did not experience an episode of graft rejection in the first year post-transplant, recipients' age was the next predictive node of graft survival. Specifically, among patients who did not experience an episode of rejection over the first years after KT, a 3-month serum creatinine level >1.65 mg/dL was predictive of GF (HR, 3:01). Age was an independent risk factor of GF, with a HR of GF of 2.39 for patients over the age of 59.5 years, even in the absence of an episode of rejection.
Results for the application of the conventional Cox survival model for setting 1, 2, and 3 are summarized in  Table 3. Performance of the prediction model by conventional decision tree using different model setting. *Test performance were presented as concordance index for time to graft failure data. Model setting 1: Use overall attributes with multiple imputation. Model setting 2: Use attributes except DSA, with/without imputation. Model setting 3: Use attributes except 3 month follow up data, with imputation. Test ratio fixed at 0.3. Use One False (F) is used if the follow-up period is shorter than the period to be predicted in the classification; these cases were excluded from the training process. Use One True (T) is used as a positive example if the patient has experienced a graft failure even though the follow-up period is short. Weight method by Zupan et al. is used when the follow-up period is short, with both positive and negative examples used, but with different weights 6 .

Discussion
Machine learning forms the basis of data mining and provides a robust methodology to extract information from extremely large databases 5, 6 . This methodology is particularly well suited to the field of transplant medicine, where numerous factors influence clinical outcomes [5][6][7] . Our study provides several major advantages to underline the feasibility of using machine learning approaches to predict graft survival after KT. First, we applied the majority of known machine learning methods (survival decision tree, bagging, random forest, and ridge and lasso) to a large dataset obtained from three centers in Korea, having expertise in KT. Second, we maximized the use of censored patient data, with survival statistics providing meaningful clinical information (Fig. 3) [24][25][26] . Third, we prioritized factors predictive of allograft survival. Using tree modelling with survival statistics, we identified acute rejection, confirmed by renal biopsy, during the first year after KT as being the greatest risk factor for graft loss, regardless of serum creatinine levels (adjusted HR, 4.27). Among patients who did not experience an episode of acute rejection within the first year after KT, the 3-month level of serum creatinine was the most important risk factor of graft loss (adjusted HR, 3.01; Fig. 3). These predictive factors were identified by considering post-transplant laboratory findings in combination with 33 attributes of the pre-transplant status of recipients and characteristics of donors in the model (Table 5).
Although the association between biopsy-proven acute rejection (BPAR) and early graft loss is well-known 4 , as well as its potential impact on the long-term graft survival, regardless of kidney function, there is controversy as to whether rejection-episode by histologic findings is how much influence it has for graft survival [27][28][29] . Several studies have evaluated the effect size of rejection episode on long-term graft survival, and the debate has persisted as to whether early (from 3 month to 2 years post-transplant) acute rejection is a cause of long-term graft failure [30][31][32][33] . Of note, the most recent data on GF post-transplantation identified an association between acute rejection in the first six months after KT and death-censored GF among 83,008 deceased donor kidney transplant recipients and 48,399 living donor kidney transplant recipients, adjusting for recipient characteristics and the kidney donor profile index (KDPI) for living donors 33 . Therefore, acute rejection in the first six months post-transplant substantially increases the risk of early and late graft loss 33    persisting debate on the effects of early acute rejection on the long-term graft survival may be due to differences in how the rejection episode is defined between studies, as well as differences in the characteristics of the study population. As an example, previous studies that have reported a negative association between early acute rejection and graft loss limited their analysis to a rejection event sustained within the first 3 months post-transplant 34,35 .
In addition, the initial diagnosis of graft rejection was made clinically without confirmation by biopsy 35 . Studies conducted in Korea have defined early acute rejection as an event that occurs up to 1-year post-transplant 30 . Among patients who did not develop transplant rejection within 1 year in our study group, the level of serum creatinine at 3 months post-transplant was the most important predictor of graft loss, with age of the recipient and length of observation from the time of transplant being additional risk factors (Fig. 3(B)). The predictive value of serum creatinine or estimated glomerular filtration ratio changes for long-term survival of a kidney graft has been debated [36][37][38][39] . Recently, using data from the Australia and New Zealand Dialysis and Transplant Registry, Clayton et al. identified a 30% decrease in the glomerular filtration rate (GFR) at 1-year post-transplant, from baseline obtained immediately after KT, to be the best predictor of graft survival, in combination with a 2-fold increase in serum creatinine and absolute glomerular filtration rate 40 . Our study does support these findings, with early graft rejection and the 3-month serum creatinine level being predictive of long-term graft survival.
An important precaution when using a machine learning approach with an observational cohort is the presence of prevalence data obtained from censored observations. Such observational data are not easy to handle, as they do not provide deterministic information to predict final patient outcomes. The panel in Fig. 4 shows an example of censored data. When predictor variables are present from the start point of observation, regression modelling can be accurately applied, such as for observations presented in Fig. 4(A and C), which occurred before the last observation. In contrast, observations (B) and (D) in the same Figure are problematic due to the lack of specific knowledge on when the event will recur in the future. This problem extends to the classification Setting Imputation method      classification problems that make accurate prediction difficult [24][25][26] . To overcome these problems, classification can proceed using censored date, with survival statistics applied to various machine learning techniques with different attribute settings (Fig. 1) 26 . In addition, since graft rejection is a time varying event, there is inevitable limitation to using conventional statistical methods, such as Cox regression. Because of the characteristics of retrospective data, there are time-fixed variables that take into account the baseline characteristics for all covariates, except for the time point of the acute rejection. To address this issue, we implemented the following conditions to our predictive modelling. First, if the follow-up period of the patient was shorter than the period to be predicted in the classification, the patient-specific data was excluded from the training process, identified by the category "Use One: FALSE" in Tables 3, 5 and 6 (condition 1). If an acute rejection event was not identified over a follow-up period >1 year, the rejection event was coded as '0' (condition 1-1). For cases in which a rejection event was observed, the event was coded a '0' if it occurred within the first year post-transplant and a '1' if it occurred at >1 year post-transplant (condition 1-2). All other 'FALSE' events were coded as '2' (not treated). In cases in which the patient experienced GF despite a short follow-up period, the event was categorized as "Use One: TRUE" in Tables 3, 5 and 6 (condition 2). If the acute rejection event associated with the 'TRUE' graft failure event occurred at >1 year follow-up, the event was coded as '0' (condition 2-1), with the event coded as '1' if the acute rejection occurred at <1 years post-transplant (condition 2-2). All other TRUE events were coded as '2' (not treated). In our study, we applied these techniques to three types of tree-based models: survival decision tree, bagging and random forest models [41][42][43] . Survival tree modelling (Table 5) has been shown to have higher predictive power than existing conventional decision tree models, increasing the reliability of identified factors in predicting clinical outcome 26 . Moreover, we showed the comparison between the performance of our decision tree models and a standard Cox regression analysis in Tables 3 and 6. Compared to a Cox regression analysis, decision tree modelling has the advantage of clarifying the classification and prediction processes, represented by the inference rule of the tree structure. Several prior studies have attempted to predict survival of kidney grafts using conventional decision tree models [15][16][17]44 . Based on the data of 92,844 KT recipients, Krikov et al. predicted the 10-year survival with high accuracy (AUROC, 0.90) 15 . However, the researchers did not identify the factors associated with graft loss. Focusing on recipients of grafts from living donor from a single center having KT experience in Egypt, Fouad et al. also compared decision tree classifiers to regression analysis 17 . However, their comparison was limited by insufficient information regarding their model attributes, inclusion of pre-transplant information only and a relatively small study group. In our study, we also tried conventional decision tree modelling to predict the 10-year graft survival with weighting method (Tables 3 and 4). A 3-month serum creatinine level >1.65 mg/ dL was associated with a GF rate of 77.8% (Fig. 2). The next most important predictive factor was the age of the recipient at the time of transplant, with age >57.5 years being associated with a 71.5% rate of GF, despite serum creatinine levels <1.65 mg/dL. For recipients <57.5-years-old, a HLA B mismatch was associated with a 10-year GF rate of 6.2%. In this group, donor age was also an important predictor of graft survival, with more favorable outcomes identified when grafts were obtained from young donors. A donor age >35 years was associated with hyponatremia at 3 months post-KT, with a rate of GF as high as 59%. It is well known that the effects of HLA DR and B mismatch on graft outcome are considerable, while an HLA-A mismatch does not produce a significant effect 45,46 . In our study, the risk of HLA-B mismatch was higher among younger patients compared to that of HLA DR. Further research is needed to clarify biological mechanism of HLA DR mismatch associated graft rejection.
The limitations of our study need to be acknowledged. First, donor and recipients ethnicity was not considered in our models. Previous studies from western countries have included Asians living in western countries. It is likely that environmental differences will influence clinical outcomes for patients of the same ethnicity living in different countries. Moreover, the characteristics of Asian individuals are likely to vary even between middle and eastern regions of Asia (Japan, China, and Korea). Overall, the inclusion of Asian populations has been quite limited in studies of kidney transplantation. In fact, in the landmark trial for transplant outcomes by Wolfe et al., only 3.9% of the study sample was Asian 47 . We have tried to overcome these limitations by using a study sample that is representative of Asians who live in Asia, eliminating effects of environmental, socio-cultural and racial differences. The second limitation is the high proportion of missing values of donor-specific antibody (DSA) data, even though we applied different model settings to address this issue (Fig. 1). Moreno-Gonzalez et al. presented an excellent retrospective study of an allograft prediction model using a combination of alloantibody and one-year surveillance data 48 , which included DSA information at 1 year for 42% of their cases, compared to 18% in our study. Adding this information, biopsy rejection score 1 year post-transplant and DSA, to the model for death-censored graft loss increased the predictability (C statistic = 0.90). Shabir et al. provided a more detailed method for predicting outcomes, with external validation against data from a multicenter cohort 49 . This study showed good-to-excellent discrimination for death-censored transplant failure (C statistics, 0.78-0.90) using data at 12-months post-transplant to develop their model. The absence of external validation is a major limitation of our study.
Despite these limitations, our model survival decision tree model improved the accuracy of predicting GF (index of concordance, 0.80) over conventional decision tree and Cox regression models. As such, our model might be clinically relevant to identify patients with a specific group of risk factors associated with graft loss and, thus, enable clinicians to predict outcomes of KT. Moreover, we do provide evidence supporting the application of advanced machine learning techniques, using survival statistics, in the field of KT. Using this approach, we have prioritized risk factors of graft failure, with acute rejection, confirmed by renal biopsy, within 1 year of KT being the most important risk for graft failure. We also emphasized the importance of post-transplant serum creatinine level as a predictive factor. Our findings will assist clinicians in predicting outcomes of KT.

Methods
Study participants. Our study cohort included 3,117 adults (>18-years-old) who had undergone KT at three tertiary care hospitals with expertise in KT in Korea, between 1997 and 2012. Our analysis included >50 attributes extracted from 3,117 medical records of recipients and donors, which included complete records for 2,017 recipients. The institutional review board at each hospital approved our methods (Institutional Review Board No. H-1409-086-609) and the study was performed according to the ethical standards of the Helsinki Declaration. This study is a retrospective, non-interventional study, and written informed consent was waived due to the study's design.
Attributes used for modelling. The following recipient-related factors were extracted from the medical records for analysis: age; sex; smoking, primary cause of end-stage renal disease (ESRD), body mass index (BMI), ABO type and ABO incompatible transplant; infection history within 1 year after transplant, type of maintenance immunosuppressant (cyclosporine versus tacrolimus); and a history of pre-transplant ischemic heart disease, peripheral vascular disease, diabetes mellitus, hypertension, and hepatitis B or C virus infection. Duration of observation from the endpoint of follow-up (23 March 2015) was included as a attribute, measured from the time of transplant to the endpoint of follow-up, which reflects the transplant era.
The following pre-transplant laboratory levels were also obtained for analysis: serum creatinine, hemoglobin, white blood cells (WBC), albumin, cholesterol, low-density lipoprotein (LDL), and high-sensitivity C-reactive protein. The following pre-transplant immunological factors were also included in the analysis: donor specific antibody and panel reactive antibody titer. The following laboratory findings obtained at 3 and 6 months post-transplantation were also included in the analysis: serum creatinine; serum Na; total CO 2; calculated anion gap; and blood levels of cyclosporine and tacrolimus. The following donor-related factors were included in the analysis: age, sex, BMI, donor type, cytomegalovirus IgG titer, miss-match of human leukocyte antigen (HLA) A, B and DR, and cross matching results, were reviewed. Among all factors considered, 33 independent attributes likely to influence graft survival were selected for inclusion in our prediction models, using individual and ensemble learner with multiple imputation (Tables 1  and 2).
Modelling process. We created training sets for machine learning using 80% of the overall dataset. We ensured a 10-fold cross validation for our training process, with performance of the trained model evaluated using 20% of the overall dataset. We used five different seeds to measure the final performance, with the index of concordance considered to be the main criteria to evaluate model performance. Two widely used individual learning models (Classification and Regression Trees, Logistic Regression) 24 Individual learners. Decision tree, also called Classification and Regression Trees (CART), is a conceptually simple approach but one that is statistically powerful 25 . As decision trees are more expressive then other models in terms of their classification process, they are easier to implement and interpret than many other machine learning algorithms, as well as being applicable to non-linear datasets. CART formulation forms a binary tree and minimizes the training error in each leaf. CART uses a Gini coefficient to choose the best variable, which estimates the purity of the internal nodes. Tree models represent data by a set of binary decision rules 25 . Logistic regression is based on the logistic function with a linear combination of dependent variables, and is formulated as: π(x) = 1/1 + e − (β X) where βt X = β0 + β1 × 1 + β2 × 2 + …, with π(x) as the probability p(y = 1|X) that the dependent variable (y) is of class 1, given the independent variables(xi) 25 . We also used the generalized linear model (GLM) library in R to fit the logistic regression model 41 . The GLM for baseline modelling used a binomial distribution for the response with a logit link function 41 . Ensemble learners. Ensemble methods classify data by combining the results of multiple learners, with the aim of improving the predictive performance of a given statistical learning model or a fitting technique. Bagging and random forest are different ensembling techniques. Bagging is the acronym of bootstrap aggregating 42 , which builds predictive models by using repeated bootstrap samples from training dataset and aggregates those predictors. For aggregation, the average is used for regression model and plurality vote for classification model 42 . We chose CART as a base learner among various algorithms.
The random forest algorithm adds more randomness to bagging. However, using a decision tree as base learner, the random forest algorithm randomly chooses a specific number of attributes at each node and finds the best split among these attributes 43 . Module structure. Conventional prediction models are inherently limited by censored data used with individual and ensemble learners (Fig. 4). Typically, this limitation is addressed by omitting these data, which often results in an insufficient follow-up period for prediction modelling. An alternative solution to omission is treating censored data as non-recurring samples (classification) and treating their follow-up time as survival time (regression). However, both of these solutions introduce bias, with this biasing effect being large if the rate of event occurrence is low. To avoid such biasing effects, we included all censored data in our models using survival statistics (Fig. 4) 26 .
Decision tree using survival statics. A general decision tree was built using a process of recursively finding a split rule, such as the Gini index or entropy index, which lowered the impurity in classified data of patients' outcome. Survival decision tree uses survival statistics as the split rule criteria, which is formulated as follows: (1) c_i: the observed event count for observation i, (2)  right . The goal is to maximize the improvement for each split, which is equivalent to minimizing the within-node deviation. Based on this process, we can identify the tree model that maximizes the likelihood of the a specific classification at a specific node and, thus, optimizes the fit of the data 26 .