A Donor Quality Index for liver transplantation: development, internal and external validation

Organ shortage leads to using non-optimal liver grafts. Thus, to determine the graft quality, the Donor Risk Index and the Eurotransplant Donor Risk Index have been proposed. In a previous study we showed that neither could be validated on the French database. Our aim was then dedicated to propose an adaptive Donor Quality Index (DQI) using data from 3961 liver transplantation (LT) performed in France between 2009 and 2013, with an external validation based on 1048 French LT performed in 2014. Using Cox models and three different methods of selection, we developed a new score and defined groups at risk. Model performance was assessed by means of three measures of discrimination corrected by the optimism using a bootstrap procedure. An external validation was also performed in order to evaluate its calibration and discrimination. Five donor covariates were retained: age, cause of death, intensive care unit stay, lowest MDRD creatinine clearance, and liver type. Three groups at risk could be discriminated. The performances of the model were satisfactory after internal validation. Calibration and discrimination were preserved in the external validation dataset. The DQI exhibited good properties and is potentially adaptive as an aid for better guiding decision making for LT.


Results
The DQI. The studied variables and log-rank tests are shown in Table 1. As stated in the methods section, a Cox model with adjustment on recipient characteristics was retained to create the DQI. Several variables (Table 2) were common to the three models. The retained donor's covariates were: age; cause of death (COD); and intensive care unit (ICU) stay. Several recipient covariates were retained for adjustment: re-transplantation; being on dialysis; status at time of LT; presence of hepatitis C virus antibody; diabetes; and decompensated cirrhosis. The retained full model included all the variables present in at least two of the three models and the set of covariates retained in addition (see methods section) ( Table 3): the lowest MDRD creatinine clearance; the liver type; the MELD exception; donor and recipient's blood groups.
The liver type covariate was retained since it was present in both DRI and ET-DRI. The MELD exception and donor and recipient's blood groups covariates were also retained, but for adjustment only.
Hence the proposed DQI score was as follows: . In Table 4, the progression of the score through different combinations is presented. In our dataset, the average score was 1.83 (SD: 0.44) with a median of 1.76 and a range between 1 and 3.12.
Calibration plot. The calibration of the score was assessed. As shown in Fig. 1S (see Supplementary Figures) the score seems to slightly underestimate the probability of death/graft loss at all months tested. This underestimation was not significant for all the predicted probabilities since more than 50% of the 95 band of the calibration plot contained the first bisector.
Of note, the 95 band was wide at the higher probability of death/graft loss due to few remaining patients at risk in this category. However, despite a slight underestimation, the longer the follow-up, the lower was the underestimation of the risk of death/graft loss.

Model performance and internal validation.
As the predictive performances of the different models were close (Table 2), the covariates selection was implemented with the simplest method, namely the backward selection using Akaiké criterion. The corrected performances of the model by the bootstrap method were 0.609 (0.591-0.627) for the Harrell C-index, 0.604 (0.590-0.617) for the Gönen and Heller K statistic and 0.464 (0.408-0.520) for the Royston and Sauerbrei R D 2 . The confidence intervals were calculated using 200 bootstrap estimates. The performances of the model remained satisfactory even after correction by the optimism.
External validation. An external validation 4,6 was performed in the validation dataset, which was constituted of grafted patients in 2014. This dataset was independent of the derivation dataset. To calculate the DQI in the validation dataset, we followed the same procedures as proposed previously in the construction steps. Calibration and discrimination were evaluated through seven steps.
Comparing the databases. The comparison of the datasets appears in Table 5. The distribution of the DQI from the derivation and validation datasets is presented in Fig. 2S (see Supplementary Figures). It appears that the DQI is higher in the validation dataset than in the construction dataset. This might be due to older donors and more frequent strokes.
Regression on prognostic index (PI). The slope on the PI, i.e. ln(DQI), was 0.88 (SD = 0.29). The slope was not different from 1 (P-value = 0.68, likelihood-ratio test); thus, the discrimination was preserved in the validation dataset.
Check model misspecification/fit. Since the estimation of the β* coefficients were not different from 0 (p = 0.64), no adjustment on the DQI covariates was needed.
We tested if there was a lack of fit for the adjustment covariates, i.e. recipients' covariates. The β ⁎ adjustment were different from 0 (p < 0.01), an adjustment on the adjustment covariates was needed.
The assumption of proportional hazards was verified for all covariates. Kaplan-Meier curves for groups at risk. Failure free survival per DQI categories for derivation and validation datasets is given in Fig. 3. The follow-up in the derivation data was truncated to 800 days.
First, Kaplan Meier curves were well separated. Then, the risk groups were discriminative. Second, Kaplan Meier curves from derivation and validation data were close. For greater accuracy, we then plotted three survival curves for each risk group with confidence intervals in derivation and validation datasets (Fig. 3). Validation  curves are within the confidence interval of the derivation dataset curves (except for group 1: from 300 to 500 days and the tail of group 2). The apparent calibration thus seems to be preserved.
Hazard ratios between groups at risk. HRs between groups at risk are presented in Fig. 3. They were not significantly different.
Calibration. The Kaplan Meier curves (green, Fig. 3, see "calibration") seem to underestimate survival. The predicted survival curves in the two datasets (red and black) are superimposed for the first risk group, very close for the second risk group and more separated for the third group. Then, there is some similarity in the PI distribution in each risk group in the derivation and validation datasets. The empirical cumulative distribution functions of the PI by dataset and risk group, on

Discussion
A DQI, which qualifies the liver graft, is one of the key factors for a better matching between donor and recipient. As the existing DRIs were not discriminant (i.e. slope on the PI: 0.57 (SE 0.15) and 0.64 (SE 0.16) for the DRI and ET-DRI respectively, P-value < 0.001) and miscalibrated (e.g. survival, according to DRIs categories, was not consistent between the construction and validation datasets) according to our dataset 4,5 , because of population differences. We thus decided to create a DQI, which characterizes the graft in view of transplantation. It is based on Cox model adjusted on the recipients' characteristics. Three different ways to elaborate the model were tested. We explored new covariates that were not taken into account in the DRI and ET-DRI, defining the donor such as MDRD clearance or adjustment covariates such as MELD exception, decompensated cirrhosis or HCC. Five donor covariates were retained in the final model: age, cause of death, intensive care unit stay, lowest MDRD creatinine clearance, and liver type (split or total). We obtained three groups of different sizes (Fig. 2). Of note, there is no consensus in the literature regarding selecting the number of risk groups or in positioning the cut-off points to delineate these groups 10 . Too many groups could be unstable and consequently discrimination becomes insufficient. The recommendation is to create three to five groups, not necessarily of the same size, in order to highlight extreme groups 10 . After construction of the risk groups, HRs (Fig. 2), between groups, showed significant differences in mortality/graft loss. Indeed, the curves were all well separated, and the survival decreased with the increase of the score (Fig. 2). The third group will  correspond to donors at higher risk that we may define as "extended criteria donors". As shown by Collett et al. 11 , the average MELD scores for recipient in the three groups at risk (20.6, 20.8 and 19.3, respectively) were not different (p = 0.06). The performance of the model remained satisfactory even after correction by the optimism. Finally, an external validation 4,6 was performed; discrimination and apparent calibration were preserved in the validation dataset. Our study has some limitations. First, it is very difficult to evaluate the intrinsic quality of a graft since the allocation procedure associates to each donor a recipient. The quality of the graft is a function of the graft/recipient survival. We then assume that the allocation was appropriate. The graft/recipient survival provides an indirect measure of the donor quality and thus can be considered as a surrogate variable. We assumed that this variable was highly correlated to the target variable, i.e. the graft quality, and was thus able to be extrapolated as an outcome.
In prognostic models, discrimination measures are generally not high 12 which the case in our study. This occurred even when models were built in large datasets 2,13 .
A selection bias has been explored as 858 recipient/donor pairs were removed from the model due to missing data. We compared the distributions of data of this subset to the data used in the analysis either for covariates belonging to the recipient (age, sex, cancer, decompensated cirrhosis, waiting time, medical condition before LT) or donor (age, sex, COD, liver type). After Bonferroni correction, only the presence of a cancer and medical condition before LT were significant (χ 2 tests). Of note, only the covariate "presence of a liver cancer" was not part of the score. However, the frequency of liver cancer was higher in the analysis data set, which limits the occurrence of a potential bias.   As the size was part of the DRI and the weight of donors is often difficult to establish due to fluctuations in hydration during the ICU stay, we gave more importance to the size and we did not include the weight in our study. Nevertheless, we tested the weight in the final model. The weight did not provide a pertinent information to be added in the model (p = 0.82). This result is consistent with the DRI and ET-DRI models 1,2 .
Macro/micro-steatosis was not included in the model since this information was not available in the French database.
Since the cold ischemia time (CIT) is only known at the time of LT, this covariate was not retained as an intrinsic characteristic of the graft in the model, as outlined in 11,14 , even though this covariate has an impact on post LT survival, such as post-and peri-operative period covariates. These covariates are not known at the time of graft procurement and can't contribute to the creation of a score, which qualifies the donor. The aim of this work was to characterize the donor independently of the potential recipient. Of note, the distance between the organ procurement center and the transplant center was not correlated to the cold ischemia time (Pearson correlation coefficient r = 0.08), indicating that cold ischemia time is highly dependent on the logistic of transplantation. Moreover, the introduction of perfusion machines may completely reverse the paradigm of organ preservation, and thus will modify the impact of the cold ischemia time. In effect, increasing the duration of organ conservation in standardized conditions may consistently improve the chances for a patient to benefit from the most compatible graft. Whatever the case may be, cold ischemia time did not go through the three different ways of selecting the covariates. About the exploration of a potential bias, as CIT is not known at the time of graft procurement, it does not lead to a selection or misclassification bias. However, there may be confounding bias because CIT exerts an effect on the recipient/graft survival. We went back and found that this was not the case. In effect, CIT had no specific weight in the Cox model (HR: 0.9998 [0.9996-1.0001], p = 0.26), therefore we did not include it in the final model even as an adjustment covariate. Adding CIT in the final model as adjustment covariate would have decreased both the quality and discrimination performance of the model (i.e. increase in Akaiké criterion (AIC) and decrease in c-index, respectively). For example, in case-control studies, when a model is forced to include non-retrained covariates, the model often becomes overfitted. An overfitted model has a weak validity and the model would have insufficient discriminative power in new patients.
Donor age was different between the Organ Procurement and Transplantation Network (OPTN) database 1 and the French database 4,5 . Since age below 70 did not exert a significant effect on recipient survival in our  database, the first three quartiles were grouped together. Age over 70 represented 25% in the French database compared to 4.3% in the OPTN database.
For COD, 60% of patients presented a CVA, 12% an anoxia and 25% a trauma. This distribution was quite different from that observed in the OPTN database 1 , which was 44%, 9% and 45%, respectively (p < 0.001). The risk of death/graft loss was increased in patients who experienced a CVA compared to anoxia (HR: 1.35 [1.08-1.69], p < 0.01). This result was consistent with Feng et al. 1 and Singhal et al. 15 .
In three out of four stays in ICU the donor's length of stay was less than 4 days. Shorts stays in ICU were associated with an increased risk of death/graft loss. As outlined above, CVA was associated with an increased risk of death/graft loss compared either to anoxia or trauma COD. Given that the length of ICU stays of donors with a CVA was ≤4 days in more than 80% of cases (which represents 49% of the donors), it supports the fact that short stay in ICU in our series was related to a shorter recipient survival. In the literature the donor ICU stay did not appear to influence graft survival 16,17 . However, Cuende et al. 18 included 3429 LT and showed that an ICU stay of more than six days represented a moderate risk (RR: 1.21 [1.1-1.4]).
Finally, concerning the external validation, the survival among the different DQI groups was not significantly different (HRs with large confidence interval). This may be due to a lack of power given that the follow up was shorter than in the derivation dataset. We were thus truncated the derivation data follow-up. A longer follow-up period would have been beneficial to obtain a better accuracy. Nevertheless, Kaplan Meier curves were well separated (Fig. 3). Then, the risk groups were discriminative. For the calibration step, an under-estimation was observed. This may be due to the specific structure of the DQI. Indeed, this score includes only a part of the PI of the original model (the β adjustment for recipient covariates were not included). But, baseline survival was calculated on the original model, which includes all the covariates; both donor and recipient. The weights of the recipients' covariates then may play an important role in the estimation of S t ( ) deriv 0 , i.e. the baseline survival in the derivation dataset. We checked whether this hypothesis was consistent by computing the baseline survival on the derivation dataset, without the recipients' covariates (Fig. 4S, Supplementary Figures). Compared to those in Fig. 3, the predicted survival was under-estimated. It confirms that baseline survival is affected by recipients' covariates. This result therefore impacts the seventh step of the external validation. Nevertheless, according to Kaplan Meier curves (Fig. 3), the apparent calibration and the discrimination were preserved in the validation dataset.
In a recent review, Flores and Asrani 14 suggested that the donor score might benefit from being updated. We aimed to create a flexible DQI model and adaptable. Indeed, if calibration is lacking with a validation dataset comprising a longer follow-up, and if the discrimination is preserved, a re-calibration of the DQI will be possible. This type of procedure is dedicated to be repeated yearly using LT data from subsequent years. This adaptive procedure will enable to gradually take into account the modifications of the graft allocation system. This approach may be of interest for other countries. Indeed, French allocation system is quite similar to most European systems. Moreover, in France, the MELD and MELD exceptions are also used such as in the United States of America. For an international-reader, our model is a tool for decision support in countries with a high activity of harvest of brain dead organ donors, with a significant increase of elderly donors, such as in Italy, Spain or Portugal thanks to a political proactive organ census and collection of old or very old donors. We encourage these countries to make an external validation of the DQI on their own datasets. Furthermore, recently, another national index was introduced, the DLI (Donor Liver Index, which has been calculated in the French dataset (Table 1)) 11 . This score is an example that a national index gives complementary information to bring an aid to improving the evaluation of the quality of grafts.
This DQI allows considering a next step to explore the optimal matching between donor and recipient and the "extended criteria donor" graft attribution, to improve the success of LTs, through sequential stratification method 3 . The DQI will be used to qualify the graft. We also aim to create a score based on the survival benefit, which combines two models: a pre-transplantation model and a post-transplantation model. The post-transplantation model takes into account the DQI and integrate the results of the sequential stratification study. A validation will be performed using a discrete-event simulation model.
More, use of machine perfusion protocols are starting this year in France. Until now, the only source of grafts are brain dead donors. These perfusion machines will likely allow rehabilitation of critical grafts and also will offer viability tests, which may eventually be compared to the DQI to test its relevance. Machine perfusion are quite expensive, where such a prognostic score may be of interest to target grafts and who would benefit the most from the infusion.

Methods
In order to simplify the interpretation of the model, and moreover to protect against outliers, we transformed the quantitative variables into qualitative variables. When there was no clinical threshold generally accepted (contrary to biological covariates such as alkaline phosphatases or Modification of the Diet in Renal Disease (MDRD) creatinine clearance for which thresholds are recommended) we defined four groups according to the quartiles of the covariate distributions. Graft survival curves were plotted according to the product limit method. They were then compared using hazard ratios (HR). In the absence of significant difference between groups, they were re-grouped together. If no groups at risk were identifiable for a given covariate, then this covariate was kept in the model for adjustment as a quantitative variable (or its natural logarithmic form, when appropriate), according to the three models presented below.

Definitions of covariates.
In France, all donors are cared for in intensive care units (ICU). The ICU length of stay is based on the number of nights spent, rather than the number of days. Therefore, a stay of 0 day is observable. For recipient, decompensated cirrhosis was identified in the database using the MELD score and the Child-Pugh score. All recipients presenting with cirrhosis of the liver, a MELD score ≥16 and a Child-Pugh B or C were considered as decompensated cirrhosis.
Patients first-transplanted, without cancer and cirrhosis, were considered as having a non-cirrhotic liver disease.
The estimated distance between donor and recipient centers was calculated on the basis of a geographic model taking into account road distances in minutes.
MELD exceptions were identified and resulted in extra points while on the waiting list 19 . The variable named "Hors tour" means that after at least five consecutive refusals by the transplant teams, the graft is then supplied to a transplant team who have identified an appropriate candidate. Score creation. In the derivation dataset, as in Feng's 1 and Braat's 2 studies we used a Cox model with adjustment on recipient characteristics to create the score (Table 1). We tested three different ways to elaborate this model: (1) A complete model with a covariate selection according to AIC; (2) An analysis using a two-step procedure. First, performing log rank tests with a threshold of 20% for the type I error for each covariate, according to the outcome. Then, followed by a multivariate model that included the selected covariates on the basis of the previous step with a selection threshold set at 20% for the type I error.
(3) An analysis using the same first step of the procedure above was performed but followed by two multivariate models for donor and recipient, respectively. These models included the selected covariates on the basis of the first step with a selection threshold set at 20% for the type I error. Finally, a multivariate model was set up with all the covariates selected in the previous models with a selection threshold set at 20% for the type I error.
We then retained a model including all the covariates present in at least two of the three models. Moreover, in this model we also tested each covariate, which appeared in only one of the three models. Covariates were retained if their P-values were lower than 5%. Then, the final model included the set of covariates present in two of the three models and the set of covariates retained in addition. This approach prevented omission any relevant covariates.

Risk groups.
To create risk groups, we used 10 groups according to the deciles of the score. We drew survival curves according to the corresponding Kaplan Meier estimates, which were compared according to the HR. In the absence of significant difference between groups, they were thus grouped together.

Calibration plot.
A calibration plot reports graphically predicted outcome probabilities (on the x-axis) against observed outcome frequencies (on the y-axis) 20 . A well-calibrated prediction implies that the curve lies on the first bisector. Thus, for each point, the predicted probability is equal to the observed outcome frequencies. In a Cox model, calibration may not be easily assessed 20 . The model allows estimation of relative risk differences between patients presenting with different characteristics. However, since it does not estimate the baseline survival function, it does not estimate absolute risks (event probabilities), in contrast to parametric models 6 . However, absolute risks can be calculated by focusing on a fixed time point (e.g., risk at month 3). Thus, we plotted three calibration plots at months 3, 6 and 12. We calculated the corresponding baseline survival rates, estimated according to Breslow's estimator on the complete model 21 . Then we calculated the predicted probabilities of death/graft loss 20 as: where S 0 (t) is the baseline survival, β i the donor coefficients of the Cox model and X i the donor covariates. Note that the estimated parameters of the score did not change according to the baseline survival. Coefficients are those estimated at the previous step.

Model performance and internal validation.
To evaluate the model performance, we calculated several indices: the C-index of Harrell 22 , the K statistic of Gönen and Heller 23 and the R D 2 of Royston and Sauerbrei 24 . Harrell's C-index is defined as the proportion of all usable patient pairs for which the predictions and outcomes are concordant 22 . Gönen and Heller's K statistic is used to evaluate the discriminant power and the predictive accuracy of nonlinear statistical models. It is a function of the regression parameters and the covariates distribution only, and is therefore asymptotically unbiased 23 . Royston and Sauerbrei's R D 2 is a measure of the proportion of explained variation, based on D, a measure of the ability of a model to discriminate between good and poor patient outcomes 24 .
Internal validation is a necessary part of model development 25 , although according to Moons 20 , internal validation quantifies the predictive ability of a model on the derivation data (often called apparent performance [26][27][28] ). Internal validation was assessed using a bootstrapping procedure in order to quantify the optimism associated with the performance of the model. A seven-step procedure was performed according to the method described in Moons 20 . The confidence intervals of the 3 indices were calculated using 200 bootstrap estimates. External validation. We performed an external validation of the DQI according to Royston and Altman 6,24 ; the method described in a previous study 4 . The external validation of a model explores the assessment of its performance on an independent database 6 . Model performance was evaluated considering two fundamental aspects: discrimination and calibration. Discrimination, known as "separation", allows differentiation of patients' prognoses through risk estimates from the model. Calibration reflects the prediction accuracy; if well-calibrated, a score assigns the appropriate probability at each level of the predicted risk 6 .
Comparison of the datasets. A comparison of the construction dataset with the validation dataset was performed using χ 2 tests or ANOVA, when appropriate.
Regression on the prognostic index (PI) in the validation data. In order to obtain the prognostic index (PI), we used ln(DQI). The model fitted was as follows: DRI a djustment adjustment with X adjustment the covariates used in the DQI and β adjustment fixed at the estimated value in the original model. The discrimination is considered good when this coefficient is equal to 1 and poor if the slope is lower than 1; a coefficient >1 is considered very good. In order to test β DQI = 1, we used a likelihood ratio test.
Check model misspecification/fit. A possible reason for a PI coefficient less than 1 is poor adjustment of one or more covariates. To test whether one or more of the DQI covariates needed an adjustment we fitted:

DQI a djustment adjustment
where Z were the covariates used to build the DQI. In this model, the β DQI was set at 1, and the β adjustment was fixed at the estimated value in the original model. Next, we performed a likelihood ratio test to test the following equation: β* = 0. The proportional hazards risk assumption was checked using the Schoenfeld residuals.
Measures of discrimination. As in the construction step we calculated three discrimination indexes: Harrell C-index 22 , Gönen and Heller K statistic 23 , and Royston and Sauerbrei R D 2 24 .
Kaplan-Meier curves for groups at risk. We plotted the survival curves according to the groups at risk using the Kaplan Meier estimates. We also did a visual comparison of these curves with those of the construction dataset to evaluate the calibration. Indeed, if the survival curves of the construction and validation data for each group at risk were superimposable, then the visual calibration was considered as preserved. Discrimination can also be evaluated from the curves; the more the survival curves are separate the better the discrimination.
Hazard ratios (HR) between groups at risk. HR were estimated for each group at risk using a Cox model. The higher the discrimination, the larger the HR.
Calibration. In this step, we estimated the baseline survival in the derivation: S t ( ) deriv 0 data, using Breslow's estimator 21 . We then calculated the estimated survival function in the validation dataset as: = S t DQI S t ( , ) () val i deriv DQI 0 i . For each group at risk, we averaged S t ( ) i val , at each time-point, to obtain the expected survival curves. The same procedure was performed for S t DQI ( , ) deriv i . We then superimposed the predicted survival curves for each risk group in the derivation and validation dataset, and the Kaplan Meier curves observed for each group in the validation dataset.
All analyses were performed using R software, version 3.3.0 (R Development Core Team, A Language and Environment for Statistical Computing, Vienna, Austria, 2016. https://www.R-project.org/).

Data availability.
The data that support the findings of this study are available from the French Agency for Biomedecine but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. According to the French regulation, upon reasonable request, data are however available with permission from the Agency for Biomedicine.