Introduction

Kidney cancer is showing an increasing incidence worldwide, with 431,288 new cases in 20201,2. It is responsible for a significant mortality rate with almost 180,000 deaths3. The growing number of imaging procedures performed each year is leading to an increase in the diagnosis of renal cell carcinoma (RCC) at localized stages for which surgery is the standard of care4,5,6.

The risk of recurrence after surgery is substantial, with rates varying from 20 to 50% at 5 years, depending on the stage7,8,9. In the absence of a consensus, current European recommendations for surveillance are based on prognostic scores that offer only moderate predictive performance4,10. These recommendations suggest regular CT scans for 5 to 10 years depending on the patient’s prognostic class.

In this context and as we enter the era of personalized medicine, it becomes increasingly important to accurately predict the individual risk of kidney cancer recurrence after surgery. This would enable the identification of high-risk patients who could be considered for adjuvant treatments, as well as low-risk patients for whom reduced surveillance and radiation exposure are possible.

Machine Learning (ML) analyses large datasets to predict outcomes more accurately than traditional tools. In healthcare, there is a wealth of clinical, biological, pathological, and imaging data available. The exponential growth in information makes analysis and interpretation difficult using traditional statistics, while the ML approach offers promising perspectives and is already being used in many medical specialties11,12,13.

Our aim is, therefore, to use ML on real-world data from a large prospective cohort of patients to propose an individual prediction of the recurrence risk after surgical management of localized or locally advanced kidney cancer.

Results

Cohort description

A total of 3372 patients managed for localized or locally advanced RCC were included. The median age was 62 years (IQR 52–69), and 2319 (69%) patients were male. The median tumor size was 4 cm (IQR 2.8–6.2), with a majority (66%) of pT1 and 71% of clear cell RCC (ccRCC). Baseline patient and tumor characteristics are reported in Table 1.

Table 1 Patients tumors and characteristics

The median follow up, defined as the median of the intervals between surgery and censoring or death, was 30 months. Four hundred and eighty patients (14.2%) experienced an event over the follow-up (122 locoregional recurrences, 270 metastatic progressions and 88 deaths). The estimated median DFS was 12 years (95% CI 9.2—Inf), and the 5-year DFS probability was estimated at 72.9% (95% CI 70.2–75.5%).

The training dataset consisted of 2241 patients (66%) from ten centers and 1131 patients (34%) from 13 other centers were assigned to the test dataset. Baseline patient characteristics, missing data rates and DFS curves were similar among the train and test cohorts (Table 1 and Fig. 1).

Fig. 1: Kaplan–Meier estimates of disease-free survival stratified by train and test datasets.
figure 1

DFS curve of the train cohort (in blue) and of the test cohort (in yellow) were similar (p = 0.67).

Model evaluation

The best predictive performance was achieved by combining multiple imputations for missing values and Cox proportional hazards models for time-to-event data. Once features that were not informative due to null variance, redundancy or imbalancement as well as features that were not independently associated with the outcome have been eliminated, the final ML model included 24 clinical, pathological, and biological variables. The permutation-based importance of each feature is displayed in Fig. 2. Tumor size, histological subtype and age at surgery were found to be the most important features of the ML model.

Fig. 2: Permutation-based feature importance of the developed ML model.
figure 2

ECOG Eastern Cooperative Oncology Group performance status, NLR Neutrophils to Lymphocytes Ratio; ASA American Society of Anesthesiologists score; NSS Nephron-Sparing Surgery.

This ML model demonstrated good discrimination and calibration abilities when applied to test dataset, with an integrated AUC of 0.81 [95% CI 0.77–0.85] and an integrated Brier score of 0.11 [0.10–0.13] (Table 2). The robustness of the signature was verified, with stable estimated predictive metrics between the training stage and the external validation stage. Both calibration and discrimination performance decreased over time, certainly due to the decrease over time in both the number of patients still at risk and the number of events observed.

Table 2 Predictive performance of the developed ML model

Decision curve analysis (Fig. 3) highlights the clinical utility of using the ML model to predict the recurrence risk within 5 years following surgery, with higher net benefit for the ML model than the competing decisions, assuming that all patients or no patient will recur, for all threshold probabilities between 10% and 50%. For a threshold probability of 30%, the ML model achieved a net benefit of 0.10, which means that 10 additional recurrences for every 100 patients would have been identified, without increasing the number of false positive predictions.

Fig. 3: Decision curve.
figure 3

Decision curve for prediction of recurrence risk within 5 years after surgery. The green curve assumes no patient will recur. The red curve assumes all patients will recur. The blue curve is associated with the use of machine learning model. The graph shows the expected net benefit for a range of threshold probabilities. The expected net benefit corresponds to the number of patients for every 100 patients who were correctly predicted with recurrence, without increasing the number of false positive predictions. The machine learning model showed better net benefit than the competing decisions for all the plausible threshold probabilities, comprised between 10% and 50%.

Individual prediction and stratification into risk groups

Each individual prediction was displayed using SHAP values, with characteristics that increase the risk of recurrence in red and protective factors in blue. An example is given in Fig. 4a, in which the patient has a 5-year recurrence risk of 63% (compared with an average risk of 20% in the training population). This increased risk is explained by the clear cell histological subtype, the Fuhrman grade 4, the presence of a necrotic component and the large tumor size. The young age of the patient reduces this risk. Patients assigned to the test cohort were stratified into four risk groups (Fig. 4b) using thresholds determined from the train cohort, achieving an iAUC of 0.79 [IC95% 0.74–0.83]. The threshold for the very low-risk group was set to include patients with a recurrence risk within 5 years lower than 10%. The resulting group represents 19% of the population with an actual 5-year recurrence rate lower than 2% and no death observed within this time frame. The threshold for low and medium-risk patients was set to obtain recurrence risks within 5 years between 10% and 22% and between 22% and 41%, respectively. This represents 43% of the population with an actual 5-year DFS of 83% for the low-risk group and 22% of the population with a DFS of 54% for the medium-risk group. Finally, the last group isolates patients with a recurrence risk superior to 41% within 5 years resulting in 17% of the population having an actual 5-year DFS of 49%.

Fig. 4: Interpretability tools.
figure 4

a SHAP value. Individual risk of recurrence within five years after surgery explained using SHAP values for a patient. The average estimated risk in the train population (base value) is 20%. Individual risk prediction for the patient is higher, at 63%, with features in red that increase the patient’s risk of recurrence and features in blue that decrease it. b Risk groups' stratification. Actual disease-free survival in the test cohort (n = 1131) according to the stratified risk score. 211 (18.7%) were classified as very low risk, 484 (42.8%) patients at low risk, 245 (21.7%) patients at medium risk and 191 (16.9%) patients at high risk of recurrence within 5 years following the surgery. The black curve represents the predicted survival curve for the patient in (a).

Comparison of predictive performance

The performance of the ML model was compared with conventional risk scores on the test dataset (Table 3). The UISS, SSIGN, GRANT and Leibovitch risk scores could be calculated for 882 (78%), 946 (84%), 1008 (89%) and 578 (51%) patients, respectively, due to incomplete data. The machine learning model outperformed the GRANT (p < 0.001), SSIGN (p = 0.01), and UISS (p < 0.001) risk scores. Additionally, it was available for twice as many patients as the Leibovich-2018.

Table 3 Comparison of the predictive performance of the models in terms of calibration and discrimination

Discussion

According to the literature, 20–50%7 of patients with localized or locally advanced kidney cancer will develop recurrence after surgery. Accurate and routinely usable predictive models of this risk are, therefore essential to advise patients and set up follow-up or propose adjuvant treatment. We have developed a predictive model of DFS after surgery in a multicenter NCI-HAS co-labelled cohort of patients with localized or locally advanced kidney cancer. We used clinical, biological, and pathological data available in routine practice.

The variables that were found to be of utmost importance in our predictive model are consistent with known prognostic factors and have been used in several prognostic models. Indeed, the tumor, node, and metastasis (TNM)14 classification has been one of the most used prognostic factors for years. The same applies to Fuhrman grade15 and histological subtype, which are recommended by the EAU guidelines4. Several studies have shown that patients with ccRCC have a worse prognosis than those with papillary and chromophobe RCC16,17. Performance status is recognized as an important predictor of clinical outcomes and is a common inclusion criterion in clinical trials. Finally, a meta-analysis including almost 15,000 patients showed a 2-to-3-fold higher risk of recurrence, metastatic progression, and cancer-related death in patients with vascular emboli on pathology18.

The association of inflammatory markers with poor prognosis has been demonstrated in several cancers19 and the neutrophil-to-lymphocyte ratio (NLR) is often used as a prognostic biomarker20. In kidney cancer, its predictive value has been evaluated several times21,22,23.

The UISS24, developed on a retrospective cohort of 661 patients, classifies patients with localized kidney cancer into 3 risk groups based on Fuhrman grade, ECOG score and pT stage. Its predictive value is moderate with a c-index ranging between 0.56 and 0.72 in different external validation studies25,26,27. The SSIGN system, which integrates stage, tumor size, Fuhrman grade and the presence of a necrotic component, predicts cancer-specific survival (CSS) in patients with ccRCC, with a c-index of 0.84 in the initial cohort. However, the accuracy is somewhat lower in different external validation studies, with c-indexes ranging between 0.63 and 0.7826. Leibovich et al. developed three different models depending on the histological type of the patient (clear cell, papillary or chromophobe). They used a CoxPH model on a monocentric cohort, with c-indexes for DFS and CSS of 0.83 and 0.86, respectively. Once again, the performance seems to be slightly lower in external validation studies (c index ranging from 0.73 to 0.8125,28). Finally, the GRANT score has been recently published. It includes Fuhrman grade, age, stage, and lymph node involvement, classifying patients into two risk groups. Its external validation revealed a low concordance score of 0.5929.

As the predictive performance of these models appears to be moderate, a few articles have suggested using machine learning to predict recurrence with greater accuracy. Therefore, Buyn et al. 30 developed a model to predict recurrence-free survival (RFS) and CSS in a cohort of 2139 ccRCC patients. The best results were obtained with a DeepSurv model. Meanwhile, Kim et al. 31 described a high accuracy in predicting recurrence using a Naive Bayes model in a cohort of 2814 patients. Nevertheless, the methodology used to develop and validate these models is poorly described and no individual predictions are presented in these articles.

More recently, Khene et al. 25 published a model based on a cohort of 4067 patients randomly assigned to either a training or a test group. They tested three machine learning algorithms and found that the Random Survival Forests model achieved the highest c-index (0.79). However, the paper had some methodological limitations and statistical biases. It lacked external center validation, did not investigate risk group stratification, had unclear handling of missing data, and did not address the applicability of usual risk scores in cases of incomplete observations. Patients included in this study were also enrolled in the UroCCR database, which may lead to minimal patient overlap between the two studies. However, considering that the UroCCR database comprises over 16,500 patients from 44 different centers and that the methodologies of our two studies differ, the cohorts and findings of our studies are distinct. This contributes new evidence to the prediction of recurrence in localized or locally advanced kidney cancer.

Furthermore, Gui et al. 32 published a multimodal model that combines genomic and pathomic with clinical features to predict the recurrence-free interval after surgery in a cohort of patients with ccRCC, using a nomogram. Nevertheless, there are several limitations, beginning with data obtained from a retrospective review of clinical files, in contrast to our data obtained from a real-world database collected prospectively. Clinical data are also selected a priori and based on the outdated Leibovich 2003 score33, which was revised in 2018. Additionally, the utilization of pathomics and genomics remains in the realm of research. These technologies are indeed prohibitively expensive and not readily available for routine clinical use. The authors themselves acknowledge that the associated tasks are too time-consuming for large-scale clinical application.

Recently, results from the Keynote 564 trial34 were reported, showing for the first time a benefit of adjuvant immunotherapy on DFS in patients who underwent surgical management for localized kidney cancer. However, with a median follow-up of 30 months, approximately 60% of patients in the placebo group remained disease free, while about 19% of patients in the experimental group experienced grade 3–5 adverse events. Therefore, it is essential to identify the right candidates for such treatment, specifically patients whose risk of recurrence justifies the use of a drug with potentially significant and long-lasting side effects. Furthermore, other phase III adjuvant trials have failed to demonstrate any post-surgery benefits35,36, and the selection of patients deemed high risk is a matter of controversy. Patients with relatively low recurrence risk may have been included, potentially masking the improvements in clinical outcomes offered by adjuvant therapy. Utilizing ML algorithms could enhance patient screening and the selection of patient profiles that would derive greater benefits from adjuvant treatment.

Our model provides individual DFS prediction following surgery for localized RCC with a high degree of accuracy. It outperformed most of the common prognostic scoring systems and offers the advantage of predicting outcomes for every single patient, even with incomplete data, in contrast to traditional scores. Displaying the SHAP values allows to explain the prediction and the impact of each factor at the individual patient level. Integrating this tool into the UroCCR database will automatically provide physicians with the individual risk assessment for each patient included, enabling personalized management and follow-up. The prediction algorithm will also be publicly available on the website of the French kidney cancer research network (www.uroccr.fr). The model variables will then have to be entered manually.

Additionally, our model can be used to stratify patients into four distinct prognostic categories with strong discriminatory power. This allows us to identify a group of patients with a very low risk of recurrence, constituting 19% of the overall cohort. In this population, a less intensive post-operative follow-up can be considered, thus reducing medical costs and radiation exposure.

While the strengths of our study include a large number of patients who are representative of the population managed for localized or locally advanced kidney cancer, and a method for external validation of the model that allows for its generalization, it also has some limitations. First, the study is retrospective, secondly the median follow-up time of 30 months is relatively short. We should also mention that ethnicity distribution is not available, as research in France is strictly regulated by the CNIL (French National Commission for Information Technology and Civil Liberties), which prohibits any ethnic categorization. This model is therefore probably not generalizable to African and Asian populations, which are poorly represented in France. Finally, the data are extracted from a multicenter database. Management and follow-up may therefore vary from one center to another. Although most cases were monitored, we can also question the disparity in database completion, particularly in event reporting, which could lead to a potential bias when designing the model.

Finally, as previously mentioned, models perform differently when validated on different cohorts. The same likely applies to our model, which should therefore be validated on other populations and prospectively.

Our study suggests that machine learning applied to real-world evidence dataset from patients undergoing surgery for localized or locally advanced kidney cancer can provide a more accurate individual prediction of DFS compared to conventional prognostic scores. This has the potential to enhance candidate selection for adjuvant therapy and identify patients who would benefit from less intensive surveillance.

Methods

Study population

From the French research network on kidney cancer database UroCCR (NCT 03293563), which has been labelled by the French National Cancer Institute (NCI) and the French High Authority of Health (HAS), we included all patients who underwent surgery between May 2000 and January 2020 for a localized or locally advanced renal cell carcinoma (pTany, Nany, M0). Patients with hereditary RCC, non-primary renal tumors, benign lesions, concomitant malignant disease or metastases and patients without any news after surgery or insufficient data were excluded. The surgical procedure could have been partial or radical nephrectomies, performed either via laparoscopic or open approach in one of the 23 participating tertiary centers. To utilize biological data, we chose to exclude patients with pathologies that could modify blood tests (hemopathies, chronic inflammatory diseases) (Supplementary Fig. 1). All data were collected prospectively in the UroCCR database after obtaining written consent. It was approved by the French Advisory Committee on the Processing of Health Research Information and the French Data Protection Agency, and it complies with all ethical regulations including the Declaration of Helsinki.

Study objectives

The primary objective was to predict individual disease-free survival (DFS) based on baseline multimodal data. The secondary objective was to stratify patients into risk groups to identify a population with very low risk and a population at high risk of recurrence within 5 years following surgery.

Predictors

We extracted more than 200 demographic and clinical variables, including sex, age at surgery, American Society of Anesthesiologists (ASA) score, body mass index (BMI), Eastern Cooperative Oncology Group performance status (ECOG PS), symptoms at diagnosis, chronic kidney disease (CKD) score and time from diagnosis to surgery. Biological data included hemoglobin, thrombocytes, leucocytes, polymorphonuclear neutrophils (PMN), lymphocytes and serum creatinine level. Preoperative tumor characteristics encompassed size on contrast-enhanced imaging and multifocal or bilateral status. Surgical data collected comprised duration, nephrectomy type (partial vs. total), approach (laparoscopic vs. open), blood loss, presence of lymph node dissection or adrenalectomy as well as intra and postoperative complications.

Finally, we examined pathological findings including tumor size and stage, Fuhrman grade, histological subtype, surgical margins, and the presence of necrosis or microvascular invasion.

Follow-up and outcome

Post-operative follow-up was conducted according to common practices of each center, typically aligning with the recommendations of the French Society of Urology37, including visits at post-operative month 1–3 and every 6 months for 3 years, followed by annual visits. Radiological follow-up involved a contrast-enhanced examination of the abdomen and pelvis (CT scan or MRI) and a chest CT scan.

The primary outcome was DFS, defined as the time elapsed between surgery and the diagnosis of local recurrence, metastatic progression, or death from any cause, whichever occurred first.

Hold-out validation

Participating sites were randomly assigned to either the training or testing cohort, ensuring an approximate 2:1 ratio of patients and similar distributions of DFS. The model and risk groups thresholds were optimized on the training cohort and then applied to the testing cohort to evaluate predictive performance. The workflow for machine learning model development and evaluation is illustrated in Fig. 5.

Fig. 5: Workflow for machine learning model (ML) development and evaluation.
figure 5

Two thousand two hundred and forty-one patients from 10 centers were randomly assigned to the training cohort. Missing data were multiply imputed and several time-to-event models were trained. The best trained model was then externally validated on the testing cohort of 1131 patients from 13 different centers and compared with existing risk scores.

Model development

Categorical features with unbalanced modalities were recoded. Categorical features were then one-hot encoded while numerical features were normalized. Missing data were multiply imputed (3 imputations) using the MICE (Multiple Imputation using 5 Chained Equations) algorithm38,39 and gradient-boosted decision trees. Several time-to-event models were trained on the training data set including Cox Proportional Hazards models with LASSO regularization40, random survival forests and gradient-boosted survival trees. Hyperparameters of each algorithm (Table 4) were tuned using repeated cross-validation procedure (3 × 10 folds) and Bayesian optimization of the integrated AUC (iAUC) over the time window (6, 60 months after surgery). Table 4 lists the explored spaces of each hyperparameter. The discriminative power of the machine learning (ML) models was assessed using iAUC, which represents the averaged cumulative-dynamic time-dependent area under the ROC curve (AUC) over the studied time interval. The iAUC values range from 0 to 1, with 0.5 for a random prediction and 1 for a perfect discrimination ability. Blanche et al. 41 argued that the AUC should be preferred to the C-index42 because the former compares the ranks of the predictions with the binary event status while the latter compares the ranks of the predictions with the ranks of the actual event status.

Table 4 Models’ parameters space for Bayesian optimization

Model evaluation

The ML model with the best cross-validated predictive performance was chosen and evaluated on the test dataset for external validation.

The model was assessed in terms of both discrimination and calibration using the time-dependent AUC and the time-dependent Brier score. The Brier score is used to measure the model calibration, ranging from zero to one with zero being the best score and one the worst. The metrics were estimated using Kaplan–Meier-based inverse probability censoring weighting (IPCW) to consider censoring. 95% confidence intervals were calculated, using Nadeau and Bengio correction43 for the cross-validation stage on the training dataset, and percentile bootstrapping for the external validation on the test dataset.

The clinical utility of the model was assessed using a decision curve based on the estimated risks of recurrence within five years following surgery. The permutation-based importance of each feature in the whole ML model was evaluated by computing the decrease in the optimization metric when the values of a given feature are randomly shuffled. SHAP (SHapley Additive exPlanations) values44 were then computed to explain each patient’s predicted probability of recurrence within the 5 years following surgery.

Stratification into risk groups

Patients were stratified into four risk groups of recurrence within five years following surgery: very low, low, medium, and high-risk groups. The thresholds were notably set to obtain a large proportion of patients with a very low actual relapse rate (very low-risk group) and a significant proportion of patients with a high actual relapse rate (high-risk group). These stratification thresholds were determined using the training dataset and then applied to patients in the test dataset (Fig. 6). The DFS curves for the four risk groups were estimated using the Kaplan–Meier method and compared using log-rank tests.

Fig. 6: Negative predictive value and positive predictive value at 5 years on the training cohort.
figure 6

Determination of the stratification thresholds on the training cohort. The left-side Figure shows the false omission rate (equivalent to 1—Negative Predictive Value) at five years according to various decision thresholds. The right-side Figure shows the positive predictive value at five years according to various decision thresholds. The machine learning model provides a relapse risk for all horizon times t that have been seen in the training dataset. For our use case, we decided to set t to 5 years as it is the standard horizon clinicians would consider building surveillance plan for their patients. Our primary goal is to find a significant group of patients with a very low risk of recurrence at 5 years. To do so, we decided to plot the false omission rate as a function of the cumulative frequency of patients in the very low-risk group by varying the risk threshold. We define our very low risk threshold such as there is a significant increase in the false omission rate. We can then use a similar strategy with the positive predictive value (PPV) to determine a high-risk group of patients. We look for PPV “plateau” to determine the risk thresholds. This method is reused to differentiate medium and low-risk groups.

Comparison with usual risk scores

The ML model was compared to four prognostic scores commonly used in guidelines and clinical trials: the UISS (University of California at Los Angeles Integrated Staging System)24, the SSIGN (Stage, Size, Grade, and Necrosis)45, the GRANT (GRade, Age, Nodes and Tumor)46 and, the Leibovich47 scores. These prognostic scores could not be computed for the entire testing cohort due to incomplete observations. Each pairwise comparison was conducted on the subset of patients for whom the prognostic score was available, with one-sided p-values estimated using bootstrapping.

This model has been developed in accordance with the SPIRIT-AI guidelines48, and its integration does not require any specific requirements. The data and algorithm can be made available upon request.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.