Introduction

Prostate cancer is a common malignancy among men, affecting 1.4 million per year1. A significant proportion of these men will receive the primary curative treatment of a prostatectomy. This surgery’s success can partly be judged by the concentration of prostate-specific antigen (PSA) in the blood. While it has a dubious role in prostate cancer screening2,3, this protein is a valuable biomarker in PCa patients’ follow-up post-prostatectomy. In a successful surgery, the concentration will mostly be undetectable (<0.1 ng/mL) after 4–6 weeks4.

However, in ~30% of the patients5,6,7, PSA will rise again after surgery, called biochemical recurrence, pointing to regrowth of prostate cancer cells. Biochemical recurrence is a prognostic indicator for subsequent progression to clinical metastases and prostate cancer death8. Estimating chances of biochemical recurrence could help to better stratify patients for specific adjuvant treatments.

The risk of biochemical recurrence of prostate cancer is currently assessed in clinical practice through a combination of the ISUP grade9, the PSA value at diagnosis and the TNM staging criteria. In a recent European consensus guideline, these factors were proposed to separate the patients into a low-risk, intermediate-risk and high-risk group10. A high ISUP grade independently can, independently of other factors, assign a patient to the intermediate (grade 2/3) or high-risk group (grade 4/5).

Based on the distribution of the Gleason growth patterns11, which are prognostically predictive morphological patterns of prostate cancer, pathologists assign cancerous tissue obtained via biopsy or prostatectomy into one of five groups. They are commonly referred to as International Society of Urological Pathology (ISUP) grade groups, the ISUP grade, Gleason grade groups, or just grade groups.9,12,13,14. Throughout this paper, we will use the term ISUP grade. The ISUP grade suffers from several well-known limitations. For example, there is substantial disagreement in the grading using the Gleason scheme15,14. Furthermore, although the Gleason growth patterns have seen significant updates and additions since their inception in the 1960s, they remain relatively coarse descriptors of tissue morphology. As such, the prognostic potential of more fine-grained morphological features has been underexplored. We hypothesize that artificial intelligence, and more specifically deep learning, has the potential to discover such information and unlock the true prognostic value of morphological assessment of cancer. Specifically, we developed a deep learning system (DLS), trained on H&E-stained histopathological tissue sections, yielding a score for the likelihood of early biochemical recurrence.

Deep learning is a recent new class of machine learning algorithms that encompasses models called neural networks. These networks are optimized using training data; images with labels, such as recurrence information. From the training data, relevant features to predict the labels are automatically inferred. During development, the generalization of these features is tested on separated training data, which is not used for learning. Afterwards, a third independent set of data, the test set, is used to ensure generalization. Since features are inferred, handcrafted feature engineering is not needed anymore to develop machine learning models. Neural networks are the current state-of-the-art in image classification16.

Deep learning has previously been shown to find visual patterns to predict genetic mutations from morphology, for example, in lymphoma17 and lung cancer18. Additionally, deep learning has been used for feature discovery in colorectal cancer19 and intrahepatic cholangiocarcinoma20 using survival data. Although deep learning has been used with biochemical recurrence data on prostate cancer, Leo et al.21. assumed manual feature selection beforehand, strongly limiting the extent of new features to be discovered. Yamamoto et al.22. used whole slide images and a deep-learning-based encoding of the slides to tackle the slides’ high resolution. They leverage classical regression techniques and support-vector machine models on these encodings. The deep learning model was not directly trained on the outcome, limiting the feature discovery in this work as well.

A common critique of deep learning is its black-box nature of the inferred features23. Especially in the medical field, decisions based on these algorithms should be extensively validated and be explainable. Besides making the algorithms’ prediction trustworthy and transparent, from a research perspective, it would be beneficial to visualize the data patterns which the model learned, allowing insight into the inferred features. We can visualize the patterns learned by the network leveraging a new technique called Automatic Concept Explanations (ACE)24. ACE clusters patches of the input image using their intermediate inferred features showing common patterns inferred by the network. We were interested in finding these common concepts over a range of images to unravel patterns that the model has identified.

This study aimed to use deep learning to develop a new prognostic biomarker based on tissue morphology for recurrence in patients with prostate cancer treated by radical prostatectomy. As training data, we used a nested case-control study25. This study design ensured we could evaluate whether the network learned differentiating patterns independent of Gleason patterns. The prognostic biomarker provides a strong correlation with biochemical recurrence in two sets (n = 182 and n = 204) from separate institutions. Furthermore, the Automatic Concept-based Explanations provided tissue patterns interpretable by our pathologist.

Methods

Cohorts

Two independent cohorts of patients who underwent prostatectomy for clinically localized prostate cancer were used in this study. Patients were treated at either the Johns Hopkins Hospital in Baltimore or New York Langone Medical Centre. Both cohorts were accessed via the Prostate Cancer Biorepository Network26. The Johns Hopkins University School of Medicine Institutional Review Board and The New York University School of Medicine Institutional Review Board provided ethical regulatory approval for collection and disbursement of data and materials from the respective institutions. The need for acquiring informed consent was waived by the institutional ethical review boards.

For the development of the novel deep-learning-based biomarker (further referred to as DLS biomarker), we used a nested case-control study of patients from Johns Hopkins. This study consists of 524 matched pairs (724 unique patients) containing four tissue spots per patient. They were sampled from 4860 prostate cancer patients with clinically localized prostate cancer who received radical retropubic prostatectomy between 1993 and 2001. Men were routinely checked after prostatectomy at 3 months and at least yearly thereafter. Surveillance for recurrence was conducted using digital rectal examination and measurement of serum PSA concentration. Patients were followed for outcome until 2005, with a median follow-up of 4.0 years. The outcome was defined as recurrence, based on biochemical recurrence (serum PSA > 0.2 ng/mL on 2 or more occasions after a previously undetectable level after prostatectomy), or events indicating biochemical recurrence before this was measured; local recurrence, systemic metastases, or death from prostate cancer. Controls were paired to cases with recurrence using incidence density sampling27. For each case, a control was selected who had not experienced recurrence by the date of the case’s recurrence and was additionally matched based on age at surgery, race, pathologic stage, and Gleason sum in the prostatectomy specimen based on the pathology reports. Given the incidence density sampling of controls, some men were used as controls for multiple cases, and some controls developed recurrence later and became cases for that time period.

The TMA spots were cores (0.6 mm in diameter) from the highest-grade tumour nodule. Random subsamples were taken in quadruplicate for each case. The whole slides were scanned using a Hamamatsu NanoZoomer-XR slide scanner at 0.23 μm/px. TMA core images were extracted using QuPath (v0.2.328,). We discarded analysis of cores with <25% tissue. The cores were manually checked (HP) for prostate cancer, excluding 535 cores without clear cancer cells present in the TMA cross-section, resulting in a total of 2343 TMA spots. The nested case-control set was split based on the matched pairs into a development set (268 unique pairs), and a test set (91 pairs); the latter was used for evaluation only. We leveraged cross-validation by subdividing the development into three folds to tune the models on different parts of the development set. We divided paired patient, randomly, keeping into account the distribution of the matched variables. The random assignment was done using the scikit-multilearn package29, specifically the ‘IterativeStratification’ method in ‘skmultilearn.model_selection’. After splitting the dataset into training and test, we split the training dataset into three folds using the same method for the cross-validation.

To validate the DLS biomarker on a fully independent external set, we used the cohort from New York Langone Medical Centre. This external validation cohort consists of 204 patients with localized prostate cancer treated with radical prostatectomy between 2001 and 2003. Patients were followed for outcome until 2019, with a median follow-up of 5 years. Biochemical recurrence was defined as either a single PSA measurement of ≥0.4 ng/m or PSA level of ≥0.2 ng/ml followed by increasing PSA values in subsequent follow-up. Cores were sampled from the largest tumour focus or any higher-grade focus (>3 mm). Subsamples were taken in quadruplicate for each case. Images were scanned using a Leica Aperio AT2 slide scanner at 0.25 μ/px.

Model details

For developing the convolutional neural networks (CNNs) we used PyTorch30. As an architecture, we used ResNet50-D31 pretrained on ImageNet from PyTorch Image Models32. We used the Lookahead optimizer33 with RAdam34, with a learning rate of 2e-4 and mini-batch size of 16 images. We used weight decay (7e-3), and a drop-out layer (p = 0.15) before the final fully-connected layer. We used EfficientNet-style35 dropping of residual connections (p = 0.3) as implemented in PyTorch Image Models. We used Bayesian Optimization to find the optimal values (See Supplementary Notes 1 for details about the searchspace).

We resized the TMAs to 1.0 mu/pixel spacing and cropped to 768 × 768 pixels. Extensive data augmentations were used to promote generalization. The transformations were: flipping, rotations, warping, random crop, HSV colour augmentations, jpeg compression, elastic transformations, Gaussian blurring, contrast alterations, gamma alterations, brightness alterations, embossing, sharpening, Gaussian noise and cutout36. Augmentations were implemented using albumentations37 and fast.ai38.

TMA spots from cases experiencing recurrence were assigned a value of 0–4, depending on the year on which the first event, either biochemical recurrence, metastases, or prostate cancer-related death, was recorded, with 0 meaning recurrence within a year, four meaning after 4+ years. TMA spots from cases without an event were also assigned the label 4.

We validated the model on the development validation fold each epoch with a moving average of the weights from five subsequent epochs. We used the concordance index as a metric to decide which model performed the best.

As the final prediction at the patient level, the TMA spot with the highest score was used. The final DLS consists of an ensemble of 15 convolutional neural networks. Using cross-validation as described above, 15 networks were trained for each fold, of which the five best performing were used for the DLS. See Fig. 1 for a graphical overview of the methods, further details can be found in the Supplementary Methods.

Fig. 1: Overview of the methods summarizing the biomarker development and the Automatic Concept Explanations (ACE) process.
figure 1

Cores were extracted from TMA slides and used to train a neural network to predict the years to biochemical recurrence. On the nested case-control test set, a matched analysis was performed. For ACE, patches were generated from the cores, inferenced through the network and clustered based on their intermediate features.

Statistics and reproducibility

For primary analysis of the nested case-control study, odds ratios (OR) and 95% confidence intervals (CI) were calculated using conditional logistic regression, following Dluzniewski et al.39. Due to the study design, calculating hazard ratios using a Cox proportional hazard regression is not appropriate. For the primary analysis, the continuous DLS marker was given as the only variable. For a secondary analysis, we added the non-matched variables PSA, positive surgical margins, and a binned indicator variable for year of surgery. Since matching was done on Gleason sum, and our goal was to identify patterns beyond currently used Gleason patterns, we corrected for the residual differences of the ISUP grade between cases and control (see Table 1). A correction was performed by adding a continuous covariate since, due to the small differences, an indicator covariate did not converge. Analysis was done using the lifelines Python package (v. 0.25.10)40 with Python (v. 3.7.8). P-values were calculated as a Wald test per single parameter. Since the DLS predicts the time-to-recurrence, high values indicate a low probability of recurrence. We multiplied the DLS output by −1 to make the analysis more interpretable. For three patients (1 from the Johns Hopkins cohort and 2 from the New York Langone cohort), PSA values were missing and were therefore replaced by the median.

Table 1 Baseline characteristics of test set and development set from the John Hopkins Hospital, prostate cancer recurrence cases and controls, men who underwent radical prostatectomy for clinically localized disease between 1993 and 2001.

For primary analysis of the New York Langone cohort, we calculated hazard ratios (HR) using a Cox proportional hazards regression. We report a secondary multivariable analysis including indicator variables for relevant clinical covariates, Gleason sum, pathological stage, and surgical margin status. We tested the proportional hazards assumption as satisfactory (every p-value > 0.01) using the Pearson correlation between the residuals and the rank of follow-up time. Kaplan–Meier plots were generated for the New York Langone cohort. Due to the nested case-control design for the Johns Hopkins set, this set could not be visualized in a Kaplan–Meier plot.

Automatic concept explanations

To generate concepts, we picked the best performing single CNN from the DLS based on its validation set fold. We used a combination of the methods of Yeh et al., 202041 and Ghorbani et al., 201924.

We tiled the TMA images into 256 × 256 patches within the tissue, discarding patches with >50% whitespace. These patches were padded to the original input shape of the CNN (768 × 768 pixels). The latent space of layer 42 of 50 was saved for each tile. Afterwards, we used PCA (50 components) to lower the dimensionality and then performed k-means (k = 15) to cluster the latent spaces.

In contrast to Yeh et al. and Ghorbani et al., we did not sort the concepts on completeness of the explanations or importance for prediction of individual samples. We sorted the concepts to find interesting new patterns related to recurrence across images by ranking the concepts based on the DLS score of the TMA spot from which they originated.

For each concept, 25 examples were randomly picked and visually inspected by a pathologist (JvI), with a special interest in uropathology, blinded to the case characteristics and prediction of the network.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Results

The DLS system was developed on the Johns Hopkins cohort with 2343 TMA spots of 685 included unique patients (39 patients were excluded due to insufficient tumour amount in the cores). Four hundred ninety-two patients were recurrence cases (72%). The 685 included patients were split into a development set of 503 unique patients and a test set of 91 matched pairs of cases and controls (182 unique patients).

In the external validation cohort, 38 out of the 204 patients (19%) had biochemical recurrence after complete remission, PSA nadir after 3 months post-prostatectomy. From the 204 patients, 620 TMA spots were included. Clinical characteristics of the cohorts can be found in Table 1 and Table 2.

Table 2 Baseline characteristics of the cohort from New York Langone hospital, prostate cancer recurrence cases and controls, men who underwent radical prostatectomy between 2001 and 2003.

The DLS marker showed a strong association in the primary analyses on the test set of the Johns Hopkins cohort with an OR of recurrence of 3.28 (95% CI 1.73–6.23; p < 0.005) per unit increase, with DLS system continuous output ranging from 0–3, with two cases below 0 (−0.27 and −0.24) (Table 3).

Table 3 Conditional logistic regression analyses of the Johns Hopkins test set.

In addition, for the John Hopkins cohort, we checked for confounding by ISUP grade, PSA level at diagnosis, positive surgical margins, and year of prostatectomy. Neither covariate was found to bias the estimates of effect substantially. The biomarker maintained a strong correlation of OR 3.32 (CI 1.63–6.77; p = 0.001) per unit increase, adjusting for these factors and the continuous term for the residual difference between cases and controls in the ISUP grade.

In the univariable analysis, the DLS marker was strongly associated with recurrence in the New York Langone external validation cohort with an HR of 5.78 (95% CI 2.44–13.72; p < 0.005) per unit increase. In the multivariate model, including ISUP grade and the other prognostic indicators in addition to the DLS biomarker, the DLS biomarker was still strongly associated with recurrence with an HR of 3.02 (CI 1.10–8.29; p = 0.03) per unit increase (Table 4). Kaplan–Meier curves based on a median cut-off, and four-group categorization, show a clear separation of the low-risk and high-risk groups (Fig. 2).

Table 4 Cox proportional hazard analyses of New York Langone external validation cohort.
Fig. 2: Kaplan–Meier plot for New York Langone external validation cohort.
figure 2

Groups were separated using the median DLS biomarker score in this cohort (a) and using four thresholds (b).

Automatic Concept Explanations provided semantically meaningful concepts (Fig. 3). Concepts were identified that correlated with either a relatively rapid or slow biochemical recurrence. Visual inspection by JvI reveals that generally, the concepts with adverse behaviour show mainly Gleason pattern 4 and some Gleason pattern 5, with cribriform configuration in TMAs within the concepts with most adverse behaviour. The two intermediate concepts show mainly stroma and less aggressive growth patterns. The two concepts predicted to be part of late recurrence cases show mainly Gleason 3 patterns, with readily recognizable well-formed glands. See the Supplementary Notes 2 for a detailed analysis.

Fig. 3: Examples of automatic concepts explanations.
figure 3

Concepts were sorted by their average score for the cores in which the pattern occurs. Showing the two most benign concepts, two intermediate and two aggressive concepts. The boxes show the quartiles of the concept predictions while the whiskers extend to show the rest of the distribution except for outlier points that lie below the 25% or above 75% of the data, by 1.5 times the interquartile range. Green, yellow and red shaded areas indicate 33%, 66% percentiles. See the Supplementary Notes 2 for all concepts.

Discussion

We have developed a deep-learning-based morphological biomarker for the prediction of prostate cancer biochemical recurrence based on prostatectomy tissue microarrays. Using a nested case-control study, we trained convolutional neural networks end-to-end with biochemical recurrence data. The DLS marker provides a continuous score based on the speed of biochemical recurrence it perceived. The DLS marker had an OR of 3.32 (CI 1.63–6.77; p = 0.001) per unit increase for the test set, and an HR of 3.02 (CI 1.10–8.29; p = 0.03) per unit increase for the external validation set. These findings support our hypothesis that there is more morphological information in the tissue besides the ISUP grade.

In the Kaplan–Meier plot (Fig. 2), the biomarker especially seems able to separate men with relatively rapid recurrence from men without (<5 years). However, we hypothesize that the decreased long-term separation in those survival curves is less due to the training cohort containing a median follow-up for 4 years. Furthermore, we choose to group patients together with >4 years of no biochemical recurrence, this limits the model’s capabilities to differentiate patients with very late recurrence. Additionally, due to the limitations of the morphology of the present tumour to inform about long-term outcomes (e.g., cells that escaped the primary tumour may subsequently acquire genomic changes that influence recurrence). Furthermore, it should be noted that the number of at-risk patients was small at these long-term time points.

The nested case-control study contained follow-up information in timespans of years, this limited the use of survival based loss functions42. When more granular follow-up information is at hand, future work could investigate usage of Cox regression based loss functions to better leverage the information of the clinical cohort.

The DLS marker showed strong and similar association in both cohorts prepared at different pathology laboratories, which supports the robustness to differences in tissue preparation, staining protocols and scanners.

We showed that Automatic Concept Explanation may be helpful to find concepts correlated with good and poor prognosis. The most discriminatory concepts followed the morphological patterns of Gleason grading. Well-defined prostate cancer glands were predicted to undergo biochemical recurrence later than disorganized sheets of prostate cancer cells. These concepts support the DLS system capturing the expected morphological patterns in support of the validity of the DLS approach.

This study focused on the use of deep learning to automatically discover features relevant for biochemical recurrence prediction. Compared to before-mentioned studies on prostate cancer prognostics models21,22, as far as we know, we report the first paper to directly optimize a neural network from prostatectomy tissue towards biochemical recurrence. Additionally, we report that training towards the biochemical recurrence endpoint results in patterns in the networks’ features aligning with the ISUP grading.

In the increasing digitalisation of pathology labs, our DLS marker may be applied on digitally chosen regions of interest. Our marker is trained on tissue microarray spots that were selected at the highest-grade cancer focus. Furthermore, it has to be noted that a TMA core allows for only limited assessment of the overall prostate cancer growth patterns. Since these tissue cores represent only limited samples from what is usually a much larger tumour lesion, the potential more aggressive patterns may still be present outside of the chosen regions, including regions of potential extraprostatic extension and perineural invasion. Validation will need to be done on entire prostatectomy sections and across cancer foci.

There have been improvements to prostate cancer grading11,13, and recently the cribriform pattern is suggested to be important for prognostics14,43. However, the evaluation of this pattern can show a range of inter-observer variability44, although a recent consensus approach could help decrease this variability45. Although we certainly have to keep in mind all the before-mentioned limitations, our findings are in line with outcomes concerning adverse behaviour in earlier work. The DLS system identified a concept that consisted of fields with cribriform-like growth patterns. This cribriform-like growth pattern was found to be part of the concept that was most associated with early recurrent cases.

The results in this study are limited to newer insights of prostate cancer growth, information on cribriform-growth and intraductal carcinoma were not readily available for use in the multivariate analysis, although the external validation cohort was graded using the 2005 ISUP consensus46 partly encoding the presence of cribriform growth inside the ISUP grade.

Although biochemical recurrence is a common endpoint to study prostate cancer progression, a clinical utility would be mostly found in assessing time-to-metastases or death. However, time-wise, they are typically significantly further separated from the surgical event, making it harder to identify relationships between tissue morphology and these endpoints. Nevertheless, we would like to investigate them in the future.

Conclusions

In summary, we have developed a deep-learning-based visual biomarker for prostate cancer recurrence based on tissue microarray hotspots of prostatectomies. The DLS marker provides a continuous score predicting the speed of biochemical recurrence. We obtained an odds ratio of 3.32 (CI 1.63–6.77; p = 0.001) for a nested case-control study from Johns Hopkins Hospital, matched on Gleason sum on other factors. Additionally, we obtained an HR of 3.02 (CI 1.10–8.29; p = 0.03) for an external validation cohort from the New York Langone hospital, adjusted for ISUP grade, pathological stage, preoperative PSA concentration, and surgical margins status. Thus, this visual biomarker may provide prognostic information in addition to the current morphological ISUP grade.