Introduction

Breast cancer is the most frequently diagnosed cancer and the second leading cause of cancer-related death for females worldwide1,2. Estrogen receptor-positive (ER+) and lymph node-negative (LN−) is a common subtype of invasive breast cancer (IBC)3,4, for which the standard treatment includes the breast-conserving surgery followed by radiation and adjuvant hormonal therapy. The adjuvant chemotherapy is however typically only adopted for the patients in high risk. Given the significant side effects of adjuvant chemotherapy5,6, it is critical to identify ER+ and LN− IBC patients with lower ROR who may safely avoid chemotherapy.

Oncotype Dx (ODx)7,8,9 is a widely applied and extensively validated molecular assay in clinical practice with ODx score aiding in estimating the level of recurrence risk of ER+ and LN− IBC and treatment benefit from adjuvant chemotherapy. The ODx test is, however, usually tissue destructive and remains expensive10. More importantly, some recent studies11,12 have suggested that the ODx assigned risk categories are not always optimal. For example, the test was less accurate on African American patients as compared to Caucasian patients11. In addition, some patients identified as in one ODx risk category might actually have opposite ROR11,12. The inclusion of additional information or new models that can provide more granular risk stratification within the ODx risk categories could allow for more accurate personalized treatment regimens for women with ER+ and LN− IBC.

The Nottingham grading system (NGS)13,14,15 is routinely used by pathologists to evaluate ROR for ER+ and LN− IBC. The NGS consists of a three-component visual assessment: (1) nuclear pleomorphism referring to variations in nuclear shape, size, and chromatin appearance, (2) mitotic activity relating to tumor cell division and proliferation, and (3) tubule formation reflecting the percentage of tumor cells forming tubule structures. The subjectivity and inter-observer variability however have remained critical challenges for using NGS in clinical practice with rather unsatisfactory concordance highlighted in a number of studies16,17,18,19,20,21,22.

With the advent of digital pathology, quantitative histomorphometry (QH) has been widely used to quantify tumor morphology from digitized tissue slides to uncover information potentially undiscernible by human vision systems23,24,25. Features related to individual NGS components including nuclear shape variability, mitotic index and ratio of tubule nuclei have been explored in numerous QH studies and validated as associated with risk stratification in breast cancer26,27,28,29,30,31. However, most studies have not performed a comprehensive and simultaneous quantitative analysis of all three NGS components and have not investigated the added prognostic value that QH-based biomarkers could provide over the ODx test.

In this study, we hypothesize that integrating all three components using QH analysis will improve breast cancer prognosis for clinical decision making. In this work, we present an Image-based Risk Score (IbRiS) classifier that combines computer-extracted features of nuclear morphology, mitotic rates, and tubule formation to prognosticate outcomes for ER+ and LN− IBC. The overall workflow is shown in Fig. 1. First, we trained three different deep learning models on H&E-stained Whole Slide Images (WSI) of breast cancer, namely (a1) a Generative Adversarial Network (GAN) for nuclei segmentation, (a2) a deep Convolutional Neural Network (CNN) for mitosis detection, and (a3) a U-Net model for tubule segmentation. Second, based on these computationally derived segmentation/detection masks, we extracted a total of 343 QH features related to nuclear morphology, mitotic count, and tubule formation from the tumor region. Subsequently, we identified the top four prognostic features from each of the three feature categories using a Cox proportional hazards regression model32. The top identified features were further ensembled to construct a final prognostic Cox regression model (IbRiS) by associating them with patient clinical outcomes. Finally, we independently validated the prognostic significance of IbRiS on two cohorts from two different institutions, comprising a total of 205 patients with ER+ and LN− IBC. Given the diverse representation of race, tumor grade, and treatment regimen between the training and testing sets, we sought to demonstrate the generalizability of IbRiS for assessing the aggressiveness of breast cancer using computer-extracted histologic features. The prognostic performance of IbRiS was also evaluated within each ODx derived risk category (i.e., low, intermediate, and high).

Fig. 1: Illustration of the overall workflow for the experimental design.
figure 1

a Three deep-learning models: (a1) a CNN, (a2) a pixel2pixel GAN, and (a3) a U-Net model, were trained to detect mitosis, nuclei, and tubules, respectively. b Tumor tiles were exhaustively extracted from the tumor regions delineated by an experienced breast pathologist. c After detection of mitosis, nuclei, and tubules, quantitative patient-level features were extracted to describe mitotic rates, nuclear pleomorphism, and tubule formation, respectively. d The four most prognostic features were selected from each feature category by their association with disease-free survival (DFS) using a Cox regression model. e The top features identified from individual feature families were ensembled to train a final combined Cox proportional hazards model to stratify the ER+ and LN− breast cancer patients into high- and low-risk categories on the training set D1 with group differences assessed by two-sided log-rank test. f The prognostic model was subsequently locked down (g) and evaluated on two independent validation cohorts, D2 and D3 with the differences between high- and low-risk categories measured by two-sided log-rank test.

Results

Clinicopathological variables of the patient cohorts

The clinicopathological variables and clinical outcomes of patient cohorts D1, D2, and D3 are provided in Table 1. Patients were primarily in their 50s and 60s, and multiple ethnicity groups were included (non-Hispanic white: 62.6%, South Asian: 26.2%, non-Hispanic black: 9%, other: 2.2%). Notably, unlike the non-Hispanic-white-dominated training set D1 and the validation set D2, all patients in the D3 validation set were South Asian women. Approximately 82% of the patients in D1, D2, and D3 were diagnosed as histologic grade 2/grade 3. Particularly, 63% of the patients in D3 were grade 3, much higher than the 16% in D1 and the 27.3% in D2. The vast majority of the patients in D1 and D3 were HER2 negative (HER2−) (except one HER2 positive (HER2+) case in D1) while in D2, 42% patients were HER2−, 20% patients were HER2+, and 38% had unknown HER2 status. Additionally, 65% of all the patients in D1 + 2 + 3 (D1 + D2 + D3) were treated with adjuvant chemotherapy (28% in D1, 100% in D2, and 68% in D3). Of note in D1, chemotherapy use was likely guided by the ODX score, unlike the other two cohorts.

Table 1 Summary of clinicopathological variables of the three patient cohorts.

Experiment 1: model construction and validation

A total of 12 prognostic features were obtained by combining the top 4 features identified in each of the three feature categories (i.e., nuclear morphology, mitotic rates, and tubule formation) using a Cox regression model targeting DFS on D1 (see Supplementary Table 1 for a brief description of the 12 identified top features). The distribution of the four identified features from each of the nuclear, mitotic, and tubule feature categories between the high-risk and low-risk groups predicted by IbRiS on all cohorts D1+2+3 is illustrated in Supplementary Fig. 2. Three representative features (i.e., mitotic counts, locally connected nuclear clusters, and the ratio of tubule nuclei count to non-tubule nuclei count) are presented in Fig. 2. Figure 2 shows that patients who did not have DFS events tended to have fewer mitotic events, fewer connected nuclear clusters, and a higher proportion of tubule nuclei in relation to those patients who did experience an event.

Fig. 2: Representative H&E WSI comparison of a recurrent (top row) and a censored (bottom row) patient.
figure 2

The first column (a, f) shows the original WSI with the pathologist-annotated tumor region. The second column (b, g) illustrates the distribution of mitotic counts on the WSI with warmer color in the scale bar indicating a higher mitosis number. The third column (c, h) is a magnified view of a tumor tile. The fourth column (d, i) demonstrates the top-identified nuclear feature, which quantifies the number of connected nuclei clusters (connected in green line). The fifth column (e, j) shows the tubule feature “ratio of tubule nuclei count to non-tubule nuclei count” with tubule nuclei highlighted in cyan.

A LASSO regularized Cox regression model (IbRiS) was constructed with the 12 identified features correlating to DFS on D1 (n = 116) (see Supplementary Table 2 for the non-zero coefficients of the features). A dichotomized risk category was generated from the model as described in the “Results” section. The distribution of the continuous risk scores for each individual cohort is illustrated in Supplementary Fig. 3. KM survival curves were generated for high (IbRiSH) and low (IbRiSL) risk groups for datasets D1, D2, and D3, respectively, with hazard ratio (HR) = 6.36 (95% confidence interval (CI) = 2.69–15, p = 2 × 10–5) on D1, HR = 2.33 (95% CI = 1.02–5.32, p = 0.045) on D2, and HR = 2.94 (95% CI = 1.18–7.35, p = 0.0208) on D3 (see Fig. 3). Patients predicted as high risk by IbRiS had a significantly worse outcome in terms of DFS than patients in the low-risk group. Notably, the separation of KM curves between IbRiSH and IbRiSL risk groups was more evident beyond the early survival times (~50 months), which reveals the model’s capability in identifying late DFS events. Since 20% of patients in D2 were HER2 positive and 38% had unknown HER2 status, we additionally performed survival analysis of IbRiS on HER2− patients in D2 after excluding the patients with HER2+ or unknown HER2 status (1st plot in Supplementary Fig. 4) as well as on HER2− and HER2 unknown patients in D2 after excluding patients with HER2+ status (2nd plot in Supplementary Fig. 4). In both KM curves, the trend that the IbRiSH group had a poorer outcome in terms of DFS was observed, although the survival differentiation is not statistically significant, potentially due to the low number of patients included.

Fig. 3: Prognostic performance of IbRiS on D1-D3.
figure 3

KM curve estimates for DFS for IbRiSH (red) versus IbRiSL (blue) across D1–D3 (ac), with IbRiSH demonstrating a significantly worse prognosis compared to IbRiSL on D1, D2, and D3 using two-sided log-rank approach.

Univariate and multivariable Cox proportional hazards analyses for DFS on IbRiS-derived risk category, clinicopathological variables, chemotherapy treatment, and ODx risk category on D1, D2, and D3 are shown in Table 2. On univariate analysis, except for IbRiS-derived risk categories and age on D1, none of the clinicopathological factors was significantly prognostic of DFS on D1, D2, and D3. The patients in IbRiSH had significantly worse DFS compared to those in IbRiSL with HR = 6.36 (95% CI = 2.69–15, p = 2e−05) on D1, HR = 2.33 (95% CI = 1.02–5.32, p = 0.0450) on D2, HR = 2.94 (95% CI = 1.18–7.35, p = 0.0208) on D3). The ODx risk category was significantly prognostic on D1 (HR = 2.48, 95% CI = 1–6.2, p = 0.0497) and D2 (HR = 14, 95% CI = 1.74–110, p = 0.0132) when combining the intermediate and high-risk category into a single group. In multivariable analysis, IbRiS was found to be independently prognostic of DFS in the training set and both independent testing sets with HR = 6.05 (95% CI = 2.33–16, p = 0.0002) on D1, HR = 4.51 (95% CI = 1.1–18, p = 0.0366) on D2, and HR = 4.12 (95% CI = 1.45–12, p = 0.0078) on D3. Note that we excluded the ODx risk category from the multivariable analysis on D2 due to the limited number of patients with ODx scores (23% in D2) available. In order to investigate the interdependency between IbRiS and ODx risk category on D2, Lin’s concordance correlation coefficient33 was calculated with the value of 1 indicating a perfect agreement and −1 representing completely disagreement. The concordance was found to be low between IbRiS (low versus high-risk group) and ODx test (low and intermediate versus high ODx risk category: 0.16 (95% CI = −0.21–0.49); low versus intermediate and high ODx risk category: 0.26 (95% CI = −0.08–0.54)).

Table 2 Univariate and multivariable analysis for DFS on IbRiS-derived risk category, clinicopathological variables, chemotherapy treatment, and ODx risk category on D1, D2, and D3.

Experiment 2: IbRiS-derived risk category versus ODx risk category

We sought to demonstrate the prognostic ability of IbRiS-derived risk scores within each individual ODx category. ODx scores were available for n = 116 patients in D1 and n = 28 patients in D2. As shown in the KM curves in Fig. 4, patients in the IbRiSH group experienced a higher relapse probability than those classified as IbRiSL in the high ODx categories for both D1 and D2. Specifically, in the high ODx risk category (D1+2), among the 10 patients predicted as IbRiSL, 9 patients had favorable outcomes (non-DFS event with a median follow-up of ~7 years) while among the 7 patients identified as high risk by IbRiS, 5 of them suffered recurrence/death.

Fig. 4: Prognostic performance of IbRiS within individual ODx risk category in D1-D2.
figure 4

KM curve estimates for DFS for IbRiSH (red) versus IbRiSL (blue) in the low, intermediate, and high ODx risk categories, respectively across D1+2 (a, d, g), D1 (b, e, h) and D2 (c, f, i) with the differences between the risk categories assessed by two-sided log-rank test. IbRiS was significantly prognostic within high ODx risk category for both D1 and D2.

We additionally generated KM plots for DFS for the low versus intermediate versus high ODx risk categories to demonstrate the prognostic performance of ODx risk category on D1 and D2, as shown in Supplementary Fig. 5.

Experiment 3: IbRiS-derived risk category versus histologic grade

We sought to demonstrate the prognostic ability of IbRiS-derived risk categories in subgroups stratified by pathologist-assigned histologic grades. As shown in the KM curves in Fig. 5, for the high-grade groups, patients predicted as IbRiSH had significantly worse prognosis than those predicted as IbRiSL for all the three cohorts. Specifically, for the pathologist-assigned high-grade group (D1+2+3), 50% of patients identified as IbRiSH suffered from DFS events, while among the patients classified as IbRiSL only 14% recurred/died.

Fig. 5: Prognostic performance of IbRiS within individual histology grade in D1-D3.
figure 5

KM curve estimates for DFS for IbRiSH (red) versus IbRiSL (blue) in the low, intermediate, and high histologic grades, respectively across D1+2+3 (a, d, h), D1 (b, e, i), D2 (c, f, j) and D3 (g, k) with the differences between the risk categories assessed by two-sided log-rank test. IbRiS was significantly prognostic within high histologic grade groups for all three cohorts.

We additionally generated KM plots for DFS for the low versus intermediate versus high histologic grade groups to demonstrate the prognostic performance of histologic grade on D1, D2 and D3 as shown in Supplementary Fig. 5. The survival analysis of clinical risks (simultaneously considering tumor grade and tumor size)34 in terms of DFS was also performed on the combination of three cohorts (D1+2+3) as shown in Supplementary Fig. 6.

Discussion

Oncotype Dx (ODx)8,9 is a widely applied multi-gene-based assay in clinical practice that has been clinically validated to be prognostic and predictive of treatment benefit of adjuvant chemotherapy. However, ODx is expensive and tissue destructive. More importantly, consistent disagreement in risk classification between the ODx test and other molecular assays has been identified, with ODx incorrectly identifying a number of patients who are likely to have a low risk of recurrence as high risk. In one comparison study12 between ODx and Prosigna (another FDA approved Prognostic Gene Signature), Prosigna was found to be a better indicator of ROR than the ODx test. In addition, ODx was found to be significantly less accurate in African American versus Caucasian breast cancer patients, suggesting ODx was not well calibrated for racial/ethnic minority populations11.

In this study, we presented a digital-pathology-based classifier to risk stratify ER+ and LN− breast cancer patients by comprehensively measuring characteristics related to the nuclear histomorphology, tubule formation, and mitotic activity from H&E-stained slide images. Additionally, we investigated if the image risk model (IbRiS) was able to provide further granular prognostic value over the ODx test. While a few studies have shown the association between QH features extracted from digitized H&E-stained slides and the ODx risk categories30,31,35,36, these studies either solely focused on one component of the three feature categories (i.e., nuclear morphology, mitotic rates, and tubule formation) or did not explore the added prognostic value the image-based signatures could offer over the ODx test. For example, Whitney et al.35,36 assessed the ability of computerized nuclear shape and architecture features to predict ODx risk categories for breast cancer patients. Romo-Bucheli et al.30 developed a deep learning based mitosis detection classifier on WSIs and further evaluated the correlation of mitosis count with ODx risk categories for breast cancer patients.

In our study, from the survival analysis of IbRiS in the subgroups of ODx risk categories, we found that IbRiS was able to add significant prognostic value to the ODx risk category (Fig. 4). For the patients distributed in the high ODx category, IbRiS was able to identify patients with true low ROR. These results suggest that among the patients identified as high risk by ODx test in clinical practice, some of them, however, are in fact true low risk and could be effectively identified by IbRiS, thus safely avoiding aggressive adjuvant chemotherapy.

IbRiS was validated as prognostic on two independent validation cohorts independent of clinicopathological variables. In addition, we performed survival analysis using histologic grade on D1, D2 and D3 and ODx risk category on D1 and D2 (Supplementary Fig. 5). Notably, the rate of chemotherapy administration varied among the subgroups in D1 with 26.6% for IbRiSL versus 35% for IbRiSH, 1.15% for the low versus 23.9% for the intermediate versus 70.6% for the high histologic grade, and 6.67% for low versus 42.9% for intermediate versus 83% for high ODx risk category. The heterogeneity in treatment among the subgroups (in terms of IbRiS-derived risk groups, histologic grades, or ODx risk categories) resulted in a differing impact on the corresponding patient outcome. In the scenario of homogeneous therapy, with higher survival improvement due to higher chemotherapy administration rate in the high-risk group could being effectively eliminated, the risk stratification among the risk groups could be potentially increased in D1.

The Nottingham Grading Scheme (NGS) is one of the most commonly utilized traditional prognostic factors for IBC by pathologists in routine clinical practice13,14. The NGS includes the measurements of nuclear pleomorphism, mitotic count, and tubular differentiation to assess tumor aggressiveness and stratify breast cancer patients by ROR. While not significantly prognostic, histologic grades still showed a certain level of prognostic value in our study, as evidenced by the marginal significance in univariable analysis in D1 and D2 (D1: p = 0.0609 for high versus intermediate and low grade; D2: p = 0.0599 for high and intermediate versus low grade) and survival analysis for D1+2+3 in Supplementary Fig. 5. A possible reason for the non-significant prognostic value of histologic grade could be the relatively small number of patients and DFS events in the cohorts considered. However, the poor to moderate inter-reader disagreement with breast cancer grading has remain a critical challenge in pathology practice16,17,18,19,20,21,22. For example, in an ECOG study of inter-observer reproducibility of NGS in stage II breast cancer, two committee pathologists only concurred on the histologic grade for 54% of patients, marginally higher than the agreement rate expected by chance16. Taking this into account, it is imperative to develop an objective and accurate prognostic model as a complementary tool to NGS in clinical practice to improve the assessment of ROR for breast cancer patients. For example, Wang et al.37 built DeepGrade, a CNN-based model for further risk stratification of the breast tumors in intermediate histology grade. DeepGrade was trained for binary classification of low versus high histological grade with H&E WSIs of breast tumors. DeepGrade was then applied to re-stratify the tumors with intermediate histologic grade into high- (similar with high-grade) or low- (similar with low grade) risk groups, with the predicted high-risk group showing a significantly elevated ROR compared with the low-risk group. Similarly, Jaroensri et al.38 constructed three DL models to predict pathologist-based reference standards for the three NGS components, respectively. They also found that the AI-NGS combining the output of the three DL models delivered non-inferiority performance for breast cancer prognosis compared with pathologists grading. Our study differs from these two studies in a couple of critical ways: First, the abstract image representations captured in the two studies for model predictions lack biologic interpretability as compared with the biologically explainable QH features employed in our study. Second, DeepGrade and AI-NGS were both trained with histologic grade as ground truth and the models’ prognostic significance were then demonstrated. In contrast, in our study, IbRiS was constructed by directly associating the image features with survival outcome. Additionally, the prognostic relevance of the IbRiS classifier was investigated in the context of all three histologic grade groups. As shown in Fig. 5, IbRiS significantly stratified the high- and low-risk patients within the high histologic groups. These results suggest that with the added prognostic value of IbRiS to histologic grade, the patients at true low risk could be further distinguished from the high-grade group who could potentially safely omit the adjuvant chemotherapy.

The potential clinical impact of IbRiS lies in complementing the ODx test and histologic grade in clinical practice for a more accurate assessment of ROR for breast cancer patients. In some ways, IbRiS more closely mirrors a multi-gene expression-based test like ODx in that it produces a recurrence score based on the weighted combination of the expression of individual and interpretable image features. At least part of the reason for the widespread clinical adoption of ODx has been the inherent interpretability of the test39, i.e., being able to connect the recurrence score to the individual genes. Therefore, it stands to reason that IbRiS might be more amenable to clinical adoption than black-box-based deep learning models like DeepGrade. In addition, IbRiS only requires a digital slide image of the biopsy or surgically excised tissue specimen and computing resources to provide the risk score. With the prevalence of WSI scanners, the IbRiS model holds vast potential to serve as an inexpensive and faster alternative prognostic tool in clinical settings, especially in low resource settings where molecular assays like ODx may not be available. Furthermore, IbRiS provides an opportunity to efficiently analyze tumor heterogeneity by processing multiple tissue slides from one tumor and identifying the most relevant features from across the slides for predicting cancer outcomes.

Limitations of our study included the fact that our model was retrospectively validated based on prognostic significance unlike the ODx test, which was prospectively validated for both prognostic significance and treatment benefit prediction. Additionally, our study focused solely on LN− and ER+ IBC patients and had a relatively small sample size. Future work will entail validating the digital pathology-based pipeline in additional independent pan-stage, molecular subtypes, and also in terms of its predictive benefit for adjuvant chemotherapy.

In summary, this study was the first to quantitatively measure the joint QH features of nuclear morphology, mitotic rates, and tubule formation on H&E WSIs and demonstrate its prognostic significance in terms of DFS for ER+ and LN− IBC. In addition, the QH features-based model provided more granular risk stratification within the ODx defined risk category. The prognostic capability of these identified features could also potentially be applicable in IBCs with positive lymph nodes as well as other molecular subtypes.

Methods

Dataset description

Our study comprised three independent cohorts (D1, D2, and D3) of patients with ER+ and LN− IBC. H&E-stained slides of surgically resected tumor specimen (no neoadjuvant treatment was administered) from D1, D2, and D3 were respectively digitized using a Roche Ventana iScan HT Scanner, a Philips Scanner, and a Ventana DP 200 Scanner (Hemel Hempstead, UK) at ×40 magnification (0.25 micron per pixel). In our experiments, D1 served as a training set for feature selection and model construction. D2 and D3 served as independent validation sets to evaluate the performance of the final locked-down prognostic model.

The flowchart for the inclusion and exclusion criteria for patient selection on D1, D2 and D3 are shown in Supplementary Fig. 1. A summary of clinicopathological variables of the three cohorts of ER+ and LN− breast cancers is shown in Table 1.

Training cohort D1: breast cancer patients treated between 1996 and 2018 at University Hospitals (UH) having a corresponding ODx score available were identified and retrieved by the pathologists from the hospital archive. The slides were subsequently digitized and transferred. H&E-stained tumor WSI along with clinicopathological/outcome information were available for 519 patients. Patients without any event (recurrence/metastasis/death) were only recruited in this study when at least a 5-year follow-up was available. This process resulted in n = 116 ER+ and LN− breast cancer patients (n = 22 events) in D1. This study was approved by the Institutional Review Board (IRB) at University Hospitals (IRB No 02-13-42C).

Validation cohort D2: The Eastern Cooperative Oncology Group (ECOG) 219740 is a prospective, randomized, phase III clinical trial that recruited n = 2778 patients with IBC (1 to 3 positive LN/LN− with tumor size ≥1 cm) from 1998 to 2007 to compare the patient’s outcome under two different adjuvant chemotherapy regimens (i.e., doxorubicin plus docetaxel versus doxorubicin plus cyclophosphamide; a previous study40 identified no significant difference in patient outcomes between the two treatment regimens). ECOG 2197 is deemed an ideal validation set due to the homogeneity in treatment (all the patients received adjuvant chemotherapy), which reduced the effect of treatments on patient outcomes. The access to the ECOG dataset involved a 2-year long process including a proposal review first through ECOG and subsequently through the Cancer Therapy Evaluation Program (CTEP) at the National Cancer Institute (NCI). From this superset, D2 comprises the subset of n = 121 ER+ and LN− breast cancer patients (n = 23 events), whose corresponding WSIs and clinical information were both accessible. ECOG provided us with the de-identified clinical data from the archived clinical trial along with the de-identified images. This study was approved by the IRB at University Hospitals (IRB No 02-13-42C).

Validation cohort D3: D3 comprises n = 84 ER+ and LN− Indian patients treated in 2009 and with follow-up until 2020 (n = 21 events) at Tata Memorial Center (TMC) which were identified and retrieved by the pathologist from hospital archive. The H&E stained tumor slides for individual patients were digitized in and subsequently transferred from TMC. The study was approved by Institutional Ethics Committee, TMC, IEC no. 1712.

The study conformed to Health Insurance Portability and Accountability Act (HIPAA) guidelines and was approved by the IRB at University Hospitals (IRB No 02-13-42C). The need for written consent from participants was waived due to the use of de-identified retrospective data.

The tumor region in the WSI was manually annotated or checked by a pathologist, with artifacts intentionally avoided (i.e., tissue folding, pen mark, staining artifacts, and bubbles). The slide with the largest representative tumor was selected for the subsequent analysis for the patients who have multiple slides available.

Feature extraction

A set of 343 QH features were extracted to describe nuclear morphology, mitotic rates, and tubule formation based on the masks of nuclei, mitosis, and tubules, respectively, generated by three different deep learning models. Additional details regarding deep learning models, algorithms for feature calculation, and descriptions of features extracted are available in the Supplementary Methods. All tiles (2000 × 2000 pixels at ×40 magnification) containing tumor region as annotated by a pathologist were exhaustively extracted from the WSIs. In each individual tile, nuclear morphology, mitotic activity, and tubule formation-related characteristics were computed. The patient-level features were then calculated by aggregating (i.e., mean, median, max, sum, standard deviation, skewness, kurtosis, histogram entropy, and approximate entropy) these features across all the tiles.

Nuclear histomorphometric feature extraction: we employed a Pixel2Pixel GAN for nuclei segmentation. Following nucleus segmentation, we extracted 242 nuclear features to quantify the nuclear histomorphology of each WSI, including global graph41, shape, cell cluster graph (CCG)42, cell orientation entropy (CORE)43, and Haralick texture feature families44. Global graph and CCG feature families, respectively, describe the global and local spatial distribution of nuclei; shape features capture nuclear boundary properties such as smoothness and elongation; the CORE feature family quantitatively measures the disorder degree of nuclear orientations; Haralick texture features characterize chromosome patterns within nuclei.

Mitosis feature extraction: a CNN was trained to detect mitotic events on H&E-stained WSIs. In addition, an epithelium segmentation model was trained to identify epithelial nuclei for subsequent mitosis ratio calculation. Forty-five features were extracted from each WSI based on detected mitoses to describe the mitosis prevalence status. More specifically, these features included: (1) multiple statistical measurements of the mitotic count; (2) ratio of mitotic count to epithelial nuclei count, ratio of mitotic count to blue-ratio nuclei count, and ratio of mitotic count to nuclei count, over all of the extracted tumor tiles across the WSI; (3) the proportion of tiles presenting a specific mitotic density within the WSI; and (4) quantitative proliferation score calculated by simulating the mitosis prevalence assessment in clinical practice.

Tubule feature extraction: tubule formation represents the portion of tumor cells forming tubular glands19. We trained a U-Net to automatically segment tubules in breast cancer histopathological images. A total of 56 tubule features were extracted to measure tubule formation based on the segmented tubule masks. Those features comprised various statistical summaries of tubule ratio metrics on all the tiles across the WSI of each patient (i.e., the ratio of tubule nuclei count to the non-tubule nuclei count, the ratio of tubule nuclei count to the epithelium nuclei count, and the ratio of tubule nuclei count to the nuclei count) as well as the number of tiles falling between different tubule ratio intervals.

Feature selection and classifier construction

In total, 343 features were finally extracted (242 nuclear pleomorphism features, 45 mitotic count features, and 56 tubule formation features). A Cox proportional hazards regression model (henceforth referred to as Cox regression model)32, regularized by Least Absolute Shrinkage and Selection Operator (LASSO)45, was constructed to identify important predictors of DFS. First, to keep the balance among the three feature categories, we implemented a Cox regression model to identify the top four prognostic features associated with DFS separately on each of the three categories on training set D1. The total number of top features (n = 12) for inclusion within the model was determined as ~10% of the patient number in the training set. Following feature identification, a final LASSO regularized Cox regression model was used to compute the coefficients for each of the features; 11 features were assigned non-zero coefficients as part of inclusion within IbRiS while one feature had a zero-coefficient value.

An optimal risk score threshold (denoted as “θopt” hereafter) was identified from the training set D1 (see Supplementary Methods for details) to dichotomize the continuous risk scores into binary high/low-risk categories.

Statistical analysis

IbRiS was validated on two independent testing sets, D2 and D3. Specifically, we calculated a continuous risk score for each patient on D2 and D3 using the feature coefficients estimated from D1. We then classified the patients into a binary high (risk score >θopt) versus low (risk score ≤θopt) risk category of recurrence by applying θopt identified from D1. DFS was defined as the time from diagnosis/random treatment assignment until first recurrence (loco-regional or distant metastasis) or death, whichever occurred earlier. Patients were censored when they did not have an event at the termination of the study or were lost to follow-up at any time during the study. Kaplan–Meier (KM) survival analysis with DFS as the endpoint was performed between the IbRiS-derived high- versus low-risk categories. The rate of DFS was estimated using the KM method, and the difference of DFS was assessed using log-rank test46 between the high- and low-risk categories predicted by IbRiS on D1, D2, and D3. We also performed subgroup survival analysis respectively for high, intermediate, and low ODx risk categories (traditional recurrence score categorization was applied: low: <18, intermediate: 18–30, high: >30)9 as well as high, intermediate, and low histologic grades assigned by pathologists.

We conducted a univariate Cox proportional hazard analysis to evaluate if any of the routinely examined clinical parameters, treatments, and ODx risk categories were prognostic of DFS on D1, D2, and D3. The clinical parameters include age (≤50 years versus >50 years), race (white versus other), tumor size (<20 mm versus ≥20 mm), Progesterone Receptor (PR) status (negative versus positive), HER2 status (negative versus positive), histologic grade (Grade I versus Grade 2 versus Grade 3). Multivariable Cox analysis47 was also performed to assess the independent prognostic significance of IbRiS after accounting for the other clinicopathological variables on D1, D2, and D3.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.