AI-enabled routine H&E image based prognostic marker for early-stage luminal breast cancer

Breast cancer (BC) grade is a well-established subjective prognostic indicator of tumour aggressiveness. Tumour heterogeneity and subjective assessment result in high degree of variability among observers in BC grading. Here we propose an objective Haematoxylin & Eosin (H&E) image-based prognostic marker for early-stage luminal/Her2-negative BReAst CancEr that we term as the BRACE marker. The proposed BRACE marker is derived from AI based assessment of heterogeneity in BC at a detailed level using the power of deep learning. The prognostic ability of the marker is validated in two well-annotated cohorts (Cohort-A/Nottingham: n = 2122 and Cohort-B/Coventry: n = 311) on early-stage luminal/HER2-negative BC patients treated with endocrine therapy and with long-term follow-up. The BRACE marker is able to stratify patients for both distant metastasis free survival (p = 0.001, C-index: 0.73) and BC specific survival (p < 0.0001, C-index: 0.84) showing comparable prediction accuracy to Nottingham Prognostic Index and Magee scores, which are both derived from manual histopathological assessment, to identify luminal BC patients that may be likely to benefit from adjuvant chemotherapy.


INTRODUCTION
Breast cancer (BC) is the most common cancer in women with an estimated 2.3 million cases and 0.7 million deaths reported worldwide in 2020 1 .BC is a heterogeneous disease with different molecular subtypes, variable morphology presentation, behaviour and response to therapy 2,3 .With the introduction of endocrine therapy, the prognosis of early-stage oestrogen receptor positive (ER+) and human epidermal growth factor receptor 2 negative (HER2−) BC, which comprises approximately 40% of BC 4 , has improved 5 but in about 20% of the cases the decease can still recur post-treatment 6 .Because some of the patients in this earlystage luminal BC will benefit from adjuvant chemotherapy while others will only require endocrine therapy, it is important to riskstratify these patients for better treatment management 7,8 .Stratification of patients into risk groups based on their survival outcome is key for personalised treatment and therapeutic interventions and, therefore, identification of clinicopathological factors and biomarkers is important area of clinical research 9,10 .
Despite several advancements in BC diagnosis and management, the existing risk stratification tools are subjective and unable to cope with the highly heterogenous morphology of the BC histology.Current BC management relies on the availability of robust clinical and pathological prognostic factors to support clinical management decision making.The Nottingham grading system (NGS) 11 , which comprises the assessment of three morphological features (tubule formation, mitotic count and nuclear pleomorphism), is a well-established prognostic marker in BC that is recommended by the World Health Organisation and other national and international organisations 12,13 as the gold standard BC grading system.NGS is a simple and cost-effective prognostic tool that was recently incorporated into the tumour-node-metastasis (TNM) stage system (prognostic stage) 14 .However, NGS still relies on subjective assessment of histology samples which needs to be resolved for reproducible, robust and reliable patient stratification.
Though NGC being an established prognostic marker its performance in predicting the outcome of the clinically indeterminate group of early-stage luminal BC is still non-optimal.Reproducibility concerns have been raised due to inter-observer disagreement regarding grade components and the complexity of intra-tumour heterogeneity 15,16 .Therefore, molecular tests including multigene assays such as Oncotype DX(ODX) 17 and PAM50 18 are increasingly used to risk stratify this group of BC.However, the relatively high cost and turnaround time and the relatively low concordance between assays makes the development of objective, reproducible and reliable alternative methods such as AIbased prognostic tools highly warranted.
Deep learning (DL) based analysis of haematoxylin & eosin (H&E) stained histology scanned slide has produced remarkable results for objective assessment of morphological features [19][20][21][22][23][24][25][26] .Several studies [27][28][29][30][31] have adapted DL for survival analysis, including an ensemble of deep convolutional neural network (CNN) models for risk stratification of BC 32 .Recently, other studies 33,34 have been focused on the importance of developing image-based tools to risk stratify the clinically indeterminate risk class of BC.The correlation of ODX and mitotic count has been previously demonstrated in different studies 35,36 whereas some studies have reported a combination of image-based features with clinicopathological variables 37,38 .Though some studies 36,39 have shown that different structures, tissue types and cells types have prognostic importance but a comprehensive phenotyping of such structures and their relationship with survival outcome is less studied especially for luminal early stage BC.
In this study, we show that the DL-based automated phenotyping of BC can provide a cost-effective and reproducible prognostic tool.Using whole slide images (WSIs) of a large well characterised cohort of luminal early stage BC, we develop an AIbased BRACE marker to capture tumour and stromal heterogeneity and mitotic activity in a quantitative and reproducible manner (Fig. 1).In addition to objectively quantifying the stromal and pleomorphic heterogeneity and digital mitotic score, the proposed marker utilises the spatial composition of digital local tumour grading (Tumour Grade Composition (TGC)) as opposed to a single case-level grade.The method is validated for prognostication of early-stage luminal breast cancer on an external cohort.Development of such AI-based markers will provide new research alternatives leading to integrated solutions along with gene expression profiling.

RESULTS
To estimate the BRACE marker first the ductal carcinoma in situ (DCIS) regions were filtered.Second, tumour rich areas were identified from tumour cell detections.This step identified regions of interest (ROI) for digital mitotic counting as well as predicting local tumour grades and nuclear pleomorphism in the tumour rich areas.Third, tissue regions were segmented to quantify stromal cell density and tumour area percentage.Finally, the tumour grade composition, local variations in nuclear pleomorphism and stroma cell density, digital mitotic count and tumour area percentage were used for survival prediction (Fig. 1).Note that tissue area was selected by thresholding but this step is not shown in the figure to keep it simple.

AI-based WSI phenotyping
Two well-characterised retrospective cohorts (Supplementary Table 1) with endocrine therapy were used for model construction for analysis of two endpoints i.e. distant metastasis free survival (DMFS) and BC specific survival (BCSS), with data splits detailed in supplementary Fig. 1.BC WSI contains a rich heterogeneous phenotype including tumour morphology, stromal variations, mitotic activity, tumour infiltrating lymphocytes (TILs) etc.The power of AI was utilised to explore a range of features (Supplementary Table 2) related to six main categories: (a) Tumour morphology (grade, pleomorphism), (b) tumour-stroma relationship, (c) TILs quantification, (d) heterogeneity in terms of tumour, stroma and TILs, (e) mitotic cell counting, and (f) counts/ratios of different nuclei.Following feature selection (see 'Feature selection' for details), the final features included: digital local grade composition in the form of local grade 1 percentage (LG1 %), LG2 %, LG3 %, tumour area %, pleomorphic contrast, stromal contrast, co-occurrences of stromal nuclei patches with low density, and digital mitotic score.
Figure 2 shows examples of stromal contrast and nuclear pleomorphic contrast for two WSIs (Fig. 2c  DCIS filter and Region Segmentor Tissue area was segmented using thresholding and morphological operations.DCIS filter (a trained CNN model) distinguished between invasive tumour regions and DCIS and achieved average F1 scores of 0.713 and 0.9 for invasive tumour and DCIS segmentation (Supplementary Table 3), respectively.Supplementary Fig. 2a shows an example of WSI-level tumour and DCIS segmentation output.Similarly, Region Segmentor (another trained CNN model) performed semantic segmentation of stromal and other non-ROI regions (fats, normal, blood vessels, artifacts) and achieved a dice score of 0.76 for stroma and 0.69 for other regions (Supplementary Table 3).The trained models were used to generate segmentation masks for the WSIs (Supplementary Fig. 2b).
Tumour-rich area identification (Tumour Detector) Tumour Detector (Supplementary Fig. 3; a cell segmentation and classification model) was used to generate WSI-level nucleicontours (used to measure tumour-nuclei morphology) and types (to get tumour-nuclei density).The output of this module was then used for ROI selection for training the TGC module (Local Grade Predictor) and for counting mitotic figures.Supplementary

AI-based prediction of grade composition (Local Grade Predictor)
Despite the fact that there is intra-tumour heterogeneity 2,40,41 , a single grade is often assigned to the entire BC case.Local Grade Predictor performed AI based grade prediction of individual patches within the WSI to capture variations due to intra-tumour heterogeneity.Tumour Detector was utilised to select an ROI based on tumour nuclei density, size and shape and instead of using patches from all over the WSI, patches from the selected ROIs were used to train Local Grade Predictor.Based on a linear support vector machine (SVM) trained with the proportions of digital local grades to predict the clinical grade of each WSI, the selection of ROI based on tumour nuclei density, size and shape improved receiver operating characteristic area under the curve (ROC-AUC) from 0.65 ± 0.023 (for baseline random ROIs) to 0.83 ± 0.014 (Supplementary Table 3).Another strategy where ROIs were selected from areas with maximum tumour tissue produced ROC-AUC of 0.79 ± 0.011.TGC considered local intratumour heterogeneity based on the proportion of areas in a given WSI that can be associated with grade 1-3.Supplementary Fig. 4 shows the association of the clinical grade and its components (mitotic score, tubule formation, and nuclear pleomorphism) with the TGC for both internal and external validation cohorts.It can also be observed that the Local Grade Predictor model does not only learn the overall grade but it also learnt the heterogeneity of clinical grade.LG1 predictions were present for all the three grades showing that at least some of the areas in most of the WSIs is similar to Grade 1.

Digital mitotic count
A stain-robust mitotic detection model 42 was used to detect mitotic figures.Mitotic figures were detected in an ROI from each WSI selected based on the same criteria as for training the Local Grade Predictor.The mitotic detection model first segmented potential mitotic figures and then refined the classification via a deep learning classifier.To assign a mitotic score to a WSI, an ROI from the WSI was passed through the model and the mitotic count was then used to decide the score.To reduce the effect of any under-or over-detections the counts were discretised to get a digital mitotic score from 1 to 3.

Stromal and pleomorphic contrast
Stromal nuclei detection from Tumour Detector were used to construct a co-occurrence matrix (CM; Supplementary Fig. 5) where each entry represented the number of times a patch containing certain number of stromal nuclei co-occurred with another patch containing certain number of stromal nuclei.The CM was then used to calculate WSI-level contrast which quantified the local stromal variations.Another stromal nuclei related feature was also calculated which quantified the co-occurrences of patches with low stromal density.Similarly, the patch-level pleomorphic predictions from Local Grade Predictor were used to construct a CM containing entries for local pleomorphism which was then summarised at WSI-level as pleomorphic contrast.

Survival analysis
The selected features i.e.TGC (LG1%, LG2%, LG3%), percentage of tumour area, pleomorphic contrast, stromal contrast, cooccurrences of stromal nuclei patches with low density, and digital mitotic score, were combined by Cox proportional hazard regression model to generate BRACE risk score.Supplementary Fig. 6 shows the contribution of each component of BRACE marker where higher percentages of LG1 and higher stromal contrast are associated with better outcomes whereas higher percentages of LG3, higher pleomorphic contrast, higher mitotic score and larger tumour area percentages are associated with worse outcomes.Table 1 shows the outcome results (P value of the log-rank test, Concordance or C-Index, and hazard ratio (HR) with 95% confidence interval (CI)) of the proposed BRACE marker and other clinical features on internal and external validation sets when used for stratifying patients into high-risk and low-risk groups in terms of DMFS and BCSS.In comparison to clinical grade, BRACE produced higher C-Indices (with significant P values of the logrank test) for both DMFS and BCSS especially when generalising to the external cohort.Supplementary Table 5 shows the outcome results of BRACE on the discovery set.For comparison, Supplementary Table 6 also lists the outcome results of a larger set of features excluding the features included in BRACE.Similarly, to compare with a simple baseline DL model Supplementary Table 7 shows the outcome results of ResNet-18 (pretrained on ImageNet) used to extract features from the same set of patches as BRACE and a clear drop in performance was noted in comparison to BRACE.
No clear effect of the scanner type on prediction accuracy was observed for the compared features (Supplementary Table 8).For example, for BCSS, both Grade and BRACE produced higher C-Indices on cases scanned with Pannoramic scanner as compared to Philips scanner whereas for NPI it was the other way around.
For DMFS, C-Index of Grade was almost the same for both the scanners.Overall, NPI's predictions were better for Philips scanner whereas both Grade and BRACE performed better for Pannoramic scanner.The effect of scanner type might become more obvious when studied in a larger cohort with sufficient number of events.
In another experiment two components of NPI i.e. lymph node status and invasive tumour size were included in BRACE (this experiment is denoted by BRACE* to differentiate from main BRACE).Except for DMFS (LN 0-3) of external validation set, BRACE* showed comparable or improved prediction performance for both DMFS and BCSS in internal as well as external validation sets (Supplementary Table 9).
Figure 4 shows Kaplan-Meier (KM) curves for DMFS in LN− cases using different clinicopathological features and BRACE marker on internal as well as external validation cohorts.The number of events in the high-risk group (n = 16) and low-risk group (n = 7) of BRACE were almost the same as in the NPI highand low-risk groups but for clinical grade the proportion of the events was 11 (high-risk) to 12 (low-risk).This trend was more evident for the clinical grade on the external cohort where the number of events in the low-risk group were more (n = 9) than in high-risk group (n = 4) suggesting the limitation of a discrete grade risk score in stratification.Supplementary Fig. 7 and 8 shows KM curves for DMFS and BCSS, respectively, in the discovery set.Similarly, Supplementary Figs.9-11 shows KM curves for DMFS (LN 0-3), BCSS (LN−), and BCSS (LN 0-3), respectively, in the validation sets.

Identification of high-risk patients for additional chemotherapy
A feature of BRACE marker is that it has shown the ability to predict high-risk patients who could benefit from additional chemotherapy.To test the predictive ability of BRACE marker, we compared the KM curves of high-risk group predicted by our model with the actual survival curve of patients who received additional chemotherapy.Figure 5a, b shows the actual survival of patients treated with endocrine therapy only and those treated with additional chemotherapy in the whole Cohort-A.Figure 5c, d show the overlap between the survival curves of predicted highrisk group (in red) in endocrine therapy treated patients with actual survival curves of high-risk chemotherapy treated patients (in black) with no significant difference (DMFS: P = 0.26, BCSS: P = 0.53) suggesting the ability of our marker to identify patients who can benefit from additional chemotherapy.Patients in both blue and red curves in Fig. 5c, d were treated only with endocrine therapy.Similarly, Supplementary Fig. 12 shows the same analysis but restricted to internal validation set.

Comparison with Magee equation
The predictive ability of BRACE marker was indirectly compared in terms of C-index with ODX risk score via the results of a previously published study where New Magee equation 2 43 showed a moderate correlation (Pearson's correlation coefficient r = 0.604) with ODX risk score.This equation is based on clinicopathological variables including NPI, tumour size, ER HScore, PR HScore and HER2 status.For Cohort-A PR HScore was missing for Magee equation.On the validation set of Cohort-A, in comparison to BRACE marker's C-indexes of 0.73 ± 0.06 and 0.84 ± 0.04 for LN− DMFS and BCSS, respectively, Magee equation produced C-indexes of 0.69 ± 0.05 and 0.78 ± 0.04 suggesting better risk ranking by the former.
Correlation with clinicopathological parameters BRACE marker was significantly associated larger tumour size (P < 0.0001), high tumour grade (P < 0.0001) including high pleomorphism (P < 0.0001), low tubule formation (P < 0.0001), and high mitotic scores (P < 0.0001).It also showed significant association with high NPI score (P < 0.0001; Supplementary Table 10).It was significantly associated with lymph vascular invasion (LVI) on Cohort-A but the association was not significant on

DISCUSSION
The decision of chemotherapy administration to patients with early-stage ER+/HER2− BC is critical to avoid unwarranted chemotherapy side effects.The current methods in histopathological analysis are mostly subjective or based on gene expression profiling which is expensive and time consuming.In this study, we developed an AI-based method (BRACE marker) which can identify patients in need for chemotherapy in this intermediate risk group using an objective and reproducible method reducing the effect of variations present in manual clinical grading.Furthermore, BRACE marker utilises H&E slides used in routine clinical practice, so no extra sampling is needed saving labour and costs of gene expression assays.
By using AI and image analysis, BRACE pipeline extracted a rich set of features including tumour morphology, tumour-stroma relationship, TILs quantification, phenotypic heterogeneity and mitotic activity from H&E-stained breast cancer images.These important biologically driven features cannot be objectively quantified in accurate manner by current BC grading.Capturing more detailed morphological patterns in the form of grade proportions and the features based on such local grades when combined with tumour area percentage, and pleomorphic and stromal variations can help predicting DMFS and BCSS.BRACE followed a bottom-up approach where pathologists' supervision in the form cellular and regional annotations was utilised so that the results are explainable as well as appropriate for the prognostication task.For example, the method can quantify how much pleomorphic or stromal variation is present in a WSI or what are the proportions of different grades along with their locality.
Our results on univariate analysis (Table 1) showed that the proposed BRACE marker can rank (in terms of C-index) patients at risk of distant metastasis as well as BC-specific death better than conventional manual clinical grade and comparable to NPI.More importantly these results suggested better generalisation of BRACE marker to external cohort as compared to clinical grade.proportions); (c) by incorporating tumour region proportion, and pleomorphic and stromal variation.
Both inter-and intra-tumour heterogeneity have effects on patients outcome and response to therapy 2,3 .These variations include tumour differentiation, cellularity, stromal and immune response, tumour architecture and histological tumour type.Such heterogeneities pose challenges for quantification and hence obtaining a single risk score for a case is not possible from visual assessment.It has also been shown that high heterogeneity of tumours is associated poor prognosis because of less immune cell infiltration 44 .With the power of AI a detailed information can be obtained in an objective manner and multiple features can be integrated to a single score which can then be used for prognosis.BRACE pipeline extracted a rich set of features from WSI which can help in a plethora of further research explorations.BRACE quantified the overall nuclear heterogeneity with in a WSI into one score (Pleomorphic Contrast) where a higher score, i.e. more local pleomorphic variations, was associated with poor prognosis as also reported in previous studies 2,3,44 .
Stroma-to-tumour ratio (STR) has been shown to have independent prognostic relevance in different tumours including BC 45,46 .However, the relevance of STR still needs further research in BC because some studies have shown association of high STR with poor prognosis 45 while other demonstrated good prognosis 46,47 .These inconsistencies might be attributed to the variations and subjectivity in the assessment of tumour stroma as well as the heterogeneous nature of BC.BRACE feature set included a quantitative assessment of stromal cell ecology and entropy in terms of Stromal Contrast measuring the local variations in stromal cells in a WSI and may provide an alternative for STR quantification.High stromal cell variations was associated with good prognosis and vice versa.
Another DL-based histological grading work (DeepGrade 32 ) related to BC survival analysis that was recently published is in line with BRACE but differs from this work in the following aspects.DeepGrade stratified only Grade 2 patients whereas BRACE produced a composition of all grades along with other imagebased features.The main outcome for DeepGrade was recurrencefree survival but for BRACE both DMFS and BCSS were the main outcomes.BRACE followed a bottom-up approach (region segmentation, cell segmentation and classification, followed by features generation) for producing interpretable features as compared to the only deep features of DeepGrade.Furthermore, DeepGrade utilised an ensemble of twenty CNN models for grade prediction whereas one CNN model was used for local grade prediction in BRACE.
To achieve better stratification of patients, BRACE combined multiple important histological features.This idea was supported by a previous study 35 analysing the correlation between ODX score and DL based mitotic count.Using frequency of mitotic figures, a linear SVM classified a patient as either a high-or lowrisk.Their analysis showed that mitotic count cannot be used alone for risk stratification of intermediate risk group suggesting addition of other pathologic features.The digital mitotic count of BRACE was higher for high BRACE score than low BRACE score (Supplementary Fig. 15) and a higher mitotic score corresponded to higher risk which was again supported by the lower mean number of mitoses in low ODX vs higher mean number of mitoses in high ODX in their study 35 .The importance of mitotic figures for predicting ODX was also shown in another study 36 .
For explainable results BRACE followed a bottom-up approach where region and cell level information was used to generate high level features in different categories.A similar approach was adopted in another work 36 where DL based features were generated in three main categories related to structures, cell types, and tissue types.
The ROI selection for training the local grade model in BRACE was based on tumour cell density and eccentricity to ensure the model learnt features representative of tumour patterns.A similar approach of ROI selection was adopted in another work 37 where the ROIs for counting tumour and immune cells were based on high tumour cell density.Their study also corroborated the benefit of the usage of DL features in BRACE by reporting an improvement in correlation with ODX score by adding DL feature to Magee features.To generate interpretable model, feature selection in this study 37 was mainly based on domain knowledge.Similarly, another work 38 weighted the recurrence score predicted from H&E image tiles by tile-level tumour likelihood and combined with clinicopathological characteristics.
Unlike BRACE most of these studies used ODX score as a label for training a regression model and combined DL features with other clinicopathological parameters.A more related work 39 in a similar line as BRACE used DL to extract features related to the three components of NGS and demonstrated it prognostic significance.
Limitations of this study included its retrospective nature because of the challenges associated with designing and conducting prospective studies.However, the method was validated on a large cohort of more than 2100 cases and its generalisability was also validated on an external cohort.With the availability of more data, a multicentric training and validation will be more useful for developing a robust model.Tumour and stromal architectures which could potentially add to a more significant indicator were not included in this work.Furthermore, due to unavailability of multigene assays, the method was indirectly compared with ODX via previously published results of Magee equation in terms of ranking the patients with C-index.
In conclusion, BRACE marker is an AI-based method which can identify high-risk patients in the intermediate risk group of ER +/HER2− with high significance, adds clinically relevant information over routine manual histological features, and provides a potential reproducible and cost-effective alternative to existing gene-based methods.This work should encourage further research in image-based prognostics in BC and other types of cancer.Our future plan is to investigate other image-based features such as the arrangements of tumour cells and stromal structures and apply the proposed method to other types of cancers (such as prostate).With access to multigene assays the method could further be validated for predictiveness and even a prospective study could also be designed for validation.H&E image-based features could also be combined with features from other stained images such as IHC as well as other clinicopathological features.

Datasets
This study included a large well-characterised luminal (ER+/HER−) BC cohort (n = 2122) who had received endocrine treatment, without chemotherapy, collected from the Nottingham University Hospital, Nottingham, UK from 1998-2020.This cohort (called Cohort-A) was used for discovery and internal validation.To validate the generalisability of BRACE marker, an external validation cohort (n = 311), referred here as Cohort-B, with same BC subgroup and clinicopathological data as Cohort-A was collected from University Hospital Coventry and Warwickshire (UHCW), Coventry, UK from 2011 to 2014.Distant metastasis-free survival (DMFS), i.e. time from surgery to development of the distant metastasis, and BC specific survival (BCSS), i.e. time from initial diagnosis to the time of BC related death, were the two endpoints of the analysis.Median follow-up duration for Cohort-A for DMFS and BCSS was 80 and 83 months, respectively, whereas for Cohort-B it was 96 months for both DMFS and BCSS.
Clinicopathological data (Supplementary Table 1) of female patients with age at diagnosis varying from 20-92 years included: lymph node (LN) status, clinical histological grade, tumour size, lympho-vascular invasion (LVI), Nottingham Prognostic Index (NPI), progesterone receptor (PR) status, follow-up and treatment data.Representative sections of formalin fixed paraffin embedded tissue blocks of surgical excision specimens from each case were H&E stained and scanned with Philips UFS scanner with 0.25 µm/ pixels at ×40 to produce WSIs (n = 1417).A subset of cases (n = 705) were scanned using Pannoramic 250 Flash III; 3DHistech, Budapest, Hungary.For each patient one H&E-stained WSI was utilised for developing an AI-based BRACE marker for survival prediction.
Cohort-A was divided into discovery (n = 1496) and internal validation (n = 626) sets (Supplementary Fig. 1a).Three different splits were formed from the discovery set for cross-validation (Supplementary Fig. 1b).Supplementary Fig. 1b shows the detail of Cohort-B.A subset of cases (n = 174) from the source hospital of Cohort-A who received both endocrine and chemotherapy was used as a control group.To keep the evaluation fully blinded the survival times and events of validation sets were hosted on a webserver.5600 × 5600 pixels (about 1344 µm) at 40× magnification.Patches of size 512 × 512 pixels (about 123 µm) at 40× magnification from the ROIs were used for training Local Grade Predictor to predict TGC i.e. grade for an entire WSI at the patch-level.

ROI selection
Based on three folds cross-validation, different strategies were evaluated for ROI selection including random ROI, ROI with maximum tumour tissue, and ROI with high tumour nuclei density and eccentricity and the last strategy was adopted because of its highest ROC-AUC for predicting clinical grade for a WSI.This empirical ROI selection criterion (eq1) gave more weightage to areas with larger and more deformed tumour nuclei.

Patch selection
The extracted ROIs were cut into smaller patches of size 512 × 512 pixels (about 123 µm) at ×40 magnification so they can fit into computer memory to train Local Grade Predictor.To allow the model to pay more attention to tumour morphology, patches below a threshold of fifteen tumour nuclei were discarded.The threshold was selected based on the discovery set of each of the three splits.

Local Grade Predictor training
To train a model for predicting TGC, we enhanced the performance of Inception V3 Predictor were used to count the co-occurrence of these predictions.Supplementary Fig. 5 further explains how CM was constructed.

Feature selection
A set of features (n = 700) in different categories (Supplementary Table 2) was extracted to identify features which could be explained from clinical point of view and perhaps could also be applied to other subgroups of BC.The prognostic importance of the features for ER+/HER2− patients was assessed by Cox L1 regression in terms of C-Index, P value (of the log-rank test) and HR on the discovery set and eight features (i.e.percentages of digital local grade LG1%, LG2%, LG3%, percentage of tumour area, pleomorphic contrast, stromal contrast, co-occurrences of stromal nuclei patches with low density, and digital mitotic score) were selected for final model development.
AI based grade has proven to be a prognostic marker for BC survival prediction by identifying useful morphological patterns 32,54 .To quantify grade at a detailed level and to put it in relation with overall tumour area, BRACE included the percentages of local grades and overall tumour percentage.Although nuclear pleomorphism is an important component of BC grade but due to high inter-observer variability 16,55 , it needs better quantification.To subjectively quantify the overall nuclear heterogeneity with in a WSI into a single score, BRACE included pleomorphic contrast which measures the local variation in nuclear pleomorphism where a low value represents less variations and a high value represents more variations.Similarly, mitotic count has been a well-known prognostic marker 35,56,57 therefore its digital counterpart has been included in BRACE.Recently, the importance of stromal variations has also been found to be of prognostic significance 45,46,[58][59][60] .To represent a quantitative measure of stromal cell ecology and entropy BRACE included stromal contrast and co-occurrences of stromal nuclei patches with low density, respectively.
It was noted that adding clinically relevant features from Category C (related to TIL features) and other ML features from Category F (different types of cell counts) did not add further information.The former might be attributed to the unestablished utility of TILs for the subgroup of ER+/HER2−, whereas the latter would be less useful because of the difficulty of reducing high variant cell counts to a single value at WSI-level.

Statistical analyses
To identify the prognostic ability of the proposed BRACE marker, a Cox proportional hazard regression model (from lifelines package for python -https://lifelines.readthedocs.io/en/latest/fitters/regression/CoxPHFitter.html) was fitted on three different splits of the discovery set.After the feature and parameter selection a single model was fitted on the whole discovery set.Parameters for the model were set as: estimation method (Breslow), L1 (0.5), L2 (0.5), and penalty (0.001).The fitted model was then evaluated on the internal and independent external validation cohorts.The regression coefficients were used to compute predictive risk score for each patient.KM curves were used to show risk stratification and a P value < 0.05 for two-tailed log-rank test was considered as significant.Based on three splits discovery sets the cut off for BRACE maker was set at 70th percentile.Similarly, for clinical variables such as grade, grade 1-2 was taken as reference against grade 3, whereas the cut off point for NPI was set at 60th percentile.Forest plots were generated using R function 'coxph' and R-package 'forestmodel'.Chi-square test was used for analysis of categorical data.
) with a low and a high BRACE risk score.Contrast measures the local variation in a feature where a low value represents less variations and a high value represents more variations.Higher stromal contrast (score = 20.81;Fig. 2d) and low pleomorphic contrast (score = 0.065; Fig. 2e) is associated with good prognosis (BRACE risk score = 0.249) where as low stromal contrast (score = 10.16) and high pleomorphic contrast (score = 12.58) is associated with bad prognosis (BRACE risk score = 10.38).Similarly, Fig. 3 shows digital tumour grade composition and mitotic count features.Higher percentage of LG3 (77%; Fig. 3b) and higher digital mitotic count (count = 60; Fig. 3c) are associated with poor prognosis (BRACE risk score = 10.38).

Fig. 1
Fig. 1 Proposed BRACE marker workflow for breast cancer survival prediction.DCIS regions are filtered for exclusion in the following steps; Tumour Detector segments and classify nuclei for tumour rich area identification/ROI selection; prediction of local grade composition (digital local grade LG1-3) and pleomorphism by Local Grade Predictor which is trained in a supervised way by using clinical grade as WSI-level label; Tissue region segmentation by Region Segmentor to quantify stromal cell density and tumour area percentage; Using the extracted features to form BRACE marker for survival prediction.

Fig. 2
Fig. 2 Stromal contrast and nuclear pleomorphism features.a Sample patches/areas of low, medium and high stomal cell density used to calculate stromal contrast in (d).b sample patches of low, medium and high nuclear pleomorphism used to calculate pleomorphic contrast in (e).c Two WSIs with their corresponding BRACE risk scores.d Stromal cell density for calculation of stromal contrast.e Nuclear pleomorphism used for calculation of pleomorphic contrast.

Fig. 3
Fig. 3 Tumour grade composition and mitotic counting features.a Two WSIs with their corresponding BRACE risk scores.b digital tumour grade composition showing percentage of each grade in a WSI.c Mitotic counts in a sample area extracted from an ROI.Mitotic figures are in yellow circles.
Figure 6 shows the forest plots from multivariate Cox proportional hazard regression model for DMFS (LN−) cases where the HR along with their 95% confidence intervals are listed for both validation Cohort-A and Cohort-B.BRACE marker was found to be independent prognostic variable against other clinicopathological variables for DMFS in LN− cases of Cohort-A (P = 0.04, HR = 2.79, CI: 1.04-7.48;Fig. 6a) as well as

Fig. 4
Fig. 4 KM curves for LN− DMFS.KM curves for the high-risk (red line) and low-risk (blue line) groups of LN− DMFS as stratified by BRACE marker and other clinicopathological variables on the validation sets (Cohort-A: n = 499; Cohort-B: n = 267).P values are for the log-rank test.

Fig. 5
Fig. 5 KM curves for identifying cases for chemotherapy (Cohort-A).KM curves for actual survival in endocrine therapy only treated (n = 2122) and endocrine+chemotherapy treated (n = 174) patients for endpoints DMFS (a) and BCSS (b).KM curves for high-risk (red line) and low-risk (blue line) groups of LN 0-3 as stratified by BRACE marker for endpoints DMFS (c) and BCSS (d) in patients treated with endocrine therapy only in discovery Cohort-A.With appropriate cut-off BRACE identified cases which could have benefited from additional chemotherapy as shown by the overlap of the predicted high-risk curve (red line) with the actual survival curve (black line) of cases treated with chemotherapy.BRACE-high and BRACE-low represents cases identified as high-and low-risk, respectively, by BRACE marker.P values are for the log-rank test.
From the multivariate analysis it was observed that BRACE marker adds clinically relevant information over other clinicopathological variables including tumour size, age at diagnosis, and grade.Although, the Local Grade Predictor in our proposed pipeline was trained in a supervised way by utilising the clinical grade as WSIlevel label, our BRACE marker ranked patients better in terms of C-Indices in comparison to the clinical grade for the following main reasons: (a) by training the local grade composition model only on ROIs selected by a criteria of high tumour nuclei density and high tumour nuclei eccentricity; (b) by capturing more detailed morphological patterns in the form of local grades (grade

Fig. 6
Fig. 6 Multivariate analysis for DMFS, LN− cases.Forest plots showing the HR with 95% confidence intervals (CI) and P values (of the logrank test) for BRACE marker when adjusted for other clinicopathological variables on DMFS (LN−) for internal validation set of Cohort-A (n = 499) (a) and external validation set Cohort-B (n = 260) (b).

Table 1 .
Results on internal and external validation sets.Internal validation set (Cohort-A), external validation set (Cohort-B).P value (of the log-rank test), C-index and hazard ratio (HR) with 95% confidence interval (CI) for the proposed BRACE marker and other clinical features for DMFS and BCSS on a subgroup of endocrine treated patients with LN− and LN 0-3 are listed.Events are censored at 10 years.x ± sd for the C-Index represents one standard deviation of the mean over 1000 bootstrap runs.
, homogeneity, etc.) which served as a compact summary of the CM for each WSI.Pleomorphic contrast was calculated in a similar manner where the patch predictions in the form of local pleomorphic 1 (LP1), LP2 and LP3 from Local Grade local changes of a feature over the WSI.It ranges from 0 (a constant image) to (size of CM-1).To calculate stromal contrast the co-occurrence of each stromal cell, in each patch of size 256 × 256 pixels, with any other cell at eight different angles were counted and put as entries in CM.A standard python library (greycoprops) was used to calculate different properties (contrast, dissimilarity