Introduction

Convolutional neural networks (CNN) are powerful pattern recognizers that have been used successfully to classify cancer tissue morphology and predict patient outcomes1,2,3,4,5,6,7. In addition, recent studies have suggested that information about gene mutations and other molecular features of cancerous tissue can also be obtained from tissue morphology using only weakly supervised machine learning2,8. Examples of CNN-mediated identification of molecular events associated with cancer include the prediction of up to six different gene mutations in lung cancer9 and microsatellite instability in colorectal cancer10. It is, however, not known whether the CNN-derived morphological features that predict the molecular status of a tumor also could be used to guide the choice of molecularly targeted therapies.

In recent studies on breast cancer, a CNN trained on tissue histomorphology predicted the steroid hormone receptor status, expression of the Ki-67 protein (indicator of cell proliferation), human epidermal growth factor receptor 2 (ERBB2, alias HER2; HGNC:3430) and a series of other tissue biomarkers in a large proportion of the patients11,12. A specific question related to breast cancer is therefore whether a CNN trained to predict the ERBB2 status of a tumor also could predict the efficacy of anti-ERBB2 adjuvant treatment.

The identification of amplified ERBB2 as a major driver in approximately 20% of breast cancers and the subsequent development of anti-ERBB2-targeted therapies, such as trastuzumab, turned out to be a major success13. Patients with ERBB2-positive breast cancer had unfavorable survival rates in the past, but systemic regimens with ERBB2-targeted agents have substantially improved survival outcomes14. Patients who benefit from ERBB2-targeted treatments are usually identified by demonstrating the presence of ERBB2 amplification with in situ hybridization or an excess of ERBB2 tyrosine kinase protein with immunohistochemistry. The criteria for defining of ERBB2-positive cancer are generally accepted, but they have required modification with time, and there are some borderline cases that are challenging to classify15. Therefore, there is a need for more accurate approaches to predict the efficacy of anti-ERBB2 treatment.

In the current study, we explored whether a CNN weakly supervised by tumor ERBB2 gene amplification status as determined by chromogenic in situ hybridization (CISH) and trained with standard hematoxylin and eosin (H&E) stained tissues samples, can predict breast cancer ERBB2 status. We refer to the CNN-based prediction as H&E-ERBB2 score. Additionally, we explored whether H&E-ERBB2 score is associated with survival in patient populations treated with or without adjuvant trastuzumab. Interestingly, the CNN trained in the current study was not only a statistically significant and an independent predictor of the ERBB2 status but also identified patients with CISH ERBB2-positive cancer who benefited more from trastuzumab. In addition, the H&E-based ERBB2 predictor, i.e. H&E-ERBB2 score identified patients who were CISH ERBB2 negative but according to the CNN exhibited ERBB2-positive cancer-like tissue morphological features and had unfavorable outcomes.

Methods

Patient series

The study was based on cancer tissue samples, clinicopathological data, and follow-up data from three independent breast cancer cohorts: the FinProg patient series16, the FinProg validation series17, and the FinHer clinical trial (ISRCTN76560285)18. The FinProg patient series, with data from 2,936 patients, is a nationwide cohort that includes approximately 50% of all women diagnosed with breast cancer in Finland in 1991 or 199219 and covers most (93%) of the patients with breast cancer diagnosed within five selected geographical regions. The FinProg validation series consisted of 565 patients diagnosed mainly in the Helsinki region who were treated at the Departments of Surgery and Oncology, Helsinki University Hospital, from 1987 to 1990. The outcome and cause of death data were retrieved from the files of the Finnish Cancer Registry and Statistics Finland. Breast cancer specific survival was used as an endpoint in the present study. Corresponding clinical information and pathologic tumor characteristics, including histological grade were available from the hospital and laboratory records.

The FinHer trial (ISRCTN76560285) was an open-label multicenter randomized trial that included 1,010 patients in Finland in 2000–200320. Eligible women were ≤ 65 years of age, had undergone breast surgery with axillary nodal dissection, and had either axillary lymph node-positive or high-risk node-negative cancer. Histological grading of cancer was done by pathologists at the time of the diagnosis according to the World Health Organization guidelines. Cancer estrogen receptor (ER), progesterone receptor (PR), and ERBB2 expression were determined by immunohistochemistry. For all patient samples considered positive for ERBB2 expression by immunohistochemistry (either 2 + or 3 + on a scale from 0 to 3 +), ERBB2 amplification status was determined centrally by chromogenic in situ hybridization (CISH). Cancers with ≥ 6 gene copies were considered ERBB2-positive. The patients were randomly assigned to receive three cycles of docetaxel or vinorelbine, followed in both groups by three cycles of fluorouracil, epirubicin, and cyclophosphamide (FEC). The 232 (23.0%) patients with ERBB2-positive cancer underwent a second randomization either to receive concomitant intravenous trastuzumab for 9 weeks or not to receive trastuzumab. Distant disease-free survival (DDFS) was used as the endpoint in the FinHer trial. The median follow-up time was 5.2 years after random assignment18.

Tumor tissue microarray preparation and digitization

Tumor tissue microarrays (TMAs) were prepared from each patient’s representative formalin-fixed paraffin-embedded breast cancer samples16. We prepared 23 TMA blocks, each containing 50 to 144 tumor samples, from the 2,306 breast tumor samples available. H&E-stained TMAs (FinProg) and whole-slide tissue sections (FinHer) were digitized using a whole-slide scanner (Pannoramic 250 FLASH, 3DHISTECH Ltd., Budapest, Hungary) (see eMethods 1 in Supplement). The slides were scanned with a 20 × objective lens and the pixel size of the whole-slide images (WSIs) was 0.24 μm. The WSIs were compressed (Enhanced Compressed Wavelet format) and digital stain normalization was performed to adjust for the color intensity variation in the H&E stains21.

Determination of ERBB2 amplification status

In the FinProg cohorts, ERBB2 amplification was quantified by CISH on TMA cores as described previously16. In the FinHer cohort, ERBB2 amplification was determined by CISH on full tumor sections as part of the FinHer study18. Binary ERBB2 amplification status was derived for each tissue sample16,18 from the original CISH images. In this case cell-level information aggregates to the entire tissue sample label, which we refer to as weakly labeled data.

Training of deep neural networks

We built a deep neural network comprised of a stack of convolutional layers taken from a squeeze-and-excitation CNN architecture22 (se-resnetxt50_32 × 4d) and a fully connected block. We employed a transfer learning approach by initializing the convolutional layers with the weights trained on ImageNet23. The training configuration and CNN parameters can be found in the Supplement. All experiments were carried out using stratified fivefold cross-validation, i.e., preserving target class distributions within each of the folds. Training, validation and testing of the networks was performed on image tiles extracted from the FinProg data. Additional (external) evaluation was performed on FinHer tissue samples. Detailed description of the tile extraction procedures described in the supplementary. Image tiles used for training, originated from the FinProg patient samples that were not used for model testing. To ensure generalization of the models, we additionally tested them on a completely independent patient cohort (FinHer). Data utilization for training and testing is depicted on Fig. 1. Patient splits are summarized in Table 1. Resulting networks predicted probabilities of ERBB2 amplification that we refer to as H&E-ERBB2 score. All deep learning models were trained and evaluated using PyTorch deep learning library24.

Figure 1
figure 1

Flow chart of the study design. Deep convolutional neural networks were trained on hematoxylin and eosin stained tissue microarray spots from a nationwide breast cancer series (FinProg) to predict the ERBB2 gene amplification status of the primary tumor. The networks were trained using a transfer learning approach with ImageNet pretrained weights, and only the deepest layers (colored in yellow) were finetuned by minimizing the focal loss, weakly supervised by the ground truth ERBB2 gene amplification status as determined by chromogenic in situ hybridization. At the test phase, the networks generate probabilities of ERBB2 amplification (the H&E-ERBB2 score). The classification accuracy was summarized with receiver operating characteristic and precision-recall curves. Additionally, we applied Kaplan–Meier plots and Cox regression analysis to correlate the H&E-ERBB2 scores with patient treatment outcome data.

Table 1 Biological characteristics of breast cancers and patient survival.

H&E-ERBB2 score maps and activation maps

The continuous H&E-ERBB2 score values were overlaid on top of the corresponding locations on the digitized whole-tissue section images to generate H&E-ERBB2 score maps, i.e. study the distribution of ERBB2-associated features. To further identify the subregions in the H&E-stained breast tumor sections that were most informative for a high versus low H&E-ERBB2 score, we used gradient-weighted class activation mapping i.e. Grad-CAM25.

Statistical analysis

To perform analysis on the FinHer whole-slide sections we first pulled tile-level H&E-ERBB2 scores to a whole-slide score by taking a median value within each slide. The area under the receiver operating characteristic (ROC) curve (AUC) was used to quantify the capability of the model to distinguish ERBB2-positive and ERBB2-negative cancers as assessed by CISH. Because of imbalanced data with a larger proportion of CISH ERBB2-negative cancers, we also generated precision-recall curves (PRC), which depict the positive predictive value of the classifier (precision) at each test sensitivity (recall) value across various thresholds26. Average precision (AP) was calculated from the PRC curves. Confidence intervals were calculated for both AUCs and APs using a stratified bootstrapping technique with 2,000 iterations. To assess whether the H&E-ERBB2 score was independent of the grade of tumor differentiation, we fitted a logistic regression model with ERBB2 CISH status as the dependent variable. We used statsmodels27 python package to perform logistic regression analysis.

To assess the relationship between the prediction of cancer ERBB2 status and distant disease-free survival (DDFS), we applied Cox regression. DDFS in the FinHer series was computed from the date of randomization to the first detection of metastases outside of the locoregional area or to death from any cause, whichever occurred first. Kaplan–Meier survival curves were drawn using the median value of the predictor as the cut-off.

Ethical approval

The use of the FinProg patient series of breast cancer samples and the clinical data was approved by the operative Ethics Committee of the hospital district of Helsinki and Uusimaa (94/13/03/02/2012). Also, the National Supervisory Authority for Welfare and Health (Valvira) approved the use of human tissues (7717/06.01.03.01/2015). The National Committee on Medical Research Ethics operates in conjunction with Valvira. Profiling of tumors from the FinHer patient series was approved by the institutional review board of Helsinki University Hospital (HUS 177/13/03/02/2011). FinHer study participants provided written informed consent. All methods were carried out in accordance with relevant guidelines and regulations.

Consent for publication

Not applicable.

Results

Patient selection criteria

We included 1,886 patients from the original FinProg patient series and 427 patients from the FinProg validation series (see Supplementary Fig. 1). The patients that had missing data on follow-up, carcinoma in situ, distant metastases at the time of the diagnosis and those who did not undergo surgery for the primary tumor were excluded. Additionally, we excluded patients with synchronous or metachronous bilateral breast cancer and those with other malignancies. The median follow-up of the patients alive at the end of follow-up was 15.4 years (range, 15.0–20.9 years). Further, for the current study we excluded the patients with missing, detached or not representative tissue spots, e.g. tissue sample area < 0.02 mm2, and patients with ERBB2 data. After all the exclusions (see Supplementary Fig. 1), 1,047 tissue microarray spots (one per patient) were retained for further analysis (Table 1). In the FinProg series 30.3% of the patients (n = 310) died of breast cancer by the end of the follow up. Among the FinHer patients, whole-slide tumor tissue sections from the primary breast cancer were available from 712 (70.6%) patients (Supplementary Fig. 2) and 16.7% of the patients (n = 119) developed distant metastasis by the end of follow-up (i.e. were uncensored in the survival analysis).

Prediction of tumor ERBB2 status from tissue morphology

We first trained a CNN to predict breast cancer ERBB2 status on 693 H&E-stained patient samples from the FinProg series. To validate the generalization of the trained models, we evaluated them on two test sets. First, we assessed the performance of the models on a FinProg held-out tissue microarray samples (n = 354). Then, the models were tested on a completely independent set of 712 whole-slide tissue sections from the FinHer patient series (Table 1). The overall setup of the computational experiment and analysis is depicted in Fig. 1. On a randomly selected internal test set from the FinProg series, the model achieved an AUC of 0.70 (95% confidence interval (CI), 0.63–0.77) and an AP of 0.35 (95% CI, 0.28–0.47) with a baseline AP of 0.19. The accuracy of individual models trained within fivefold cross-validation is reported on the Supplementary Fig. 3. Additionally, contingency table calculated on the FinProg hold out data is depicted in Supplementary Fig. 4. When we next applied the model on 712 breast cancer whole-slide tissue images from the FinHer dataset (external test set), the CNN predicted ERBB2 status with an AUC of 0.67 (95% CI, 0.62–0.71) and an AP of 0.37 (95% CI, 0.32–0.44) with a baseline AP of 0.23 (Fig. 2). These results suggest that the H&E-ERBB2 score is a predictor of ERBB2 status as determined by CISH and that the model generalizes from small tumor areas (tissue microarray spots) to whole-slide samples from an independent test cohort.

Figure 2
figure 2

The H&E-ERBB2 score is a significant predictor of tumor ERBB2 gene amplification as determined by chromogenic in situ hybridization. The accuracy of neural network-mediated prediction of breast cancer ERBB2 status was evaluated with (A) receiver operating characteristic (ROC) curves and (B) precision-recall curves (PRC). The results are presented for both the internal (FinProg) and the external (FinHer) test sets. Area under the ROC curve (AUC) and average precision (AP) were calculated with 90% confidence intervals. The PRC curves were compared to the baseline precision, i.e. precision of a random classifier. The baseline is the ratio of ERBB2-positive samples over the total number of ERBB2-positive and ERBB2-negative samples as determined by chromogenic in situ hybridization.

H&E-ERBB2 score independent of tumor histological grade

Tumor histological grade also predicted cancer ERBB2 status in the FinProg series (P < 0.001). When both the H&E-ERBB2 score and histological grade were included in a logistic regression analysis as covariates, the H&E-ERBB2 score remained as an independent predictor of breast cancer ERBB2 status in the FinProg test set (P = 0.005) and also in the FinHer external test set (P < 0.001). Additional details of the logistic regression analysis are provided in supplementary Tables 1 and 2 respectively. This suggests that the deep learning model identified morphological patterns associated with ERBB2 gene amplification not explained by grade of differentiation.

ERBB2-associated morphology predicts distant disease-free survival

As CISH ERBB2-positive patients were randomly assigned to receive or not receive adjuvant trastuzumab as part of the FinHer trial, we performed the analysis in each of the treatment groups separately. CISH ERBB2-positive patients with high H&E-ERBB2 score and treated with trastuzumab had a more favorable DDFS than those who had a low H&E-ERBB2 score (HR, 0.37; 95% CI, 0.15–0.93; P = 0.034). CISH ERBB2-positive patients not treated with trastuzumab and who had a high H&E-ERBB2 score tended to have a less favorable DDFS (HR, 2.03; 95% CI, 0.69–5.94; P = 0.20; Fig. 3A). This suggests that the H&E-ERBB2 score could contribute to a more accurate prediction of trastuzumab efficacy than the CISH ERBB2 status alone.

Figure 3
figure 3

Cancer tissue morphology-based H&E-ERBB2 score and distant disease-free survival. (A) Evaluation of H&E-ERBB2 scores and distant disease-free survival (DDFS) in patients with ERBB2-positive breast cancer as determined by chromogenic in situ hybridization (CISH) in the FinHer trial series. Left panel: Patients treated after breast surgery with chemotherapy plus adjuvant trastuzumab. Right panel: Patients treated with chemotherapy but without trastuzumab. (B) The DDFS of CISH ERBB2-negative patients in the FinHer series stratified by the H&E-ERBB2 score. None of the patients judged ERBB2-negative by CISH received adjuvant trastuzumab.

Among the 548 patients who were ERBB2-negative by CISH and who, therefore, were not eligible to receive trastuzumab in the FinHer trial, 246 (45%) had a high H&E-ERBB2 score and 302 (55%) had a low score, as determined by a median H&E-ERBB2 score on the entire external test set. In this subset, the patients with a high H&E-ERBB2 score had a less favorable survival than the patients with a low score (HR 2.15; 95% CI, 1.36–3.41; P = 0.001; Fig. 3B). All together, these observations suggest that some of the CISH ERBB2-negative patients might potentially benefit from anti-ERBB2 treatment and those can be detected via analyzing conventional H&E-stained cancer tissue samples.

Activation maps for the deep learning model trained to predict the ERBB2 gene amplification status

We observed substantial variability of the H&E-ERBB2 score within and across the samples, suggesting that the ERBB2 associated patterns learned by the CNN are heterogeneously distributed in the tissue (Fig. 4). According to the Grad-CAM activation maps, the tissue morphological features that were most predictive of ERBB2 gene amplification based on the CNN analysis were regions of tumor epithelium and in situ carcinoma components, as well as individual epithelial cells and fibroblasts in the stromal regions (Fig. 4C,D).

Figure 4
figure 4

H&E-ERBB2 score and Grad-CAM activation maps. (A) H&E-ERBB2 Score Map as predicted by the convolutional neural network (CNN) based on a hematoxylin eosin (H&E) stained sample from the FinHer cohort. The Score Map was overlaid as a heatmap on top of the H&E-stained whole-slide tissue image, representing variable levels of the CNN-derived H&E-ERBB2 score. (B) Left: Magnified image of the box shown in panel A. Right: Areas representing high H&E-ERBB2 Scores indicated with red. (C) Grad-CAM activation map of a region predicted to have a low H&E-ERBB2 score. (D) Grad-CAM activation map of a region with a high H&E-ERBB2 score. The high score areas indicate clusters of cancerous epithelial cells. The sample presented is ERBB2-positive by chromogenic in situ hybridization and had a high overall H&E-ERBB2 score.

Discussion

In this study, we have shown that a CNN trained on a primary breast tumor tissue morphology is able to learn patterns predictive of breast cancer ERBB2 gene amplification status as assessed by chromogenic in situ hybridization. More importantly, our findings generalized from the limited tumor areas, i.e. tissue microarray (TMA) spots to whole slide tumor sections and samples from multiple centers. This suggests that ERBB2 status is reflected in breast cancer tissue architecture, and that the information can be captured by and quantified with machine learning.

Further, we have shown that ERBB2 amplification–associated morphology reflected by a high H&E-ERBB2 score is correlated with the efficacy of adjuvant trastuzumab therapy and predicts a significantly more favorable distant disease-free survival in CISH ERBB2-positive patients. Conversely, in CISH ERBB2 positive patients who had a high H&E-ERBB2 score but were randomized not to receive trastuzumab, a trend towards a less favorable DDFS was seen. Similarly, in CISH ERBB2-negative patients who were not eligible to receive trastuzumab, a high H&E-ERBB2 score was associated with a significantly less favorable DDFS. This indicates that the ERBB2-associated morphology may contain significant therapy-predictive information to complement the molecular (CISH-based) analysis.

Our findings related to morphology-based prediction of the ERBB2 status are in line with those from a recent study, in which a deep learning-based method was used to predict ERBB2 and a series of other biomarkers in breast cancer from H&E-stained tissue microarray spots11,12. In our work, the accuracy of the prediction of the ERBB2 status was on a similar level (AUC 0.70) as in the previous study by Shamai et al. (AUC 0.74)11 when validated on tissue microarray samples. Our hold-out TMA test set was smaller than theirs, which could explain a lower accuracy. Shamai et al. showed that increasing the number of samples for training from 1,000 tumor samples to 4,000 correspondingly increased the AUC from 0.66 to 0.7411 suggesting that our results may be considered on par with theirs taking into account the number of training samples.

In this study, we were able to address several of the limitations raised in the previous study such as expanding the testing material from small tumor areas (TMAs) to whole slide samples and from a single center to a nationwide setting and multiple centers.

To better understand what ERBB2-associated morphological patterns the CNN has learned, we visualized network scores and activation maps on top of the corresponding histological images. In general, activation of the CNN-based ERBB2 predictor was focused on tumor epithelium, and larger and smaller nests of malignant epithelial cells, rather than stromal areas. In certain cases, the activation of the CNN was focused on individual cancerous epithelial cells. This is in line with what would be expected, given that the ERBB2 gene amplification and protein overexpression is occurring in cancerous epithelial cells.

Related to the predictive information of the H&E-ERBB2 score that we found to complement the ERBB2 gene amplification status, one could speculate that the CNN has learned auxiliary tumor features associated with efficacy of trastuzumab therapy and DDFS. While features learned with weakly supervised machine learning may not necessarily represent the actual target (e.g. cells with a specific mutation or gene amplification), the tissue patterns identified using this approach can still yield relevant information about tumor biology2. Prognostic tissue microarchitectural features in breast cancer have been previously described and captured both with machine learning1 and conventional image analysis28. In these studies, a series of features including the tumor-stroma interface, distance between stromal regions and size of tumor nests were reported to be associated with survival in breast cancer1,28. In that context, the relationship between the H&E-ERBB2 score and molecular subtypes, TILs, and other prognostic tissue biomarkers should be further explored.

Future studies should aim to explore these tissue microarchitectural features in more detail, for example with multiplexed methods that allow morphological and molecular characterization (e.g. multiplexed immunohistochemistry or in situ sequencing) of the same tissue sections or using cell-level registration of consecutive tissue sections.

Our study has certain limitations that should be addressed in future studies. We did not have access to consecutive sections of samples with ERBB2 gene amplification results as determined by chromogenic in situ hybridization (CISH) or ERBB2 protein expression assessed with immunohistochemistry (IHC). Consecutive slides or multiplexed analysis of the same tissue section with H&E, CISH, IHC or in situ sequencing could allow improved localization of the ERBB2-associated features in the H&E morphology that predicted the efficacy of anti-ERBB2 treatment in our study. Overall accuracy and especially the specificity of the ERBB2 predictions could be improved by for example incorporating more training samples from external datasets. ERBB2 expression heterogeneity and its effect on survival should be also analyzed on the whole-slide sections. Another limitation was the restricted number of patients in some of the subgroups. To confirm our findings, analyses of additional patient series treated with adjuvant or neoadjuvant anti-ERBB2-targeted therapy are needed.

To the best of our knowledge, the present study is the first to show that a deep learning algorithm trained on tissue morphology and weakly supervised by the molecular status can learn patterns that predict the efficacy of adjuvant systemic therapy in patients with breast cancer. The present study suggests that machine learning can be used to extract predictive information when applied to routine tumor tissue sections and may help in identifying patients with ERBB2-positive cancer who would benefit the most from adjuvant trastuzumab. Our findings also suggest that some patients whose cancer is identified as ERBB2-negative by CISH do in fact have ERBB2-associated morphology, and thus potentially could benefit from targeted anti-ERBB2 therapy. This warrants further studies, perhaps with the H&E-ERBB2 score as a companion diagnostic assay in clinical trials with agents that show clinical activity in ERBB2-low-expression breast cancer29. Further research should also elucidate the amount of clinically relevant and actionable information that remains to be extracted from ubiquitous, inexpensive H&E-stained samples, such as features that predict the efficacy of hormone therapy and other molecularly targeted treatments.