Determining breast cancer biomarker status and associated morphological features using deep learning

Gamble, Paul; Jaroensri, Ronnachai; Wang, Hongwu; Tan, Fraser; Moran, Melissa; Brown, Trissia; Flament-Auvigne, Isabelle; Rakha, Emad A.; Toss, Michael; Dabbs, David J.; Regitnig, Peter; Olson, Niels; Wren, James H.; Robinson, Carrie; Corrado, Greg S.; Peng, Lily H.; Liu, Yun; Mermel, Craig H.; Steiner, David F.; Chen, Po-Hsuan Cameron

doi:10.1038/s43856-021-00013-3

Download PDF

Article
Open access
Published: 14 July 2021

Determining breast cancer biomarker status and associated morphological features using deep learning

Communications Medicine volume 1, Article number: 14 (2021) Cite this article

40k Accesses
53 Citations
177 Altmetric
Metrics details

Subjects

Abstract

Background

Breast cancer management depends on biomarkers including estrogen receptor, progesterone receptor, and human epidermal growth factor receptor 2 (ER/PR/HER2). Though existing scoring systems are widely used and well-validated, they can involve costly preparation and variable interpretation. Additionally, discordances between histology and expected biomarker findings can prompt repeat testing to address biological, interpretative, or technical reasons for unexpected results.

Methods

We developed three independent deep learning systems (DLS) to directly predict ER/PR/HER2 status for both focal tissue regions (patches) and slides using hematoxylin-and-eosin-stained (H&E) images as input. Models were trained and evaluated using pathologist annotated slides from three data sources. Areas under the receiver operator characteristic curve (AUCs) were calculated for test sets at both a patch-level (>135 million patches, 181 slides) and slide-level (n = 3274 slides, 1249 cases, 37 sites). Interpretability analyses were performed using Testing with Concept Activation Vectors (TCAV), saliency analysis, and pathologist review of clustered patches.

Results

The patch-level AUCs are 0.939 (95%CI 0.936–0.941), 0.938 (0.936–0.940), and 0.808 (0.802–0.813) for ER/PR/HER2, respectively. At the slide level, AUCs are 0.86 (95%CI 0.84–0.87), 0.75 (0.73–0.77), and 0.60 (0.56–0.64) for ER/PR/HER2, respectively. Interpretability analyses show known biomarker-histomorphology associations including associations of low-grade and lobular histology with ER/PR positivity, and increased inflammatory infiltrates with triple-negative staining.

Conclusions

This study presents rapid breast cancer biomarker estimation from routine H&E slides and builds on prior advances by prioritizing interpretability of computationally learned features in the context of existing pathological knowledge.

Plain language summary

Breast cancer diagnosis and characterization involves evaluation of marker proteins found inside or on the surface of tumor cells. Three of the most important markers are estrogen receptor (ER), progesterone receptor (PR) and a receptor called HER2. The levels of these markers can influence how a person with breast cancer is treated in the clinic. This study explored the ability of machine learning – whereby computer software is trained to recognise and classify particular image features - to determine the status of these markers in digitized images, without the need for tissue stains. Our results demonstrate that machine learning can automatically predict the status of ER, PR and HER2 in pathology images and further testing identifies specific image features which enable these predictions. This type of approach may decrease costs and timelines and enable improved quality control in marker detection.

Deep learned tissue “fingerprints” classify breast cancers by ER/PR/Her2 status from H&E images

Article Open access 29 April 2020

Computational pathology improves risk stratification of a multi-gene assay for early stage ER+ breast cancer

Article Open access 17 May 2023

Integrative multiomics-histopathology analysis for breast cancer classification

Article Open access 29 November 2021

Introduction

Clinical biomarkers are critically important in the diagnostic workup and treatment of breast cancer. In breast cancer, three molecular biomarkers form a cornerstone for optimized clinical decision making by providing prognostic information and predicting response to specific therapies. These biomarkers include estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2)¹. In current clinical practice, biomarker status is typically determined by histological inspection of immunohistochemistry (IHC) stained tissue using separate IHC stains for each biomarker.

Biomarker profiles are known to be correlated with histologic features in breast cancer^2,3,4,5. This is further recognized in the 2019 American Society of Clinical Oncology (ASCO) and College of American Pathologists (CAP) guidelines, which recommend follow up for observed discordance between ER status and histologic findings, such as low grade but ER-negative carcinoma⁶. Such follow up includes a second review or repeat IHC staining and is meant to help ensure that technical issues, tumor heterogeneity, or interpretation variability are ruled out before the biomarker status is used for treatment decisions.

With the adoption of digital workflows in histopathology and recent advancements in machine learning, initial efforts have explored the possibility of using algorithms to predict biomarker status from hematoxylin and eosin (H&E)-stained tissue in breast cancer^7,8,9 and other cancer types^10,11. Not only could this approach provide a more efficient and accessible option than IHC, but it also provides the intriguing scientific potential to identify morphological features that correlate with biomarker status. Explainability efforts have provided some insights into the features learned via slide-level determination of biomarker status¹², but further investigation of localized feature-prediction associations as well as comparison of features learned by different modeling approaches remains an important next step. Such work may also help pathologists identify morphological findings that further inform histology-biomarker discordances and reduce incorrect biomarker status reporting.

In this study, we develop models to predict three clinically relevant breast cancer biomarkers from H&E images, providing biomarker predictions for localized tumor regions as well as the slide-level summarization that has been the focus of prior studies. We also leverage multiple model-interpretation techniques to further investigate the associations between morphologic features and biomarker status predictions learned by the models. Taken together, our modeling and interpretability results provide both qualitative and quantitative assessment of morphological features relevant to biomarker prediction in breast cancer, highlighting that deep learning approaches in pathology can be accurate, informative, and interpretable.

Methods

Datasets

De-identified breast cancer data for this study came from three sources (Table 1): a tertiary teaching hospital, a medical laboratory, and TCGA^{13, 14}. The teaching hospital contributed both formalin-fixed paraffin-embedded (FFPE) tissue blocks (from which new IHC-stained slides could be prepared) and archived H&E-stained slides. The medical laboratory contributed only tissue blocks and TCGA represents only archival H&E slides. Pathology reports were available for cases from all three sources. Inclusion criteria for the H&E images required the presence of invasive carcinoma in primary breast tissue specimens, as determined by pathologist review.

Table 1 Dataset summary for slides and cases used in model development and evaluation.

Full size table

Slides from the teaching hospital and medical laboratory (whether archival or newly prepared) were scanned by Aperio AT2 digital scanners, and TCGA whole-slide images were digitized by Aperio and 3DHistech scanners and obtained via the Genomic Data Commons Data Portal (https://gdc.cancer.gov). The study protocol was approved and informed consent was waived by the Naval Medical Center San Diego (NMCSD) Institutional Review Board (IRB). This IRB approval covered the use of de-identified cases for the data from the tertiary hospital, the medical laboratory, and TCGA as used in this study.

First, to develop the patch-level DLS that predicts the biomarker status of each region of tissue, we prepared new, paired H&E and IHC slides from available tissue blocks from the hospital and laboratory. For each block, three 4 μm serial sections were prepared (Fig. 1) and each slide was first stained with H&E, digitized, then de-stained and stained with IHC for ER, PR, and HER2, respectively. Quality review of these IHC-stained slides was performed by multiple pathologists for a limited number of sample specimens prior to adopting this approach and showed comparable results to serial section staining. Quality assurance requirements were also employed throughout, including recutting and re-staining for a small number of sections with poor staining of the initial H&E or re-stained IHC. After digitization and alignment (see Supplementary Methods), these adjacent sections enabled simultaneous review of H&E and corresponding IHC for precise determination and manual annotation of biomarker status. The patch-level stage of the DLS was trained using pathologist-labeled patches from 140 blocks, tuned using 60 blocks, and tested on 64 blocks (see Table 2 for numbers of labeled patches).

**Fig. 1: Overview of data annotation, the two-stage deep learning system (DLS), and interpretability techniques.**

Table 2 Data summary for patches used in model development and evaluation.

Full size table

The second stage of the DLS, that provides slide-level summarization of biomarker status, was trained on biomarker status extracted from the pathology reports. This slide-level stage of the DLS was trained on 757 slides, tuned on 1377 slides, and tested on 3274 slides (see Tables 1 and 2 for breakdown by source).

Biomarker labels

Labels were provided by a cohort of 16 pathologists (median years of post training experience 8.5 years, range 3–21). For patch-level labels, pathologists were provided pairs of aligned H&E and IHC images to enable easy review of IHC status for specific regions of each H&E slide. Contiguous regions of invasive carcinoma were identified, without enrichment or selection of specific subtypes, and annotated as positive or negative for biomarker status using a 1% threshold for ER and PR, and a 10% threshold of complete circumferential staining for HER2. For the slide-level review of biomarker status, pathologists reviewed the IHC slides when available and assigned positive, negative, or indeterminate for ER and PR and 0, 1+, 2+, or 3+ for HER2 as per CAP guidelines¹⁵.

For all archival cases, pathologists also reviewed de-identified pathology reports and extracted the reported biomarker status. For ER and PR, each case was categorized as either positive, negative, or unavailable. For HER2 status, IHC results were reported as positive, negative, equivocal, or unavailable. For equivocal cases, HER2 fluorescence in situ hybridization (FISH) results from pathology reports were also recorded as positive, negative, or unavailable and used when available.

Deep Learning System (DLS) development

We developed a separate DLS for each biomarker (ER, PR, and HER2) to enable exploration of the possibility that different morphological features might be associated with each biomarker. Each DLS consists of two stages: the first is a deep convolutional neural network based on the Inception-V3¹⁶ architecture, that operates on image patches cropped from the whole slide image. Each input image patch was of size 512 × 512 pixels at 5X magnification (1024 µm wide at 2 µm/pixel). The patches used for model development were randomly sampled across the complete training set without any enrichment for morphological features or histologic subtypes. The model was trained to categorize each patch as one of three categories: biomarker positive invasive carcinoma, biomarker negative invasive carcinoma, and “non-tumor” (i.e., not invasive carcinoma). After the softmax layer, all predictions were in the range [0,1].

Training labels were collected as described above for each biomarker (one H&E-stained section labeled for each of the three biomarkers). To increase the number of labeled patches, pathologist-annotated H&E regions for each biomarker were replicated to the two available serial H&E images as a form of data augmentation that was found empirically to improve performance on a tuning set. For example, if a region of the H&E section was labeled as ER-positive (based on the paired IHC image), that ER-positive region label was propagated to the two serial H&E sections (even though those specific sections were not stained or evaluated for ER status). Further training details and other hyperparameters are provided in Supplementary Table 1.

The second stage utilizes features extracted from the patch-level output (i.e., heatmaps) to classify each slide as positive or negative for each biomarker status. For each biomarker, three sets of features were used: a single feature indicating the ratio of biomarker-positive vs biomarker-negative areas, and two normalized histograms of patch-level predicted values: one for biomarker positivity and one for biomarker negativity.

To compute biomarker-positive/negative area from continuous stage-1 predictions, we defined a single threshold t for both biomarker-positive (or negative) patches. For example, patches with predicted probability of ER-positivity >= t were considered ER-positive, and patches with predicted probability of ER-negativity >= t were considered ER-negative. Patches not meeting either threshold were considered “non-tumor” (i.e., not invasive carcinoma). The ratio of positive versus negative patches was used as the first feature for the stage 2 model. Next, we incorporated information about the full spectrum of predictions so that “borderline” predictions were not discarded and to reduce dependency on the exact threshold t. Specifically, we considered b evenly spaced histograms of patch-level predicted probabilities for both biomarker-positivity and biomarker-negativity. Prediction outputs below 0.1 were discarded because they tended to indicate non-invasive carcinoma (i.e., neither biomarker positive nor negative). The histogram values were then normalized by the sum of biomarker-positive and biomarker-negative patches (based on the above threshold t). Finally, these features were used as input to a regularized logistic regression model (implemented in Python’s sklearn library, v0.23.2).

The threshold t, number of histogram bins b, L1 vs L2 regularization, and regularization strength C for the logistic regression were all tuned using 10-fold cross validation on the slide-level train and tune dataset (Table 1). The final hyperparameters used were: t = 0.7, b = 5 for ER and PR, b = 7 for HER2, L1 regularization with C = 0.077, 0.045, and 0.024 for ER, PR, and HER2, respectively.

Model interpretability

Testing association of DLS patch-level predictions with specific histologic concepts

To evaluate the association of specific histomorphological features with the biomarker status predictions made by our models, we performed analysis of concept activation vectors (CAV analysis)¹⁷. Briefly, CAV quantitatively evaluates the degree to which a DLS associates a user-specified ‘concept’ with a particular predicted classification. This approach is hypothesis driven and thus requires a proposed set of concepts to test. Based on discussions with experienced breast subspecialist pathologists regarding known or likely associations with biomarker status, we identified six concepts for CAV analysis: high-grade carcinoma, low-grade carcinoma, invasive lobular carcinoma, DCIS, tumor-adjacent desmoplastic stromal changes, and TILs.

CAV analysis requires three categories of patches: patches representing particular histomorphological concepts (concept patches), patches known to be specific biomarker classes (known-class patches), and patches selected randomly from the entire dataset (random patches) to serve as a control. A minimum of 100 concept patches were randomly sampled from pathologist-annotated regions for the six concepts across 45 slides from the slide-level tuning set. Known-class patches consisted of 500 patches randomly sampled from the annotated positive and negative regions for ER, PR, and HER2 in the slide-level tuning set. Random patches consisted of 10 sets of 500 patches each, sampled randomly from all tissue areas.

To conduct CAV analysis, we generate activations at the final concatenation layer between the convolutional blocks and the fully connected layers for all concept patches and random patches, train a set of 20 linear support vector machine classifiers to distinguish between a sampling of random and concept activations (see Supplementary Methods for more details), and then measure the directional derivative of the model’s prediction for a given class along a vector orthogonal to the SVM decision boundaries (the Concept Activation Vector)¹³. For each concept–biomarker pair, we report the TCAV score, which is defined as the fraction of patches of a known class that have a positive derivative.

Unsupervised clustering

In this analysis, all patches from the patch-level training and tuning sets that were labeled as invasive carcinoma were clustered using a deep-learning based model that has been previously shown to be able to retrieve visually-similar histopathology images patches^{18, 19}. This model uses as input patches of size 299 × 299 pixels; therefore to obtain embeddings for our model’s input patches (512 × 512 pixels), we concatenated the image-similarity model’s embeddings for the 2 × 2 overlapping patches that constitute each of our patches.

We qualitatively selected the minimal number of clusters (25) that maintained within-cluster visual consistency. Each cluster contained 900–3000 patches, representing 30–100 cases. Next, 5 pathologists, blinded to the DLS predictions, reviewed 10 patches from each cluster. These patches were selected to be those closest to the cluster center while maintaining that each patch was from a distinct case. The pathologists provided both free-text characterization of morphological features and filled out a structured survey of histologic features.

To better understand which clusters were most similar in terms of predicted biomarker statuses, we next computed the patch-level predictions for ER, PR, and HER2 biomarker-positivity for every patch from the tune set, obtaining a distribution of biomarker status predictions for each cluster. We then used hierarchical clustering based on the average linkage to group these clusters. The distance metric used was the sum of the earth mover’s distance across all three biomarker-positivity predictions. Finally, to define each group resulting from the hierarchical clustering, we manually examined the mean and median biomarker-positivity scores of each group.

Saliency maps

Finally, to better understand the predictions at the pixel level, we leveraged SmoothGrad²⁰. Briefly, SmoothGrad calculates the gradient of output prediction with respect to the input pixels, and averages these gradients across multiple copies of the input image (n = 8 in our work), each with pixel-wise Gaussian noise added. Saliency maps were generated for a minimum of 100 patches per biomarker, 50 for each of positive and negative, a subset was manually chosen for independent review by two pathologists. The pathologists were presented with a high magnification version of the model input patch and the corresponding saliency map overlay, and asked to qualitatively assess the most salient regions on the patch.

Statistical analysis

Model performance was evaluated for both patch-level and slide-level predictions by calculating AUC for each biomarker. Confidence intervals for patch-level and slide-level AUCs were computed via bootstrapping by sampling with replacement (1000 iterations) using Python’s sklearn package, v0.23.2. Reported patch level AUCs are one-vs-all for the biomarker positive invasive carcinoma class. Slide-level AUCs represent binary classification of biomarker positive versus negative. Confidence intervals for the TCAV analyses were computed using the same method with 100 iterations.

Results

Our approach involves a 2-stage deep learning system (DLS) for each biomarker. The first stage predicts the local biomarker status for individual, cropped image patches representing small regions of tissue. The output of this prediction is one of three classes: biomarker positive tumor, biomarker negative tumor, or non-tumor. The second stage of the DLS predicts the slide-level biomarker status using the predictions of the first stage across every patch in the slide (Fig. 1, Methods).

Patch-level model status prediction

The first stage of the DLS was developed (trained and tuned) using 1.21 billion patches from 576 slides across 200 cases, and evaluated on a test set of all patches from 181 slides across 64 cases (Table 1 and Supplementary Table 2). We next report one versus all classification performance across all patches (biomarker positive tumor, biomarker negative tumor, or non-tumor). The patch-level area under the receiver operating characteristic curves (AUCs) were 0.939 (95%CI 0.936–0.941), 0.938 (95%CI 0.936–0.940), and 0.808 (95%CI 0.802–0.813) for ER, PR, and HER2, respectively (Fig. 2a). Examples of the patch-level predictions along with the corresponding IHC images are shown in Fig. 3 and Supplementary Figs. 1 and 2. We observed that heterogeneous staining was indeed associated with heterogeneous predictions, and this was in contrast to the uniformly positive patch-level predictions observed for the homogeneous cases (Supplementary Figs. 1 and 2).

**Fig. 2: Deep learning system performance.**

**Fig. 3: Visualization of predictions and annotations.**

Slide-level biomarker status prediction

The second stage of the DLS, which provides slide-level biomarker status predictions, was developed (trained and tuned) using 2134 slides from 264 cases, and evaluated on a test set containing a total of 3274 slides from 1249 cases across a tertiary hospital and The Cancer Genome Atlas (TCGA^{13, 14}, representing 36 unique tissue source sites), summarized in Table 1. On the combined test set, the AUCs for binary biomarker status classification were 0.86 (95% CI 0.84–0.87), 0.75 (95% CI 0.73–0.77), and 0.60 (95% CI 0.56–0.64) for ER, PR, and HER2, respectively (Fig. 2b).

We also evaluated AUC separately for the two test set data sources, observing AUCs of 0.87 (95% CI 0.85–0.89) for ER, 0.76 (95% CI 0.74–0.78) for PR, and 0.59 (95% CI 0.53–0.66) for HER2 for the tertiary hospital (Fig. 2c), and 0.83 (95% CI 0.80–0.86) for ER, 0.72 (95% CI 0.69–0.76) for PR, and 0.58 (95% CI 0.53–0.63) for HER2 for TCGA (Fig. 2d).