Main

The companion diagnostic approach seeks to dictate therapeutic strategy based on a molecular description of a patient's disease, with drugs targeting the HER2 receptor one of the most widely adopted examples. There are well-established guidelines for selecting patients for anti-HER2 adjuvant therapies in breast cancer treatment. However, even with patient selection, many trastuzumab-treated patients do not benefit from therapy, as their disease progresses or becomes recurrent. For example, about 1/3 of breast cancer patients given Herceptin fail to respond (de novo resistance), and about 1/5 of the responsive patients become refractory (acquired resistance).1 The proportion of patients who are not responsive to therapy, even with the inclusion of a companion diagnostic to predict patient response, indicates that the current approaches to treatment strategy and patient selection may not be as robust as possible.

The current HER2 immunohistochemistry (IHC) score methodology does not account for heterogeneity. Since 2007, the American Society of Clinical Oncology and College of American Pathologists (ASCO/CAP) have recommended specific guidelines for HER2 scoring.2 These guidelines call for a consistent process of sample preparation and staining (IHC) or hybridization (FISH) approaches, as well as score reporting. The ASCO guidelines also suggest using the terminologies ‘positive’ (3+), ‘equivocal’ (2+), or ‘negative’ (+1 to 0) to define HER2 scoring. According to the immunohistochemical (IHC) scoring methodology, the difference between an 1+ and 2+ score is a description of a ‘faint’ (0/1+) compared to a ‘weak-to-moderate’ (2+) membrane staining in more than 10% of the tumor cells. In contrast, a 3+ score is described as a uniform intense membrane staining of >30% of tumor cells. Thus, this widely used scoring approach is semiquantitative, as it relies on a threshold percentage of positive cells to determine the score. Importantly, this overall score does not include any additional information about the percentages of tumor cells that score beyond the threshold levels. The ASCO/CAP guidelines for HER2 FISH scoring also relies on a stratified HER 2/CEP17 ratio for the score.

This lack of information about variability within the tumor, or between tumors with the same score, blinds clinicians to a potential readout that could represent a biology responsible for non-effective responses to therapy. It is intuitive that differential cell populations within or between tumors could contribute to clinical refraction to therapy and thereby affect patient outcomes. All potential factors within individual patients that contribute to a lack of response are not known, but cancer biologists have long hypothesized that such disparate populations within the tumor can be selected for outgrowth and emerge as a resistant tumor. This concept of tumor heterogeneity leading to drug resistance was debated as early in the 1950s as the ‘Greenstein Hypothesis’, and has become part of cancer biology doctrine.3 In more recent times, as more targeted therapies are being developed, the issue of tumor heterogeneity has re-emerged as a factor significant to clinical strategy. Thus, there is a need for clinical evaluation of tumor heterogeneity that is aligned with the emerging understanding of cancer biology.

Studies of intratumoral heterogeneity from the same site demonstrate that heterogeneity can affect prognosis in 2+ scored tissues.2, 4 Another study found 16% of 3+ score cases exhibiting tumor heterogeneity.5 A recent case6 documented the personal significance of tumor heterogeneity, where a patient with invasive breast carcinoma demonstrated HER2 gene amplification on core biopsy, but relapsed while on adjuvant trastuzumab therapy after mastectomy, dying 15 months after diagnosis. Often, metastases harvested at autopsy demonstrated no evidence of HER2 gene amplification, but retrospective examination of the carcinoma in the patient's mastectomy specimen revealed only focal HER2 amplification within the tumor, localized to the region of the prior core biopsy site, highlighting the importance of both adequate sampling and awareness of heterogeneity issues. Another case was noted7 where a patient with breast cancer had areas of the tumor that were 3+ positive and negative for HER2/neu by IHC, adjacent to each other. These cases represent an underlying biology of tumor heterogeneity, which contributes to the clinical outcome.

The assessment of HER2 protein expression status in breast cancer provides a useful working example of tumor heterogeneity for future biomarker studies. There are substantial biological and clinical implications of intratumor clonal heterogeneity.8, 9 This heterogeneity may reside within a single tumor (intratumoral), or between tumors at different sites (intertumoral). Consequently, researchers have attempted to identify the levels of clinically observed heterogeneity in multiple studies of HER2/neu in breast carcinoma, the results of which are summarized in Table 1. Eight different studies of HER2 heterogeneity between primary breast tumor and metastasis demonstrate the low disconcordance rates between these: 0 and 13%, with the majority of studies under 5% disconcordant. Thus, determining the disparity between primary tumor and metastases may not be of high clinical priority. However, one recent study found disconcordance rates of 14% between core needle and excisional biopsies, suggesting that tumor heterogeneity could contribute to misclassification utilizing needle biopsies.10 The ASCO/CAP guidelines define HER2 genetic heterogeneity in FISH testing as >5%, and noted that the incidence of intratumor heterogeneity by this definition ranged in the literature from 5 to 30%.11

Table 1 Studies of tumor heterogeneity in HER2/neu

Accordingly, the ability to measure tumor heterogeneity may assist clinicians in verifying the predictive value of the HER2 score. It is critically important that the profession begin to develop improved approaches of reporting heterogeneity in samples. In the discipline of stereology, unbiased sampling is obtained by utilizing an entire tissue block, and randomly sampling both the sections and regions within a section to eliminate bias.12 However, a heterogeneity measurement seeks to start with the entire population, and then sample in an unbiased manner to then determine a representative variation. In addition, in clinical trials, it is difficult and nearly impossible to obtain the blocks required for stereology sampling, so the industry is left with dealing with one or several tissue slides as the specimen from which to obtain heterogeneity assessments.

As pathology evolves into a more digital and quantitative discipline, the challenge of quantifying tumor heterogeneity comes more clearly into focus. Whole slide imaging and quantification techniques for the evaluation of IHC biomarkers facilitate an approach for measuring tumor heterogeneity. The ability to distinguish and score individual cells across the whole tissue provides sufficient content to assess reliably diversity of a biomarker within the sample. Combined with a mathematical approach to describe a measure of variation within the sample, a heterogeneity index can be created. In this report, a novel, functional approach is described that assigns a numerical value to HER2 score diversity within a tumor sample, and thus serves to quantify heterogeneity. This output can be included with other digital pathology-based measurements of IHC biomarkers to provide a more contextual value to the numerical score. Two definitions are introduced to further assist with describing heterogeneity cell-level and tumor-level heterogeneity (Figure 1). Cell-level heterogeneity (Hetcell) is the variability of cells within a nest of tumor, and tumor-level heterogeneity (Hettumor) is the variability of nests of cells across an entire tumor. There is only one score per slide for Hettumor, but as each nest or sampled region in a tumor has its own Hetcell score, it is challenging to combine these into a single measure for a given slide. Thus, several approaches are examined to aggregate measures of cell-level heterogeneity across a slide.

Figure 1
figure 1

Definitions of cell-level (above) and tumor-level heterogeneity (below). Slide-level heterogeneity is a sampling substitute for tumor-level heterogeneity. The below figure also illustrates some contributions of anatomic heterogeneity, as parts of the lesser stained areas are ductal carcinoma in situ (DCIS).

Numerical Indices of Tumor Cell Diversity

Diversity measurement is a well-established field in the ecological sciences, and numerous approaches to quantifying the variability of species have been utilized in this discipline. Ecologists will describe diversity in terms of richness and evenness, and each can be ranked differently depending on the weighting of these concepts. For example, one area might have only two species, each covering half the area. The second area might have six different species, with one dominant species covering 95% of the area, and the other five each only covering 1%. Defined in terms of richness, the second area with eight different species would be considered more diverse. Defined in terms of evenness of distribution, the first area would be more diverse as it avoids having one type dominating over all others. Two commonly used diversity indices are the Shannon index13 and the Simpson index,14 for measuring plant and animal species diversity. The Shannon index of diversity is defined as:

where N is the number of biological types and pi the proportional abundance of the ith type. This index, ranging in theory from 0 to infinity, estimates the average uncertainty in predicting to which species type a randomly selected subunit of area belongs. The Simpson index is defined as:

Producing values from 0 to 1, Simpson's index defines the probability that two randomly selected equal-sized subunits of terrain belong to different species. A recent evaluation of tumor heterogeneity pioneered the use of both Shannon and Simpson indices in evaluating 8q24 copy number gain in both CD24+ and CD44+ cell populations in ductal carcinoma in situ and invasive regions of tumors.15 Copy numbers at each of three levels were considered as separate ‘species’ and the indices applied to deliver a measure of heterogeneity within each sample. Two distinct tumor subtypes of high and low diversity of 8q24 copy number, as measured by the Shannon index, and the group with lower diversity contained fewer samples of HER2+ tumors. There was no difference between diversity of the luminal A tumors and the normal cells, although basal-like tumors tended to have higher diversity scores. In this study, few qualitative differences were seen between Shannon and Simpson indices, although the data set were small. The Shannon index tends to blur distinctions of species richness and evenness, while the Simpson index can be dominated by the most abundant species in the population.

The disadvantage of both Shannon and Simpson indices is that they do not account for taxonomic distance between species. In the world of clinical anatomic pathology, most cells are binned and scored as one of three or four classes. In HER2 scoring methodology, pathologists (or pathologist-trained computer programs) score cells as populations of either 0+, 1+, 2+, or 3+ intensity. Consider two regions: Region A with ten 0+ cells and ten 3+ cells, and region B with ten 1+ cells and ten 2+ cells. Clearly, Region A has a higher level of heterogeneity than Region B, but Shannon and Simpson indices would score these as equal heterogeneity. To overcome this problem, an ecological diversity approach known as Rao's quadratic entropy (QE)16 was used. A distance matrix is incorporated in the diversity index, where, for example, a difference between a 0+ and 3+ cell would be weighted a ‘3’, and a 1+ to 2+ would be weighted a ‘1’. When all weights are the same, the scoring schemes tend to be equivalent to those mentioned previously.

The equation is as follows:

where N is the total number of species, pi and pj are the proportions of the ith and jth species, respectively, in the sampling unit, and dij is a member of the symmetric taxonomic distance matrix DÌ„(dij=dji and dii=0). The values of DÌ„ can be adjusted to match differences in classes in a particularly biological application, and the matrix shown above was utilized in this study for HER2-expressing cells. As an example of the flexibility of this approach, studies in plant ecology17 have utilized the following matrix, where dij=l if both species belong to the same genus, dij=2 if both species belong to the same family but different genera, dij=3 if both species belong to the same order but different families, dij=3.5 if both species belong to the same subclass but different orders, dij=4 if both species belong to the same class but different subclasses, dij=4.5 if both species are both angiosperms but are from different classes, and finally dij=5 otherwise.

The differences in diversity index scores are illustrated in Table 2, where several different distributions of samples in a given region are assumed. The QE weighs distances between species and will generally range from 0 (entirely homogenous population) to 1.5 (split evenly between extremes), although the upper range depends on the distance matrix used. Using a distance matrix also minimizes minor changes between cells classed into difference adjacent categories. Minimizing minor cell classification changes is important, as many researchers have noted the relative nature of IHC, and the difficulties associated with using IHC for quantitative analyses.18 One can further increase the numeric value between two classes in the distance matrix to make changes from one class to an adjacent class far less important than a change across several classes (eg from 0+ to 1+ to 2+ to 3+ distance be changed from 0,1,2,3 to 0,1,4,9 respectively). The appropriate distance matrix should be discussed within the context of a particular protein and the therapeutic goals for prognosis decisions.

Table 2 Example diversity indices and their scores for various hypothetical regions

In this study, a simple approach to heterogeneity is sought that can be utilized in the clinic as anatomic pathology is practiced today, to ensure immediate assistance to improving clinical trials practice. The constraints include: (1) dealing with a single slide; (2) working in brightfield IHC (there are 16 brightfield FDA clearances for protein expression in tissue and none in fluorescence); (3) utilizing image analysis scoring approaches that have already been cleared for use in clinical practice and are familiar to practicing pathologists; and (4) delivering a scoring system that is easily communicated between pathologists and oncologists. A measure of heterogeneity (HetMap) was developed that incorporates both cell- and tumor-level heterogeneity measures. The approach was evaluated on HER2 IHC-stained breast cancer samples, using 200 specimens across two different laboratories, with three pathologists at each laboratory outlining 10–25 regions of tumor for scoring by automatic image analysis.

MATERIALS AND METHODS

Two slide sets prepared and scored by two different clinical laboratories were used for this study. One slide set of 100 breast carcinomas was selected with an equal distribution of slides scored from 0+ to 3+, and a second slide set of 100 breast carcinomas was taken from routine operation with a distribution of slides representative of the target population. The tissues were formalin-fixed, paraffin-embedded breast tissue specimens immunohistochemically stained using Dako in vitro diagnostic FDA-approved HerceptTest (Dako, Carpinteria, CA, USA). All slides were scanned on an Aperio ScanScope, and three board-certified pathologists for each slide set manually drew between 10 and 20 regions of interest of tumor on the slides using the Aperio ImageScope interface. The pathologists were asked to draw representative regions of tumor on each slide for routine HER2 scoring using automatic image analysis. A total of 8549 tumor regions on 100 slides were drawn electronically by three pathologists on the first slide set and 6002 tumor regions on 100 slides were drawn by three other pathologists on the second slide set. The Aperio HER2 membrane algorithm was adjusted and run on these regions of interest to identify cells and classify them as 0+, 1+, 2+, or 3+ staining (http://www.aperio.com). The algorithm was adjusted by consensus of the pathologists before the study on control slides, and then used with a fixed parameter set for each of the slide sets. HER2 IHC protein expression status was classified per cell following ASCO/CAP guidelines. Data from the automated image analysis scoring of cells for each region were exported to tab-delimited file format and then analyzed with R.19

HER2 Scoring

In this study, two different scoring schemes for HER2 were compared to the classic ASCO/CAP scoring methodology. All scores were calculated for each region and for each slide (sum of all cells in all regions of a slide).

ASCO/CAP HER2 Score Determination

HER2 scoring was calculated according to 2010 CLIA/CAP guidelines (2). Regions containing all cells with no staining or if cell membrane staining is observed in <10% of the tumor cells received a score of 0 (‘negative’). Regions containing more than 10% of cells with a faint perceptible membrane staining received a score of 1+ (‘negative’). Regions containing more than 10% of cells with weak-to-moderate complete membrane staining received a score of 2+ (‘weakly positive’). Regions containing more than 30% of cells with a strong complete membrane staining received a score of 3+ (‘positive’).

H-Score Determination

A classic H-score was calculated as follows:

Continuous HER2 Score Determination

A continuous HER2 score (HER2cont) was developed and then computed the value for each region and each slide (sum of all cells across all regions). The HER2cont provides the same scoring report structure as the classic HER2 score, by rounding the HER2cont score graduation to the nearest integer ([0.0, …, 0.5): 0, [0.5, …, 1.5): 1+, [1.5, …, 2.5): 2, and [2.5, …, 3.0]: 3+). To provide the threshold graduations for the HER2 score, two components were used for measurement of the HER2 score, which were intuitive to pathologists:

(a) The percentage of cells that is responsible for the actual HER2 score (percentage of 3+, 2+, and 1+ cells when the HER2 score is 1; percentage of 3+ and 2+ cells when the HER2 score is a 2+; and percentage of 3+ cells when the HER2 score is a 3+) contributes to the HER2cont score. As the percentage of those cells increases beyond the critical threshold of the actual score, so does the HER2cont in a linear manner from where the critical threshold was passed.

(b) The percentage of cells that is critical for the HER2 scoring to the next higher score (percentage of 3+, 2+, and 1+ cells, when the HER2 score is 0; percentage of 3+ and 2+ cells, when the HER2 score is a 1+; and percentage of 3+ cells, when the HER2 score is a 2+) also contributes to the HER2cont score. As the percentage of those cells increases to reach the critical threshold to the next higher score, so does the HER2cont in a linear manner (0%: −0.5; 1%: −0.4; 2%: −0.3; 3%: −0.2; 4%: −0.1; 5%: 0.0; 6%: 0.1; 7%: 0.2; 8%: 0.3; 9%: 0.4; and 10%: 0.5=critical threshold for next higher score).

The two measurement components of the HER2cont score complement each other as the HER2 score moves from one score to the next higher score. Because an HER2 score is assigned to each individual cell, histogram data can be generated, which tallies the quantum (0, 1+, 2+, 3+) scores for each region. The HER2cont scoring approach examines the components of this histogram and assigns a threshold value based on a weighted mean of the data values of this histogram. Thus, this score fully represents the two components of data captured in a histogram: (1) the percent of total cells with each quantum score; and (2) consideration of the population profile of these cells. For example, a classic HER2 score of 2+ can be represented by an HER2cont score ranging from 1.5 to 2.4. The specific value of the HER2cont score consists, in part, of the sum of the quantum score multiplied by the number of cells given to each score, and divided by the number of cells. However, the percentages of these cells that contributed most to this mean value are weighted to determine a threshold value of the score. To understand how this approach works, we will use an example of a region that contained 10 cells, of which 4 cells had a score of 3+, 3 cells had a score of 2+, and the remaining 3 cells had a score of 1+. Intuitively, we could deduce the mean of this score as (3+3+3+3+2+2+2+1+1+1)/10=21/10=2.1. However, in actuality, the percentages of cells that contributed to the HER2cont score are weighted to yield the final score. To understand how this weighting process contributes to the final HER2cont score, we can examine the components of a score of 1.5, right at the threshold: If back-calculated, we would find that there are exactly 10% of 2+ or 3+ cells. In contrast, a score of 1.9 must result from either 82% of 2+ or 3+ cells, or 4% of 3+ cells. Finally, a score of 2.4 must result from 9% of 3+ cells, moving the score very close to the next higher 3+ score. Thus, the HER2cont score better captures the effects of intra-regional variability, which contribute to the overall profile of cells within a given summary HER2 score for that region.

The HER2cont was designed to capture the deficiency in a classic HER2 scoring approach, which has been in part addressed by the new HER2 scoring according to the ASCO/CAP guidelines, using the 30% of 3+ cells threshold for a 3+ score. The newer ASCO/CAP threshold approach to defining a 3+ score is based on a more stringent requirement for HER2 positivity, based on the now widely held understanding that the cells with the most HER2 expression are most responsive to trastuzumab. To understand how this decision was critical to predicting trastuzumab sensitivity, we can use the demonstration of how a classic H-scoring approach can result in an identical score for two dissimilar population profiles. For example, if a tumor had 30% of the cells being 3+ with the remaining cells being 0+, the H-score would be 90. If another tumor had 10% of the cells 3+, 60% of the cells 1+, and remaining cells 0+, the H-score would also be 90. However, these tumors clearly have a different molecular profile. The HER2cont score does not facilitate this scenario, and thus represents a similar approach to the ASCO/CAP HER2 scoring paradigm because the graduation of the continuous 3+ HER2 score only depends on the percentage of 3+ cells. For example, the new HER2 score can be obtained by applying a threshold of 2.61 (equals 30% of 3+ cells) to get a 3+ HER2 score (instead of 2.5 by rounding):

• If 3+ %cells≥10%, then region=2.5+

b. (3+ % cells–10%)/90% × 0.5

• If (3+ %cells+2+ %cells)≥10%, then region=1.5+

MAXIMUM of

a. (3+ %cells)/10%

b. (3+ %cells+2+ %cells–10%)/90% × 0.5

• If (3+ %cells+2+ %cells+1+ %cells)≥10%, then region=0.5+

MAXIMUM of

a. (3+ %cells+2+ %cells)/10%

b. (3+ %cells+2+ %cells+1+ %cells–10%)/90% × 0.5

• Else region=0.0+

a. (3+ %cells+2+ % cells+1+ %cells)/10% × 0.5

Heterogeneity Scoring

The goal was to develop two measures of heterogeneity, one representing cell-level heterogeneity (Hetcell) and one representing slide-level heterogeneity (Hetslide), as a surrogate for tumor-level heterogeneity (Figure 2). By reporting both the cell- and tumor-level heterogeneity, the pathologist can communicate a view of both microvariation (cell-level) and macrovariation (tumor-level) in the tumor. HetMap is then a graph of the entire patient population, displaying cell-based heterogeneity within individual regions on one axis, and the slide-level heterogeneity on the other axis.

Figure 2
figure 2

Diagrammatic illustration of the HetScore. The Het score is derived from analysis of the HER2 score for each individual cell (HER2cell—color-coded circles represent different HER2 values assigned to each cell) or the HER2 score from each individual region (HER2region—the overall score of cells within an outlined and shaded region). The quadratic entropy (QE) value of an outlined region (QEregion) is determined from the individual cellular HER2 scores measured within a region. Hetcell is defined as the mean of the QE value of all the regions. The QE value of the summary HER2 scores for all regions across the whole slide (QEslide) is defined as Hetslide. The QE value ranges from 0 to 1.5, with 0 being entirely homogeneous and 1.5 displaying the maximum heterogeneity possible.

To compute Hetcell, one first computes the individual quantitative entropy scores for all sampled regions on a slide. Then, these scores need to be combined into a meaningful representation per slide. This could either be done by presenting the average values or the maximum or minimum values. As each approach (average, maximum, minimum) potentially conveys different information content, it is not clear a priori as to which is the best approach for determining this value. Thus, the values for all three approaches for reporting Hetcell were determined, and the advantages and disadvantages of each are discussed. Hetslide is defined as QEslide, the quantitative entropy scores of the population of HER2 scores for each sampled region on a slide.

Variability Analysis by Lab and Pathologist

Once calculations were performed on each slide with regions drawn by each pathologist, these values could then be compared to determine the effect of each pathologist selecting his or her own regions, as well as the variability between labs. Each pathologist's annotations of regions on a slide were analyzed independently, and then the same digital slide compared between the three pathologists to analyze dependence on the regions of interest chosen by the pathologists. The HER2 score, H-score, HER2cont, and Hetcell were calculated for each region and the HER2 score, H-score, HER2cont, Hetcell, and Hettumor were calculated across all regions for all 600 slides (2 labs × 3 pathologists × 100 slides). Across all the regions on a single slide, the mean, standard deviation, maximum, and minimum for regions scores for HER2, H-score, HER2cont, and Hetcell were also computed. Once these computations were made, the assessment of the differences between pathologists and their choice of regions could be evaluated. This was done based on computing the standard deviation and coefficients of variation (CV) (standard deviation/mean) for the values computed by each pathologist's analysis.

RESULTS

Comparison of HER2 Scoring Approaches

As might be anticipated, the H-scores of the regions were nonlinear with the HER2 scores, with the HER2cont score being a closer representation of the CAP/CLIA guidelines (Figure 3). The nonlinearity is due to the heaver weighting of higher intensity staining over lower intensity staining to calculate the score, which disproportionately biases higher HER2 scored slides towards having a higher H score. For example, a tumor with 30% of 1+ cells receives the same H-score as a tumor with 10% of 3+ cells, but the first one would be a 1+ HER2 score and the second a 3+ HER2 score. These two slides, with the same H-score, have a quite different meaning to a pathologist and clinician determining the therapeutic strategy. Thus, the H-score, which was designed as a standard scoring scheme to provide continuous scores, is not well suited for the scoring HER2-stained slides.

Figure 3
figure 3

(a) H-score and (b) HER2cont vs HER2 scores. Left column: Lab A, 8500 tumor regions across 100 slides; right column: Lab B, 6000 tumor regions across 100 slides.

In contrast to the H-score, the HER2cont score remains linear with HER2 scores determined using the CAP/CLIA guidelines. Linearity of HER2cont with the classic HER2 score is preserved because of the use of two components in calculating the score: the percentage of cells that is responsible for the actual HER2 score, and the percentage of cells that is critical for the HER2 scoring to the next higher score. The first component of the measurement confirms that the critical threshold is passed, whereas the second component measurement demonstrates how the different cell populations are moving the value towards the next score level. The H-score attempts to accomplish this by a score-weighted value. In contrast, the second component HER2cont score assigns an unweighted, score-independent value to the effect of cell variability on the HER2 score. As such, the HER2cont score remains linear with the ASCO/CAP-derived score.

Relationship of Heterogeneity with HER2 Score

The QEregion generally increases with the classic HER2 and HER2cont scores (Figure 4). This observation is consistent with the attributes of tumors with increasing HER2 scores. The increasing variability observed in tumors with 3+ scores is a result of increasing complexity in the tumor: a 1+ or 0 tumor has little to no staining and thus is very homogeneous; and a 2+ or 3+ tumor has intensely staining cells. These more diverse components inherently permit more variability than a 0 or 1+ tumor.

Figure 4
figure 4

Cell-level heterogeneity vs (a) HER2 and (b) HER2cont scores. Top set of four graphs measures cell-level heterogeneity as the average (QEcell) values and bottom set of four graphs as the maximum (QEcell) values.

The lack of information about what factors contributed to the determination of the HER2 score using the classic method results in a less complex plot comparing QEregion vs classic HER2 than the comparison of the QEregion vs HER2cont. The plot comparing the QEregion vs HER2cont score demonstrates a similar pattern, with QE scores increasing as the HER2cont score increases towards a threshold. However, non-intuitively, the QE score decreases dramatically at the very highly scored 3+ tumors. This is due to the maximum staining intensity of any single cell being 3+. As there is no higher scoring population (+4), all the high staining cells are by definition 3+. Thus, the QEregion is actually decreased as the percentage of +3 intensity cells increases; as the amount of tumor with highly staining cell types increases, the variability decreases. At the extreme values of HER2cont, which have the most homogeneity (highest and lowest percent of +3 intensity cells), QEregion would be expected to actually decrease, as is observed.

Three approaches to measuring an aggregate cell-heterogeneity score, the average, minimum, and maximum QEregion scores, found on a whole slide are shown in Figure 4. Utilizing the maximum QEregion demonstrates the strongest correlation between HER2 score or HER2cont and QEregion. This again is, in part, due to a larger dynamic range of HER2 scores enabled by an increasing complexity with the presence of +3 stained cells within the region. Utilizing the maximum score, by definition, captures the maximum entropy observed, enabling clearer visualization. As expected, using the minimum value has the opposite effect. As might be anticipated, the range of values for the average was lower than that of the maximum score, as the use of the average smoothed the variability within the sample by decreasing the dynamic range of the QE score reported. Thus, utilizing the maximum QEregion value from a single whole slide to serve as a reporting measure for cell heterogeneity because of it having the greatest dynamic range, and potentially representing the most relevant aspects of cell biology leading to therapy resistance is proposed.

Figure 5 displays these three different cell-level heterogeneity (QEregion) approaches vs the slide heterogeneity (Hetslide) scores for each lab. As the data demonstrate, there is almost no correlation (all correlation values are R2 <0.07) between cell- and tumor-level heterogeneity (QEregion vs Hetslide). This is true whether the minimum, maximum, and average QEregion is used. This indicates that the variability between regions within a tumor is independent of variability between neighboring cells. This concept is consistent with the accepted model of cell-autonomous, genetic regulation of HER2 expression.

Figure 5
figure 5

Tumor-level heterogeneity (Hetslide) is not correlated with any of the three measures of cell-level heterogeneity: (a) minimum, (b) maximum, and (c) average.

A final representation of ‘unblinded’ data in Figure 6 displays a patient cohort visualization approach called HetMap. HetMap includes pseudocoloring to identify the HER2 score for each slide. HetMap is thus a graphical representation of the type of heterogeneity as it relates to a HER2 score. As demonstrated in Figures 4, the QE increases with the HER2 score, due to increased complexity of the sample with more 3+ staining cells. This representation shows a clear stratification of the HER2 score and cell heterogeneity is seen, as tumors with 3+ and 2+ scores demonstrate significantly more cell heterogeneity than tumors with 0 and 1+ scores. Again, there is no relationship between HER2 score and Hetslide, consistent with the data shown in Figure 5. However, the inclusion of the HER2 score in this graph indicates that there are specific subsets of patients within each score who either have a large or small degree of heterogeneity across the whole tumor. The current understanding of tumor heterogeneity on clinical response suggests that this measure has potential clinical value, as discussed later.

Figure 6
figure 6

An illustration of HetMap showing each slide in the context of the patient population, including the IHC score, the tumor-level heterogeneity, and one of three measures of cell-level heterogeneity—(a) minimum, (b) maximum, and (c) average.

Effect of Pathologist Choice of Regions on the HetMap Score

The standard deviation, average, and CV (standard deviation/average × 100) values were determined for the three sets of regions drawn by the three pathologists for each slide. These values for slide-level heterogeneity (Hetslide), the HER2 continuous score (HER2cont), and the three approaches to cell-level heterogeneity, Max(QEregion), Min(QEregion), and Ave(QEregion) are displayed for the two laboratories in Table 3. The variation of HER2 as a continuous score between pathologists drawing their own regions was low in both labs (8 and 9% CVs), validating this scoring methodology. This was lower than the ASCO/CAP recommended concordance rate of 95%, but far better than published concordance rates for HER2 IHC tests around 20%.20, 21

Table 3 Comparison of the variability introduced by the pathologist selecting regions of interest

The CV values for cell-level heterogeneity (Max(QEregion), and the Ave(QEregion)) ranged higher ((16 and 15% CVs) and (17 and 16% CVs), respectively). Measures of cell-level heterogeneity based on a minimum (60 and 45% CVs) were much higher than those based on a maximum. The Min(QEregion) determination showed the most disparity in CV values between pathologists, reinforcing the idea that this particular assessment is subject to more noise interference due to the nature of its determination, and should not be used. Tumor-level heterogeneity (Hetslide) was far more impacted by pathologists’ choice of regions (30 and 25% CVs) than the other measurements. It is likely that the number of regions sampled may not be sufficient to make determinations of tumor-level heterogeneity. Relying on a methodology that samples all the tumor on a slide is becoming possible and practical, and may be required for this type of analysis.

Tumor-Level Heterogeneity Impacts Pathologist Concordance Rates

Figure 7 shows the distribution of tumor-level heterogeneity vs slides that had either concordant or disconcordant reads between the three pathologists in a laboratory. Disconcordant reads between 1+ and 2+ are particularly troubling for clinical operation, as a 1+ will not be reflexed for FISH HER2 testing and investigated further for possible treatment options. There was a significant difference (P<0.01) between the concordant 1+ slides and those disconcordant either with 0+ or 2+.

Figure 7
figure 7

Tumor heterogeneity plays a role in pathologist disconcordant reading. The concordant slides (1+/1+) had significantly lower tumor heterogeneity (P<0.01) than 0+/1+ or 1+/2+ disconcordant slides.

DISCUSSION

Although molecular-targeted therapies such as trastuzumab are based on a strong biological rationale for the target, a significant percent of patients who express the target are refractory to the therapy. Markers of resistance are not intuitive or well established, and often not related to the target itself. For example, ErbB2 mutations that lead to trastuzumab resistance are not found in the clinic, and resistance is a result of amplification of many other genes besides HER2.22 HER2-expressing cells may overcome anti-HER2 therapy, although a variety of other survival or compensatory signaling mechanisms that currently cannot be identified clinically.23 This is consistent with the well-accepted idea that HER2+ breast tumors are not a single biological system, but are comprised of several different subtypes of HER2+ and HER2− cells within a tumor. Cancer biologists have long hypothesized that minor populations of cells within a bulk of the tumor are key to understanding mechanisms that underlie resistance to trastuzumab treatment. Thus, a persistent problem that confounds ideas about trastuzumab resistance is the concept of tumor heterogeneity. Therefore, for the continued development of the best strategy for trastuzumab and other therapies, methodologies are needed that can describe heterogeneity in the clinical context in relation to the targeted therapy.

As shown in the literature review summarized in Table 1, there are different ways of defining heterogeneity in the context of HER2 expression. First, there is the most classical measure of heterogeneity between different samples from the same tumor site taken at the same time. This can be within a single sample, or between multiple samples, such as that seen between needle core biopsy and excisional biopsy. Second, heterogeneity can be measured between different subtypes of the same disease within the same patient, such as a comparison of between ductal carcinoma in situ and invasive disease, or between primary site and metastasis. Finally, heterogeneity can be measured between different tumor samples within a population. This can also include measures of heterogeneity in a patient population before and after a treatment. Regardless of the type of heterogeneity, its measure is determined be the variability for a biomarker score with an entire population.

In this report, heterogeneity is defined as the measure of the variability samples within a population found in a single breast tumor. The first assumption made with this approach is that the tissue section analyzed is indeed representative of the whole biopsy. The second assumption made is that the biopsy is, in fact, representative of the tumor. Although these assumptions are inherent in and necessary for the current approaches to clinical biomarker evaluation, to date there has been no methodology to address the potential effects of these assumptions on the measurement. To account for the potential effects of heterogeneity on a HER2 IHC score, a methodology must be used that can simultaneously incorporate a measurement of heterogeneity with evaluation of a biomarker score.

Classical IHC scoring approaches, which allow visualization of the tissue, do not sufficiently address heterogeneity. Within a tumor, HER2 heterogeneity may manifest as variability in HER2 expression between neighboring cells, or as variability in HER2 expression between different parts of the tumor section. Normally, pathologists will take an average expression across a slide by selecting several random regions to score and combined the scores from each region for a final score. Although this method scores individual cells and different regions of the tumor, it does not consider either component separately in determining the score. Furthermore, there is no consideration in these assessments for variation within the sample that may have impacted the score. In this way, the current paradigm of assessing HER2 is not aligned with the fundamental understanding of tumor heterogeneity in cancer biology. With the current approach, potentially valuable information about the variability between areas sampled is lost.

To create a method that included these important aspects of tumor biology, a method of scoring heterogeneity was developed that could be incorporated into existing workflow and be readily implemented in the clinic called HetMap. To be a biologically relevant measure, the measure of heterogeneity needs to capture both cell- and tumor-level variation. To be a valid measurement, it needs to be independent of the pathologist's choice of regions, as well as the laboratory or antibody manufacturer. To be a useful measurement, it needs to be amenable to immediate use in a clinical environment. With this in mind, an approach was developed that meets these requirements, first by developing methods to score HER2, then cell- and tumor-level heterogeneity separately, and finally assembling these components into the graphical output of HetMap.

First, to better capture cell- and tumor-level heterogeneity, a continuous HER2 scoring method was developed. As designed, the HER2cont scoring allows for a better benchmark when observing variability using standard of deviation. In contrast to the classic H-score methodology, the HER2cont score remains linear with the ASCO/CAP HER2 scores. Perhaps most significantly, the HER2cont also ensures that the graduation of the HER2 score remains meaningful to the pathologists, in contrast to the classic HER2 score, which is devoid of information about the factors that determined the score. Thus, the HER2cont score enables better understanding of the variability that contributed to the overall score. Although the HER2cont proved a useful approach for analyzing the data, its use outside of this context is not recommended, given the regulatory difficulty of making any changes to accepted scoring schemes in clinical environments. In this context, the HER2cont could be utilized as a complement to the current reporting approach to HER2 analysis by IHC, to convey meaningful contextual information to the clinician about the reported HER2 score, or to identify near-threshold scores that may require re-evaluation.

To report directly heterogeneity of a biomarker within a tumor, population diversity approaches used in ecology were used. The measure of taxonomical diversity with Rao's QE proved to be a simple and versatile compliment to HER2 scoring. In combining the cell–heterogeneity scores for each region into a single score for a slide, three alternative approaches using the minimum, maximum, or average QEregion score were presented. The maximum QEregion demonstrated the clearest trend between an increase in HER2 score and an increasing QEregion. It is hypothesized that this could, in part, due to a larger dynamic range of HER2 scores enabled by an increasing complexity with the presence of +3 stained cells within the region. However, it also suggests that there is a causative relationship between HER2 expression and heterogeneity. This could be for several hypothetical reasons pertaining to the complexities of HER2 biology, but a possible simple explanation points to well-known factors that are related to HER2 gene amplification: HER2 expression is regulated by the amplification of a ‘hot spot’ on chromosome 17q, in a so-called ‘firestorm pattern’. As such, HER2 amplification also results in the amplification of multiple neighboring gene loci for other known oncogenes, through both known and unknown mechanisms. For this reason, HER2 expression is clinically well correlated with measures of disease progression, drug resistance, and recurrence.22 As discussed earlier, these same measures are well correlated with tumor heterogeneity. As such, the linear relationship between HER2 score and QEregion may, in fact, be causal and is in line with our current understanding of HER2 biology.

Although most useful with the HER2cont scoring method, the relationship between HER2 score and variability were demonstrated with either HER2 scoring method. One can imagine biological reasons for favoring each approach, depending on the biomarker of interest. Reporting maximum QEregion may be most beneficial for biomarkers whose overexpression is predictive, such as HER2. Although less useful for HER2, minimum cell-level QE may be important when examining the downregulation of a biomarker for susceptibility or resistance, like the tumor suppressor gene p53. The average QEregion may be most valuable for examining the levels of a biomarker that is more homogeneously expressed, but with varied expression depending on the phenotypic state of the cell. An example of this might be the widely monitored cell adhesion protein E-cadherin. In any case, there must be congruency between the biology of the biomarker and the variability of its expression within a tumor. Ultimately, it will require a comparison to clinical data to determine if the QE output is prognostic or predictive, and which of the three approaches will be most useful.

Throughout all the data, the results were highly reproducible between the two laboratories, even though the two labs had different specimens, different pathologists, and different standardization procedures (although both complying with CLIA/CAP guidelines for HER2 testing). This is encouraging both for clinical practice and for these measures of heterogeneity. These data require further investigation, but the initial findings that the lowest CV values are found with the HER2cont and the maximum QEregion outputs suggest that the pathologists are fairly consistently drawn to areas with higher staining (they may contain the highest HER2 score and maximum heterogeneity regions). This idea is congruent with the long-held observation that there is more continuity between pathologists in HER2 3+ scored slides compared to 2+ stained slides.20 These data also initially suggest that pathologist do not systematically select areas with lower staining (containing perhaps lower levels of heterogeneity), demonstrated as weaker correlations between minimum QEregion score and HER2 score, and the increased CV values seen with these measures. Finally, the data suggest that pathologists tend to more systematically include both low and high levels of staining in the same regions, demonstrated as modest CV values seen with these measures.

To demonstrate a working HetMap model in the widely used HER2 scoring platform, the data generated from these approaches was integrated into a graphical output. Perhaps counter-intuitively, there was little correlation between any of the cell-level HER2 heterogeneity measures and the tumor-level HER2 heterogeneity measures. A possible simple explanation is that HER2 expression is a cell-autonomous event, which is determined primarily by genetic amplification. For this reason, HER2 FISH scores correlate well with HER2 IHC scores, and both methods are acceptable for clinical use. Thus, HER2 expression is not influenced by known factors that affect the tumor cell microenvironment and heterogeneity within the tumors, such as autocrine and paracrine loops, hypoxia, inflammatory cells, and extracelluar matrix components. The observation that cell- and slide-level heterogeneity did not correlate with each other, despite the clear correlation between all three approaches to cell-level heterogeneity and HER2 score, supports this idea. In contrast, other biomarkers represented a phenotype more dependent on these physiological/environmental factors, such as the proliferation marker Ki67, may show a correlation between cell- and tumor-level heterogeneity.

The HetMap not only provides a general measure of tumor heterogeneity, it has the potential to identify at-risk patients who may benefit from a different course of therapy. For example, as Figure 6 shows, there are subsets of tumors that are on the far right of the HetMap, which have significant intratumor heterogeneity. Although this cannot be determined from this data set, our current understanding of tumor heterogeneity would suggest that these outliers have a poorer prognosis and may be likely have de novo resistance to trastuzumab, despite having HER2 expression. In contrast, tumors on the far left side of HetMap who have low heterogeneity and express HER2 may benefit the most from trastuzumab therapy. Also, a correlation between HER2-expressing tumors with modest heterogeneity, which is indicative of a tumor subset that becomes refractory (acquired resistance), may be found. However, in the absence of patient outcome data, these scenarios can only be speculative. Despite this, the HER2 HetMap ties discrete subgroups of patients who have extreme or little heterogeneity with their HER2 score, thereby facilitating progress toward applying principles of tumor biology into clinical practice.

ASCO/CAP has not yet provided guidance on a potential diagnostic approach to consider HER2 IHC heterogeneity. In contrast, ASCO/CAP has provided specific guidance on assessments of HER2 FISH heterogeneity: These guidelines define ‘heterogeneous amplification’ as the presence of between 5 and 50% cells with HER2/CEP17 ratios of more than 2.20. A survey of 1329 consecutive breast cancer cases demonstrated that when utilizing the previously used criteria, 23.2% of cases demonstrated heterogeneity, of which 81.6% were not amplified and 15.5% were equivocal by the standard criteria. In contrast, the new ASCO/CAP criteria classified only 6.5% of cases as heterogeneous, of which only 8% were not amplified and 79% were equivocal by standard criteria. Thus, this new approach of defining a ‘heterogeneous amplification’ patient population better captures patients who truly have amplified HER2 in a non-uniform manner without introducing false positives.24 Similarly, the TEAM (Tamoxifen vs Exemestane Adjuvant Multicentre) pathology study, which included 6461 eligible cases, demonstrated that 33.5% of the cases exhibited FISH heterogeneity defined by the newer ASCO/CAP criteria. These aforementioned studies show that carcinomas with HER2 genetic heterogeneity can still have an overall negative HER2 amplification status, despite still containing a significant number of tumor cells with HER2 gene amplification.

Interestingly, there was no prognostic value found in comparing the ‘heterogeneous amplification’ population to outcome in this study.25 Although the lack of prognostic value is unexpected, a recent study found that tumors classified as ‘heterogeneous’ amplification’ lacked specific clinicopathological characteristics that would normally correlate with prognosis, suggesting that other more important factors may be critical to patient response. However, these same authors found that HER2 genetic heterogeneity according to the ASCO/CAP definition is most often present in breast carcinomas with an equivocal (2+) HER2 score.26 This correlates with our observation that patients who received an equivocal (2+) HER2 score also tended to have the highest heterogeneity as well.

This study is the first to show the impact of tumor-level heterogeneity on pathologist disconcordant reading. There was no significant difference between cell-level heterogeneity and pathologist disconcordance, as might be expected. Tumor-level heterogeneity presents a much larger challenge to accurate and precise reading by pathologists, as the location for reading on a slide can impact the scoring. This study relied on pathologists to identify manually regions of interest on the whole slide image, and measured the variability introduced by this human sampling approach. There are several potential approaches to eliminating this sampling bias. One could increase the number of regions, but the time required already for drawing regions manually is beyond economic viability in most laboratories. One could computationally deconvolve each IHC image, displaying only the hematoxylin to the pathologist, to reduce the selection bias towards highly staining DAB regions.27 Alternatively, it may be better to overlay a grid on an digital image of the sample, similar to stereology approaches, and then ask the pathologist to draw the closest ‘tumor’ nest within a set distance of randomly sampled grid points to be used as the data set. Another improvement could be the use of pattern recognition or special staining followed by pattern recognition on a consecutive section. This approach could remove bias by either using a central slide without the biomarker of interest on it or having the pathologist select tumor nests with only the hematoxylin-stained and eosin-stained section. This method of aligning serial sections to be treated as identical sections could also increase the sampling size, and could automate the choice of regions. One can take the use of aligned serial sections further, but including not just one IHC marker but several to then provide a classification of cells based on multiple markers. As pattern recognition methods improve and whole slide imaging becomes more standardized, this may facilitate the identification of the optimal heterogeneity approaches.

The HetMap conveys the cell-level, tumor-level, and IHC expression score for each patient in the context of the entire patient population, and further clinical studies to determine its utility are anticipated. Unfortunately for the development of new companion diagnostics, HER2 may actually represent a best-case scenario for assessing heterogeneity. Several of the studies listed in Table 1 also included other biomarkers, and generally consistently found HER2 to have substantially lower heterogeneity. In particular, multiple studies find higher heterogeneity in the estrogen and progesterone markers. One study found HER2 to have far lower levels of heterogeneity than other biomarkers studied, with c-myc and cyclinD1 exhibiting heterogeneity in 83 and 100% of samples, respectively.28 In regards to tumor type, high levels of heterogeneity is well known in gastric carcinoma,29 and one study found rates of intratumoral heterogeneity of KRAS mutations as high as 35–47% in colorectal tumors.30 HER2 expression is generally far more heterogeneous in gastric carcinoma than in breast tumors.31 The generally low levels of HER2 heterogeneity between primary and metastatic site cannot be assumed to be seen in other therapeutic targets under development. Given that new therapeutic agents and their corresponding companion diagnostics will present greater heterogeneity challenges than HER2, it only increases the urgency to move anatomic pathology in the direction of measuring and reporting IHC heterogeneity.