INTRODUCTION

HER-2/neu gene amplification or protein overexpression is seen in 20–30% of invasive breast cancer (1). Many studies have shown that this represents an adverse prognostic factor (2). Recent clinical studies have also suggested that overexpression may serve as a predictive marker of therapeutic response (3, 4, 5) to the monoclonal antibody, Trastuzumab (Herceptin, Genetech, Inc. CA), that directly targets the HER-2/neu protein. This new therapy highlighted the need for valid testing of HER-2/neu in our laboratory.

Immunohistochemistry (IHC) for HER-2/neu protein expression is the most attractive routine test based on issues of cost, convenience, and biological relevance. However, problems with variability in IHC staining have been reported. Specifically, differences in sensitivities and specificities among the various commercially available antibodies (6), variability in IHC interpretation (7), and technical artifact (8) have proven problematic. Although a true gold standard for testing does not exist, HER-2/neu gene amplification by fluorescent in situ hybridization (FISH) has been shown to correlate with protein overexpression by IHC and be similarly predictive and prognostic (9, 10, 11, 12, 13, 14). The goal of this study was to assess the performance of IHC compared with that of FISH on fixed tissue from invasive breast carcinomas submitted to our laboratory from hospital laboratories in British Columbia, Canada, using several commercially available antibodies. Also, we sought to examine the effect of interobserver variability on IHC interpretation.

MATERIALS AND METHODS

Study Population

Study material was derived from patients with advanced breast cancer considered candidates for Herceptin therapy. Biopsy material was submitted from laboratories throughout the province of British Columbia to the BC Cancer Agency (BCCA) laboratory for HER-2/neu testing between January and December of 1999. Cases were initially tested by IHC with a polyclonal antibody without antigen retrieval (DAKO, c-erbB2, clone A0485). A proportion of these cases was difficult to classify, and these were considered by the original pathologist to be equivocal between negative and positive (IHC stain intensity, +1 versus +2). For this study, all of these equivocal cases plus a mix of negative (stain intensity 0 and +1) and positive cases (stain intensity +2 and +3) were chosen. Only cases with sufficient invasive carcinoma for multiple slides were included.

Paraffin blocks were obtained from breast biopsies (n = 102) and metastases to lymph nodes (n = 11), skin (n = 3), ovary (n = 1), and liver (n = 1). Blocks were received from 24 laboratories. All laboratories used 10% buffered formalin for primary tissue fixation. Duration of exposure to the fixative varied considerably, as typically encountered in surgical pathology practice. Two laboratories used postfixation treatment, one with alcohol and one with microwave. Two laboratories used a combination of alcohol and xylene in the tissue processor.

For each case, 4-μm-thick tissue sections were cut from paraffin blocks. FISH was performed in the BCCA laboratory using the Vysis PathVysion HER-2/neu DNA probe Kit (Vysis, Inc. Downers Grove, IL). IHC was performed at the Vancouver Hospital with three anti-HER-2 antibodies (DAKO, c-erB2, clone A0485; Zymed, HER2, clone TAB250; and Novocastra, HER2, clone CB11) and at the BCCA with HercepTest Kit (DAKO).

FISH for HER-2/neu Gene Amplification

FISH for HER-2/neu gene amplification was performed using the Vysis PathVision HER-2/neu DNA Probe Kit according to the manufacturer instructions. In brief, tissue sections were baked overnight at 56° C, deparaffinized in xylene, dehydrated in 100% ethanol, and air-dried. Pretreatment consisted of slide immersion in 0.2 n hydrochloric acid, a purified water wash, a wash in Vysis Pretreatment Wash Buffer and immersion in Vysis Pretreatment Reagent at 80° C for 30 minutes. Slides were then washed in purified water, washed in Vysis Wash Buffer, and immersed in Vysis Protease solution at 37° C for 20 minutes. Protease was removed with two changes of wash buffer and slides allowed to air dry. Ten microliters of PathVysion HER-2 Probe was added to the sample area of each section, and the section was coverslipped and the edges sealed with rubber cement. Slides were placed on a HyBrite (Vysis, Inc., Downers Grove, IL) for codenaturation using lengths of 5 minutes at 73° C to denature the DNA and 18 hours at 37° C to hybridize the probe and DNA. After hybridization, the rubber cement was removed, posthybridization solution was used to remove the coverslips, and slides were washed in posthybridization wash buffer and air-dried in the dark. Nuclei were counterstained with 10 μl of DAPI/antifade and coverslipped. Slides were stored at −20° C for up to 7 days. Control slides, either negative or positive, were included in each Coplin jar of every run. Controls consisted of sections of cases of known HER-2 FISH status.

FISH Interpretation

Sections were examined with a fluorescent microscope (Zeiss Axioskop) using filter sets recommended by Vysis (DAPI/SpectrumOrange dual bandpass, DAPI/SpectrumGreen dual bandpass). A comparable hematoxylin and eosin (H&E) section was examined to assist with location of invasive tumor within each section. Forty malignant cell nuclei from invasive foci were scored for CEP-17 signal and HER-2 signal. Only nuclei with both signals interpretable were scored. A ratio of HER-2/CEP-17 was calculated for each specimen. A technologist and a pathologist independently enumerated each case.

IHC for HER-2/neu Protein Expression

Sections from each case were deparaffinized and rehydrated in graded alcohol solutions. Processing for each antibody was as follows: A0485: no antigen retrieval, dilution 1:25; TAB250: pretreatment with protease Type 1 (Ventana) for 4 minutes, dilution 1:50; CB11: no antigen retrieval, dilution 1:25; HercepTest: heat-induced antigen retrieval with temperature-controlled water bath. (Note: We now use antigen retrieval with A0485 as currently recommended by DAKO. At the time of this study, DAKO was not recommending antigen retrieval for the A0485 antibody.)

At the Vancouver Hospital laboratory, all staining was performed on a Ventana ES Immunostainer (Ventana Medical Systems, Tucson, AZ), according to the manufacturer's instructions. The primary antibody was applied for 32 minutes. All cases were run with known positive- and negative-biopsy control material. The detection system used was AEC Detection Kit (Sigma), which included endogenous peroxidase inhibitor, biotinylated goat anti-mouse IgG/IgM, anti-rabbit IgG, peroxidase-labeled streptavidin, and AEC chromogen and substrate (hydrogen peroxide). Sections were counterstained with hematoxylin.

At the BCCA laboratory, staining for HER-2/neu protein expression was performed manually using the Dako HercepTest Kit according to the protocol in the manufacturer's guide. Paraffin sections were either freshly cut or were cut no longer than four weeks before the assay was performed. In brief, tissue sections were deparaffinized in xylene and rehydrated through a series of alcohols to distilled water. Epitope retrieval was in preheated (95° C) DAKO Epitope Retrieval Solution (0.01 mol/L citrate buffer, pH 6.0) in a water bath for 40 minutes. peroxidase-blocking reagent was applied for 5 minutes. The primary antibody (prediluted rabbit anti-human antibody to HER-2/neu protein) was applied to test and positive-control sections for 30 minutes. The negative-control reagent (normal rabbit serum) was similarly applied to negative-control sections. The detection system consisted of DAKO visualization reagent (dextran polymer conjugated with horseradish peroxidase and goat anti-rabbit immunoglobulins) with 3,3′-diaminobenzidine chromogen solution. Sections were counterstained with hematoxylin. Controls for each run consisted of a negative and positive tissue control, a negative control for each patient, and the pellet control supplied with the kit.

IHC Interpretation

Slides were independently interpreted in a blinded fashion by an experienced breast pathologist (Observer 1), a general surgical pathologist (Observer 2), and a surgical pathology resident (Observer 3). Only invasive cancer was scored. Staining was interpreted on the maximum area of staining intensity as follows: 0 = no staining; +1 = weak, incomplete membranous staining; +2 = moderate, complete membranous staining of at least 10% of invasive tumor cells; and +3 = strong membranous staining of at least 10% of invasive tumor cells. Cases interpreted as 0 or +1 were considered negative, and cases interpreted as +2 or +3 were considered positive.

Statistical Methods

The distribution of the original A0485 antibody staining and FISH amplification for the study population were compared with the patient population using Pearson's χ2 test. Patients with complete data (FISH and all four antibodies reviewed by each of the three observers) were compared with those with incomplete data using the t test for continuous measurements and the Pearson's χ2 test for categorical measurements.

Positive predictive values (PPV) and negative predicted values (NPV) were calculated for each of the four antibodies and for each of the three observers. Sensitivity, specificity, false-negative proportion (1-sensitivity), and false-positive proportion (1-specificity) were also calculated using all possible cutpoints. Because the sensitivity and specificity were biased because of the enriched sampling scheme, adjusted sensitivity and specificity were computed using the following formulas. Let the T+, T− be positive and negative antibody stain for a particular cutoff, and F+, F− be amplified and nonamplified FISH assays. Then, the sensitivity, the probability that the antibody stain is positive given that FISH is amplified, can be written Pr(T+ F+). Using Bayes' theorem, and with an estimate of the prevalence of a positive test, Pr(T+), the sensitivity can be expressed as a function of the PPV, Pr(F+ T+) and the NPV, Pr(F− T−), using that cutpoint:

The distribution of A0485 in the population (mixed, single reviewers) was used as the population prevalence for all antibodies. Receiver–operator characteristic curves (ROC) were also computed. For each observer and antibody stain, adjusted c statistics (the area under the ROC curve [AUC]) were computed.

Inter observer variability was assessed by multi observer kappa statistics(15).

RESULTS

FISH for HER-2/neu Gene Amplification

In total, 127 cases were assessed by FISH for HER-2/neu gene amplification and by IHC for protein expression. FISH analysis failed in 9 cases; 5 cases failed because of inadequate digestion, 3 cases failed because of loss of tumor from the section, and 1 case failed because of unknown reasons.

Of 118 cases with informative FISH results, 52 (44%) were gene amplified and 66 (56%) were not amplified. The FISH score (HER-2/CEP-17 ratio) for amplified cases ranged from 2.0–11.0 (mean = 5.36).

Comparison of Study Group with Population

From June 1998 to February 2000, 484 cases were tested with the A0485 antibody. The results were negative (0) for 333 (68.8%), negative (+1) for 28 (5.8%), positive (+2) for 26 (5.4%), and positive (+3) for 97 (20%). By comparison, for the 114 patients with informative FISH analysis, the distribution of A0485 staining (Observer 1) was negative (0) for 46 (39.0%), negative (+1) for 20 (17.5%), positive (+2) for 14 (11.9%), and positive (+3) for 34 (28.8%). This was significantly different from the population distribution (Pearson χ2 = 38.3, P < .001). In addition, compared with the 26% prevalence of positive HER-2 by IHC in our laboratory, 44% of the study cases were gene amplified by FISH. Therefore, this study set was enriched with positive cases and cases equivocal between +1 and +2 over that encountered in our clinical practice.

Comparison of Complete and Incomplete Data

Complete data (FISH and 4 antibody results for 3 observers) was available in 95 cases. Statistical analysis was done on this data set. In 23 cases, a complete set of IHC stains was not available, as the original blocks could not be recovered for analysis. The profile of the complete (95 cases) and incomplete (118 cases) data sets were similar (Table 1). For the 95 cases with complete results, 43% (41 cases) were gene amplified.

TABLE 1 Comparison of Complete and Incomplete Cases

Comparison of FISH and IHC Assays

Table 2 shows the complete data set (95 cases) for each observer for FISH score and antibody stain results.

TABLE 2 Cross-Tabulation of FISH Score by Antibody Staining for Each Observer

The overall interobserver agreement (Kappa) for each antibody was as follows: A0485, 64.2; CB11, 74.4; TAB250, 79.2; and HerceptTest, 69.5 (Table 3). Although interobserver agreement was high (77–95.6) for 0 and +3 staining, it was generally poor (32.8–59.1) for +1 and +2 staining (Table 3). When combined scores of 0 and +1 were interpreted as negative and combined scores of +2 and +3 interpreted as positive, the kappa values were as follows: A0485, 79.5; CB11, 96.8; TAB250, 86.0; and HercepTest, 81.6.

TABLE 3 Reliability (Kappa) with 95% Bootstrap Confidence Interval for Each of the Four Antibodies

The antibody sensitivity, specificity, and c statistic (AUC) for each observer are shown in Table 4. The results suggest similar overall test performance for A0485, HercepTest, and TAB250. The CB11 antibody had the lowest sensitivity for all observers (range, 63–66) but good specificity (range, 97–98). TAB250 was the most sensitive antibody (range, 82–100%) and showed very good specificity (range, 94–95%). These results are shown graphically as a sample ROC (Fig. 1) for the 4 antibodies for one observer.

TABLE 4 Range of Antibody Sensitivities and Specificities and c-Statistics (AUC) for the Three Observers*
FIGURE 1
figure 1

Receiver operator curve (ROC) for Observer 1 with the four antibodies.

The variation in individual-observer positive predictive value of a positive test (PPV) and negative predictive value of a negative test (NPV) for each level of membrane staining with each antibody is shown in Table 5. The NPVs and PPVs, respectively, were high for all antibodies when staining was absent (0) and when staining was strongly positive (+3). However, for TAB250 and HercepTest, a stain intensity of +2 was poorly predictive of gene amplification (TAB250, 40–56%; HercepTest, 50–54%), indicating a high false-positive rate for these antibodies at this stain level. Also, for CB-11 and HercepTest, +1 stain intensity was poorly predictive of no gene amplification (30–33% and 50–70%, respectively), indicating an increased false-negative rate at this stain level. Although the numbers are small, the data for CB11 suggest that any membrane stain at all favors gene amplification.

TABLE 5 Antibody Performance (Positive Predictive Value = PPV; Negative Predictive Value = NPV) for Individual Observers

Cases Nonconcordant between IHC and FISH

For the FISH amplified cases, only 1 case (HER-2/Cep17 = 3.3) had no evidence of protein overexpression with all 4 antibodies. For the FISH nonamplified cases, 13 cases showed protein expression by at least one antibody. One case was IHC positive with all 4 antibodies (HER-2/Cep 17 = 1.3), one case was IHC positive with 3 of the 4 antibodies (HER-2/Cep-17 = 1.5), and two case were positive with 2 of 4 antibodies (HER-2/CEP = 1.2 & 1.3). In the remaining 9 cases, only one antibody was positive.

Cases without IHC Consensus

In 10 cases, each observer interpreted the stain intensity differently. These cases were reviewed by all observers at a multiheaded microscope. Heterogeneity of staining was noted in all these biopsies. Specific problems noted (Table 6) included enhanced edge staining on small biopsies, excess cytoplasmic staining mimicking membrane staining, and nonspecific staining of stroma around cell groups but not around individual cell membranes. Observers varied in the interpretation of equivocal staining (+1vs+2?) with some “scoring up” to positive (+2) and others “scoring down” to negative (+1).

TABLE 6 Factors Contributing to Interobserver Variability in IHC Interpretation with All Antibodies*

DISCUSSION

Clinical demand for HER-2/neu assessment is increasing, and there is a need for standardization of HER-2/neu assessment and reporting so that results are comparable between laboratories (16). As a reference hospital for the treatment of breast cancer, the BC Cancer Agency was particularly interested in defining a reliable approach to assess HER-2/neu status for the province of British Columbia, Canada.

The study goal was to examine the ability of IHC using four antibodies to predict gene amplification as defined by FISH analysis. An ideal antibody test would have near-perfect interobserver agreement on stain interpretation, would be sufficiently sensitive to capture all possible treatment candidates, and would have a high PPV (low false-positive rate) to minimize over treatment. None of the antibodies approached this ideal.

It should be emphasized that although a sampling bias was introduced in enriching the study with equivocal cases, this bias was corrected using statistical methods allowing comparison to our normal population prevalence of HER-2 positive cases.

Three antibodies (A0485, TAB250 and HercepTest) were found to have similar performance. TAB250 had the highest AUC for all three antibodies (Table 4). Although TAB250 appeared to perform slightly better, in practical terms, this was not statistically significant. With all 3 antibodies, absent staining (score of 0) was highly predictive of a nonamplified case (NPV, 93–100%), and strong complete membrane staining (+3) was highly predictive of an amplified case (PPV, 89–100%). Stain intensity scores of +1 or +2 were variably predictive of gene amplification. Others have recently reported similar findings (16), and for this reason, we recommend FISH analysis for cases scored as +1 or +2.

Interobserver agreement varied with stain intensity. For stain intensities interpreted as 0 or +3, interobserver agreement was generally high. In contrast, interobserver agreement was generally poor for stain intensities interpreted as +1 or +2. This is significant, as distinction between these staining intensities is clinically relevant. At present, patients with breast cancers showing +1 staining for Her-2 protein are usually not offered Herceptin therapy, whereas those with +2 staining often are. Interobserver variability at these staining intensities further supports a role for FISH in evaluating cases scored as +1 and +2.

In 10 cases, there was no agreement on stain interpretation reflecting the number of equivocal cases in the study set. A particular problem was noted with small and/or fragmented biopsies in which significant edge enhancement of stain could occur. It would be prudent in these cases, if not clearly negative (0) or positive (+3), to repeat the test on another block and/or consider FISH analysis.

Another possible source of interobserver variation was suggested after the study was complete. The initial observer, who always had a positive control slide for comparison, noted that occasionally, the stain intensity of the positive control slide varied. Because control comparison in the study was not rigidly adhered to by all observers, this may have accounted for some interpretation variation.

In this study, sections from tissues up to 18 years old were received from 24 different laboratories. Despite concern that variables in tissue fixation, processing, and embedding would be significant, tissue from all 24 laboratories was suitable for testing with both IHC and FISH.

Variability in sensitivity and specificity of commercially available HER-2/neu antibodies used with IHC has been previously noted (6, 17). In the study by Press et al. (6), the sensitivities and specificities of 28 antibodies were compared and showed variation in sensitivities ranging from 6–82% and specificities of 92–100%. In the present study, the sensitivities were found to vary between observers with the same antibody and between antibodies. Overall, the sensitivity ranged from 59 to 100% and the specificity ranged from 91–98%. CB11 showed the highest specificity (97–98%) but a low sensitivity (63–66%), and thus was unsuitable as a primary test. TAB250 had the highest sensitivity (82–100%) and very good specificity (94–94%). The negative predictive value of TAB250 was consistently high between observers (100% for absent stain and 80–100% for +1 stain). This suggests that if another antibody was used as the first test, TAB250 may have use as a second IHC test in cases with equivocal results (+1 versus +2?). HercepTest and A0485 showed a similar range of sensitivities between observers (A0485, 76–93%; HercepTest, 76–90%), but HercepTest was slightly less specific (A0485, 91–96%; HercepTest, 83–85%) and had a lower PPV and NPV than A0485.

With all antibodies, cases were interpreted as IHC negative that were FISH amplified. However, in only one FISH-amplified case did all 4 antibodies indicate that no protein overexpression was present. This one case may reflect a technical problem with the specimen or a true failure of protein production. For the other cases, the therapeutic significance of this discrepancy is not yet clear.

The performance of HercepTest, the only approved IHC method of HER-2 testing, has recently been questioned (18). The study reported a high level of staining of normal epithelium and a high number of apparent false-positive tests compared with FISH and two other antibodies. This may have been a result of the use of alcoholic formalin as a primary fixative. DAKO specifically advises the use of neutral-buffered formalin in the HercepTest package insert. All our cases were initially fixed with buffered formalin, and staining of normal epithelium was not a problem, emphasizing the importance of the initial fixative. We consider membranous staining of normal epithelium unacceptable for stain interpretation. In contrast, others have found HercepTest to perform well (19). In the present study, HercepTest sensitivity and specificity were comparable to A0485 and TAB250. However, with TAB250 and HercepTest, the +2 stained cases had a low PPV in predicting gene amplification, and if +2 stain were interpreted as positive, more patients with FISH nonamplified status would be treated. In addition, the NPV of +1 with HercepTest and A0485 was poor.

Antigen retrieval may increase the sensitivity of IHC. However, using the same primary polyclonal antibody, this study found comparable results with antigen retrieval (HercepTest) and without antigen retrieval (A0485). We now use A0485 with a microwave antigen retrieval method as specified by the manufacturer. A higher antibody dilution is required to avoid excess cytoplasmic and background staining. This again highlights the need for controls that include nonstaining benign epithelium.

An unknown in this study is the number of false-negative FISH cases. It is possible that some of the IHC false-positive cases were actually FISH false-negative cases. Problems we noted in the interpretation of FISH staining included weak signal interpretation, variability in signal intensity over the section, separation of in situ from invasive cancer cells, and loss of tissue architecture with digestion. The biological significance of the IHC positive–FISH nonamplified cases remains to be seen, although a recent report (13) suggests that the FISH status may be relevant to therapy response.

There is a need to correlate clinical response to Herceptin therapy with Her-2 protein and gene expression levels. As this remains a critical issue in predictive testing, until we have examined our own outcome data in terms of test results and response to Herceptin therapy, we propose to continue to collect both IHC and FISH data on Herceptin-treated cases.

CONCLUSIONS

The results of this study can be summarized as follows.

  1. 1

    Provided that initial fixation of tissue is with 10% buffered formalin, blocks from all hospitals and of all ages were suitable for IHC and FISH analysis.

  2. 2

    For DAKO A0485, TAB250, and HercepTest there was generally good concordance between IHC and FISH.

  3. 3

    Interobserver agreement for stain interpretation was high for the 4 antibodies when there was negative (0) or strong (+3) staining but was poor for weak (+1) and moderate (+2) staining.

  4. 4

    The positive and negative predictive values varied by stain intensity. For most antibodies, stain intensities of +1 and +2 were not sufficiently predictive, suggesting that FISH should be added as a second test. Intensity should always be compared with a positive (+3) control slide that includes nonstaining normal epithelium.

  5. 5

    Sections with scant invasive carcinoma, marked tissue fragmentation, excess cytoplasmic staining, or staining of normal epithelium should be rejected. Repeat IHC staining on a more suitable section, with a lower antibody concentration, or FISH analysis should be performed as needed.