Main

The reproducibility of the classification and grading of invasive breast cancer and the cause(s) of interobserver disagreement among pathologists have not been adequately evaluated. Prior studies evaluating interobserver concordance in categorizing breast lesions have documented improved diagnostic agreement when the pathologists involved used agreed-upon criteria,1 but other potential sources of poor interobserver agreement, such as the difficulties in the application of the individual histologic criteria, the individual pathologist's variation in use of these criteria, and most importantly, the ambiguous or borderline and heterogeneous nature of the cases themselves have received less attention. Since it is unlikely that any significant pathologic differences between patient subgroups can be detected without accounting for the presence of tumor heterogeneity (the latter of which may well play a key role in determining variability in clinical behavior), it is important that the collection of pathologic data for cases accessioned into a breast cancer registry database incorporate this variability of tumor classification and grading into the classification scheme.

The Breast/Ovarian Cancer Family Registry is an international consortium that was initiated in 1995 and is supported through the United States National Cancer Institute (NCI). This group was established to provide a comprehensive infrastructure for interdisciplinary research studies of hereditary breast cancer. The participating sites include an Informatics Support Center (Irvine, CA, USA) and six Registry sites. There are three population-based sites: The Ontario Cancer Genetics Network, Cancer Care Ontario, Canada; Northern California Cancer Center, San Francisco and the University of Melbourne, Melbourne, Australia, and three clinical-based sites: Fox Chase Cancer Center, Philadelphia; the Huntsman Cancer Center, Salt Lake City; Columbia University, New York. Each site has at least one study pathologist and some sites also have pathology fellows involved in the review of cases. Currently, funded activities of the registry include establishment of common databases for family history, epidemiology, biospecimens and pathology. As of early 2005, 12 507 families and 37 724 individuals have been enrolled in this Registry.

The Pathology Working Group of the Breast/Ovarian Cancer Family Registry developed a pathology data collection and retrieval system for registry cases that would afford optimal diagnostic concordance among the participating registry sites without sacrificing potentially important information that could be obtained from cases with heterogeneous or borderline histologic features. An abbreviated version of the data collection system was then used to evaluate the registry group pathologists’ diagnostic accuracy and reproducibility of invasive breast cancer classifications using an initial group training session with a standard set of slides, followed by individual assessment of a separate set of slides, using the agreed upon criteria. One of the registry pathologists assessed the study set of slides twice in order to establish a reference standard and ascertain the degree of intraobserver agreement of the chosen reference standard. Since one of the aims of the registry was to obtain an accurate assessment of familial breast cancer subtype(s) and to determine whether there are specific phenotypes of hereditary cancer, we also evaluated the flexibility of the pathology data collection system, given the presence of interobserver disagreement, in identifying all potential examples of these phenotypes.

Materials and methods

Study Design

To identify all areas in which potential diagnostic inaccuracy and/or poor interobserver reproducibility would engender significant misclassification of individual breast lesions subsequently enrolled in the registry, ‘problematic’ breast cancer slides were circulated and discussed at an initial meeting of the Pathology Working Group of the Breast/Ovarian Cancer Family Registry. Based on review and discussion of the problematic slides, a Registry Pathology Review form was specifically designed to capture all relevant pathologic findings in such a manner that ‘borderline’ or ‘ambiguous’ lesions could be identified and retrieved from the registry database without a laborious rereview of all the registry slides. For example, using agreed upon criteria, the form was designed to capture all potential medullary carcinomas of the breast entered into the registry database by a search for ‘medullary carcinoma’ and ‘atypical medullary carcinoma’, as well as ‘ductal carcinoma, not otherwise specified’, and restricting the latter to only those cases with marked lymphocytic infiltrate, presence of syncytial growth pattern or circumscribed margins. Similarly, all potential infiltrating lobular carcinomas could be retrieved by a global search for ‘typical lobular carcinoma’, ‘pleomorphic lobular carcinoma’ or ‘mixed ductal and lobular cancer’. Since the form called for assignation of a primary and if present, a secondary pattern, even those cases in which only a small component of a particular histologic type was identified could be retrieved for future clinicopathologic, epidemiologic or basic research investigations. Following a trial use of the form by the members of the group, modifications were made and the form was standardized (Figure 1). Criteria used for scoring the individual components of the data form were based on published criteria.2

Figure 1
figure 1

Data form used to evaluate breast carcinomas.

A study set of slides was selected to test the accuracy and reproducibility of invasive breast cancer diagnoses using the agreed upon classification system and data entry form. To assess the utility of the data entry form, 35 cases of primary invasive breast cancer were selected by the study group chair (FOM) from routinely processed archival cases accessioned during the same period as the cases enrolled in the registry. Cases were selected to highlight problem areas identified in the initial intergroup meeting. For the purposes of the study, the ‘gold standard’ or reference diagnosis was that rendered by the study chair. For each study case, a single set of 5-μm-thick hematoxylin and eosin-stained sections was prepared, all by the same laboratory. Since in many instances, it is the submitting pathologist and not the actual registry pathologist, who selects the actual registry slide, it was concluded that a single representative slide for each of the cases most optimally simulated actual data recovery. All patient and hospital identifiers were removed and a study number was assigned to each slide. The complete set was evaluated by each of the participating pathologists following a brief training session using a separate set of 15 slides, each selected to depict potential problem areas in invasive breast cancer diagnosis. Since geographic limitations prevented a single training session with all of the members of the working group, two separate training sessions were conducted, each by the same individual and with the same set of training slides. Each participant was asked to evaluate 35 slides with the pathology form (Figure 1). In addition, one participant (FOM) assessed the set of 35 slides twice, the second time after an interval of more than a year in order to set the reference diagnosis and to ascertain reproducibility of the assessments for the reference diagnosis. Other than the single page data entry form, no other instructions or teaching sets were supplied. Each participant evaluated the same slide set. The forms were returned to the study coordinator and entered in the database by preassigned codes, thus masking the identity of the pathologist.

The individual characteristics of the cases in the study set with representative comparisons to the registry database are shown in Table 1. The histologic type of invasive breast cancer is summarized in two ways: (1) the primary pattern, collapsed into two categories (no special type vs all others), and (2) the presence or absence of any individual pattern, either primary or secondary. ‘Any medullary feature’ includes both classical and atypical types. Similarly, ‘any lobular feature’ includes both classical and pleomorphic types. The histologic grade of invasive breast cancer is also summarized in two ways: (1) the overall Nottingham grade; I, II or III, and (2) Nottingham score summarized as below seven or seven and above. Cribriform architecture and blood vessel invasion were not present in any of the slides selected and were not further assessed.

Table 1 Distribution of histologic features of study casesa

The distribution of the histologic characteristics as well as the association among the individual characteristics of the 35 cases based on the reference standard were tabulated. Cramer's V statistic was used as a measure of association since it is suitable for categorical data and takes on values between −1 and 1, similar to the usual Pearson's correlation coefficient.3 Data were analyzed using two approaches. In the first approach, category-specific multirater percent agreement and κ statistics were used to characterize the agreement among the 13 pathologists.4 This approach does not assume to know the ‘truth’ or gold standard diagnosis for the given slide, but simply assesses the extent to which the pathologists agree among themselves. κ statistics were calculated to evaluate levels of agreement adjusted for agreement expected to occur by chance alone. Since κ is influenced by prevalence of the characteristic being measured, agreement was measured by category-specific κ and percent agreements to accommodate uncommon or low prevalent features.5 In general, κ statistics less than 0.4 are associated with relatively poor agreement, values of 0.4–0.6 moderate agreement, values of 0.6–0.8 substantial (good) agreement and values greater than 0.8 are associated with excellent (almost perfect) agreement.6

In the second approach, the reproducibility (or accuracy) of the study pathologists’ diagnoses was assessed relative to the reference standard. This latter analysis can be extrapolated as a reflection of completely centralized pathology review vs the semicentralized review actually implemented. High accuracy in this approach was defined as a high probability of an individual pathologist detecting a feature given that it is detected by the reference standard. In order to assess the intraobserver agreement of the standard, category-specific κ's and percentages of agreement were calculated using the first and second assessments by the reference standard. The other pathologists’ assessments were then compared to the reference to determine what percentage of the slides assigned to a category by the reference was also assigned to that category by the reviewing pathologists.

Occasionally, reviewing pathologists did not score individual items on the assessment sheets, but since the number of nonscored items was low (on average, 2.5% were missing), the nonscored items were accommodated by adjusting the denominator number of slides appropriately in the calculations.

Results

The study group consisted of six pathologists from population-based sites and seven from clinical-based sites. Three have a special interest in breast pathology, three were surgical pathology fellows during the study period and all remaining participants either practiced general pathology or have a special interest in other areas of surgical pathology.

Since cases were selected to (1) highlight problem areas previously identified in the initial intergroup meeting and (2) represent cases accessioned into the registry database, the study set tended to over-represent unusual histologic subtypes and higher grade tumors. The distribution of cases in the study set and in the registry database is presented in Table 1. Representative dot plots of the individual pathologists’ scores for primary and secondary histologic patterns, Nottingham grade and lymphocytic infiltrate for each of the 35 slides is presented in Figures 2, 3, 4 and 5.

Figure 2
figure 2

Dot plot depicting distribution of scores for primary histologic pattern, (specific patterns are provided at left), of 35 invasive breast cancers evaluated by the 13 pathologists. The slide number of the individual breast cancer is provided above each boxed entry. Grids at top and bottom represent individual pathologists, with the reference standard at 1. If a specific histologic pattern is not identified by an individual pathologist as the primary pattern (eg lobular, pathologist 7 case 33), it is often identified as the secondary pattern (see Figure 3).

Figure 3
figure 3

Dot plot depicting distribution of scores for secondary histologic pattern, (specific patterns are provided at left), of 35 invasive breast cancers evaluated by the 13 pathologists. The slide number of the individual breast cancer is provided above each boxed entry. Grids at top and bottom represent individual pathologists, with the reference standard at 1.

Figure 4
figure 4

Dot plot depicting distribution of scores for Nottingham grade of 35 invasive breast cancers evaluated by the 13 pathologists. The slide number of the individual breast cancer is provided above each boxed entry. Grids at top and bottom represent individual pathologists, with the reference standard at 1. Interobserver reproducibility for Nottingham grade is good overall, but significantly better for grade III cancers than grade II cancers based on the reference standard.

Figure 5
figure 5

Dot plot depicting distribution of scores for lymphocytic infiltration in 35 invasive breast cancers evaluated by the 13 pathologists. The slide number of the individual breast cancer is provided above each boxed entry. Grids at top and bottom represent individual pathologists, with the reference standard at 1. Although agreement appears low, when collapsed into a binary score (absent or mild vs moderate or marked) the accuracy of classification with respect to the reference standard was quite high (mean, 73.7–92.9%).

Classification of the specific subtype of breast cancer by primary pattern showed generally high agreement (Figure 2). Causes of discrepant diagnoses were most commonly attributed to a ‘no special type’ classification by one reviewer and a ‘lobular’, ‘atypical medullary’, or ‘mucinous’ type by other reviewers. In many of these cases, the discrepant diagnoses were ultimately captured in the secondary pattern; that is, although ‘no special type’ was assigned to the primary pattern by a reviewer, ‘lobular’ or ‘medullary’ or ‘mucinous’ was assigned to the secondary pattern (and vice versa for the other reviewers) (Figures 2 and 3). In these instances, when the primary and secondary patterns were grouped, discrepancies in classification of histologic type were significantly decreased, but not completely eradicated (Table 2).

Table 2 Category-specific κ and percent interobserver and intraobserver agreement

The interobserver percent agreement for classification of the specific histologic type of the 35 invasive breast cancers by primary or secondary pattern ranged from 35 to 99.5%; this corresponded to a category-specific κ range of 0.3–1.0 (Table 2). Despite the relatively large range in category-specific κ for the entire group, most of the poor reproducibility of classification of the invasive cancers could be attributed to the classification/misclassification of specific uncommon subtypes of invasive breast cancer. This was true for the 13 pathologists’ interobserver agreement as well as for the reference pathologist's intraobserver agreement. Persistent causes for disagreement involved classification of the primary pattern as ‘micropapillary’, ‘medullary’ or ‘metaplastic’ carcinoma by one reviewer and classification of the primary pattern as ‘no special type’ by another reviewer without assignation of a specific secondary pattern. However, interobserver percent agreement was quite high when each of these diagnoses was considered to be absent. The category-specific agreement was highest for tubular (78.7%), mucinous (96.0%) and lobular (78.0%) subtypes. Not surprisingly, there was significant interobserver disagreement in the classification of cancers as ‘medullary’. Despite the wide range in κ for the classification of histologic type, the accuracy for assignation of histologic subtype (defined as the degree to which the reviewing pathologist identified the same feature as the reference pathologist) was quite high (Table 3), especially for ductal carcinoma, no special type (mean, 92%), any mucinous carcinoma (mean, 95.8%) and any lobular carcinoma (mean, 90%). The accuracy for classification of metaplastic carcinoma was quite low, in part due to the presence of only one such case in the study set. The case included was particularly difficult to interpret as the vast majority of the lesion was comprised of high-grade epithelial cells, with a small focus of chondroid change present at the edge of the section.

Table 3 Accuracy of classification of histologic features in invasive breast cancer relative to reference standarda

Category-specific κ values for the Nottingham grade of the 35 invasive breast cancers ranged from 0.5 to 0.7, with a corresponding percent agreement of 61.4–87.8%. κ values for Nottingham score ≥7 or <7 were slightly better (κ=0.7). The intraobserver percent agreement for Nottingham grade (87–100%) and score (94.7–98%) were markedly better than the interobserver agreement (the value of this level of intraobserver agreement is admittedly limited in that only one pathologist was utilized for this calculation, but it provides a sense of reproducibility for the reference standard used to assess the accuracy of classification in this study). Disagreement on grading was usually attributed to differences in classification of the grade II carcinomas. Even though there was a relatively wide range in interobserver agreement for Nottingham grade, the accuracy for classification of the grade was quite high, ranging from 75 to 100% (mean, 83.3%) for grade I; 50 to 83.3% (mean, 64.6%) for grade II and 79 to 100% (mean, 92.3%) for grade III tumors.

The category-specific percent agreement among the 13 pathologists for the presence of lymphatic space invasion was 55% in this study, whereas the category-specific agreement for the absence of lymphatic space invasion was 91%. The comparative intraobserver agreement for the presence or absence of this histologic feature was 90.9 and 98.3%, respectively. The accuracy for the individual pathologists ranged from 33.3 to 100% (mean, 65.6%) for the determination of the presence of lymphatic space invasion, while the accuracy for the determination of the absence of lymphatic space invasion ranged from 48.2 to 100% (mean, 92.9%). Two-thirds or more of the cases were accurately classified as positive for lymphatic space invasion by 50% of the reviewing pathologists while all but one reviewing pathologist accurately classified two-thirds or more of the cases as negative for lymphatic space invasion.

The interobserver percent agreement for lymphocytic infiltration ranged from 31.4 to 58.3% (category-specific κ=0.2–0.4) when the infiltrate was evaluated using a four-tier score (absent, mild, moderate, marked), but improved to 73.8–80% (category-specific κ=0.6) when collapsed into a binary score (absent or mild vs moderate or marked). The level of intraobserver agreement for this feature was not markedly better than the interobserver agreement and was only modestly improved by the use of the binary score. However, the accuracy of classification with respect to the reference standard was quite high with the binary scheme (mean, 73.7–92.9%) (Table 3). The interobserver and intraobserver percent agreement for the presence of syncytial growth and circumscribed margins was 61.2 and 66.7%, and 50.9 and 75%, respectively; however, the level of agreement was markedly higher for the absence of these two features (91.4 vs 96.9% and 83.3 vs 92.6%, respectively). Average accuracy of identifying the presence of syncytial growth or circumscribed margins was 97.2 and 67.9%, respectively (Table 3).

Since one of the primary aims of the cancer registry involved correlation of specific histologic subtypes of invasive cancer with BRCA1 and BRCA2 mutation status, as well as with other potential molecular, epidemiologic, and therapeutic outcomes, the ability to search and identify specific histologic types within the registry was of interest to the Pathology Working Group. Although the category-specific interobserver agreement for the individual histologic criteria for medullary carcinoma (lymphocytic infiltrate, syncytial growth, circumscribed margin) varied widely, on average 94% of the primary and secondary pattern medullary cases identified by the reference standard could be identified in the data sheet for each reviewer by either primary or secondary histologic pattern, marked lymphocytic infiltrate, presence of circumscribed margins or presence of a syncytial growth pattern. By widening the net in this way, the number of potential medullary cases was increased on average three-fold compared to primary/secondary patterns alone. Similarly, of the primary and secondary patterns identified as tubular or lobular by the reference standard, on average 96 and 88%, respectively could be retrieved from the data sheets for each reviewer by also including cases with Nottingham grade I.

Discussion

Recent studies have indicated that interobserver agreement in breast cancer grading and typing can be optimized by the use of well-defined, agreed upon criteria and terminology.7, 8 Despite these advances in our understanding of breast cancer diagnosis and histopathologic classification, an unspecified degree of interobserver and intraobserver disagreement is inescapable in the assignation of any classification, grade or overall score to a pathologic process that is (1) based on subjective distinctions along a histologic continuum and (2) requires evaluation for a variety of pathologic characteristics, some of which may be relatively uncommon. In implementing the pathology review and data collection for the breast cancer registry, our goals were to optimize interobserver agreement by establishing uniform, well-specified and agreed upon criteria for classification, to identify potential sources of persistent interobserver disagreement and to design a data entry form that could accommodate this level of disagreement.

Given the constraints imposed by the overall registry goals, the level of agreement obtained by the registry pathologists in grading these 35 invasive carcinomas was quite good. Category-specific κ scores were moderate to good for Nottingham grade (κ=0.5–0.7), good for histologic score <7 vs ≥7 (κ=0.7) and, although not directly comparable somewhat higher than that which has been reported previously. Although Frierson et al10 reported moderate to substantial κ values for interobserver agreement for histologic grade, Delides et al9 found low interobserver agreement and overall κ values for the European Working Group and for the Japan National Surgical Adjuvant Study were moderate at best.8, 9, 10, 11 Interobserver and intraobserver agreement, as well as accuracy of classification relative to the reference standard were higher for the grade I and grade III tumors than for the grade II tumors. These results are similar to those of Dalton et al,12 who showed that excellent agreement for histologic grade was more likely to occur for extremely low-grade and extremely high-grade cancers. In our study, it appeared that differences in assessment of the degree of nuclear pleomorphism were most commonly responsible for differences in assessment of overall histologic score, followed by mitotic index and tubule formation (data not shown).

The significant degree of interobserver disagreement that occurs in the allocation of nuclear grade in breast cancer has been noted in previous studies.9, 10, 12, 13, 14 In at least one series, it was suggested that pathologists who are not specialists in breast disease tend to underscore, possibly due to a preconception that invasive breast cancer sorts equally into each of the three grades.15 It has also been argued that reproducibility of nuclear pleomorphism is difficult because of the nonquantitative nature of the scoring method,16 but in our opinion, the intermediate nature of some breast cancers and the heterogeneity in nuclear pleomorphism that can occur in these malignancies is underappreciated and probably contributes to the relatively high and variable degree of interobserver disagreement, especially with respect to the intermediate grade tumors. In comparison to nuclear pleomorphism, the criteria for scoring tubule formation and mitotic score are relatively robust. Concordance in mitotic counts is highest when counts are determined in the same area, using established counting methods and established criteria.16, 17, 18, 19 Mitotic counts also depend on the quality of the tissue processing and the size of the ocular lens.16, 18, 19 Since the registry relies in part upon the ability of the participating pathologists to select the optimum area for mitotic counts, there was no attempt to guide the reviewers to any single designated area on the study slides and this is likely responsible for some of the interobserver disagreement. Nevertheless, it is likely that the category-specific κ obtained in our study overestimates the level of agreement that would occur during the actual performance of the registry data collection, since the area of determination of mitotic counts is occasionally selected from among several sections of tumor by the registry pathologist during the actual review procedure. The level of degradation in interobserver agreement would depend on the numbers of cases in which several sections were examined, the number of sections examined and the relative contribution of moderately differentiated tumors, all of which would likely vary depending on the registry center and the submitting hospital.

Only one previous study has evaluated the ability of a group of pathologists to assign a histologic subtype to a range of invasive carcinomas. In the study conducted by the European Working Group, it was found that subtyping was most consistent for mucinous carcinoma, followed by lobular carcinoma and least consistent for medullary carcinoma.11 Our results are similar with the additional finding of a relatively high degree of consistency for subtyping tubular carcinoma. The poor reproducibility for the diagnosis of invasive lobular carcinoma relative to ductal and mucinous tumors has been noted by others and appears to be due to (1) a tendency for overdiagnosis of lobular cancer; (2) confusion regarding diagnostic criteria for the pleomorphic subtype and (3) suboptimal histology.20 The moderate degree of interobserver agreement for medullary carcinoma has been the subject of prior studies.21, 22, 23 Utilizing the criteria of Ridolfi et al,24 consensus diagnoses were achieved in 56.3% of cases in the current study. These results are similar to that achieved by others.21, 22 In our study, the range in interobserver agreement was greatest for lymphocytic infiltrate, followed by margin status and syncytial growth pattern. Moreover, when the coassociation of individual histologic features was analyzed, the medullary subtype was most highly associated with circumscribed margins, followed by syncytial growth and lymphocytic infiltration (data not shown). These findings are similar to those of Gaffey et al23 and contrast with Pedersen et al,22 who found that interobserver agreement was lowest for circumscription, although the latter authors used a three-tiered scoring system for circumscription and lymphocytic infiltrate.

It was anticipated that cases could be ambiguous either due to the histologic features of the individual tumor or due to difficulties in the application of the individual criteria. Therefore, the pathology data sheet was designed to accommodate cases that appeared to be ambiguous, mixed or borderline to the reviewer, regardless of whether it was due to the borderline nature of the tumor itself or due to underspecified or poorly specified criteria. As expected, interobserver disagreement in classification of histologic type was largely due to differences in classification of lobular and medullary carcinoma, but differences in classification of the latter diagnosis were markedly decreased by assigning a primary and secondary pattern to each of the cases. Thus, a case that was scored as ‘lobular’ for the primary pattern by most reviewers was scored as ‘lobular’ for the secondary pattern by the remaining reviewers in two cases (100%) and all but one reviewer in the other two cases (92%) (Figure 3). Classification of cancers exhibiting a medullary pattern was also improved by incorporating primary and secondary pattern, but not to the same degree, in part due to the absence of classical medullary carcinomas in the study set and in part due to the inherent poor reproducibility for this diagnosis. However, even though reproducibility for the diagnosis of medullary carcinoma is quite poor, the four cancers scored as ‘any medullary’ by the reference standard in this study could be identified by most pathologists by a combination of medullary, marked lymphocytic infiltrate, circumscribed margins or syncytial pattern. The ability to identify the majority of cases falling within the diagnostic range for these particular subtypes is important, given the predilection for lobular and medullary cancer in familial and hereditary breast cancer families.25, 26, 27, 28, 29, 30, 31

Central review continues to be a necessary component to any large cooperative study involving pathologic materials. However, we have shown that a well-designed data entry sheet for pathology review obviates the need for a single central review and permits the review process to occur on a more localized basis, provided the data entry form is designed to facilitate the identification and retrieval of the histologically ambiguous and unambiguous cases. This approach promotes a shift from the dualistic paradigms of lobular/ductal or medullary/nonmedullary to one that embraces a histologic continuum and recognizes tumor heterogeneity. In our opinion, this latter approach, in conjunction with epidemiologic, therapeutic and molecular developments, is the approach that is most likely to advance our understanding of carcinogenesis and ultimately, our therapeutic decisions.