Introduction

Breast carcinoma grading schemes have evolved over the last century and histologic grading is one of the most important prognostic features in the evaluation of early-stage breast carcinoma [1,2,3,4,5,6]. The Nottingham system, which has been endorsed by the College of American Pathologists and the World Health Organization utilizes three variables: gland formation, nuclear grade, and mitotic rate [1]. In the latest edition of the AJCC Cancer Staging Manual, in addition to the anatomic stage groups (based on TNM alone), breast carcinomas can be organized into prognostic stage groups based on additional information including grade, biomarker status (i.e., estrogen receptor [ER], progesterone receptor [PR] and human epidermal growth factor receptor 2 [HER2]), and molecular testing results [7]. According to a recent publication, which attempted to validate the new staging system, grade was statistically associated with overall survival and the prognostic stage group system was shown to outperform TNM alone [8].

For prognostic markers, such as histologic grade, to be robust, there must be high reproducibility and low interobserver variability. Studies have shown that interobserver variability in breast carcinoma grading ranges from fair to good based on kappa statistics [9,10,11,12,13]. Since the incorporation of grading into the AJCC manual, little is known about how variability in grading might affect prognostic stage groups [14].

Advances in technology have led to the advent of virtual microscopy (VM) using digital whole-slide imaging (WSI), in which glass slides are digitally scanned at a high resolution for viewing on a screen. While the technology has mostly been in the educational, research, image analysis, and quality assurance settings, there is increased interest in broadly applying VM to the clinical domain [15,16,17,18,19,20]. Some platforms have been approved by the Federal Drug Association for diagnostic use [21]. Data are limited regarding the variability in breast carcinoma grading using VM, however, recent studies have shown moderate concordance between grading using VM versus light microscopy (LM) [12, 13].

Considering the recent changes to the AJCC staging manual and organization of breast carcinomas into prognostic stage groups, understanding the interobserver variability of breast cancer grading is critical. Furthermore, as the push for using VM rather than LM in primary sign-out increases, it is important to evaluate pathologists’ concordance in this setting. We sought to evaluate interobserver variability amongst a multi-institutional group of academic breast pathologists using digital WSI. As a secondary measure, we also evaluated whether discordances in grading would affect prognostic stage groups.

Materials and methods

Patient cohort

Cases of consecutive invasive breast carcinoma from the calendar year 2016 were identified in the pathology files at New York-Presbyterian Hospital/Weill Cornell Medicine. Cases of microinvasive carcinoma, those with insufficient tumor area to perform formal mitotic counts (MCs) and those treated with neoadjuvant chemotherapy, were excluded. The final cohort consisted of 143 consecutive invasive breast carcinomas. Archived hematoxylin and eosin slides were reviewed by one pathologist (PSG) who selected one representative slide for each lesion to be scanned into the digital slide platform. Pertinent clinicopathologic variables including age, gender, laterality, hormone receptor (HR) status, HER2 status, tumor focality, tumor size, and lymph node involvement were obtained from a review of the patient’s surgical pathology reports. Institutional review board approval was obtained for all parts of this study.

Digital whole-slide scanning

Slides were scanned at a ×40 magnification using a single z-plane via an Aperio AT2 whole-slide scanner (Leica Biosystems, San Diego, CA, USA). Scanned digital WSI were evaluated for quality and to ensure that they were in focus. De-identified digital files in (.svs) format were stored on an image server for remote evaluation using the Aperio ImageScope application (Leica Biosystems, Buffalo Grove, IL, USA).

Pathologic examination and grading

The digital WSIs were independently reviewed by six pathologists (PSG, RI, TMD, SF, SJ, and MH). All pathologists were instructed to grade tumors based on established criteria for tubule formation (TF), nuclear pleomorphism (NP), and MC according to the Nottingham Grading System [1, 7]. Since the area viewed on the digital slides differs based on screen size, browser size, etc. the pathologists were provided instructions for annotating areas corresponding to a total area of 2.38 mm2, which corresponds to the area in ten high-power fields evaluated using an eyepiece with a field diameter of 0.55 mm to perform MCs. Within this area, MCs of <8, 9–17, and ≥18 were scored as 1, 2, and 3, respectively. All pathologists included in this study have subspecialty interest and/or fellowship training in breast pathology and years of attending level sign-out experience range from 4 to 25 years (median: 14 years). Pathologists were blinded to the original LM grade as well as other clinicopathologic parameters.

Evaluation of potential confounders

Following VM grading, participants were invited to complete a questionnaire (Supplementary Fig. 1). Seven questions were used to assess the experience (number of years in practice), work environment (academic and/or nonacademic laboratory), daily work method (conventional LM and/or digital pathology), weekly amount of time dedicated to breast pathology, the habit of reporting nuclear grade in cases with heterogeneity, the method used to determine the mitotic rate, and whether any tumors were graded based on the assumption that it represented a special type of carcinoma.

Statistical analysis

Fleiss’ k for overall agreement amongst all observers was calculated for overall grade and individual components, pairwise comparison between individual pathologists, and for histopathologic types of invasive carcinoma. Levels of agreement based on the kappa statistic were defined as follows: ≤0.20 as slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 good, and 0.8–1.00 very good [22, 23]. The most common grade (statistical mode) was taken as the gold standard, and interobserver concordance was evaluated based on this grade. When appropriate, t tests were performed to examine correlations between the degree of interobserver variability and any possible confounder mentioned in the questionnaire. A P value of <0.05 (two-tailed) was considered significant. All analyses were performed using statistical software CRAN.R irr package (version 3.6.1).

Results

Patient and clinicopathologic characteristics

One hundred forty-three consecutive invasive breast carcinomas from 135 patients were identified. Two patients had bilateral invasive carcinoma. Three patients had multiple morphologically distinct ipsilateral invasive carcinomas. Another patient had bilateral invasive carcinoma and had multiple morphologically distinct ipsilateral invasive carcinomas. The cohort included 134 female patients and one male patient with a mean age of 63 years (range; 29–98). One hundred twenty-five tumors were HR-positive, HER2-negative/equivocal, seven tumors were HR-positive, HER2-positive, ten tumors were triple-negative, and one tumor was HR-negative, HER2-positive. Additional histopathologic features are described in Table 1.

Table 1 Clinicopathologic features of cohort.

Agreement in breast carcinoma grading

Perfect agreement was observed in 43 cases (30%) (Fig. 1). Perfect agreement was achieved in 14 of grade 1 carcinomas (9.7%), 14 of grade 2 carcinomas (9.7%), and 15 of grade 3 carcinomas (10.5%). Discordance between grades 1 and 2 was observed in 28 cases (19.6%) and between grades 2 and 3 were observed in 68 cases (47.6%). Four cases demonstrated a discrepancy between grades 1 and 3 (2.8%), 3 of which also showed a 1–3 category discrepancy in the mitotic rate. None of these cases showed a 1–3 category discrepancy in TF. In one case, there was an even split between pathologists in terms of grade. Excluding the single case with an even split between grades, complete concordance amongst all pathologists was observed in 56% (14/25), 20% (14/70), and 32% (15/47) of tumors with a modal grade 1, 2, and 3, respectively (p = 0.003, χ2 = 24.7).

Fig. 1: Examples of cases with perfect and 2-step discordance.
figure 1

A Whole-slide scanned image of case with perfect overall grading concordance shows a homogenous tumor lacking any tubule formation. B On higher magnification, the carcinoma shows pronounced nuclear pleomorphism. The presence of apoptotic debris (circles) did not affect enumeration of the conspicuous mitoses (arrows). C Whole-slide scanned image of case with two-step overall grading discordance shows a tumor with variable tubule formation. D While nuclear pleomorphism was predominately intermediate, occasional higher grade cells were present (not shown). In this case, differentiating mitoses (arrows) from apoptotic debris (circles) likely contributed to a two-step discordance in mitotic rates amongst the six pathologists. E Whole-slide scanned image of case with two-step overall grading discordance amongst pathologists shows a tumor with heterogeneous tubule formation. F Nuclear pleomorphism scoring was split evenly amongst pathologists between grades 2 and 3. Both heterogeneity in mitotic activity and difficulties in differentiating mitoses (arrows) from apoptotic debris (circles) likely contributed to a two-step discordance in mitotic rates amongst the six pathologists.

For the individual components, perfect agreement was reached for TF in 70 cases (49%), NP in 45 cases (31.5%), and mitotic activity in 28 cases (19.6%). Perfect agreement on grading was attained in 31 of 108 cases (28.7%) of invasive ductal (no special type) (IDC), 6 of 23 cases (26%) of invasive lobular (ILC), and 6 of 12 cases (50%) of special types of invasive carcinoma.

Overall interobserver variability in breast carcinoma grading

Interobserver agreement for grade was moderate (κ = 0.497), with the best agreement for grade 1 (κ = 0.705), followed by grade 3 (κ = 0.491), and only fair agreement for grade 2 (κ = 0.375) (Table 2). For observer pairs, concordance ranged from fair to good (κ = 0.354–0.684) (Table 3).

Table 2 Interobserver variability based on grade, individual grading components, and histopathologic type.
Table 3 Pairwise Fleiss’ κa for overall grade interobserver variability.

Interobserver agreement was fair to moderate for the individual components with kappas of 0.281, 0.403, and 0.503 for the mitotic rate, NP, and TF, respectively (Table 2). For the individual categories of the grade components, the degree of agreement ranged from slight to good, with the least concordance for the mitotic rate category 2 (κ = 0.121) and the best concordance for TF categories 1 and 3 (κ = 0.613 each) (Table 2). Interobserver agreement was better for patients with IDC (κ = 0.490) than ILC (κ = 0.092). Concordance was good for the other types of invasive carcinomas (κ = 0.606) (Table 2).

Impact of interobserver variability on pathologic prognostic stage

Of the 143 cases, 127 were from patients with a single tumor evaluated for prognostic staging. For the three patients with bilateral tumors, both tumors were evaluated by prognostic staging. For the four patients with multiple histologically distinct ipsilateral tumors, the largest tumor was evaluated for prognostic staging. In 14 cases, lymph nodes were not submitted, precluding pathologic prognostic staging. These cases were excluded from the analysis. In all, 124 tumors were evaluated for the impact of interobserver variability on prognostic stage, of which 38 demonstrated complete agreement amongst pathologists in histologic grading of carcinoma. In all, there were 86 cases with discrepancies in histologic grading of carcinoma, of which 17 led to changes in prognostic staging (19.8%) (Table 4). Discrepancies in grading most frequently resulted in a change of stage from IA to IB (n = 9; 10.4%), followed by IB to IIA (n = 3, 3.5%), IB to IIB (n = 3, 3.5%), and IIIA to IIIB (n = 2, 2.3%). All of the cases in which discrepancies in grading lead to changes in the prognostic staging were HR-positive, HER2-negative/equivocal. Of the cases where discordances in grading lead to differences in prognostic stage, eight cases had Oncotype DX testing performed. In two of these cases, the Oncotype DX recurrence score was <11, which would have resulted in a prognostic stage of 1A regardless of grade. For both of these cases, the discrepancy in grade resulted in a change from 1B to 1A.

Table 4 Pathologic prognostic stage of cases with discordant tumor grades.

Confounders

Potential confounders that might affect variability were evaluated. These included experience, work setting (academic, nonacademic), type of microscope used (conventional LM, DM and conventional LM), time dedicated to breast pathology, nuclear grading in case of heterogeneity, the method used to determine the mitotic rate, and influence of special type classification on grading. No significant associations were observed for years of experience (when dichotomized based on ≤14 years versus >14 years) or the method used to determine the mitotic rate (P > 0.05) (Table 5). Since all the participating pathologists practice in a predominantly academic setting, we cannot determine whether the interobserver agreement would differ in the community setting. The majority of pathologists in our study spent at least 40% of their time on breast sign out so we cannot exclude the possibility that interobserver variability would be significantly different among pathologists that devote less than 40% of their time to breast sign out. Finally, while we did not observe a difference in interobserver variability between pathologists that graded based on special type (i.e., cribriform, tubular, lobular, etc.) compared whose who did not, nor a difference in the habit of reporting nuclear grade in cases with heterogeneity, we lack the statistical power to confirm our observations.

Table 5 Distribution of answers of the 6 participating pathologists regarding potential confounders that might influence the degree of interobserver variability.

Discussion

Breast cancer grading has been an important prognostic factor in breast carcinoma and with its incorporation into prognostic staging by the most recent AJCC staging manual continues to be a key pathologic feature used in the treatment of breast cancer patients [6, 7, 24]. The use of digital WSI and VM are increasingly being incorporated into routine clinical practice and may include sharing of digital WSIs in lieu of glass slides for second opinion diagnosis. As such, demonstrating reasonable concordance amongst pathologists using this platform, particularly at multiple institutions, is of the utmost importance. Refinements in breast carcinoma grading which include specific criteria for assessing TF, NP, and mitotic scoring render this system amenable for assessing reproducibility amongst pathologists using digital WSI [1, 5].

Many studies have been performed to evaluate the variability in pathologist breast carcinoma grading using LM. When compared to both single and multi-institution LM studies which have mostly demonstrated moderate-to-good levels of interobserver agreement, we found a similar rate of concordance in overall breast cancer grading using VM [9, 14, 25,26,27,28,29]. While our pairwise agreement which ranged from fair to good (κ = 0.354–0.684) is similar to some studies [30], others have demonstrated higher degrees of concordance [14]. Our results resembled those of other published studies wherein agreement for the individual components has mostly been fair to moderate [9, 14]. We too found that agreement for grade 2 tumors tended to fall below that of grade 1 and grade 3 tumors [10, 28, 29]. Similar to others, we found that variability was lowest for TF [14, 25, 26, 30]. While some studies have also demonstrated the greatest variability for the mitotic rate [25, 26, 30], as was demonstrated in our study, others have observed greater variability in NP [14, 28]. Finally, this and other studies have shown that while discrepancies of one step (grade 1 versus grade 2, grade 2 versus grade 3) are common, with rare exceptions, discrepancies of more than one grade (i.e., grade 1 versus grade 3) were infrequent (1–5%) [9, 10, 12,13,14, 25, 30, 31]. Our study showed that concordance using VM is not largely different from that observed in studies using LM.

Multi-institutional studies evaluating concordance of breast cancer grade using VM are limited [32]. In one such study, VM interobserver concordance performed on ×40 magnification digital WSI for overall breast cancer grade was moderate and was similar to that observed using LM [32]. As for the individual components, agreement was greatest for TF with moderate concordance (κ = 0.54), followed by mitotic rate with fair concordance (κ = 0.35), and worst for NP with only slight concordance (κ = 0.15) [32]. These results are mostly similar to our findings, however, the reason for the slight agreement for the NP component of the grading system in their study is unclear. Other studies that have evaluated breast cancer grading using VM primarily compared VM to LM grade and studied intraobserver variability using VM [12, 13]. In these studies, VM breast cancer grading performed on ×20 magnification digital WSI was compared to the routinely reported grade using LM [12, 13]. Overall concordance between VM and LM was moderate (unweighted κ = 0.51). In their study, Rahka et al. showed that VM tends to downgrade tumors, a finding that they attributed to a relatively reduced ability to identify MCs on the screen, which in part could be due to scanning at ×20 magnification [12, 13]. While we and others scanned slides at ×40 magnification, only slight concordance for mitotic rate was observed. This is an interesting observation that may be related to the inability to assess different planes on VM, however, requires further consideration in future studies. While beyond the scope of this study, VM lends itself well to the use of artificial intelligence (AI) programs such as mitotic recognition software, which may be useful in the future as a means to improve concordance in mitotic scoring [33,34,35]. This assertion is supported by a recent study that demonstrated improved accuracy, precision, and sensitivity of counting mitoses by pathologists at all levels of experience with the assistance of AI software [36]. This certainly deserves further study. As there was no attempt to guide reviewers to a single designated area on the slide, it is also possible that some interobserver disagreement could be due to differences in the participating pathologists’ selection of the optimum area for MCs. Since both the Fixed Size and Freehand Annotation methods for determining the mitotic rates were equally split amongst pathologists, it seems less likely that the method used influenced variability.

The impact of interobserver variability on breast cancer grading and its consequence on AJCC prognostic staging is limited [14]. One study showed that of 100 cases, discordance resulted in differences in prognostic staging in 25 and 29 cases during two rounds of scoring for an average rate of prognostic stage change of 27%. In both rounds, a change from stage IA to IB was the most common (18 and 21 cases, respectively). Less frequently changes from IA to IIA, IB to IIA, IB to IIB, and IIIB to IIIC were also observed [14]. We too found that discordant grading amongst pathologists leads to changes in prognostic staging at a rate of 19.8%. Similarly, we most frequently noted changes from IA to IB (10.4%) and fewer cases of IB to IIA (3.5%), IB to IIB (3.5%), and IIIA to IIIB (2.3%). While we found that changes in prognostic staging were limited to HR-positive, HER2-negative/equivocal tumors in this cohort, there are circumstances in which grading discrepancies can result in alterations in prognostic staging for triple-negative tumors (e.g., change from 1B to 1A in a triple-negative, grade 2–3 versus 1 tumor). While we did not observe alterations in prognostic stage due to grading discrepancies in triple-negative breast carcinoma, the limited number of cases in our cohort (n = 10) likely contributed to this finding, and ought to be confirmed in a larger cohort. Finally, in contrast, HR-negative, HER2-positive tumors are not susceptible to prognostic staging changes based on grading discrepancies. While the Rabe et al. study was unable to evaluate the impact of Oncotype DX testing on prognostic staging in discordant cases, we found that for two cases where the discrepancy in grade resulted in a change from prognostic stage group 1B to 1A, Oncotype DX results would have also downgraded these cases to 1A. We only had Oncotype DX results from 8 of 17 cases with grading discrepancies that resulted in changes in prognostic stage groups. While Oncotype DX results may ultimately be used to determine PS despite grade in some cases, we must acknowledge that Oncotype DX and grading are two different tools used for prognostication and one cannot be used to mitigate the variability of other. Additional studies would be necessary to provide a more insight into the clinical significance of Oncotype DX results in cases with discrepancies in grading.

We were unable to determine the impact of work setting (academic and nonacademic), type of microscope used (conventional LM, DM and conventional LM), and time dedicated to breast pathology in grading variability due to the similarities in practice amongst the participating pathologists. We also lacked statistical power to evaluate other potential confounders such as differences in interobserver variability between pathologists that graded based on special type (i.e., cribriform, tubular, lobular, etc.) compared whose who did not, nor a difference in the habit of reporting nuclear grade in cases with heterogeneity. As we did not require pathologists to save annotations used for determining mitotic rates, we are unable to determine whether area selection influenced discordance in this parameter. We recognize that pathologists in our study were split in their approach to scoring nuclear grade in cases of heterogeneity, grading tumors of special type, and the approach used to determine the mitotic rate by VM. This would suggest that further clarification regarding standardization of histologic grading in these settings, particularly when grading using VM would be beneficial to the pathology community at large and requires further study.

Our cohort was biased toward HR-positive, HER2-negative tumors, and there was a paucity of HR-negative tumors. This bias may have resulted in increased variability because HR-positive, HER2-negative tumors are commonly graded as grade 2 and also limited our ability to evaluate the effect of grading discordance on prognostic staging in the triple-negative breast carcinoma subtype.

Using VM, a multi-institutional cohort of pathologists showed moderate concordance for breast cancer grading, a finding similar to that seen in studies using LM. The agreement was the best at the extremes of grade and for evaluation of TF. How VM influences the variability of the mitotic rate remains to be elucidated. The clinical relevance of how grading discrepancies affect prognostic staging and the impact of Oncotype DX results in determining PS in cases with grading discrepancies require further study.