Introduction

Ductal carcinoma in situ (DCIS), a nonobligate precursor of invasive breast cancer, is a health problem worldwide [1]. The introduction of breast screening programmes in Western countries resulted in a significant rise of its diagnosis with a current incidence of ~20% of all breast cancers [2,3,4]. Prior to those screening programmes, symptomatic DCIS accounted for around 1% of all breast cancer diagnoses [3]. Little is known about its natural history, and hence its appropriate management [5].

Many screen detected DCIS patients are likely overtreated, as a consequence of identifying low risk DCIS cases that would not otherwise be diagnosed outside the screening setting. To address this issue, four randomized clinical trials (LORIS, LORD, COMET, and LARRIKIN) are currently ongoing to investigate the potential noninferiority of active surveillance for patients with a biopsy diagnosis of low risk DCIS by comparing watchful waiting with the current surgical standard of care [6]. The single-arm Japanese LORETTA trial will investigate the value of endocrine therapy without surgery for estrogen receptor-positive, HER2-negative low risk DCIS [7]. The definitions of “low risk” DCIS vary slightly among these trials, but all take into account the degree of nuclear atypia (i.e., DCIS grade) [5,6,7]. This inclusion criterion signifies a major challenge, since previous studies have shown that histopathological assessment of nuclear grade is characterized by substantial interobserver variability, regardless of the grading system used [8,9,10,11,12].

This variability is not restricted to study settings, as a nationwide evaluation of DCIS grading in The Netherlands revealed significant variation among different laboratories [13]. Similar variations at the population level were reported for grading of invasive breast cancers, and the presence of borderline features for grade resulting in discordant grades was associated with decreased disease-free survival [14,15,16]. Although the inclusion criteria of the aforementioned noninferiority trials encompass strict theoretical definitions of “low risk” DCIS [6], the actual implementation of these definitions will probably result in variable eligibility rates among different laboratories, should their inclusion criteria be generalized to routine practice. The LORIS trial aims to reduce this variation by performing central review of all cases diagnosed as DCIS [17]. While this approach applies uniform histological criteria for confirming/refusing patient eligibility, it cannot be generalized to routine practice, where biopsies are often evaluated by only one or two pathologists.

We have previously shown that pathologists are better at distinguishing grade 2 from grade 3 DCIS than distinguishing grade 1 from grade 2 DCIS [18]. Similar differences between different cutoffs were noted upon post hoc dichotomization of the multicategorical assessment of comedonecrosis, stromal inflammation, and stromal architecture [18]. We hypothesized that ad hoc dichotomization of multicategorical histopathological features according to the “ideal” cutoff might result in acceptable degrees of interobserver variability. The goal of the current study was therefore to perform upfront dichotomous histopathological assessment of DCIS and to explore the degree of interobserver variability. In addition, we aimed to assess pathologists’ concordance in quantifying tumor-infiltrating lymphocytes as a percentage and determine the cutoff for dichotomization characterized by the highest interobserver agreement.

Materials and methods

Patient samples

A consecutive series of DCIS was selected, based on organ (breast) and lesion codes in the electronic histopathological reports (LIS DaVinci, MIPS, Ghent, Belgium). All patients underwent breast-conserving surgery or mastectomy for DCIS between January 1, 2014 and August 31, 2018 at the Cliniques universitaires Saint-Luc (Brussels, Belgium). Needle biopsies and vacuum-assisted core biopsies were excluded. DCIS with associated microinvasive foci (≤1 mm) or frankly invasive carcinoma (invasive component >1 mm) were excluded. Hematoxylin and eosin stained slides were retrieved from the archives of the Department of Pathology (Cliniques universitaires Saint-Luc) and reviewed by one pathologist (HD), who selected one representative slide for each lesion.

Slides containing the biopsy site were avoided, as biopsy reactions hamper the assessment of myxoid stromal periductal changes and stromal inflammation. Resection specimens with limited amounts of residual DCIS (i.e., one duct with DCIS) were excluded from this study. Selected slides were scanned by an automated slide scanner with Z-stack feature (NanoZoomer 2.0-RS, Hamamatsu Photonics K.K., Hamamatsu City, Japan). Digital images were available on a password-protected online platform (DIH, Leica Biosystems, Dublin, Ireland). This study was approved by the Ethics Committee of the Cliniques universitaires Saint-Luc (2018/21NOV/443).

Participants

All participating pathologists (pseudonyms P1–P39) had to meet the following criteria: (1) being a board-certified pathologist with a special interest in breast disease or equivalent; (2) actively working as a pathologist, either in an academic or nonacademic pathology laboratory, or both; and (3) assess at least fifty primary oncologic breast cancer resection specimens per year, according to the EUSOMA-criteria for dedicated breast pathologists [19].

DCISion setup

All participants were invited to complete the DCISion questionnaire (Supplementary Fig. 1), which was partially based on the study of van Dooijeweert et al. [13]. Eleven questions were used to assess the experience (number of years in practice), work environment (academic and/or nonacademic laboratory), daily work method (conventional light microscopy and/or digital pathology), weekly amount of time dedicated to breast pathology, the system used for DCIS grading, and the habit of routinely reporting specific histopathological features. No training set was used. Instead, all pathologists were provided with the study protocol, which contained written definitions for each histopathological feature (Supplementary Fig. 2), the DCISion poster with exemplary photographs based on the previous study (Supplementary Fig. 3), and the relevant literature concerning the applied histopathological definitions [20, 21]. All participants received a log-in and password which allowed access to the digital slides during four months. Scores were entered in an Excel template. Completion of the informed consent form was a prerequisite for subsequent data processing.

Definitions for dichotomous histopathological assessment

Eight histopathological features were assessed dichotomously, based on previously determined cutoffs, and were illustrated in the DCISion poster [18]. Nuclear grade was assessed as a nonhigh versus high grade, after adaptation of the American Society of Clinical Oncology/College of American Pathologists protocol for examination of DCIS specimens [22]. While the architectural types include solid, cribriform, papillary, and micropapillary growth [23], in the current study, DCIS architecture was assessed as predominantly solid (≥50% solid growth) or predominantly nonsolid (<50% solid growth), as previously described [18, 24]. Both DCIS architecture and nuclear grade were assessed regardless of the presence or absence of necrosis. Necrosis was classified into two categories as previously described: no or single cell necrosis versus any amount of comedonecrosis [18]. Comedonecrosis was defined by areas of confluent dirty necrosis, i.e., confluent eosinophilic material, often containing ghost cells and karyorrhectic debris, and easily detected at low magnification. Intraductal calcifications within the DCIS were scored as present or absent, regardless of the size and the number of ducts with calcifications.

The architecture of the periductal stroma was recorded as either sclerotic or myxoid. Sclerotic stroma resembles the regular fibrous mammary stroma and consists of regularly arranged collagen fibers. Myxoid stroma was defined as loosely arranged collagen fibers, often interspersed with an amorphous, slightly basophilic substance, as illustrated in the DCISion poster and in previous reports [18, 25, 26]. Stromal architecture was divided into two categories (<33% or ≥33% of ducts surrounded by myxoid stroma), and preferentially assessed at low magnification. If the DCIS was located within adipose tissue without any surrounding fibrous tissue, the case was considered as predominantly sclerotic. Lobular cancerization was defined as the presence of DCIS tumor cells within breast lobules, with preservation of the normal lobular architecture. Lobular cancerization was assessed as either absent or present in the digital slide, regardless of its extent.

The presence and extent of chronic inflammatory infiltrates in the periductal stroma (regardless of its architecture) was recorded in a semiquantitative manner as previously described and was preferentially assessed at low magnification, distant from the biopsy site (if present) [18, 24, 25, 27]. Low stromal inflammation was defined as periductal stroma that is not infiltrated by lymphocytes, or that was infiltrated by few loosely arranged lymphocytes with apparent intervening stroma [27]. High stromal inflammation was defined as periductal stroma containing a chronic inflammatory infiltrate that consists of at least one lymphoid aggregate (i.e., any infiltrate that consists of lymphocytes abutting one-another without intervening collagenous stroma) [27]. Lymphoid follicle formation could be present but was not a prerequisite. Assessment of the stromal architecture could be hampered by the density of the inflammatory infiltrate.

Assessment of tumor-infiltrating lymphocytes

Tumor-infiltrating lymphocytes were assessed according to the standardized method proposed by the International Immuno-oncology Biomarkers Working Group [20, 21, 28]. The “supplementary Fig. 1” of Pruneri et al. was used as a visual aid during assessment [29]. Tumor-infiltrating lymphocytes percentages signified the percentage of lymphocytes related to the total periductal stromal surface area, which served as a denominator [20, 28]. The percentage of tumor-infiltrating lymphocytes was assessed without considering the score for “stromal inflammation”. Participants were asked to provide an average percentage for tumor-infiltrating lymphocytes surrounding all ducts affected by DCIS in that particular slide [28]. Hotspots were not taken into account. Participants were also asked to assess tumor-infiltrating lymphocytes dichotomously, using an upfront cutoff (<50% versus ≥50%), as previously described by Pruneri et al. [29].

Statistical analysis

Statistical analyses were performed with IBM SPSS statistics 25.0 (IBM Chicago, IL, USA) as previously reported [18]. Pie charts were constructed in Excel (Excel Windows 10, Microsoft Corporation, Redmond, WA, VS). The arithmetic mean for each histopathological characteristic was calculated per lesion, to evaluate the distribution of each characteristic within the DCIS cohort. The mean corresponds to the most commonly addressed category for a specific characteristic per lesion. Percentages of absolute agreement were determined, signifying the number of lesions with 100% concordance. Krippendorff’s alpha (KA) reliability estimates were calculated using the “Kalpha” macro provided by Hayes and Krippendorff (http://afhayes.com/spss-sas-and-mplus-macros-and-code.html). This macro was introduced in SPSS to compute KA for all dichotomous data to investigate overall interobserver variability per characteristic [30, 31]. The number of bootstrap samples was set at 10,000.

The intraclass correlation coefficient was calculated for tumor-infiltrating lymphocytes assessed as a percentage and was interpreted according to Koo and Li [32]. Intraclass correlation coefficient settings were: two-way mixed, single measures, absolute agreement. Cohen’s kappa (Ƙ) values were calculated for all dichotomous variables for each observer duo (i.e., 741 kappa values for each dichotomous histopathological characteristic). The kappa’s distribution was visualized by box-and-whisker plots. Interpretation of the kappa values was performed according to Landis and Koch [33]. Pearson’s and Spearman’s correlation tests were performed when appropriate, to investigate correlations between the degree of interobserver variability and any possible confounder mentioned in the DCISion questionnaire. Multiple linear regression analysis was performed to correct for multiple potential confounders. All tests were two-sided. The statistical significance level was set at 0.05.

Results

Characteristics of the DCISion participants

In total, 47 pathologists were invited to participate in this study. Thirty-nine pathologists (83%) from nine different countries representing three continents and thirty different laboratories responded. The DCISion questionnaire was completed by 38 pathologists (97%); one participant had skipped four questions regarding habits of reporting (Table 1). The participants had been practicing as certified pathologists for 13.7 years on average (range 1–30 years, excluding the years of training). Twenty-five pathologists (64%) work in an academic laboratory, twelve pathologists (31%) work in a nonacademic laboratory and two pathologists (5%) work in both an academic and nonacademic laboratory. The average weekly time dedicated to breast pathology amounts to less than 1 day for seven participants (18%), one to 2 days for ten participants (26%), and 2–3 days for four participants (10%). Eight participants (20%) spend between 3 and 4 days on breast pathology, and ten participants spend more than four days on breast pathology (26%). Three participants (8%) use both conventional and digital microscopy in daily practice; thirty-six participants (92%) only use conventional light microscopy on a daily basis.

Table 1 Distribution of the answers of the 39 participating pathologists regarding potential confounders that might influence the degree of interobserver variability

Case selection

Since a previous report has shown that histopathological diagnosis based on a single slide is characterized by substantial interobserver variability regarding the distinction between benign breast disease, DCIS and (micro-)invasive breast cancer [34], a consecutive series of 149 slides encoded as pure DCIS was provided to all participants. Upon completion of the digital histopathological assessment, all the participants’ comments regarding the eligibility of the DCIS lesions were gathered and reviewed together with the digital images and the original glass slides and available archived immunohistochemical stained slides (performed by HD, CG, and MRVB). In total, twenty-one lesions (14%) were removed from the series because of: (1) insufficient amount of DCIS (i.e., one single duct) in one case; (2) insufficient size (<2 mm), favoring a diagnosis of atypical ductal hyperplasia, in six cases; (3) presence of microinvasive carcinoma in six cases; (4) presence of a 2 mm focus of invasive lobular carcinoma in one case; (5) lesions falling short of being designated DCIS, favoring a diagnosis of usual ductal hyperplasia, flat epithelial atypia or apocrine atypia in seven cases. Eventually, 128 unequivocal pure DCIS (86%) remained for the final analyses. This method aimed to prevent the introduction of a selection bias, which was likely to occur upon removal of specific cases judged by a single pathologist [34].

Absolute agreement

The mean histopathological scores were calculated and designated Px (Fig. 1a). Absolute agreement reflects the number of cases that were rated identically by all pathologists (100% agreement) or all but one pathologist (97% agreement; Table 2). Absolute agreement was lowest for nuclear atypia and lobular cancerization (6% and 7%, respectively). Dichotomously assessed tumor-infiltrating lymphocytes showed the highest absolute agreement (59%). Of note, the high concordance was due to the rarity of DCIS cases with ≥50% tumor-infiltrating lymphocytes (i.e., 12% of the cohort on average), resulting in high agreement on the cases with <50% tumor-infiltrating lymphocytes and no agreement at all on the cases with ≥50% tumor-infiltrating lymphocytes.

Fig. 1
figure 1

a Pie charts illustrating the histopathological assessment of the “average pathologist” (Px) as a percentage, based on the arithmetic mean of all pathologists’ scores per dichotomously assessed histopathological feature. KA signifies the Krippendorff’s alpha statistic per feature. b Box-and-whisker plots illustrating the differences in the distribution in Cohen’s kappa values per pathologist duo per histopathological feature. Circles represent outliers. DCIS ductal carcinoma in situ, TILs tumor-infiltrating lymphocytes

Table 2 Absolute agreement among pathologists regarding the evaluation of eight different histopathological features in ductal carcinoma in situ of the breast

Overall interobserver variability

KA was the lowest for lobular cancerization (KA = 0.396), and slightly higher for stromal architecture (KA = 0.450) and nuclear atypia (KA = 0.422). The KAs were fairly high for stromal inflammation (KA = 0.564), dichotomously assessed tumor-infiltrating lymphocytes (KA = 0.520), and comedonecrosis (KA = 0.539). The highest KAs were observed for solid ductal carcinoma in situ architecture (KA = 0.602) and presence of intraductal calcifications (KA = 0.676).

Comparison of pathologists duos

As KA does not permit detailed evaluation of interobserver variability among different pathologists, Cohen’s kappa values were determined for each pathologist duo (Supplementary Tables 18). These kappa values confirmed that evaluation of calcifications and lobular cancerization was characterized by the highest and lowest concordance, respectively (Table 3). The narrowest interquartile range was observed for evaluation of calcifications, whereas stromal inflammation showed the highest dispersion of kappa values (Fig. 1b).

Table 3 Descriptive statistics for the Cohen’s kappa values among 39 pathologists per histopathological feature of ductal carcinoma in situ of the breast

Comparison of different cutoffs for tumor-infiltrating lymphocytes

The intraclass correlation coefficient was calculated for tumor-infiltrating lymphocytes assessed as a percentage, showing good overall agreement with an average of 0.821 (range 0.566–0.933; Supplementary Table 10). Since intraclass correlation coefficient and KA are different statistical measures, their values cannot be mutually compared. Therefore, all participants were asked to rate stromal tumor-infiltrating lymphocytes ad hoc as a dichotomous variable by using a cutoff at 50% [29]. The tumor-infiltrating lymphocytes percentages were also dichotomized post hoc according to different cutoffs with 10% increments (10, 20, 30, and 40%). The number of cases with average low and high tumor-infiltrating lymphocytes according to each cutoff was compared with the number of DCIS presenting with low and high “stromal inflammation” (Fig. 2), resulting in the following kappa values: 0.881 (10% cutoff), 0.837 (20% cutoff), 0.637 (30% cutoff), 0.437 (40% cutoff), and 0.245 (50% cutoff).

Fig. 2
figure 2

Bar charts (a–e) and ROC-curve (f) illustrating the interrelationship between semiquantitatively assessed stromal inflammation versus dichotomized tumor-infiltrating lymphocytes (TILs) assessed as a percentage in ductal carcinoma in situ (DCIS). The agreement between both parameters depends upon the cutoff used: Cohen’s kappa values (Ƙ) are mentioned for a threshold of 10% (a), 20% (b), 30% (c), 40% (d), or 50% (e). An ROC-curve (f) illustrates this interrelationship for each pathologist separately

Similarly to the other dichotomously assessed histopathological features, KAs were calculated for tumor-infiltrating lymphocytes according to the different cutoffs: the KA amounted 0.512 for the 10% cutoff, 0.669 for the 20% cutoff, 0.641 for the 30% cutoff, 0.604 for the 40% cutoff, and 0.520 for the 50% cutoff. The kappa values were calculated for each cutoff per pathologist duo, which confirmed that the highest level of concordance was observed for the 20% cutoff (Fig. 3).

Fig. 3
figure 3

Box-and-whisker plots illustrating the differences in the distribution in Cohen’s kappa values per pathologist duo per cutoff for dichotomized tumor-infiltrating lymphocytes (TILs). The 20 and 50% cutoff for tumor-infiltrating lymphocytes assessed as a percentage showed the lowest and highest interobserver variability, respectively. Circles represent outliers; asterisks represent extremes

Concordance for a combined risk score

Previously, a combined risk score for DCIS, based on dichotomously assessed nuclear atypia, stromal architecture and stromal inflammation, was shown to be associated with recurrence risk after breast-conserving surgery. We calculated the combined risk score for each participant, based on the available individual histopathological characteristics: DCIS with a combination of high nuclear grade, predominantly myxoid stromal architecture and high stromal inflammation were considered to have a high combined risk score. When any of these features was lacking in a lesion, it was considered to have a low combined risk score. The overall agreement for this post hoc determined dichotomous variable was rather low (KA = 0.482). The kappa values were calculated for each cutoff per pathologist duo (Supplementary Table 9). Descriptive values for the kappa value of the combined risk score assessment are shown in Table 3.

Confounders

The influence of potential confounders on the (dis)agreement among pathologists was investigated, including: experience, time dedicated to breast pathology, work environment, classification system used for DCIS grading (for nuclear atypia only), and habit of reporting the characteristic of interest. No significant associations were observed (p > 0.05), except for two. Interobserver variability for lobular cancerization was significantly lower for pathologists claiming to always mention this feature, and highest for pathologists stating never to report this feature (p = 0.014). This observation was independent of the laboratory environment, experience and time dedicated to breast pathology. The degree of concordance for stromal inflammation was significantly higher for pathologists with more time dedicated to breast pathology (p = 0.031), independent of the laboratory environment and experience. This association was not observed for tumor-infiltrating lymphocytes assessment. The influence of a “center effect” (i.e., pathologists working together in one center might show higher concordance) was not investigated as the number of colleagues was too low to allow sufficiently powered statistical analysis.

Discussion

Grading systems for DCIS classify these lesions in three categories, analogous to the Nottingham grading system for invasive breast cancer [35,36,37,38]. Previous studies showed that DCIS grading is characterized by substantial interobserver variability, regardless of the grading system used. [8,9,10,11,12, 39] The current ad hoc dichotomization as nonhigh versus high nuclear grade resulted in moderate agreement with an average kappa of 0.430, which is lower than the kappa values previously reported for post hoc dichotomization of DCIS grade (0.55 by Rakha et al. and 0.53 by Van Bockstal et al.) [18, 40]. Nuclear atypia represents a biological spectrum, ranging from monotonous, slightly atypical nuclei to extreme pleomorphism. The upper ends of this spectrum are easily assessable, but a large gray zone exists in-between. Heterogeneous morphology throughout a single DCIS lesion can further hamper the adequacy and reproducibility of morphological assessment. Molecular and genomic studies provided evidence for a two-tier (low grade versus high grade) pathway in breast cancer development, wherein morphologic grade 2 DCIS either cluster together with morphologic grade 1 or grade 3 [41,42,43]. Two-tier classification systems for morphological evaluation of dysplasia are currently common practice in other organ systems, such as gastrointestinal epithelial neoplasia, cervical and vulvar squamous intraepithelial lesions [44, 45]. These classifications improved interrater concordance [44, 45]. Logically, the discordance among observers is generally larger when the number of available categories increases. We therefore hypothesized that two-tier assessment might improve the current interobserver variability in DCIS grading.

We previously examined the interobserver variability among 13 pathologists for multicategorical histopathological features and reported the “ideal” cutoffs for two-tier assessment after post hoc dichotomization [18]. The prognostic value of this two-tier assessment was subsequently investigated in a cohort of 211 DCIS patients, wherein high grade nuclear atypia, high stromal inflammation and myxoid stromal architecture were associated with increased overall recurrence risk after breast-conserving surgery [27].

The DCISion study applied these binary cutoffs upfront, which resulted in overall moderate agreement for all individual histopathological features, as well as for the post hoc combined risk score assessment (i.e., average kappa values between 0.41 and 0.60). Despite dichotomization, the interobserver variability remains considerable for all features except intraductal calcifications. Of note, the current study setting lacked some essential features of the “real-life setting” of histopathological diagnosis, e.g., immunohistochemistry, deeper levels and multiple tissue blocks. In addition, only three pathologists reported that they used digital pathology on a regular basis. These factors might have negatively influenced the degree of interobserver variability. Nevertheless, the observed discordance might explain the variable prognostic power of different histopathological features in previous reports. For example, some large retrospective study cohorts and randomized trials could not confirm the association between high nuclear grade and increased recurrence risk [46,47,48,49]. We were unable to investigate the impact of interobserver variability on the recurrence risk stratification in this cohort, as most patients were recently diagnosed and adequate follow-up data were not available. However, it is likely that interobserver variability has an effect on the association with recurrence risk, i.e., some pathologists’ scores might be associated with recurrence after breast-conserving surgery, and others might not. We aim to investigate this in the future, when the median follow-up time of this cohort reaches at least 5 years. Such a large-scale study would enable the definition of an allowed margin of error that is associated with a limited degree of discordance among pathologists, without significantly affecting recurrence risk stratification.

Robust prognostic histopathological markers require reproducibility of their assessment. Although features assessed as a continuous measure (such as stromal tumor-infiltrating lymphocytes) show overall acceptable concordance rates, the clinical decision making is often dichotomous (e.g., to treat or not to treat) and therefore generally requires a particular cutoff. For example, positive or negative hormone receptor status and HER2 status in invasive breast cancer will determine eligibility for (neo)adjuvant hormonal treatment and HER2-targeted therapy. DCIS patients who undergo breast-conserving surgery are often treated with adjuvant radiotherapy, and in some countries also with selective estrogen receptor-modulators or aromatase inhibitors. The choice for adjuvant treatment is mainly led by the recurrence risk, as up to 30% of nonirradiated DCIS patients will develop a loco-regional recurrence after [50]. To date, margin size and histopathological features remain the cornerstone for recurrence risk stratification, although molecular tests such as Oncotype DX® DCIS (Genomic Health, Redwood City, CA, USA) are emerging [51]. It is therefore important that pathologists speak the same language, to ensure that patients are treated in a consistent manner throughout different hospitals.

The DCISion study illustrates that the current histopathological evaluation, though acceptable, should improve. It is challenging to identify the precise causes of the observed disagreement. As mentioned above, discordance might have partly been induced by the use of digital histological images, since the majority of the participants commonly uses glass slides in daily practice. In general, diagnostic concordance between glass slides and digital slides is reported as high, but these studies usually investigate intraobserver equivalency [52, 53]. However, interobserver variability using glass slides will probably result in interobserver variability using digital slides. Thus, the use of digital slides cannot entirely explain the observed discordance in the DCISion study. Intra-tumor heterogeneity, differences in selected regions of interest and different interpretations of the provided definitions may also account for increased discordance rates. The current study was limited to the evaluation of morphological features in a single H&E slide per lesion. Multiple H&E slides, additional levels and/or immunohistochemical markers might increase the degree of concordance in DCIS grading. Future studies should explore whether there is a role for hormone receptor status and HER2 status in DCIS risk stratification, as well as for immunohistochemical characterization of the stromal immune response.

Reproducibility and robustness of assessment are of particular importance for inclusion in the ongoing randomized trials that investigate noninferiority of active surveillance. In these trials, risk assessment and eligibility are based on histopathological morphological features, although the LARRIKIN, COMET, and LORETTA trials also take into account hormone receptor and HER2 status [6, 7]. It will take several years before these trials’ findings are translated to routine clinical practice, but it is likely that watchful waiting will become a valid “treatment” option. Deep learning algorithms or so-called artificial intelligence might offer a solution to cope with the interobserver variability in histopathological assessment, but we should be careful not to introduce interobserver variability into these deep learning networks. If all the available deep learning algorithms are initially taught by only one or two pathologists, we might end up with similar interobserver variability in artificial intelligence. It would be of interest to use the results of a multiheaded panel evaluation to enable the creation of a robust deep learning algorithm. The current DCISion study cohort might therefore be of interest, as it has been assessed by 39 different observers, representing thirty different laboratories. We aim to train an algorithm based on the DCIS cases with 100% absolute agreement for a particular feature, and to explore its prognostic power in an independent DCIS patient cohort with available follow-up data.

One limitation of the current study is the lack of a gold standard, since nobody knows who is “right” or “wrong” regarding the DCIS cases with total lack of absolute agreement for a specific characteristic, such as lesion that was regarded as nonhigh grade by 19 participants and as high grade by 20 other participants (Supplementary Fig. 4). A deep learning algorithm trained by using cases with 100% absolute agreement for a particular characteristic, might act as an artificial reference. This study also highlights that poor concordance (i.e., low kappa values) between two observers does not give any information on who is “right” and who is “wrong.” It only indicates that this particular feature has been perceived differently by two observers, and this poor concordance might be due to unclear definitions of the investigated characteristics. For instance, Harrison et al. recently showed that the threshold for comedonecrosis is highly variable, even among experienced breast pathologists [54]. A discussion and developing national/international guidance on how to define or redefine particular histopathological features, including their applied cutoffs, should be highly ranked when setting a research agenda for DCIS.

This is of particular interest for stromal inflammation or tumor-infiltrating lymphocytes. The International Immuno-oncology Biomarkers Working Group proposed to quantify tumor-infiltrating lymphocytes in as much detail as possible, i.e., as a continuous variable by using percentages [28]. This method allows in-depth analysis of potentially clinically relevant cutoffs. In the DCISion study, we asked all participants to quantify tumor-infiltrating lymphocytes both dichotomously and as a percentage by using previously published photographs as a visual aid (such as Supplementary Fig. S1 of Pruneri et al.) [20, 21, 29]. By using 10% increments, we identified the 20% tumor-infiltrating lymphocytes cutoff as the cutoff associated with the highest interobserver agreement. However, the 10% tumor-infiltrating lymphocytes cutoff corresponded best with semiquantitative assessment of stromal inflammation, implying that most DCIS with ≥10% stromal tumor-infiltrating lymphocytes present at least one dense aggregate of lymphocytes (designated “high stromal inflammation”). Presence or absence of dense lymphoid aggregates (with or without tertiary follicle formation) might serve as a visual aid for pathologists to classify a particular DCIS lesion as having low or high stromal tumor-infiltrating lymphocytes. Since interobserver agreement seems to be the highest for this 20% tumor-infiltrating lymphocytes cutoff, future work should also focus on investigating its prognostic value for DCIS recurrence risk stratification after breast-conserving surgery. Overall agreement on stromal inflammation and stromal tumor-infiltrating lymphocytes was moderate in the DCISion study, but we may improve the current degree of concordance by providing clear definitions and adequate visual aids to enable training of pathologists. The website of the International Immuno-Oncology Biomarker Working Group (www.tilsinbreastcancer.org) provides such a useful tumor-infiltrating lymphocytes training tool for pathologists. The added value of immunohistochemical characterization of the immune infiltrate should also be explored, as well as the role of the location of stromal tumor-infiltrating lymphocytes. Toss et al. have recently demonstrated that dense touching tumor-infiltrating lymphocytes (i.e., tumor-infiltrating lymphocytes touching the basement membrane or away from it with maximum one lymphocyte of thickness) are associated with decreased recurrence-free survival [55]. This promising observation warrants validation in an independent patient cohort, as it indicates that spatial arrangement of tumor-infiltrating lymphocytes might be more important than the quantity of tumor-infiltrating lymphocytes. Moreover, the assessment of touching tumor-infiltrating lymphocytes seems reproducible, because it showed higher interobserver concordance than assessment of tumor-infiltrating lymphocytes as a percentage [55].

Conclusions

Despite upfront dichotomous evaluation, the interobserver variability for histopathological assessment of DCIS remains considerable and is at most acceptable, although it varies between the evaluated features. This large-scale international multicenter study allowed us to compare two different methods to assess the inflammatory response in the periductal stroma. This comparison suggests that a semiquantitative method (absence or presence of lymphoid aggregates) corresponds best with tumor-infiltrating lymphocytes assessed as a percentage when a cutoff at 10% is used for the latter. Nevertheless, a post hoc applied cutoff at 20% for stromal tumor-infiltrating lymphocytes results in the highest interobserver concordance. Future research should validate the upfront use of this 20% cutoff, and investigate its relation with post-operative outcomes. Furthermore, the impact of the current degree of interobserver variability on DCIS prognostication should be explored. Forthcoming machine learning algorithms and additional immunohistochemistry might be useful to tackle these substantial diagnostic challenges.