Introduction

Breast cancer biomarkers, namely Estrogen Receptor (ER), Progesterone Receptor (PR), Androgen Receptor (AR), and Human Epidermal Growth Factor Receptor 2 (HER2), are crucial components of pathology reporting and a known prognostic factor to determine therapy for patients both in the primary setting or with recurrent or metastatic disease. Studies have shown a substantial survival benefit from targeted therapy especially against hormone receptors such as ER and/or oncogenic proteins such as HER21. These prognostic markers in breast cancer are routinely tested using both histopathology specimens and cytology preparations including cell blocks2. Digital pathology systems have been developed and validated for routine histopathology diagnoses using digitized glass slides as whole slide images (WSI). Digital pathology systems have also been demonstrated to assist in rapid diagnostic consultation in breast care clinics3.

Progression in the field of digital pathology has yielded new opportunities for digital reporting of such biomarkers. Over a decade, WSI has become popular in seeking a second opinion in teleconsultation4, education5, and most importantly for primary diagnosis, especially in surgical pathology6,7,8,9,10,11,12,13,14,15,16,17,18,19. There is limited literature on the use of WSI in cytopathology for primary diagnosis in comparison to surgical pathology mainly due to the cytological specimens’ characteristics, longer scan times, and the need for higher scanning resolution20,21. More recently, the adoption of WSI for primary diagnosis has increasingly become a reality despite some barriers to implementation, such as cost and workflow considerations. This increased adoption of WSI has resulted in a relatively few publications regarding the validation of WSI for cytopathology diagnostic use22,23,24,25.

Although the College of American Pathologists (CAP) published a formal guideline on validating WSI for primary diagnosis in 201315, there is limited data to support the use of WSI in reporting immunohistochemistry markers in cytology specimens. The FDA has also recently approved WSI for primary diagnosis purposes. There are currently two WSI scanner devices that have been cleared for use in FFPE hematoxylin and eosin stained tissue specimens but they do not make claims on the use for cytology specimens26,27

Few studies have attempted to show that WSI diagnosis is equivalent to light microscopy interpretation in breast cytology specimens28,29, and to the best of our knowledge, none have validated breast biomarkers reporting in cytology specimens. Among the few studies that used cytology specimens, only a single whole slide scanner was used, which raises questions about inter-instrument as well as inter-observer variability and diagnostic accuracy.

The advantages of WSI for surgical pathology can potentially be used in cytopathology specimens for interpretation of prognostic immunohistochemistry (IHC). Validation of WSI on multiple scanners is crucial to ensure that diagnostic performance based on digitized slides is non-inferior to that of glass slides and light microscopy. Among the few studies that used WSI in non-gyn cytology specimens22, none have validated the reporting of breast hormonal biomarkers in cytology specimens using three different scanners.

In this study, we aimed to evaluate the utility of the digital review of breast cancer biomarkers (ER, PR, AR, and HER2) and to validate their reporting in cytology specimens. We also compared concordance between digitally reviewed WSI to the conventional microscope results using three different whole slide scanners.

Material and methods

Case selection

This study presents a validation of the digital pathology breast biomarkers in cytology specimens at a large academic tertiary cancer center in New York City after approval from the institutional review board. The validation encompasses digitization of fine needle aspiration and fluid specimens from patients with known breast cancer (recurrent or metastatic). Glass slides generated from formalin fixed paraffin embedded cell blocks included hematoxylin & eosin (H&E) stains and breast biomarker immunohistochemical stains (ER, PR, HER2, and AR). All methanol fixed specimens were excluded for the study.

Digitization of glass slides

The whole slide imaging process included pre-analytic quality assurance of slide preparations, analytic process of glass slides digitization on each whole slide scanner, and post-analytic quality assurance of the generated WSI for digital artifacts (e.g., out of focus, stitching, banding). The whole slide scanning process included three different vendor whole slide scanners acquiring WSI from a glass slide at high resolution (~0.25 um/pixel). Each whole slide scanner uses an objective lens paired with an image acquisition sensor and stitches all captured images together to form a single digital file that can be navigated similarly to a glass slide on a microscope, in a whole slide image viewer. All glass slides were scanned in a single z-plane.

All glass slides used for validation ER, PR, AR, and HER2 with their corresponding H&E cell block slides were scanned using three different scanners at x40 equivalent magnification (~0.25 um/pixel). The list of scanners used is indicated below:

  • Leica Aperio GT450 (Leica Biosystems, Buffalo Grove, IL, USA)

  • Pannoramic 1000 (3DHistech, Budapest, Hungary)

  • Ultra Fast Scanner (Philips Health, Amsterdam, Netherlands)

Whole slide scanner precision

Each glass slide was scanned in triplicate across all scanners included in this study. This was conducted to evaluate the intra-scanner accuracy of digitization related to barcode detection, tissue detection, and image quality (e.g., blur, digital artifacts). A successful scan was defined as correct barcode decoding, complete capture of tissue on the glass slide, and digital slides free of image quality defects. Each glass slide was scanned in triplicate and tabulated according to the success rate of each scan for all three scanners.

Digital slide review and scoring

After whole slide scanning on each of the three different scanners at x40 equivalent resolution, slides were de-identified and were distributed to the study pathologists using an internally developed WSI viewer application30.

A total of 96 glass slides were scanned; including 20 cell blocks H&E; 20, ER; 20, PR; 16, AR; and 20, HER2 stained immunohistochemistry slides three times on each scanner. All immunohistochemical stained slides had routine control tissue placement on each slide for pathologist reference. Cases were randomized and distributed to the study pathologists for digital cytology reporting on cases from the three different scanners. Quantification of ER, PR, AR, and HER2 were assessed based on the American Society of Clinical Oncology (ASCO) and the College of American Pathologists (CAP) guidelines31,32 by all four pathologists, blinded to the reported semi-quantitative IHC results, after a washout period of at least 6 months. All participating pathologists (n = 4) were breast pathologists with cytology experience and at least 3-year experience in using WSI for secondary diagnostic use cases (tumor boards, reviewing archived scanned slides, etc.). Pathologist used HP Z24n 24-inch 1920 × 1200 resolution monitors. The monitors are not color calibrated by default and no color calibration adjustment was done. The review was conducted in 2 phases, first, each pathologist randomly reviewed a total of 15 digital cases with their corresponding markers, 5 cases from each scanner on 3 different occasions, at least 14 days apart.

The Second phase was conducted 6-months after completion of phase one. Two out of the four pathologists (Pathologist B and Pathologist C) reviewed all 20 digital cases with their corresponding IHC markers on each scanner to test inter-instrument concordance rate when comparing between multiple scanners and investigate inter-observer variability (Fig. 1).

Fig. 1: Study Design.
figure 1

Total of 96 glass slides were scanned; including 20 cell blocks H&E; 20, ER; 20, PR; 16, AR; and 20, HER2 stained immunohistochemistry slides three times on each scanner. The review was conducted in 2 phases, first, each of the four pathologists (Pathologist A-D) randomly reviewed a total of 15 digital cases, with their corresponding markers, 5 cases from each scanner on 3 different occasions, at least 14 days apart. Second phase was conducted 6-months after, two out of four pathologists (Pathologist B and C) reviewed all 20 digital cases with their corresponding IHC markers on each scanner to test inter-instrument concordance rate between scanners and investigate inter-observer variability. ER Estrogen Receptor, PR Progesterone Receptor, AR Androgen Receptor, HER2 Human Epidermal Growth Receptor 2, H&E Hematoxylin and Eosin stain.

A referee pathologist not participating in reporting of the study reads verified all cases included in the study by ensuring pathologist case assignment, whole slide image slide quality, and respective rescanning for glass slides which failed scanning, noting the reason for each failure. The referee also annotated the readings of the biomarkers and included the percentage for the positive nuclear staining for ER, PR, AR, and membranous staining for HER2 staining for analysis.

Statistical analysis

Interobserver variability was calculated using unweighted Cohen’s κ, whereby a value of 0.01 to 0.20 indicated slight concordance, 0.21 to 0.40 fair, 0.41 to 0.60 moderate, 0.61 to 0.80 substantial, and 0.81 to 0.99 strong concordance. Statistical significance was established at p < 0.05.

The percentage agreement, 95% CI and the level of significance (using Fisher’s exact test) was calculated using IBM SPSS Statistics for Windows, version 26 (IBM Corp., Armonk, N.Y., USA). This study was performed with the approval of the Institutional Review Board of Memorial Sloan-Kettering Cancer Center (New York, NY).

Concordance

Concordance analysis was performed in two steps

  • Kappa concordance was initially calculated for all pathologists, defined as complete agreement between the first original signed out biomarkers IHC stains and the digital reads. (i.e., positive vs negative for ER, PR, AR, HER2 or equivocal for HER2 staining).

  • We also conducted a second analysis for ER, PR, and AR to test within group concordance for pathologist B and C after dividing the IHC score readings into groups: Group 1 = 0%, Group 2 = 1–10%, Group 3 = 11–75% and Groups 4 > 75 % nuclear staining.

Discordance

  • Minor discordances were defined as differences that will not result in clinical or prognostic implications.

  • Major discordances between microscopic and digital reads for all pathologists were classified when a clinically relevant change was seen (i.e., positive vs negative for ER, PR, AR, HER2 or equivocal versus negative for HER2 staining) between paired samples. Major discordances for pathologist B and C in the second phase of analysis were classified as differences in biomarkers scores that reflected different groups.

Final diagnosis for the discordant cases were made by two pathologists based on consensus (ME, OL).

Results

Study cohort

Following random selection, the study cohort included 20 matched cell block H&E, ER, PR, and HER2 samples from 20 patients, with AR available in 16 cases. All samples were from metastatic disease (n = 19) or were locally recurrent (n = 1) in female patients. Among the study samples 70% (n = 14), 35% (n = 7), 75% (n = 12) were reported positive for ER, PR, and AR, respectively, by the original reported brightfield microscope interpretation. Fifteen cases were HER2 negative (0 or 1 + ), one case was reported HER2 positive (3 + ), and four cases as HER2 equivocal (2 + ) by immunohistochemistry (Supplementary Table 1). FISH results were available on 3 out of the 4 equivocal HER2 cases of which one case showed amplification. Forty-five percent (9/20) of the study samples were pleural fluid specimens, 40% (8/20) were from lymph nodes, 10% were from chest wall (2/9), and 5% (1/20) were from breast local recurrences. Clinical pathological characteristics are shown in Table 1.

Table 1 Summary of histology, specimen type, site of 20 breast cancer samples.

The study set consisted of 96 glass slides scanned on the three scanners (total = 288 scanned images). Quality assurance of WSIs by technicians or the referee pathologist prior to initiation of digital readings revealed 65 slides (23%) required rescanning due to barcode detection failures, 26 (9%) due to tissue detection failure, 13 (5%) due to partial tissue detection, and 5 (2%) due to out of focus areas in each scan (Supplemental Table 2). Since Scanner 1 does not support manual focus plane depth adjustment, manual adjustments for out-of-focus and tissue failure were only done on scanner 2 and scanner 3. The slides for scanner 1 were rescanned without any adjustment and was at the scanner’s ability for focus/tissue detection. After remediation of the glass slides and scanner adjustments, failures decreased to a total of 3 images (one tissue detection failure for PR on scanner 1 and two out-of-focus for HER-2 images on scanner 2 and 3 (Supplemental Table 2).

The first-time successful scan rate and average number of rescans to successfully scan each slide are shown in Table 2. Technical data for the first-time failed scan stratified by negative versus positive immunohistochemistery staining are shown in Table 3.

Table 2 First-time successful scan rate, average number of rescans to successfully scan for each slide, scanner precision by stain and by scanner.
Table 3 Technical Data: First-time failed scan for negative versus positive IHC; due to barcode or other reasons.

We also separately tested for scanner precision results after scanning all slides from each case in triplicates on each scanner. Scanner 1 had the best average precision (92%) compared to scanner 2 and scanner 3 with 78% and 73%, respectively. When comparing precision between ER, PR, AR, and HER2 staining, the HER2 immunohistochemical stains had the lowest intra-scanner precision at 64%.

Concordance

All four pathologists successfully completed digital review of all WSI. Each pathologist randomly reviewed a total of 15 digital cases with their corresponding IHC biomarkers, 5 cases from each scanner on 3 different occasions. A total of 228 reads were performed using both WSI and glass slides.

There was strong concordance between all four pathologists’ digital and the glass slide readings (κ = 0.97,0.85, 0.93, and 0.90 for pathologist A, B, C, and D, respectively (P-value <0.0001) (Fig. 2). Complete concordance between all study pathologists and the original sign-out diagnosis was achieved in 90% of ER, 80% of PR, 100% of AR, and 95% of HER2 stains.

Fig. 2: Kappa concordance.
figure 2

(1) Kappa concordance of ER, PR, AR, and HER2 scores between all pathologist digital reads vs microscope. (2) Kappa concordance of ER, PR, and AR scores divided into groups between all scanners for pathologist B. (3) Kappa concordance of ER, PR, and AR scores divided into groups between all scanners for pathologist C. The p-value corresponds to a two-sided hypothesis test comparing reader-averaged accuracy with each scanner to the microscope and between scanners (0.41 to 0.60 moderate, 0.61 to 0.80 substantial, and 0.81 to 0.99 strong concordance). ER Estrogen Receptor, PR Progesterone Receptor, AR Androgen Receptor, HER2 Human Epidermal Growth Receptor 2.

Inter-instrument concordance

Six months after the initial scoring, two out of four pathologists (pathologist B and pathologist C) reviewed all digital cases on all three scanners. Concordance was similarly determined by comparing between multiple scanners to evaluate inter-observer variability on digital reads from different scanners.

There was a strong concordance (kappa = 0.97, 0.89, and 0.92 for pathologist B, and kappa =0.92, 0.94, and 0.86 for Pathologist C) when comparing reads between scanners (scanner 1 vs scanner 2, scanner 1 vs scanner 3, and scanner 1 vs scanner 3; P < 0.001) (Fig. 2).

Discordance

Reassessment of the glass slide and the WSI for all discordant cases by 2 referee pathologists (ME, OL) revealed that overall discordance was seen in 2.3% (n = 6) of glass/digital pairs; 1 for ER, 4 for PR, and 1 for HER2, of which 5 were of minor discordance without any clinical or prognostic implication. One ER and three PR stains from three different specimens were initially reported with rare nuclear stain positivity on the glass slide (5, 1, and 1%, respectively) had a negative digital read. Another PR stain was reported negative on the initial glass read and was digitally reported positive (2%). All cases had very few tumor cells (<20 tumor cells) with weak nuclear staining shown in Fig. 3.

Fig. 3: Five discordant cases IHC stained slides.
figure 3

A: ER stain: reported 5% on the glass slide and had a negative read, B: PR stain: initially reported negative on the glass slide and had a 2% digital read, CE: PR stains: initially reported 1% on the glass slide and had a negative digital read F: HER2 stain: initially reported equivocal (2 + ) on the glass read had a negative (1 + ) digital read. IHC Immunohistochemistry, ER Estrogen Receptor, PR Progesterone Receptor, AR Androgen Receptor, HER2 Human Epidermal Growth Receptor 2.

One HER2 slide (0.3%) showed a major discordance. This HER2 stain that was initially reported equivocal (2 + ) on the glass read had a negative (1 + ) digital read. FISH results from this case performed at the time of the original reporting of IHC showed HER2 amplification. Table 4 details discordant cases.

Table 4 Discordant Diagnoses and Reasons: 5 Discordant cases showing glass reads, digital reads, reviewers reads, scanners used, and reasons for discordance.

Discussion

Digital validation reporting of hormonal breast cancer markers has not been clearly established in the literature, especially in cytology specimens. In breast cancer patients, determination of prognosis and treatment strategies based on ER, PR, AR, and HER2 status greatly depends on the accurate evaluation of overexpression by IHC and/or FISH.

Studies have attempted to use WSI to validate primary diagnosis in multiple surgical pathology specialty applications12 including prostate33, pediatric6, dermatopathology14, gastrointestinal16,19, and gynecological pathology specimens34. Krenacs et al. was among the first to address the potential use of digital imaging in breast cancer18. This was further followed by using WSI in primary breast cancer diagnosis17,28 and in reporting prognostic factors such as the Nottingham histology grading17,35, PDL-136, and HER2 immunohistochemistry stains37,38 in histology specimens.

In surgical pathology, among the largest studies that used WSI to validate primary diagnosis in multiple organs, the concordance rate was 98% and 96% and the major discordance rate was 0.7% and 0.9% in 3017 and 1070 specimens, respectively39,40. In breast pathology, the concordance and discordance rate followed a similar trend, 95–97% and 3%, respectively9.

However, comparing surgical pathology to cytology studies, the reported results were variable. In two recent meta-analysis studies, a surgical pathology, and a cytology study showed slightly different concordance rates between WSI and original diagnosis. Araujo et al. in 2019 (included 13 surgical pathology studies) showed overall strong concordance of 87–98.3% with inter-observer κ coefficient 0.8–0.98. In contrast, Girolami et al. in 2020 (included 19 cytology studies), showed a very wide range of concordance and inter-observer κ coefficient,14–100% and 0.57–0.82, respectively, between cytology smears only but not cell blocks. Even after correction for the difference in study size the mean percentage concordance and κ coefficient remained inferior to that reported in surgical pathology with 84% and 0.69, respectively22. This can be partially explained by the differences in the histological characteristics between surgical pathology and cytology specimen.

Focusing attention on the application of immunohistochemistry WSI for reporting prognostic breast markers, Campbell et al. used IHC (e.g., AE1/AE3, P63, and E-cadherin) WSI to study the diagnostic reads in breast core needle biopsies9. Others have evaluated digitally reporting prognostic immunohistochemical factors such as PDL-1 and HER2 in breast cancers and showed similar concordance results. Two studies of digital HER2 and PDL-1 reporting showed substantially equivalent kappa co-efficient (0.72) and percent agreement ranging from 61−92% in 180 and 79 cases, respectively36,41, others have confirmed these findings and concluded there is non-inferiority for interpreting breast markers IHC by either glass slides or digital images37. Not surprisingly, 90–97% concordance of glass/digital pairs for all pathologists seen in this study are comparable to those published in the literature.

In our study, overall discordance was seen in 2.3% (n = 6) of glass/digital pairs; of which one case (HER2) showed a major discordance. Among the 5 cases that had minor discordance, they all had very few tumor cells staining (1–5%). There are limited data on the overall benefit of endocrine therapies for patients with low level (1–10%) ER expression. The literature suggests tumors with such results are heterogeneous in both behavior and biology and may be more similar to ER-negative cancers31. For this reason, we considered those cases to be relatively concordant.

The only case that had a major discordance was a HER2, that was initially reported equivocal (2 + ) on the glass read and had a negative (1 + ) digital read. FISH studies performed at the time of the original IHC reporting showed HER2 amplification. Upon review of this case, the tumor cells had heterogenous HER2 membranous staining ranging from weak to strong, which might be subject to inter-observer variation regardless of the method used for scoring. HER2 scores on WSI were shown to be higher than those on glass slides in a previous study, possibly due to increased color contrast on WSI42, an issue that we did not face in this study since immunohistochemistry positive and negative controls were part of the scanned slide. Additionally, previous studies have addressed diagnostic concerns regarding color inconsistency of WSI between different scanners and within the same scanner on different occasions43. These parameters may be important for future validation guidelines, since adjusting color or contrast/brightness might alter visibility and impact reporting of membranous or even nuclear staining.

In 2017, The FDA cleared the first device for using WSI for primary diagnosis26. A systemic review showed that among the studies that mentioned the type of scanners used, Leica Aperio seemed to be the most used (37%) followed by Hamamatsu (21%), and Roche/Ventana (16%)22. Another study39 used eight different scanners from five different manufacturers and showed no significant discordances when rendering diagnosis using digital versus glass slides. As per laboratory accreditation guidelines44, each digital pathology system requires its own studies for validation. In this single large center study, three different scanners from three different manufacturers were used, and indeed the concordance rate between scanners were analyzed after accounting for interobserver variability and showed excellent concordance between all three scanners.

Errors rates by first successful scan rate differed by scanner vendor. Each scanner has their own technical specifications, including tissue detection–where immunohistochemistry typically has lower contrast compared to hematoxylin and eosin-stained tissue. Scanners also require barcode decoding to ensure digital slides are viewable within the laboratory information system. Additionally, scanning glass slides may show digital artifacts such as out of focus regions on the digital slide that may necessitate rescan. Scanner 1 showed the highest performance related to first time successful scans, as well as intra-scanner precision. One PR stain persistently showed tissue detection failure on scanner 1, contributing reasons might be due to scant tissue and pale counterstain.

The HER2 immunostain showed the lowest first-time successful scan performance in relation to the other immunohistochemical stains. After scanner adjustments, two out-of-focus HER-2 images persisted. Taking a closer look at the images, we believed that staining heterogeneity and faint tissue causing low contrast might have contributed to focus issues. Understanding these parameters are important in evaluating scanners as well as validations. Implementing WSI in primary reporting of immunohistochemistry will refine pathology practice for prognostication where limited tissue is available particularly in subspecialties like breast cytopathology.

Digital validation will help integrate pathology images to the clinical information that will eventually permit easily comparing prior and prospective patient’s breast hormonal status to better understand the progress and aid in treatment modification for recurrent or metastatic breast cancer cases.

When comparing pathologists, digital biomarker quantification to the original glass slide microscope, respective kappa values were consistent even when comparing digital reads on various scanners. However, some difficulties were encountered in reporting biomarker stains in cases with low cellularity and heterogenous weak nuclear staining even after manual adjustment for multiple scanning at different focus plane depths.

Excluding methanol fixed cases is one of this study’s major limitations. Another limitation was the time needed for image exploration. Examining WSI was perceived to take more time than evaluation by conventional microscope (although no formal timing was conducted in this study). Additional research is required to document minimum number of tumor cells required in cytology specimens for optimal WSI performance and IHC digital reporting. Such data are important for validation guidelines or protocols dedicated to cytology specimens to enhance our pathology daily practice.

In conclusion this study is the first to address the feasibility of WSI of breast biomarkers in cytology specimens and validate primary reporting using inter-instrument comparisons between three different scanners. Digital scanning is an acceptable method to report ER, PR, AR, and HER2 quantification assessment for clinical decision-making and offers similar reproducibility to routine microscope reads. More studies are needed in cytology specimens to better understand discordances and compare concordance between different scanners, resolutions, and specimen preparations. This will ultimately help better refine WSI guidelines and standards dedicated to cytology subspecialty (Table 4).