Systematically higher Ki67 scores on core biopsy samples compared to corresponding resection specimen in breast cancer: a multi-operator and multi-institutional study

Ki67 has potential clinical importance in breast cancer but has yet to see broad acceptance due to inter-laboratory variability. Here we tested an open source and calibrated automated digital image analysis (DIA) platform to: (i) investigate the comparability of Ki67 measurement across corresponding core biopsy and resection specimen cases, and (ii) assess section to section differences in Ki67 scoring. Two sets of 60 previously stained slides containing 30 core-cut biopsy and 30 corresponding resection specimens from 30 estrogen receptor-positive breast cancer patients were sent to 17 participating labs for automated assessment of average Ki67 expression. The blocks were centrally cut and immunohistochemically (IHC) stained for Ki67 (MIB-1 antibody). The QuPath platform was used to evaluate tumoral Ki67 expression. Calibration of the DIA method was performed as in published studies. A guideline for building an automated Ki67 scoring algorithm was sent to participating labs. Very high correlation and no systematic error (p = 0.08) was found between consecutive Ki67 IHC sections. Ki67 scores were higher for core biopsy slides compared to paired whole sections from resections (p ≤ 0.001; median difference: 5.31%). The systematic discrepancy between core biopsy and corresponding whole sections was likely due to pre-analytical factors (tissue handling, fixation). Therefore, Ki67 IHC should be tested on core biopsy samples to best reflect the biological status of the tumor.


INTRODUCTION
It has been long acknowledged that the immunohistochemical (IHC) detection of Ki67 positive tumor cells provides important clinical information in breast cancer 1 . More recently, Ki67 gained clinical utility in the T1-2, N0-1, estrogen receptor-positive (ER) and HER2-negative patient group by allowing to identify those patients that are unlikely to benefit from adjuvant chemotherapy 2 . However, Ki67 has not been consistently adopted for clinical care due to unacceptable reproducibility across laboratories [3][4][5] .
Therefore, the International Ki67 in Breast Cancer Working Group (IKWG) originally published consensus recommendations in 2011 for best practices in the application of Ki67 IHC in breast cancer 6 . According to this consensus, parameters that predominantly influence Ki67 IHC results can be grouped into preanalytical (type of biopsy, tissue handling), analytical (IHC protocol), interpretation and scoring, and data analysis steps 6 . As the scoring method was the largest contributor to test variability 7 , the IKWG has undertaken serious efforts to standardize the Ki67 scoring method of pathologists 8,9 . Although in multiinstitutional studies, standardized Ki67 scoring methods reached pre-defined thresholds for adequate reproducibility 9,10 , this was only after completing calibration training and by using tedious counting methods. In this context, recently updated guidelines by the IKWG now recommend Ki67 IHC for clinical adoption in specific situations, including the identification of very low (<5) or very high proliferation (>30) indices, that render more expensive gene expression tests unnecessary 2 .
An important additional issue that can cause variability in Ki67 measurements is the type of specimen (core biopsy vs excision) and its effect on Ki67 scoring in a multi-center setting 2 . Indeed, the IKWG recommended use of core biopsies (CB), based on apparent superior results for Ki67 when visual evaluation was compared to that of whole sections (WS).
In this multi-observer and multi-institutional study, we aimed to investigate the comparability of Ki67 measurements across corresponding core biopsy and resection specimens from the same breast cancer cases, when evaluated using a calibrated, automated reading system. Furthermore, we assessed between-(consecutive) section differences in Ki67 scoring as no difference between sections will facilitate the selection of the tumor-block to perform the IHC staining on.

MATERIALS AND METHODS Patients
Thirty cases of ER-positive breast cancer used in phase 3 of IKWG initiatives collecting 15 cases from the UK and 15 cases from Japan designed to cover a range of Ki67 scores 9 were employed in this study. No outcome data were collected for this cohort. Patient selection was irrespective of patients' age at diagnosis, grade, tumor size or lymph node status. The clinicopathological characteristics of these 30 cases can be found in our previous publications 9, 10 .

Tissue preparation and immunohistochemistry (IHC)
Tissues from UK patients, both core biopsies and surgical resections were collected according to ASCO/CAP guidelines, while patients' tissues from Japan were collected following ISO (International Organization for Standardization) 15189 approved by the Japan Accreditation Board. Preparation of the Ki67 slides of the first cohort has been previously described 9 . Briefly, the corresponding core-cut biopsy and surgical resection blocks were centrally cut and stained with Ki67, resulting in 60 Ki67 slides from 30 cases. The IHC was performed using monoclonal antibody MIB-1 at dilution 1:50 (DAKO UK, Cambridgeshire, UK) using an automated staining system (Ventana Medical Systems, Tucson, AZ, USA) according to the consensus criteria established by the International Ki67 Working Group 6 . Sections from the same block were stained in a single immunohistochemistry run, except for four cases where the staining was performed in two different runs. This approach effectively controls for any technical variation in staining.

Sample distribution
Twenty volunteer pathologists from 15 countries, most of whom participated in the previous Phase 3A study, were invited to participate. Four adjacent sections from each of the 60 blocks were centrally stained as follows: the first section with haematoxylin and eosin (H&E), the second with p63 (a myoepithelial marker, to assist the distinction of DCIS from invasive breast cancer) and the third to fourth with Ki67 (designated as slide sets 1-2).
The Aperio ScanScope XT platform was used at 20× magnification to digitize the slides (pixel size: 0.4987 µm × 0.4987 µm), which were uploaded to a server and distributed as digital images. Seventeen pathologists successfully completed the study (Fig. 1). Fig. 1 Study design. Thirty patients of ER-positive breast cancer were enrolled comprising 15 cases from UK and 15 cases from Japan. Corresponding core-cut biopsy and surgical resection blocks were centrally cut two adjacent sections per case and stained with Ki67. Seventeen pathologists from 15 countries were given 60 slides (30 Core cut biopsy slides and 30 surgical resection specimen slides) of Ki67 to score.

Digital image analysis (DIA)
The QuPath open-source software platform was used to build automated Ki67 scoring algorithms for breast cancer 11 . A detailed guideline for setting up and building an automated Ki67 scoring algorithm was sent to the participating labs. All the participating labs were requested to build their own Ki67 scoring algorithm following the instructions and apply them on these 60 slides. The complete step by step instructions are available in Supplementary File 1. The reason why we asked each lab to build their own algorithm instead of using the same pre-trained and locked down Ki67 algorithm was to mimic clinical practice. As of the date of the study, no generalizable Ki67 scoring algorithm was available that provides whole slide scoring. Thus, theoretically, all the labs would need to adjust/ optimize any such DIA approach to their lab characteristics (different fixation, different antibodies and IHC protocols etc.) necessitating a labspecific DIA approach. Calibration of the DIA method/guideline was performed in our previous studies demonstrating very good reproducibility among users 12,13 . Briefly, after the whole invasive cancer area on a digitized slide was annotated, hematoxylin and DAB stain estimates for each case were refined using the "estimate stain vectors" command. We used watershed cell detection 14 to segment the cells in the image with the following settings: Detection image: Optical density sum; requested pixel size: 0.5 µm; background radius: 8 µm; median filter radius: 0 µm; sigma: 1.5 µm; minimum cell area: 10 µm 2 ; maximum cell area: 400 µm 2 ; threshold: 0.1; maximum background intensity: 2. In order to classify detected cells into tumor cells, immune cells, stromal cells, necrosis and others (false detections, background) (Supplementary File 1), we used random trees as a supervised machine-learning method. The features used in the classification are described in Supplementary Table 1. After setting the optimal color deconvolution and cell segmentation, two independent classifiers were trained on a randomly selected, pre-defined core biopsy (CB classifier) and a resection specimen slide (WSI classifier). Both CB and WSI classifiers were run on both CB slides and resection specimen slides in order to adjust for potentially different characteristics of the two specimen types (Fig. 2).

Statistical analysis
For statistical analysis, SPSS 22 software (IBM, Armonk, USA) software was used. Degree of agreement was evaluated by Bland-Altman plot and linear regression. To assess differences between specimen type the Wilcoxon signed-rank test was applied, since the data were not normally distributed. Data were visualized using boxplot, spaghetti plot, and dot-plot.

RESULTS
Between-(consecutive) section difference in Ki67 scoring Very high correlation and no systematic error (bias: −0.6%; p = 0.08) was found between the two consecutive (serial) sections regarding Ki67 scores. If the Ki67 score is higher for a given case, the difference between the sections tends to be also greater (proportional error p = 0.002, Fig. 3.), however this difference (0.6% mean difference) does not reach clinical relevance.
Specimen type (CB vs resection specimen) difference in Ki67 scoring A low correlation was found between core biopsy and whole section excision images (Fig. 4). Ki67 scores were higher when determined on core biopsy slides compared to paired whole sections (p ≤ 0.001; median difference: 5.31%; IQR: 11.50%) from subsequent surgical excisions of the same tumor. Systematic error occurred between specimens from the same patient as core biopsy Ki67 scores were greater, with a clinically relevant mean difference of 6.6% (bias p = 0.001). The limits of agreement also have to be considered wide from a clinical perspective (between −13.7 and 27). Furthermore, Ki67 scores on CB were even higher compared to WS on cases with higher Ki67 scores (proportional error p = 0.001). Moreover, the variability of differences in Ki67 scores between CB and WS showed an increasing trend, proportional to the magnitude of Ki67 score (Fig. 4). The same results were found irrespective of the origin of the specimens (CB vs WS p < 0.001 for both UK and Japan cases Fig. 5).

DISCUSSION
In this study, we observed that clinically relevant and systematic discrepancies occurred in Ki67 scores between core biopsy and corresponding surgical specimens when evaluated with an automated reading system. Overall, Ki67 scores were higher on CB compared to WS samples. Furthermore, this discrepancy was even more pronounced in tumors that expressed higher levels of Ki67 in general.
Ki67 is one of the most promising yet controversial biomarkers in breast cancer with limited adoption into clinical practice due to its high inter-and intra-laboratory variability 3,15 . However, Ki67 is widely used in many countries, there is wide variability in its use (to distinguish luminal A-like vs B-like tumors; to determine whether to decide for gene-expression profiling or not; as an adjunct to mitotic counts, etc.), with still no uniformity between clinicians on how to use this biomarker, let alone which cut-off to  use. Although the IKWG set up a guideline in 2011 to improve preanalytical and analytical performance, inter-laboratory protocols still demonstrated low reproducibility related to different sampling, fixation, antigen retrieval, staining and scoring methods 6,7 . As the latter was the largest single contributor to assay variability, the IKWG has undertaken multi-institution efforts that have standardized visual scoring of Ki67 in a manner which requires on-line calibration tools and careful scoring of several hundred cells, which may or may not be ideal for pathologists in daily practice with time-constraints 8,9 . This result suggests that digital solutions may still be required to address this issue.
The rise of digital image analysis (DIA) platforms has improved capacity and automation in biomarker evaluation 16,17 . DIA platforms are able to assess nuclear IHC biomarkers such as Ki67, and numerous studies have been conducted to compare human visual scoring with DIA platforms 12,[18][19][20][21][22][23][24][25][26][27][28] . Although the latest guideline of IKWG recommends Ki67 for clinical practice in specific situations, the type of specimen as a potential pre-analytical factor contributing to Ki67 variability was not specifically investigated in a multi-operator/multi-center setting. In this study we aimed to address these biospecimen questions including assessment by specimen type and between serial sections.
One explanation for our finding would be the presence of tumor heterogeneity, and the broader field of review in a whole section from resection specimen. However, one would expect that this cause of discrepancy would result in random discordance, not the consistent finding that Ki67 scores on core biopsies are higher than that of on resection specimens. Rather, we conclude that lower Ki67 in resection specimens is more easily explained by preanalytical factors. For example, since longer times to fixation occur with resection specimens compared to CB, persistent cell division will occur even in an unfixed, hypoxic environment. Further, epitope degradation also occurs with prolonged time to fixation 29-31 .
In addition, one can expect that hot spot scoring might lead to less discrepancy between CB and WS because it considers only the hottest area of Ki67 positivity (highest percentiles of Ki67 distribution) on both specimen types, while global assessment evaluates the total Ki67 distribution which can be variable 10 . However, there remains a fundamental issue of exact hot spot definition and where pathologists set its boundaries. Moreover, the International Ki67 Working Group has recommended global scoring over hot spot as it did show a consistent trend towards increased reproducibility in both core biopsy 9 and excision 10 specimens.
Additional support for the conclusion that the difference in Ki67 between CB and WS is provided by the observation of clinically relevant differences between specimens in cases from different institutions used in this study, independently scored multiple times by 17 pathologists. Although many studies focused on assessing the level of agreement between CB and resection samples in Ki67 scoring; consensus was not possible due to lack of standardization 32 .
Our results are consistent with previous results showing poor/ moderate concordance (κ = 0.195-0.814) occurring between CB and resection specimen in Ki67 scoring 1,33-46 . However, some studies showed higher Ki67 scores on resection samples 35,36,38 . This discrepancy among studies may be due to lack of standardization in methodology leading to different scoring methods, which we have previously demonstrated to be highly variable 2 . Moreover, inter-institutional discrepancies could also be the result of different antibodies and protocols used to detect Ki67, different tissue handling/fixation protocols and at some point tumor Ki67 heterogeneity since Ki67 is heterogeneous in tumors 6 . Thus, our findings provide further support to the latest IKWG recommendations and provide a consensus that Ki67 should be ideally tested on CB samples because it minimizes many fixation problems as Ki67 IHC is more sensitive than ER or HER2 to variabilities in fixation 2 . Since pre-analytical factors are critical in diagnostic pathology, the IKWG recommends that breast cancer samples for Ki67 testing should be processed in line with ASCO/ CAP guidelines 2 .
There are a number of limitations in this study. This study only focused on analytical and preanalytical questions, therefore we cannot demonstrate the clinical validation of the calibrated tool. There are many other studies that address the prognostic or predictive value of this test, and that goal was beyond the scope of this effort. For the same reason, further clinical studies are needed to demonstrate how does this consistent difference in Ki67 between corresponding core-cuts and resection specimen impact on prognostic value or its clinical implication on the assessment of neoadjuvant endocrine therapy benefit. Furthermore, the low correlation suggests a critical difference between a core biopsy score and a whole section excision score, which can undermine the use of data on outcome, derived predominantly from resection samples, to identify patients at high risk using a score derived from a core biopsy. Therefore, this study suggests caution in this approach given that even without intervening therapy a clinically relevant change in Ki67 may occur. Further, the Ki67 assessments were based on biospecimens from only 2 central sites. While the participating pathologists within the IKWG represented 15 countries, specimens were centrally acquired and stained. Whereas other investigators have compared specimens from multiple different sites 5,7,47 we limited the number of sites to remove the variables associated with the technical aspect of the stain. Finally, while the core cut biopsy and resection are from the same case, only a single core was assessed. Thus, we could be missing heterogeneity seen in larger resection specimens. The effect of heterogeneity could be decreased by taking multiple core cuts when clinical situation allows. However, since examination of a single core cut represents the clinical standard of care in several countries, we did not pursue multiple cores.
In conclusion, while we find no significant difference in digitallyassessed Ki67 index between serial sections, we do find a systematic discrepancy between core biopsy and corresponding whole sectionscore biopsy samples yield higher scores (likely due to pre-analytical factors including more standard and prompt tissue handling, fixation, etc.). Therefore, this work suggests that Ki67 IHC tested on core biopsy samples should be preferred to excision specimens in clinical decision-making, because doing so will preclude many pre-analytical factors.

DATA AVAILABILITY
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. Impact of molecular subtypes classification concordance between preoperative core needle biopsy and surgical specimen on early breast cancer management:

ETHICS APPROVAL AND CONSENT TO PARTICIPATE
The study was approved by the British Columbia Cancer Agency's Clinical Research Ethics Board (H10-03420).