The nuclear proliferation biomarker Ki67 has potential prognostic, predictive, and monitoring roles in breast cancer. Unacceptable between-laboratory variability has limited its clinical value. The International Ki67 in Breast Cancer Working Group investigated whether Ki67 immunohistochemistry can be analytically validated and standardized across laboratories using automated machine-based scoring. Sets of pre-stained core-cut biopsy sections of 30 breast tumors were circulated to 14 laboratories for scanning and automated assessment of the average and maximum percentage of tumor cells positive for Ki67. Seven unique scanners and 10 software platforms were involved in this study. Pre-specified analyses included evaluation of reproducibility between all laboratories (primary) as well as among those using scanners from a single vendor (secondary). The primary reproducibility metric was intraclass correlation coefficient between laboratories, with success considered to be intraclass correlation coefficient >0.80. Intraclass correlation coefficient for automated average scores across 16 operators was 0.83 (95% credible interval: 0.73–0.91) and intraclass correlation coefficient for maximum scores across 10 operators was 0.63 (95% credible interval: 0.44–0.80). For the laboratories using scanners from a single vendor (8 score sets), intraclass correlation coefficient for average automated scores was 0.89 (95% credible interval: 0.81–0.96), which was similar to the intraclass correlation coefficient of 0.87 (95% credible interval: 0.81–0.93) achieved using these same slides in a prior visual-reading reproducibility study. Automated machine assessment of average Ki67 has the potential to achieve between-laboratory reproducibility similar to that for a rigorously standardized pathologist-based visual assessment of Ki67. The observed intraclass correlation coefficient was worse for maximum compared to average scoring methods, suggesting that maximum score methods may be suboptimal for consistent measurement of proliferation. Automated average scoring methods show promise for assessment of Ki67 scoring, but requires further standardization and subsequent clinical validation.
The Ki67 immunohistochemistry assay is widely performed to assess cellular proliferation in breast cancer [1, 2], yet its assessment has never been standardized. This has limited its value for both clinical trial and routine diagnostic usage. The International Ki67 in Breast Cancer Working Group was convened in 2010 to address this problem . This group designed and executed several studies to first assess the problem and then to develop methods of standardization, beginning with visual assessment.
The International Ki67 in Breast Cancer Working Group (Supplemental Table 1) has previously demonstrated that, in the absence of standardized scoring, concordance for visual reading of Ki67 in previously stained sections was satisfactory within, but not between, different laboratories . When training was instituted to standardize scoring, interlaboratory reproducibility improved substantially [5, 6]. However, these prior studies were performed using slides containing previously cut and stained tissue microarray sections, and no effort to associate the data with clinical outcomes was undertaken. Therefore, evidence remains insufficient to support Ki67 use in routine clinical care. Now the International Ki67 in Breast Cancer Working Group is conducting additional studies to determine whether similar reproducibility can be achieved using core cut and whole-section biopsy specimens representative of materials used in clinical practice.
During the last decade, technological advances have permitted development of software for automated assessment of immunohistochemical expression. Although digital image analysis algorithms might be expected to be superior to visual analysis, computational approaches have not been proven superior in recognizing cancer and consistently selecting the correct objects to score. Recent progress in generation of image capture platforms and software packages has raised the possibility that machine-based approaches might rival pathologist-based visual assessments for scoring Ki67.
To investigate this possibility, we undertook a study to assess reproducibility of multiple existing technologies for automated machine measurement of Ki67 expression using slides from core-cut biopsies previously analyzed in the International Ki67 in Breast Cancer Working Group phase 3 study that evaluated reproducibility of visual Ki67 assessment. In that study, a standardized visual approach to scoring Ki67 met its pre-specified criterion for success. We now report results comparing 14 laboratories from 6 countries, using 7 different scanners and 10 software packages, in the absence of any prescribed scanner or software harmonization.
Materials and methods
This study was approved by the British Columbia Cancer Agency Clinical Research Ethics Board (protocol H10–03420). All samples used were donated by patients who signed a generic consent. All core-cut biopsy material used was excess to diagnostic requirements and ethically available for quality control studies.
Fourteen volunteer laboratories (three of whom participated in the phase 3 visual scoring study) representing six countries, completed this automated image analysis study. Two laboratories contributed two sets of analysis results each, and these were treated as though they were independent laboratories for purposes of the analysis. Preparation of the Ki67, H&E, and myoepithelial marker (p63) slides were as described . Briefly, 5 adjacent sections from each of the 30 core-cut biopsy source blocks were centrally cut and stained for H&E (1 section), p63 (1 section), and Ki67 (3 sections), resulting in 3 groups of 30 Ki67 slides from 30 cases. One group of slides was damaged in the previous study, leaving only two sets available for this study. Participating laboratories were divided into two groups (seven laboratories in each group) and members within the same group were given the same set of glass slides to analyze (Supplemental Fig. 1). Each laboratory had 2 weeks to scan the slides and then send them to the next laboratory on the list. Two sets of slides were circulated to expedite study completion, with the assumption that serial sections would be essentially identical with respect to Ki67 expression.
Slide scanning and analysis
Image analysis systems selected by site principal investigators for use in this study covered a wide range (Table 1). The most common scanning platform was the Aperio which was used by 7 of the 14 laboratories. Once scanned images were generated, each site implemented its own choice of software packages for analysis, with 10 different ones chosen by the 14 laboratories. While some groups used the same software packages, even then they were not used identically. Five laboratories scanned the slides at 40×, while nine scanned at 20×. Most systems required some human intervention in the image analysis process. Specifically, 9 systems required a human operator for an initial training step, 11 required human visual selection of “region of interest”, and 2 had the user specify the number of “fields of view” for analysis. Five systems did not analyze by “field of view” methods, counting and averaging across the entire slide, and hence they were not able to generate maximum scores. One system analyzed images based solely on pixel colors, while all other systems included some notion of shape/size object selection. One system (laboratory D) did not use a slide scanner, but used a live microscope camera directly connected to the image analysis software. Two systems are based on open-source software.
Participating laboratories were instructed to score the 30 core-cut biopsy slides using the image analysis system of their choice following their own standard operating procedure. No further instructions were given, no standardization slides were sent out and participants were unaware of others’ scores (including previous visual scores). All participating laboratories were given online access to the H&E and myoepithelial maker (p63) images for the 30 study cases. A Microsoft Excel spreadsheet was sent out to each laboratory for entering (1) the number of fields of view analyzed, (2) maximum score of the fields analyzed, (3) average scores across all the fields analyzed, (4) timing data, and (5) any comments they may have on the study slides. All laboratories provided details of their image analysis system by answering a set of questions. Two laboratories (Lab H and Lab L) submitted scores using two image analysis approaches. Table 1 shows the details of the image analysis systems used in this study.
Ki67 score calculation
The various image analysis systems used in this study have their own definition of Ki67 score. Most defined Ki67 score as the percentage of invasive tumor cells positively stained in the examined field(s). However, one measured simply the percentage of pixels with a certain color. Five (out of 14) image analysis systems did not analyze the scanned image by field of view; instead, the entire region of interest was analyzed and a single Ki67 score was reported.
Statistical design and analysis
Intraclass correlation coefficient as the reproducibility metric
Intraclass correlation coefficient estimates (ranging from 0 to 1, with 1 representing perfect reproducibility) were computed by variance component analysis previously described  (see Statistics Supplement). Analyses partitioned total variability in log-transformed Ki67 scores into variance contributions from scoring laboratory, patient tumor (biological variation—each core-cut biopsy block represents a unique patient), section (slide) of the core-cut biopsy block, and remaining variability absorbed in residual error. Same-section (laboratories scoring same set of slides) and different-section intraclass correlation coefficients (laboratories scoring different sections of same block) were computed, representing proportion of the total variation (biological + technical) attributable to biological variability between patients at the tumor section level and patient biopsy level, respectively.
Variance component and intraclass correlation coefficient estimates with 95% credible intervals were obtained using packages lme4 and MCMCglmm in R version 3.2.1 . Data were visualized using heat maps, boxplots, and spaghetti plots.
Pre-specified criteria for success
Primary criteria for success used in the phase 3 visual scoring study  were also used here: achieving an intraclass correlation coefficient significantly greater than 0.80 for both same-section and different-section intraclass correlation coefficient. Significance was interpreted as the 95% credible interval for intraclass correlation coefficient lying completely above 0.80 (see Statistics Supplement for power analysis).
Handling of revised data from one laboratory
After initial data analysis, data from one laboratory (Lab B) appeared markedly different than data from all other laboratories. Study leadership requested Lab B to quality review its data, without revealing to Lab B how its data differed from other laboratories’ data. Lab B identified problems in its process and was permitted to submit revised data, acknowledging both in the study report (see Statistics Supplement). Summary statistics reported here are based on the revised data unless otherwise specified. However, for figures/plots showing individual data points, both the initial and revised results from Lab B are shown.
Interlaboratory reproducibility of Ki67 according to score type
Participating laboratories were divided into Groups 1 and 2, with seven different laboratories in each group (Lab H and L submitted scores using two image analysis approaches and they were analyzed as though they were independent laboratories resulting in four sets of scores from these two laboratories, combined). A pre-stained set of 30 specimens, covering a representative range of Ki67 levels for ER + breast cancer, was sent to an initial laboratory, scanned, and then sent to the next laboratory within each group. Figure 1 displays the side-by-side boxplots of Ki67 scores across laboratories, by group. Summary statistics for the Ki67 scores across the 14 laboratories are given in Supplemental Tables 3 and 4.
Variance components analysis produced estimates of the biological, laboratory, section, and residual variances for the average and maximum scoring methods (Supplemental Tables 5a–b). Estimates for different-section intraclass correlation coefficient, obtained without standardization across laboratories and using originally submitted data, were 0.83 (95% credible interval: 0.73–0.91) for automated average scores across 16 operators and 0.63 (95% credible interval: 0.44─0.80) for maximum scores across 10 operators (Supplemental Table 6). However, original data submissions were discovered to include outlier results from one laboratory that failed to follow its internal standard operating procedures. After quality review and correction of that laboratory’s aberrant data, revised intraclass correlation coefficient estimates were 0.86 (95% credible interval: 0.79–0.93) for average scores and 0.76 (95% credible interval: 0.64─0.88) for maximum scores. The corresponding same-section intraclass correlation coefficient estimates for the average and maximum scores were 0.89 (95% credible interval: 0.83–0.95) and 0.77 (95% credible interval: 0.64–0.88), respectively. This observation indicates excellent reproducibility for average score between automated image analysis systems scoring the same physical glass slides. Although the revised different-section intraclass correlation coefficients did not meet the pre-specified success criterion (lower bound of 95% credible interval did not exceed 0.80), the one for average score using corrected data from Lab B came very close.
When the secondary analysis was performed restricting to only the subgroup of laboratories using the Aperio platform (8 score sets), different-section intraclass correlation coefficient for automated average scores was 0.89 (95% credible interval: 0.81–0.96) (only two laboratories in this subgroup reported maximum Ki67 scores, so variance components analysis was not conducted for that method). A modest numerical increase compared to the analogous intraclass correlation coefficient for the full group of laboratories was observed. Although perhaps not a statistically significant increase, this result provides motivation to investigate whether standardization of automated scoring could further improve reproducibility.
Variance component analyses show that, regardless of scoring method, biological variation among different patients was the largest component of the total variation, indicating that the Ki67 score is reflecting inherent properties of the tumor and that the variation in scores introduced by different laboratories’ scanning and scoring is not obscuring biological signal (Fig. 2, Supplemental Tables 5a–b).
Comparisons of absolute Ki67 scores between laboratories
The variation in scores across laboratories is shown in Fig. 3, in spaghetti plot format. Each line represents scores from one laboratory for each of the 30 core biopsy cases. The between-laboratory reproducibility at the lower end of the range of Ki67 values appears to be particularly good for the average/global method using the automated approach but this good performance did not extend to the lower values of Ki67 using the automated maximum/hot-spot method.
Agreement of categorical Ki67 scores
In routine clinical laboratory settings, some pathologists may provide categorical Ki67 scores rather than exact staining percentages. To reflect this, an analysis was performed on a categorical level (instead of continuous 0–100% scale), considering categories <10, 10–20, and >20% (commonly interpreted as low, intermediate, and high Ki67 indices). Concordance of these categorical scores across laboratories and cases can be appreciated in a heat map format with the columns (laboratories) sorted (within each group) by the median scores across cases, and the rows (cases) sorted by the median scores across laboratories (Fig. 4). Each box (representing one laboratory’s score for one case) is color-coded, according to the three categories. Among the 30 breast cancer cases, 11 showed complete agreement across laboratories for categorized average scores, and 12 showed complete agreement using categorized maximum scores (Figs. 3, 4). This display also illustrates that laboratories measuring higher or lower than others did so fairly consistently, presumably influenced by thresholds set by the software each laboratory used.
Comparison with standardized visual scoring
Standardized visual scored obtained previously on these same slides  and the non-standardized automated scores obtained in the current study show a high degree of similarity across the spectrum of cases, although better for average score than for maximum score methodology (Fig. 5). Although this study was not statistically designed to compare standardized visual to non-standardized automated scoring, observed score ranges and reproducibility appear similar: intraclass correlation coefficient for average standardized visual = 0.87 (95% credible interval: 0.81–0.93) compared to intraclass correlation coefficient for average non-standardized automated (using Lab B’s revised results) = 0.86 (95% credible interval: 0.79–0.93).
The analysis of Ki67, while often perceived as valuable, has not been widely adopted for directing routine breast cancer management, mostly due to lack of standardization across laboratories . While other International Ki67 in Breast Cancer Working Group studies have focused on standardization of visual scoring, here we tested the hypothesis that simple adoption of an automated method could achieve standardization. The data support the hypothesis in that reproducibility across independent laboratories from around the world was observed to be much higher with non-standardized digital imaging analysis compared to what was seen in the first similar effort involving pathologist visual scoring without standardization . Given this higher “starting point” for reproducibility, we are optimistic that the addition of standardization to the automated process may lead to a highly uniform and reproducible scoring method suitable for eventually achieving clinical validation and ultimately broad clinical adoption, but this remains to be assessed.
Arguably, establishing comparability of machine scores to human reads is an important step toward incorporating Ki67 results into routine clinical care. However, since we felt that the machine-based scoring should first be shown to have good reproducibility prior to comparison to visual scoring, this study was primarily designed to assess reproducibility among (unstandardized) automated methods. Although this study was not designed for a statistically well-powered comparison of the two approaches, we did conduct an exploratory comparison of reproducibility of automated scoring vs. human visual scoring of these slides. In this regard, the difference between the non-standardized automated average score intraclass correlation coefficient (using Lab B’s revised results) of 0.86 (95% credible interval: 0.79–0.93) and the standardized visual intraclass correlation coefficient of 0.87 (95% credible interval: 0.81–0.93) appears minimal. We propose that this observation suggests that a statistically powered formal comparison of the ability of Ki67 to predict clinical outcome when scored according to the various approaches should proceed only following standardization of the automated systems. Standardization should include an assessment of whether the apparent superior performance of the automated average/global method of scoring at lower levels of Ki67 can be confirmed.
Another key observation was provided, perhaps inadvertently, by Lab B. Upon assessment of its initially submitted data, it was clear that Lab B was an outlier compared to all other laboratories (Fig. 1). In consultation with the director of Lab B, it became evident that deviations from the laboratory’s standard operating procedures had occurred. Submission of revised scores was permitted with agreement to report both sets of results. Although Lab B’s second data set remained somewhat high compared to data from other laboratories, differences were less dramatic. This illustrates the importance of both careful human oversight of machine data and also standardization across laboratory sites, whether for pathologist reads or for machine calibration.
Limitations to this work, besides inadequate power to compare automated to visual scoring, are important to appreciate. There was heterogeneity in the scanners and software used by the laboratories, but insufficient numbers using each platform for formal comparison. All sections were cut and pre-stained in a single laboratory using a uniform method, but these factors would contribute additional variability in Ki67 determinations across clinical laboratories in practice. Further, some laboratories batched scanning before analysis while others scanned and analyzed individual cases in succession. These aspects could affect results, but the impact was not quantifiable to this level of detail.
The relative prognostic performance of hot spot (i.e., determining score based on most mitotically active area of tumor) vs. average scoring of Ki67 expression has been a longstanding and still unresolved issue [9, 10]. We used image analysis-determined maximum scores to attempt to reflect the human concept of hot spot, but these assessments relied on selection of a FOV without standardized hot spot sampling criteria. FOV size may be another important aspect of standardization as suggested by reports that larger FOV sizes for hot spot determination are associated with decreased Ki67 scores . Further studies are needed to define optimal criteria for hot spot analysis to improve reproducibility of both visual and machine measurement. Future studies of Ki67 that include clinical outcome data are also needed to determine which of average, hot spot or other score quantification can deliver Ki67 values most predictive of outcome when analytically standardized.
Leung SCY, Nielsen TO, Zabaglo L, et al. Analytical validation of a standardized scoring protocol for Ki67: phase 3 of an international multicenter collaboration. NPJ Breast Cancer. 2016;2:16014.
Yerushalmi R, Woods R, Ravdin PM, et al. Ki67 in breast cancer: prognostic and predictive potential. Lancet Oncol. 2010;11:174–83.
Dowsett M, Nielsen TO, A’Hern R, et al. Assessment of Ki67 in breast cancer: recommendations from the international Ki67 in breast cancer working group. J Natl Cancer Inst. 2011;103:1656–64.
Nielsen T, Polley M, Leung S, Mastropasqua MG, Zabaglo LA, Bartlett JMS, et al. An international Ki67 reproducibility study. Cancer Res. 2012;72:SABCS abstrS4–6.
Polley MY, Leung SC, McShane LM, et al. An international Ki67 reproducibility study. J Natl Cancer Inst. 2013;105:1897–906.
Polley MY, Leung SC, Gao D, et al. An international study to increase concordance in Ki67 scoring. Mod Pathol. 2015;28:778–86.
R Core Team. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2017.
Harris LN, Ismaila N, McShane LM, et al. Use of biomarkers to guide decisions on adjuvant systemic therapy for women with early-stage invasive breast cancer: American Society of Clinical Oncology clinical practice guideline. J Clin Oncol. 2016;34:1134–50.
Jang MH, Kim HJ, Chung YR, et al. A comparison of ki-67 counting methods in luminal breast cancer: the average method vs. the hot spot method. PLoS ONE. 2017;12:e0172031.
Brown JR, DiGiovanna MP, Killelea B, et al. Quantitative assessment ki-67 score for prediction of response to neoadjuvant chemotherapy in breast cancer. Lab Invest. 2014;94:98–106.
Christgen M, von Ahsen S, Christgen H, et al. The region-of-interest size impacts on Ki67 quantification by computer-assisted image analysis in breast cancer. Hum Pathol. 2015;46:1341–9.
Schuffler PJ, Fuchs TJ, Ong CS, et al. TMARKER: a free software toolkit for histopathological cell counting and staining estimation. J Pathol Inform. 2013;4:S2.
Klauschen F, Wienert S, Schmitt WD, et al. Standardized Ki67 diagnostics using automated scoring—clinical validation in the GeparTrio breast cancer study. Clin Cancer Res. 2015;21:3651–7.
Wienert S, Heim D, Kotani M, et al. CognitionMaster: an object-based image analysis framework. Diagn Pathol. 2013;8:34.
Wienert S, Heim D, Saeger K, et al. Detection and segmentation of cell nuclei in virtual microscopy images: a minimum-model approach. Sci Rep. 2012;2:503.
This work was supported by a generous grant from the Breast Cancer Research Foundation (DFH). Additional funding for the UK laboratories was received from Breakthrough Breast Cancer and the National Institute for Health Research Biomedical Research Centre at the Royal Marsden Hospital. Funding for the Ontario Institute for Cancer Research is provided by the Government of Ontario. JH is the Lilian McCullough Chair in Breast Cancer Surgery Research and the CBCF Prairies/NWT Chapter. We are grateful to the Breast International Group and North American Breast Cancer Group (BIG-NABCG) collaboration, including the leadership of Nancy Davidson, Thomas Buchholz, Martine Piccart, and Larry Norton.
Conflict of interest
DR works or has worked as a consultant to AstraZeneca, Agendia, Agilent, Biocept, BMS, Cell Signaling Technology, Cepheid, Merck, OptraScan, Perkin Elmer, and Ultivue; has equity in PixelGear; and received research funding from AstraZeneca, Cepheid, Navigate/Novartis, Gilead Sciences, Ultivue, and Perkin Elmer. JB received honorarium from Oncology Education. JB have a consulting or advisory role with Insight Genetics, BioNTech, Due North, and Biotheranostics. CD received honoraria from Novartis, Pfizer, Amgen, MSD, Roche, Celgene, and Teva. CD has been a cofounder and shareholder of Sividon Diagnostics, Cologne. CD has a patent or intellectual property interest for VmScope Digital Pathology Software. CG is the executive vice president, chief medical officer, and laboratory director of Molcular MD. AG is the chief executive officer of Optra Technologies. AJ is the director of digital pathology of Optra Technologies. RL is the co-founder of MUSE Microscopy Inc. KM has a consulting or advisory role with Visiopharm. LP has a consulting or advisory role with Hamamatsu, Leica, Ibex, and Cambridge Healthtech Institute. MD has received lecture fees from Myriad. The remaining authors declare that they have no conflict of interest.
About this article
Cite this article
Rimm, D.L., Leung, S.C.Y., McShane, L.M. et al. An international multicenter study to evaluate reproducibility of automated scoring for assessment of Ki67 in breast cancer. Mod Pathol 32, 59–69 (2019). https://doi.org/10.1038/s41379-018-0109-4
Artificial intelligence-assisted interpretation of Ki-67 expression and repeatability in breast cancer
Diagnostic Pathology (2022)
Conventional and digital Ki67 evaluation and their correlation with molecular prognosis and morphological parameters in luminal breast cancer
Scientific Reports (2022)
ERBB2 mutation is associated with sustained tumor cell proliferation after short-term preoperative endocrine therapy in early lobular breast cancer
Modern Pathology (2022)
Virchows Archiv (2022)
Systematically higher Ki67 scores on core biopsy samples compared to corresponding resection specimen in breast cancer: a multi-operator and multi-institutional study
Modern Pathology (2022)