Ki-67 assessment of pancreatic neuroendocrine neoplasms: Systematic review and meta-analysis of manual vs. digital pathology scoring

Ki-67 assessment is a key step in the diagnosis of neuroendocrine neoplasms (NENs) from all anatomic locations. Several challenges exist related to quantifying the Ki-67 proliferation index due to lack of method standardization and inter-reader variability. The application of digital pathology coupled with machine learning has been shown to be highly accurate and reproducible for the evaluation of Ki-67 in NENs. We systematically reviewed all published studies on the subject of Ki-67 assessment in pancreatic NENs (PanNENs) employing digital image analysis (DIA). The most common advantages of DIA were improvement in the standardization and reliability of Ki-67 evaluation, as well as its speed and practicality, compared to the current gold standard approach of manual counts from captured images, which is cumbersome and time consuming. The main limitations were attributed to higher costs, lack of widespread availability (as of yet), operator qualification and training issues (if it is not done by pathologists), and most importantly, the drawback of image algorithms counting contaminating non-neoplastic cells and other signals like hemosiderin. However, solutions are rapidly developing for all of these challenging issues. A comparative meta-analysis for DIA versus manual counting shows very high concordance (global coefficient of concordance: 0.94, 95% CI: 0.83–0.98) between these two modalities. These findings support the widespread adoption of validated DIA methods for Ki-67 assessment in PanNENs, provided that measures are in place to ensure counting of only tumor cells either by software modifications or education of non-pathologist operators, as well as selection of standard regions of interest for analysis. NENs, being cellular and monotonous neoplasms, are naturally more amenable to Ki-67 assessment. However, lessons of this review may be applicable to other neoplasms where proliferation activity has become an integral part of theranostic evaluation including breast, brain, and hematolymphoid neoplasms.


INTRODUCTION
The world health organization (WHO) released in 2019 a consensus document entitled "Recommendations on digital interventions for health system strengthening", acknowledging that artificial intelligence (AI) and digital technologies can offer limitless possibilities to advance health management and achievements (https://apps.who. int/iris/handle/10665/311980, last access 10/30/2021). Indeed, AI-based technologies are emerging in every medical field, especially in radiology and pathology [1][2][3] . In pathology, AI-based systems utilize machine-and/or deep-learning models to assist pathologists in analyzing digital images to perform different tasks, including screening for rare events, quantification, diagnosing lesions, and prognostication 1,4,5 . Digital pathology, which includes the digitizing of glass slides to generate whole slide images, facilitates the application of AI in pathology [6][7][8] . A key benefit of employing AI-based systems in pathology is to provide reliable, objective and reproducible results, thereby reducing inter-and intrapathologist variability and enabling automation to augment routine practice 1,6-8 . In this context, digital image analysis (DIA) has been utilized to evaluate neuroendocrine neoplasms (NENs). Among welldifferentiated neuroendocrine tumors (NETs), grading is based on the assessment of mitotic rate and the proliferation index determined by Ki-67 immunostaining [9][10][11] . Currently, the WHO classification of some NENs specifies that Ki-67 should be assessed by manual counting on a printed image including at least 500 neoplastic cells from the regions of highest labeling (hotspots) 9,10 . Recently, different DIA-based systems have been developed to assist pathologists with this important task ( Fig. 1), which has implications for the clinical management of patients with NENs. To date, the majority of studies on this topic were performed on NENs of the gastro-entero-pancreatic system.
The aim of our study was to systematically review all published studies that compared manual Ki-67 assessment in pancreatic NENs (PanNENs) with DIA-based determination, highlighting the benefits and drawbacks of each approach. A comparative meta-analysis is also undertaken of manual counting versus DIA for PanNENs.

MATERIALS AND METHODS
This systematic review adhered to the MOOSE guidelines 12 and PRISMA statement 13 . Studies were considered eligible for inclusion if they reported original data on DIA-based assessment of Ki-67 in PanNENs. Both neuroendocrine tumors (PanNETs) and carcinomas (PanNECs) were included. For the comparative meta-analysis of manual counting vs. DIA, we considered all manuscripts reporting an analytical comparison between these two modalities used in the assessment of Ki-67. In the case of duplicate cohorts, the largest and then most recent was selected. Exclusion criteria included no definitive histological diagnosis of PanNEN, and in vitro or animal studies.

Study selection and data extraction
Following the aforementioned search strategy, duplicates were removed and then two reviewers (CL, PA) independently screened titles and abstracts of all potentially eligible articles. These two authors applied eligibility criteria and reviewed the full texts of included studies. A final list of articles was subsequently established for both the systematic review and comparative meta-analysis by consensus with a third independent author (AE). Two authors were involved in extracting data in a preset Excel database: one (CL) extracted data from the selected articles; the other (AS) independently validated the extracted data. For each article, we extracted the following information: authors; year of publication; country study originated from; number of cases; patient demographics; type of analyzed material; tumor grading; as well as methods for manual counting and DIA. For the comparative meta-analysis, the primary outcome was the coefficient of agreement between manual counting vs. DIA in the assessment of Ki-67 in PanNENs.
Data synthesis, quality, and publication bias assessment The comparative meta-analysis was conducted using Comprehensive Meta-Analysis v2 software (Biostat; Englewood, NJ, USA).  1 An example of the use of a digitalized system for assessing Ki-67 in pancreatic neuroendocrine neoplasms is shown here. This is a particularly illustrative case due to the presence of a lymphocytic infiltrate at the tumor periphery, which represents a potential source of bias for Ki67 assessment with digital systems. A A pancreatic neuroendocrine tumor, G2, is shown. (Hematoxylin-eosin, 10x original magnification); B the digitalized system can count all cells present in a specific field, also on hematoxylin-eosin slides; C, D modern systems can select a specific area for the Ki-67 count: in this example, the field with lymphocytes has been excluded from the count, reducing potential important biases in tumor grading (Ki67 immunohistochemistry, 10x original magnification).
Furthermore, the Newcastle-Ottawa Scale (NOS) was used to assess study quality, following existing guidelines 14,15 . Finally, we investigated publication bias by visual inspection of funnel plots and with the Egger bias test 16 .

Search results
The search yielded a total of 4286 potential eligible studies. Following in-depth screening based on title/abstract, only 56 (1.3%) of these studies were retrieved for further analysis. Of them, 22 were considered eligible for the systematic review  , and 4 for the correlation meta-analysis (Supplementary Fig. 1) 25,27,28,34 .

Study and patient characteristics
The most important features from the extracted data are summarized in Table 1. Overall, the selected studies reported data on a total of 752 PanNENs. The majority of the investigated cohorts (59.1%) were from the USA, with the remaining composed of European patients (27.3%) and mixed cohorts including Asian patients (13.6%). There was an almost equal distribution of male (50.5%) and female (49.5%) patients. Regarding tumor grading (G), the majority of cases were G1 (55.3%), followed by G2 (40.6%) and G3 (4.1%). The type of specimen material analyzed varied, including surgical resection specimens, biopsies, and cytology cell blocks. The majority of the studies (54.5%) did not report specific data on the type of specimens analyzed. The reported procedures used for manual counting and the specific DIA technologies adopted in the assessment of Ki-67 in PanNENs are summarized in Table 1.

Advantages and limitations of DIA-based systems in the assessment of Ki-67
The key advantages and limitations of DIA-based systems in the assessment of Ki-67 in PanNENs are summarized in Table 2. The most commonly described advantages of DIA were improved reproducibility and reliability, as well as reduced time required for Ki-67 assessment. The most common limitations of DIA were counting non-neoplastic ("contaminants") cells (e.g., lymphocytes), the higher cost compared with manual counting, and the potential delay in diagnosis, which was dependent on some procedures or technician availability.
Comparative meta-analysis, quality and publication bias assessment Overall, for the comparative meta-analysis of 4 studies including 238 patients with PanNEN were selected 25,27,28,34 . The pooled correlation estimate was 0.94 (95%CI: 0.83-0.98; I 2 = 24.15%), indicating a high correlation between manual (reference value) and digital count. The heterogeneity was low (i.e., I 2 < 50%), reinforcing the reliability of these results. The quality of the studies did not represent risk of bias (mean score of the Newcastle-Ottawa Scale: 8). Furthermore, no publication bias emerged (Egger's test = 1.42; p = 0.90). The fail-safe number was 660, a value that indicates strong statistical significance of our results based on existing guidelines 15,16 .

DISCUSSION
The Ki-67 proliferative index is critical in the pathologic assessment of PanNEN, and has important clinical implications for patient management. The adoption of international recommendations released by the WHO classification of tumors and the European neuroendocrine tumor society (ENETS) for assessing Ki-67 has improved the standardization of methodologies for this task 9,39 . However, given the persistence of interlaboratory and interobserver discrepancies, as well as potential inconsistencies with different scoring systems, accurately grading PanNENs remains a challenge for pathologists, especially in the G1-G2 and G2-G3 transition areas for PanNETs. Multiple factors affecting the interpretation of the Ki-67 index include the use of different antibody clones and staining protocols, tissue section thickness, tumor cell density, and difficulty distinguishing tumor from non-tumor cells. According to Adsay, "to count or not to count is not the question, but rather how to count" 40 . Manually counting camera captured or printed images is generally favored over eyeballing. Further, more recently DIA has proven to be an acceptable method for Ki-67 assessment. In this study, we reviewed all published investigations that employed DIA for Ki-67 determination in PanNENs, highlighting some of the advantages and limitations of utilizing this technology. Furthermore, by comparing the coefficient of correlation between manual counting and DIA by means of a comparative meta-analysis, we demonstrated a high value of consistency (0.94, 95%CI: 0.83-0.98) between these two approaches.
The advantages derived from utilizing DIA include more reproducible results, higher accuracy, and reduced time to evaluate Ki-67 in PanNENs 1,6-8 . Current guidelines for assessing Ki-67 recommended manual counting from a printed image that includes at least 500 neoplastic cells from tumor hotspots. While still time consuming, this manual method does promote standardization that helps reduce interobserver variability 24 . However, for grade transitions between G1 and G2 (3% of Ki-67) and between G2 and G3 (20% of Ki-67), there were still discrepancies with manual counting from a printed image. The use of DIA for Ki-67 determination resulted in greater consistency in grading of all PanNEN cases, particularly for those cases belonging to the aforementioned gray transition areas G1-G2 and G2-G3. However, it should be noted that even when using DIA one can obtain different results depending on the selection of hotspots and by altering the number of cells counted. Access to DIA allows rapid counting of more cells, and that alone can push a tumor from G2 to G1 or G3 to G2, whereas counting fewer cells in the same hotspot can achieve the reverse 41 .
DIA assistance with grading PanNEN is of particular benefit in fine needle aspiration (FNA) cytology samples. Guidelines established using histological samples have been extrapolated to grading PanNENs in cytology material (e.g., cell blocks) procured by FNA. Several studies have demonstrated that Ki-67 assessment by manual counting of tumor cells in cell blocks can result in under-grading of these neoplasms when compared with matched surgical resection specimens 36,42 , with discrepancies more often observed in G2 cases 20,29 . Intriguingly, Abi-Raad and colleagues demonstrated that counting hotspots instead of the complete cell block can provide a higher concordance with surgical specimens, especially for FNA samples that contain ≥ 1000 cells 43 . A different perspective was provided by Satturwar and colleagues who investigated the potential role of augmented reality microscopy (ARM) for Ki-67 assessment in cytology specimens 36 . ARM, which is basically a modified microscope associated with an attached computer unit, enables real-time image analysis using a traditional light microscope and glass slides, without the need to first photograph or digitize slides 36,44,45 . If coupled with image analysis software, ARM allows quantifying immunohistochemical stains including Ki-67, and can also be combined with elaborate AI-based algorithms to perform more complex tasks [44][45][46] . Like other DIA methods, ARM has significant potential for improving PanNEN grading on cell block material 36 .
Currently, DIA for Ki-67 assessment has some limitations that may need to be addressed if counting in practice is to be performed with this approach. The most commonly reported drawback is the risk of counting dividing non-neoplastic "contaminating" cells (e.g., endothelial cells, lymphocytes), thereby erroneously increasing the overall tumor grade. Other brown-pigmented signals (hemosiderin and hematoidin) also cause this over-counting phenomenon. Such issues are Table 1. Summary of studies about AI-based systems used for Ki-67 assessment in PanNENs.  This study investigated a total of 25 cases, 3 with pancreatic origin, 5 with ileal origin, 5 with duodenal origin, 2 with gastric origin, 3 nodal metastases, 1 ileal metastasis, 5 liver metastases and 1 diaphragmatic metastasis; Ω this study reported data on a total of 45 cases but the total number of patients was 44: there were 22 females (one had two tumors, for a total of 23 tumors) and 22 males.
C. Luchini et al. enhanced in NEC, especially due to the effect of artefacts (e.g., smeared chromatin material, nuclear molding in small cell NEC) on DIA. However, these problems can be overcome by having pathologists directly annotate regions of interest to be scored, with the intent of excluding contaminating cells from being counted. Further studies that specifically address these challenges in PanNEC are needed. This issue becomes particularly important if non-pathologist personnel such as trainees and technicians are used as key operators. Of note, more sophisticated AI-algorithms are being developed that only count neoplastic cells [47][48][49] and become more operator-independent. One potential solution that also has been employed is to utilize double-stained slides (e.g., Ki-67 and synaptophysin) with deep learning algorithms to improve the accuracy of Ki-67 index quantification [50][51][52][53] . More recently, some investigators have shown that they are able to predict Ki-67 positive cells directly from H&E images using AI-based methods 51 . Another important pending issue that needs to be addressed for improving Ki-67 assessment in PanNENs is related to standardizing hotspot size and number to be evaluated 41,54 . Hotspots are defined as tumor areas with higher Ki-67 nuclear staining. It has been shown that the greater the hotspot size, the lower the Ki-67 count, highlighting the importance of standardizing this parameter for reliable evaluation 34,41 . Furthermore, not only is the size of the hotspot difficult to define, but so is the shape 55 . Most pathologists and algorithms define a hotspot as a circular shape; however, there is no biological evidence to support this notion. Another important factor to consider is the number of hotspots when determining the Ki-67 index. Training operators not to select a geographic region that may lead to hyper-selection of positive cells in a given hotspot region is also important, which erroneously creates higher "percentage" positivity. However, all of the above shortcomings are relatively easy to address with proper training and application of improved AI software. Despite the clear advantages of DIA for determination of the Ki-67 labeling index, scoring with this digital modality is still subject to the fundamental limitation that applies to any cut-off in a continuous variable: it can be changed randomly, as it was for PanNETs in 2017 when it was moved from 2 to 3 for Grade 1 9 . Moreover, any cut-off of a continuous variable can be shown to have value, but the actual grading is inherently arbitrary 41 . Indeed, how best to employ K-i67 as a reliable prognosticator of PanNETs has been a study in progress. For example, in 2017 the cut-off was clarified such that cases with an index less than 3.0 (including 2.99), which were previously unclear as to which grade this belonged, now clearly included Grade 1 9 . Naturally, as in any grading and staging system that assesses a continuous variable, the Ki-67 index-based system is imperfect 54 . For example, it can be expected that cases with 2.99 (now in G1) and 3.0 (now in G2) will be similar in biological behavior. Nevertheless, DIA will help standardize the process, not only offering more reproducible results in daily clinical practice, but also allowing for better comparison between studies that aim to fine-tune this grading system. For example, there have been proposals to move the G1/ G2 cut-off to 5%; but it is difficult to verify the results of these proposals due to variation in counting methods. Fundamentally, the reality is that even with more accurate analysis provided by DIA, a G2 tumor with a Ki67 of 4% will still be more likely to behave in an indolent fashion than a G2 tumor with a Ki67 of 19%. Thus, the issue of a continuous variable, which is a complex concept involving statistical and biological sciences 56 , enhances the need for accurate Ki-67 quantification and may ultimately be more important than the actual grade score. Finally, a significant limitation of DIA for widespread adoption has been the accessibility of this technology due to cost and maintenance. However, as whole slide scanners and digital cameras (and related software) become more widely available, the adaptation of facilities to perform DIA for Ki-67 counting is becoming increasingly feasible and amenable to employ 57 . Another issue to be considered is the need to better integrate Ki-67 counting by DIA into routine workflow 24,58 .
In this review, we chose to focus on PanNENs. However, the topic of manual vs. digital pathology scoring of Ki-67 is also certainly of importance for NENs at many other anatomic sites 54 23 . Similar results were replicated by Kroneman et al. 25 . and Conemans et al. 30 , showing substantial similarities in terms of prognostication between manual vs. DIA scoring of Ki-67. It is important to note that only four studies in the literature provided data on this fundamental topic. Moreover, all of these studies were conducted prior to the introduction of the 2017 grading system. Thus, further studies on larger cohorts and based on current grading methods are needed. We advocate that DIA-based systems could provide a more standardized method, guaranteeing a more reliable basis for prognostic stratification.
In summary, this systematic review and comparative metaanalysis demonstrates that the advantages outweigh the limitations of using DIA to assess Ki-67 in PanNENs. We advocate that the next logical step for more broadly adopting DIA in pathology practice would be to further explore the relationship between hotspot parameters (number, size, and shape) and the Ki-67 index with patient outcome. Currently, most studies use manual counting from captured images as the gold standard; however, the ultimate validation will naturally come from prognostic correlation. Based upon current evidence provided by peer-reviewed literature, DIA appears to offer pathologists higher reliability and reproducibility than manual counting for grading PanNENs. The overall findings of this review, therefore, support widespread adoption of carefully optimized and validated DIA-based methods for this important diagnostic task in clinical practice. Lessons learned from the application of DIA to the PanNEN model can also be extrapolated to different tumors in other organ systems, such as breast carcinoma in which Ki-67 quantification is increasingly becoming a key driver for patient management.

DATA AVAILABILITY
All data/information are available in the manuscript and in the supplementary material.