Introduction

The world health organization (WHO) released in 2019 a consensus document entitled “Recommendations on digital interventions for health system strengthening”, acknowledging that artificial intelligence (AI) and digital technologies can offer limitless possibilities to advance health management and achievements (https://apps.who.int/iris/handle/10665/311980, last access 10/30/2021). Indeed, AI-based technologies are emerging in every medical field, especially in radiology and pathology1,2,3. In pathology, AI-based systems utilize machine- and/or deep- learning models to assist pathologists in analyzing digital images to perform different tasks, including screening for rare events, quantification, diagnosing lesions, and prognostication1,4,5. Digital pathology, which includes the digitizing of glass slides to generate whole slide images, facilitates the application of AI in pathology6,7,8. A key benefit of employing AI-based systems in pathology is to provide reliable, objective and reproducible results, thereby reducing inter- and intra-pathologist variability and enabling automation to augment routine practice1,6,7,8.

In this context, digital image analysis (DIA) has been utilized to evaluate neuroendocrine neoplasms (NENs). Among well-differentiated neuroendocrine tumors (NETs), grading is based on the assessment of mitotic rate and the proliferation index determined by Ki-67 immunostaining9,10,11. Currently, the WHO classification of some NENs specifies that Ki-67 should be assessed by manual counting on a printed image including at least 500 neoplastic cells from the regions of highest labeling (hotspots)9,10. Recently, different DIA-based systems have been developed to assist pathologists with this important task (Fig. 1), which has implications for the clinical management of patients with NENs. To date, the majority of studies on this topic were performed on NENs of the gastro-entero-pancreatic system.

Fig. 1: An example of the use of a digitalized system for assessing Ki-67 in pancreatic neuroendocrine neoplasms is shown here.
figure 1

This is a particularly illustrative case due to the presence of a lymphocytic infiltrate at the tumor periphery, which represents a potential source of bias for Ki67 assessment with digital systems. A A pancreatic neuroendocrine tumor, G2, is shown. (Hematoxylin-eosin, 10x original magnification); B the digitalized system can count all cells present in a specific field, also on hematoxylin-eosin slides; C, D modern systems can select a specific area for the Ki-67 count: in this example, the field with lymphocytes has been excluded from the count, reducing potential important biases in tumor grading (Ki67 immunohistochemistry, 10x original magnification).

The aim of our study was to systematically review all published studies that compared manual Ki-67 assessment in pancreatic NENs (PanNENs) with DIA-based determination, highlighting the benefits and drawbacks of each approach. A comparative meta-analysis is also undertaken of manual counting versus DIA for PanNENs.

Materials and methods

This systematic review adhered to the MOOSE guidelines12 and PRISMA statement13. Studies were considered eligible for inclusion if they reported original data on DIA-based assessment of Ki-67 in PanNENs. Both neuroendocrine tumors (PanNETs) and carcinomas (PanNECs) were included. For the comparative meta-analysis of manual counting vs. DIA, we considered all manuscripts reporting an analytical comparison between these two modalities used in the assessment of Ki-67. In the case of duplicate cohorts, the largest and then most recent was selected. Exclusion criteria included no definitive histological diagnosis of PanNEN, and in vitro or animal studies.

Data sources and literature search strategy

Two investigators (CL, PA) independently searched PubMed, Embase and SCOPUS databases up until 30/06/2021. The search strategy included combinations of the following keywords: #1 “digital”[Title/Abstract] AND “pathology”[Title/Abstract]; #2 “image”[Title/Abstract] AND “analysis”[Title/Abstract]; #3 “artificial intelligence”[Title/Abstract] OR “AI”[Title/Abstract] OR “machine learning”[Title/Abstract] OR “deep learning”[Title/Abstract] OR “automated”[Title/Abstract] OR “semiautomated”[Title/Abstract] OR “algorithm*“[Title/Abstract] OR “neural network”[Title/Abstract] OR “computer-aid”[Title/Abstract] OR “computer-aided”[Title/Abstract] OR “image analysis”[Title/Abstract] OR “digital pathology”[Title/Abstract] OR “WSI”[Title/Abstract] OR “whole slide”[Title/Abstract] OR “digital”[Title/Abstract]; #4 #1 OR #2 OR #3; #5 “neuroendocrine”[Title/Abstract] OR “carcinoid”[Title/Abstract] OR “medullary”[Title/Abstract]; #6 #4 AND #5; #7 “artificial intelligence”[MeSH Terms]; #8 “Neuroendocrine Tumors”[MeSH Terms] OR “carcinoma, neuroendocrine”[MeSH Terms] OR “Gastro-enteropancreatic neuroendocrine tumor” [Supplementary Concept] OR “Carcinoid Tumor”[MeSH Terms]; #9 #7 AND #8; #10 #6 OR #9.

Study selection and data extraction

Following the aforementioned search strategy, duplicates were removed and then two reviewers (CL, PA) independently screened titles and abstracts of all potentially eligible articles. These two authors applied eligibility criteria and reviewed the full texts of included studies. A final list of articles was subsequently established for both the systematic review and comparative meta-analysis by consensus with a third independent author (AE). Two authors were involved in extracting data in a preset Excel database: one (CL) extracted data from the selected articles; the other (AS) independently validated the extracted data. For each article, we extracted the following information: authors; year of publication; country study originated from; number of cases; patient demographics; type of analyzed material; tumor grading; as well as methods for manual counting and DIA. For the comparative meta-analysis, the primary outcome was the coefficient of agreement between manual counting vs. DIA in the assessment of Ki-67 in PanNENs.

Data synthesis, quality, and publication bias assessment

The comparative meta-analysis was conducted using Comprehensive Meta-Analysis v2 software (Biostat; Englewood, NJ, USA). Furthermore, the Newcastle–Ottawa Scale (NOS) was used to assess study quality, following existing guidelines14,15. Finally, we investigated publication bias by visual inspection of funnel plots and with the Egger bias test16.

Results

Search results

The search yielded a total of 4286 potential eligible studies. Following in-depth screening based on title/abstract, only 56 (1.3%) of these studies were retrieved for further analysis. Of them, 22 were considered eligible for the systematic review17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38, and 4 for the correlation meta-analysis (Supplementary Fig. 1)25,27,28,34.

Study and patient characteristics

The most important features from the extracted data are summarized in Table 1. Overall, the selected studies reported data on a total of 752 PanNENs. The majority of the investigated cohorts (59.1%) were from the USA, with the remaining composed of European patients (27.3%) and mixed cohorts including Asian patients (13.6%). There was an almost equal distribution of male (50.5%) and female (49.5%) patients. Regarding tumor grading (G), the majority of cases were G1 (55.3%), followed by G2 (40.6%) and G3 (4.1%). The type of specimen material analyzed varied, including surgical resection specimens, biopsies, and cytology cell blocks. The majority of the studies (54.5%) did not report specific data on the type of specimens analyzed. The reported procedures used for manual counting and the specific DIA technologies adopted in the assessment of Ki-67 in PanNENs are summarized in Table 1.

Table 1 Summary of studies about AI-based systems used for Ki-67 assessment in PanNENs.

Advantages and limitations of DIA-based systems in the assessment of Ki-67

The key advantages and limitations of DIA-based systems in the assessment of Ki-67 in PanNENs are summarized in Table 2. The most commonly described advantages of DIA were improved reproducibility and reliability, as well as reduced time required for Ki-67 assessment. The most common limitations of DIA were counting non-neoplastic (“contaminants”) cells (e.g., lymphocytes), the higher cost compared with manual counting, and the potential delay in diagnosis, which was dependent on some procedures or technician availability.

Table 2 Summary of reported advantages and limitations when utilizing DIA systems to assess Ki-67 for PanNENs.

Comparative meta-analysis, quality and publication bias assessment

Overall, for the comparative meta-analysis of 4 studies including 238 patients with PanNEN were selected25,27,28,34. The pooled correlation estimate was 0.94 (95%CI: 0.83–0.98; I2 = 24.15%), indicating a high correlation between manual (reference value) and digital count. The heterogeneity was low (i.e., I2 < 50%), reinforcing the reliability of these results. The quality of the studies did not represent risk of bias (mean score of the Newcastle–Ottawa Scale: 8). Furthermore, no publication bias emerged (Egger’s test = 1.42; p = 0.90). The fail-safe number was 660, a value that indicates strong statistical significance of our results based on existing guidelines15,16.

Discussion

The Ki-67 proliferative index is critical in the pathologic assessment of PanNEN, and has important clinical implications for patient management. The adoption of international recommendations released by the WHO classification of tumors and the European neuroendocrine tumor society (ENETS) for assessing Ki-67 has improved the standardization of methodologies for this task9,39. However, given the persistence of interlaboratory and interobserver discrepancies, as well as potential inconsistencies with different scoring systems, accurately grading PanNENs remains a challenge for pathologists, especially in the G1-G2 and G2-G3 transition areas for PanNETs. Multiple factors affecting the interpretation of the Ki-67 index include the use of different antibody clones and staining protocols, tissue section thickness, tumor cell density, and difficulty distinguishing tumor from non-tumor cells. According to Adsay, “to count or not to count is not the question, but rather how to count”40. Manually counting camera captured or printed images is generally favored over eyeballing. Further, more recently DIA has proven to be an acceptable method for Ki-67 assessment. In this study, we reviewed all published investigations that employed DIA for Ki-67 determination in PanNENs, highlighting some of the advantages and limitations of utilizing this technology. Furthermore, by comparing the coefficient of correlation between manual counting and DIA by means of a comparative meta-analysis, we demonstrated a high value of consistency (0.94, 95%CI: 0.83–0.98) between these two approaches.

The advantages derived from utilizing DIA include more reproducible results, higher accuracy, and reduced time to evaluate Ki-67 in PanNENs1,6,7,8. Current guidelines for assessing Ki-67 recommended manual counting from a printed image that includes at least 500 neoplastic cells from tumor hotspots. While still time consuming, this manual method does promote standardization that helps reduce interobserver variability24. However, for grade transitions between G1 and G2 (3% of Ki-67) and between G2 and G3 (20% of Ki-67), there were still discrepancies with manual counting from a printed image. The use of DIA for Ki-67 determination resulted in greater consistency in grading of all PanNEN cases, particularly for those cases belonging to the aforementioned gray transition areas G1-G2 and G2-G3. However, it should be noted that even when using DIA one can obtain different results depending on the selection of hotspots and by altering the number of cells counted. Access to DIA allows rapid counting of more cells, and that alone can push a tumor from G2 to G1 or G3 to G2, whereas counting fewer cells in the same hotspot can achieve the reverse41.

DIA assistance with grading PanNEN is of particular benefit in fine needle aspiration (FNA) cytology samples. Guidelines established using histological samples have been extrapolated to grading PanNENs in cytology material (e.g., cell blocks) procured by FNA. Several studies have demonstrated that Ki-67 assessment by manual counting of tumor cells in cell blocks can result in under-grading of these neoplasms when compared with matched surgical resection specimens36,42, with discrepancies more often observed in G2 cases20,29. Intriguingly, Abi-Raad and colleagues demonstrated that counting hotspots instead of the complete cell block can provide a higher concordance with surgical specimens, especially for FNA samples that contain ≥ 1000 cells43. A different perspective was provided by Satturwar and colleagues who investigated the potential role of augmented reality microscopy (ARM) for Ki-67 assessment in cytology specimens36. ARM, which is basically a modified microscope associated with an attached computer unit, enables real-time image analysis using a traditional light microscope and glass slides, without the need to first photograph or digitize slides36,44,45. If coupled with image analysis software, ARM allows quantifying immunohistochemical stains including Ki-67, and can also be combined with elaborate AI-based algorithms to perform more complex tasks44,45,46. Like other DIA methods, ARM has significant potential for improving PanNEN grading on cell block material36.

Currently, DIA for Ki-67 assessment has some limitations that may need to be addressed if counting in practice is to be performed with this approach. The most commonly reported drawback is the risk of counting dividing non-neoplastic “contaminating” cells (e.g., endothelial cells, lymphocytes), thereby erroneously increasing the overall tumor grade. Other brown-pigmented signals (hemosiderin and hematoidin) also cause this over-counting phenomenon. Such issues are enhanced in NEC, especially due to the effect of artefacts (e.g., smeared chromatin material, nuclear molding in small cell NEC) on DIA. However, these problems can be overcome by having pathologists directly annotate regions of interest to be scored, with the intent of excluding contaminating cells from being counted. Further studies that specifically address these challenges in PanNEC are needed. This issue becomes particularly important if non-pathologist personnel such as trainees and technicians are used as key operators. Of note, more sophisticated AI-algorithms are being developed that only count neoplastic cells47,48,49 and become more operator-independent. One potential solution that also has been employed is to utilize double-stained slides (e.g., Ki-67 and synaptophysin) with deep learning algorithms to improve the accuracy of Ki-67 index quantification50,51,52,53. More recently, some investigators have shown that they are able to predict Ki-67 positive cells directly from H&E images using AI-based methods51. Another important pending issue that needs to be addressed for improving Ki-67 assessment in PanNENs is related to standardizing hotspot size and number to be evaluated41,54. Hotspots are defined as tumor areas with higher Ki-67 nuclear staining. It has been shown that the greater the hotspot size, the lower the Ki-67 count, highlighting the importance of standardizing this parameter for reliable evaluation34,41. Furthermore, not only is the size of the hotspot difficult to define, but so is the shape55. Most pathologists and algorithms define a hotspot as a circular shape; however, there is no biological evidence to support this notion. Another important factor to consider is the number of hotspots when determining the Ki-67 index. Training operators not to select a geographic region that may lead to hyper-selection of positive cells in a given hotspot region is also important, which erroneously creates higher “percentage” positivity. However, all of the above shortcomings are relatively easy to address with proper training and application of improved AI software.

Despite the clear advantages of DIA for determination of the Ki-67 labeling index, scoring with this digital modality is still subject to the fundamental limitation that applies to any cut-off in a continuous variable: it can be changed randomly, as it was for PanNETs in 2017 when it was moved from 2 to 3 for Grade 19. Moreover, any cut-off of a continuous variable can be shown to have value, but the actual grading is inherently arbitrary41. Indeed, how best to employ K-i67 as a reliable prognosticator of PanNETs has been a study in progress. For example, in 2017 the cut-off was clarified such that cases with an index less than 3.0 (including 2.99), which were previously unclear as to which grade this belonged, now clearly included Grade 19. Naturally, as in any grading and staging system that assesses a continuous variable, the Ki-67 index-based system is imperfect54. For example, it can be expected that cases with 2.99 (now in G1) and 3.0 (now in G2) will be similar in biological behavior. Nevertheless, DIA will help standardize the process, not only offering more reproducible results in daily clinical practice, but also allowing for better comparison between studies that aim to fine-tune this grading system. For example, there have been proposals to move the G1/G2 cut-off to 5%; but it is difficult to verify the results of these proposals due to variation in counting methods. Fundamentally, the reality is that even with more accurate analysis provided by DIA, a G2 tumor with a Ki67 of 4% will still be more likely to behave in an indolent fashion than a G2 tumor with a Ki67 of 19%. Thus, the issue of a continuous variable, which is a complex concept involving statistical and biological sciences56, enhances the need for accurate Ki-67 quantification and may ultimately be more important than the actual grade score. Finally, a significant limitation of DIA for widespread adoption has been the accessibility of this technology due to cost and maintenance. However, as whole slide scanners and digital cameras (and related software) become more widely available, the adaptation of facilities to perform DIA for Ki-67 counting is becoming increasingly feasible and amenable to employ57. Another issue to be considered is the need to better integrate Ki-67 counting by DIA into routine workflow24,58.

In this review, we chose to focus on PanNENs. However, the topic of manual vs. digital pathology scoring of Ki-67 is also certainly of importance for NENs at many other anatomic sites54, as well as for other neoplasms in which DIA-based systems are being leveraged to assess biomarkers. In 2015, Joseph et al. studying a cohort of 48 lung carcinoids, demonstrated an overall similarity of manual counting vs. DIA; although Ki-67 estimation resulted in slightly higher results than manual counting59. Of note, a more recent analysis by Swarts et al. comparing the use of manual analysis vs. DIA (in-house Leica Qwin program) in a cohort of 201 lung tumors, described a substantial equivalence of both methodologies60. It is also worth noting that Ki-67 assessment may be of importance in other tumor types. For example, in 2020 Hida et al. compared the use of manual analysis vs. DIA (Visiopharm software) for proliferative index evaluation in a total of 413 cases of breast cancer, showing a high value of correlation (coefficient of correlation = 0.82, p < 0.0001) between both methods61. Alataki et al. corroborated these findings, demonstrating a high correlation in Ki-67 assessment between manual and DIA in both surgical breast resections and biopsies62.

An important question is whether the comparison of Ki-67 assessment between manual vs. DIA-based systems influences clinical management and prognostication. Among all selected manuscripts, only four provided data on this specific topic20,23,25,30. Goodell et al. demonstrated significant reliability in predicting nodal and distant metastasis of PanNETs with the ventana image analysis system (VIAS), with the highest specificity (94% in their cohort) demonstrated when analyzing 10 consecutive and randomly selected fields20. Similarly, van Velthuysen et al. investigating the performance of manual vs. digital (ImageJ) Ki-67 scoring in a cohort of 73 PanNENs, showed that tumor grading correlated with survival irrespective of the way Ki-67 was assessed23. Similar results were replicated by Kroneman et al.25. and Conemans et al.30, showing substantial similarities in terms of prognostication between manual vs. DIA scoring of Ki-67. It is important to note that only four studies in the literature provided data on this fundamental topic. Moreover, all of these studies were conducted prior to the introduction of the 2017 grading system. Thus, further studies on larger cohorts and based on current grading methods are needed. We advocate that DIA-based systems could provide a more standardized method, guaranteeing a more reliable basis for prognostic stratification.

In summary, this systematic review and comparative meta-analysis demonstrates that the advantages outweigh the limitations of using DIA to assess Ki-67 in PanNENs. We advocate that the next logical step for more broadly adopting DIA in pathology practice would be to further explore the relationship between hotspot parameters (number, size, and shape) and the Ki-67 index with patient outcome. Currently, most studies use manual counting from captured images as the gold standard; however, the ultimate validation will naturally come from prognostic correlation. Based upon current evidence provided by peer-reviewed literature, DIA appears to offer pathologists higher reliability and reproducibility than manual counting for grading PanNENs. The overall findings of this review, therefore, support widespread adoption of carefully optimized and validated DIA-based methods for this important diagnostic task in clinical practice. Lessons learned from the application of DIA to the PanNEN model can also be extrapolated to different tumors in other organ systems, such as breast carcinoma in which Ki-67 quantification is increasingly becoming a key driver for patient management.