Introduction

Colonoscopy can effectively detect colorectal polyps of various histological subtypes, including hyperplastic, adenomatous, and malignant polyps. Premalignant polyps, such as adenoma, should be resected endoscopically to prevent their development to colorectal cancer (CRC)1. Early CRCs with favorable histological features and cancer invasion up to mucosa or superficial submucosa less than 1,000 μm from the muscularis mucosa can be also managed by endoscopic resection2. However, endoscopic resection is not recommended for early CRCs with unfavorable histological features, such as massive submucosal invasion deeper than 1,000 μm, and surgery should be performed for these tumors3,4. Diminutive hyperplastic polyps can be left in situ without resection because they have no malignant potential, which is called a diagnose-and-leave strategy5. Owing to the wide variety of management strategies, the accurate assessment of colorectal polyp histopathology is of crucial importance.

Image-enhanced endoscopy using narrow-band imaging (NBI), blue light imaging, and i-Scan enabled the clear visualization of the microvascular architectures and surface structures of colorectal polyps6,7,8. Systematic classification systems have been developed to predict the histopathology of colorectal polyps based on NBI findings, which include the NBI International Colorectal Endoscopic classification and Japan NBI Expert Team classification9,10. Although these classification systems showed good accuracy in the prediction of colorectal polyp histopathology9, their performance is greatly endoscopist-dependent and the performance of optical diagnosis in nonacademic centers was disappointing11. Therefore, significant learning curves with repetitive training for NBI images are needed to achieve a diagnostic performance with high confidence12,13.

Recently, artificial intelligence (AI) has been introduced in an attempt to revolutionize the field of endoscopy. Multi-layered rapid image analysis and feature extraction performed by machine learning, a subset of AI, have been applied in the endoscopic recognition and assessment of colorectal polyps14,15,16. Initial experiences have enabled endoscopists to expect more detailed applications of AI and the innovative shifts in endoscopy practices by overcoming several limitations, including the inter-observer and intra-observer variability of completely endoscopist-dependent practices.

We herein present the development and validation of a computer-aided diagnostic system (CAD) for predicting colorectal polyp histology using AI-based deep learning. We aimed to investigate the possibility of a CAD application in the formation of treatment plans for colorectal polyps.

Results

Baseline characteristics of colorectal polyps

The baseline characteristics of 1169 colorectal polyps are presented in Table 1. The median size was 10 mm (range, 2–100 mm). Gross morphology of the Is type was the most common. BA was the most common histological diagnosis (705/1169, 60.3%), while DSMC was the least common (91/1169, 7.8%).

Table 1 Baseline characteristics of the included colorectal polyps.

Diagnostic performance of the CAD and comparison with endoscopists

The schematic view of the training strategy for the CAD is presented in Fig. 1 and detailed in the Methods section. Among the 182 NBI images of colorectal polyps in test dataset I, the CAD correctly classified 148 images (81.3%). The CAD correctly classified 32 (82.1%) of 39 serrated polyps (SPs), 106 (84.1%) of 126 benign conventional adenoma (BA)/mucosal or superficial submucosal cancer (MSMC) polyps, and 10 (58.8%) of 17 deep submucosal cancer (DSMC) polyps. The overall Cohen’s kappa value for the CAD was 0.614 (95% CI, 0.488–0.730), implying substantial agreement between the actual and predicted histological diagnoses. The Cohen’s kappa value for the trainee endoscopists was 0.368 (95% CI, 0.281–0.459) and that of expert endoscopists was 0.649 (95% CI, 0.564–0.725). Thus, the CAD diagnostic performance was better overall than that of the trainees and comparable to that of expert endoscopists. Detailed kappa values according to polyp size, location, and morphology are presented in Table 2 and show a similar tendency.

Figure 1
figure 1

A schematic of the training strategy of the computer-aided diagnostic system (CAD) using a 50-layered convolutional neural network and image patches. SP, serrated polyp; BA, benign conventional adenoma; MSMC, mucosal or superficial submucosal cancer; DSMC, deep submucosal cancer.

Table 2 Cohen’s kappa value measuring the agreement between true and predicted histopathological diagnoses in test sets I and II.

In test dataset II analyzing the diagnostic performance of the CAD in prospectively acquired real-time NBI images of 363 colorectal polyps, the Cohen’s kappa value for the CAD was also significantly higher than that of trainee endoscopists (0.642 vs. 0.401), while it was comparable or slightly inferior to that of expert endoscopists (0.642 vs. 0.735). Detailed findings are presented in Table 2.

Diagnostic performance of the CAD in each histological group

The diagnostic performances of the CAD and endoscopists according to the three histological groups are presented in Table 3. In test dataset I, the overall diagnostic accuracy of the CAD was 81.3% compared to that of expert endoscopists being 82.4%, indicating no statistically significant difference. However, the CAD showed significantly better overall diagnostic accuracy compared to the trainee endoscopists (81.3% vs. 71.8%, P = 0.005) (Table 3). Other performance indicators, including sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), were similar between the CAD and the experts in the three histological groups, whereas the performance parameters of the CAD were superior to those of trainees (Table 3).

Table 3 Diagnostic performance of the CAD in each histological group in comparison with the diagnostic performance of endoscopists.

In test dataset II, the overall diagnostic accuracy of the CAD was significantly higher than that of trainee endoscopists (82.4% vs. 63.8%, P < 0.001). The CAD showed inferior overall diagnostic accuracy to that of expert endoscopists (82.4% vs. 87.3%, P = 0.005). The analytical results of other performance indicators in test dataset II are presented in Table 3.

Area under the receiver operating characteristic (ROC) curves (AUC) for the CAD showed good-to-excellent diagnostic performances at 0.93, 0.86, and 0.91 for the SP, BA/MSMC, and DSMC groups, respectively, in test dataset I. ROC curves of the CAD in test dataset II showed similar findings (AUCs: 0.95, 0.89, and 0.89 in the SP, BA/MSMC, and DSMC groups, respectively) (Fig. 2). The diagnostic performance of the CAD as demonstrated by the ROC curves was comparable or slightly inferior to that of experts and clearly superior to that of trainees in both test datasets I and II (Fig. 2).

Figure 2
figure 2

The receiver operating characteristic (ROC) curves evaluating the diagnostic performance of the computer-aided diagnostic system (CAD). The performance of the CAD was evaluated and compared with the performances of three expert endoscopists and three trainees using ROC curves. (A–C) The ROC curves for the CAD in the SP, BA/MSMC, and DSMC groups of test dataset I; (D–F) The ROC curves for the CAD in the SP, BA/MSMC, and DSMC groups of test dataset II. AUC, area under the ROC curve; SP, serrated polyp; BA, benign conventional adenoma; MSMC, mucosal or superficial submucosal cancer (cancer with invasion depth <1000 µm from the muscularis mucosa); DSMC, deep submucosal cancer (cancer with invasion depth ≥1000 µm from the muscularis mucosa).

The visualized class activation map images demonstrated that the CAD was able to correctly predict the histological diagnosis of colorectal polyps by perceiving the characteristic surface area of the appropriate, relevant histological group (Fig. 3)17.

Figure 3
figure 3

The visualized class activation map images. The figures in small rectangles in each image show the probability of each class being predicted by the computer-aided diagnostic system (CAD). The red area represents the region that the CAD considers to be compatible with the particular histology with high probability. The blue area represents the region that CAD considers to have a low probability for the particular histology. SP, serrated polyp; BA, benign conventional adenoma; MSMC, mucosal or superficial submucosal cancer; DSMC, deep submucosal cancer.

Diagnostic performance of trainees assisted by the CAD

In test dataset I, the overall Cohen’s kappa value of the CAD+trainees was 0.665 (95% CI, 0.560–0.758), which was higher than that of the trainees (0.368, 95% CI 0.281–0.459) (Table 2). Moreover, the overall diagnostic accuracy of the CAD+trainees was significantly higher than that of the trainee endoscopists (84.2% vs. 71.8%, respectively; P < 0.001) (Table 3). The ROC curves also showed improvements in the diagnostic performance of the trainees with CAD assistance (Fig. 4).

Figure 4
figure 4

Improvement of the diagnostic performance of trainees with the assistance of the computer-aided diagnostic system (CAD). All empty circles representing trainees’ performance moved to solid circles representing the performance of the CAD+trainees at the left upper side or near the yellow curved line; this suggests that the performance of the CAD+trainees was superior to that of trainees and comparable to that of the CAD (yellow curved line). (A–C) Improved diagnostic performance of the CAD+trainee in the SP, BA/MSMC, and DSMC groups of test dataset I; (D–F) Improved diagnostic performance of the CAD+trainee in the SP, BA/MSMC, and DSMC groups of test dataset II. SP, serrated polyp; BA, benign conventional adenoma; MSMC, mucosal or superficial submucosal cancer; DSMC, deep submucosal cancer.

Analysis of the diagnostic performance of the trainees assisted by the CAD in test dataset II showed similar findings as those in test dataset I (Table 2,3, Fig. 4).

Diagnostic performance of the CAD according to the size, location, and morphology of colorectal polyps

The diagnostic performance of the CAD in test dataset I was generally good, with no definite, consistent differences according to polyp size (>10 mm vs. ≤10 mm), location (right vs. left colon), or morphology (laterally spreading tumor [LST] vs. Is; Table 4). The diagnostic performance of the CAD according to polyp size, location, and morphology in test dataset II showed tendencies similar to those in test dataset I (Supplementary Table 2).

Table 4 Diagnostic performance of the CAD according to the size, location, and morphology of the colorectal polyps in test set I.

Inference time of CAD

The average inference time for histological assessment by the CAD was 0.02 seconds with ResNet-50 and 0.04 seconds with DenseNet-201.

Discussion

In this study, the CAD based on NBI near-focus images showed a good diagnostic accuracy of >80%–90% irrespective of polyp size, location, and morphology. The area under the ROC curves for the CAD was 0.86–0.95, implying good-to-excellent predictability. The CAD showed a better performance compared to trainees and a slightly inferior or comparable performance to that of experts. In addition, CAD assistance significantly improved the diagnostic performance of trainees. These findings suggest that the AI CAD system helps inexperienced endoscopists to correctly predict the histopathology of colorectal polyps and gives expert endoscopists increased confidence in their histological assessments. Therefore, we believe that CAD assistance will help endoscopists more reliably determine the appropriate treatment plan for colorectal polyps.

Several recent studies have investigated the use of the CAD to differentiate between neoplastic and non-neoplastic lesions, which are summarized in Supplementary Table 3. Our study is distinctive in several aspects. First, the majority of previous studies developed a CAD based on magnifying images or endocytoscopy images; these are not widely available in current clinical practice, particularly in nonacademic hospitals15,18,19,20,21,22. In contrast, we developed a CAD using NBI near-focus images that can be easily obtained in many centers, including primary care units. Second, previous CAD studies focused on discriminating only diminutive polyps14,15,19. However, appropriate treatment plans for both large and small colorectal polyps are essential to achieve successful screening and surveillance colonoscopy. Therefore, we enrolled all colorectal polyps regardless of their size and trained the CAD to classify the polyps into three histological groups with varying treatment plans. The BA/MSMC group is endoscopically resectable, while the DSMC group is endoscopically unresectable group and requires surgery. Since differentiating between hyperplastic and sessile serrated polyps is difficult even through histological evaluation by pathologists23,24, the treatment plan for SP is currently determined clinically based on the size and location of the SP; endoscopic resection is recommended for SP > 5 mm at any location and SP of any size at a location proximal to the sigmoid colon. Accordingly, real-time differentiation between the two types is not mandatory in current practice. Therefore, we suggest that our three histological groups adequately represented the treatment planning of most colorectal polyps, and our study showed that the CAD is a potentially good modality to aid in the differentiation of these three histological groups.

Interestingly, the diagnostic performance of trainees improved significantly with CAD assistance through our heuristic algorithm. We suggest that this type of assessment considering both the endoscopist’s confidence level and the CAD probability may be a good way to implement AI CAD into real clinical practices. If an AI device shows a clearly superior performance, it could entirely replace human clinicians. However, consensus on the definition of “clear superiority” is difficult to achieve. Therefore, approaches combining both human and AI-suggested decisions can be a practical solution, and we believe our algorithm shows such an example of colonoscopy practices using a CAD.

The class activation map images showed the correct perception of the representative surface area of colorectal polyps by the CAD (Fig. 3). In general, the deep-learning model is regarded as a black box function because it is a data-driven method without inference by well-defined scientific laws. However, if the inference by the model cannot be interpreted, it would not be useful in clinical practice. As part of the solution to this problem, the class activation map can be extracted for the inference reason of the model. As shown in Fig. 3, the class activation map indicated that the model inferred histopathological diagnoses by correct perception of the characteristic polyp surface similarly to the endoscopists.

One disappointing aspect of our study was the relatively unsatisfactory performance of the CAD in the DSMC group. The CAD correctly classified only 10 (58.8%) of 17 DSMC polyps in test set I and 18 (62.1%) of 29 DSMC polyps in test set II. This may be partly related to the small number of DSMC cases. Another possibility is that only a single NBI near-focus image in test set I and 1–5 NBI near-focus images in test set II may not represent all the features of the entire tumor, particularly for DSMC. A larger DSMC dataset and diagnosis based on multiple images for each DSMC may be warranted in future studies.

This study has several limitations. First, our CAD predicted colorectal polyp histology based on still images, and investigation of an AI CAD system assessing motion images should be developed and validated. Nonetheless, because the images used for the CAD training in this study were unmodified from the original images, we are optimistic about the application of motion images to the CAD. In addition, we assessed the performance of the CAD in two separate test datasets. Of those, in test dataset II, real-time assessment of histological diagnoses was performed after the instant transmission of still images to the CAD. We believe this experiment showed the feasibility of real-time assessment even with still images in clinical practice. Second, all NBI near-focus pictures in this study were taken by expert endoscopists. Inexperienced endoscopists may not pinpoint the representative region of a given colorectal polyp, thereby decreasing the performance of the CAD. Despite these limitations, our study is significant in that we demonstrated the possibility of using an AI CAD as a real-time histological diagnostic tool for not only small but also large colorectal polyps, including early cancers. A short inference time of only 0.02–0.04 seconds by our CAD is another important factor in the clinical utility of this system, because rapid diagnosis is mandatory in daily practice.

In conclusion, a CAD developed using a deep-learning model accurately predicted colorectal polyp histology based on NBI images with high accuracy. The diagnostic performance of the CAD was comparable to that of expert endoscopists and better than that of trainees. Real-time histological assessment of colorectal polyps by the CAD may enhance endoscopists’ decision-making and confidence in the selection of appropriate treatment plans.

Methods

Patients and data collection

We collected NBI near-focus images of endoscopically resected colorectal polyps at Asan Medical Center between 2014 and 2018. All endoscopic images, including both white light and NBI, were taken using CF-H290 colonoscopes (Olympus Co, Tokyo, Japan). The exclusion criteria were as follows: (1) colorectal polyps without NBI near-focus images, (2) those with dirty mucus and/or feces on their surface, (3) out-of-focus images, and (4) images with evident motion blurring. First, we retrospectively collected 806 NBI near-focus images of 806 polyps, with one image per polyp, in 646 patients. Among these, 624 images were used as the training dataset and 182 were used as the test dataset I. Second, we prospectively collected NBI near-focus images of colorectal polyps as the test dataset II for another separate real-time performance test of the developed CAD system. The same exclusion criteria as those described above for the training dataset and test dataset I were applied. The test dataset II included 546 near-focus images of 363 colorectal polyps, with 1–5 images per each polyp, in 305 patients. Finally, a total of 1352 NBI near-focus images of 1169 colorectal polyps in 951 patients were collected.

This study was approved by the institutional review board (IRB) of Asan Medical Center (2017–1357). Due to the retrospective study design, written informed consent was not obtained from participants. The IRB of our institution waived the need for informed consent based on the non-invasive and anonymized nature of this study. This study was conducted in accordance with institutional ethical guidelines and the Declaration of Helsinki.

Histopathological classification of colorectal polyps

The histopathology of all colorectal polyps was evaluated by board-certified gastrointestinal pathologists. The polyps were classified into three histological groups: (1) SP, (2) BA/ MSMC, and (3) DSMC. The SP group encompassed hyperplastic and sessile serrated polyps. Superficial submucosal cancer was defined as cancer with an invasion depth <1000 µm from the muscularis mucosa. DSMC was defined as cancer with an invasion depth ≥1000 µm from the muscularis mucosa.

Development of the CAD

A deep-learning model was used to develop a CAD. ResNet-5025 and DenseNet-20126 models with proven performance in the ImageNet Large-Scale Visual Recognition Competition (ILSVRC)27 were used as a deep-learning architecture to train the weak supervisions of histological diagnoses of NBI near-focus images. ResNet-50 was initially adopted, and then the recently introduced DenseNet-201 was used to improve the performance of the CAD.

Among the retrospectively collected NBI near-focus images of 806 polyps, 624 were used as the training dataset and 182 were used as the test dataset I. Since the prediction of a single NBI near-focus image of a tumor could be easily overfitted in small data-intensive situations, we employed a simple curriculum learning strategy. The tumor area of the collected NBI near-focus images was denoted with a rectangle. Then, 20 half-size image patches containing the center point of these evidences were extracted from each entire image. Through this process, 12480 image patches measuring 224 × 224 were extracted from 624 entire images of 448 × 448 size in the training set. In order to pre-train image patches and fine-tune the model using entire images, two steps of training were performed as shown in Fig. 1. First, the model that was pre-trained on the ILSVRC dataset was trained using an augmented dataset of 12480 image patches as the training data. Thereafter, it was fine-tuned using 624 entire images. This curriculum learning strategy was intended to lead the model to a better local minimum.

There was an imbalance in the number of datasets among the SP, BA/MSMC, and DSMC groups. Since this imbalance could have led to paradoxical outcome, an oversampling strategy was employed to extract the same number of samples per training epoch. Each sample was standardized and trained using common data augmentation techniques, such as adding Gaussian noise, rotating, zooming, and shifting.

All experiments were implemented in Keras with a Tensorflow backbone; a stochastic gradient descent optimizer28 was used with 5e−5 learning and 5e−5 decay rates.

Since the model was validated using the test set without a separate validation set, the reliability of the model’s performance was assessed by 5-fold cross-validation. Supplementary Table 1 shows the cross-validation results for ResNet-50 and DenseNet-201. The mean values of the accuracies were 77.4 for ResNet-50 and 81.4 for DenseNet-201 (P = 0.08). Although there was no statistically significant difference, DenseNet-201 showed a numerically higher accuracy. Therefore, the final CAD system was developed using DenseNet-201.

Diagnostic performance of the CAD and comparison with endoscopists

The diagnostic performance of the CAD was tested twice separately. The first test was conducted with NBI near-focus images of 182 colorectal polyps in test dataset I. Diagnosis by the CAD was made based on the probabilities of the three histological groups. For example, if the CAD showed a 5% probability for SP, 10% probability for BA/MSMC, and 85% probability for DSMC, the diagnosis was finalized as DSMC, as it had the highest probability. Diagnostic accuracy, sensitivity, specificity, PPV, and NPV were investigated in the differential diagnosis among SP, BA/MSMC, and DSMC. Diagnostic performances were further assessed according to polyp size, morphology, and location. Polyp morphology was classified into Ip (pedunculated), Is (sessile), and LST. LSTs were further categorized into granular (LST-G) and non-granular (LST-NG) types. Tumors were also categorized based on location as follows: tumors in the left or right colon (above the splenic flexure).

To compare the CAD performance with that of endoscopists, six endoscopists blinded to the histological diagnoses were asked to classify the same 182 polyps of the test dataset I into three histological groups based on NBI near-focus images. Endoscopists provided their diagnosis with a confidence level (high vs. low). Three of the six endoscopists were board-certified expert colonoscopists who had experienced approximately 2,500 NBI colonoscopies in ≥5 years. The other three were trainees who had experienced approximately 100 NBI colonoscopies in < 6 months.

We also evaluated the diagnostic performance of trainees assisted by the CAD (CAD+trainee). The final diagnosis by the CAD+trainee was made according to the following algorithm: (1) If the CAD and trainee made the same diagnosis, it was considered the final diagnosis of the CAD+trainee; (2) if the CAD and trainee diagnoses were different and the diagnostic probability by the CAD was ≥80%, the CAD diagnosis was considered the diagnosis of CAD+trainee; (3) if the CAD and trainee diagnoses were different, the diagnostic probability by the CAD was <80%, and the confidence level of the trainee diagnosis was high, the trainee’s diagnosis was considered the diagnosis of the CAD+trainee; (4) if the CAD and trainee diagnoses were different, the diagnostic probability by the CAD was <80%, and the confidence level of trainee diagnosis was low, the CAD diagnosis was considered the diagnosis of the CAD+trainee. A diagnostic probability of 80% is a heuristic parameter that can be changed depending on the model.

The second test for the diagnostic performance of the CAD was performed with NBI near-focus images of 363 colorectal polyps in test dataset II. To test the real-time performance, the following steps were performed. First, 1–5 representative NBI near-focus still images of a colorectal polyp were acquired during colonoscopy. Second, the image was transmitted to the CAD on a laptop via the picture archiving and communication system (PACS) of our center. Finally, the CAD determined the histological diagnostic group in real-time. The same three expert endoscopists and another three trainees assessed the histological diagnoses of the 363 colorectal polyps in test dataset II after all the NBI near-focus images were collected.

Statistical analysis

Continuous variables were expressed as medians with ranges and categorical variables as frequencies with percentages. We used the unpaired Student’s t-test or Mann–Whitney U-test to compare continuous variables and the chi-square or Fisher’s exact test to compare categorical variables. The agreement between the true and predicted histological diagnoses was evaluated using Cohen’s kappa coefficient. The average of the kappa values for each endoscopist was calculated, and 95% CI was estimated by the percentile bootstrap method based on 1000 resamples. Sensitivity, specificity, accuracy, PPV, and NPV were compared using logistic regression with generalized estimating equations that accounted for the clustering of the same patient. A ROC curve was generated to evaluate the diagnostic performance in each histological group. P < 0.05 was considered statistically significant. All statistical analyses were performed using SPSS ver. 21.0 for Windows (IBM SPSS; IBM Co., NY, USA) and SAS (version 9.4; AS Institute, Cary, NC, USA).