Endoscopic diagnosis and treatment planning for colorectal polyps using a deep-learning model

We aimed to develop a computer-aided diagnostic system (CAD) for predicting colorectal polyp histology using deep-learning technology and to validate its performance. Near-focus narrow-band imaging (NBI) pictures of colorectal polyps were retrieved from the database of our institution. Of these, 12480 image patches of 624 polyps were used as a training set to develop the CAD. The CAD performance was validated with two test datasets of 545 polyps. Polyps were classified into three histological groups: serrated polyp (SP), benign adenoma (BA)/mucosal or superficial submucosal cancer (MSMC), and deep submucosal cancer (DSMC). The overall kappa value measuring the agreement between the true polyp histology and the expected histology by the CAD was 0.614–0.642, which was higher than that of trainees (n = 6, endoscopists with experience of 100 NBI colonoscopies in <6 months; 0.368–0.401) and almost comparable with that of the experts (n = 3, endoscopists with experience of 2,500 NBI colonoscopies in ≥5 years) (0.649–0.735). The areas under the receiver operating curves for CAD were 0.93–0.95, 0.86–0.89, and 0.89–0.91 for SP, BA/MSMC, and DSMC, respectively. The overall diagnostic accuracy of the CAD was 81.3–82.4%, which was significantly higher than that of the trainees (63.8–71.8%, P < 0.01) and comparable with that of experts (82.4–87.3%). The kappa value and diagnostic accuracies of the trainees improved with CAD assistance: that is, the kappa value increased from 0.368 to 0.655, and the overall diagnostic accuracy increased from 63.8–71.8% to 82.7–84.2%. CAD using a deep-learning model can accurately assess polyp histology and may facilitate the diagnosis of colorectal polyps by endoscopists.


Results
Baseline characteristics of colorectal polyps. The baseline characteristics of 1169 colorectal polyps are presented in Table 1. The median size was 10 mm (range, 2-100 mm). Gross morphology of the Is type was the most common. BA was the most common histological diagnosis (705/1169, 60.3%), while DSMC was the least common (91/1169, 7.8%).
Diagnostic performance of the CAD and comparison with endoscopists. The schematic view of the training strategy for the CAD is presented in Fig. 1 and detailed in the Methods section. Among the 182 NBI images of colorectal polyps in test dataset I, the CAD correctly classified 148 images (81.3%). The CAD correctly classified 32 (82.1%) of 39 serrated polyps (SPs), 106 (84.1%) of 126 benign conventional adenoma (BA)/mucosal or superficial submucosal cancer (MSMC) polyps, and 10 (58.8%) of 17 deep submucosal cancer (DSMC) polyps. The overall Cohen's kappa value for the CAD was 0.614 (95% CI, 0.488-0.730), implying substantial agreement between the actual and predicted histological diagnoses. The Cohen's kappa value for the trainee endoscopists was 0.368 (95% CI, 0.281-0.459) and that of expert endoscopists was 0.649 (95% CI, 0.564-0.725). Thus, the CAD diagnostic performance was better overall than that of the trainees and comparable to that of expert endoscopists. Detailed kappa values according to polyp size, location, and morphology are presented in Table 2 and show a similar tendency.
In test dataset II analyzing the diagnostic performance of the CAD in prospectively acquired real-time NBI images of 363 colorectal polyps, the Cohen's kappa value for the CAD was also significantly higher than that of trainee endoscopists (0.642 vs. 0.401), while it was comparable or slightly inferior to that of expert endoscopists (0.642 vs. 0.735). Detailed findings are presented in Table 2.
Diagnostic performance of the CAD in each histological group. The diagnostic performances of the CAD and endoscopists according to the three histological groups are presented in Table 3. In test dataset I, the overall diagnostic accuracy of the CAD was 81.3% compared to that of expert endoscopists being 82.4%, indicating no statistically significant difference. However, the CAD showed significantly better overall diagnostic accuracy compared to the trainee endoscopists (81.3% vs. 71.8%, P = 0.005) ( Table 3). Other performance indicators, including sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), were similar between the CAD and the experts in the three histological groups, whereas the performance parameters of the CAD were superior to those of trainees (Table 3). In test dataset II, the overall diagnostic accuracy of the CAD was significantly higher than that of trainee endoscopists (82.4% vs. 63.8%, P < 0.001). The CAD showed inferior overall diagnostic accuracy to that of expert endoscopists (82.4% vs. 87.3%, P = 0.005). The analytical results of other performance indicators in test dataset II are presented in Table 3.
Area under the receiver operating characteristic (ROC) curves (AUC) for the CAD showed good-to-excellent diagnostic performances at 0.93, 0.86, and 0.91 for the SP, BA/MSMC, and DSMC groups, respectively, in test dataset I. ROC curves of the CAD in test dataset II showed similar findings (AUCs: 0.95, 0.89, and 0.89 in the SP, BA/MSMC, and DSMC groups, respectively) (Fig. 2). The diagnostic performance of the CAD as demonstrated by the ROC curves was comparable or slightly inferior to that of experts and clearly superior to that of trainees in both test datasets I and II (Fig. 2).
The visualized class activation map images demonstrated that the CAD was able to correctly predict the histological diagnosis of colorectal polyps by perceiving the characteristic surface area of the appropriate, relevant histological group (Fig. 3) 17 .
Diagnostic performance of trainees assisted by the CAD. In test dataset I, the overall Cohen's kappa value of the CAD+trainees was 0.665 (95% CI, 0.560-0.758), which was higher than that of the trainees (0.368, 95% CI 0.281-0.459) ( Table 2). Moreover, the overall diagnostic accuracy of the CAD+trainees was significantly higher than that of the trainee endoscopists (84.2% vs. 71.8%, respectively; P < 0.001) ( Table 3). The ROC curves also showed improvements in the diagnostic performance of the trainees with CAD assistance (Fig. 4).
Analysis of the diagnostic performance of the trainees assisted by the CAD in test dataset II showed similar findings as those in test dataset I ( Table 2,3, Fig. 4).
Diagnostic performance of the CAD according to the size, location, and morphology of colorectal polyps. The diagnostic performance of the CAD in test dataset I was generally good, with no definite, consistent differences according to polyp size (>10 mm vs. ≤10 mm), location (right vs. left colon), or morphology (laterally spreading tumor [LST] vs. Is; Table 4). The diagnostic performance of the CAD according to polyp size, location, and morphology in test dataset II showed tendencies similar to those in test dataset I (Supplementary Table 2

Discussion
In this study, the CAD based on NBI near-focus images showed a good diagnostic accuracy of >80%-90% irrespective of polyp size, location, and morphology. The area under the ROC curves for the CAD was 0.86-0.95, implying good-to-excellent predictability. The CAD showed a better performance compared to trainees and a slightly inferior or comparable performance to that of experts. In addition, CAD assistance significantly improved the diagnostic performance of trainees. These findings suggest that the AI CAD system helps inexperienced endoscopists to correctly predict the histopathology of colorectal polyps and gives expert endoscopists increased confidence in their histological assessments. Therefore, we believe that CAD assistance will help endoscopists more reliably determine the appropriate treatment plan for colorectal polyps.
Several recent studies have investigated the use of the CAD to differentiate between neoplastic and non-neoplastic lesions, which are summarized in Supplementary Table 3. Our study is distinctive in several aspects. First, the majority of previous studies developed a CAD based on magnifying images or endocytoscopy images; these are not widely available in current clinical practice, particularly in nonacademic hospitals 15,[18][19][20][21][22] . In contrast, we developed a CAD using NBI near-focus images that can be easily obtained in many centers, including primary care units. Second, previous CAD studies focused on discriminating only diminutive polyps 14,15,19 . However, appropriate treatment plans for both large and small colorectal polyps are essential to achieve successful screening and surveillance colonoscopy. Therefore, we enrolled all colorectal polyps regardless of their size and trained the CAD to classify the polyps into three histological groups with varying treatment plans. The BA/ MSMC group is endoscopically resectable, while the DSMC group is endoscopically unresectable group and requires surgery. Since differentiating between hyperplastic and sessile serrated polyps is difficult even through histological evaluation by pathologists 23,24 , the treatment plan for SP is currently determined clinically based on the size and location of the SP; endoscopic resection is recommended for SP > 5 mm at any location and SP of any size at a location proximal to the sigmoid colon. Accordingly, real-time differentiation between the two types is not mandatory in current practice. Therefore, we suggest that our three histological groups adequately represented the treatment planning of most colorectal polyps, and our study showed that the CAD is a potentially good modality to aid in the differentiation of these three histological groups.
Interestingly, the diagnostic performance of trainees improved significantly with CAD assistance through our heuristic algorithm. We suggest that this type of assessment considering both the endoscopist's confidence level and the CAD probability may be a good way to implement AI CAD into real clinical practices. If an AI device shows a clearly superior performance, it could entirely replace human clinicians. However, consensus on the definition of "clear superiority" is difficult to achieve. Therefore, approaches combining both human and AI-suggested decisions can be a practical solution, and we believe our algorithm shows such an example of colonoscopy practices using a CAD. www.nature.com/scientificreports www.nature.com/scientificreports/ The class activation map images showed the correct perception of the representative surface area of colorectal polyps by the CAD (Fig. 3). In general, the deep-learning model is regarded as a black box function because it is a data-driven method without inference by well-defined scientific laws. However, if the inference by the model cannot be interpreted, it would not be useful in clinical practice. As part of the solution to this problem, the class activation map can be extracted for the inference reason of the model. As shown in Fig. 3, the class activation map indicated that the model inferred histopathological diagnoses by correct perception of the characteristic polyp surface similarly to the endoscopists.
One disappointing aspect of our study was the relatively unsatisfactory performance of the CAD in the DSMC group. The CAD correctly classified only 10 (58.8%) of 17 DSMC polyps in test set I and 18 (62.1%) of 29 DSMC polyps in test set II. This may be partly related to the small number of DSMC cases. Another possibility is that only a single NBI near-focus image in test set I and 1-5 NBI near-focus images in test set II may not represent all the features of the entire tumor, particularly for DSMC. A larger DSMC dataset and diagnosis based on multiple images for each DSMC may be warranted in future studies.
This study has several limitations. First, our CAD predicted colorectal polyp histology based on still images, and investigation of an AI CAD system assessing motion images should be developed and validated. Nonetheless, because the images used for the CAD training in this study were unmodified from the original images, we are optimistic about the application of motion images to the CAD. In addition, we assessed the performance of the CAD in two separate test datasets. Of those, in test dataset II, real-time assessment of histological diagnoses was performed after the instant transmission of still images to the CAD. We believe this experiment showed the feasibility of real-time assessment even with still images in clinical practice. Second, all NBI near-focus pictures in this study were taken by expert endoscopists. Inexperienced endoscopists may not pinpoint the representative region of a given colorectal polyp, thereby decreasing the performance of the CAD. Despite these limitations, our study is significant in that we demonstrated the possibility of using an AI CAD as a real-time histological diagnostic tool for not only small but also large colorectal polyps, including early cancers. A short inference time of  www.nature.com/scientificreports www.nature.com/scientificreports/ only 0.02-0.04 seconds by our CAD is another important factor in the clinical utility of this system, because rapid diagnosis is mandatory in daily practice.
In conclusion, a CAD developed using a deep-learning model accurately predicted colorectal polyp histology based on NBI images with high accuracy. The diagnostic performance of the CAD was comparable to that of expert endoscopists and better than that of trainees. Real-time histological assessment of colorectal polyps by the CAD may enhance endoscopists' decision-making and confidence in the selection of appropriate treatment plans.

Methods
Patients and data collection. We collected NBI near-focus images of endoscopically resected colorectal polyps at Asan Medical Center between 2014 and 2018. All endoscopic images, including both white light and NBI, were taken using CF-H290 colonoscopes (Olympus Co, Tokyo, Japan). The exclusion criteria were as follows: (1) colorectal polyps without NBI near-focus images, (2) those with dirty mucus and/or feces on their surface, (3) out-of-focus images, and (4) images with evident motion blurring. First, we retrospectively collected 806 NBI near-focus images of 806 polyps, with one image per polyp, in 646 patients. Among these, 624 images were used as the training dataset and 182 were used as the test dataset I. Second, we prospectively collected NBI near-focus images of colorectal polyps as the test dataset II for another separate real-time performance test of the developed CAD system. The same exclusion criteria as those described above for the training dataset and test dataset I were applied. The test dataset II included 546 near-focus images of 363 colorectal polyps, with 1-5 images per each polyp, in 305 patients. Finally, a total of 1352 NBI near-focus images of 1169 colorectal polyps in 951 patients were collected.
This study was approved by the institutional review board (IRB) of Asan Medical Center (2017-1357). Due to the retrospective study design, written informed consent was not obtained from participants. The IRB of our institution waived the need for informed consent based on the non-invasive and anonymized nature of this study. This study was conducted in accordance with institutional ethical guidelines and the Declaration of Helsinki.
Histopathological classification of colorectal polyps. The histopathology of all colorectal polyps was evaluated by board-certified gastrointestinal pathologists. The polyps were classified into three histological groups: (1) SP, (2) BA/ MSMC, and (3) DSMC. The SP group encompassed hyperplastic and sessile serrated polyps. Superficial submucosal cancer was defined as cancer with an invasion depth <1000 µm from the muscularis mucosa. DSMC was defined as cancer with an invasion depth ≥1000 µm from the muscularis mucosa.  27 were used as a deep-learning architecture to train the weak supervisions of histological diagnoses of NBI near-focus images. ResNet-50 was initially adopted, and then the recently introduced DenseNet-201 was used to improve the performance of the CAD.
Among the retrospectively collected NBI near-focus images of 806 polyps, 624 were used as the training dataset and 182 were used as the test dataset I. Since the prediction of a single NBI near-focus image of a tumor could be easily overfitted in small data-intensive situations, we employed a simple curriculum learning strategy. The tumor area of the collected NBI near-focus images was denoted with a rectangle. Then, 20 half-size image patches containing the center point of these evidences were extracted from each entire image. Through this process, 12480 image patches measuring 224 × 224 were extracted from 624 entire images of 448 × 448 size in the training set. In order to pre-train image patches and fine-tune the model using entire images, two steps of training were performed as shown in Fig. 1. First, the model that was pre-trained on the ILSVRC dataset was trained using an augmented dataset of 12480 image patches as the training data. Thereafter, it was fine-tuned using 624 entire images. This curriculum learning strategy was intended to lead the model to a better local minimum.
There was an imbalance in the number of datasets among the SP, BA/MSMC, and DSMC groups. Since this imbalance could have led to paradoxical outcome, an oversampling strategy was employed to extract the same number of samples per training epoch. Each sample was standardized and trained using common data augmentation techniques, such as adding Gaussian noise, rotating, zooming, and shifting.
All experiments were implemented in Keras with a Tensorflow backbone; a stochastic gradient descent optimizer 28 was used with 5e −5 learning and 5e −5 decay rates.
Since the model was validated using the test set without a separate validation set, the reliability of the model's performance was assessed by 5-fold cross-validation. Supplementary Table 1 shows the cross-validation results for ResNet-50 and DenseNet-201. The mean values of the accuracies were 77.4 for ResNet-50 and 81.4 for DenseNet-201 (P = 0.08). Although there was no statistically significant difference, DenseNet-201 showed a numerically higher accuracy. Therefore, the final CAD system was developed using DenseNet-201.

Diagnostic performance of the CAD and comparison with endoscopists. The diagnostic per-
formance of the CAD was tested twice separately. The first test was conducted with NBI near-focus images of 182 colorectal polyps in test dataset I. Diagnosis by the CAD was made based on the probabilities of the three Figure 3. The visualized class activation map images. The figures in small rectangles in each image show the probability of each class being predicted by the computer-aided diagnostic system (CAD). The red area represents the region that the CAD considers to be compatible with the particular histology with high probability. The blue area represents the region that CAD considers to have a low probability for the particular histology. SP, serrated polyp; BA, benign conventional adenoma; MSMC, mucosal or superficial submucosal cancer; DSMC, deep submucosal cancer. histological groups. For example, if the CAD showed a 5% probability for SP, 10% probability for BA/MSMC, and 85% probability for DSMC, the diagnosis was finalized as DSMC, as it had the highest probability. Diagnostic accuracy, sensitivity, specificity, PPV, and NPV were investigated in the differential diagnosis among SP, BA/ MSMC, and DSMC. Diagnostic performances were further assessed according to polyp size, morphology, and location. Polyp morphology was classified into Ip (pedunculated), Is (sessile), and LST. LSTs were further categorized into granular (LST-G) and non-granular (LST-NG) types. Tumors were also categorized based on location as follows: tumors in the left or right colon (above the splenic flexure).
To compare the CAD performance with that of endoscopists, six endoscopists blinded to the histological diagnoses were asked to classify the same 182 polyps of the test dataset I into three histological groups based on NBI near-focus images. Endoscopists provided their diagnosis with a confidence level (high vs. low). Three of the six endoscopists were board-certified expert colonoscopists who had experienced approximately 2,500 NBI colonoscopies in ≥5 years. The other three were trainees who had experienced approximately 100 NBI colonoscopies in < 6 months.
We also evaluated the diagnostic performance of trainees assisted by the CAD (CAD+trainee). The final diagnosis by the CAD+trainee was made according to the following algorithm: (1) If the CAD and trainee made the same diagnosis, it was considered the final diagnosis of the CAD+trainee; (2) if the CAD and trainee diagnoses were different and the diagnostic probability by the CAD was ≥80%, the CAD diagnosis was considered the diagnosis of CAD+trainee; (3) if the CAD and trainee diagnoses were different, the diagnostic probability by the CAD was <80%, and the confidence level of the trainee diagnosis was high, the trainee's diagnosis was considered the diagnosis of the CAD+trainee; (4) if the CAD and trainee diagnoses were different, the diagnostic probability by the CAD was <80%, and the confidence level of trainee diagnosis was low, the CAD diagnosis was considered the diagnosis of the CAD+trainee. A diagnostic probability of 80% is a heuristic parameter that can be changed depending on the model.
The second test for the diagnostic performance of the CAD was performed with NBI near-focus images of 363 colorectal polyps in test dataset II. To test the real-time performance, the following steps were performed. First, 1-5 representative NBI near-focus still images of a colorectal polyp were acquired during colonoscopy. Second, the image was transmitted to the CAD on a laptop via the picture archiving and communication system (PACS) of our center. Finally, the CAD determined the histological diagnostic group in real-time. The same three expert endoscopists and another three trainees assessed the histological diagnoses of the 363 colorectal polyps in test dataset II after all the NBI near-focus images were collected. . Improvement of the diagnostic performance of trainees with the assistance of the computeraided diagnostic system (CAD). All empty circles representing trainees' performance moved to solid circles representing the performance of the CAD+trainees at the left upper side or near the yellow curved line; this suggests that the performance of the CAD+trainees was superior to that of trainees and comparable to that of the CAD (yellow curved line). (A-C) Improved diagnostic performance of the CAD+trainee in the SP, BA/ MSMC, and DSMC groups of test dataset I; (D-F) Improved diagnostic performance of the CAD+trainee in the SP, BA/MSMC, and DSMC groups of test dataset II. SP, serrated polyp; BA, benign conventional adenoma; MSMC, mucosal or superficial submucosal cancer; DSMC, deep submucosal cancer. Table 4. Diagnostic performance of the CAD according to the size, location, and morphology of the colorectal polyps in test set I. CAD, computer-aided diagnostic system; BA, benign conventional adenoma; MSMC, mucosal or superficial submucosal tumor; DSMC, deep submucosal cancer; PPV, positive predictive value; NPV, negative predictive value; LST, laterally spreading tumor.