Artificial intelligence-assisted analysis of endoscopic retrograde cholangiopancreatography image for identifying ampulla and difficulty of selective cannulation

The advancement of artificial intelligence (AI) has facilitated its application in medical fields. However, there has been little research for AI-assisted endoscopy, despite the clinical significance of the efficiency and safety of cannulation in the endoscopic retrograde cholangiopancreatography (ERCP). In this study, we aim to assist endoscopists performing ERCP through automatic detection of the ampulla and the identification of cannulation difficulty. We developed a novel AI-assisted system based on convolutional neural networks that predict the location of the ampulla and the difficulty of cannulation to the ampulla. ERCP data of 531 and 451 patients were utilized in the evaluation of our model for each task. Our model detected the ampulla with mean intersection-over-union 64.1%, precision 76.2%, recall 78.4%, and centroid distance 0.021. In classifying the cannulation difficulty, it achieved the recall of 71.9% for the class of easy cases and that of 61.1% for that of difficult cases. Remarkably, our model accurately detected AOV with varying morphological shape, size, and texture on par with the level of a human expert and showed promising results for recognizing cannulation difficulty. It demonstrated its potential to improve the quality of ERCP by assisting endoscopists.

Device and definition. ERCP was performed using a duodenoscope (JF 240 or TJF-260 V; Olympus Medical Systems Co. Ltd., Tokyo, Japan), and the endoscopic images were taken using video endoscopy systems (EVIS LUCERA CLV 260; Olympus Medical Systems Co. Ltd., Tokyo, Japan). The following endoscopic images were selected: (1) endoscopic images of the front on ampulla, (2) endoscopic images of the naïve ampulla before being touched by the endoscopic device, (3) endoscopic images taken after washing off bubbles and food materials from the duodenum.
The cannulation time was defined as the total time spent from approaching the ampulla to performing a successful deep cannulation into the CBD. Additionally, the data were obtained from the records written by the endoscopists regarding whether additional cannulation techniques were used or not. Additional cannulation techniques that were included were the double-guidewire technique, needle-knife fistulotomy, and changing the cannulation device after an initial trial. An instance of cannulation was considered a difficult case when the cannulation time exceeded 5 min, when additional techniques were used, or when the cannulation failed.
For the ampulla-detection task, each captured endoscopic image was annotated with a bounding box (bbox) that indicates the location of AOV. The bbox annotation comprises four real values, namely, the x and y coordinates of the upper-left point, and the width and height of the box. An endoscopist who had more than five years of experience in ERCP and conducted the procedure more than 2000 times performed the annotation.
Detection of the location of AOV in the duodenum. This paper aims to assist endoscopists performing ERCP through two tasks, the automatic detection of the ampulla and the identification of cannulation difficulty. The former will be introduced in this section, and the latter will be explained in detail in the following section. Given the annotated location of AOV as bboxes, it is common to design a neural network to estimate them, i.e., to generate the real-valued output bboxes. However, in our model, instead of predicting bboxes that identify the exact range of the entire AOV, we estimate the probability of whether each image pixel belongs to an AOV. A strict bbox that divides AOV from non-AOV is not suitable for this task because the AOV is not strictly distinguished from the background. Rather, the AOV gradually blends into the background, similar to how a probability distribution continually ranges from 1 to 0. Predicting the exact range of bboxes that indicate AOV is a difficult task due to variations in the data (e.g. the boundary of AOV may be ambiguous, its morphology varies across patients). Such methods have been proven to be effective for datasets that similarly have ambiguous boundaries (Kim et al. 18 ). In this sense, we made our model generate the output as a pixel-wise soft mask, which is a density map with the probability of whether each pixel belongs to an AOV, instead of a strict bbox.
To this end, we transformed the bbox annotations into pixel-wise soft masks. Figure 1 illustrates a sample image with the original bbox annotation and the new soft mask label. First, regarding the possibility of pixels belonging to AOV, the centroid of the bbox should have the maximum probability while the probability should www.nature.com/scientificreports/ become smaller as the pixel is further away from the centroid. In addition, given that the shape of the ampulla is similar to a circle, not a rectangle, a bivariate normal distribution parametrized by the mean and covariance matrix was used to render masks. It is a bell-shaped curve, which has a maximum value at the mean and decreasing values with the speed determined by the covariance matrix. To formulate the new label, we set the mean as the centroid of the bbox annotation and the horizontal and the vertical variance as the half of the width and the height, respectively, of the box. As the width of the box becomes longer, the curve decreases more slowly along the corresponding axis.
In this manner, the new annotation is a pixel-wise soft mask whose values follow a bell-shaped probability distribution. While learning to minimize the difference between the outputs and ground truth (GT) labels using the binary cross-entropy loss, the network learns to assign the maximum probability at the centroid of the bbox and the shape of its outputs becomes close to a circle or an oval. During the evaluation, the output of the model is transformed back to a bbox form. The bbox includes the peak of the predicted mask and only those pixels with a probability larger than a threshold around the peak. We normalized the predicted mask to have a maximum value of one at peak and set the probability threshold as 0.6 to determine the boundary of the box.
For our neural network architecture, we adopted U-Net 19 with a VGGNet-based 20 encoder and decoder. VGGNet is a CNN model with stacked blocks of multiple 3 × 3 convolutional layers followed by a max pooling layer. U-Net consists of an encoder and a decoder with a residual connection for precise localization of objects within images. Also, we utilized transfer learning for faster learning and improved prediction accuracy, by finetuning our model pre-trained on the ImageNet dataset 21 .
Classification of the ampulla-cannulation difficulty. After the location of AOV is detected, our next goal is to classify the endoscopic images of the duodenum according to the cannulation difficulty in ERCP. Depending on how the difficulty labels were grouped, we conducted the prediction in the following two ways: binary classification and four-class classification. First, we divided all the cases into two groups, namely, "easy case" or "difficult case" group. The "difficult case" group included the cases that had the cannulation time of over 5 min, cases requiring additional cannulation techniques, and failure of selective cannulation, except easy cases, as stated earlier. Furthermore, the groups were subdivided into four-class as follows: easy class, class whose cannulation time was over 5 min, class requiring additional cannulation techniques, and failure class.
Similar to the AOV detection task, CNN-based classification models were used and transfer learning was adopted. Specifically, modified versions of VGG19 with batch normalization 20 , ResNet50 22 , and DenseNet161 23 were used. VGG19 is a VGGNet architecture with 19 layers, while batch normalization is a technique that keeps the distribution of activation values in a network constant. ResNet is a CNN model that allows residual mappings by adopting skip connections between the layers and effectively alleviates the gradient vanishing problem. DenseNet uses skip connections "densely" to maximize the advantage of skip connections. A single three-channel endoscopic image is used as an input to the model. While training, data were augmented every iteration by applying various transformations to endoscopic images, e.g., flipping, shearing, and rotating. We also used early stopping to avoid overfitting. All the networks in this study were implemented with the deep learning framework called PyTorch 24 .
Outcome measures. When evaluating the model performance for the ampulla detection task, the following methods were used other than main outcome measures such as recall.
• Centroid distance is a relative coordinate error between centroids of the GT bbox and the estimated bbox.
Its mathematical expression is as follows: www.nature.com/scientificreports/ where x g , y g , x e , y e , W , and H denote coordinates of the GT bbox, those of the measured bbox, the width and height of the image, respectively. • A success plot shows success rates for decreasing (or increasing) mean intersection-over-union (mIoU) (or centroid distance) thresholds. It is counted as a success if the IoU (or centroid distance) between the model prediction and the GT label for an image is bigger (or lower) than the threshold. • Human performance was compared with our model. We randomly sampled 30 images from our test set of the first fold using the Python NumPy library and an expert endoscopist conducted the ampulla detection on the sampled data. The performance of our model was measured for the same images.

Results
The baseline characteristics of the patients and the results of ERCP are listed in Table 1. A total of 531 patients were included in this study. Their mean age was 66.0 ± 15.2 years, and 303 (57%) patients were men. The median value of the cannulation time was 130.0 ± 305.5 s. The cannulation time was over 5 min in 69 patients, and there were 6 cases of cannulation failure. Additional techniques were used in 94 patients, and in all cases where additional techniques were used, the cannulation time exceeded 5 min. Finally, 169 cases (31.8%) were considered to have experienced cannulation difficulty. For the study of detecting AOV in the duodenum, the endoscopic images of 451 patients were selected for the task of annotating the AOV location in the endoscopic images. A total of 80 images were excluded because of poor image quality or ambiguous AOV location. All the 531 cases were used to estimate the cannulation difficulty.
For all experiments, we performed fivefold cross-validation on our dataset and all results were reported as the average of the five folds with the standard deviation unless noted otherwise. In the ampulla location prediction task, the prediction results of our model were as follows: mIoU 0.641 ± 0.021, precision 0.762 ± 0.035, recall 0.784 ± 0.006, and centroid distance 0.021 ± 0.003. Examples of the model output and the GT label are shown in Fig. 2. The images with IoU ranges from 0.196 to 0.968 and centroid distance ranges from 0.001 to 0.066 are included. Figure 3 shows two success plots, one with IoU thresholds ranging from 0.0 to 0.9 and the other with increasing centroid-distance-ratio thresholds ranging from 0.01 to 0.1. As shown in Fig. 3a, the proposed method with soft mask output achieved the average success rate of 91.4 ± 2.3%, for 0.3 IoU threshold, compared to the model with bbox outputs with 84.9 ± 4.7%. Similarly, Fig. 3b shows the average success rate of the proposed model for having a lower centroid distance than 5% of the image resolution is 92.0 ± 1.28%, compared to 84.9 ± 4.7% of the bbox output model.
For classifying the difficulty of cannulation, the performance results of the CNN-based models on the binary classification are shown in Table 2. Of all, ResNet achieved the best performance. We achieved the average recall of 0.719 ± 0.081 for the easy class and that of 0.611 ± 0.098 for the difficult class. Also, the F1-score was 0.757 ± 0.062 for the easy class and 0.553 ± 0.062 for the difficult class on average. As shown in Table 3, our best model scored a macro-average F1-score of 0.429 ± 0.062 and an accuracy of 0.667 ± 0.078 on the four-class classification task. The recall was 0.8039 for the easy class and 0.5638 for the class of cases requiring additional cannulation techniques on the four-class classification. www.nature.com/scientificreports/ Figure 2. Examples of the model prediction and GT label. A model prediction is present in two different ways, the green bbox (upper) and heatmap visualization (lower). In both cases, the white bbox indicates the GT label. For each prediction, IoU and centroid distance are written above. Heatmap results show that the predicted masks from our model accurately match AOVs in size and shape, even ones with IoU around 30%.

Discussion
Studies on AI in medicine have been widely conducted in recent years, and remarkable progress has been made.
Notably, CNN-based models proved their potential in medical imaging applications and have been widely applied in the field of gastroenterology. For example, Constantinescu et al. 27 adopted AI to detect polyps from endoscopic images and achieved 93.75% recall and 91.38% specificity, showing a similar diagnosis to a physician-led one that showed 94.79% recall and 93.68% specificity. Wu et al. 28 proposed a CNN-based model that detected early gastric cancer and gastric locations better than endoscopists. Saito et al. 29 obtained 98.6% accuracy in detecting protruding lesions from wireless capsule endoscopy images. This study is the first one to develop an AI-based endoscopy support system for ERCP by comprehensively analyzing endoscopic images with clinical outcomes. The result has a clinical significance that the efficiency and safety of cannulation in the conventional ERCP might be improved via the support of the new AI technology. Although performing a perfect biliary cannulation has been a challenge for most endoscopists, an unsolved question still remains: what is the optimal cannulation technique for ERCP? We propose an AI-assisted system that detects the location of AOV and estimates the cannulation difficulty in advance while performing ERCP.
Our CNN-based models achieved competent performance in these tasks, especially in the ampulla detection task, showing robust performance on variations in data among patients, such as morphological shape, size, texture, location, and types of the diverticulum (Fig. 2). Moreover, our model even identifies the shape of the ampulla precisely, e.g., if the ampulla is vertically long or circular, so is the model output. These results are especially meaningful in that our model successfully detected both the location and morphological shape of AOV only with bbox annotations, not costly pixel-level annotations. Furthermore, performances of our model and the bbox output model are compared in Fig. 3. The former always shows better performance than the latter, proving the effectiveness of the soft mask.
Also, it is notable that even predictions with IoU between 0.3 and 0.4 adequately identified the location to practically assist the ERCP procedure. In this sense, we counted how many predictions achieved IoU bigger than 0.3. Remarkably, the average success rate was 91.4% (Fig. 3), showing that our model learned to detect ampulla in spite of the unclear boundary. Moreover, as shown in Fig. 3, the average success rate for the centroid distance threshold of 5% of the image resolution was 92.0%, where it can be easily seen through examples in Fig. 2 that 5% is a proper threshold to identify the performance. This result demonstrates that our model accurately detected the location of the AOV for most of the cases.
The comparison with the human expert results demonstrates that our model (mIoU 0.684, recall 0.825) achieved comparable performance with the human expert (mIoU 0.554, recall 0.602) in recognizing the range of AOV on average although the endoscopist (precision 0.917) was better at excluding unnecessary parts than our model (precision 0.789). Also, the centroid distance results show that its capability to pinpoint the location of AOV is on par with the level of a human expert.
Since the boundary of AOV is ambiguous, each annotator can draw different bounding boxes for the same AOV. In this sense, the difference between the human expert and the ground truth labels can be regarded as inter-annotator disagreement. Thus, we also measured the model performance regarding the human expert as a new GT. Although the mIoU and the precision are relatively small, the recall becomes even higher compared to when the original GT label is used (recall 0.825). This indicates that our model is not biased to the GT label and more similar to the expert than the GT label in terms of recall. Moreover, the centroid distance between our  www.nature.com/scientificreports/ model and the human expert (0.008) is smaller than the distance between the GT label and the human expert (0.012). Even though our model never saw the expert's annotation while training, our model pinpoint AOV closer to the human expert than the GT label does. These results support that our model is generalizable.
In the task of binary classification of cannulation difficulty, our model showed high performance for estimating easy cases for selective cannulation with the average precision and recall of 0.802 and 0.719, respectively (Table 2). However, the selection of difficult cases remains to have a low recall of 0.611 on average. Therefore, more improvement would be required to use our model for clinical practice in ERCP.
On the other hand, there were some interesting and promising results in four-class classification. Although the estimation of a long cannulation time was low similarly to the binary classification, AI-assisted models showed a favorable performance in predicting the cases requiring additional technique (recall 0.564) even though only 17.70% of the data belong to this class. It shows the potential that endoscopic images solely can provide the information on whether additional techniques would be necessary while performing ERCP, without any repeated attempts and failures. Therefore, the use of the AI-assisted procedure had enough feasibility to get the additional information of ERCP.
Additionally, in Fig. 4, we visualized class activation maps using gradient-weighted class activation mapping (Grad-CAM) 30 to interpret the model behavior for correctly classified examples. It shows which visual features were meaningful to the model. The model focused on the surroundings of the ampulla for the easy cases and the bulging ampulla for cases of the more-than-five-min class. On the other hand, the upper crease of AOV was considered as an important feature for detecting cases requiring additional techniques. These results demonstrate that our model makes decisions according to learned features related to cannulation difficulties (possibly involving other visual features not mentioned) and that it can provide us further insights to explain the relationship between the various anatomical factors of patients and cannulation difficulties.
In future work, collecting additional information from radiologic data (e.g., computed tomography or magnetic resonance imaging) would improve the classification performance. For example, the inner structure of the ampulla is one factor that determines the cannulation difficulty but may not be easily captured from the endoscopic images. Also, data imbalance in difficulty classes needs to be addressed, which is one of the typical problems that degrade model performance in classification tasks [31][32][33][34][35][36][37] . Furthermore, sharing the learned www.nature.com/scientificreports/ knowledge between the cannulation difficulty prediction model and the ampulla detection model would lead to improvements in prediction accuracy for both tasks.
In conclusion, this study shows the potential of clinically applicable AI-based automatic ERCP procedures, even with a small number of data. The AI-assisted system showed high accuracy for finding the location of the ampulla at the duodenum in ERCP on par with the level of a human expert. It is expected to help make decisions in ambiguous situations during ERCP.

Data availability
The datasets generated during and/or analyzed during the current study are not publicly available due to patient's privacy protection act.