Introduction

Upper gastrointestinal endoscopy plays a pivotal role in diagnosing and managing upper gastrointestinal disorders1, with early identification and precise diagnosis of lesions significantly affecting treatment strategies and overall prognosis2. Countries with a high prevalence of gastric cancer that have implemented national screening programs with upper gastrointestinal endoscopy have experienced an increase in the early gastric cancer detection rate and a decrease in gastric cancer mortality3,4,5. The early detection and precise diagnosis of gastric lesions, particularly malignant and premalignant lesions, are essential for timely and effective treatment, ultimately leading to enhanced survival rates6,7. In addition, when malignancy is suspected, endoscopic prediction of invasion depth is important for deciding treatment modalities, such as endoscopic resection and surgery8.

However, the accurate identification and diagnosis of gastric lesions during endoscopy requires a thorough inspection of the stomach, discrimination of abnormal lesions from normal gastric mucosa, and the decision to perform a biopsy, which can be influenced by the expertise and experience of endoscopists9. Notably, missed gastric cancer during endoscopy is common, with missing rates of up to 12%10. Factors that influence the missing rate include endoscopist errors, such as failure of lesion detection, detection of a lesion without performing a biopsy, and follow-up delays11. To overcome these limitations, numerous studies have recently been conducted using convolutional neural networks (CNNs)12,13,14,15,16,17 or machine learning methods18,19,20,21 to improve lesion detection and differentiation rates during endoscopy.

Deep learning algorithms are widely used in various fields owing to their growing clinical relevance in the medical domain. This can assist clinicians in decision-making22, enhance lesion detection, and alleviate the fatigue experienced by endoscopists23,24,25.

In this study, we aimed to develop a novel algorithm for the detection and classification of gastric cancer and premalignant and benign gastric lesions that are commonly identified through upper gastrointestinal endoscopy, while predicting the depth of invasion of gastric cancer.

Material and methods

Dataset

We retrospectively gathered still-image white-light endoscopy images of pathologically confirmed gastric lesions from patients who underwent upper gastrointestinal endoscopy between January 1, 2018, and December 31, 2021, at Seoul National University Hospital (SNUH). These included cases of gastric cancer (early gastric cancer [EGC] and advanced gastric cancer [AGC]), gastric premalignant lesions (low-grade and high-grade dysplasia), benign gastric lesions (benign gastric ulcers [BGU], benign polyps [hyperplastic and fundic gland polyps], and benign erosions), and normal endoscopy cases (normal gastric mucosa with no visible lesions). The exclusion criteria were as follows: (1) inappropriate images (low resolution, blurring, artifacts, bubbles, shadowing, inadequate air inflation, etc.) and (2) images without pathology results (except for images of a normal stomach). The models for lesion detection and invasion depth classification were designed as shown in Fig. 1.

Figure 1
figure 1

Schematic diagram for the automated Multi-Class Lesion Detection and T-stage Classification Model.

Table 1 shows the composition of image categories in the datasets used in this study. A total of 10,181 white-light images from 2606 participants were included in the study, with an 8:1:1 ratio maintained for the training, validation, and test data to ensure that the patient images did not overlap between the sets. Specifically, the training, validation, and test sets included 1997 (7632 images), 303 (1156 images), and 306 (1393 images) participants, respectively.

Table 1 Data distribution of the training, validation, and test sets for the detection model of gastric lesions.

All endoscopic procedures were performed and reviewed by experienced endoscopists, each with more than 6000 cases of prior experience. Gastric cancers and adenomas were treated with either endoscopic submucosal dissection or surgery, and the pathological results of the resected tumors were reviewed.

The lesions were classified by combining endoscopic findings with the pathology reports reviewed by the endoscopists (HSC and BKK). Endoscopic images were classified into six categories: EGC, AGC, gastric dysplasia, BGU, benign polyps, and benign erosions. Images were also classified according to their malignant potential: neoplasm versus benign and cancer versus non-cancer. Cancers included EGC and AGC, whereas neoplasms included both gastric cancer (EGC and AGC) and gastric dysplasia (low-grade dysplasia [LGD] or high-grade dysplasia [HGD]). For gastric cancers, the pathology results of the resected specimens were reviewed, and the depth of invasion was identified as: (mucosal cancer (T1a), submucosal invasion (T1b), proper muscle invasion (T2), subserosal invasion (T3), and serosal invasion or invasion of adjacent structures (T4). The training dataset for the model that classified the depth of invasion is presented in Table 2.

Table 2 Data distribution of the training, validation, and test sets for the model classifying the depth of invasion.

Characteristics of the included images

The data were categorized into six classes, as listed in Table 1. Specifically, 48.24% (4911/10,181) of the entire dataset fell into the “neoplasm” category (including EGC, AGC, HGD, and LGD), 23.38% (2380/10,181) were classified as “non-neoplasm” (which included BGU, benign polyps, and benign erosions), and 28.39% (2890/10,181) as “normal mucosa.”

Within the neoplasm category, dysplasia images comprised the highest proportion at 20.71% (2108/10,181), followed by AGC at 13.80% (1405/10,181) and EGC at 13.73% (1398/10,181). In the non-neoplasm category, benign polyp images constituted the largest portion at 8.99% (915/10,181), followed by erosions at 8.07% (822/10,181), and BGU at 6.32% (643/10,181). Normal mucosa images were not separately categorized during training; however, they were used as background images and negative examples in the test set. Endoscopic images were extracted from the picture archiving and communication system of SNUH in PNG format and captured using Olympus Medical Systems endoscopes (GIF-H290) and video processing systems (EVIS LUCERA ELITE CV-290) in Tokyo, Japan. Furthermore, to anonymize the patient data, sections corresponding to patient information were cropped and removed from the original endoscopic images. Consequently, only the images corresponding to the field of view of the gastrointestinal endoscope were obtained through preprocessing (the minimum resolution of these images was 371 × 322 pixels).

In medical datasets, achieving a natural balance can be challenging, resulting in imbalances in the number of data points across different lesions when the images are used for classification training. Various methods have been used to address this issue. In this study, we adopted data augmentation techniques (Fig. 2), including horizontal flip, HSV channel translation, affine augmentation, polar augmentation26, mosaic augmentation27, and copy paste augmentation28. Lesion images from patients undergoing upper endoscopy typically consist of approximately three to four images per patient, taken from various angles and distances; therefore, we addressed class imbalance by employing image stitching29 in the validation set (Fig. 2). The authors assert that all procedures contributing to this work complied with the ethical standards of the relevant national and institutional committees on human experimentation and the Declaration of Helsinki of 1975, as revised in 2008. The requirement for written consent was waived by the Institutional Review Board (IRB) of Seoul National University Hospital (no. 2108-030-1242; the IRB acquisition date of the IRB is August 31, 2021).

Figure 2
figure 2

Application examples of image augmentation: augmentation method using imgaug library, involving affine transformations on the left and polar augmentation on the right of the original image. Image stitching with homography aligns multiple multi-angle lesion images for augmentation.

Model development and main outcome

All deep learning models were developed using the Python programming language (version 3.9.0)30 and Pytorch 1.11.031. The imgaug library 0.4.026 was used for data augmentation. We employed YOLOv732 to develop a multiclass detection model for the six classified lesions. To identify the optimal hyperparameter configuration for achieving the best-performing model, we employed Hyperparameter Optimization with Genetic Algorithm for YOLOv732 and presented the results as an optimized parameter table (Table 3). The hardware setup used for training included 2 * RTX 3090ti graphics processing units, 12th Generation Intel® Core™ i9-12900K, and 32 GB RAM.

Table 3 Hyperparameters of the detection model after optimization with the genetic algorithm.

We also developed a classification model to distinguish the depth of invasion in cancer images. Based on the T stage from the pathological reports of resected specimens from patients with gastric cancer, we developed a binary classification model for T stage estimation.

Notably, we evaluated our model, especially on images showing discrepancies between the initial endoscopic impression and the actual T stage reported from the resected specimen. These included images from 13 patients who were initially thought to have EGC based on endoscopic findings but were upstaged to AGC after resection, and 75 patients who were initially thought to have AGC based on endoscopic findings but were downstaged to EGC after resection.

The primary outcome was lesion detection rate in the detection model. Additional performance metrics include the following:

The Positive Predictive Value (PPV), defined as “true positive / (true positive + false positive)”.

- Sensitivity, defined as “true positive / (true positive + false negative)”.

  • Negative Predictive Value (NPV), defined as “true negative / (true negative + false negative)”.

  • Specificity, defined as “true negative / (true negative + false positive)”.

  • Accuracy, defined as “(true positive + true negative)/(true positive + true negative + false positive + false negative)”.

Comparative performance analysis with experts

To compare the performance with experts, we conducted additional analysis using our AI model and four expert endoscopists. We collected an additional set of 104 anonymized endoscopic images, which were not part of our model's development dataset and were from a different period (2023–2024). These images were evenly distributed across six diagnostic classes. Both the model and four expert endoscopists independently reviewed these images without any prior knowledge of each other's assessments. Following their evaluations, we compiled and analyzed the results to assess and compare the diagnostic performance of the AI model and the experts.

Ethics declarations

Approval of all ethical and experimental procedures and protocols was granted by the institutional review board (IRB) in Seoul National University Hospital (IRB No. 2108-030-1242). Due to the retrospective nature of the study, 2108-030-1242 waived the need of obtaining informed consent.

Results

Test performance of the computer-aided detection (CADe) model

A schematic of the established lesion detection system is depicted in Fig. 3. The lesion detection rate was assessed on a per-patient basis and achieved a rate of 95.22% (219 out of 230 patients) in the test set. In the context of endoscopic inspection, we opted to analyze the test results on a per-lesion basis, prioritizing continuous observation of specific lesions rather than relying on individual static image evaluations. This approach resulted in a 100% per lesion detection rate for EGC, 97.22% for AGC, 96.49% for dysplasia, 75.00% for BGU, 97.22% for benign polyps, and 80.49% for benign erosion. In internal testing of the classification models, the six-class category exhibited a maximum accuracy of 73.43% (95% CI 71.01–75.85%), accompanied by a sensitivity of 80.90% (95% CI 76.45–85.36%), specificity of 83.32% (95% CI 80.69–85.95%), PPV of 73.68% (95% CI 69.39–77.98%), and NPV of 88.53% (95% CI 86.29–90.77%) (Table 4). When categorized into cancer and non-cancer lesions, lesion detection rates were as high as 98.61% and 89.24%, respectively. Cancer lesions (EGC and AGC) demonstrated an NPV of 88.57% (95% CI 86.22–90.93%) and sensitivity of 78.62% (95% CI 72.39–84.85%). For neoplasms vs. non-neoplasms, the lesion detection rates were 97.67% and 85.15%, respectively. We also found high performance of CADe, achieving an 89.80% (95% CI 87.29–92.31%) NPV for neoplasms, indicating that it excels in detecting neoplastic lesions (Table 4).

Figure 3
figure 3

The structure of YOLOv7; MP: MaxPooling; ConvModule: Conv2d-BatchNormalization-SiLU activation; RepConv: Reparameterization.

Table 4 Diagnostic performance of the model in classifying lesions on endoscopic images.

The detailed training parameters for the lesion detection model were as follows:

  1. (a)

    Augmentation methods: Horizontal flip, HSV channel translation, translation, scale, mosaic, and copy-paste.

  2. (b)

    Batch size: 64

  3. (c)

    Epoch: 100

  4. (d)

    Optimizer: SGD with a momentum of 0.842

  5. (e)

    Learning rate scheduler OneCycleLR

  6. (f)

    Label smoothing: 0.1

Test performance of the T-stage classification model

In this study, we developed an algorithm to classify cancer images and their depth of invasion into T stages. For this purpose, we employ the EfficientNet-B3 model. When the T stage was classified as T1 and T2-T4, the model achieved an accuracy of 85.17%, sensitivity of 88.68%, specificity of 79.81%, PPV of 87.04%, and NPV of 82.18% (Table 4). Notably, among 75 pathologically proven patients with AGC, with the initial impression of the endoscopist being EGC, our model accurately predicted the T stage in 65 patients. In addition, out of 13 patients diagnosed by their endoscopist as having AGC,, their` actual T stage was EGC, and the model accurately predicted the T stage in nine cases.

The detailed training parameters for the lesion detection model were as follows:

  1. (a)

    Batch size: 48

  2. (b)

    Epoch: 100

  3. (c)

    Optimizer: SGD with momentum of 0.9, decay of 0.00034

  4. (d)

    Learning rate: 0.0096

  5. (e)

    Loss function: sparse categorical cross-entropy

Comparative performance analysis with experts

In our expanded analysis involving both four expert endoscopists and our CNN model, we found that our model performed robustly across a diverse dataset of six lesion types. We analyzed our model's performance across various lesion types including the performance to distinguish between cancer and non-cancer, neoplasm and non-neoplasm, each 6 lesion types (Supplementary Table 1–3). In the classification between cancer versus non-cancer, which is the most important issue in endoscopic exam, the negative predictive value and sensitivity of the AI was 88.89% and 98.51% which was superior than that of experts (78.38% and 88.06% respectively) (Supplementary Table 1).

We further identified every case where our model correctly classified lesion types, while expert endoscopists did not (Supplementary Table 4) with representative images (Fig. 4). Our model accurately recognized dysplasia, in cases where some experts categorized the same lesions as benign erosion (Fig. 4A, B) or EGC (Fig. 4C). Furthermore, our model successfully identified benign polyps in cases where they were misclassified as dysplasia (Fig. 4D, E) or AGC (Fig. 4F) by some experts. This is of particular importance as such distinctions can have significant implications on subsequent clinical management, including treatment decisions and follow-up endoscopy scheduling.

Figure 4
figure 4

Representative cases where the AI correctly classified the lesion, while experts did not. (A, B) Cases where AI accurately recognized Dysplasia, while experts categorized the same lesions as Benign erosion. (C) Cases where AI accurately recognized Dysplasia, while experts categorized the same lesions as EGC. (D, E) Cases where AI accurately recognized Benign Polyps while misclassified as dysplasia. (F) Cases where AI accurately recognized Benign Polyps while misclassified as AGC.

Discussion

In this study, we developed an automated system for detecting and classifying malignant, premalignant, and benign gastric lesions. Previous studies25 aimed to classify lesions using a separate artificial intelligence model for images detected using an anomaly detection algorithm. However, this approach requires an additional step to indicate the required histological examination, even after lesion detection, deviating from the primary goal of reducing the workload and fatigue23,24,25. To overcome this limitation, we developed a multiclass detection algorithm with real-time processing to eliminate the need for dual processing. Although detection algorithms with high sensitivity can identify subtle lesions, there is a risk of overprediction. Misclassifying nonlesions as lesions can inundate clinicians with false alarms, possibly overshadowing true lesions and misguiding lesion classification9,10. This can compound clinician fatigue, which is a challenge that CADe aims to address. In addition, a major concern during endoscopic examinations is the possibility of missed lesions such as cancers. Considering these challenges, our algorithm was primarily tailored to yield a high NPV for cancerous and neoplastic lesions11. From the data encompassing 2,606 patients, we achieved an NPV of 88.53% and a sensitivity of 80.90% for the six-class classification. Notably, cancerous lesions had an NPV of 88.57% and sensitivity of 78.62%, and neoplastic lesions had an NPV of 89.80% and sensitivity of 83.26%. Overall, the detection rate was 95.22% in the 306 patients evaluated.

The decision to perform a biopsy is often based on the assessment of the malignant and premalignant potential of a lesion. This nuanced judgment is based on extensive endoscopic experience. Our methodology can substantially aid in refining biopsy decisions and in classifying lesions as cancerous, neoplastic, or benign.

In addition, when estimating the depth of invasion for gastric cancer, discrepancies between endoscopist impressions and the actual pathological T stage are often observed. Our model achieved high performance in the prediction of the T stage, even in cases that showed discrepancies in the endoscopic and pathological T stages. Because treatment options for gastric cancer, such as endoscopic or surgical resection, often rely on the estimated T stage, this model could aid in accurate decisions for optimal treatment.

However, this study has a few limitations. The data used in this study were exclusively obtained from a single institution. Therefore, the algorithms employed in this study require external validation. Moreover, owing to the nature of Seoul National University Hospital, which is a tertiary hospital, there was a higher proportion of advanced cancer cases than of EGC or BGU cases, resulting in somewhat lower evaluation metrics, especially in the BGU and EGC classes. Additionally, data imbalance raises the possibility of bias towards cancerous lesions. Hence, securing multi-institutional data is recommended for external validation in future studies. Furthermore, because we could not compare the performance of the developed algorithm with that of an endoscopist using the same images, it is impossible to determine the true extent of the algorithm’s enhancement of the lesion detection rate or its potential impact.

Compared with previous studies12,16,22,23, the strength of this study is its comprehensive inclusion of various lesions. Not only did we account for diverse cancers and premalignant lesions, but we also incorporated common benign lesions that show diverse endoscopic appearances, including BGUs, benign polyps, and benign erosions. This wide-ranging analysis ensured a thorough and representative evaluation, thus enhancing the applicability and robustness of the findings.

In summary, our model, which was tailored for detecting and classifying gastric cancers, dysplasia, and various benign lesions, demonstrated an outstanding performance and has the potential to assist clinicians in decision-making during endoscopic procedures.