Introduction

Colorectal cancer (CRC) is the third most common cancer diagnosed globally1. A screening colonoscopy is the proven modality that enables a reduction in CRC risk through early detection and removal of premalignant colorectal polyps2. The two main types of precancerous lesions of CRC are conventional adenomas (ADs, the precursors of 70% of all CRCs) and sessile serrated lesions (SSLs, the precursors of 15–30% of all CRCs)3. The detection rate of premalignant polyps is the key quality indicator in a colonoscopy4. However, the overall detection rate has varied significantly among individual endoscopists, owing to different recognition skills and the withdrawal technique5. Additionally, risk factors leading to missed polyps, such as flat or sessile shapes, a pale color, and small size, can affect the detection rate6. Previous studies reported a wide variation in the adenoma detection rate (9.4–32.7%) and SSL detection rate (0.6–11%) across endoscopists, and the overall polyp miss rate is approximately 22%5,7,8. The variation of the detection rate and the high polyp miss rate significantly affect the efficacy of colonoscopies for CRC prevention9.

Artificial intelligence technology based on deep learning is being applied in various medical fields, and research is being actively conducted to develop computer-aided detection (CADe) systems for colonoscopies to overcome the limitation of the variance of human skills10,11,12. Color, texture, and shape-based features have been used to detect polyps, and polyp detection systems using convolutional neural networks (CNNs) have shown promising results in recent studies11,12,13,14. A meta-analysis that included five randomized control trials reported that CADe groups exhibited a 44% increase (36.6% vs. 25.2%) in the adenoma detection rate (ADR) and a 70% increase (58% vs. 36%) in the number of ADs per colonoscopy (APCs) when compared with the control groups15. These well-trained CADe systems demonstrated high performance for adenoma detection15,16,17. However, the fidelity of CADe systems for SSL detection is still lacking, owing to the very low SSL detection performance when compared with the clinical benchmark for the SSL detection rate (> 5%)15,18. SSLs are a high-risk precursor for CRC; however, they can be easily missed even by experienced endoscopists, owing to their subtle morphology with indistinct border, pale color, and flat or sessile shape19,20,21,22,23,24. Large individual variations in the SSL detection rate have also been observed in clinical experts who have been extensively trained to detect SSLs25. Therefore, systems specializing in SSL detection must be developed to improve detection performance by detecting additional CRC precursors that may not be visually recognized26.

Training data composition is important for a colonoscopy CADe system because colon polyp datasets collected from clinical practice are typically imbalanced27,28. Prevalence studies indicate that ADs are approximately eight times more prevalent than SSLs, and each type of polyp exhibits unique endoscopic features29. These data imbalance problems can introduce a bias in the training process, thereby decreasing the performance of the CADe system30. To address these problems, traditional augmentation methods have been explored to expand minor types of data, including flipping, rotation, scaling, and cropping31,32. Recently, generative adversarial networks (GANs) have led active research on medical data synthesis and have been considered for various applications33,34,35. However, in the field of endoscopy, it is difficult to generate whole synthetic images because endoscopy does not have a formal protocol and structure36. To overcome this problem, studies have employed GANs to generate endoscopic images that include polyps by synthesizing a normal mucosa background and a lesion patch37,38. They showed improved detection performance on gastric cancers and colorectal polyps. Nevertheless, these synthesized images have relatively low quality when compared with actual endoscopic images, and they may not reflect an actual endoscopic environment including polyp features such as color and texture. In the present study, a style-based GAN (StyleGAN) method was adopted to decrease the type imbalance by synthesizing high-resolution endoscopic images, almost indistinguishable from real images and including features of SSLs, proven through a visual Turing test. Then, a polyp detection system contributing to the detection of easily missed polyps was developed based on the validated GAN-synthesized images.

Results

SSL image synthesis with GAN

In Fig. 1, the synthesized SSL images exhibited global consistency and realistic mucosa features, including texture, color, folds, and blood vessels. Most importantly, the SSLs in the synthesized images depicted the real endoscopic features of SSLs, including sessile or flat morphology, a pale color, disrupted vascular pattern, altered fold contour, indistinct borders with mucus capping, and a rim of bubbles or debris (Fig. 1b). Those features and the quality of synthesized images were identified in the assessment by clinical experts. Additionally, we identified that SSL images with combined features of two or more polyp images present in the GAN training datasets were synthesized (Supplementary Fig. S1). These images can provide more information about the real data distribution. To evaluate the quality of images synthesized with the GAN, we conducted t-SNE visualization and assessments by clinical experts.

Figure 1
figure 1

Generated SSL images by StyleGAN2. (a) Representative synthetic images with high-quality SSL features, and (b) endoscopic features of SSLs discovered in GAN-synthesized SSL images including an indistinct border, flat and irregular shape, mucous cap, a cloud-like surface without vessels, and a dark spot inside the crypts.

Fréchet inception distance (FID) score computation

While training StyleGAN2 on the SSL images, we quantified the quality of the synthesized images using an FID score. The FID score was initially 426.0625, while the final FID score was 42.1885 (Supplementary Fig. S2).

Visualization of t-SNE

We used t-SNE visualization to further analyze the synthesized images. The t-SNE algorithm for dimensionality reduction enables the embedding of high-dimensional data into a 2D space. We used t-SNE on a random selection of 200 SSL real images from the original dataset and 200 SSL synthetic images generated by the GAN. As presented in Fig. 2a, the synthetic SSL data group is not separated from the original data distribution, which means that the synthetic images reflected a real distribution. Moreover, it may be helpful to provide more features about the real distribution uncovered by the original dataset. However, we confirmed a discrete distribution between the real images in the original dataset and the traditionally augmented images in Fig. 2b. New features extracted from the traditional augmentation dataset may provide additional information; however, they may not be helpful to approximate the real distribution.

Figure 2
figure 2

Visualization of SSL augmented images with t-SNE. (a) t-SNE of SSL images between original dataset (blue) and GAN augmented dataset (orange), (b) t-SNE of SSL images between original dataset (blue) and traditional augmentation dataset (orange).

Quality assessment with clinical experts

We evaluated the image quality of synthetic SSL images with four experts. The prevalence of the samples was blinded to the participating experts. In the first assessment, a visual Turing test was conducted to differentiate between the real and synthetic colonoscopy images; we tested a total of 50 colonoscopy images, which consisted of 25 real and 25 synthetic SSL images that were randomly selected from the real images and GAN-synthesized images. The real images were evaluated as positive, and the synthetic images as negative. All experts showed low performance in identifying whether the lesions shown were true or synthetic. The overall accuracy was 63% (ranging between 60 and 66%), and the sensitivity and specificity for real images was 79% (ranging between 68 and 92%) and 47% (ranging between 32 and 60%), respectively (Fig. 3a, Supplementary Table S1). In the second assessment, a micro assessment of synthetic SSL images was performed according to the characteristic endoscopic features of SSLs, including the indistinct border, pale color, and mucus cap. From the 2400 synthetic SSL images, 120 images were randomly selected for the micro assessment. The image quality was rated in three grades (good, moderate, and poor), representatively shown in Supplementary Fig. S3. The experts judged that 76% (above moderate) of the synthetic SSL images reflected the endoscopic characteristics well (Fig. 3b).

Figure 3
figure 3

Results of quality assessment with clinical experts. (a) Results of a visual Turing test by four experts to differentiate between real and GAN-synthesized SSL colonoscopy images in 1 s, (b) Assessment of SSL endoscopic features in 120 representative samples synthesized by the GAN.

Performance comparison

Performance comparison on polyp detection

To illustrate the validity of the SSL augmentation techniques to improve the detection of premalignant polyps, such as AD and SSL, we evaluated the performance of the augmentation models with the validation dataset, which was composed of 1106 polyp images with 1141 polyps and 1000 normal mucosa images without any polyps. Each polyp image can contain more than one polyp; therefore, the metrics were evaluated per polyp. We implemented the detection models on the whole validation dataset to check the improvement in performance depending on the augmentation techniques and the ratio of the augmented SSLs. To address the imbalance in the organizational ratio of AD, hyperplastic polyps (HP), and SSL within dataset (11:7:1), we augmented SSL images by 1200/1800/2400 corresponding to HP, the average of HP and AD, and AD. As can be observed in Table 1, the models trained on the augmented dataset with two augmentation methods achieved relatively higher performance when compared with the original model. Although the GAN augmentation (GAN-aug) models showed slightly less positive predictive value (PPV) than the traditional augmentation (T-aug) models, they achieved better performance than other models for the average precision (AP) and F1-score, which indicates the overall performance of the detection model. To compare all the augmented models, we focused on AP, which is the representative metric of object detection. Supplementary Fig. S4 shows the variation in the performance (AP) on polyp detection according to the proportion of augmented SSL images in the training dataset in comparison with the original dataset, original + 1200 (aug1), original + 1800 (aug2), and original + 2400 (aug3) augmented datasets. Compared with the performance of the original model, both augmentation methods exhibited a highly improved performance. However, it can be confirmed that data augmentation over a particular proportion causes performance degradation. Through the augmentation methods, we identified that 1800 SSL image generation leads to the best performance in each method. In Table 1, the detection model with the GAN-aug2 dataset achieved the best performance, except for the specificity and PPV. SSL is a polyp type that has inconspicuous features; thus, the increase in SSL images can lead to high sensitivity and low specificity. The ROCs of each of the best models in the original, T-aug, and GAN-aug datasets are shown in Supplementary Fig. S5. In addition, public datasets were evaluated to compare the performance of our model with that of models developed in other studies31,39,40. A CNN-based approach CUMED of the MICCAI 2015 challenge on polyp detection39 and two optimal models using a faster R-CNN proposed by Shin et al.31 were compared on ETIS-LaribPolypDB41. A model based on YOLOv2 proposed by Lee et al.40 was computed on CVC-ClinicDB42. As shown in Supplementary Table S2, the GAN-aug2 model achieved the highest sensitivity value on both datasets: 89.4% on ETIS-LaribPolypDB and 91.0% on CVC-ClinicDB. The detected polyp images obtained using our SSL images, ETIS-LaribPolypDB, and CVC-ClinicDB are shown in Supplementary Fig. S6.

Table 1 Performance of original, traditional augmentation (T-aug), and GAN augmentation (GAN-aug) models for the validation dataset including polyp (n = 1141) and negative images (n = 1000).

Performance comparison on histological polyp type

To identify the performance improvement in AD and SSL detection, we separated polyps in the validation dataset into three histological types: AD, HP, and SSL. The polyps included in the validation dataset comprised 620 AD, 438 HP, and 63 SSLs. To accurately evaluate the effectiveness of the detection system for SSLs, we collected 130 additional SSL images for the temporal validation dataset between March 2020 and October 2020 from the same institution.

We evaluated three models on the type-separated polyp validation dataset, original model, T-aug2 model, and GAN-aug2 model. Two augmented models were selected because they are representative of the whole validation dataset in each augmentation method. Compared with the original model, polyp detection sensitivity for each of the three types increased for the GAN-aug2 model. Notably, the GAN-aug2 model exhibited a 19.1% sensitivity improvement when compared with the original model on SSL images (Supplementary Table S3).

Because the number of SSLs in the validation dataset was small to assure the effectiveness of the GAN augmentation, we validated with the SSL temporal dataset, which included 130 images with 133 SSLs. As shown in Table 2, by evaluating all models in the SSL temporal validation dataset, we obtained similar results to the augmented models with 1800 SSL images, which showed high performance when compared with the other models. In addition, GAN augmentation methods also exhibited better performance than the traditional augmentation methods in this case. As a result, we can confirm that the GAN is validated for use in augmentation methods. Furthermore, its performance changed according to the identified augmentation ratio, and we could identify the importance of augmentation considering the distribution of the data.

Table 2 Sensitivity, positive predictive value (PPV), and average precision (AP) of original model, traditional augmentation models, and GAN augmentation models for sessile serrated lesion (SSL) temporal validation dataset (n = 133).

Discussion

In this paper, we aimed to generate synthetic SSL images using a GAN for data augmentation to address the data imbalance issue that commonly exists in medical image analysis. As observed in Table 3, the distribution of the polyp histology classes is imbalanced, especially for SSLs, which are a high-risk precursor of CRC but are easily missed during a colonoscopy23,24. To improve the performance of the polyp detection system on easily missed polyps, data augmentation with GAN-synthesized SSL images was applied. In previous studies, there were some trials to synthesize endoscopy images using a GAN37,38,43,44. For domain adaptation and 3D reconstruction through endoscopy images, GAN techniques were used to synthesize intermediate data of the network, which is operated to translate original endoscopy images to another domain43,44. Although GAN synthesis studies have also been conducted for detection problems, the methods were bounded to integrate normal mucosa background images and polyp lesion patches37,38. These synthesized images through the randomized combinations of lesion patches and normal mucosa have relatively low quality when compared with the actual endoscopic images, and they may not reflect the endoscopic locational information about histological features, such as color, shape of folds according to thickness, and vessel distributions45,46,47. Additionally, only deterministic polyps are generated including limited variations in polyp features in terms of color and texture38. To improve detection performance on easily missed polyps, augmented data must contain subtle features of SSLs that are difficult to detect. In our study, StyleGAN2 exhibited the synthesized high-resolution whole endoscopic images including the GAN mixed-style images, which showed the combined features of two or more SSL images. The quality of the synthesized polyp images was evaluated by four expert endoscopists to verify the reality and endoscopic features of the SSLs. The annotation of the synthetic images was checked twice by the operators. The result proved that it is possible to generate high-resolution endoscopy images using a GAN with only polyp images as the input. Moreover, with one histological type, a GAN can generate polyps, including the endoscopic features of each type, such as texture, color, folds, and blood vessels ensuring the reliability of data synthesis with lesion features in the medical domain.

Through the use of these synthetic images, we can achieve high performance, especially on SSLs. Compared with a traditional augmentation method, GAN-aug models exhibited high detection performance overall (Tables 1, 2, and Supplementary Table S3). This revealed that the GAN augmentation technique can be more effective in augmenting endoscopy datasets for detection than traditional augmentation methods. Our results showed that the synthesized polyp images have meaningful features and can be helpful in solving the data imbalance problem in developing detection systems. Moreover, we could identify the reason for the relatively low performance of T-aug models when compared with GAN-aug models through data distribution visualized using t-SNE. In Fig. 2, the augmented images from the GAN are distributed in the same space with real data; however, they may fill in the distributions by adding data points that are not covered by the original dataset. Considering the augmentation number in the training dataset, it is important to consider the ratio of the augmented data according to the composition. We identified that additional augmented data does not lead to improved performance. As shown in Tables 1 and 2, aug2, which augmented the data with 1800 synthetic SSL images, exhibits the best performance. The reason why aug3 does not exhibit an improved performance when compared with aug2 may be that the distribution of 2400 augmented data differs from the actual data distribution. When the generator learns to map a small subset of the possible realistic modes, partial mode collapse occurs48. Data augmentation over a certain value with partial mode collapse could distort the data distribution49,50.

To develop an effective CADe system, high detection performance for premalignant polyps that are hard to detect, such as SSLs, is the key factor. Zhou et al.51 reported an 84.10% sensitivity to SSL frames with a system that achieved an overall sensitivity of 94.07% and 0.98 AUC in a study by Wang et al.11. In particular, we could identify a definite difference in detecting SSL polyps between the original model, T-aug model, and GAN-aug model (Supplementary Table S3, Table 2). The GAN-aug2 model achieved a sensitivity of 95.24% and an AP of 0.9338 in the SSL images included in the validation dataset. Additionally, in the SSL temporal validation dataset, we achieved 93.98% sensitivity and 0.9302 AP using the GAN-aug2 model. When performing external validation using public datasets, the GAN-aug2 model also showed the best performance in terms of sensitivity compared with models of other studies31,39,40. Using synthesized polyp images with the GAN resulted in a high detection performance for SSLs and was proved to significantly improve the overall polyp detection performance. GAN-synthesized images may approximate the data distribution, which helps the algorithm to be trained in the desired direction.

As a part of future research, we are planning to develop the system with a multi-center dataset and conduct simultaneous video tests to identify the ability of the polyp detection system in comparison with clinical experts. Furthermore, we will synthesize other polyp types, AD and HP, to confirm the effect of augmentation ratio of each type on the detection performance, and thus develop the classification study on polyp type diagnosis using the GAN-synthesized methods. These results can be extended to multi-class detection studies that simultaneously conduct polyp detection and diagnosis. In addition to the GAN augmentation technique, an advanced CADe system can be developed by improving the resolution of endoscopy with real-time image super-resolution and training various parts of polyps through deep neural network correlation learning mechanism52,53.

In conclusion, we proposed an enhanced polyp detection system, which demonstrated higher detection performance, especially on SSLs, which are easily missed during a colonoscopy. Moreover, we verified the capacity of GAN in detection problems and confirmed that GAN-synthesized images are realistic and reflect the endoscopic features of polyps. These features can provide additional information and ensure a comprehensive distribution of the non-augmented dataset, which helps the model to achieve the intended direction of training. Furthermore, the endoscopic features of polyp types in GAN-synthesized images could be utilized for histological type diagnosis problems.

Methods

In this section, we describe the dataset used in developing the polyp detection system, and explain the method for generating synthetic SSL images using GAN and traditional augmentation methods. The quality of the generated SSL images was evaluated using both quantitative and qualitative techniques. To identify the effect of data augmentation depending on the proportion between imbalanced types, SSL minor type augmentations were conducted three times (with an increase in the major type, middle type, and average between those). Then, the performance analysis of the polyp detection models trained with original data, GAN augmentations, and traditional augmentations could be conducted (Fig. 4). The study protocol adhered to the ethical guidelines of the 1975 Declaration of Helsinki and its subsequent revisions, and was approved by Seoul National University Hospital Institutional Review Board (number H-2001-083-1095). A study protocol was designed in consideration of the use of retrospective data of the previous study (number H-1505-019-670), and informed consent from the patients was waived by Seoul National University Hospital Institutional Review Board.

Figure 4
figure 4

Procedure to develop automatic polyp detection system with SSL augmentation using GAN. In the generator network, “A” means a learned affine transform and “B” operates the noise broadcast.

Dataset

We developed a polyp detection system using 5503 white-light colonoscopy images with polyps from colonoscopy examinations undertaken at the Seoul National University Hospital, Healthcare System Gangnam Center between October 2015 and February 2020. The endoscopic images for the development of the colonoscopy AI system were collected from the retrospective and prospective databases. Retrospective data was collected from the previous study’s database, and informed consent from the patients was waived54. Prospective data was collected under written informed consent. All colonoscopies were performed using high-definition colonoscopes (EVIS LUCERA CV260SL/CV290SL, Olympus Medical Systems Co., Ltd., Japan) by nine endoscopists. Polyp images were randomly split in a ratio of 8:2 to compose the training and validation datasets (both images included different sets of polyps). To evaluate negative performance, we used 1000 normal mucosa colonoscopy images with no polyps. In addition, because the number of SSLs in the validation dataset was less than 100, we collected another 130 SSL images in the same process between March and October 2020 for temporal validation. The organizational ratios within the dataset show that ADs, HPs, and SSLs have a ratio of approximately 11:7:1, and SSLs constitute a very small proportion (Table 3). To address this imbalance, we augmented the minor type SSL images by 1200/1800/2400 images corresponding to HP, the average of HP and AD, and AD. Additionally, for external validation, ETIS-LaribPolypDB41 and CVC-ClinicDB42 were used to evaluate the models. ETIS-LaribPolypDB is composed of 196 images with 208 polyps, and CVC-ClinicDB includes 612 images containing 646 polyps. Details of all the datasets are provided in Supplementary Fig. S7.

Table 3 Polyp characteristics for the training, validation, and SSL temporal validation datasets.

Synthetic SSL image generation

We used two strategies to generate SSL synthetic data: a traditional augmentation technique involving geometric transformations, and synthesized images through a GAN model, which learned using SSL images from the training dataset. In each of the two augmentation methods, three datasets were randomly sampled to a ratio of 1200/1800/2400 depending on the type imbalance. The number of augmented SSL images was determined by the proportional difference between the major (AD), middle (HP), and minor (SSL) types. Datasets to which 1200 synthetic SSL images were added to match the ratio with that of the middle type (HP) are called aug1. Similarly, the other two datasets are called aug2 (1800 images) and aug3 (2400 images), respectively. For performance comparison, we used the same environment and the same hyperparameters.

Traditional augmentation method

Data augmentation including geometric transformations, such as rotating, flipping, cropping, and scaling has been traditionally conducted to enlarge the minor type of the training dataset31,32. We implemented augmentation techniques on all 222 SSL images in the detector training dataset. Each image was flipped vertically. Then, original non-flipped images and flipped images were rotated three times at random angles \(\theta _{1}=[0^{\circ }, \ldots, 120^{\circ }], \theta _{2}=[120^{\circ }, \ldots, 240^{\circ }], \theta _{3}=[240^{\circ }, \ldots, 360^{\circ }]\). Subsequently, the rotated images were randomly cropped to a size of \(256\times 256\) pixels twice. Consequently, each image was augmented 12 times \(((1+1_{flip})\times 3_{rotation}\times 2_{crop})\); 186 images were removed because they could not be cropped while including the whole polyp region. Then, we sampled the 1200/1800/2400 augmented images randomly to create three traditional augmented datasets, T-aug1, T-aug2, and T-aug3.

Image augmentation with GAN

The style-based generator architecture (StyleGAN2) has a hierarchical generator with a skip connection, and it normalizes the convolution weights using estimated statistics instead of normalizing them with actual statistics55. It is a promising model for the generation of medical images, owing to its capacity to synthesize high-resolution images with a realistic level of details56. The aim of training StyleGAN2 was to synthesize SSL colonoscopy images for improving the CADe performance in polyp detection. The dataset is composed of SSL images, included in the polyp detection training datasets. To train StyleGAN2, 203 SSL images were cropped to a size of \(256\times 256\) pixels that included the polyp region; the remaining 19 SSL images were excluded because they could not be cropped to include the polyp. Adam optimizer was used to train StyleGAN2 with momentum parameters \(\beta _{1}=0\) and \(\beta _{2}=0.99\) at a learning rate of \(2\times 10^{-3}\).The main loss function was a non-saturating logistic loss with \(R_{1}\) regularization. The training phase required approximately 5625 ticks for a total of 45,000 iterations (54 days) with a batch size of 16 on an NVIDIA GeForce RTX 2080 Ti GPU\(\times\) 2 (32 GB RAM). StyleGAN2 proposed an FID score to quantify the quality of the synthesized images every 10 ticks. The generated image resolution was adjusted to \(256\times 256 \; pixels\). To verify the differences between the basic image generation model, the image-to-image translation model, and StyleGAN2, we additionally trained DCGAN and CycleGAN on SSL images with the same resolution. After the loss converged, the synthesis results of the three GAN models were compared. As shown in Supplementary Fig. S8, the images synthesized using StyleGAN2 could be confirmed to have the highest resolution and most realistic lesion characteristics. Differences depending on the network type are discussed in the Supplementary Information (Description of the GAN models used).

For quantitative evaluation of the quality of the generated images, we measured the differences between two distributions in the high-dimensional feature space of an Inceptionv3 classifier with FID57. If the activations on the real and synthesized data are N(mC) and \(N(m_{w}, C_{w})\) respectively, FID is defined as

$$\begin{aligned} \left\| m-m_{w} \right\| _{2}^{2}+Tr(C+C_{w}-2(CC_{w})^{\frac{1}{2}}) \end{aligned}$$
(1)

FID was computed with all images in the training dataset and 50,000 random images generated at every 10 ticks. All generated images were annotated with the expected SSL region; those that could not be annotated were discarded. Randomly sampled 1200/1800/2400 synthetic SSL images were added to the original training dataset to compose the GAN-aug1, GAN-aug2, and GAN-aug3 datasets.

Assessment of GAN-synthesized images

We used quantitative and qualitative techniques to analyze the results and demonstrate the quality of the synthetic images generated by the GAN58,59. In the qualitative evaluation, two methods were applied: (i) t-SNE visualization and (ii) quality assessment by clinical experts. Clinical experts in the quality assessment process were board-certified gastroenterologists with colonoscopy experience ranging from 5 to 20 years, and their annual volume of procedures was over 500. Written informed consent was obtained from all participating physicians.

First, the t-SNE algorithm for dimensionality reduction enabled the embedding of high-dimensional data into a 2D space. It represented the similarities between the data points by iteratively comparing the probability distributions of different data points in both high- and low-dimensional spaces60. By applying t-SNE, the distribution of real/synthetic images can be visually analyzed33,61. We randomly selected 200 images each from the real-SSL original dataset, traditionally augmented images, and GAN-synthesized images. Then, all images were resized to \(224\times 224\) pixels. We set a perplexity of 30 for 1000 iterations.

Second, the synthesized images were evaluated by clinical experts33,37,43. Four clinical experts evaluated the visual quality of the SSL samples generated by the GANs. Two test modules were developed using a PowerPoint presentation, and the quality of the evaluation was assessed in two steps: Test 1—visual Turing test: differentiation of a polyp image between a real and GAN-synthesized image within 1 s; Test 2—micro assessment of synthetic SSL image quality (corresponding SSL endoscopic features) in 10% representative SSL synthetic images. Through this process, we aimed to answer the following: (1) Is the appearance of the synthesized lesions realistic? (2) Do the synthesized lesions sufficiently include the characteristic endoscopic features of SSLs? All images used in the tests had a size of \(256\times 256\) pixels, and the prevalence of the samples was blinded.

Training polyp detection model

To develop the real-time polyp detection system, we applied one-stage detection algorithms with high inference speed. Three representative one-stage object detection models, RetinaNet, single shot detector (SSD), and YOLOv3 were trained on the original training dataset with the same environment for performance comparison (Supplementary Table S4). Considering frame per second (FPS) and overall metrics, subsequent experiments were conducted using YOLOv3. Additionally, we confirmed the performance of YOLOv3 according to the backbones used as the feature extractor. Darknet-53 was compared with Inceptionv3, ResNet50, and AlexNet. As YOLOv3 performs detection at three scales, three inputs should be provided to the front model of YOLOv3 in \(52\times 52\times 256\), \(26\times 26\times 512\), \(13\times 13\times 1024\). We placed the outputs from three layers of each model into the front model of YOLOv3 and trained it on the original training dataset with the same parameter settings. As shown in Supplementary Table S5, Darknet-53 achieved a higher performance overall compared to other backbone networks.

Through transfer learning, faster loss convergence can be achieved by initializing the current model with the learned weights of a pre-trained model62,63. In this study, a pre-trained Darknet-53 model was used to customize the YOLOv3 detection model. The backbone classifier is the Darknet-53 network, and features were extracted in three different scales for the detection of objects of various sizes64 as shown in Supplementary Fig. S9. It uses the sum-squared error between the predictions/ground truth and binary cross-entropy in class prediction as a loss.

$$\begin{aligned} &\lambda _{coord}\sum _{i=0}^{S^{2}}\sum _{j=0}^{B}{\mathbb {I}}_{ij}^{obj}(2-w_{i}\times h_{i})[(x_{i}-\hat{x_{i}})^{2}+(y_{i}-\hat{y_{i}})^{2}]\\&\qquad +\lambda _{coord}\sum _{i=0}^{S^{2}}\sum _{j=0}^{B}{\mathbb {I}}_{ij}^{obj}(2-w_{i}\times h_{i})[(w_{i}-\hat{w_{i}})^{2}+(h_{i}\hat{h_{i}})^{2}]\\&\qquad +\lambda _{obj}\sum _{i=0}^{S^{2}}\sum _{j=0}^{B}{\mathbb {I}}_{ij}^{obj}[\hat{C_{i}}log(C_{i})+(1-\hat{C_{i}})log(1-C_{i})]\\&\qquad +\lambda _{noobj}\sum _{i=0}^{S^{2}}\sum _{j=0}^{B}{\mathbb {I}}_{ij}^{noobj}[\hat{C_{i}}log(C_{i})+(1-\hat{C_{i}})log(1-C_{i})]\\&\qquad +\lambda _{class}\sum _{i=0}^{S^{2}}{\mathbb {I}}_{ij}^{obj}\sum _{c\in classes}^{}[\hat{p_{i}}(c)log(p_{i}(c))+(1-\hat{p_{i}}(c))log(1-p_{i}(c))] \end{aligned}$$
(2)

where \(x_{i}, y_{i}\) is the centroid location of an anchor box, \(w_{i}, h_{i}\) are the width/height of the anchor; \(C_{i}\) is the objectness, which is the same as the confidence score; and \(p_{i}(c)\) is the classification loss. When an object exists in the boxj of celli, \({\mathbb {I}}_{ij}^{obj}\) is 1; otherwise, it is 0. If the boxj in celli has no object, the value of \({\mathbb {I}}_{ij}^{noobj}\) is 1; otherwise, it is 0. The size of the feature map is described as \(S^{2}\), and B is the number of anchor boxes. Lambda coefficients, \(\lambda _{coord}, \lambda _{obj}, \lambda _{noobj}\), and \(\lambda _{class}\) are the weight parameters for localization, objects, no-object, and classes that were initially set to 1, 5, 1, and 1, respectively, by default. An increase in the lambda coefficients affects the focused part of the loss. Specifically, and increment in \(\lambda _{coord}, \lambda _{obj}, \lambda _{noobj}\), and \(\lambda _{class}\) affects the intersection over union (IoU), sensitivity, specificity, and confidence score, respectively. We set the lambda coefficients as 1, 9, 2, and 1 based on experimental trials. The model was initialized with ImageNet weights and then fine-tuned to learn the endoscopic features of polyps.

We trained YOLOv3 with a pre-trained Darknet-53 model with a batch size of 8, and the learning rate for the Adam optimizer was \(1\times 10^{-7}\), which was selected from the range [\(1\times 10^{-4}\) \(1\times 10^{-8}\)]. The network input size was set to \(416\times 416\) such that input images were all resized to \(416\times 416\), and nine anchors, which were obtained by clustering the dimensions using K-means for each dataset, were used. Anchors that overlapped the ground truth object by less than the IoU threshold value (0.5) were ignored. The non-maximum suppression threshold value was set to 0.2, and the confidence score was selected as 0.5. More details can be found in the Supplementary Information (Description of training detection algorithm).

The original training dataset is composed of 4397 WL polyp images. All images were added twice to include one with a full monitor image and one that is cropped to involve only the endoscopic view. Additionally, we composed augmentation training datasets with traditional geometric transformation methods and GAN. SSL augmented images were added in 1200/1800/2400 to the original training dataset so that the proportion of SSL in the dataset corresponded to the HP, the average of HP and AD, and AD. To compare the detection performance according to the augmentation methods and augmented ratio of the minor type, we trained the polyp detection model using the same hyperparameters as the original and augmented datasets: (i) original model trained without augmented images; (ii) T-aug1, T-aug2, and T-aug3 models trained with original + 1200/1800/2400 traditionally augmented SSL images; and (iii) GAN-aug1, GAN-aug2, and GAN-aug3 models trained with original + 1200/1800/2400 synthetic SSL images generated by StyleGAN2. We evaluated these models using the original validation dataset and SSL temporal validation dataset, including 130 SSL images with confidence score and IoU thresholds both at 0.5.

Statistical analysis

In polyp images, the polyp detection box with an IoU higher than 0.5 was evaluated as true-positive (TP) and that lower than 0.5 was evaluated as false-positive (FP). If the system could not detect the polyp box in the polyp images, those images are evaluated as false-negative (FN). To evaluate the instances of a true-negative (TN) result, we composed 1000 negative images with no polyp. Using the TP, FP, FN, and TN indicators, we calculated the sensitivity (= recall), specificity, PPV (= precision), negative predictive value (NPV), AP, F1-score, and the area under the receiver operating characteristic (AUROC).