Evaluating the performance of generative adversarial network-synthesized periapical images in classifying C-shaped root canals

This study evaluated the performance of generative adversarial network (GAN)-synthesized periapical images for classifying C-shaped root canals, which are challenging to diagnose because of their complex morphology. GANs have emerged as a promising technique for generating realistic images, offering a potential solution for data augmentation in scenarios with limited training datasets. Periapical images were synthesized using the StyleGAN2-ADA framework, and their quality was evaluated based on the average Frechet inception distance (FID) and the visual Turing test. The average FID was found to be 35.353 (± 4.386) for synthesized C-shaped canal images and 25.471 (± 2.779) for non C-shaped canal images. The visual Turing test conducted by two radiologists on 100 randomly selected images revealed that distinguishing between real and synthetic images was difficult. These results indicate that GAN-synthesized images exhibit satisfactory visual quality. The classification performance of the neural network, when augmented with GAN data, showed improvements compared with using real data alone, and could be advantageous in addressing data conditions with class imbalance. GAN-generated images have proven to be an effective data augmentation method, addressing the limitations of limited training data and computational resources in diagnosing dental anomalies.

The C-shaped canal configuration is a unique anatomical variation commonly found in mandibular molars.This configuration poses challenges for clinicians because of its complex canal morphology and high susceptibility to periodontal disease 1 .Proper identification and management of C-shaped canals play a crucial role in achieving successful endodontic treatment outcomes and improving the overall prognosis, because the presence of additional canals and isthmuses can lead to incomplete debridement and inadequate disinfection, compromising the long-term success of the treatment 2 .Therefore, understanding the complexities of C-shaped canals and their classification is paramount in providing optimal care for patients with these challenging anatomical variations.C-shaped canal anatomy is typically detected in clinical settings by taking a periapical radiograph or, if available, a panoramic radiograph for overall screening.However, these 2-dimensional images have limitations, such as image distortion or superimposition, which can affect diagnostic accuracy.To overcome these limitations, conebeam computed tomography (CBCT) imaging is commonly used.CBCT produces 3D images that are reported to provide high diagnostic accuracy comparable with that obtained from conventional CT scans while having lower radiation doses 3 .However, CBCT examinations still result in significantly higher radiation doses compared with conventional panoramic or periapical radiographs, limiting their indications.images were prepared (non C-shaped = 803, C-shaped = 653), and cut into patches to include only the second mandibular molar as the region of interest and resized to 512 × 512 pixel sized JPG images.This retrospective study was conducted according to the principles of the Declaration of Helsinki and was performed in accordance with current scientific guidelines.The study protocol was approved by the Institutional Review Board (IRB) of Yonsei University Dental Hospital, Seoul, Korea (approval number: 2-2023-0026).Written informed consent was waived by the IRB.
The CBCT images were acquired with the patients in a standard upright position using RAYSCAN Alpha plus (Ray Co, Hwaseong, Korea) or Pax-Zenith 3D (Vatech Co, Hwaseong, Korea) machines.The scanning parameters included a scanning time of 14 s, a field of view of 100 × 100 cm, a tube voltage of 90 kVp, a tube current of 12 mA, and a voxel size of 0.18, which were determined based on the patient's size on the scanning device.Periapical images were taken with a tube voltage of 60 kV, a tube current of 7 mA, and an acquisition time of 0.125 s using an electric intraoral sensor.

Frechet inception distance (FID)
The FID serves as a metric for evaluating the quality of images produced by GANs.It gauges the similarity between two distributions, typically the distribution of real images and the distribution of generated images.A lower FID score indicates a higher level of realism in the generated images.FID is advantageous in multiple ways as it can provide a quantitative evaluation of the realism of generated images, distinguish between real and generated samples, align with human perceptual evaluations, detect distortions, and is computationally and sample efficient 43 .During the training process of StyleGAN2-ADA in generating periapical image patches, the FID was continuously monitored.The analysis focused on examining the lowest FID score that was achieved during training the GAN model.The FID was calculated five times (n = 5) using different random seeds and the average (± standard deviation) was calculated for both c and non c-shaped canal images.

Visual Turing test
To validate the perceptual quality of the generated images, a total of 100 images consisting of 50 generated periapical images of mandibular molars (25 with C-shaped canal configuration, 25 with non C-shaped canal configuration) which seemed real at first glance were randomly selected, and 50 real images (likewise, 25 with C-shaped canal configuration, 25 with non C-shaped canal configuration) were added.The 100 images were rearranged through random shuffling and reordering.These images were evaluated by two radiologists, each of whom had over 15 years of experience.Neither observer had any experience with synthesized periapical images.
The test was performed by showing images one by one in a Google Form-based interface and the images were displayed in the same order for each observer.The observers were informed that there were 50 real and 50 generated images each, and 50 mandibular second molars with C-shaped canals and 50 with non C-shaped canals.
The observers were allowed to provide only a single response to each question, without the option to revisit previous questions.They were instructed to determine whether the displayed image was genuine or synthesized.The accuracy, sensitivity, and specificity of the test results were then assessed.Fleiss kappa was used to evaluate the interobserver agreement in the visual Turing test.

Predictive performance based on CNN classification
EfficientNet 43 was used for subsequent classification to evaluate the validity of the synthesized images generated by GAN.The pretrained EfficientNet-B0 which was trained on over a million images from the ImageNet database was employed.The weights from this pretraining process were used, (https:// pytor ch.org/ hub/ nvidia_ deepl earni ngexa mples_ effic ientn et/) and Pytorch 2.0 with CUDA 11.8 was used for development.
Six different scenarios were constructed, and the classification performances were compared.The first scenario (A) was trained only on real data (280 real images), whereas the second scenario (B) was trained only on GAN data (280 generated images).The third scenario (C) was trained on double the number of real data (560 real images) and the fourth scenario (D) was trained with a mix of real and GAN data (280 real images + 280 generated images).The fifth scenario (E) was trained on real data with class imbalance (280 real images; 56 C-shaped), and the sixth scenario (F) was trained by adding 168 synthetic images of the minority class (C-shaped) to balance the training subset with real data.The class ratio between C-shaped and non C-shaped images was equal (5:5) in scenarios A to D and F, whereas scenario E had a class imbalance of 2:8 (C vs non C) that would correspond to the real prevalence of C shaped canal configurations in the Asian population 44 .The validation and test datasets used in the study consisted entirely of real periapical images.More specifically, the validation set comprised a total of 80 real images with 40 images belonging to the C-shaped category and 40 images belonging to the non C-shaped category.The test set comprised a total of 40 real images with 20 images belonging to the C-shaped category and 20 images belonging to the non C-shaped category.Each scenario underwent five folds of trials.To achieve this, five sets of validation and testing datasets were initially prepared for each fold.Subsequently, the training data were randomly selected for each fold.In scenarios A, C, D, E, and F, a total of 280 real images were assigned to the training dataset while maintaining specific matching ratios (A, B, D: 50% C-shaped, E, F: 20% C-shaped).Without data augmentation as in scenarios C, D, and F, the ratio between training, validation, and test was set to 7:2:1.Particular attention was given to ensure that the training, validation, and test datasets did not include the same images.Moreover, the validation and test sets for each fold were also ensured to have no overlapping items.The images were then trained on EfficientNet for binary classification of C-or non C-shaped canals.The accuracy, precision, recall, specificity, false positive rate (FPR), false negative rate (FNR), and the area under the receiver operating characteristic curve (AUROC) was calculated.To assess the effectiveness of data augmentation using GAN-generated images, scenarios A and D were compared using the chi-square test (specifically, McNemar's test).Similarly, scenarios E and F were compared to evaluate the effectiveness of GANgenerated images in addressing imbalanced dataset situations.The null hypothesis stated that there would be no significant difference between each of the two scenarios.The significance level (alpha) was set to 0.05.

Ethics approval and consent to participate
This study was approved by the Institutional Review Board (IRB) of Yonsei University Dental Hospital (approval no.2-2023-0026).Written informed consent was waived by the IRB.
The results of the visual Turing test are shown in Table 1 with mean accuracy, sensitivity, and specificity.The accuracies were 0.490 and 0.590, sensitivities were 0.429 and 0.592, and specificities were 0.549 and 0.588 www.nature.com/scientificreports/respectively for each observer.The Fleiss kappa was low (κ = 0.28) indicating poor interobserver agreement.The overall results indicate that the radiographic images generated by GAN were difficult to decipher and showed no notable difference when compared with real radiographs.Table 2 presents the predictive performance of the classification models used in scenarios A to F. The classification model's performance was weakest in scenario B where a limited number of only synthetic images from GAN were used.The accuracy, sensitivity, and specificity were 0.715 ± 0.095, 0.810 ± 0.134, and 0.620 ± 0.268 respectively.Scenarios C and D yielded better performances as the training data was augmented with either real or synthetic images, resulting in increased accuracy, sensitivity, and specificity.Scenario C had an accuracy of 0.845 ± 0.069, sensitivity of 0.810 ± 0.089, and specificity of 0.880 ± 0.084 with training data augmentation using real images.Scenario D had an accuracy of 0.890 ± 0.065, sensitivity of 0.910 ± 0.055, and specificity of 0.870 ± 0.120 with training data augmentation using synthesized images.A t-test indicated a significant difference between scenarios C and D only in sensitivity (p = 0.034) and FNR (p = 0.034) (Supplementary Table 1: T-test results).In scenarios E and F, performance was better in scenario F where the class imbalance was resolved by adding GAN-synthesized radiographs.The accuracy and specificity values were 0.855 ± 0.048 and 0.800 ± 0.050 for scenario F, and 0.800 ± 0.108 and 0.660 ± 0.219 for scenario E respectively.However, sensitivity was higher in scenario E (0.940 ± 0.065) than in scenario F (0.910 ± 0.102).
The receiver operating characteristic (ROC) curves and corresponding AUROC values for scenarios A to F are displayed in separate plots in Fig. 2. The AUROC values for scenario A to F were 0.87, 0.78, 0.92, 0.93, 0.90 and 0.94 respectively.The chi-square test yielded values of 4.114 (scenarios A vs. D) and 4.326 (scenarios E vs. F), with corresponding p values of 0.042 and 0.037, indicating a statistically significant difference in model performance.However, the chi-square value was 2.630 with a corresponding p value of 0.104 for scenario C versus D, indicating no significant difference.Examples of qualitative comparison between the original periapical images and GAN synthesized images having C-shaped or non C-shaped canal configuration are shown in Fig. 3.

Discussion
In recent years, the use of GANs has become increasingly popular in a wide range of applications within the medical and dental fields, including image and video synthesis as well as text generation.The ability of GANs to generate realistic and diverse data samples has made GANs an important tool in machine learning and artificial intelligence research, and their potential use in dental radiography is especially compelling because of the limited availability of high-quality dental radiographs for research and training purposes.In this study, we aimed to evaluate the feasibility and effectiveness of using GANs to generate synthetic periapical radiographs and evaluate their performance in image quality and prediction outcomes.
In our study, the FID scores for the generated images from a pretrained StyleGAN2-ADA were 72.762 (± 0.723) and 61.373 (± 3.035) for radiographs with C and non-C canal configurations, respectively.These results may appear unsatisfactory when compared with other medical studies.One study 21 reported FID scores of 5.22 (± 0.17) for a liver CT dataset on a StyleGAN2 network with transfer learning from the FFHQ dataset, and FIDs of 10.78, 3.52, 21.17, and 5.39 on the publicly available SLIVER07, ChestX-ray14, ACDC, and Medical Segmentation Decathlon (brain tumors) datasets.In another study 45 , the FID was approximately 20 for synthesized magnetic resonance and CT images.Nevertheless, the synthesized images in this study still proved useful in data augmentation and yielded good results in the visual Turing test and classification performances.The visual Turing test showed that the synthesized images had realistic and diverse characteristics (average accuracy = 0.54).Furthermore, the performance results from the EfficientNet classification indicate that the images generated from StyleGAN2-ADA could serve as a useful database for data augmentation.On the other hand, it is worth noting that while the authors were able to produce images with acceptable qualities after sufficient training; the root tips and periodontal ligaments in the apical portion shows a relatively fuzzy appearance in the generated images using StyleGAN2-ADA compared to real periapical images, and the root canals also seem less prominent.The overall results of this study show that the generated images have sufficient image quality for the diagnosis of the C-shaped canal anatomy for both human observers and the CNN; yet these call for improvement in future studies.
There may be some limitations in applying FID in medical images because the ImageNet dataset does not contain medical images.Thus, some studies argue that using FID for medical imaging is neither practical nor feasible and suggest replacing the inception network with their own encoding networks 46,47 .Nonetheless, recent studies using StyleGAN2 have reported their results using FID 21,45 , which is different from the approach of using their own encoding networks for FID evaluation in medical imaging.This is because the alternative approach lacks consistency in evaluating and comparing FID because it does not use the same encoding model as ImageNet 21,48 .For these reasons, the original FID was used as the calculating metric of this study.
Despite the promising results of this study in using GAN-generated data to improve the performance of dental radiographic image classification models, there are several limitations that should be acknowledged.First, the Table 1.Average assessment results of the two observers on the visual Turing test.The Fleiss kappa coefficient was found to be low (κ = 0.28), indicating a poor level of agreement between the observers.www.nature.com/scientificreports/synthetic data.Thus, learning a disentangled representation is still a challenge in the field of the interpretability of GANs, and identifying errors or biases in the generated data is challenging.Studies seeking methods to modify traditional GANs to ensure explainability and interpretability by manipulating feature or saliency maps of the corresponding filters are showing promising results 49,50 ; therefore, implementing interpretable GANs must be considered in future studies for stable image generation and increased applicability in the clinical field.Second, the evaluation of generative models lacks an objective loss function, making it necessary to assess the quality of the generated synthetic images.Although manually inspecting and judging the generated examples at different iteration steps is a basic and useful approach to evaluate a GAN, this method has limitations because it is basically subjective and may include the biases of the reviewer.Additionally, it requires domain knowledge to distinguish between realistic and unrealistic images, making it essential to involve experts in the field.Furthermore, the number of images that can be reviewed is limited by the capacity to undertake manual inspection, and no clear best practice has emerged for qualitatively assessing the generated images, as it is likely to vary depending on the specific case.In this study, it was particularly important to rely on the assessments of dental specialists when evaluating the synthetic periapical radiographs.More methods that can objectively evaluate GANs are needed in future GAN implementations for image data augmentation.

Accuracy
Third, the high cost of model training and the need for a larger and diverse dataset are also significant challenges to consider.This study used a dataset of 1400 images-a relatively small number when compared with other studies-and the model was trained using a single GPU.Although the authors of StyleGAN2-ADA suggest that the model produces promising results from a small dataset of over 1000 images, the amount of data needed  is still unclear; therefore, a larger and more diverse dataset from different institutions is strongly recommended for better generation outcomes.Additionally, further studies using external datasets are needed to evaluate the generalizability of the proposed method.Within the GAN training dataset of this study, the truncation was manually set (Ψ = 1, 0.7, 0.5, − 0.5, − 0.7, − 1) to generate moderately diverse yet stable results and avoid mode collapse.Generated images also went through manual screening by looking at the thumbnails, which implies that generated images can seem unrealistic in some cases despite their visual quality and must be inspected by experts.An alternative that can generate realistic and diverse data in a more stable and automatic manner and within limited computational resources should be implemented in future studies.In this study, the image size was limited to 512 × 512 pixels, and the region of interest was manually cut out.Future work should focus on the full periapical image synthesis.
Fourth, the effect of adding GAN data must be further recognized.In general, incorporating GAN-generated images into the real dataset led to an improvement in classification performance, indicating that GAN-generated data can serve as an effective method for data augmentation.However, when examining the ROC, training the CNN solely with GAN-synthesized data (scenario B) displayed a lower true positive rate compared with the CNN trained solely with real data (scenario A) or augmented data (scenario C, D).Nonetheless, the false positive rate was found to be similar between the two approaches.Notably, the true positive rate increased as more data was added, regardless of whether it was real data or GAN-generated data.The low true positive rate observed when training the CNN with GAN-synthesized data suggests that the generated images might not accurately capture the specific features necessary for correctly identifying C-shaped canals.The GAN-generated data may lack certain details or exhibit variations that make it more challenging for the CNN to accurately classify C-shaped canals.However, the similar false positive rate between the CNN trained with GAN-synthesized data and the CNN trained with real data suggests that the GAN data does not significantly contribute to an increased number of false positives.This might indicate that the GAN-synthesized images do not introduce additional incorrect classifications beyond those already produced by the CNN trained with real data.Overall, these results suggest that the GAN-synthesized data might not be fully representative of the specific characteristics required for accurate classification of C-shaped canals, but up to this extent can serve as an effective data augmentation method in both situations with or without data imbalance.This conclusion is supported by the observation that the chi-square scores were 4.114 (scenarios A vs D) and 4.326 (scenarios E vs F), with corresponding p-values of 0.042 and 0.037.Additionally, the ROC and AUC values are comparable when comparing scenarios where data augmentation was performed using traditional methods (scenario C) versus adding GAN-synthesized data (scenario D).Therefore, within this context, GAN-generated data offers a viable approach for enhancing the training dataset.
Finally, while GANs have shown great potential for image generation tasks in dentistry, their use in combination with other deep learning tasks in the field remains largely unexplored.In medicine, the use of GANs has been thoroughly investigated, not only in data augmentation 11 but also in image segmentation 12 , denoising and super-resolution, domain transfer, and post intervention prediction 51 .Therefore, the development momentum of GAN-based dental deep learning studies must be updated and reinforced, and future studies should investigate the potential benefits of using GANs in combination with other deep learning approaches by conducting cooperative studies with multiple facilities.

Conclusion
In conclusion, the results of this study suggest that the periapical images generated using StyleGAN2-ADA may not fully capture the precise characteristics required for accurate classification of C-shaped canals; however, these generated images exhibit satisfactory visual quality and demonstrate potential as an effective data augmentation method.The use of GAN-generated images can help overcome the challenges associated with insufficient training data and limited computational resources in diagnosing dental anomalies or diseases.Future studies are anticipated to enhance the quality of GAN-generated periapical images by incorporating larger and more diverse datasets that encompass a broader range of anatomical and disease features.
repositories were used with the default parameters and without a hyperparameter search.Mirroring (horizontal flip) and ADA were enabled, and training was initialized with the official StyleGAN2 pretrained weights from Flickr-Faces High Quality Dataset (FFHQ) (https:// nvlabs-fi-cdn.nvidia.com/ style gan2-ada/ pretr ained/).The training was performed on an NVIDIA A100-SXM graphics processing unit (GPU) with 40.0 gigabytes of GPU RAM and was executed for 600 ticks for each type of periapical image (C-shaped and non C-shaped canals), with metrics computed and weights saved every 16 ticks.The training process was repeated three times to test the algorithm stability and training convergence was evaluated by computing the average generator loss and FID improvement, and manually reviewing the generated images.The development was carried out in Pytorch version 1.19 with CUDA 11.1.All 1456 images were used for training: namely, 803 non C-shaped images and 653 C-shaped images.After training 600 images for each C shaped and non C-shaped canal, images were generated from the truncated latent space by setting the threshold Ψ, which is used to truncate and resample the latent vectors to certain numbers (Ψ = 1, 0.7, 0.5, − 0.5, − 0.7, − 1).By setting the truncation Ψ sampling can be done from a truncated normal, having values which fall outside a range to be resampled to fall inside that range.The generated images were then manually examined as thumbnails (10 × 10 pixel size) and ultimately 280 images of C-shaped mandibular second molars and 280 images of non C-shaped mandibular second molars were prepared.The conceptual architecture for training StyleGAN2-ADA and generating images is depicted in Fig. 1.

Figure 1 .
Figure 1.Conceptual artwork illustrating the architecture of the StyleGAN2-ADA used to generate periapical images of C-shaped and non C-shaped canal configurations, and the methods used to evaluate the quality of the generated images. https://doi.org/10.1038/s41598-023-45290-1

Figure 2 .
Figure 2. Receiver operating characteristic (ROC) curves and the corresponding area under the ROC curve (AUROC) values for scenarios A to F.

Figure 3 .
Figure 3. Examples of qualitative comparison between the original periapical images and GAN synthesized images having C-shaped or non C-shaped canal configuration.Some generated images show lower precision in the apical area with features such as fuzzy appearing apexes or periodontal ligaments, and less prominent root canals.