Introduction

Oral cancer accounts for almost 3% of all cancer cases diagnosed worldwide1. According to the World Health Organization, more than 370,000 cases of oral cancer were reported in 20202. Several studies have shown that tongue cancer is the most common type of oral cancer (42%)3,4. Oral cancer is prevalent in individuals mostly from Asia (65.8%) and is ranked one of Asia’s sixth most frequent malignancies5. The lifestyle of the Asian population, which includes such as chain-smoking, alcohol consumption, and betel quid chewing, is a strong risk factor for oral cancer6,7.

The early detection of tongue cancer is essential8,9. The overall 5-year survival rate for patients with tongue cancer is 68.1%10. According to the Surveillance, Epidemiology, and End Results database, the 5-year survival rates for local, regional, and distant stages are 82%, 68%, and 40%, respectively. In addition to the prognosis, patients with advanced tongue cancer experience difficulties during eating and speaking11. Furthermore, when the diagnosis is delayed, the scope of surgery broadens, and various invasive treatments are performed, resulting in increased side effects after treatment12.

Endoscopy is a simple, effective, and non-invasive method for diagnosing tongue cancer13. However, only a few specialists have the ability to accurately read endoscopic results. For example, if a suspicious lesion is identified in a local clinic, the patient should be referred to a specialist for confirmation of disease status and further management14. However, general physicians who lack experience in treating patients with tongue cancer might mistakenly diagnose visual patterns for signs of ulceration or oral mucosa disease15.

Studies on early detection of various malignancies using the characteristics of the tongue have been undertaken in the past16,17,18. Recently, the development of a primary diagnosis method through artificial intelligence (AI) analysis of oral endoscopic images can improve the chances of early diagnosis of tongue cancer. However, previous studies related to oral cancer were conducted with images created in non-clinical environments using smartphones or digital cameras, rather than in a validated medical environment by using an endoscope; further, the number of images was small (< 300 images)19,20. In addition, studies have shown that there is a risk that the existing diagnostic algorithms may misdiagnose or underestimate the risk to critically ill patients in clinical applications21. This result was attributed to the low quality of the data collected for AI learning22. Hence, in this study, we verified the quality of the constructed dataset. Based on this data, we explored the feasibility of endoscopy-imaging-based deep learning models for tongue cancer diagnosis.

Results

Dataset characteristics

We retrospectively constructed the dataset of 12,400 verified endoscopic images obtained from five university hospitals in South Korea between 2010 and 2020. Of the 12,400 total images, 5576 images related to the tongue were extracted. For the development and validation of the total dataset (N = 5576), 1941 endoscopic images of malignant lesions and 3635 non-malignant endoscopic images were included. A difference in the ratio between malignant and non-malignant tumors was confirmed by each medical institution (Table 1). The internal validation dataset contained 1809 photographs of malignant lesions and 3415 non-malignant lesions. The external validation dataset consisted of 132 photographs of malignant lesions and 220 non-malignant lesions.

Table 1 Dataset characteristics.

Parameter tuning and training

To perform fair comparison, all training hyperparameters were kept identical in all experiments (Fig. 1). The networks were trained for 300 epochs using binary cross-entropy loss with a batch size of 32. To avoid overfitting during training, we determined that overfitting occurred when the validation loss increased compared to the training loss, and then we explored ten additional epochs. If this trend continued, an early stopping logic that determines the parameter value in the epoch where the validation loss was increased compared to the train loss as the final parameter was applied. We did not use an algorithm that changes the learning rate according to the learning state, but rather applied Bayesian optimization to find the optimal learning rate to build the model.

Figure 1
figure 1

Overview of the development and evaluation of the tongue cancer diagnosis algorithm.

Testing and model selection

After training, we evaluated the classification models using internal and external validation datasets. The evaluation results are summarized in Table 2. The optimal point of the ROC curve was determined when the AUROC reached its maximum value. When AUPRC, AUROC, specificity, and F1-score were compared for different models, DenseNet models showed excellent performance. Among them, Densenet169 had a higher AUROC, AUPRC, and accuracy than DenseNet 201 and DenseNet 121. Therefore, DensetNet169 was selected as the final model (Fig. 2).

Table 2 Diagnostic performance of CNN models in internal validation (a) and external validation (b).
Figure 2
figure 2

Receiver operating characteristic curves and precision-recall curves for the deep learning algorithm on internal validation dataset (A) and external validation datasets (B).

AI vs. human readers

Figure 3 presents the test results for the best performing algorithm model and human readers on the external test dataset. The algorithm achieved an accuracy of 84.7% with a sensitivity of 81.1% and specificity of 86.8% for detecting tongue cancer. Among human readers, the accuracy of the oncology specialist was higher than that of the developed model at 92%. However, the accuracy of the general physician was lower than that of the model at 75.9%. The sensitivity and specificity considerably varied among the two human readers: the AI model achieved lower results than the specialist (sensitivity: 91.7%; specificity: 90.1%) and demonstrated significantly higher results than the general physician (sensitivity: 77.3%; specificity: 75.0%).

Figure 3
figure 3

Performance of the deep learning model and comparison with human readers.

The agreement between the model and the human reader was estimated using the kappa value scale. Good agreement was observed between the model and the oncology specialist (kappa value = 0.685, 95% CI 0.606–0.763, p < 0.001). Further, moderate agreement was confirmed between the model and the general physician (kappa value = 0.482, 95% CI 0.389–0.575, p < 0.001). (Table 3).

Table 3 Agreement of the model and human readers.

Discussion

This study developed a deep learning algorithm based on DenseNet169 with acceptable performance (i.e., AUROC 0.895 and AUPRC 0.918 for external validation datasets) for tongue cancer diagnosis from endoscopic images (Table 2 and Fig. 2). Other existing medical imaging studies have yielded higher results in some cases. However, unlike this study, most of them have a limitation in that they showed internal validation results rather than performance validation results when using an external test set23,24. The AI model developed in our study could derive the visual patterns of cancer in cluttered oral endoscopic images. This AI-based diagnostic tool could have clinical significance for the early diagnosis of cancer.

Although the diagnosis of tongue cancer should be made early, it is sometimes delayed25. In this case, as the cancer stage increases, the prognosis worsens, and the scope of surgery expands, resulting in severe postoperative side effects, such as dysarthria26. Early detection is difficult, and from the patient’s viewpoint, knowledge and awareness regarding tongue cancer are lacking27. Furthermore, general physicians find it difficult to diagnose cancer in local areas using only endoscopic images28. Therefore, cancer should be diagnosed by an oncology specialist with extensive clinical experience. In previous studies, a screening system involving trained head and neck cancer specialists reduced oral cancer mortality29.

However, the number of specialists is small, and most of them work in large medical institutions, including university hospitals with low accessibility to patients. In the present study, the developed deep learning model had superior performance in diagnosing cancer than a general physician, but inferior than an oncologist (Fig. 3). The difference between these results is possible because general physicians have relatively little clinical experience with cancer patients30. This indicates that AI-based diagnosis models have the potential to help general physicians with little clinical experience in oncology treatment to diagnose endoscopic images. In other studies, examples of increased cancer diagnostic accuracy have been provided with the aid of AI31. In addition, when considering the results of the kappa coefficient, there was a good agreement between the model developed in this study and the specialist in terms of lesion classification (kappa value = 0.685, 95% CI 0.606–0.763) (Table 3). Therefore, as in gastrointestinal endoscopy, the developed model will enable general physicians to improve the accuracy of diagnosing tongue cancer by combining it with oral endoscopy that is available in primary medical institutions.

Recently, several studies have reported the usefulness of medical image analysis based on deep learning models. The CNN model based on ResNet-50 simultaneously learned to detect and characterize lesions on magnetic resonance imaging (MRI)32. In addition, the developed CNN model with VGGNet classified benign or malignant lesions in medial image data33. In this study, we retrained an existing CNN model developed on a large general natural image dataset using oral endoscopic images (Fig. 1). Six different models were used in this study: CNN, ResNet, EfficientNet, VGGNet, MobileNet, and DenseNet. Because the CNN model is the most basic model for image classification, it was used as a basis for comparing the performances of other models.

VGGNet, ResNet, and DenseNet were models that share a huge skeleton, and when the layers are deepened, each model can achieve better prediction performance. We were able to spot trends in the data and find an appropriate model using these associated models. MobileNet and VGGNet have relatively fast learning speeds, comply with the required performance, and are used to quickly check the results by adding logic to find data features more efficiently. ResNet, DenseNet, and EfficientNet are composed of deep layers; therefore, their learning speed is relatively slow, but their performance is acceptable. In particular, DenseNet shows superior performance with fewer parameters than ResNet. ResNet combines features by summation when passing through layers, but DenseNet is different because it concatenates the features rather than adding them.

Unlike previous studies that used standardized CT and MRI images, this study analyzed atypical oral images using the deep learning algorithm mentioned above. Since tongue cancer is a rare disease, we removed as much noise as possible from the image rather than increasing the amount of data. By minimizing the deviation of the data, the difference between the sample population and the overall population was reduced. DenseNet169, which was evaluated as the most suitable algorithm in this study, was also effective in image evaluation conducted in previous studies. In a study to classify pathological images in which atypical images were used similar to this study, effective results were obtained even with a small number of images34. Similarly, DenseNet169 showed the best performance in the study of the AI model for classifying the quality of tongue images35. Therefore, the application and optimization of AI algorithms considering the characteristics of each image data is essential. In particular, we believe that the model derived from this study will be meaningful for atypical data with large deviations among images, including endoscopes.

Despite recent innovative advances in deep learning technology, a large, validated dataset is one of the prerequisites for improving diagnostic performance. Driks emphasized the problem of “Frankenstein datasets”22. A Frankenstein dataset comprises information collected from multiple sources and assembled piece by piece. If an algorithm is tested with the same data used to train the model, it tends to appear to perform more accurately than it actually would on more realistic data or in practical applications. Therefore, we focused on well-organized and high-quality dataset construction. In the previous study, easily accessible smartphone and digital camera images were used; however, in this study, a dataset was constructed using oral endoscopy images created in clinical sites19,20. The poor-quality images could affect the analysis of image features and directly lead to a wrong diagnosis, causing severe interference with the development of the AI model. Therefore, oral endoscopic images are difficult to classify. In particular, oral endoscopy performed during the treatment process has different characteristics depending on the examiner because no guidelines were set for imaging.

This medical condition could lead to incorporation bias in the dataset. To create a relatively stable tongue image dataset, tongue images were collected using uniform endoscopic equipment. Additionally, to improve the quality of the dataset, several head and neck cancer specialists from multiple institutions directly participated in the data collection and review process. De-identification of data was carried out, and data inspection was performed more than twice. Moreover, a verification was conducted by TTA, an external institution. The radiomics approach used in previous studies involves manual ROI segmentation and extraction of several text features36. However, in this study, a deep learning network can be trained automatically without ROI segmentation. Therefore, advantages exist in terms of the decreased training time and costs for annotation workers. This method is designed to extract features directly from a dataset without the prerequisite for segmentation and manual processing. We performed processing to remove areas other than important areas so that the model could easily identify patterns in the image data.

We preprocessed the dataset before developing the AI model. The endoscopic images were of varying sizes, lighting conditions, and angles. In addition, owing to the noise of the equipment itself, some pixels sporadically entered as outliers in the oral endoscopy image. Some images also contained textual information, such as weather, and provided line guidelines (Fig. 1). In addition to the previous data preprocessing steps, such as scaling and adjusting the exposure, we developed a new algorithm and applied it in our research. For image standardization, we proceeded as follows. (1) We created a background image by converting the target image into a black-and-white image. (2) We removed the text from the background images. (3) We blurred the background image based on outliers using Gaussian blur. (4) The lesions were explored in the background image. (5) We cropped useless parts from the original image based on the lesions found in the background image (Supplement 1). All images were then converted into the JPEG format as required by our deep neural framework. According to the algorithms, they were then resized to 224 × 224 or 300 × 300 to the required input image size of the models before the model training process.

The current study had several limitations. First, the developed model cannot make a definite diagnosis for benign diseases among tongue lesions, such as leukoplakia and ulcers. In future studies, we plan to develop a model that can clearly distinguish benign and malignant tumors by classifying them into three categories: normal, benign, and malignant. Second, the oral endoscopic image characteristics used in this study differed from those of conventional CT and MRI images. These data have a high degree of freedom and are affected by the features of the endoscope user with atypical, non-standardized images. We used various data preprocessing techniques to compensate for these shortcomings. When collecting data in future studies, it would be beneficial to consider the application of endoscopy guidelines. Third, developing a cancer diagnosis model using only endoscopic images has a limit. In future research, high-performance diagnostic models are expected to be developed if images are combined with various clinical data. Fourth, several medical institutions participated in this study, resulting in differences between institutions in the amount of data, image characteristics, and the ratio of malignancy to non-malignancy (Table 1). In this study, data preprocessing was performed to correct this. In future research, uniformly distributing the ratio and amount of data for each participating institution would be necessary. Finally, lesions were not detected in this study. In future work, we plan to collect additional information on lesions and use it to develop an AI model that identifies suspected lesions with heat maps using Grad-CAM.

In conclusion, we have constructed a quality-validated dataset using oral endoscopy images from several medical institutions. A deep learning model based on the dataset showed acceptable performance for application in tongue cancer diagnosis. Compared with human readers, it showed lower diagnostic performance than oncology specialists and higher diagnostic performance than general physicians. Therefore, the developed algorithm could be used as an assistant tool for general physicians to increase the diagnosis and screening of cancer in clinical settings.

Methods

Dataset

We retrospectively collected 12,400 clinical endoscopic images from five hospitals in South Korea (i.e., Seoul National University Hospital, Yonsei University Hospital, Ajou University Hospital, National Cancer Center, and Boramae Medical Center) between December 9, 2010, and September 24, 2020. Through a database query of the medical databases (i.e., EMR and PACS), we extracted the endoscopic images taken for diagnosis of tongue cancer and the pathological reports of the images. The extracted endoscopic images were read and reviewed by at least two head and neck oncologists at each hospital, and image preprocessing, such as de-identification, was performed. The diagnosis results of each oral imaging image can be classified as malignant, benign, or normal. Among these, benign and normal images were classified as non-malignant images. The constructed dataset has undergone and passed an external verification by the Telecommunications Technology Association (TTA) for data structure and format accuracy.

Of the 5576 total tongue images, we selected 5224 images (internal validation dataset) to develop the algorithm and then used the remaining 352 images (external validation dataset) for testing (Fig. 4). Pathological diagnosis was used as the correct answer to develop and validate the deep learning model. The Institutional Review Board of Ajou University Hospital approved this study (IRB No. AJIRB-MDB-20-311). Further, informed consent from all participants was waived by the IRB because of the retrospective nature of this study. All methods were performed in accordance with the Declaration of Helsinki.

Figure 4
figure 4

Validation and test structure diagram of the tongue cancer dataset for deep learning.

Deep learning model

To detect malignancy from oral endoscopic images (Fig. 1), we developed an automated deep learning algorithm using a cascaded convolutional neural network (CNN). The backbone networks for the detection and classification were initialized using a pre-trained model, which was trained with tens of millions of images in the ImageNet dataset and was further finetuned using the development dataset37. The tensor converted from the image was subjected to data scaling, data-type adjustment, and padding to maintain the image ratio. To optimize the hyperparameter, we used a Bayesian optimization method for the training and internal validation processes38. The target of Bayesian optimization was the area under a receiver operating characteristic curve (AUROC), and the hyperparameters that maximize AUROC were derived. The minibatch size was determined to be 32 to further improve the generalization performance. After the optimal hyperparameters were determined, we obtained the best model and evaluated its performance in the testing set.

A CNN architecture was constructed to calculate the probability of malignancy of an endoscopic image using ResNet (i.e., ResNet34, ResNet101, and ResNet152)39, EfficientNet B340, VGGNet (i.e., VGG 16 and VGG 19)41, MobileNet (i.e., MobileNetV1 and MobileNetV2)42, and DenseNet (i.e., DenseNet121, DenseNet169, and DenseNet201)43. These models are neural networks with several layers and are commonly used for image classification. We applied stratified k-fold cross-validation to assess the deep learning model (k = 10). A total of 10 random datasets were extracted by fixing the seeds to ensure that the non-malignant and malignant ratios were equal. During internal validation, we randomly partitioned the dataset into approximately 70% training, 10% validation, and 20% test sets (Fig. 4). Moreover, we determined the number of epochs using an early termination tool. In this process, a dataset consisting of images obtained from Seoul National University Hospital, Severance Hospital, Ajou University Hospital, and National Cancer Center was used for internal validation. Through this method, the risk of overfitting increases from the moment the validation loss increases compared to the training loss. Thus, the training was ended after additional exploration.

After training the models, we examined the accuracy of the trained models by other clinical research centers in distinguishing non-malignant from malignant for external validation. To this end, we constructed a new testing dataset including 352 tongue images using the Boramae Medical Center dataset.

Comparison with observer classification

We compared the performance of the algorithm with that of human readers using an external validation dataset. The human readers employed in our study were divided into two groups according to their professional backgrounds and clinical experiences. A specialist human reader who was a head and neck surgical oncologist with more than seven years of clinical experience participated in this study. The general physician human reader was a doctor with four years of experience after obtaining their license and was a non-specialist.

The human reader reviewed the same dataset and classified cases as malignant vs. non-malignant, without any prior knowledge on the patient history. The reader blindly evaluated the de-identified endoscopic image of the data and assessed the possibility of malignancy. The AI model with the best performance among the models was also evaluated using the same dataset.

The performance of the readers was assessed by comparing their predictions with the corresponding pathological reports. We evaluated the final results and calculated the overall accuracy, sensitivity, and specificity. We estimated the kappa values with linear weighting and 95% confidence intervals (CIs) to compare the diagnostic results of human readers and the model. The kappa value scale for agreement strength was as follows: poor: < 0.2; fair: 0.21–0.40; moderate: 0.41–0.60; good: 0.61–0.80; and very good: 0.81–1.0044.

Statistical analysis

We evaluated the performance of the classification models using objective evaluation metrics, including specificity, precision, sensitivity, F1-score, and accuracy. The metrics base their mathematical foundation on the true positive (TP), true negative, false negative, and false-positive (FP) values of the models’ predictions. In addition, we used AUROC to evaluate the performance of the deep learning algorithm for distinguishing malignant from non-malignant. We plotted the receiver operating characteristic (ROC) curve by calculating the TP rate (sensitivity) and the FP rate (1 − specificity) with different predicted probability thresholds, and then we calculated the AUC values. Because the distribution of binary cases was not uniform, we also estimated the area under the precision-recall curve (AUPRC) values to evaluate the trained model. The corresponding 95% confidence interval was computed for each indicator value. The performance of the CNN models and the two readers in distinguishing malignant from non-malignant images was evaluated using these indicators.

We selected the model that best classified the endoscopic images by comparing the model performance. When selecting a model, the performance was evaluated by considering the first AUROC and the second AUPRC. Even if the model showed high performance in internal validation, the model that showed poor performance for external validation was excluded from model selection. All statistical analyses were performed using pandas (version 0.22.1), scikit-learn (version 0.24.1), NumPy (1.19.5), Matplotlib (3.3.4), OpenCV-Python (4.5.2), and Bayesian optimization (1.2.0) Python packages. We used Keras, which is a deep learning framework that acts as an interface for the TensorFlow2 library. Model structures were developed on graphical processing unit servers with multiple NVIDIA Tesla V100 graphic process units (32 GB × 4) and Xeon Gold 6248 (2.5 GHZ/20-core/150 W, 512 GB RAM) as the central processing unit.

Ethical statement

The Institutional Review Board of Ajou University Hospital approved this study (IRB No. AJIRB-MDB-20–311). Further, informed consent from all participants was waived by the IRB because of the retrospective nature of this study.