Dental implants are among the most widely used and commonly accepted treatment modalities for oral rehabilitation of partially and completely edentulous patients1,2. The occurrence of various major or critical mechanical (such as fractures of screws or fixtures) and biological (such as peri-implantitis) problems is steadily and inevitably increasing, affecting long-term survival and reintervention outcomes3,4. Therefore, dental implant-related complications are a growing concern in the dental community worldwide and are a public health problem associated with a high socio-economic burden5,6.

In particular, early detection and appropriate treatment of simple mechanical complications such as screw loosening can prevent more severe complications, such as fixture fracture or severe peri-implantitis, at an early stage7,8. For early and fast intervention, dental implant systems (DIS) placed in the oral cavity must be unambiguously identified and classified. However, in actual clinical practice, it is not easy to properly identify or classify the various different types of DIS after implant surgery because of various clinical and environmental factors, including the closure of a dental hospital or the loss of dental records.

Although two-dimensional dental radiography, including panoramic and periapical radiographs, is the most useful tool for identifying and classifying DIS post-implant surgery, there is a fundamental and practical limit for classifying thousands of different types of DIS with similar shapes and physical properties9,10. In addition, three-dimensional cone-beam computed tomography (CBCT) has been actively used for dental implant surgery; however, whether CBCT can better classify DIS is debatable because the sharpness and resolution of CBCT is still significantly lower than that of peripheral radiographs11.

Deep learning (DL), a subfield of artificial intelligence (AI), has a wide range of applications in medicine; this unique technology is associated with high accuracy in medical image analysis for edge detection, classification, or segmentation based on a cascade of multiple computational and hidden layers in a deep neural network12. When limited to dentistry, deep and convolutional neural networks have rapidly become the methodology of choice for two- and three-dimensional dental image analyses13,14,15,16. Several studies have demonstrated DL algorithms as an emerging state-of-the-art approach in terms of accuracy performance for identifying and classifying various types of DIS and often show outperforming results compared to dental professionals specialized in implantology17,18,19,20,21,22,23,24,25. However, since most previous studies were based on fewer than thousands of DIS images or fewer than 10 different types of DIS, available evidence is insufficient to be implemented in actual clinical practice19,20,21,22,23,24. This study aimed to evaluate the accuracy of the automated DL algorithm for the identification and classification of DIS using a large-scale and comprehensive multicenter dataset.

Results

The performance metrics of the automated DL algorithm based on the accuracy, precision, recall, and F1 score for total of 156,965 panoramic and periapical radiographic images were 88.53%, 85.70%, 82.30%, and 84.00%, respectively. Using only panoramic images (n = 116,756), the DL algorithm achieved 87.89% accuracy, 85.20% precision, 81.10% recall, and 83.10% F1 score, whereas the corresponding values using only periapical images (n = 40,209) achieved 86.87% accuracy, 84.40% precision, 81.70% recall, and 83.00% F1 score, respectively. No statistically significant difference in the classification accuracy was observed between the three groups, and the detailed accuracy performances of DL in the classification of DIS are listed in Table 1.

Table 1 Accuracy performance of automated deep learning algorithm.

Figure 1 shows the normalized confusion matrices, containing a summary of the classification of the 27 different types of DIS based on the automated DL algorithm (full details are provided in Appendix 3). Using panoramic and periapical images, the classification accuracy of DL was the highest for Nobel Biocare Branemark (100.0%) and Megagen Exfeel external (100.0%), and the lowest for Warantec IT (35.3%). Using only panoramic images, the classification accuracy was the highest for Osstem US III (100.0%) and Megagen Exfeel external (100.0%), and the lowest for Warantec IT (19.0%). When using only periapical images, the classification accuracy was the highest for Megagen Exfeel external (100.0%) and Dentsply Xive (100.0%), and the lowest for Warantec IT (19.2%).

Figure 1
figure 1

Schematic illustration of dataset collection and verification. All study protocols and related procedures were supervised by the National Information Society Agency (NIA) and the Korean Academy of Oral & Maxillofacial Implantology (KAOMI).

Discussion

AI-based large-scale machine learning and DL in the late 2010s, which facilitated the accurate diagnosis of medical radiographic images, garnered attention in biomedical engineering and provided novel insights into precision medicine26,27,28. More recently, deep convolutional neural network algorithms have gained popularity in dentistry, and have also achieved considerable success in analyzing dental radiographic images29. The potential clinical applications of DL technology are closely related to (1) deeper and more sophisticated neural network structures and (2) large annotated and high-quality datasets. Particularly, a gold-standard dataset annotated and verified by medical and dental professionals is essential to create a reliable radiographic image-based DL model in the medical and dental fields26,27.

To evaluate the performance of DL-based identification and classification of various types of DIS in actual clinical practice, a large, highly accurate, and reliable dataset is necessary. Recently, a large-scale and comprehensive multicenter dataset that could be used in the clinical field for DL-based identification and classification of DIS was collected and released openly by the national initiative. To our knowledge, the dataset used in the present study contained the larger number of radiographic images and types of DIS than any previously reported implant-related dataset. Because we used this dataset in the current study, it is expected to show higher feasibility than that of any previous implant-related DL research.

Most previous studies evaluated the accuracy performance of the conventional or minimally modified DL architectures (e.g., YOLO, SqueezeNet, ResNet, GoogLeNet, and VGG-16/19) using less than a few thousand dental radiographic images, and usually fewer than 10 different types of DIS in their datasets, identifying a classification accuracy ranging from 70 to 100%17,18,19,20,21,22,23,24,25. One study that utilized a ResNet architecture based on 12 types of 9767 panoramic images reported a high accuracy of 98% or more23. Our previous pilot study that utilized automated DL based on six different types of 11,980 DIS images also showed reliable outcomes and achieved a very high accuracy of 95.4% (sensitivity:95.5% and specificity:85.3%)18. Conversely, another study based on Yolov3 using 1282 panoramic images showed a relatively low accuracy in the 70% range on average22.

The automated DL algorithm used in this study, based on the combination of periapical and panoramic radiographs, achieved an AUC of 0.885. When only panoramic radiographs were used, the AUC was 0.878, and when only periapical radiographs were used, the AUC was 0.868. Specifically, periapical and panoramic images had the highest classification accuracy, and periapical images alone had the lowest accuracy, but there was no statistically significant difference between the three groups. These outcomes are consistent with the previously reported absence of a significant difference in classification accuracy between panoramic and periapical images and are also likely due to the fact that almost three times more panoramic images (n = 105,080) than periapical images (n = 36,188) were used for training and validation17,18.

Specifically, the Nobel Biocare Branemark, Megagen Exfeel external, Osstem US III, and Dentsply Xive showed a high classification accuracy of 100.0%, whereas Warantec IT showed a low accuracy performance (accuracy: 19.0–35.3%) due to the relatively small number of radiographic images, including only 238 panoramic and 208 periapical images, despite having a conventional fixture morphology with an internally tapered shape. From this perspective, DL has great advantages in identifying and classifying similar types of DIS; however, the accuracy performance varies significantly depending on the amount of datasets required for training, which is considered a fundamental limitation of the existing DL algorithms. Further research should be conducted to confirm whether the number of datasets required for training can be reduced by adopting an algorithm that is more specialized than the algorithm in this study for DIS classification.

In the radiographs used in this study, the main ROI was the implant fixture, but a number of other confounding conditions (such as surrounding alveolar bone, cover screw, healing abutment, provisional or definitive prosthesis) were included. To be used in actual clinical practice, implant fixtures with different confounding conditions and angles should be used as datasets, rather than implant fixtures with perfect/intact shapes and standard angles. Several previous studies, including this one, have confirmed that implant datasets with different angles and confounding conditions have a high accuracy performance of over 80%17,18,25. Furthermore, using the Gradient-Weighted Class Activation Mapping technique, it was found that the types of DIS were classified by focusing on the implant fixture itself rather than the various confounding components of the DIS. Therefore, various confounding factors and angles do not appear to have a significant impact on the accuracy performance of DL-based implant system classification.

In a recent study wherein healthcare professionals with no coding experience evaluated the feasibility of automated DL models using five publicly available and open-source medical image datasets, most classification models showed accuracy performance and diagnostic properties comparable to those of state-of-the-art DL algorithms30. Developing customized DL models according to the types and characteristics of datasets requires highly specialized skills and expertise. This study confirmed that the DL algorithm itself, not computer scientists and engineers, built an automated DL model without coding and showed excellent classification accuracy of over 86% in 27 similar design but different types of multiple classifications.

Identifying and classifying DIS with varying features and characteristics and limited clinical and radiographic information is a challenge not only for inexperienced dental professionals, but also for dentists with sufficient experience in implant surgery and prosthetics. In the past, several studies have identified DIS from a forensic perspective based on radiographs, and until recently, efforts have been made to classify DIS, but most of these are based on empirical evidence, making it difficult to achieve high reliability9,10,31. More recently, computer-based implant recognition software and web-based DIS classification platforms have been developed and used; however, most require manual classification of DIS features (such as coronal interface, flange, thread type, taper and apex shape) or contain only a small number of DIS datasets, limiting their active use in clinical practice32.

The first end goal based on this research was to obtain a database of almost all types of DIS used worldwide and train it with sophisticated and refined DL algorithms optimized for DIS classification to achieve a high level of reliability that can be used in actual clinical practice. The second goal was to create a web or cloud-based environment where datasets can be freely stored, trained, and validated in real time. Achieving these goals requires the proactive development of standard protocols to facilitate data sharing and integration, secure transmission and storage of large datasets, and enable federated learning33,34.

This study had several limitations. Collecting a dataset using supervised learning requires considerable tangible and intangible resources including finances, time, trained personnel, hardware, and software. Therefore, unsupervised learning, a technique for overcoming small-scale and imbalanced datasets, has been introduced and tested with caution in dentistry; however, it remains a challenging approach35. Large-scale and multicenter datasets may be useful for future DL-based research and actual clinical trials to identify and classify various types of DIS. Nevertheless, the dataset used in this study had inherent limitations regarding the interpretability of the results. Although the raw NIA dataset consisted of 165,700 radiographs and 42 different types of DIS, the number of panoramic and periapical images for each type of DIS was highly heterogeneous. In addition, DIS manufactured by foreign companies or using non-titanium materials (such as non-metallic ceramic zirconia), which are rarely used in South Korea, were few or not included in the raw dataset. To overcome the potential problem of overfitting and selective bias, we selected only DIS that contained more than 100 images of panoramic and periapical radiographs.

Conclusion

We verified that automated DL shows a high classification accuracy based on large-scale and multicenter datasets. Furthermore, no significant difference in accuracy was observed between panoramic and periapical radiographic images. The clinical feasibility of the automated DL algorithm will have to be confirmed using additional datasets and clinical research in the future.

Materials and methods

Ethics

This study was approved and conducted in accordance with the following Institutional Review Board (IRB): Seoul National University Dental Hospital (ERI21024), Yonsei University Dental Hospital (2-2021-0049), Gangnam Severance Dental Hospital (3-2021-0175), Wonkwang University Daejeon Dental Hospital (W2104/003-002), Dankook University Dental Hospital (2021-8-004), and national public IRB (P01-202109-21-020). IRBs of Seoul National University Dental Hospital, Yonsei University Dental Hospital, Gangnam Severance Dental Hospital, Wonkwang University Daejeon Dental Hospital, Dankook University Dental Hospital, and national public approved a waiver of informed consent for retrospective large-scale and multicenter data analysis. All methods in this study were performed in accordance to relative guidelines and regulations.

Dataset collection and verification

All included dental radiographic images were managed and supervised by the National Information Society Agency (NIA) under the Ministry of Science and the Korean Academy of Oral and Maxillofacial Implantology (KAOMI). The dataset was collected from five college dental hospitals (including Seoul National University Dental Hospital, Yonsei University Dental Hospital, Yonsei University Gangnam Severance Dental Hospital, Wonkwang University Daejeon Dental Hospital, and Dankook University Dental Hospital) and 10 private dental clinics. Appendix 1 summarizes the detailed consortium of the dataset collection.

Digital imaging and communication in medicine-format panoramic and periapical images were converted into either de-identified 512 × 512 pixels or smaller JPEG- or PNG-format images, and one implant fixture per image was cropped as a region of interest (ROI). The collected ROI images were reviewed to ensure that cropping, resolution, and sharpness were properly performed by 10 dental professionals employed by the KAOMI. Subsequently, based on the medical and dental records provided by college dental hospitals and private clinics, each implant fixture was labeled with the manufacturer, brand and system, diameter and length, placement position, surgery date, age, and sex using customized labeling and annotation tools. All included radiographic images were validated by a board-certified oral and maxillofacial radiologist who was not involved in dataset management. The final dataset consisted of 165,700 panoramic and periapical radiographic images and 42 types of DIS. Appendix 2 provides a detailed list of the raw NIA dataset (Fig. 2).

Figure 2
figure 2

Dataset containing 116,756 panoramic and 40,209 periapical radiographic images and comprising 10 manufacturers and 27 types of dental implant systems.

We included only DIS that contained at least 100 periapical and 100 panoramic images from the raw NIA dataset. Finally, the dataset used in this study contained 116,756 panoramic and 40,209 periapical images, comprised 10 manufacturers and 27 types of DIS. Specifically, the dataset included Neobiotech (n = 21,260, 13.54%), Nobel biocare (n = 3644, 2.32%), Dentsply (n = 15,296, 9.74%), Dentium (n = 41,096, 26.18%), Dioimplant (n = 1530, 0.97%), Megagen (n = 7801, 4.97%), Straumann (n = 4977, 3.17%), Shinhung (n = 3376, 2.15%), Osstem (n = 42,920, 27.34%), and Warantec (n = 15,065, 9.60%). The detailed types of DIS are listed in Table 2 and illustrated in Fig. 3.

Table 2 Number of panoramic and periapical radiographic images for each dental implant system.
Figure 3
figure 3

Multi-label classification confusion matrix with normalization using panoramic and periapical radiographic images.

Implementation of automated DL algorithm

For the identification and classification of 156,965 radiographic images, a customized automatic DL engine (Neuro-T version 3.0.1, Neurocle Inc., Seoul, Korea) was adopted in this study. Within the available computing resources and training time, an automated DL algorithm is a self-training architecture that selects appropriate DL models and optimizes the hyperparameters (including the resize method, number of network convolutional layers, decay method, learning rate, dropout rate, batch size, number of epochs and patience, and optimizer) to fit the model in the customized dataset36.

The dataset comprised three groups: panoramic (n = 116,756), periapical (n = 40,209), and panoramic and periapical (n = 156,965) images. Each dataset group was randomly and evenly divided into three subgroups: training (80%), validation (10%), and testing (10%). After the dataset division, the training dataset was augmented by ten times, with random rotations (90°), hue (range of − 0.1 to 0.1), brightness (range of − 0.12 to 0.12), saturation (range of 0.6–1.5), contrast (range of 0.6–1.5), noise (0.05), and horizontal and vertical flips. We trained our approach on two NVIDIA A6000 graphic processing units (48 GB memory, NVIDIA, Mountain View, CA, USA). The models were trained for a maximum of 500 epochs and stopped if the validation set loss did not improve for more than 20 epochs.

Statistical analysis

Categorical and continuous variables were expressed as frequencies (n) and ratios (%). The performance metrics were evaluated as accuracy, precision, recall, and F1 score (Eqs. (1)–(4), TP: true positive, FP: false positive, FN: false negative, and TN: true negative):

$$ {\text{Accuracy}} = \frac{{\left( {{\text{TP}} + {\text{TN}}} \right)}}{{\left( {{\text{TP}} + {\text{FP}} + {\text{TN}} + {\text{FN}}} \right)}} $$
(1)
$$ {\text{Precison }} = \frac{{{\text{TP}}}}{{\left( {{\text{TP}} + {\text{FP}}} \right)}} $$
(2)
$$ {\text{Recall}} = \frac{{{\text{TP}}}}{{\left( {{\text{TP}} + {\text{FN}}} \right)}} $$
(3)
$$ {\text{F}}1{\text{ score}} = 2{ } \times { }\frac{{\left( {{\text{Precision }} \times {\text{ Recall}}} \right)}}{{\left( {{\text{Precision }} + {\text{Recall}}} \right)}} $$
(4)

Additionally, a normalized confusion matrix for each DIS was calculated based on the test dataset. All data processing and statistical analyses were conducted using a commercial statistical package (Neuro-T version 3.0.1, Neurocle Inc., Seoul, Korea) and non-commercial statistical package (R version 4.2.0, R Foundation for Statistical Computing, Vienna, Austria).