Deep learning based prediction of extraction difficulty for mandibular third molars

This paper proposes a convolutional neural network (CNN)-based deep learning model for predicting the difficulty of extracting a mandibular third molar using a panoramic radiographic image. The applied dataset includes a total of 1053 mandibular third molars from 600 preoperative panoramic radiographic images. The extraction difficulty was evaluated based on the consensus of three human observers using the Pederson difficulty score (PDS). The classification model used a ResNet-34 pretrained on the ImageNet dataset. The correlation between the PDS values determined by the proposed model and those measured by the experts was calculated. The prediction accuracies for C1 (depth), C2 (ramal relationship), and C3 (angulation) were 78.91%, 82.03%, and 90.23%, respectively. The results confirm that the proposed CNN-based deep learning model could be used to predict the difficulty of extracting a mandibular third molar using a panoramic radiographic image.


Discussion
Deep learning has been widely used in various fields. Among the many deep learning models, CNNs are the most efficient 6 . CNNs have shown excellent results in the analysis of radiographic images when compared to the results by medical experts. Previous studies have shown that deep learning can be used to recognize anatomical structures, find anomalies, measure the distance, and classify structures in medical images 1,[3][4][5][6][7][8][9][10][11][12][13][14][15] . However, in most studies, object detection was conducted manually, and tasks were limited to performing simple measurements, comparisons, or classifications. In this study, all processes were applied automatically including object detection. In addition, object detection is quite complicated because the normal anatomical structure is assessed and scored based on three criteria. A Single Shot Multibox Detector (SSD) is a representative CNN-based model used for object detection. Liu et al. used this model to discretize the output space of the bounding boxes into a set of default boxes over various aspect ratios and scales 18 . This approach makes the training and integration of the detection system straightforward. Consequently, SSD shows a fast inference speed and achieves an outstanding detection performance.
We attached zero-padding to the edge to unify the image size during the preprocessing. The main reason for unifying the image size is to enable mini-batch learning. Notably, mini-batch learning not only speeds up the learning convergence it also increases the model efficiency. Meanwhile, ResNet, which is also a CNN, has delivered an excellent image recognition performance with residual learning implemented using skip connections.
The experiment results are impressive. Figure 2 shows the results of the classification models according to each criterion as a confusion matrix. As the confusion matrix in Fig. 2 shows, there are a few cases in which the difference between the misclassified pairs of score is significant. Although a misclassification is a problem, if it www.nature.com/scientificreports/ does occur, the smaller the difference between the predicted and actual scores, the less significant the diagnostic error. Despite being unintentional, the above phenomenon is significant.
The primary objective of this study was to use a CNN to evaluate the difficulty of extracting mandibular third molars based on the features present in radiography images. Therefore, the correlation between the PDS evaluated by the proposed model and that measured by experts must be verified.
In general deep learning based classifications, the class index with the largest value among the calculated class probabilities is selected. However, the data selected in this way may not accurately represent the data predicted by the model. For example, if the model obtains a probability distribution as depicted in Fig. 3 A, it will be received a score of 1. However, it can be seen that the model also has high confidence with of a score of 2. Conversely, if a probability distribution similar to that depicted in Fig. 3 B is obtained, its score will be miscalculated as 2; however, the probability for a score of 1 (the actual score) is also high. In the abovementioned cases, the classified scores cannot fully reflect the intention of the model. Therefore, we computed the predicted PDS based on the inferred probability distribution to reflect such intention. Given a probability P s c for score S c ∈ {1, 2, 3, 4} of each criterion c ∈ {C1, C2, C3} , the predicted PDS ŷ can be calculated as follows: In this way, the results can reflect the intention of the model. The results show that accurate predictions of mandibular third molar extraction can be achieved using a CNN. However, although it performed well for scores of 4 through 7, it overestimated the cases of PDS 3 and underestimated the cases of PDS 8 and 9 (Fig. 1). Because PDS 3 is the lowest, a CNN can only estimate cases of PDS 3 or higher, leading to overestimated results on average. A similar occurrence was shown in cases of PDS 8 and 9. In addition, because there are no cases for PDS 10 and only two cases for PDS 9, one of which was used for testing and not learning, there has been little opportunity for the CNNs to learn about such cases and therefore CNNs have little information about them. This has led to an underestimation of the cases of PDS 8 and 9. It is likely that there are so few cases for PDS 9 and 10 because teeth with high scores for all criteria are rare. For example, a tooth with a vertical or distal angle (C3, with a score of 3 or 4) will not be interfered with by the adjacent mandibular second molar, and thus the score for C1 would be 1 or 2.
To the best of our knowledge, this is the first study on evaluating the difficulty of extracting a mandibular third molar using a deep learning model. These predictions will help the operator plan and prepare in advance, prior to the extraction process. The prediction results can also be used to inform patients about their conditions and seek their consent. In addition, objective data can be used to determine the treatment cost for extraction based on the level of difficulty.
There is a limitation however, in that we only used panoramic images. Panoramic images can show a broad range of anatomical structures in a single 2D image, although inevitable distortions occur in both the vertical and horizontal dimensions 19 . In addition, it is extremely difficult to evaluate a transverse angulation or dilaceration.
Clinically, there are many other factors that can affect the difficulty of extracting mandibular third molars; these include the gender, age, root morphology, bone density, and proximity to the inferior alveolar nerve 15,16 . Some studies have previously suggested that deep learning models can be used to evaluate certain factors related to the extraction difficulty. Hiraiwa et al. showed that CNNs can assess the rough morphology of the root of the mandibular first molar using a panoramic image 14 . This approach can be applied to the mandibular third molar, although more studies on evaluating the detailed morphology of the root, such as a dilaceration or partial curvature will be needed. Lee et al. showed that osteoporosis can be detected by analyzing the textural and morphological features in panoramic images using a CNN 13 . By analyzing the bone around the mandibular third molar and quantifying it, it will be possible to determine how much the bone density will affect the difficulty of extraction when using a panoramic image. For proximity to the inferior alveolar nerve, many previous studies have shown that CNNs can detect the inferior alveolar nerve using panoramic images and con beam computed tomography 2,3,20 . It is possible to calculate the distance and location relationship between the inferior alveolar nerve and mandibular third molar using a CNN. However, there are no standardized variables for evaluating the difficulty of mandibular third molar extraction based on the distance or location. Further studies to quantify and standardize such variables and finally synthesize them will be needed. The 600 panoramic images included images of 1053 mandibular third molars. Each tooth was scored based on three criteria-depth, ramal relationship, angulation-according to the Pederson scale (Table 2). Scoring was applied with the consensus of three dentists, i.e., one oral and maxillofacial surgeon, one oral and maxillofacial resident, and one oral and maxillofacial radiologist, using two CX50N monitors (WIDE Co., Hwaseung, Korea). Because there was no precise boundary between the scores, the observers were calibrated as described below.
1. Depth (C1) The midpoint of an occlusal surface of the impacted third molar was set as the evaluation point.
When the evaluation point was above the occlusal surface of the mandibular second molar, we recorded the score as a 1, and when it was below, we recorded it as a 2. When the entire tooth was below the occlusal surface of the mandibular second molar, we recorded the score as a 3. www.nature.com/scientificreports/

Ramal relationship (C2)
In mesio-angulation and horizontal angulation cases, a contact point of the mandibular third molar and the mandibular ramus was set as the evaluation point. The evaluation point was compared with the distal point of the cemento-enamel junction of the mandibular third molar. When the contact point was disto-apical, we recorded the score as a 1. When the contact point was mesio-occlusal, we recorded it as a 2. In the vertical and distoangular cases, we used the same points but only considered the occluso-apical position. When the contact point was apical, we recorded it as a 1. When the contact point was occlusal, we recorded it as a 2. Those scores for cases in which the entire crown was impacted were recorded as a 3.

Angulation (C3)
The occlusal surface of the mandibular third molar was compared with the distal surface of the mandibular second molar. When they were close to perpendicular, we recorded the score as a 3, and when they were close to parallel, we recorded the score as a 2; otherwise, we recorded the score as a 1. Finally, we scored those cases with an angle of below 90° as a 4.
To draw an objective conclusion, every score was cross verified. In the case of a disagreement, we followed the majority opinion. Subsequently, the PDS was determined as the sum of all scores obtained from each criterion 17 . Each radiograph was manually labeled by drawing rectangular bounding boxes around the mandibular third molars for region of interest (ROI) detection training.
Preprocessing and composition. Preprocessing was required before the acquired images could be used for learning and verification. Figure 4 shows the preprocessing process. First, the original image was split into two sections (left and right) at the same ratio based on the width of the panorama image. The second image in Fig. 4 is the split image. Next, the edges were zero-padded to unify them at the same size. The sizes of the panoramic images obtained were different because the field of view varied slightly depending on the sizes of the objects. After pre-processing, the whole data were randomly sampled at a ratio of 1:1 according to the subject, and the sampled data were used as a training and testing set, respectively. The dividing process was performed only once at the first. And then, all of the experiments we had done were used the same dataset. As for the data for validation, 10% of data was reassigned from the training set. Learning the model and finding the optimal hyperparameters were done on the trainset and validation set. Only the finally selected model was evaluated on the test set and presented in this paper.
Augmentation. Augmentation prevents an overfitting and helps in the learning of various features. As the augmentation techniques, we employed random flipping and rescaling in our detection model. The image was flipped with a probability of 0.5, and the scale was randomly converted within a ratio range of (0.8, 1.0). The brightness and contrast variation factors were randomly selected within the range of (0.8, 1.2). In addition, the ROI was randomly cropped from the entire image within a ratio range of (0.9, 1.0). All transformations for augmentation were applied differently for each iteration.
Proposed diagnosis model. Our proposed diagnosis model, as illustrated in Fig. 5, can be divided into two phases: ROI detection and a difficulty index classification. First, we find an ROI that includes the region of the mandibular third molar using the object detection model. The detection model outputs the coordinates Table 2. Pederson scale used in this study for an evaluation of the difficulty of extraction. Level A: the occlusal surface of the mandibular third molar is at the same level as that of the occlusal surface of mandibular second molar. Level B: the occlusal surface of the mandibular third molar is between the occlusal surface and the cemento-enamel junction of the mandibular second molar. Level C: the occlusal surface of the mandibular third molar is below the cement-enamel junction of the mandibular second molar. Class 1: there is sufficient space between the mandibular ramus and mandibular second molar for the crown part of the mandibular third molar. Class 2: space between the mandibular ramus and mandibular second molar is insufficient for the crown part of the mandibular third molar. Class 3: almost the entire crown of the mandibular third molar is impacted in the mandible. 1. ROI detection We used the SSD 18 as the ROI detection model. The size of the input image was downscaled from 1500 × 1500 to 512 × 512 because the original was too large to be used as an input. We used a VGG16 pretrained on the ImageNet dataset as a backbone network of the detection model. In addition, the mandibular third molar region, which is the ROI, has less variability in terms of scale and proportion than that of ordinary objects. Therefore, by removing the less useful default boxes, the overall computation and processing time could also be reduced. The aforementioned model aims to find a suitable region for a score evaluation. Thus, to train this model, the target data including the region information need to be identified. In this study, we determined the suitable scope while simultaneously estimating the difficulty score.
2. Difficulty classification An image cropped from the ROI of the original image, as predicted by the detection model, is used as the input. We proposed applying a classification model because the scores did not match the gradual variation in the radiographic image. The backbone network of the classification model used in our study was an ResNet-34 21 pretrained on the ImageNet dataset. Feature maps extracted by the backbone network were  Training details. We used the stochastic gradient descent as an optimizer with a learning rate of 0.01, weight decay of 0.9, mini-batch size of 32, and momentum of 0.9. We divided the learning rate by 10 for 250 iterations.
In addition, we used gradient clipping to ensure that the training remained stable. The detection model loss is the weighted sum between the localization loss and the confidence loss. The localization loss is Smooth L1 loss, and the confidence loss is SoftMax cross-entropy loss.

Statistical analysis.
A statistical analysis was conducted by calculating the accuracy, sensitivity, and specificity as listed in Table 1. In addition, the RMSE between the predicted and true Pederson scores was calculated to analyze whether our proposed model was able to predict the mandibular extraction difficulty similarly to that of the experts. Given the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), the sensitivity and specificity were calculated using the following equations for each class.
The accuracy and Cohen's kappa score (k) were calculated as follows: where P o is the observed agreement, which is the same as the accuracy, and P e is the expected agreement, which is due to chance. In addition, P e is given by (TP + TN)(TP + FP)/ TP + TN 2 .
We can also calculate the RMSE as follows:  Ethical approval and informed consent. This study was conducted in accordance with the guidelines of the World Medical Association Helsinki Declaration for biomedical research involving human subjects and was approved by the Institutional Review Board of Daejeon Dental Hospital, Wonkwang University (W2004/001-001). The IRB waived the need for individual informed consent, either written or verbal, from the participants owing to the non-interventional retrospective design of this study and because all data were analyzed anonymously. N N n ŷ n − y n 2 ,