Introduction

Cephalometric analysis is an essential tool of orthodontic diagnosis as well as treatment planning in orthognathic surgery. The first step of cephalometric analysis requires identifying cephalometric landmarks, a labour-intensive and time-consuming task for even well-trained orthodontists. In addition, cephalometric analysis suffers from two types of errors—including projection error caused by projected X-ray images from 3D objects—and identification errors caused by incorrect identification of landmarks, tracing, and measurements1,2,3. Among these errors, the inconsistency in landmark identification may prove greater than other errors4. The variation of landmark definition, bony complexity of the related region, and the quality of the X-ray image could affect accuracy of landmark identification. Even after expert orthodontists received standardized training for landmark identification, disagreement between inter-observers was inevitable5. To overcome these problems, several studies developed automated cephalometric analysis to reduce analysis time and improve the accuracy of landmark identification6,7. Furthermore, various approaches to automate landmark identification have been proposed; however, these approaches have not proved accurate enough for clinical use8,9,10.

Recently, deep learning with convolutional neural networks (CNN) has shown surprising accomplishments in computer vision tasks, which can be applied to classification, detection, and semantic segmentation in medical imaging11. Therefore, automated landmark prediction studies have been rapidly applied to cephalometric analysis12,13,14,15. In order to improve the prediction performance within the search area, various image analysis methods have been proposed to preprocess the images first to find the regions of interest (ROI)16. Similarly, for the landmark detection, a paper for predicting a landmark after detecting an ROI has also been proposed17.

All these studies insisted that the automatic landmark identification system performed not only accurately, but also quickly. However, the landmark prediction error within 2 mm reported in these studies may be too large to use in a clinical situation. Some investigators divided the cephalometric X-ray image into small ROIs to increase the accuracy of automatic identification14,18. In addition, Hwang et al., compared the human and automated landmark identification prediction error and reported that the automated system shows more accurate results19. However, the accuracy of the automated landmark prediction system was only comparable to those of different users due to inter-observer variability and inferior to those of multiple trials of single user.

In this study, we proposed a cascade network to detect the related ROI of each landmark with region proposal network and find the exact position of a landmark in the ROI with semantic segmentation network—like orthodontists when determining the cephalometric landmarks—which could improve the robustness of landmark identification to the orthodontist-level.

Materials and methods

Dataset

This retrospective study was conducted according to the principles of the Declaration of Helsinki, and was performed in accordance with current scientific guidelines. The study protocol was approved by the Institutional Review Board Committee of Seoul National University School of Dentistry and Seoul National University Dental Hospital, Seoul, Korea (S-D 2018010 and ERI 19007). The requirement for informed patient consent was waived by the Institutional Review Board Committee of Seoul National University School of Dentistry and Seoul National University Dental Hospital. A total of 1000 consecutive lateral cephalometric X-ray images were acquired from 509 patients from the department of orthodontics in Kooalldam dental hospital from 2017 to 2018. All patients had permanent dentition without dentofacial deformity. Radiographs (n = 140) were from 140 patients who wanted to start orthodontic treatment, and the other 860 radiographs were from 369 patients who completed treatment. Although we received informed consent from all the patients, all personal information was deleted. All cephalometric X-ray images are grayscale images with the 2 k × 3 k pixel and 8-bit depth, stored in Digital Imaging and Communications in Medicine (DICOM) file format. Considering the ratio of the original image size, all cephalometric X-ray images were resized to 700 × 1000, and pixel normalization was performed by dividing by 255.0 to have pixel values in the range 0–1.

Landmark definition

All the images were traced by one orthodontist (JWP) with more than 20 years of clinical experience. Forty-two landmarks were traced as shown in Fig. 1 and Table 1. Among them, 28 and 14 landmarks were selected from the hard tissue and soft tissue, respectively. To evaluate intra-observer variabilities, twenty-eight images from the dataset were randomly selected and measured again by the same orthodontist (JWP).

Figure 1
figure 1

An example case of forty-two landmarks (numbered 0–41) in a cephalometric X-ray lateral image of size 2 k × 3 k pixel used in this study.

Table 1 The landmark names with corresponding numbers in Fig. 1.

The cascade network

Since the cephalometric X-ray image is very large, finding the exact location of landmarks using a simple deep learning model is very challenging. To overcome this issue, we proposed a fully automated landmark prediction algorithm with a cascade network to improve prediction accuracy and reduce false-positive regions. Figure 2 shows a diagram of our proposed algorithm with the cascade network20. The proposed algorithm consists of two steps: (1) ROI detection and (2) landmark prediction. First, candidate ROI regions with different sizes depending on each landmark were trained by an ROI detection network. The complexity of the areas surrounding each landmark should be considered for more robust ROI detection. A different range of views is generally required when expert orthodontists identify each landmark. Applying these considerations, various ROI sizes were evaluated. Then, the exact locations of each landmark were detected based on a semantic segmentation network in the results of the previous ROI detection.

Figure 2
figure 2

The general schematic of our proposed algorithm for finding the exact location of landmarks with a cascade network. The proposed algorithm consists of two parts, ROI detection (upper part) to propose the area of interest and the landmark prediction (lower part) to find the exact location of landmarks.

ROI detection

The RetinaNet, a state-of-the-art CNN based detection algorithms, was used to detect ROIs21. The RetinaNet is a type of one-stage detector, which selects feature pyramid network to train the model efficiently by extracting features in various sizes of the feature map. The datasets were split into training, validation, and test set at a ratio of 8:1:1. For training, the ROI patches with the centre of the landmark marked with coordinates \(\left( {T_{x} , T_{y} } \right)\) were extracted. The model was trained from scratch due to relatively large dataset and preserving originality of our dataset. In Fig. 3, a different range of ROI depending on each landmark were proposed and evaluated similar to the orthodontists' viewing. The two sizes of ROI, including 256 × 256 and 512 × 512, were evaluated.

Figure 3
figure 3

Two sizes of ROIs in the cephalometric X-ray. (a) ROIs with 256 × 256 and 512 × 512 size were extracted by landmarks. (b) Sella, nasion, and menton requiring a small ROI with 256 × 256 size (red), and (c) hinge, corpus and Md6 root requiring a wide ROI with 512 × 512 size (blue).

Various augmentation methods, including Gaussian noise, random brightness, blurring, random contract, flip, and random rotation, were used to train the detection model. Adam optimizer was used, Focal loss was used, and the accuracy of the ROI detection model was expressed using the Euclidean distance between the centre point \(\left( {T_{x} , T_{y} } \right){ }\) and the predicted ROI patch \(\left( {P_{x} , P_{y} } \right)\) from the ROI detection model.

Landmark prediction

Because the first model of ROI detection was trained independently, separate datasets were generated for the second model. The second model, U-Net11 was used to find the exact locations of each landmark within the ROI patch obtained from the first model. In addition, two models with small ROIs (256, 256) and large ROIs (512, 512) were trained independently. The centre of ROI patches was represented as \(\left( {\left| {T_{x} - D_{x} \left| {,{ }} \right|T_{y} - D_{y} } \right|} \right)\) instead of \(\left( {T_{x} ,{ }T_{y} } \right)\) and the ROI detection’s mean distance error \(\left( {D_{x} ,{ }D_{y} } \right)\) were extracted. The circular segmentation labels with the diameter \(d\) were generated in the centre of ROI. If the diameter \(d\) was too small, the information may be lost during CNN's training process. Conversely, the larger \(d\) lead to the greater the prediction error of the model. Through several experiments, the most appropriate \(d\) was empirically determined as 50 pixels.

Various augmentation methods such as Gaussian noise, random brightness, blurring, random contract, flip, and random rotation were used to train the segmentation model. Adam was used as a optimization function the learning rate was initially set to 0.0001, and then decreased by a factor of 10 when the validation set accuracy stopped improving in the two networks. In total, the learning rate was decreased 3 times to end the training. Dice similarity coefficient (DSC) was applied by calculating both the loss function and the model performance. For ablation study to evaluate the effectiveness of the first ROI detection, three models with and/or without ROI detection with fixed and variable ROI sizes were evaluated by using the average distance errors of all landmarks?

Statistics analysis

The accuracy of ROI detection was evaluated by the distance between the predicted centres and ground truth ROIs. Statistical comparisons between models with the ROI detection only, without the ROI detection, and with the ROI detection using fixed size and variable size, were carried out to determine whether the model’s performances were significantly better. Paired t-test analyses with two-sided were performed for evaluating accuracy comparison of landmark prediction of the three models. The significant alpha was considered as 0.05 (p < 0.05) in this study. To compare the reproducibility of landmark prediction error of the cascade model and the expert orthodontist, a total of 28 cephalometric X-ray images from 28 patients was randomly selected and manually measured by the orthodontist with an interval of 6 months. Differences in the landmark's positions over the two trials were calculated as reproducibility, which was compared with those of the deep learning model. All statistical evaluations were performed by MEDCALC (MedCalc software, Ostend, Belgium) version 19.1.3 in this study.

Results

ROI detection

Figure 4 shows the results of ROI detection and landmark prediction with different sizes depending on the required information of landmark prediction. Landmarks with small ROIs of 256 × 256 (red box) including the sella, nasion, and menton and large ROIs of 512 × 512 (blue boxes) including a-point, porion, and corpus left were predicted by RetinaNet algorithm. Based on these ROI regions, patches were extracted for input to semantic segmentation network, and U-Net for predicting a landmark. The mean and standard deviation of distance errors between the predicted centre of these ROIs and the ground truth of all the landmarks were 1.55 ± 2.17 mm (Table 2).

Figure 4
figure 4

Regions of interests (ROIs) detection and landmark prediction results with different sizes depending on the information of each landmark. (a) Predicted ROIs (red and blue boxes) by RetinaNet algorithm, (b) ROI patches used for input of semantic segmentation for predicting a landmark, and (c) ground truth masks from the test set of each landmark.

Table 2 Comparisons of distance error (mean ± STD, unit: mm) between predictions of the four different networks and the ground truth in the test set.

Landmark prediction

The mean and standard deviation of the distance errors with or without the ROI detections of fixed and variable sizes experiments were listed in Table 2. The landmark prediction with ROI detection of variable size shows the best accuracy of all models (Table 3). In landmark-based analyses, each distance error of all landmarks predicted by the two models without ROI detection and with ROI detection of variable sizes was compared in Table 4. Approximately 55% of landmarks in prediction with ROI detection of variable sizes showed significantly better accuracies. To validate our model, we also conducted comparative experiments with the previous methods15,22. The proposed model shows significantly better performance than those of the previous models including Mask R-CNN. In addition, considering the various patch sizes and depth of U-Net, a U-Net model with variable patch size and 5 depth was selected based on experimental result (Fig. 5).

Table 3 Comparisons of distance error (mean ± STD, unit: mm) using ninefold cross validation.
Table 4 Comparisons of distance error (Mean ± STD, unit: mm) of each landmark between prediction without the ROI detection and with the ROI detection of variable size.
Figure 5
figure 5

Experimental results of our proposed model. (a) shows the highest accuracy, and (b) shows the lowest accuracy in cephalometric X-ray images (red point, predicted landmark by deep learning; Green point, the ground truth).

To avoid overfitting, ninefold cross validation was conducted in Table 3.

Comparison with reproducibility of an expert orthodontist

To measure the reproducibility of the landmark prediction error of an expert orthodontist, a total of 28 cephalometric X-ray images from 28 patients was randomly selected and manually measured by the orthodontist after 6 months. Differences in the landmark's positions in the two trials were calculated as reproducibility and were compared between the different models. The orthodontist had a mean reproducibility and standard deviation of distance error of a total of 42 landmarks of 0.80 ± 0.79 mm and mean reproducibility and standard deviation of distance error of landmarks were listed in Table 5, which shows considerably similar accuracies of landmark prediction with ROI detection of variable size.

Table 5 Comparisons of distance error (mean ± STD, unit: mm) between first label and second label. (unit: mm).

Discussion

Cephalometric x-ray images could provide orthodontists important information to determine orthodontics and maxillofacial surgery treatment options. However, the quality of cephalometric analysis depends on the accuracy of delineating landmarks, which could be vulnerable to inter- or intra-observer variations. In addition, the extensive number of landmarks requires that orthodontists spend considerable time per analysis for each patient, leading to fatigue. The present study introduces a new algorithm to increase CNN performance in cephalometric landmark identification in a fully automated manner.

The size of the original cephalometric x-ray images was too large, and irrelevant information could prevent from predicting landmark precisely with only one network. Therefore, in this study, we proposed a cascade CNN which consists of two steps to transfer learning manner. In the first step, the ROIs were predicted by using RetinaNet. In the second step, U-Net was used to detect the precise landmarks in those ROIs with relevant information, which significantly enhanced the overall accuracy of this landmark prediction to those of the other methods (Table 2). Furthermore, we demonstrated superior performance over recently existing regression-based models22 and single detection models15.

In general, orthodontists need a variable field of view to detect each landmark, which leads to training the model with variable sized ROIs. To identify the landmark, it was more effective to match the ROI sizes of each landmark to the field of view of the orthodontist. In addition, this method shows substantially better intra-observer variation compared to the orthodontist, meaning that this method shows robust accuracy.

Previous studies investigated a limited number (< 20) of hard tissue landmarks, and the results could not be satisfactory in clinical orthodontic practice12,13. Recently, Hwang et al. reported the accuracy of 42 landmarks, including 23 hard tissue landmarks and 19 soft tissue landmarks16. However, the study did not consider all possible landmarks for hard tissue analysis, soft tissue analysis, and occlusal plane analysis. With the results of this model, we could analyse the occlusal plane as well as hard and soft tissue analysis.

In this study, there are several limitations. First, this study was only evaluated with a dataset from a single centre and a single observer. Therefore, we need to extend this study with datasets from multi-centres, multi-vendors, and multi-observers. We suspected that the high quality of gold standard for training by as single observer would cause the accuracy of our model to be comparable to those of an expert orthodontist. In addition, this study could suffer from disease prevalence, partially caused by a single centre. Therefore, we need to test our model in varied clinical settings of maxilla-facial surgery and plastic surgery, which need to automated cephalometric analysis, as well.

Conclusion

In this paper, we propose the idea of connecting two different models in a cascade manner to develop a fully automated landmark prediction model in cephalometric x-ray images. The model with the cascading CNN with variable ROI size shows significantly better accuracy than the other models, and is comparable to the expert orthodontist with more than 20 years’ experience and could be applied in actual clinical practice.