Comparing intra-observer variation and external variations of a fully automated cephalometric analysis with a cascade convolutional neural net

The quality of cephalometric analysis depends on the accuracy of the delineating landmarks in orthodontic and maxillofacial surgery. Due to the extensive number of landmarks, each analysis costs orthodontists considerable time per patient, leading to fatigue and inter- and intra-observer variabilities. Therefore, we proposed a fully automated cephalometry analysis with a cascade convolutional neural net (CNN). One thousand cephalometric x-ray images (2 k × 3 k) pixel were used. The dataset was split into training, validation, and test sets as 8:1:1. The 43 landmarks from each image were identified by an expert orthodontist. To evaluate intra-observer variabilities, 28 images from the dataset were randomly selected and measured again by the same orthodontist. To improve accuracy, a cascade CNN consisting of two steps was used for transfer learning. In the first step, the regions of interest (ROIs) were predicted by RetinaNet. In the second step, U-Net detected the precise landmarks in the ROIs. The average error of ROI detection alone was 1.55 ± 2.17 mm. The model with the cascade CNN showed an average error of 0.79 ± 0.91 mm (paired t-test, p = 0.0015). The orthodontist’s average error of reproducibility was 0.80 ± 0.79 mm. An accurate and fully automated cephalometric analysis was successfully developed and evaluated.

www.nature.com/scientificreports/ analysis [12][13][14][15] . In order to improve the prediction performance within the search area, various image analysis methods have been proposed to preprocess the images first to find the regions of interest (ROI) 16 . Similarly, for the landmark detection, a paper for predicting a landmark after detecting an ROI has also been proposed 17 . All these studies insisted that the automatic landmark identification system performed not only accurately, but also quickly. However, the landmark prediction error within 2 mm reported in these studies may be too large to use in a clinical situation. Some investigators divided the cephalometric X-ray image into small ROIs to increase the accuracy of automatic identification 14,18 . In addition, Hwang et al., compared the human and automated landmark identification prediction error and reported that the automated system shows more accurate results 19 . However, the accuracy of the automated landmark prediction system was only comparable to those of different users due to inter-observer variability and inferior to those of multiple trials of single user.
In this study, we proposed a cascade network to detect the related ROI of each landmark with region proposal network and find the exact position of a landmark in the ROI with semantic segmentation network-like orthodontists when determining the cephalometric landmarks-which could improve the robustness of landmark identification to the orthodontist-level.

Materials and methods
Dataset. This retrospective study was conducted according to the principles of the Declaration of Helsinki, and was performed in accordance with current scientific guidelines. The study protocol was approved by the Institutional Review Board Committee of Seoul National University School of Dentistry and Seoul National University Dental Hospital, Seoul, Korea (S-D 2018010 and ERI 19007). The requirement for informed patient consent was waived by the Institutional Review Board Committee of Seoul National University School of Dentistry and Seoul National University Dental Hospital. A total of 1000 consecutive lateral cephalometric X-ray images were acquired from 509 patients from the department of orthodontics in Kooalldam dental hospital from 2017 to 2018. All patients had permanent dentition without dentofacial deformity. Radiographs (n = 140) were from 140 patients who wanted to start orthodontic treatment, and the other 860 radiographs were from 369 patients who completed treatment. Although we received informed consent from all the patients, all personal information was deleted. All cephalometric X-ray images are grayscale images with the 2 k × 3 k pixel and 8-bit depth, stored in Digital Imaging and Communications in Medicine (DICOM) file format. Considering the ratio of the original image size, all cephalometric X-ray images were resized to 700 × 1000, and pixel normalization was performed by dividing by 255.0 to have pixel values in the range 0-1.

Landmark definition.
All the images were traced by one orthodontist (JWP) with more than 20 years of clinical experience. Forty-two landmarks were traced as shown in Fig. 1 and Table 1. Among them, 28 and 14 landmarks were selected from the hard tissue and soft tissue, respectively. To evaluate intra-observer variabilities, twenty-eight images from the dataset were randomly selected and measured again by the same orthodontist (JWP).
The cascade network. Since the cephalometric X-ray image is very large, finding the exact location of landmarks using a simple deep learning model is very challenging. To overcome this issue, we proposed a fully automated landmark prediction algorithm with a cascade network to improve prediction accuracy and reduce false-positive regions. Figure 2 shows a diagram of our proposed algorithm with the cascade network 20 . The proposed algorithm consists of two steps: (1) ROI detection and (2) landmark prediction. First, candidate ROI regions with different sizes depending on each landmark were trained by an ROI detection network. The complexity of the areas surrounding each landmark should be considered for more robust ROI detection. A different range of views is generally required when expert orthodontists identify each landmark. Applying these considerations, various ROI sizes were evaluated. Then, the exact locations of each landmark were detected based on a semantic segmentation network in the results of the previous ROI detection.
ROI detection. The RetinaNet, a state-of-the-art CNN based detection algorithms, was used to detect ROIs 21 . The RetinaNet is a type of one-stage detector, which selects feature pyramid network to train the model efficiently by extracting features in various sizes of the feature map. The datasets were split into training, validation, and test set at a ratio of 8:1:1. For training, the ROI patches with the centre of the landmark marked with coordinates T x , T y were extracted. The model was trained from scratch due to relatively large dataset and preserving originality of our dataset. In Fig. 3, a different range of ROI depending on each landmark were proposed and evaluated similar to the orthodontists' viewing. The two sizes of ROI, including 256 × 256 and 512 × 512, were evaluated.
Various augmentation methods, including Gaussian noise, random brightness, blurring, random contract, flip, and random rotation, were used to train the detection model. Adam optimizer was used, Focal loss was used, and the accuracy of the ROI detection model was expressed using the Euclidean distance between the centre point T x , T y and the predicted ROI patch P x , P y from the ROI detection model.
Landmark prediction. Because the first model of ROI detection was trained independently, separate datasets were generated for the second model. The second model, U-Net 11 was used to find the exact locations of each landmark within the ROI patch obtained from the first model. In addition, two models with small ROIs (256, 256) and large ROIs (512, 512) were trained independently. The centre of ROI patches was represented as  Various augmentation methods such as Gaussian noise, random brightness, blurring, random contract, flip, and random rotation were used to train the segmentation model. Adam was used as a optimization function the  www.nature.com/scientificreports/ learning rate was initially set to 0.0001, and then decreased by a factor of 10 when the validation set accuracy stopped improving in the two networks. In total, the learning rate was decreased 3 times to end the training. Dice similarity coefficient (DSC) was applied by calculating both the loss function and the model performance.
For ablation study to evaluate the effectiveness of the first ROI detection, three models with and/or without ROI detection with fixed and variable ROI sizes were evaluated by using the average distance errors of all landmarks?
Statistics analysis. The accuracy of ROI detection was evaluated by the distance between the predicted centres and ground truth ROIs. Statistical comparisons between models with the ROI detection only, without the ROI detection, and with the ROI detection using fixed size and variable size, were carried out to determine whether the model's performances were significantly better. Paired t-test analyses with two-sided were performed for evaluating accuracy comparison of landmark prediction of the three models. The significant alpha was considered as 0.05 (p < 0.05) in this study. To compare the reproducibility of landmark prediction error of the cascade model and the expert orthodontist, a total of 28 cephalometric X-ray images from 28 patients was randomly selected and manually measured by the orthodontist with an interval of 6 months. Differences in the landmark's positions over the two trials were calculated as reproducibility, which was compared with those of the deep learning model. All statistical evaluations were performed by MEDCALC (MedCalc software, Ostend, Belgium) version 19.1.3 in this study.

Results
ROI detection. Figure 4 shows the results of ROI detection and landmark prediction with different sizes depending on the required information of landmark prediction. Landmarks with small ROIs of 256 × 256 (red box) including the sella, nasion, and menton and large ROIs of 512 × 512 (blue boxes) including a-point, porion, and corpus left were predicted by RetinaNet algorithm. Based on these ROI regions, patches were extracted for input to semantic segmentation network, and U-Net for predicting a landmark. The mean and standard deviation of distance errors between the predicted centre of these ROIs and the ground truth of all the landmarks were 1.55 ± 2.17 mm (Table 2).
Landmark prediction. The mean and standard deviation of the distance errors with or without the ROI detections of fixed and variable sizes experiments were listed in Table 2. The landmark prediction with ROI detection of variable size shows the best accuracy of all models (Table 3). In landmark-based analyses, each distance error of all landmarks predicted by the two models without ROI detection and with ROI detection of variable sizes was compared in Table 4. Approximately 55% of landmarks in prediction with ROI detection of variable sizes showed significantly better accuracies. To validate our model, we also conducted comparative experiments with the previous methods 15, 22 . The proposed model shows significantly better performance than www.nature.com/scientificreports/   www.nature.com/scientificreports/ those of the previous models including Mask R-CNN. In addition, considering the various patch sizes and depth of U-Net, a U-Net model with variable patch size and 5 depth was selected based on experimental result (Fig. 5).
To avoid overfitting, ninefold cross validation was conducted in Table 3.
Comparison with reproducibility of an expert orthodontist. To measure the reproducibility of the landmark prediction error of an expert orthodontist, a total of 28 cephalometric X-ray images from 28 patients was randomly selected and manually measured by the orthodontist after 6 months. Differences in the landmark's positions in the two trials were calculated as reproducibility and were compared between the different models. The orthodontist had a mean reproducibility and standard deviation of distance error of a total of 42 landmarks of 0.80 ± 0.79 mm and mean reproducibility and standard deviation of distance error of landmarks were listed in Table 5, which shows considerably similar accuracies of landmark prediction with ROI detection of variable size.

Discussion
Cephalometric x-ray images could provide orthodontists important information to determine orthodontics and maxillofacial surgery treatment options. However, the quality of cephalometric analysis depends on the accuracy of delineating landmarks, which could be vulnerable to inter-or intra-observer variations. In addition, the extensive number of landmarks requires that orthodontists spend considerable time per analysis for each patient, leading to fatigue. The present study introduces a new algorithm to increase CNN performance in cephalometric landmark identification in a fully automated manner. The size of the original cephalometric x-ray images was too large, and irrelevant information could prevent from predicting landmark precisely with only one network. Therefore, in this study, we proposed a cascade CNN which consists of two steps to transfer learning manner. In the first step, the ROIs were predicted by using RetinaNet. In the second step, U-Net was used to detect the precise landmarks in those ROIs with relevant information, which significantly enhanced the overall accuracy of this landmark prediction to those of the other methods (Table 2). Furthermore, we demonstrated superior performance over recently existing regression-based models 22 and single detection models 15 .
In general, orthodontists need a variable field of view to detect each landmark, which leads to training the model with variable sized ROIs. To identify the landmark, it was more effective to match the ROI sizes of each landmark to the field of view of the orthodontist. In addition, this method shows substantially better intraobserver variation compared to the orthodontist, meaning that this method shows robust accuracy.
Previous studies investigated a limited number (< 20) of hard tissue landmarks, and the results could not be satisfactory in clinical orthodontic practice 12,13 . Recently, Hwang et al. reported the accuracy of 42 landmarks, including 23 hard tissue landmarks and 19 soft tissue landmarks 16 . However, the study did not consider all possible landmarks for hard tissue analysis, soft tissue analysis, and occlusal plane analysis. With the results of this model, we could analyse the occlusal plane as well as hard and soft tissue analysis.
In this study, there are several limitations. First, this study was only evaluated with a dataset from a single centre and a single observer. Therefore, we need to extend this study with datasets from multi-centres, multi-vendors, www.nature.com/scientificreports/ and multi-observers. We suspected that the high quality of gold standard for training by as single observer would cause the accuracy of our model to be comparable to those of an expert orthodontist. In addition, this study could suffer from disease prevalence, partially caused by a single centre. Therefore, we need to test our model in varied clinical settings of maxilla-facial surgery and plastic surgery, which need to automated cephalometric analysis, as well.

Conclusion
In this paper, we propose the idea of connecting two different models in a cascade manner to develop a fully automated landmark prediction model in cephalometric x-ray images. The model with the cascading CNN with variable ROI size shows significantly better accuracy than the other models, and is comparable to the expert orthodontist with more than 20 years' experience and could be applied in actual clinical practice. Table 5. Comparisons of distance error (mean ± STD, unit: mm) between first label and second label. (unit: mm). Landmarks measured for the first time in 28 X-rays were called first labels, and landmarks measured for the same patient after 6 months were called second labels.