Canal-Net for automatic and robust 3D segmentation of mandibular canals in CBCT images using a continuity-aware contextual network

The purpose of this study was to propose a continuity-aware contextual network (Canal-Net) for the automatic and robust 3D segmentation of the mandibular canal (MC) with high consistent accuracy throughout the entire MC volume in cone-beam CT (CBCT) images. The Canal-Net was designed based on a 3D U-Net with bidirectional convolutional long short-term memory (ConvLSTM) under a multi-task learning framework. Specifically, the Canal-Net learned the 3D anatomical context information of the MC by incorporating spatio-temporal features from ConvLSTM, and also the structural continuity of the overall MC volume under a multi-task learning framework using multi-planar projection losses complementally. The Canal-Net showed higher segmentation accuracies in 2D and 3D performance metrics (p < 0.05), and especially, a significant improvement in Dice similarity coefficient scores and mean curve distance (p < 0.05) throughout the entire MC volume compared to other popular deep learning networks. As a result, the Canal-Net achieved high consistent accuracy in 3D segmentations of the entire MC in spite of the areas of low visibility by the unclear and ambiguous cortical bone layer. Therefore, the Canal-Net demonstrated the automatic and robust 3D segmentation of the entire MC volume by improving structural continuity and boundary details of the MC in CBCT images.


Materials and methods
Data acquisition and preparation. We included 50 patients (27 females and 23 males; mean age 25.56 ± 6.73 years) who underwent dental implant surgeries or third molar extractions at the Seoul National University Dental Hospital (2019-2020). The patients had different mandibular canal shapes with various dental conditions including the metallic crowns and implants. The patient data were obtained at 80 kVp and 8 mA using CBCT (CS9300; Carestream Health, New York, USA). The CBCT images had dimensions of 841 × 841 × 289 pixels, voxel sizes of 0.2 × 0.2 × 0.2 mm 3 , and 16-bit depth. This study was performed with approval from the institutional review board of the Seoul National University Dental Hospital (ERI18001). The ethics committee approved the waiver for the informed consent because this was a retrospective study. The study was performed in accordance with the Declaration of Helsinki.
The mandibular canals including the surrounding cortical bone was manually annotated by an oral and maxillofacial radiologist using a software (3D Slicer for Windows 10, Version 4.10.2; MIT, Massachusetts, USA) 30 . We used the cropped images consisting of 200 slices of 128 × 128 pixels that were centered at the left and right mandibular regions in order to reduce the memory requirement. Zero-padding was performed to maintain the input volume of the same length for all patients showing the mandibular canals of different lengths. For deep learning, we prepared 60 volumes from 30 patients for the training dataset, 20 from ten patients for the validation dataset, and 20 from ten patients for the test dataset where the right mandible images were horizontally flipped to match the left. We performed five-fold cross-validation, where each training cycle consisted of 60, 20, and 20 volumes for training, validation and test datasets, respectively.
We estimated the minimally required sample size to detect significant differences in the accuracy between the Canal-Net and the other networks, when both assessed the same subjects (CBCT images). We designed to capture a mean accuracy-difference of 0.05, and a standard deviation of 0.10 between the Canal-Net and the other networks. Based on an effect size of 0.5, a significance level of 0.05, and a statistical power of 0.80, we obtained a sample size of N = 128 (G* Power for Windows 10, Version 3.1.9.7; Universität Düsseldorf, Germany). Eventually, we split the CBCT dataset of 2D images into 10,185, 2546, and 3183 for training, validation, and test datasets, respectively.

Continuity-aware contextual network (Canal-Net).
We designed a continuity-aware contextual network (Canal-Net) which had 3D encoder-decoder architecture under a multi-task learning framework consisting of time-distributed convolution blocks, multi-scale inputs 31 , skip connections, and bidirectional convolutional LSTM (ConvLSTM) with side-output layers 31,32 (Fig. 1). The bidirectional ConvLSTM was used to capture anatomical context information in concatenated feature maps extracted from the corresponding encoding path and the previous decoding up-sampling layer. A multi-task learning approach was adopted to simultaneously output the entire MC volume and its 2D multi-planar projections in three anatomical planes, which helped the network learn the overall MC volume and structural continuity (multi-planar projection outputs and the output www.nature.com/scientificreports/ volume in Fig. 1). The network under multi-task learning was optimized in an end-to-end manner, where the MC segmentation output was generated directly from the input volumes of the CBCT images. At the encoder, the time-distributed convolution blocks processed sequential information from 3D volumetric inputs as series of features for 2D slices 33 (white blocks in Fig. 1). It was a typical convolution passed to a timedistributed wrapper that could be applied to every temporal frame of the input independently 33 . Convolutional blocks were comprised of two repeated modules of two 3 × 3 × 3 convolutions, batch normalization, ReLU, and 2 × 2 × 2 max-pooling at the encoder path. The number of feature maps gradually decreased from 128 to 64, 32, and 16. To mitigate spatio-temporal information loss caused by max-pooling operations, the multi-scale inputs down-sampled from the original input volume by 2 × 2 × 2 average pooling were concatenated at each level of the encoder (multi-scale inputs in Fig. 1).
At the decoder, the features from time-distributed convolutions at the encoder 33 were concatenated with the corresponding up-sampling layer and fed to bidirectional ConvLSTM blocks (Skip connection and yellow blocks in Fig. 1). Long short-term memory (LSTM), one of the recurrent neural networks (RNN) 34 , was an efficient network for handling spatio-temporal data and was widely used in contextual processing such as natural language processing 35 and video segmentation 36 . The internal matrix multiplication of the original LSTM was replaced by the convolution operation to maintain the input dimension in ConvLSTM 37 . The ConvLSTM blocks were composed of two repeated modules of two 3 × 3 × 3 bidirectional ConvLSTMs, batch normalization, ReLU, and 2 × 2 × 2 up-sampling at the decoder path. The number of feature maps gradually increased from 16 to 32, 64, and 128. The ConvLSTM captured 3D local anatomical contextual information more effectively by learning the spatio-temporal features of the 3D volumetric data 37 .
At the output layer, the averaged side-outputs generated from a local output map from every level of the decoder were merged and fed to the bidirectional ConvLSTM, which mitigated the gradient vanishing problem for encouraging the back-propagation of gradient flow (Side output and average layers in Fig. 1). The 3D volume loss and multi-planar projection losses (MPL) from the 2D projections simultaneously encouraged the network to learn the global structural continuity information of the canal under the multi-task learning framework. The MPL were calculated from the 2D projection maps of the output in three anatomical planes. The Dice similarity coefficient score (DSC) was used for the two loss functions 38 . The loss function ( L = αDL vol + β(DL ap + DL cp + DL sp ) of the Canal-Net consisted of 3D volume loss ( DL vol ) for the entire canal volume, and the MPL as sum of the 2D projection losses in axial-( DL ap ), coronal-( DL cp ), and sagittal-( DL sp ) planes, where α and β were constant weights for the 3D volume loss and the summation of the 2D projection map losses, respectively (equation of L total in Fig. 1). The weights of α and β were optimized for the best Figure 1. The Canal-Net architecture with a 3D encoder-decoder under a multi-task learning framework consisting of time-distributed convolution blocks, multi-scale inputs, skip connection, and bidirectional convolutional LSTM (ConvLSTM) with side-output layers. The bidirectional ConvLSTM was utilized to capture anatomical context information, and a multi-task learning approach was performed to learn overall MC volume and structural continuity. www.nature.com/scientificreports/ performance through an ablation study. The weights of 0.7 and 0.3 for the 3D volume loss and MPL, respectively, exhibited the best performance compared to other weight options ( Table 1). The proposed networks were trained using an Adam optimizer, and the learning rate of 0.00025 was reduced on plateau by a factor of 0.5 every 25 epochs in 300 epochs with the batch size of 1. They were implemented with Python3 based on Keras with a Tensorflow backend using a single NVIDIA Titan RTX GPU 24G.

Performance evaluation of Canal-Net for MC segmentation.
where V gt and V pred represented the number of voxels for the ground truth and for the predicted volume, respectively. We also evaluated the mean curve distance and t denotes coordinates of a ground truth voxel 14 , and C(·) is an operation which extracted the center curve line through skeletonization for the set of voxels 14 . The higher values of DSC, JI, PR, and RC, and the lower values of VOE, RVD, and MCD indicated better segmentation performance. We used paired two-tailed t-tests to compare performances between Canal-Net and others (SPSS Statistics for Windows 10, Version 26.0; IBM, Armonk, New York, USA). The statistical significance level was set at 0.05. We also performed the Bland-Altman analysis to analyze the bias and agreement limits of the used segmentation models between the number of pixels of ground truth and prediction results.

Results
The performances of Canal-Net, convLSTM 3D U-Net, MPL 3D U-Net, 3D U-Net, SegNet, and 2D U-Net were evaluated for a total of 20 mandibular canals not used for training. Among them, convLSTM 3D U-Net, MPL 3D U-Net, and 3D U-Net were evaluated to demonstrate the effectiveness of the corresponding components in Canal-Net, while the other networks were used for performance comparisons between 2 and 3D CNN-based approaches. In addition, the Canal-Net was evaluated for the impacts of the weights of α and β on 3D volume loss and MPL, respectively. The Canal-Net with loss weights of α = 0.7 and β = 0.3 achieved the best segmentation performance of 0.87, 0.93, 0.91, and 0.94 DSC for 3D volume, axial, coronal, and sagittal planes, respectively (Table 1). Table 2 shows the quantitative results of the segmentation performance by the networks. The performances of Canal-Net, ConvLSTM 3D U-Net, MPL 3D U-Net, 3D U-Net, SegNet, and 2D U-Net were compared using 20 total mandibular canals. The Canal-Net achieved the highest values of 0.87 DSC (p < 0.05), 0.80 JI (p < 0.05), 0.89 PR (p = 0.05), and 0.88 RC (p = 0.05) in 2D performance metrics, and also the lowest values of 0.14 RVD (p < 0.05), 0.10 VOE (p < 0.05), and 0.62 MCD (p < 0.05) in 3D performance metrics ( Table 2). The Canal-Net outperformed all the other networks in DSC, JI, PR, RC, RVD, and VOE, and significantly so in MCD (p < 0.05) ( Table 2). The performance of the networks is also plotted in boxplots (Fig. 2). The Canal-Net achieves the higher performances than the other networks with a smaller dispersion of data, shorter length of whiskers, and rare existence of outliers (Fig. 2).
In Fig. 3, the Canal-Net exhibited more accurate predictions with more true positives (yellow) and less false positives (red) and false negatives (green) compared to the other networks for MCs with unclear and ambiguous cortical bone layers and metallic objects in CBCT images of lower contrast (Fig. 3a-e). In the 3D segmentation results, the Canal-Net also demonstrated better prediction results with less false positives and false negatives compared to the other networks in the mental foramen area of the various MC shapes (Fig. 4a-e). Furthermore, only a few cases as outliers for the results of the Canal-Net were observed due to other causes such as the presence of a third molar beside the MC (Figs. 3f, 4f). The Canal-Net predicted more accurately the entire MC volume, Table 1. Comparison of Dice similarity coefficient scores (DSC) for α, weight of 3D volume loss, and β, weight of the MPL in axial, coronal, and sagittal planes to determine the optimal loss weights for α and β for the Canal-Net. www.nature.com/scientificreports/ and demonstrated improved structural continuity and boundary details of the MC from the mental foramen to the mandibular foramen compared to the other networks (Fig. 4a-e). The DSC and MCD for the whole test dataset were plotted from the mental foramen to the mandibular foramen, and the 3D networks generally exhibited less variations of the performances compared to the 2D networks (Figs. 5 and 6). The Canal-Net demonstrated the most consistent performances with the smallest fluctuations of true segmentation compared to the other networks throughout the entire MC volume (Figs. 5 and 6). As a result, the Canal-Net represented the best 3D segmentation accuracies of RVD, VOE, and MCD throughout the entire MC volume among the networks ( Table 2). The Bland-Altman plot between the ground truth and prediction results from the Canal-Net showed higher linear relationships and better agreement limits than those from the other networks (Fig. 7). Therefore, the Canal-Net represented more accurate and robust MC segmentation performance of the entire MC compared to the other networks. Table 2. Mean (SD) dice similarity coefficient score (DSC), Jaccard index (JI), precision (PR), recall (RC), volume of error (VOE), relative volume difference (RVD), and mean curve distance (MCD) by Canal-Net (ours), ConvLSTM 3D U-Net (ours), MPL 3D U-Net (ours), 3D U-Net, SegNet, and 2D U-Net Net by five-fold cross-validation. *Significant difference between Canal-Net and ConvLSTM 3D U-Net (p < 0.05). † Between Canal-Net and MPL 3D U-Net (p < 0.05). ‡ Between Canal-Net and 3D U-Net (p < 0.05). + Between Canal-Net and SegNet (p < 0.05). Φ Between Canal-Net and 2D U-Net (p < 0.05).  www.nature.com/scientificreports/  www.nature.com/scientificreports/

Discussion
In this study, we proposed a continuity-aware contextual network (Canal-Net) which learned 3D local anatomical contextual information and the global continuity of the MC complementally in order to segment the MC with high consistent accuracy throughout the entire MC volume in cone-beam CT (CBCT) images. We employed time-distributed convolution layers for handling time-distributed sequential features with multi-scale inputs at the encoder path 33 , and bidirectional ConvLSTM layers for extracting spatio-temporal features at the decoder path 37 . The Canal-Net was able to learn the local anatomical variations of the MC by incorporating the spatiotemporal features effectively, and the global structural continuity information of the MC under the multi-task learning framework, complementally. The Canal-Net used optimized weights for 3D volume loss and multiplanar projection losses in multi-task learning. Therefore, the Canal-Net improved the performance of automatic segmentation of the MC by combining anatomical context information and global structural continuity information, resulting in higher consistent accuracy throughout the entire MC volume in CBCT images.  www.nature.com/scientificreports/ We compared the Canal-Net with other popular segmentation networks such as 2D U-Net, SegNet, and 3D U-Net, and also with our MPL 3D U-Net and ConvLSTM 3D U-Net for MC segmentation. In performances of MC segmentation in CBCT images, 2D U-Net and SegNet exhibited lower accuracies compared to the 3D networks, generally. False negatives and positives were observed at a higher rate around the mental foramen area with ambiguous or unclear cortical bone layers. Since the 2D networks were not able to learn the 3D contextual features of the MC volume in CBCT images, the 2D networks exhibited coarser 3D segmentation volumes with more fluctuations of 3D performance accuracy from the mental to the mandibular foramen regions. In terms of learning 3D spatial contextual information between image slices of the 3D anatomical structures, 3D U-Net was generally expected to generate more accurate segmentation results compared to 2D networks 41 . In the present study, the 3D U-Net predicted the more accurate segmentation of the MC with fewer false positives and negatives compared to the 2D U-Net and SegNet. However, the 3D U-Net had still limitations in segmenting the MC regions with unclear cortical bone layers accurately by only learning 3D spatial information between image slices, and exhibited inaccurate segmentation results with disconnections around the mental foramen area.
Both MPL 3D U-Net and ConvLSTM 3D U-Net demonstrated better segmentation results than 3D U-Net in different aspects. The MPL 3D U-Net showed an improved travel course of the MC compared to 3D U-Net because its spatial information was complemented with the global structural continuity information by learning through multi-planar projections. Although the structural continuity of the MC volume was improved by multi-task learning, the MPL 3D U-Net exhibited difficulties in producing segmentation boundaries in detail around the mental foramen area. On the other hand, the ConvLSTM learned anatomical context information through spatio-temporal features, and the MC volume showed smooth boundaries with more consistent accuracies even in unclear cortical bone layer regions in the CBCT images. Therefore, the Canal-Net demonstrated the most accurate segmentation of the entire MC volume compared to the other networks by simultaneously learning global structural continuity through MPL, and anatomical context information through ConvLSTM. Compared with previous studies using 3D U-Net 10,14 , our Canal-Net achieved 0.87 of DSC and 0.80 of the mean intersection of union (IoU) while two previous studies reported 0.58 of DSC 10,14 and 0.58 of mean IoU 10,14 . Compared with the previous studies 10, 14 , the Canal-Net showed substantially enhanced performance of the MC segmentation in CBCT images.
In the Canal-Net, the MPL provided global structural continuity from three anatomical projection maps with ConvLSTM anatomical context information by spatio-temporal features, complementally. In the MC areas of low visibility with ambiguous or unclear cortical bone layers in CBCT images, the Canal-Net exhibited the best outcomes with continuous and consistent MC volumes from the mental to mandibular foramens. The Canal-Net especially surpassed other networks by showing continuous MC volumes around the mental foramen area where www.nature.com/scientificreports/ the visibility of the MC tended to diminish 15 , and in areas affected by metallic objects such as implant fixtures or dental crowns in CBCT images. As a result, the Canal-Net demonstrated the most robust MC segmentation with high consistent DSC throughout the entire MC volume in CBCT images. The primary reason for improved segmentation performance by Canal-Net was that its network architecture was constructed to complementally learn the 3D anatomical context information of the MC by the spatiotemporal features from the bidirectional ConvLSTM layers and the global structural continuity information by MPL. In the Canal-Net, the complementary context information was successfully learned in the proposed framework, leading to maintaining continuous and consistent MC volumes from the mental to the mandibular foramen areas. The proposed learning process has several advantages. First, it could increase the discriminative capability of intermediate feature representations with multiple regularizations on disentangling subtly correlated tasks 48 , potentially improving the robustness of the segmentation performance. Second, in the application of MC segmentation, the multi-task learning framework could also provide complementary context information that would serve well to segment the MC maintaining overall continuous and consistent volumes. This could improve the performance accuracy of MC segmentations substantially, especially in MC regions with ambiguous or unclear cortical bone layers in lower contrast CBCT images.
The accurate identification of the whole MC structure in the mandible is an essential prerequisite for the preoperative planning of third molar extractions and implant surgeries to avoid any surgical complications 7 . However, the exact recognition of the entire canal structure is considered to be a challenging and delicate task for several reasons 15 . CBCT, the most commonly used 3D dental imaging tool, has lower contrast than CT, which negatively affects the ability to distinguish MCs 10,42 . As a result, the low visibility of MCs, such as in ambiguous or unclear cortical bone regions, affects the structural continuity of MC segmentation in CBCT images 10,14 . Furthermore, the visibility of the MC itself is low due to variable cortications and bone densities of the canal wall, the diverse travel courses of the canal, and the spread of vessels and nerve branches 15,[43][44][45][46][47] . The Canal-Net could be used in automatic and robust 3D segmentation of the MC structure for the preoperative planning of third molar extractions and implant surgeries to avoid any surgical complications when using CBCT images. The automatic segmentation of the MC volume by the Canal-Net could provide clinicians with accurate identification of the MC structure in the mandible with high consistent accuracy throughout the entire MC volume ranging from the mental foramen to the mandibular foramen while reducing time and effort. However, our study had several limitations. First, as there was the problem of reducing the memory requirements for dealing with large amounts of data when using deep 3D networks running on the GPU, it was necessary to optimize the way the memory was used in order to maximize GPU utilization. Therefore, we used the cropped images with smaller dimensions than the original, and preprocessing of the images required additional time and labor. Second, our study had a potential limitation of generalization ability due to using internal data from a single organization. Overfitting of training a deep learning model, which resulted in the model learning statistical regularity specific to the training dataset, could negatively impact the model's ability to generalize to a new dataset 49 . Although the proposed network did not show the presence of overfitting for the internal dataset in the five-fold cross-validation, it needs to be trained and evaluated using large datasets from multiple organizations or devices for generalization. Third, the results presented in this study were based on datasets from 50 patients. The proposed method needs to be evaluated for datasets from more patients with various dental restorations and implants. In future studies, we will improve the generalization ability and clinical efficacy of the Canal-Net by using large CBCT datasets acquired under various imaging conditions from multiple organizations or devices.

Conclusions
In this study, we proposed a continuity-aware contextual network (Canal-Net) that was robust to ambiguous or unclear cortical bone regions of the MC and lower contrast of CBCT images in 3D segmentations of the entire MC. The Canal-Net was designed based on a 3D U-Net with the ConvLSTM under the multi-task learning framework using MPL in order to complementally learn anatomical contexts and global structural continuity information. As a result, the Canal-Net achieved substantially enhanced performances compared to other networks such as 2D U-Net, SegNet, 3D U-Net, MPL 3D U-Net, and ConvLSTM 3D U-Net in 2D and 3D performances. Furthermore, Canal-Net demonstrated automatic and robust 3D segmentation of the entire MC volume by improving structural continuity and boundary details of the MC in CBCT images. The Canal-Net could be contributed to accurate and automatic identification of the MC structure for the preoperative planning of third molar extractions and implant surgeries to avoid any surgical complications.

Data availability
The datasets generated and/or analyzed during the current study are not publicly available due to the restriction by the Institutional Review Board (IRB) of Seoul National University Dental Hospital in order to protect patients' privacy but are available from the corresponding author on reasonable request. Please contact the corresponding author for any commercial implementation of our research. www.nature.com/scientificreports/