Accuracy of facial skeletal surfaces segmented from CT and CBCT radiographs

The accuracy of three-dimensional (3D) facial skeletal surface models derived from radiographic volumes has not been extensively investigated yet. For this, ten human dry skulls were scanned with two Cone Beam Computed Tomography (CBCT) units, a CT unit, and a highly accurate optical surface scanner that provided the true reference models. Water-filled head shells were used for soft tissue simulation during radiographic imaging. The 3D surface models that were repeatedly segmented from the radiographic volumes through a single-threshold approach were used for reproducibility testing. Additionally, they were compared to the true reference model for trueness measurement. Comparisons were performed through 3D surface approximation techniques, using an iterative closest point algorithm. Differences between surface models were assessed through the calculation of mean absolute distances (MAD) between corresponding surfaces and through visual inspection of facial surface colour-coded distance maps. There was very high reproducibility (approximately 0.07 mm) and trueness (0.12 mm on average, with deviations extending locally to 0.5 mm), and no difference between radiographic scanners or settings. The present findings establish the validity of lower radiation CBCT imaging protocols at a similar level to the conventional CT images, when 3D surface models are required for the assessment of facial morphology.


Material
The material consisted of ten human dry skulls collected from the Municipal cemetery of Serres, Greece, following the required approval from the local authorities (Municipality of Serres, Greece, Protocol Number: 4044/12.07.2018) (Supplementary Fig. 1).This was performed in the context of a large project investigating 3-dimensional superimposition techniques on skeletal structures of the head 20,22 .As reported previously 20 , all handling of human tissues was compliant with the relevant local legislation.The specimens belonged to humans that were deceased between 8 and 12 years prior to the study implementation.At the time of acquisition, there were no claims from any relatives and the identity of the specimens was not known.Hence, informed consent was not pursued, as the ethics committee granted a waiver for it.Our goal was to choose intact skulls of adult size, free from any indications of significant aging or pathology.The sample size was arbitrarily defined based on data and resource availability and the authors' research experience.According to our previous study, a minimum sample of eight specimens was considered adequate 23 , but we decided to increase the sample size to ten to facilitate the statistical analysis and obtain more robust findings 22 .

Image acquisition
All skulls were directly scanned in hydrated conditions 20 with a structured-light, 3D surface optical scanner and were also subjected to CT and CBCT imaging in similar conditions.The radiographic images were obtained at hydrated conditions for soft-tissue simulation.The same conditions were created for the direct surface scans to account for the effects of hydration on dimensional integrity 20 and ensure comparability.
The dry skulls were embedded in tap water for approximately 15 min.The water was in room temperature (22-25 °C) and had a pH of approximately 7.5.Afterwards, they were removed from water, they were gently patted with tissue paper, and were immediately scanned using a high-accuracy, optical 3D surface scanner (Artec Space Spider, Artec3D, Luxembourg; Software: Artec Studio 12, Version 12.1.6.16).Prior to scanning, the scanner was calibrated according to the manufacturer's instructions.The complete outer facial surface was the primary target, although the complete image of the skull was acquired.The entire sample preparation and scanning process has been published previously and showed high precision at about 40 μm 20,22 .The subsequent 3D surface models were used as gold standard models for the study.
Within a few days, CT and CBCT images were obtained from the same specimens, with soft tissue simulation achieved by enclosing each specimen in a 3D printed head shell (PETG, MasterFill Premium PETG Pro, 3DHUB, Greece), filled with water 17,18,24 .The entire head bony specimen was centered in the head shell, using radiolucent water resorbing sponges (Fig. 1).Three head shells of different size were designed in a way to accommodate every specimen, while allowing for realistic soft-tissue thickness between the shells' inner wall and the outer surface of the specimens.
Each specimen underwent four full head radiographic scans using the following acquisition settings: 1. CT machine (Revolution CT 256, GE Healthcare; 251 Hellenic Airforce Hospital, Athens, Greece).kV: 120, mA: 490 in the area of interest (automatically configured based on tissue mass and density), exposure time: 1 s, slice thickness: 0.625 mm, voxel size: 0.49 to 0.62 × 0.49 to 0.62 × 0.31 (interslice) mm, FOV: full head (displayed FOV: 25 cm).All radiographic scans were performed by professional radiologists under standardized conditions regarding the water embedding process and the anatomical form stability of the specimens.Corresponding slices from radiographic volumes of one skull, through each acquisition setting, are provided in Supplementary Fig. 2.

Optical surface scanner data
The raw data of the directly scanned skulls through the surface scanner were semi-automatically post-processed in the Artec Studio 16 software (Version 16.0.5.114,Luxembourg, Luxembourg) to create a complete full head surface model.The surface models were saved as STL and OBJ file formats.The detailed 3D model generation process has been published previously and showed very high precision ranging from 5 to 10 μm 22 .
All 10 surface models that were acquired through surface scanning were considered as the gold standard reference.These were imported in Viewbox 4 software (dHAL Software, Kifisia, Greece) for comparison with the corresponding radiographically acquired surface models, to test the trueness of the latter.

Radiographic data
The CT and CBCT radiographic images were exported in DICOM format and imported in Viewbox 4 to extract the facial surface models through a visually defined single threshold segmentation process.One experienced operator (MG) assessed all models and selected the optimal threshold through gradual adjustment.To facilitate this process, image contrast was enhanced by removing the irrelevant extreme greyscale values from visualization.Afterwards the threshold isoline was adjusted manually to best conform on the outer skeletal surface edge on several 2D radiographic slices.The final threshold was the one that, after potential adaptation, provided the best segmentation of the entire facial model, according to the visual assessment of the operator.Each selected threshold was recorded in a Microsoft Excel sheet (Microsoft Corporation, Redmond WA, USA).The subsequent dense triangular mesh models, which were generated in Viewbox 4 software using a variant of the marching cubes algorithm 25 , were exported and saved as STL files.They consisted of approximately 900,000, 1,600,000 and 4,000,000 vertices, for CT, Newtom, and Planmeca unit-generated volumes, respectively.

Intra and inter-operator reproducibility of the visually defined segmentation threshold
The visual segmentation process of all radiographs was repeated by the same operator (M.G.) at least 2 weeks following the first extraction to test intra-operator reproducibility.A second operator (M.J.) with experience in single threshold identification repeated the entire process for 16 randomly selected radiographic volumes (four from each setting), following a calibration session with the first operator.The selected threshold values were recorded in a Microsoft Excel sheet and compared.

Intra-operator reproducibility of the visually segmented surface models
Intra-operator reproducibility of the visually segmented surface models was tested at three levels, similarly to a previously applied method 2 .First, the mean absolute distance (MAD), as well as the standard deviation of the absolute distances (SDAD), of the repeatedly extracted models was calculated, considering the distance of each vertex point of one mesh model to the closest point on the second model, at three predefined measurement areas consisting of 2000 triangles each.For this assessment, the segmented surface models retained their spatial relations in the source radiographic volume.The measurement areas were located on the forehead, the zygomatic process, and the maxillary complex, bilaterally.The bilaterally selected triangles on each anatomical structure were considered as one measurement area (Fig. 2A).Colour coded distance maps between repeatedly segmented entire models were generated to represent the cases with minimum, average and maximum difference on all measurement areas.Afterwards, each pair of repeated models was superimposed through a variant of the iterative closest point (ICP) algorithm 26 , under the following software settings: 100% estimated overlap of meshes, matching point to plane, exact nearest neighbour search, 100% point sampling, 50 iterations.The used superimposition reference area is shown in Fig. 2C.The rotational and translational movements required for the best fit approximation of each pair of models were recorded to describe differences between their original position and the position after superimposition.Finally, the MAD (SDAD) between the superimposed models at the predefined measurement areas was calculated to test their morphological differences, independent of their position in space.Zero MAD prior to superimposition and zero movements were considered as perfect reproducibility.

Trueness of radiographically derived surface models
The trueness of the radiographically derived skeletal surface models was tested through comparison with directly obtained models using a high-accuracy optical surface scanner 20,22 .For this purpose, each radiographically derived surface model was best-fit approximated to its corresponding optical scanner derived model, using the same settings and reference areas described above, regarding the intra-operator reproducibility of the visually segmented surface models.The congruence of the superimposed models on three pre-defined measurement areas, consisting of 4,000 triangles each, was calculated as MAD (SDAD) and this was the metric to assess trueness.The measurement areas were placed bilaterally on the forehead, the zygomatic process, and the maxillary complex.The bilaterally selected triangles on each anatomical structure were considered as one measurement area (Fig. 2B).These measurement areas were selected only once for each skull, on the optical scanner derived surface, and were used for all comparisons to eliminate confounding due to measurement area selection.Colour coded distance maps between superimposed entire models were generated to represent the cases with minimum,  A, B).Three measurement areas defined in each skull through bilaterally selected mesh surfaces at the forehead (blue), the zygomatic process (green), and the maxillary process (red) for reproducibility and trueness assessment, respectively.Each circular area shown in the images consists of 1000 triangles.(C).Reference area (light blue) used for all surface-based superimpositions performed in the study.

Intra-operator reproducibility of the segmented surface models
Intra-operator reproducibility of the visually segmented surface models was high, with the maximum MAD being 0.067 mm (SDAD: 0.026 mm) and 0.027 mm (SDAD: 0.023 mm), before and after superimposition of the repeatedly segmented models, respectively.Both were detected in the maxillary process depicted in a CT scan for MADs and in a Newtom scan for SDADs.
When each area was analysed as a single variable, Newtom showed consistently better reproducibility (MAD) than CT in all areas prior to best-fit approximation of the models (Forehead: p = 0.008, Zygoma: p = 0.008, Maxilla: p = 0.005).After approximation of the models, this outcome remained significant only for the maxilla (p = 0.038).Regarding the SDADs of the repeated models, before and after superimposition, all comparisons showed no differences, except from one (CT vs. Planmeca U: p = 0.038) (Fig. 4).
For all acquisition settings, colour coded distance maps between repeatedly segmented surface models, at their original position, revealed uniformly distributed differences, limited within a range of 0.1 mm (Fig. 5).
There was no difference between acquisition settings in any rotational or translational movement required to best-fit approximate repeatedly segmented surface models by the same operator, (X-translation: p = 0.608, Y-translation: p = 0.263, Z-translation: p = 0.431, X-rotation: p = 0.354, Y-rotation: p = 0.145, Z-rotation: p = 0.501).The magnitude of translation or rotation was very limited, consistently less than 0.08 mm or degrees, respectively (Fig. 6).
When each measurement area was analysed as a single variable, there was no difference in trueness (MAD of the best-fit approximated models) between acquisition settings.Similarly, for the SDAD very few, limited differences between acquisition settings were detected only at the maxilla (Newtom vs. Planmeca F: p = 0.013, Newtom vs. Planmeca U: p = 0.033) (Fig. 7).
The colour coded distance maps between the best-fit approximated segmented and directly scanned facial surface models revealed consistent outcomes within and between acquisition settings, with deviations from the   true model only locally reaching a maximum of 0.5 mm (Fig. 8).Large deviations were consistently present at the sides of the skull, located distant to the used superimposition reference area.

Discussion
The present study tested the accuracy of radiographically derived 3D facial surface models against a gold standard reference model, applying advanced 3D image analysis methods.Previous studies attempted to test similar outcomes, albeit with certain shortcomings, such as the usage of artificially prepared models 27 , the lack of softtissue simulation [28][29][30] , or the use of inter-landmark distances as test outcomes 5,29 .Other studies highlighted these shortcomings and their impact on applicability issues 10,16,19,20,31,32 .We aimed to simulate real life conditions, by incorporating study design techniques that allowed for the simulation of soft-tissues 24 and the acquisition of true reference models 22 , and we considered the entire facial surface for outcome assessment.
Although soft tissues have varying densities, and their representation in radiographic images may not be fully accurate with single material simulation, various materials have been shown to provide adequate soft-tissue simulation in similar experimental settings, including water, wax sheets, or gel-like materials 17,18,24 .We embedded the dry skulls in water during radiographic imaging to enable multiple scans without compromising the integrity of the specimens 20,22 .We hydrated the dry skulls similarly, prior to the true reference model acquisition by the high-accuracy optical scanner, since a previous study revealed that hydration alters the anatomical form of dry bones 20 .
This study tested the accuracy of both CT and CBCT radiographic scanners, using regular radiation dose settings and a low radiation setting, offered by one CBCT unit.The image acquisition protocols were defined according to the standard practice for the assessment of craniofacial morphology, by specialist radiologists that were regularly using the radiographic machines for clinical purposes.All tested methods performed similarly, exhibiting an average trueness of 0.12 mm, without any significant difference between machines or settings.Colour coded distance maps revealed local deviations extending up to 0.5 mm, but the overall models were robust in all cases, despite the different image acquisition configurations.This finding has significant implications when 3D surface models are sought for diagnosis, growth or treatment outcome evaluation, or applications such as virtual surgical planning and manufacturing of prostheses in maxillofacial surgery.In every circumstance, whether these accuracy levels are acceptable or not, depends on the specific research or clinical context in which they are intended to be applied.The present findings showed that imaging methods applying lower radiation than the conventional CT can provide adequate digital models for the assessment of facial morphology, without compromises in accuracy.This holds true even for the low dose acquisition with the Planmeca CBCT scanner, which was performed using half of the radiation dose of the regular scan by the same unit.In the same line, a previous qualitative study evaluating the visibility of selected anatomical structures of the jaws in large FOV CBCT images with lower exposure settings showed encouraging results 33 .
The overall trueness of 0.12 mm was very high when considering the magnitude of the optical scanner error 22 and the segmentation error, at 0.03 mm and 0.07 mm, respectively.Larger deviations up to 0.5 mm were consistently present at the sides of the skull, which were distant to the selected superimposition reference area 3,10,22,34 , as well as to local sites over the entire tested surface, with no specific spatial pattern.The latter findings might be partially attributed to direct optical surface scanner inaccuracies that might sometimes exceed 0.2 mm locally 20,22 .
The segmentation reproducibility was lower for the CT derived surface models, when assessed at their original position in space.This difference can be attributed to how thresholding functions in each imaging modality.CT images use Hounsfield units, which is a standardized quantitative scale consistent throughout a dataset, meaning that two voxels with the same value correspond to the same tissue type and density.Therefore, slight changes in the threshold value may have a more distinct effect on the resulting 3D surface.On the other hand, grayscale values in CBCT do not have the same consistency, as they vary according to the surrounding structures of each target area and to its position in space.In CBCTs, small differences in the threshold values appeared to have a lesser effect.The voxel size differences might be another contributing factor, although it did not affect the trueness findings, even within the same machine (CT, Supplementary Fig. 3).In any case, the differences between repeatedly segmented models were consistently lower than 0.1 mm and the differences between repeatedly selected threshold values were consistently lower than 2% of the total range of values.Thus, any of the aforementioned factors had a minor effect on the reproducibility outcomes.
Although trueness was comparable among radiographic scanners and settings, its variance was higher for the CT and CBCT data from the Planmeca machine, especially in the maxillary areas.A possible explanation could be the porous structure of the maxilla in conjunction with the usage of a single threshold value for each dataset.The complex bone-air interface in this area may pose a challenge to depict accurately and reliably, when other factors such as the voxel size, FOV, and radiation dose are considered.
For the assessment of reproducibility and trueness, the left and right measurement areas were unified to consolidate homologous findings and facilitate the statistical analysis.This decision was based on previous studies that did not find differences between contralateral sides 3,10 .Moreover, we selected four circular surfaces for each measurement area to assess trueness, but only two for reproducibility.The reason is the higher resolution of the optical scanner as opposed to the radiographically derived 3D surfaces.Therefore, the gold standard reference models had a higher triangle count for a given area.To compensate for this difference, we performed four selections in each measurement area to assess trueness that had similar extent to the two analogous areas used for reproducibility testing.

Strengths and limitations
Based on the outcomes of a previous study 20 , the hydration of the skulls has a significant effect on their geometric stability.This is important when investigating accuracy at a submillimetre level.Every specimen was embedded in water for 15 min prior to any image acquisition to closely simulate real life conditions, namely the presence of soft-tissues on radiographic images, and acquire directly comparable true reference models.Another strength is the inclusion of multiple widely used radiographic scanners to increase the study's generalisability.The acquisition parameters were defined by radiology specialists to reflect imaging for actual clinical purposes, namely adequate diagnostic quality for the representation of actual anatomical morphology.Finally, the rigorous testing of the used hardware and software applications that provided the true models 20,22 , as well as of the 3D imaging, the superimposition, and the assessment methods, by our team, ensure the validity of the presented outcomes 2,3,10,23,[34][35][36] .
A limitation of our study might be the usage of a single threshold value for each dataset.An alternative method is the manual segmentation for each anatomical region, but this is a very time consuming procedure that is still prone to error 37 .More recent, sophisticated approaches involving deep learning and artificial intelligence applications manage to reduce time considerably, but the segmentation accuracy levels are often not properly assessed or not substantially enhanced 38,39 .Due to its straightforward application, the single threshold segmentation remains the standard process for many imaging software, despite its weaknesses, especially in CBCT images.Here, both the intra-and inter-operator differences in the visual segmentation threshold values were small and did not differ between machines.Similar outcomes were evident in a recent study on anterior cranial base surface models, where the visually defined single threshold value showed very small differences from the manually defined 2 .The repeatedly extracted facial surface models consistently showed differences lower than 0.1 mm in the entire facial surface.With this straightforward approach, the present study showed high accuracy outcomes, despite the weaknesses stemming from the CBCT image characteristics, where the grayscale intensity values do not directly correspond to the tissue type and density throughout the entire radiographic volume.The latter, combined with the large extent of the assessed surface models, was expected to affect negatively the outcomes, since the optimal threshold might differ between anatomical areas of the same tissue type and density that are located at different positions in the radiographic volumes.Regardless, the accuracy outcomes were robust throughout the tested facial surfaces.It should be mentioned that when imaging actual patients, motion artefacts may lead to greater inaccuracies 40 .Moreover, the software's algorithm measures the distances between selected vertices on one surface and the closest vertices on the other surface, which might not necessarily be anatomically correspondent.This minimizes the differences between the two approximated surface meshes, and thus, the actual anatomical differences may be slightly larger.Finally, the age that the tested skulls represented was not known.It is expected that most of the skulls, if not all, would belong to adult subjects.As a result, we should anticipate a specific age range, but this cannot be further defined.However, the absence of this information did not impact the comparisons within the study, as the same skulls were utilized across all methods.The outcomes were robust with no outliers, but still no solid conclusions about children can be drawn from the present study.
This study focused on facial surface models, which comprise an important anatomical area for several fields 5,10,12 .Other areas, with thin cortical bone, such as the mandibular condyles or the anterior cranial base, which might be hard to segment in large field of view scans 2,19 , were not assessed.That would require the application of newly designed assessment protocols, including bone segmentation procedures, and could be the topic of future studies.Future research should also focus on the assessment of craniofacial morphology in three dimensions, while minimizing patient exposure to radiation.The fact that the tested low radiation protocol showed comparable accuracy to the standard protocols is encouraging.A decrease of the field of view and the proper adjustment of other scanning parameters can lead to further reduction of radiation exposure 23,33,41 .

Conclusions
The repeatedly extracted facial surface models through a visually defined single threshold showed consistently differences lower than 0.1 mm over the entire facial surface, for all acquisition configurations.All CT and CBCT radiographic images exhibited an average trueness of 0.12 mm, with local deviations reaching 0.5 mm in certain cases.There were no significant differences between radiographic scanners or settings and the inaccuracies were evenly distributed over the entire facial surface.These findings are of high significance establishing the accuracy of lower radiation CBCT imaging at a similar level to the conventional CT images.Furthermore, a lower radiation CBCT protocol was shown to perform adequately and similarly to the standard CBCT imaging protocols in the depiction of the actual skeletal facial morphology, shifting the cost/benefit ratio of a given acquisition to a more favourable level for our patients.

Figure 1 .
Figure 1.(A).Final configuration of an entire skull subjected to radiographic imaging.(B) The 3D printed soft-tissue head shell was filled with water, through the hole present at its upper part.(C).Radiographic image acquisition with soft-tissue simulation.

Figure 2 .
Figure 2. (A, B).Three measurement areas defined in each skull through bilaterally selected mesh surfaces at the forehead (blue), the zygomatic process (green), and the maxillary process (red) for reproducibility and trueness assessment, respectively.Each circular area shown in the images consists of 1000 triangles.(C).Reference area (light blue) used for all surface-based superimpositions performed in the study.

Figure 3 .
Figure 3. Box plots showing the intra-and inter-operator differences in visually defined facial surface segmentation threshold values for radiographic volumes.Outliers are shown as black circles (further from the median more than 1.5 times the IQR).A difference of 10 in threshold values corresponds to 0.25% of the full range of voxel values of the CT images, 0.22% of the Newtom images, and 0.29% of the Planmeca images.

Figure 4 .
Figure 4. Box plots showing the intra-operator differences between repeatedly segmented facial surface models from radiographic volumes.The upper graphs show Mean Absolute Distances (MAD) between the corresponding surface models and the lower graphs the Standard Deviations of the absolute distances (SD).The lines connect variables that show significant differences (p < 0.05) detected through Kruskal-Wallis, followed by Mann-Whitney U test (Bonferroni adjusted).Outliers are shown as black circles (further from the median more than 1.5 times the IQR) or asterisks in more extreme cases (further from the median more than 3 times the IQR).

Figure 5 .
Figure 5. Colour coded distance maps between repeatedly segmented surface models by the same operator, representative of the minimum, average, and maximum differences detected in the sample for each acquisition setting.The compared models retained their original spatial relation within the source radiographic volume (status before superimposition).

Figure 6 .
Figure 6.Box plots showing the rotational (°) or translational (mm) movements required to best-fit approximate the repeatedly segmented surface models by the same operator, for each acquisition setting.Outliers are shown as black circles (further from the median more than 1.5 times the IQR).X-translation: lateral movement, Y-translation: vertical movement, Z-translation: anteroposterior movement, X-rotation: around the lateral axis, Y-rotation: around the vertical axis, Z-rotation: around the anteroposterior axis.

Figure 7 .
Figure 7. Box plots showing the trueness of the segmented facial surface models indicated by the distances of the radiographically derived models from the direct optical scans, following their best-fit approximation.The Mean Absolute Distances (MAD) and the standard deviations of the absolute distances (SD) between the superimposed surface models are shown.The lines connect variables that show significant differences (p < 0.05) detected through Kruskal-Wallis, followed by Mann-Whitney U test (Bonferroni adjusted).Outliers are shown as black circles (further from the median more than 1.5 times the IQR).

Figure 8 .
Figure 8. Colour coded distance maps between best-fit approximated segmented surface models and directly obtained models through an optical surface scanner, representative of the minimum, average, and maximum deviations in trueness, detected in the sample for each acquisition setting.