Resolving complex cartilage structures in developmental biology via deep learning-based automatic segmentation of X-ray computed microtomography images

The complex shape of embryonic cartilage represents a true challenge for phenotyping and basic understanding of skeletal development. X-ray computed microtomography (μCT) enables inspecting relevant tissues in all three dimensions; however, most 3D models are still created by manual segmentation, which is a time-consuming and tedious task. In this work, we utilised a convolutional neural network (CNN) to automatically segment the most complex cartilaginous system represented by the developing nasal capsule. The main challenges of this task stem from the large size of the image data (over a thousand pixels in each dimension) and a relatively small training database, including genetically modified mouse embryos, where the phenotype of the analysed structures differs from the norm. We propose a CNN-based segmentation model optimised for the large image size that we trained using a unique manually annotated database. The segmentation model was able to segment the cartilaginous nasal capsule with a median accuracy of 84.44% (Dice coefficient). The time necessary for segmentation of new samples shortened from approximately 8 h needed for manual segmentation to mere 130 s per sample. This will greatly accelerate the throughput of μCT analysis of cartilaginous skeletal elements in animal models of developmental diseases.

To understand the complexity of embryonic development, it was essential to assess the shape and structure of tissues and organs in three-dimensional space. It also enabled us to dissect the sequential steps of their formation. Pioneering work introduced tissue contrasting techniques that enabled the detection of previously hidden www.nature.com/scientificreports/ training due to the homogeneity of the segmented structure. To our knowledge, there is only one work dealing with a similar high-resolution segmentation of chondrocranium in developing mouse embryos imaged via µCT, published by Zheng and colleagues 23 . A manually annotated database is not available to the authors, and they approach the segmentation task using sparse annotation with uncertainty-guided self-training. The authors segment cartilage in the whole chondrocranium. They evaluated the performance of their method on selected, manually annotated subregions of the whole 3-D volume, as manual annotations of the whole chondrocranium were not available. This manual selection of the evaluation region may skew the final evaluation accuracy.
Here, we provide a methodology for fully automatic segmentation of highly complex cartilaginous nasal capsules in µCT images of mouse embryos. We utilised a CNN trained in a supervised training mode on a unique database of 14 manually annotated µCT scans of mouse embryo heads on their 17th day of embryonic development. We employed different modifications to a basic encoder-decoder CNN architecture to improve the segmentation performance of the model. We experimentally validated the proposed methodology for the particular image segmentation task. This µCT image segmentation model can be further used to segment newly scanned mouse embryos, thus greatly reducing the time required for processing new samples. The segmentation model is ready to be trained to include additional embryonic developmental stages or used as a basis for transfer learning for other high-resolution µCT segmentation tasks. We also show that the data provided by the proposed automatic segmentation methodology can be further quantitatively analysed in the same manner as manually segmented data.

Methods
Samples. The database for training and testing the proposed segmentation method consists of 14 micro-CT scans of mouse embryonic heads at E17.5 (developmental stage). The heads were contrasted using the PTAstaining procedure before scanning, which enabled the detection of tissues with low density (e.g., cartilage and muscle) 24 . The staining protocol was previously described in 2,3,25 . A subset of the dataset was published and is available for inspection in 26 . All samples utilised in this work are summarised in Suppl. Table S1. All animal work was approved by the Local Ethical Committee on Animal Experiments (Norra Djurförsöksetiska Nämd, ethical permit N226/15 and N5/14) and conducted according to The Swedish Animal Agency´s Provisions and Guidelines for Animal Experimentation recommendations. In order to comply with the 3R strategy of animal welfare, we decided to use data generated for previous studies 2, 3 . No additional animals have been used in this study. All experiments on animals were conducted in compliance with the ARRIVE guidelines.
Multiple genetically modified embryos with altered cartilage development were included in the database to improve the generalisability of the developed method. As proper sample preparation is very important, we included an improperly stained embryo during the sample preparation procedure. The differences are further visualised in the tomographic cross-section in Fig. 1a. The changes in the cartilaginous nasal capsule geometry and morphology in genetically modified samples differ in severity from moderate to severe. The shape differences found in mutant embryos are visualised as 3-D renders in Fig. 1b. Sample preparation. Mice were sacrificed with isoflurane (Baxter KDG9623) overdose or cervical dislocation, and embryos were dissected and collected in ice-cold PBS. Subsequently, the samples were fixed in 4% paraformaldehyde (PFA) in PBS solution for 24 h at + 4 °C with slow rotation. Before contrasting, samples were dehydrated in incrementally increasing ethanol concentrations (30%, 50%, 70%), one day in each concentration to minimise the shrinkage of the tissue. Samples were transferred into 1.5% PTA (phospho-tungstic acid) in 90% methanol for tissue contrasting. The PTA-methanol solution was changed every 2-3 days. Samples were stained for seven weeks. The contrasting procedure was followed by rehydration of the samples by incubation in an ethanol series (90%, 70%, 50% and 30%). μCT measurement. The samples were scanned with a laboratory μCT system GE Phoenix v|tome|x L 240 (Waygate Technologies GmbH Germany). The system was equipped with a high contrast flat panel detector DXR250 with 2048 × 2048-pixel resolution and 200 × 200 μm 2 pixel size. The embryos were fixed in polyimide tubes filled with 1% agarose gel to prevent sample movement during the µCT stage rotation. Two thousand projections were acquired with an exposure time of 900 ms per projection. Each projection was captured three times, and an average of the signal was used to improve the signal-to-noise ratio. The acceleration voltage of the X-ray tube was 60 kV, and the tube current was 200 μA. The X-ray beam was filtered with a 0.1 mm aluminum plate. Tomographic reconstruction of the obtained set of projections was performed using the FDK reconstruction algorithm 27 in GE phoenix datos |× 2.0 3D computed tomography software (Waygate Technologies GmbH Germany). Output of the reconstructed CT slices was 16-bit integer. To compensate for small and smooth drift of axis (samples and detector) and focus (X-ray tube) position, scan optimiser module was applied during the reconstruction. Beam hardening correction was applied by the commercially available module in the reconstruction software with parameter 7 for different materials. The voxel size was variable depending on the sample size (see Suppl. Table S1 for complete information).
Manual segmentation. Avizo image processing software (version 7, Thermo Fisher Scientific, USA) was used to manually segment the nasal capsule cartilage in the reconstructed CT images. The data were aligned for each embryo head to have the same orientation. The manual segmentation of the cartilaginous nasal capsule tissue takes at least 8 h 16 , depending on the sample and operator's experience. As a result of the cartilage being segmented by multiple operators, some intraoperator variability is introduced into the manually segmented samples. It was partially avoided by the quality check performed by a single expert, but it might still affect the quality of the dataset and then further evaluation of the segmentation accuracy. To make the load of 3D seg- www.nature.com/scientificreports/ mentation volume easier to handle, only every 3 rd slice was manually segmented, and the remaining slices were calculated by linearly interpolating between adjacent manually segmented slices. Figure 2 depicts the segmented structure in the context of the whole head in 3-D volume.
Neural network architecture. We aimed to fully preserve the resolution provided by the µCT imaging modality. In the CNN architecture design, we had to keep in mind the large size of the segmented images, which is over 1000 voxels in all three dimensions. Utilising a fully 3-D CNN architecture for the segmentation of image data of this size is not feasible due to memory limitations. A piecewise segmentation of patches extracted from the 3-D volume seems to be a possible solution to this problem; however, even the segmented structure is enormous for a typical segmentation via 3-D CNN (see Suppl. Table S1). The size of the segmented structure is in each case over 700 × 1000 × 600 pixels. By extracting patches from the whole 3-D volume, much of the global spatial context needed for proper localisation and segmentation of the cartilage would be lost. For these reasons, a slice-by-slice approach to segmentation is the most appropriate. Manual segmentation was performed in the axial slices of the whole 3-D volume, and we thus decided to utilise the axial plane for training and subsequent inference of the developed segmentation model. We use the basic U-Net shape; however, the input is downsampled only four times in the original implementation 12 . To compensate for the large image size, two additional levels were added to the architecture. This means that the input, set to a fixed size of 1792 × 1280 pixels, is downsampled a total of 6 times to the size of www.nature.com/scientificreports/ 28 × 20 pixels in the lowest level of the network. This makes the network very deep, and issues such as the vanishing gradient could significantly hinder the training of the model. For this reason, the architecture was enhanced by utilising residual blocks that had been first proposed in 20 . Residual blocks are a structure consisting of stacked layers utilised in CNNs. They improve the information flow through the deep network, prevent vanishing gradient problems, and improve the network's training 28 . Three types of residual blocks are used in the architecture (see Fig. 3). A downsampling residual block implements dimensionality reduction in the encoding part of the CNN architecture. Strided convolutional layers achieve dimensionality reduction in the convolutional path of the residual block and max-pooling in the identity path of the residual block. Because in U-Net-based architectures, the number of filters increases twice with each dimension reduction level, it is also necessary to increase the number of filters in the identity part of the residual block. This is performed by a 1 × 1 convolutional layer with the required number of filters to perform the addition of the feature maps from the convolutional and identity paths. Another type of residual block in the proposed architecture is a so-called flat block that outputs feature maps with the same dimensions as the output. The third type of residual block utilised in the proposed architecture is an upsampling block. The upsampling block is a residual equivalent of the transposed convolutional layers of the decoder part of the basic U-Net architecture. The upsampling is performed by transpose convolutional layers in the convolutional path of the residual block and by nearest neighbor interpolation in the identity path. The 1 × 1 convolutional layer in the identity path ensures the correct number of feature maps for the addition with the feature maps from the convolutional path. As in any U-Net-based architecture, feature maps from the encoder are concatenated with the decoder feature maps. The overall CNN architecture is visualised in Fig. 3. Furthermore, we used the SELU activation function 29 with LeCun normal weight initialisation in the proposed CNN architecture 30 . SELU is designed by its authors to have a so-called self-normalizing property which makes the training of the network more stable implying better network´s performance. A great advantage of SELU over the other normalization techniques is no need for hyperparameter tuning as well as no dependency on the mini-batch size. To support weight updates even in the deepest part of the network, additional paths were added to each upsampling block: a 1 × 1 convolution layer with a sigmoidal activation function followed by a basic upsampling layer that transforms the feature map dimension to the dimensions of the ground-truth mask. The losses were weighed by the following weights from the deepest layer to the shallowest: 0.03, 0.05, 0.08, 0.12, 0.15, 0.2, and 0.37, with the largest weights being given to the layers with the feature maps of largest dimensions. Data preparation. As the proposed CNN architecture requires a fixed size input, the CT images' dimensions and corresponding manual segmentation masks had to be unified. First, we rescaled the data to a unified voxel size of 6 μm by bilinear interpolation. A suitable dimension size proved to be 1792 × 1280 pixels. This value allowed us to crop the tomographic cross-sections in the case of larger datasets without any loss of relevant information. In the cases where one or both dimensions of the data were smaller than this value, the image data were padded with zero-value pixels. Such prepared data were standardised to 0 mean and standard deviation 1.
Training. For better generalisation of the trained segmentation model, a custom augmentation procedure is proposed. The augmentation consists of random rotation, vertical flipping, elastic deformation, gamma transform with random parameter gamma, and random scaling (see Table 1) for the transform parameters). Each training image has a certain probability of undergoing two consecutive augmentation transforms. These prob-  Performance evaluation. The performance of the proposed segmentation method was evaluated using the Dice similarity coefficient (DSC). DSC is a generally utilised binary segmentation mask overlap measure. Its maximum value is 1, which signifies a complete overlap of the evaluated segmentation mask and the groundtruth mask 38  Wall thickness analysis. Wall thickness analysis was performed using VG Studio MAX 3.5 software (Volume Graphics GmbH, Germany). The wall thickness for each voxel was calculated as the diameter of the largest inscribed sphere to the volume, which still contains the center position of the voxel.

Results and discussion
The results of the sevenfold cross-validation are summarised in Table 3. The results of the segmentation were compared with the ground-truth segmentation masks via the Dice coefficient. The results are also visualised in the form of a boxplot (Fig. 4a), where each point represents the segmentation accuracy of a 3-D segmented sample. According to the Dice coefficient, the median segmentation accuracy is 84.44%, with the largest outlier being Sample 4, with a segmentation accuracy of merely 55.68%. As shown in Fig. 1, Sample 4 was improperly stained during the sample preparation procedure, and the proposed segmentation model could not correctly identify the necessary features for the accurate segmentation of the cartilage. It is thus essential that the staining protocol performed prior to the μCT measurement be followed correctly for the segmentation model to perform well. Sample 10 is a severely affected mutant embryo, significantly different from the rest of the available database. It was included in the training and evaluation of CNN to show its capabilities of processing even morphologically different samples. The DSC of 71.16% is relatively low compared to the remaining database, and more scans of mutant mouse embryos should be included in the training database to improve the model segmentation accuracy of this type of sample. The moderately changed mutant embryo (Sample 6) was segmented with an above-average accuracy of 86.67%. See Fig. 5a for a visualisation of the difference in the segmentation accuracy in genetically modified embryos. Figure 5b then shows an example of both manual and automatic segmentation in a selected tomographic cross-section of Sample 8.
We also evaluated the proposed method in comparison with 100 randomly selected tomographic slices from the validation fold of the available database, segmented by a second independent operator to see if the proposed www.nature.com/scientificreports/ CNN behaves similarly to an independent human operator performing the manual segmentation. The segmentation was performed the same way as the segmentation of the ground-truth data: Avizo (Thermo Fisher Scientific, USA) was used. Both the CNN segmentation and the independent operator segmentation were compared with the ground-truth segmentation masks using the Dice coefficient. The results of this experiment are summarised in Fig. 4b in the form of a boxplot, where each data point represents one segmented tomographic slice. The median accuracy of the automatic CNN segmentation with respect to the ground-truth data was 87.43%, and the accuracy of the second operator with respect to the ground-truth data was 88.14%. There was also a moderate positive correlation between the values (Spearman coefficient 0.59, p < 0.01). This shows that the CNN operates within the scope of the intraoperator variability. As such, the segmentation error might be caused partially by the uncertainty of the manual segmentation in some regions of the cartilage. The performance of the trained segmentation model was also evaluated on samples from different developmental stages that were not present in the training database (specifically embryos from the 12th to 18th day of their development). The accuracy of such segmentation was 86% (DSC) for the sample on the 18th day of development (E18.5) and 72% for the scan of embryos on the 16th day of development (E16.5). We performed Theiler staging of the embryos in this external dataset. 25,39 Theiler stages objectively evaluate the development of the embryos based on their morphology independently on their gestational age. The Theiler stages for both the 17 day old embryo and 18 day old embryo is the same (Theiler stage 26), with 16 day old embryo being only 1 stage lower (Theiler stage 25). These samples were not involved in the development of the proposed method and these results thus show, that the proposed methodology performs well even on an external test set of embryos with comparable developmental stages. When the network is applied to earlier stages, the segmentation accuracy decreases rapidly. In the developmental stages from 12 to 13 days, when the cartilage is not fully developed and mesenchymal condensations are still present, the trained CNN fails completely (see Fig. 4c). Including other developmental stages in the training database might improve the robustness of the method; however, using the same segmentation model to segment the images of embryos in earlier developmental stages than 14 days after conception, before the cartilage is formed, seems not feasible.
As a further qualitative check of the segmentation accuracy, we performed a wall thickness analysis of the segmented structure for both the 3-D model created by manual segmentation and the 3-D model created by the proposed CNN (Fig. 6). Wall thickness analysis is a routine follow-up analysis to show additional developmental changes. Figure 6 shows the wall thickness analysis of Sample 8. As the wall thickness histogram (c) in Fig. 6 shows, the results of wall thickness analysis performed on both 3-D models are very similar. This is also demonstrated by the very high positive correlation of the wall thickness distributions (Spearman coefficient 0.98, p < 0.01). Slight differences may be caused by the step artefact produced by the manual segmentation performed only in a single plane. Even though the CNN also performs segmentation in a single plane, its predictions are much smoother.
We performed an ablation experiment to evaluate the contribution of each proposed modification to the CNN architecture and to the training strategy towards the total nasal capsule cartilage segmentation accuracy. Here we removed the modifications from the complete architecture and one by one evaluated the segmentation accuracy of each model by sevenfold cross-validation. The results of this experiment can be seen in Fig. 7. The proposed methodology employing increased depth of the CNN, deep supervision, SELU activations, residual blocks and the proposed image augmentation strategy provides the highest median segmentation accuracy: 74.58% (DSC). Note that this number is significantly lower than the median segmentation accuracy presented in Fig. 4a. This lower segmentation accuracy is caused by training the CNNs in the ablation experiment on a reduced training set of images to make the ablation experiment less time-demanding. Deep supervision seems to provide only minor improvement to the total segmentation accuracy, as the median segmentation accuracy is lower only by ~ 2% (DSC) when training without deep supervision. Training models without utilising the residual blocks or the proposed augmentation procedure sees a more significant drop in the cross-validation www.nature.com/scientificreports/ accuracy to the median of ~ 67% (DSC). This justifies the use of residual blocks to improve the training of the CNN. The segmentation accuracy of the CNN without the increased depth drops even further to the median of 63.52% (DSC). This decrease in segmentation accuracy is expected as the shallow network has fewer trainable parameters and cannot benefit from the abstract features extracted in the deep layers of the proposed CNN. Finally, the most significant drop in accuracy is observed when not substituting the ReLU activations for SELU activations. This shows that the reported self-normalizing property of the SELU activation function dramatically improves the final segmentation accuracy and generalisability of the trained image segmentation model. This makes SELU an extremely valuable addition to the CNN architecture. As in many supervised machine learning application tasks, the performance and generalisability of the trained model are closely tied to the distribution of the training database. In our work, the proposed CNN was trained exclusively on data originating from a single μCT scanner with the samples measured under a unified methodology (sample staining, scanning parameters, resolution, image size). The methodology described here should be followed as closely as possible to achieve segmentation performance comparable to the results shown in this work. We artificially enlarged the training database by applying selected data augmentation techniques; however, despite this fact, a decrease in performance should be expected when deviating from the outlined data acquisition methodology. This decrease in segmentation accuracy was demonstrated in the case of Sample 4, where the staining of the sample is significantly different from the rest of the database. Expanding the training database

Conclusion
In this work, we have demonstrated a highly efficient and time-saving application of a custom U-Net-based CNN for the segmentation of cartilaginous tissue in μCT images of mouse embryos. We employed this architecture and trained it on a database of 14 3-D manually segmented μCT scans. It has been proven that a highly accurate, fully automatic segmentation (84.44% overlap with ground truth according to the Dice coefficient) of the complex cartilaginous structures in a developing mouse head is achievable via deep learning and will be vital for accelerating research on mammalian chondrocranium. One of the primary motivations for this work was to reduce the time required to process new data by employing a fully automatic segmentation procedure  (Fig. 4d), depending on the number of tomographic cross-sections present and on available hardware. This segmentation model will be further used to segment new samples, including models of major congenital craniofacial and skeletal diseases. It is possible to obtain an even larger training database by manual corrections of the initial segmentation results and make the final model even more robust.

Data availability
Due to the training data coming from multiple sources and studies, it is currently not feasible to share the complete training and testing database; however, a subset of the whole database was published as an X-ray microtomography-based atlas of mouse embryo cranial development and can be accessed at 26 . The trained models and accompanying code can be found in a public GitHub repository: https:// github. com/ janma tula/ deep-mouse-carti lage.  Results of the ablation experiment. Each box represents sevenfold cross-validation segmentation accuracy of a CNN trained without one of the modification to the architecture or training strategy employed in this work. The boxes extend from the first quartile Q1 to the third quartile Q3, and its length represents the interquartile range (IQR = Q3-Q1). The length of whiskers is data largest and smallest data point lying within the range defined by 1.5·IQR subtracted from Q1 and added to Q3. Outliers are marked as a black dot.