Abdominal organ segmentation via deep diffeomorphic mesh deformations

Abdominal organ segmentation from CT and MRI is an essential prerequisite for surgical planning and computer-aided navigation systems. It is challenging due to the high variability in the shape, size, and position of abdominal organs. Three-dimensional numeric representations of abdominal shapes with point-wise correspondence to a template are further important for quantitative and statistical analyses thereof. Recently, template-based surface extraction methods have shown promising advances for direct mesh reconstruction from volumetric scans. However, the generalization of these deep learning-based approaches to different organs and datasets, a crucial property for deployment in clinical environments, has not yet been assessed. We close this gap and employ template-based mesh reconstruction methods for joint liver, kidney, pancreas, and spleen segmentation. Our experiments on manually annotated CT and MRI data reveal limited generalization capabilities of previous methods to organs of different geometry and weak performance on small datasets. We alleviate these issues with a novel deep diffeomorphic mesh-deformation architecture and an improved training scheme. The resulting method, UNetFlow, generalizes well to all four organs and can be easily fine-tuned on new data. Moreover, we propose a simple registration-based post-processing that aligns voxel and mesh outputs to boost segmentation accuracy.


Introduction
The goal of medical image segmentation is to extract anatomical structures in the image.Despite the progress in tomographic imaging, the image resolution of CT and MRI is typically in the order of millimeters.Hence, even accurately predicted segmentation, as well as manual annotations, can only approximate the true organ contour since they are limited to the voxel  grid.The consequence are stair-case artifacts shown in Figure 1.While voxel-based biomedical segmentation is sufficient for coarse morphological analyses, fine-grained geometric analyses of organs and clinical navigation systems need highly accurate segmentations of organ boundaries 1 .Therefore, voxel-based segmentation methods such as UNet2 and its variants [3][4][5] are only of limited suitability for such applications.
With 3D meshes, we can overcome the limitations of the voxel grid by having an explicit representation of the contour, see Figure 1.3D meshes are a natural representation of shapes as they can capture the topology and smoothness of the organ surfaces.Recent approaches for surface reconstruction from image data based on template deformations [6][7][8][9][10][11][12][13] seem particularly promising for abdominal organs since anatomical priors, such as spherical topology, can be directly engraved into the template -a property that is difficult to achieve with implicit representations 14,15 .In addition, homologous points, i.e., point-to-point correspondences, of the output shapes to the template are established through the deformation 16 .These correspondences are essential for the creation of statistical shape models and for modeling longitudinal changes.
The origin of template-based segmentation can be traced back to the idea of active contours 17 in 2D and active shapes 18 for 3D segmentation.The main idea is to compute an ideally diffeomorphic deformation of an input template to the target surfaces as visualized in Figure 2. Recent approaches for direct surface reconstruction from image data learn these deformations from training data without requiring pre-defined landmark points.Based on the type of neural network that predicts the template warping, three groups can be distinguished.The first class of methods uses multi-layer perceptrons (MLPs) for the computation of displacement vectors 6,7 or a neural deformation field that can be integrated numerically 8 .The latter has the advantage that optimal integration of continuous-time deformation fields, which amounts to solving the deformation-describing ordinary differential equation (ODE), naturally prevents self-intersections 8 .Similarly, the second class of methods 9,19 relies on CNNs to predict deformation fields.A third branch of works [10][11][12][13] focuses more on the graph structure of meshes.These methods combine CNNs with deep sequential graph neural networks (GNNs) that operate directly on the template mesh and can aggregate neighboring vertex features locally for computing deformation vectors.In contrast to the ODE-based approach, GNN-based deformation methods require carefully tuned regularization loss functions to enforce the smoothness of reconstructed surfaces and to avoid self-intersections.In addition, these methods predict a voxel segmentation next to the meshes.This serves as an auxiliary task for learning meaningful image features but has not yet been considered further.
In this work, we perform the first thorough evaluation of deep template-based surface reconstruction methods for joint abdominal multi-organ segmentation.This is not straightforward since most methods were developed for single shapes, and the performance typically relies on a large number of hyper-parameters.Therefore, we implement six deep mesh-deformation architectures and evaluate them on 1,000 abdominal CT scans and a smaller MRI dataset (70 scans).Our results highlight two weaknesses of existing methods: (i) corrupt pancreas segmentation due to a limited capability to generalize to organs of different geometry and (ii) difficulties training on smaller datasets.To address these issues, we develop a new deep diffeomorphic mesh-deformation method called UNetFlow, where we introduce a novel deep mesh supervision (DMS) scheme and an additional voxel branch (VB).Our experiments show that UNetFlow yields segmentation accuracy on par or better than existing methods while achieving the best topological measures on average over all organs.Furthermore, we analyze the relation between voxel and mesh outputs, which has been overlooked in previous works, and we showcase a simple yet effective post-processing step to improve mesh segmentation via alignment to the voxel output of the model.Finally, we introduce a fine-tuning scheme for training on small datasets and show its effectiveness on the MRI scans.

Homologous mesh extraction with UNetFlow
The primary result of this work is a new deep diffeomorphic shape deformation method called UNetFlow.The architecture of UNetFlow is depicted in Figure 3 and further details are described in the "Methods" section.As input, it takes a 3D medical scan and a mesh template comprising all anatomical shapes under consideration.We pass the scan through a residual 3D UNet 5 and predict deformation fields Φ 0−4 from the decoder.The "resolution" of each deformation field corresponds to the respective decoder stage, which increases the granularity of mesh deformations in a coarse-to-fine manner.To enable the joint reconstruction of multiple organs at a time, we predict an individual deformation field for each organ.An exception is given by Φ 4 , where we compute a single deformation field for all organs to improve the delineation of close-by surface boundaries.Formally, the warping of the input template is modeled by the ODE dx(t) dt = Φ(x(t),t), with the boundary condition at t = 0 given by the points on the template and Φ being composed of Φ 0 , . . ., Φ 4 .For the solution of this ODE, we leverage the Euler integration scheme due to its computational efficiency.Since we keep the connectivity of the meshes unchanged throughout the deformation, homologous points with an analog on the template are induced.We train UNetFlow with a combination of deeply supervised 20 voxel (L CE ) and mesh loss (L M ).Finally, we propose to align the output meshes to the predicted voxel segmentation for the best accuracy.

Characterization of template-based mesh extraction methods
Table 1 characterizes current deep template-based mesh extraction methods based on model parameters, end-to-end training ability, type of mesh decoder, vertex features (if any), number of loss terms, and inference time.For best comparability, we assume all methods employ the same UNet backbone and are implemented in a way to reconstruct all shapes jointly, as it is done in our implementation but usually not in the original work where the methods were proposed.When studying the table, we realized that NMF reconstructs shapes from a compressed latent vector that represents the entire shape "globally" and the size of this latent representation affects reconstruction quality.However, we are interested in comparing the different neural network architectures for mesh-based segmentation independent of the ability to learn a compressed representation of shapes.Hence, we implemented an MLP-based variant of V 2C 13 (V 2C-MLP), which is identical to V 2C besides replacing every graph layer with a linear layer.We expect that the effect of the GNN compared to an MLP can be better assessed by comparing V 2C to V 2C-MLP instead of NMF.
With around 6,6M parameters, UNetFlow has the most compact architecture.The reason is that UNetFlow basically consists solely of the UNet backbone, whereas the other approaches add an MLP (NMF, V 2C-MLP), a GNN (V 2M, V 2C), or more UNets (CF) to it.This is also reflected by the inference time, where UNetFlow has a slight advantage over all other methods for reconstructing five individual shapes (liver, two kidneys, spleen, and pancreas).All methods, besides CF, which relies on a sequential training procedure of the UNets, are trainable in an end-to-end fashion.End-to-end training is important Table 1.Characterization of deep mesh-deformation methods.Inference time is measured on an Nvidia A100 and includes reconstruction of all four organs except for NMF, which has only been implemented for the liver.The number of mesh loss terms contains the number of different loss functions computed on output meshes, ignoring terms that arise from multiple outputs.

Method
Model parameters End-to-end Mesh decoder Vertex features Mesh loss terms Inference time for transferring trained models to new data via fine-tuning and it typically makes the training less error-prone.Another characteristic is the number of mesh loss terms used for training the different architectures, ranging from one in NMF to five in V 2C/V 2C-MLP.Note that we only consider the number of different loss functions, i.e., geometric or regularization terms, not taking into account loss terms from different deformation stages or deep supervision.In general, it can be assumed that a larger number of loss functions makes it harder to transfer the method to new data since the weighting among loss terms is a sensitive parameter that needs to be tuned thoroughly.

Reconstruction accuracy and topological correctness
We report segmentation Dice scores and average symmetric surface distance (ASSD) on the CT test set in Table 2 for all methods from Table 1.We also compute the number of self-intersecting faces (SIF) per organ as a topological measure of surface quality.We observe that NMF is not competitive in liver segmentation with the other methods, which we attribute to the reconstruction from a global feature vector that is probably unable to cover a high level of surface details.For kidneys, liver, and spleen, the performance of the other methods is largely comparable.For the pancreas, however, V 2C, V 2M, and V 2C-MLP yield considerably lower Dice scores.This quantitative observation is underpinned by Figure 4, which shows that the pancreas mesh is "wrapped around" multiple times in these methods.The defect is likely related to the many loss functions in these methods as it happens for the GNN and the MLP decoder similarly.UNetFlow and CF turn out to be more robust in this regard as they achieve reasonable segmentation of the pancreas, which we attribute to the fewer loss functions and a numerical integration scheme.Notably, we find a slight but consistent advantage of UNetFlow (Dice of 0.87, ASSD 2.16mm, SIF 0.02% on average over all organs) compared to CF (Dice of 0.86, ASSD 2.67mm, SIF 0.4%).Table 3 assesses the impact of novel architectural design choices in UNetFlow on the CT validation set.It can be observed that both, the proposed deep mesh supervision (DMS) as well as the added voxel branch (VB), improve the mean segmentation accuracy in terms of Dice score and average symmetric surface distance (ASSD) in average over all organs.Importantly, the DMS reduces the number of self-intersecting faces (SIF), a measure of topological correctness, by up to 97% from 0.30% to 0.01%.

Alignment of surface meshes and voxel segmentation
V 2M, V 2C, and UNetFlow predict a voxel segmentation next to the surface meshes.However, the relation between these outputs has not been considered so far.We close this gap and demonstrate in Table 4 that, by using rigid Iterative Closest Point (ICP) 21 or non-rigid ICP (NRICP) 22 registration of the meshes to the voxel prediction, the segmentation accuracy of the meshes predicted by V 2C and UNetFlow can easily be improved.At the same time, the point-wise correspondence to the template is preserved by the registration.As an additional baseline, we also evaluate direct non-rigid registration of the template to the voxel segmentation produced by the UNet backbone in UNetFlow (a rigid registration could not cover individual shape details).
In Table 4, we find that the non-rigid registration of the template to the UNet output is not competitive with the learned deformations of V 2C and UNetFlow.For V 2C and UNetFlow, Dice and ASSD scores are generally better after the registration except for V 2C on the pancreas, where the ASSD is slightly higher after the application of ICP.While the improvement in accuracy is mostly larger for non-rigid registration, it also introduces more topological errors.On the other hand, the surface topology and, hence, the number of self-intersecting faces (SIF), remains unchanged when applying the ICP algorithm.

Generalization to MRI
Abdominal MRI is an integral part of large population studies [23][24][25] .Nonetheless, the number of annotated training samples is typically smaller than in publicly available CT datasets.Therefore, we assess the generalization of template-based segmentation models to MRI.Since CF relies on an intricate sequential training procedure and it is therefore unclear how to transfer a trained model to new data, we train CF from scratch on the MRI data.As seen from Table 5, the results for CF are far from satisfying -with an average Dice score of 0.71 over all organs.This is underwhelming since, although the number of training samples (47) is relatively low, this is not an atypical size for segmentation datasets and is often sufficient for segmentation models 4 .We further evaluate the best two models from Table 2, V 2C and UNetFlow, in two different settings: training from scratch on the MRI data and fine-tuning the models pre-trained on CT.It turns out that the pre-training on CT drastically improves the accuracy on MRI compared to the models trained from scratch.The scores for the pre-trained models are mostly on par, with an advantage of UNetFlow in segmentation Dice on the liver, pancreas, and spleen.UNetFlow also shows a low number of self-intersections (≤ 0.06% for all organs) compared to V 2C (3.92% on the pancreas).

Discussion
The automated segmentation of abdominal shapes is challenging because of the high variability in size, shape, and position.Furthermore, point correspondences across surfaces have to be established for vertex-wise statistical analyses.Recent advances in template-based surface reconstruction, comprising methods like NMF 8 , V 2M 11 , CF 9 , and V 2C 13 seem to be promising for this task as they do not require any mesh extraction nor any other post-processing but directly produce meshes with point-wise correspondence to a template.However, the generalization to different shapes and imaging modalities is crucial for deploying deep learning models to clinical practice and has not been evaluated thoroughly.
Our experiments on four organs (liver, kidneys, spleen, pancreas) and two imaging modalities (CT, MRI) uncovered the fragility of existing mesh template-based shape reconstruction methods.More precisely, methods that do not exhibit a numerical integration scheme but instead rely on regularization terms (V 2M 11 , V 2C 13 ) were unable to successfully reconstruct organs of different shape and size, which becomes evident from defective, practically useless pancreas meshes in Figure 4. Further, even though yielding topologically flawless liver shapes, we found NMF 8 not to be competitive regarding segmentation accuracy.We suppose this is due to the global latent vector from which the organ is recovered, which is different from the localized view of the other evaluated methods.CF accomplished reasonable results on the pancreas segmentation, cf.Table 2, but also delivered a surprisingly high number of self-intersecting faces on the organs in the CT test set (not so on the MRI data with identical parameters).Moreover, training CF on a smaller MRI dataset did not lead to reasonable segmentation results, cf.Table 5, and fine-tuning of a pre-trained model is prohibited by its intricate sequential training procedure.Eventually, we improved the robustness to shape variations and transfer learning by simplifying the architecture to its core UNet and training it with a novel deep mesh supervision technique and an additional voxel branch.In our experiments, UNetFlow is the only method that achieves a relative number of self-intersecting faces (SIF) below 0.1% on all organs and both imaging modalities, cf.Tables 2 and 5, while having the best segmentation Dice score on three out of four shapes on the CT data, cf.Table 2.This indicates that UNetFlow attains the best trade-off between accuracy and topological correctness across different shapes and imaging modalities.The end-to-end training of UNetFlow further offers the opportunity to benefit from pre-training on currently emerging large corpora of annotated data; this is standard practice for conventional image segmentation but has not yet been exploited for biomedical mesh reconstruction.Its simplicity makes UNetFlow straightforward to implement and to use, especially in comparison to alternative approaches that combine different neural architectures (CNN + MLP for NMF, CNN + GNN for V 2M/V 2C) or stack multiple UNets (CF).
Another potential that previous works have overlooked is the relation between the voxel and the mesh output, e.g., obtained by V 2C 13 and UNetFlow.In Table 4, we demonstrated that a simple surface registration via (NR)ICP of the predicted meshes to the learned voxel segmentation improves the average (ASSD) and worst-case (HD99) surface distance by up to 2mm (spleen HD99 of UNetFlow with no registration and NRICP).However, in the case of NRICP, this might come at the cost of additional self-intersections, which are avoided by the rigid nature of standard ICP, still yielding a consistent improvement in ASSD for UNetFlow on all organs.Therefore, we recommend applying NRICP only if accuracy is more critical than the manifoldness of the shapes and ICP otherwise.
We based our implementation of all methods on the original repositories and tuned critical parameters, such as the weighting between individual loss terms, thoroughly.This was necessary since we found slight misspecification of these parameters to cause unreasonable results, and none of the evaluated methods, except V 2M, was initially developed for abdominal shapes.However, finding the best set of parameters is time-consuming and computationally costly.Following the same procedure for all methods for a fair comparison, we limited the search to the liver shape.Nonetheless, our evaluation might be biased toward  The labels we used for training and evaluation in our experiments were obtained from manually annotated voxel masks.However, as seen from Figure 1, this is not optimal due to the limited resolution of the voxels and prevalent staircase artifacts.Unfortunately, in contrast to neuroimaging, where established and extensively validated software frameworks for automated mesh extraction exist 26 , there is no such tool for the abdomen and the creation of "manual mesh annotations", is hardly possible.
In summary, we have transferred recent template-based surface extraction methods to abdominal multi-organ segmentation from CT and MRI.We have uncovered the limited generalization ability of these approaches to changes in organ geometry and a severe dependency on the amount of training data.We were able to mitigate these issues with a simpler, yet competitive deep mesh deformation architecture, UNetFlow.We demonstrated generalization across different abdominal organs and two imaging modalities, CT and MRI.We are encouraged by this result to question overly complex neural architectures in favor of more straightforward but similarly effective methods.

Datasets
The two imaging modalities in our experiments are CT and MRI.For CT images, we used the public AbdomenCT-1K dataset 27 , which contains 1,000 abdominal CT scans with manual annotations for four organs: kidneys, liver, spleen, and pancreas.After excluding scans with missing organs, we split the CT data into train, validation, and test splits with 666, 167, and 147 scans, respectively.CT images were z-score normalized.For MRI scans, we used Dixon opposed-phase (OPP) images from three different datasets: UKBiobank 23 (37 scans), GNC 24 (16 scans), and KORA 25  (17 scans).The OPP images were pre-processed following the steps from a previous study 28 , which involves bias field correction, resampling, cropping, and annotation by an anatomy expert.Our MRI training, validation, and test splits contained 47, 5, and 18 samples, respectively, balanced according to data sources.MRI images were min-max normalized.For all images, we performed affine registration 29 to a randomly selected reference image (Case_00001 from AbdomenCT-1K, excluded from the test set) to ensure spatial alignment (2×2×3mm 3 resolution).Finally, we extracted ground truth meshes for each organ from the label masks using marching cubes 30 .

Voxel space
At the core of UNetFlow, we employ a 3D residual UNet based on the 3D-full-resolution network implemented in nnUNet 5 .This fully-convolutional neural network projects a 3D input scan into a low-resolution latent space.From the latent space, the decoder recovers the original image size step-wise, producing a stack of cuboidal feature maps with increasing resolution.From the final UNet layer, a segmentation is obtained via Softmax activations.

Figure 1 .
Figure 1.Meshes (top, computed with UNetFlow) represent shape characteristics of abdominal organs, i.e., smooth surface morphology and spherical topology, better than voxel maps (bottom), as visualized for MRI, CT, and a 3D surface rendering.

Figure 2 .
Figure 2. UNetFlow predicts a smooth diffeomorphic deformation of the input template, thereby establishing vertex correspondences between the template and the reconstructed shapes.Colors indicate a unique vertex ID.See the supplementary video for an animation.

Figure 3 .
Figure 3. Architecture of UNetFlow.The predicted flow fields Φ 0−4 enable a correspondence-preserving deformation of the input template, guided via deep mesh supervision (DMS).The model is trained with a combination of voxel (L CE ) and mesh loss (L M ).At test time, the consistency of the output shapes with the segmentation is ensured via surface registration.Rainbow colors indicate a unique vertex ID.

Figure 4 .
Figure 4. UNetFlow avoids pitfalls of previous template-based organ extraction methods, such as inter-mesh intersections and defective ("wrapped around") pancreas shapes.Shown are axial, coronal, and sagittal views of segmentations (from top to bottom) on a scan from the CT test set and corresponding 3D views of shapes (last row, respectively).

Table 2 .
Comparison of deep mesh-based surface extraction methods on the CT test set (n = 147).We report Dice scores, average symmetric surface distance (ASSD), and the relative number of self-intersecting faces (SIF).Values are (mean ±SD) and the best scores are highlighted for each organ.

Table 3 .
UNetFlow ablation study on the CT validation set (n = 167).We evaluate the impact of the proposed deep mesh supervision (DMS) and voxel branch (VB) in terms of Dice score, average symmetric surface distance (ASSD) and self-intersecting faces (SIF).Values are mean ±std over all organs in the validation set.

Table 4 .
Surface accuracy increases by rigid Iterative Closest Point (ICP) and non-rigid ICP (NRICP) registration of mesh to voxel output.We report mean values ±SD of average symmetric surface distance (ASSD), 99-percentile Hausdorff distance (HD99), and self-intersecting faces (SIF) on the CT test set (n = 147).* This model uses the UNet encoder and decoder from UNetFlow.

Table 5 .
Evaluation on the MRI test set (n = 18) with models trained from scratch and, if possible, pre-trained on CT data.We report Dice, average symmetric surface distance (ASSD), and the relative number of self-intersecting faces (SIF).Values are (mean ±SD) and the best scores are highlighted for each organ.