A lightweight neural network with multiscale feature enhancement for liver CT segmentation

Segmentation of abdominal Computed Tomography (CT) scan is essential for analyzing, diagnosing, and treating visceral organ diseases (e.g., hepatocellular carcinoma). This paper proposes a novel neural network (Res-PAC-UNet) that employs a fixed-width residual UNet backbone and Pyramid Atrous Convolutions, providing a low disk utilization method for precise liver CT segmentation. The proposed network is trained on medical segmentation decathlon dataset using a modified surface loss function. Additionally, we evaluate its quantitative and qualitative performance; the Res16-PAC-UNet achieves a Dice coefficient of 0.950 ± 0.019 with less than half a million parameters. Alternatively, the Res32-PAC-UNet obtains a Dice coefficient of 0.958 ± 0.015 with an acceptable parameter count of approximately 1.2 million.

www.nature.com/scientificreports/ method aims to achieve high liver segmentation accuracy while utilizing less network parameter and disk space. Fundamentally, we make the following contributions: 1. We propose tuned backbones with residual connections and fixed-width that minimize the parameter count of the network, while improving gradient flow and segmentation performance (Methodology: Network Architecture). 2. We introduce a Pyramid Atrous Convolution (PAC) module over the skip connections of the encoder to extract multiscale volumetric features, assisting the network in constructing liver masks from CT scans with poor anatomical boundaries and contrast (Methodology: Pyramid Atrous Convolution Module). 3. We modify the surface loss function proposed by Kervadec et al. 32 by incorporating the combo loss, allowing the loss function to quantify the discrepancies between the network prediction and ground truth (Methodology: Loss Functions). 4. We empirically evaluate the impact of the loss functions and PAC module on the segmentation accuracy of the Res-PAC-UNet and other architectures in the UNet family. Furthermore, we compare the segmentation performance of the Res-PAC-UNet with the state-of-the-art for liver CT segmentation models [28][29][30] (Results and Discussion).
To the best of our knowledge, this work presents significant advantages over the existing work in the literature by emphasizing lower parameter count, smaller model size, and usability of the model without compromising the segmentation accuracy. The remainder of the paper is structured as follows: "Introduction" Section describes our proposed methodology, including the neural network architectures, PAC module, and the loss function. "Experimental setup" Section explains the experimental setup by providing information about the dataset, preprocessing, evaluation metrics, and implementation. "Results and discussion" Section highlights the results of the empirical study and discusses the critical observations and findings. Finally, "Conclusion" Section summarizes our contribution and concludes the paper.

Proposed methodology
Network architecture. Our Res-PAC-UNet architecture (Fig. 1) overcomes the major limitations of the UNet and its variants for liver CT segmentation. Firstly, Res-PAC-UNet has a tuned backbone with constant feature width (K) and residual blocks to minimize the parameter count and the memory footprint of the network while improving the information and gradient flow. The constant feature width (K = 16 or = 32) prevents the exponential increase in features (from 32 to 256) in the deeper layers of UNet. We have selected K (i.e., 32) based on the initial feature width of the UNet 7 . We have also trimmed the K to half of the initial width of the UNet (i.e., 16) to understand its impact on segmentation accuracy. Unfortunately, We could not train the Res-PAC-UNet with higher values of K due to limited GPU memory. Secondly, we overcome the problem of redundant features and loss of edge information in deeper feature maps by proposing an intuitive solution of generating features of different scales before transferring them to the decoder. To compute multi-scale volumetric features at different encoder levels, we place PAC modules ("Pyramid atrous convolution module" Section) over the skip connec-  www.nature.com/scientificreports/ tions. We avoid placing the PAC modules at the top skip connection to minimize GPU memory required by the high resolution of the feature maps. Additionally, The residual blocks replace the convolutional blocks and perform downscaling of the input feature maps by employing strided convolutions. Figure 2 shows the residual block utilized in the tuned backbone. Initial convolutions in the encoder residual blocks operate a stride of 2 ( s 0 = 2 ) to downscale the feature map resolution by half. On the other hand, the decoder employs transpose convolutions (i.e., deconvolution) to upscale the feature map resolution and regular convolutions with a stride of 1 in the residual blocks. The regular convolutional operation and residual blocks in the proposed backbone can be mathematically expressed as follows: where x is the input feature map, s is the stride of the convolution, K is the number of kernels, m is the dimension of the kernels; θ contains the weights and biases of all kernels, f(.) is the activation function applied to the result of the convolution, ⊛ s is the strided convolution operation, w j and b j are the weight and bias of the jth kernel, respectively. Based on this definition of the convolution operation, the residual block can be expressed as: where c i−1 and c i are the input and output of the residual block, respectively. c i 1 , c i 2 , c i 3 are the outputs of the three convolution operations. ⊕ is the element-wise addition operation.
Empirical comparison of neural networks. We evaluate our proposed models' segmentation performance and disk utilization by conducting an extensive experimental study with the original UNet as the baseline model. Then, we tune the UNet backbone by shrinking the feature width of the first layer, thereby every successive layer of its encoder, to decrease the overall parameter count. We name the resulting model Tuned-UNet. Subsequently, we add the PAC module to the proposed tuned Res-UNet (Res-PAC-UNet) to measure its performance impact. Finally, we modify the Thin UNet architecture by adding PAC modules to compare its performance with the Res-PAC-UNet architecture. For Res-PAC-UNet and Thin-PAC-UNet, we are limited to constant feature widths of 16 and 32 in the backbones due to memory constraints on the GPU. Furthermore, we include Res-UNet++ 28 , Multi-Res-UNet 29 , DC-UNet 30 , and TMD-UNet 33 architectures proposed for liver CT segmentation in our study to establish a thorough comparison with the literature.
Pyramid atrous convolution module. Zhao et al. 34 propose a Pyramid Scene Parsing (PSP) module to improve semantic segmentation performance by enhancing contextual relationships between the image regions and serving as a global contextual prior. The PSP module extracts multi-scale contextual features by performing max pooling operations at different scales, followed by 1 × 1 convolutions and concatenation. However, the PSP www.nature.com/scientificreports/ results in loss of spatial information due to pooling operations. Chen et al. 35 overcome the limitation of the PSP module in the DeepLabV3 architecture by replacing the max pooling operation with atrous convolution. The resulting Atrous Spatial Pyramid Pooling (ASPP) utilizes convolution with different dilation rates to capture varying fields of view in the feature maps, thereby generating multi-scale volumetric features and controlling the receptive field of the network. It can be observed that the 3 × 3 convolution in the ASPP module degrades to a 1 × 1 convolution for edge pixels of a feature map due to the large dilation between the convolutional filter's weights. Chen et al. 35 recover the missing edge pixel information by using global average pooling (GAP) followed by upsampling. This paper proposes the PAC module based on ASPP to segment abdominal 3D CT scans. Fundamentally, PAC is a 3D extension of ASPP with lower dilation rate convolutions (i.e., 6, 12, and 18). We drop high dilation rate convolutions (e.g., dilation rate = 24) because the overall size of the kernel becomes similar to the dimension of feature maps in the deeper layers of the encoder, helping to decrease the parameter count of the PAC module. These convolutions may not be helpful because they may capture features from different corners of the CT scan rather than emphasizing the liver region. The lower dilation 3D convolutions allow PAC to extract organ-specific multi-scale volumetric features. These features allow the network to utilize essential intra-and inter-slice information to differentiate the liver from the background. The PAC module is placed at the deeper skip connections of the tuned Res-UNet backbone to pass multi-scale features from multiple levels of the encoder, thus preventing duplication of low-level features and smoothing of anatomical boundaries. Finally, the decoder utilizes multiscale information when upsampling the feature maps at different stages to construct liver segmentation masks effectively. Figure 2 shows the components of the PAC module. The dilated convolutions used in the PAC module can be mathematically expressed as: where r is the dilation rate, w j r and b j r are the weight and bias of the dilated jth kernel, respectively. The rest of the parameters have the same meaning as in the regular convolution defined previously. Based on this definition of the dilated convolution operation, the PAC module can be given by: where I and O are the input and output of the PAC module, respectively. pac i , 1 ≤ i ≤ 5 are the outputs of the five sub-operations within PAC. ⌢ is the tensor concatenation operation. GAP represents the global average pooling operation. Upsample3D rescales the feature map to the same dimension as the pac i , 1 ≤ i ≤ 4.

Loss function.
A loss function is an essential component of a neural network training procedure because it effectively quantifies the discrepancies between the ground truth and prediction. For the image segmentation task, the neural network needs to learn the ROIs' area, statistical distribution, and boundaries.
Modified surface loss. Kervadec et al. 32 propose a boundary loss function using the distance metrics of the shape contours for quantifying the anatomical boundary errors. The boundary loss is described using a graphbased optimization for estimating the gradient flow for curve evolution. Different components of boundary loss are the regional softmax probabilities of the pixels ( ) in the predicted segmentation mask ( M θ ) and the level-set function pre-computed on the ground truth ( φ G ).
here the boundary loss is computed by multiplying probabilities in the prediction with the level-set function of the ground truth and integrating the result over all the pixels. Kervadec et al. 32 highlight that the combination of boundary loss with region-based loss function (surface loss) provides up to 8% performance improvement in Dice coefficient. We modify the surface loss function by replacing the generalized Dice loss with combo loss (sum of Dice loss and focal loss) that emphasizes the ROIs' class and area distribution, aiming to improve the class accuracy metrics. Furthermore, we propose an alternative weight shifting strategy, shifting the weight from 0.99 to 0.25 on the combo loss and 0.01 to 0.75 on the boundary loss. Initial increased weight on combo loss ensures that the network learns the area and statistical distribution of the liver in the earlier epochs. As the weight shifts towards the boundary loss in later epochs, the network is trained to learn the anatomical boundaries of the liver. To add stability to network training, the suggested weight shifting strategy ensures that the combo loss has a fair portion of the net weight at the end of the training.

Experimental setup
Dataset and pre-processing. We employ the liver CT scans provided in the medical segmentation decathlon 36 to train our models. The liver segmentation challenge contains 201 contrast-enhanced CT scans divided into the training (131 scans) and test (70 CT scans) sets. The spatial dimension of the CT scans is 512× 512, with the number of slices in the range of (50, 1100). The CT scans belong to the patients suffering from HCC and other liver diseases resulting from lung, breast, or colorectal cancers. The liver decathlon dataset was acquired at the IRCAD Hôpitaux Universitaires, Strasbourg, France, and shares a small number of CT scans with the 2017 Liver Tumor Segmentation (LiTS) challenge 37 . The ground truths of the CT scans in the test set are undisclosed because of ongoing community challenges. For this reason, we split the original training set of the dataset and performed training with 101 scans and testing with 30 scans. Despite the quality measures taken during the acquisition, the CT scans have poor contrast, overlapping anatomical boundaries, noise, and significant variations in axial resolution, making the image pre-processing, neural network training, and evaluation of liver CT scans challenging. Figure 3 shows the deep learning CT segmentation framework. In the pre-processing stage, we read the file using the Nifti loader and cap the image intensities of all the scans in the range [− 500, 500] HU. Next, we perform min-max normalization to recompute the image intensities to [− 1, 1]. One significant challenge while designing networks for 3D CT is VRAM consumption. We resize the spatial dimensions of the input scans to 256×256 and resample 64 slices from the liver region of each scan to minimize the VRAM footprint of the network. In addition, we replace the tumor label in the ground truth with the liver label for training the networks for liver CT segmentation. We perform data augmentation on the processed CT scans using the volumentations 38 package to reduce overfitting. The augmentations are randomly applied to the scans and include flips (along the X and Y axis) and transpose operations. Implementation details. To minimize the I/O and computational costs, we pre-process the CT scans and store them in RAM before the training. Additionally, we define the neural networks in Keras and utilize the Tensorflow dataset generator with prefetching to ensure that the neural networks are efficiently fed with augmented scans and the ground truth. The networks are trained for 150 epochs to ensure model convergence (Fig. 4). Adam optimizer (learning rate = 0.0001) and batch size of 1 are used for updating network parameters. We train the Res32-PAC-UNet with three different loss functions to understand their performance impact. The remaining models are trained using the modified surface loss function. The model weights resulting in the highest Dice coefficient on the test set are saved using the Keras callbacks and are utilized during the model evaluation phase.
The models are trained on an HP Z8 workstation with an Intel® Xeon(R) Silver 4216 CPU with a 2.10 GHz base clock (64 cores) and 128 GB of system memory. The workstation also contains an Nvidia Quadro RTX 5000 GPU with 16 GB of VRAM. With the above implementation, the training procedure for the Res32-PAC-UNet model has taken approximately 12 hrs.
Evaluation metrics. We evaluate the segmentation performance of the networks, mainly by computing area/volume overlap, and class-based accuracy. The metrics can be defined as follows: Dice Coefficient (DC) and Symmetric Volume Difference (SVD): DC is a region-based metric described as  www.nature.com/scientificreports/ Specificity: Specificity measure the ratio of correctly marked negative pixels to the total number of negative pixels in the predicted mask. Specificity is expressed as TN TN+FP . Sensitivity: Sensitivity measures the ratio of correctly marked positive pixels to the total number of positively marked pixels in the predicted segmentation map. Sensitivity is calculated as TP TP+FN .

Results and discussion
Impact of loss functions. Table 1 shows the segmentation performance summary of the Res32-PAC-UNet model for three different loss functions to evaluate their suitability for liver CT segmentation. The results indicate that the use of binary crossentropy (BCE) and focal loss leads to sub-optimal segmentation performance, suggesting that the statistical distribution of the classes does not provide sufficient information to the network for achieving high segmentation accuracy. The use of region overlaps and class distribution in the modified surface loss offers an acceptable boost to the segmentation performance, highlighting that area/volume overlap information is essential for segmentation tasks. The modified surface loss further maximizes the segmentation accuracy of the network by employing the boundary loss to refine the edges of the predicted masks. Figure 4 shows the 3-moving average DC of the Res32-PAC-UNet model trained using different loss functions for the first 100 epochs. We apply the moving average to smoothen out the abrupt changes in the DC curve caused by the stochastic update of network weights. It can be deduced that modified surface loss provides high initial segmentation accuracy and provides faster convergence in earlier epochs by attaining an 80% DC in the first five epochs. The modified surface loss allows the Res32-PAC-UNet to achieve the highest segmentation accuracy among the tested loss functions with accelerated convergence. These results can be explained by the dynamic nature of the loss, which shifts the weights from the combo loss to boundary loss during training. The initial emphasis on the combo loss allows the model to learn the volumetric ROI effectively. In contrast, a definite focus on the boundary loss in the later epochs improves the edge precision of the predicted segmentation masks. In our empirical study, we train the remaining neural networks with the modified surface loss function because of its emphasis on crucial aspects of segmentation masks and superior segmentation accuracy.

Impact of PAC module on segmentation performance.
We conduct a comprehensive empirical study to quantify the impact of using the tuned residual UNet backbone and the PAC module. The performance of the UNet model is established as the baseline for segmentation accuracy and parameter count. Table 2 shows the segmentation performance of the proposed models, including the Res-PAC-UNet, Thin-PAC-UNet, Tuned- We also note that the Res32-PAC-UNet outperforms the remaining models on all metrics except the specificity. Figure 4 shows the 3-moving average of DC for the UNet, Tuned-UNet, Res32-UNet, and Res32-PAC-UNet over the first 50 epochs. The UNet experiences slow learning in the first 20 epochs because of its many parameters. On the other hand, the Tuned-UNet model achieves an 80% DC in 10 epochs due to its tuned feature widths throughout the backbone, requiring less training with a limited training set (101 CT scans). The Res32-PAC-UNet architecture shows the fastest learning on the test set by attaining 80% DC in the first five epochs.
The best-in-class segmentation performance and faster convergence of the Res32-PAC-UNet could be associated with the choice of the tuned residual backbone and the usage of PAC modules. The network's backbone has a fixed width of 32 to reduce the exponential growth of the parameter, minimizing disk utilization. Additionally, it employs residual blocks to improve information and gradient flow, thus allowing the model to learn quickly. The usage of PAC modules over the skip-connections prevents the duplication of low-level features from the encoder to the decoder, replacing them with informative multi-scale volumetric features. Altogether, the Res32-PAC-UNet model overcomes the pitfalls of the conventional UNet and its variants for liver CT segmentation and delivers a better segmentation performance with lower disk utilization.

Model parameters and storage utilization.
Res-PAC-UNet architecture provides the desired tradeoff between segmentation accuracy and disk utilization by varying the feature width in the backbone. Table 2 compares the proposed methods and related work using parameter count and model size. The UNet model provides acceptable segmentation performance but contains nearly 22 million parameters, resulting in a model size of 270 MB. The Tuned-UNet decreases the parameter count and storage space by up to 4 × by limiting the parameters in the early layers of the encoder. The model experiences an improvement in almost all metrics due to its tuned parameter backbone. Specifically, the DC increases from 91.9% to 95.5%, relative to the UNet. This observation suggests that backbones with restricted parameters may outperform UNet in scenarios with limited data. Interestingly, the Thin16-PAC-UNet and Thin32-PAC-UNet approach the segmentation performance of the Tuned-UNet model with nearly 12× , 4.6× fewer parameters and storage requirements, respectively. The segmentation performance of Thin-PAC-UNets relative to the Tuned-UNet highlights the performance gains due to the PAC module in thin, lightweight architectures.
The Res16-UNet backbone has the lowest parameter count and disk utilization (2.18 MB) while outperforming the baseline UNet model. On the other hand, Res16-PAC-UNet outperforms Thin16-PAC-UNet and matches the performance of Thin32-PAC-UNet in terms of segmentation metrics while having less than half the parameters and models storage requirements. The higher performance of the Res16-PAC-UNet is because of better feature/gradient propagation in the residual backbone relative to the Thin-PAC-UNet architecture. The Res32-PAC-UNet model outperforms all the models in the empirical study while limiting the parameters to 1.2 million and the model size to 15.1 MB. The recently proposed Res-UNet++ 28 architecture has performance closer to the Res32-PAC-UNet, but contains nearly 10× more parameters. DC-UNet outperforms the Multi-Res-UNet, indicating that the dual-channel pathways in the convolution blocks of the backbone can assist the network in improving the segmentation performance. However, the improvement in the performance of DC-UNet comes www.nature.com/scientificreports/ at the cost of increased network parameters. Finally, analyzing the results of TMD-UNet suggest that 2D convolution-based neural networks with significant parameter count (i.e., 9.1 million) are unable to capture long-term dependency in the axial direction of CT scan, thereby limiting the segmentation accuracy. Based on the analysis of segmentation accuracy, parameters count, and model size, we suggest the Res16-PAC-UNet model for machines with disk constraints because it is closer in accuracy to the Tuned-UNet and has a smaller model size (i.e., nearly 11.8× smaller). Alternatively, when the segmentation accuracy is of utmost importance, the Res32-PAC-UNet achieves the best-in-class accuracy with 18× , 4.6× fewer parameters than UNet and Tuned-UNet models, respectively.

Qualitative analysis of segmentation masks.
Qualitative analysis is also a crucial aspect of evaluating the proposed neural network's performance. Figure 5 provides a qualitative comparison of the segmentation masks by overlaying a predicted segmentation mask (in yellow) on the ground truth (in red). In addition, the liver regions excluded from the ground truth have been highlighted in the CT slices. The CT slices highlight the significant obstacles in liver segmentation; for instance, similar image intensities of the neighboring organs and significant boundary variations between adjacent CT slices. UNet architecture accurately predicts the segmentation mask by excluding the areas outside the ROI along the boundary and the center of the liver. The Tuned-UNet architecture slightly over-segments the liver at the edges because of its restricted parameter growth in the backbone. The Thin-PAC-UNet models over-segment the excluded regions at the borders and the center of the liver. Similarly, Res16-UNet and Res16-PAC-UNet models over-segment due to their limited parameters in the backbones. On the other hand, adding the PAC module to the Res32-UNet backbone significantly improves segmentation mask quality, indicating that the fixed-width residual backbone provides more relevant information to the PAC modules relative to Thin-UNet backbones. The Res32-PAC-UNet generates segmentation masks comparable to the UNet while having a fraction of the parameter count.
Inference time and future directions. Table 2   www.nature.com/scientificreports/ nearly half (i.e., 0.266 s). Res16-UNet backbone attains the lowest inference of 0.249 s. We can observe that adding PAC modules to the Res-UNet backbones increases the inference times, highlighting its one key limitation. To elaborate, Res32-PAC-UNet has a similar inference time to that of UNet while having 18× fewer parameters, suggesting that the network-fragmentation and element-wise operations 39 in PAC modules may impact the run time of a network. Nonetheless, the PAC module allows the lightweight neural networks to maximize their segmentation performance while keeping the disk utilization several times lesser than the UNet and its variants. TMD-UNet and Res-UNet++ take a few seconds to generate a prediction, suggesting that these networks perform heavy computations for predicting liver masks.
In future, we aim to extend this work to determine whether Res-PAC-UNet maintains the same performance for segmentation of liver tumors, vessels, and other organs (i.e., kidney, spleen, and pancreas) across 3D imaging modalities (i.e, CT and MRI). We would also like to construct Res-PAC-UNet like architectures using state-ofthe-art computer vision findings that can outperform well-known segmentation models while achieving lower disk utilization and inference times. We plan to achieve this by enhancing the segmentation performance using knowledge distillation by pruning the networks using the TensorRT framework or quantization aware training. Additionally, we think that it may also be beneficial to design networks that achieve acceptable segmentation performance on the CPU.

Conclusion
In this paper, we propose a novel Res-PAC-UNet architecture that provides a good trade-off between segmentation accuracy and model size. The proposed model employs a tuned fixed-width residual backbone with PAC modules to provide higher segmentation performance with fewer weights and lower disk utilization. The residual backbone restricts the exponential growth rate of the parameters while improving the information and gradient flow, thus assisting the PAC modules present over the skip-connection to extract relevant multi-scale volumetric features. The proposed networks are trained with a modified surface loss function to maximize the segmentation performance. Subsequently, we conduct an empirical study to compare the quantitative and qualitative segmentation performance of the models. We have found that the Res16-PAC-UNet contains fewer weights for liver CT segmentation, while the Res32-PAC-UNet maximizes the segmentation performance. Thus the proposed network provides flexibility to the radiologists to choose models as per their requirements.

Data availability
The datasets generated and/or analysed during the current study are available in the medical segmentation decathalon 36 repository.