Fully automated deep learning based auto-contouring of liver segments and spleen on contrast-enhanced CT images

Manual delineation of liver segments on computed tomography (CT) images for primary/secondary liver cancer (LC) patients is time-intensive and prone to inter/intra-observer variability. Therefore, we developed a deep-learning-based model to auto-contour liver segments and spleen on contrast-enhanced CT (CECT) images. We trained two models using 3d patch-based attention U-Net (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\text{M}}}_{{\text{paU}}-{\text{Net}}})$$\end{document}MpaU-Net) and 3d full resolution of nnU-Net (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\text{M}}}_{{\text{nnU}}-{\text{Net}}})$$\end{document}MnnU-Net) to determine the best architecture (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\text{BA}})$$\end{document}BA). BA was used with vessels (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\text{M}}}_{{\text{Vess}}})$$\end{document}MVess) and spleen (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\text{M}}}_{{\text{seg}}+{\text{spleen}}})$$\end{document}Mseg+spleen) to assess the impact on segment contouring. Models were trained, validated, and tested on 160 (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\text{C}}}_{{\text{RTTrain}}}$$\end{document}CRTTrain), 40 (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\text{C}}}_{{\text{RTVal}}}$$\end{document}CRTVal), 33 (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\text{C}}}_{{\text{LS}}}$$\end{document}CLS), 25 (CCH) and 20 (CPVE) CECT of LC patients. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\text{M}}}_{{\text{nnU}}-{\text{Net}}}$$\end{document}MnnU-Net outperformed \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\text{M}}}_{{\text{paU}}-{\text{Net}}}$$\end{document}MpaU-Net across all segments with median differences in Dice similarity coefficients (DSC) ranging 0.03–0.05 (p < 0.05). \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\text{M}}}_{{\text{seg}}+{\text{spleen}}}$$\end{document}Mseg+spleen, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\text{M}}}_{{\text{nnU}}-{\text{Net}}}$$\end{document}MnnU-Net were not statistically different (p > 0.05), however, both were slightly better than \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\text{M}}}_{{\text{Vess}}}$$\end{document}MVess by DSC up to 0.02. The final model, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{\text{M}}}_{{\text{seg}}+{\text{spleen}}}$$\end{document}Mseg+spleen, showed a mean DSC of 0.89, 0.82, 0.88, 0.87, 0.96, and 0.95 for segments 1, 2, 3, 4, 5–8, and spleen, respectively on entire test sets. Qualitatively, more than 85% of cases showed a Likert score \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge$$\end{document}≥ 3 on test sets. Our final model provides clinically acceptable contours of liver segments and spleen which are usable in treatment planning.

Liver cancer is the third most common cause of the cancer-related deaths globally and it resulted in roughly 700,000 deaths in 2020 1 .Surgery (resection or lobectomy) is considered the main line of treatment especially in colorectal liver metastases 2 in which segment(s) or entire lobe is removed depending upon the extent of tumor 3,4 .However, the ability to perform liver surgery is largely dependent upon accurate localization of tumor with respect to segments and the volumetric measurement of liver segments as it allows clinician to ensure that the patient would have minimum remnant functional liver volume after the surgery (e.g.20% in normal liver) 4 .To quantify the functional liver volume, radiologists/technologists perform manual contouring of segments on the contrast-enhanced CT (CECT) images following the architecture of vessels, ligament and organs 5 .However, manual contouring is time intensive 6 and prone to inter/intra-observer variabilities 7 which can affect the volumetric measurement and subsequent clinical use.Therefore, automation of liver segment contouring is crucial to evaluate the eligibility of patient for liver surgery.
Several semi-automatic and automatic segmentation approaches exist but recent advancements in Deep Learning (DL) based models have outperformed other methods in terms of required time and segmentation accuracies across various organ sites 8 .Recent surveys have reported a plethora of architectures used in medical image segmentation out of which U-Net based architectures are widely used for organ segmentations 9,10 .In particular, 3D U-Net (the 3D extension of U-Net) is of great importance as it offers two major features (1) training with sparse volumetric data (2) input of 3D volume/patch in the training which allows the architecture to retain more features in contrast to 2D input 11 .Both of those features make 3D U-Net more applicable in 3D organ segmentation, and resultingly, several studies 12,13 have reported reasonable accuracy and clinically translatable performance of organ segmentations with 3D U-Net.Currently, nnU-Net is one of the state-of-art segmentation framework which utilizes U-Net based architectures (combined or individual 2D and 3D U-Net) to train segmentation models 14 , and has shown excellent translatable clinical performance in abdominal segmentation 15 .In addition, nnU-Net is a self-configuring framework and automatically performs hyperparameter tuning and data augmentation which promises to result in higher segmentation accuracies 14 .However, the presence of 3D input patch also implies the inclusion of features from irrelevant regions which involve large number of trainable parameters resulting in excessive requirement of computational resources.To address such issues, Attention based gating has been implemented by Oktay et al. 2018 in the standard 2D U-Net, which uses attention coefficients to identify relevant image features and merge them just before the concatenation operation in the skip-connection phase 16 .Additionally, Attention U-Net showed consistent significant performance improvements when its performance was compared with 3D U-Net 16 .However, since 3D input patch would also preserve higher number of relevant features compared to 2D input, it is therefore reasonable to implement attention mechanism in the multiple skip connection of standard "3D U-Net" and test if it would improve the segmentation accuracies.
Additionally, with regard to model training for segment contouring in the patients with primary and metastatic liver disease, the architecture has to face liver specific anatomical challenges which could result in uncertainties in demarcation of liver segments.For example, the occlusion of vessels due to tumor could result in distortion of liver contours and liver segments.Both aforementioned issues could be addressed if we can implement localization of vessels during the training.Another important condition is enlargement of liver and spleen in cancer patients in which spleen is abutted with segment 2 and 3 which result in incorrect separation of segments with spleen.One possible approach to address such issues is training the model with both segments and spleen.Currently, a very few DL based liver segmentation studies exist that investigated the automated segmentation accuracy on CT images of patients with liver tumors.Tian et al. 2019 implemented global and local context U-Nets (GLC-UNet) which first segmented the whole liver and then localized vessel-based slice features are utilized to segment the Couinaud's segments 17 .GLC-UNet achieved a mean segment DSC similarity coefficient (DSCs) of 0.92.Additionally, a recent study by Lee et al. 2022 developed two different models to separately contour the liver segments and spleen and achieved a median DSC score around 0.91 across the segments 18 .
In this study, our central goal is to develop a fully automated segmentation model that can achieve consistent, robust, expert observer-level accuracy in liver segment contouring to guide the liver surgery planning.To achieve this goal, we have established three main aims (1) to determine the best architecture for auto segmentation of liver segments by investigating the performance of 3D patch based attention U-Net (paU-Net) over the gold-standard framework of nnU-Net (2) to determine if addition of vessels and spleen during segmentation training could improve the liver segments segmentations (3) to perform quantitative and qualitative assessment of model across patients undergoing RT, general evaluation for liver surgery, portal vein embolization (PVE), and CT based liver pathologies used in various segmentation challenges.

Overall framework
Figure 1 shows the overall workflow of our study which involves three major blocks.Starting in the architecture selection block (block 1), we investigated the best architecture by comparing Attention 3D U-Net and 3D full resolution from nnU-Net.In the uncertainty improvement block (block 2), we investigated whether the addition of vessels and spleen during model training improves the segmentation results while using the best architecture identified from block 1. Lastly, in the Model Assessment (block 3), all the models were evaluated on surgery candidates' CT scans, patients who received portal vein embolization, non-contrast CT images and on external CT datasets from various segmentation challenges [19][20][21][22] .

Datasets patient population
The study included two major data group, namely, an internal data group (IDG) and external data group (EDG).The IDG consisted of contrast enhanced CT (CECT) scans of patients diagnosed with primary and metastatic liver cancer at our institution.Within IDG, we have four cohorts.The radiotherapy cohort ( C RT ) consisted of 100 patients with a radiotherapy planning and 3-month follow-up CECT image.The surgery cohort ( C LS ) included 33 CT scans of patients that were being evaluated for liver surgery.The non-contrast cohort ( C NC ) included 20 non-contrast CT (non-CECT) scans of patient with contrast scans used in the training.The portal vein embolization cohort ( C PVE ) included 20 CT scans of patient undergoing portal vein embolization for the liver (PVE).All patients from internal data group were retrospectively enrolled in a Health Insurance Portability and Accountability Act-compliant institutional review board approved study (The University of Texas MD Anderson Cancer Center IRB PA18-0832) with a waiver of informed consent.Use of data was approved by the IRB and all experiments were performed in accordance with relevant guidelines and regulations.

Manual and AI edited segmentations
Ground-truth segmentations of the patient datasets included liver segments 1, 2, 3, 4, 5-8 (combined), spleen, and vessels.Two major approaches were used to contour the liver segments.In first approach, an in-house nnU-Net model trained on the subset of C RTTrain was used to contour liver segments on C RT and C LS .After- wards, the model generated contours were edited or recontoured fully by a radiologist (MA) as per the need.In second approach, liver segments were manually contoured by the radiologist MA on C PVE , and C CH without any assistance from AI models.Additionally, spleen contours on C RT and C CH were first created by a nnU-Net model trained on task 9 Medical Imaging Decathlon Dataset 20 and were manually edited by a radiologist (MA) or students (SR and ACG).On C PVE , C LS , and C NC the spleen was manually contoured by ACG without using any AI segmentations.Lastly, the reader is referred to Sect."Uncertainty improvement-impact of vessels and spleen" for mechanism behind vessels contours.

Architecture selection
We have investigated two variants of 3D U-Net in this study.First, we developed a 3D patch-based U-Net with attention mechanism based on the standard 3D U-Net 11 and attention gate 16 .As shown in Fig. 1A and B, in the analysis path, a patch size of 256 × 256 × 24 was input to the network.The network consisted of 4 layers with 2 blocks in each layer.A convolution of 3 × 3 × 3 is performed at each block with group normalization and Leaky ReLU followed by a 2 × 2 × 2 max pooling before transitioning to the next layer.In the decoder, blocks within each layer undergo up-sampling through convolution of 3 × 3 × 3. A skip-layer with concatenation is implemented which feeds the feature map from corresponding block in encoder to attention gate.The attention gate suppresses www.nature.com/scientificreports/ the irrelevant features and noise as per the standard methodology 16 .The gated feature is then concatenated to the transposed block in the analysis.A final 1 × 1 × 1 convolution is performed in the last layer of the decoder path to produce the image with selected number of classes.Categorical cross entropy is used as the loss function for validation.To identify the best hyperparameters, we performed multiple trainings (epoch = 1000) using stable and cyclic learning rates (rate = 0.0001) for number of blocks = 2 and 3 and number of filters = 16, 32, 48, 64.As a result, 16 models were trained.Second, we investigated the 3D full-resolution configuration of nnU-Net which is also a patch-based 3D U-Net.nnU-Net automatically generates the segmentation pipeline specific to the dataset through its three major domains: fixed, rule-based, and empirical parameters, which handles all the preprocessing, training and postprocessing for the datasets 14 .Unlike our in-house architecture, the nnU-Net automatically selects the hyperparameter that is suitable for a dataset.Figure 1C shows an example of nnU-Net architecture which was used to train the model in section.A patch size of 192 × 192 × 48 with a batch size of 2 is input to the architecture with 5 layers, 2 blocks, and 32 filters.In the encoder, there is a convolution of 3 × 3x × 3 followed by Intensity Normalization (IN) and a 2 × 2 × 2 max pooling.In the decoder, blocks undergo up-sampling using the same mechanism as described for the 3D U-Net.Data augmentation was performed automatically as described in the nnU-Net guidelines 14 .Combined DSC and cross-entropy are used as the loss function.
To identify the best architecture, we trained two models, one based on the patch U-Net (M paU-Net ) and one based on the nnU-Net (M nnU-Net ) to predict the segmentation of segments 1, 2, 3, 4, and 5-8.Models were trained for five-fold cross validation using ensemble approach in both architectures.In M paU-Net , majority vote and STA-PLE algorithm from Simple ITK v2.2.1 was implemented to select the best result from five folds.In M nnU-Net , the default configuration of nnU-Net (average ensembling) was used 14 .Quantitative and statistical analysis were performed (as per Sect."Data analysis") to select the best architecture model, M Best-Architecture .

Uncertainty improvement-impact of vessels and spleen
We investigated if the uncertainties in the definition of liver segment boundaries can be improved by incorporating two additional features in the training.
First, we trained a model ( M vess ) using M Best−Architecture (from Block 1) to investigate if the incorporation of vessels during the training would improve the segmentation of the liver segments.We began by generating liver vessels using the liver vessel generation algorithms 23 available in a commercial treatment planning system (RayStation v12.0.110.72,RaySearch Laboratories, Stockholm, Sweden) on C RT (N = 200).The binary label map of vessels was added as an extra input channel using modality function in nnU-Net, and the model was trained to predict the contours of liver segments.Second, we trained a model ( M seg+spleen ) using M Best−Architecture (from Block 1) to determine if the addition of spleen contours during the training would result in improved segmentation of liver segments, especially segments 2 and 3.The training was optimized to predict the contours of the liver segments (with segments 5-8 combined) and spleen.
Last, we individually compared the performance of the models M ves and M seg+spleen with our best architecture model M Best−Architecture to determine if individual features improved the segmentation performance.Addition- ally, we compared M vess and M seg+spleen models to determine if one features would result in greater impact on segmentation.To select a single best model ( M Best−Model ) , all the model comparisons were performed on the external validation set C RT val using quantitative and qualitative assessment described in Sect."Quantitative analysis" and "Qualitative analysis".After the optimal model was selected, all models were evaluated on all test sets to determine if the optimal model ranking was held in the test environment.architecture as described in Sect."Manual and AI edited segmentations".All models in our study were trained for 1000 epochs and with five-fold cross-validation.All models were evaluated on both validation ( C RTVal ) and test sets ( C LS , C CH , C PVE ).While the main purpose of validation set was to select the best model, the assessment of models on test set was used to further establish the discrimination among the model performance.

Training, validation and test set for model creation
Table 2 shows the labels used in the study and data separation for training, validation, and test across the models.

Assessment of the models on patients withheld from training/validation
Assessment of the final model on the liver surgery patients To assess the accuracy of the models in clinical practice, we retrospectively obtained 33 CT scans of patients for whom the segment volume was assessed to determine the eligibility of the patient for liver surgery.AI predicted contours from each model were quantitatively and qualitatively evaluated as per Sect."Qualitative analysis".

Assessment of the models on challenge datasets
This test set was developed by randomly selecting 25 CT images from each cohort C IRCAD−01 , C IRCAD−02 , C MID , and C CHAOS .A radiologist (MA) contoured the liver segments and spleen on each CT.The liver segment and spleen contours generated by all of the models were qualitatively and quantitatively compared with the groundtruth contours.

Assessment of the models on post-portal vein embolization images
This test set was developed by obtaining 20 patients who received Portal vein embolization at our institution.This analysis's main purpose was to quantify the model's performance in presence of liver hypertrophy and metallic artifacts.All images included some form of metallic artifacts due to embolization coil.AI predicted contours from all models were assessed against the ground-truth using both quantitative and qualitative analysis.

Perturbation analysis of the model using non-contrast images
Here, we investigated the adaptability of our models on the perturbed images of patients using non-contrast images which is one of the clinical scenarios.We randomly selected CECT images of 20 patients used in training and then obtained their corresponding pre-contrast CT (i.e., non-CECT) images from the same four-phase liver CT protocol examination.To generate the ground-truth contours of liver segments, we first contoured the whole liver on the both CECT and non-CECT using our deep-learning based model 24 , and then performed whole liver based biomechanical deformable image registration using an algorithm previously validated 25,26 .We used models M Best−Architecture and M seg+spleen to predict the liver segments and spleen.M vess was not assessed because non-contrast images lack the vessels in the image.Further, no qualitative analysis was performed due to absence of vessel information on the image.In addition to quantitative metrics mentioned in Sect."Qualitative analysis", mean distance to agreement (MDA) was also evaluated to further quantify the adaptability of our model when presented with perturbation.

Quantitative analysis
The performance of the model was evaluated on all validation (N = 40), entire test (N = 78 total) and perturbation sets (N = 20) using Sorenson-DSC similarity coefficients (DSC), average Hausdorff Distance (HD A ), 95th Percentile Hausdorff Distance (HD 95 ), Percent Difference in the Volume (PDV).
For further comparison, we calculated the individual DSC differences ( DSC M 1 −M 2 ) between the correspond- ing cases of models of interests using Eq.(1a) and binned the results in [0.025, 0.05), [0.05, 0.1), and [0.1, 1) under respective models based on the sign (Eq.(1b)).Lastly, the ratio of the frequency of cases within each bin from two models of interests was used to evaluate the models (Eq.( 2)).
where M 1 and M 2 are two mo dels of interests and could b e any mo dels f rom { M paU−Net , M nnU−Net , M vess , M seg+spleen }.N M1 and N M2 are number of cases from each model meeting the criteria in Eq. ( 2).All parameters discussed above were assessed for segmentations corresponding to the models in Table 2.

Qualitative analysis
Unipolar Likert scale survey on the scale of 1-5 was performed by radiologists to evaluate the contours from various datasets.To avoid the inherent biasness in observer, the assessments were performed by two radiologist who did not participate in delineating any contours in our study.A radiologist (SY) evaluated the contours of all models on C LS and C CH .Another radiologist (US) evaluated the contours of all models on C RTVal and C PVE .Likert scoring criteria with the definition of rating is shown in the Table 3

Intra-and inter-observer analysis
We selected 10 images that were used in our model training.Radiologist MA contoured the segments twice in the gap of two weeks and relative inter-observer variability in DSC was estimated.Additionally, another radiologist, JAMS, contoured the liver segments on the same patients, and relative interobserver variability in DSC were calculated with respect to the contours of MA.

Statistical analysis
Wilcoxon signed-rank test was performed to determine if the models were statistically different (p < 0.05).For comparison involving more than 2 models, Bonferroni correction was performed to adjust the p-values.

Selection of best architecture
The best tuned hyperparameters for paU-Net were obtained for the model with 3 blocks and 64 filters.This model showed highest validation DSC of 0.75 and a low difference between training and validation DSC of 0.14 among all paU-Net models.
In paU-Net's ensembling method comparison, the majority vote and STAPLE based contours showed overall similar mean DSCs of 0.86 and 0.87, respectively.However, when we compared minimum DSC of segments altogether, STAPLE showed improvement of 0.052 or 5.2% on average (see Table S1).Additionally, our visual assessment revealed that the majority vote contours had increased zero voxels at the boundaries of segments compared to STAPLE results (see Fig. S1).Therefore, we selected STAPLE based prediction as our final ensembling method for M paU−Net .
Table 4 shows the volumetric and overlap metric comparison between the results of M paU−Net and M nnU−Net .M paU−Net and M nnU−Net showed overall mean (average of median) DSC of 0.87 (0.87) and 0.89 (0.92), respec- tively, when assessed across all segments.The individual mean DSC values of M nnU−Net for segments 1, 2, 3, and 4 were greater than that of M paU−Net by 0.03, 0.04, 0.02, and 0.05, respectively.The ratio of number of cases meeting binned differences (Eq.( 3)) i.e., f M nnU−Net :M paU−Net was > 3 for segments 2, 3, and 5-8 and were > 10 for segments 2 and 4 (see Table S2 for details).Additionally, M nnU−Net demonstrated lower mean and median HD 95 values than M paU−Net for each segment.The difference in mean and median HD 95 between M paU−Net and M nnU−Net were within 1 mm for all segments except segment 4 where the differences were 16.3 mm (mean) and 2.7 mm (median), with M nnU−Net having superior performance.PDV comparison showed that differences in mean and median were mostly within ± 1.5% with few exceptions; segment 1 showed differences of − 5.3% and − 3.2% for mean and median, respectively, with M nnU−Net having superior performance, segment 2 showed − 5.8% (mean) and segment 4-3.9%(mean), with M nnU−Net having superior performance.Statistically, Wilcoxon signed-rank showed that performance difference of the models were significant for DSC values of all segments with M nnU−Net having superior performance.Further, except segments 2 and 5-8 in HD A and HD 95 , all other metrics/segments showed statistical significance in the comparison.Lastly, as per the qualitative assessment (Table 5), 99% of cases from M nnU−Net received an overall score ≥ 3 whereas 88% of cases from M paU−Net received an overall score of ≥ 3. Considering the better agreement with M nnU−Net qualitatively and quantitatively, we selected nnU-Net as the best architecture, i.e.M Best−Architecture = M nnU−Net .Hereafter, M nnU−Net is also used to represent the best architecture which is nnU-Net model trained with segments only.

Impact of vessels and spleen on segment contouring/selection of best model
Tables 5 and 6 shows the comparison of models M nnU−Net , M vess , and M seg+spleen using quantitative and qualitative approach described in Sects."Quantitative analysis" and "Qualitative analysis", respectively.For M vess vs.M nnU−Net , M vess showed DSC values of 0.89 (mean) and 0.91 (average of median), which are similar to mean DSC of 0.89 and average of median DSC of 0.92 of M nnU−Net .Individual DSC difference ( M vess − M nnU−Net ) were within -0.01 (mean) for segments 2, 3, 4, 5-8 and − 0.02 (median) for segments 2 and 4. All other segments had mean and median DSC difference of 0. f M vess :M nnU-Net was ≤ 1 : 3(0.33) (see Table S3 for details) for all except segment 5-8 where the ratio was 1:1.With regard to HD 95 , the difference in mean and median values ( M vess − M nnU−Net ) were within ±1mm for all cases except mean values of segments 3, 4, and 5-8 where the dif- ferences were 2.6 mm,1.50 mm, and 1.67 mm, respectively.Additionally, the overall differences in the mean and median PDV values were within ±2.5%.Most of the differences were > 0 , indicating a reduction in performance for M vess .Qualitatively, the difference between the cases of M vess and M nnU−Net receiving score ≥ 3 is within 1% in all segments except segments 2 and 3 where M nnU−Net leads by 3% and 5% respectively.Overall, the metrics of M vess were equivalent or slightly worse than of that of M nnU−Net .
In M seg+spleen vs.M nnU−Net , M seg+spleen showed DSC values of 0.89 (mean) and 0.91 (average of median) which are similar to mean DSC of 0.89 and average of median DSC of 0.92 of M nnU−Net .Individual DSC difference between ( M seg+spleen − M nnU−Net ) were 0 for all segments except the median of segment 4 where M nnU−Net >M seg+spleen by 0.01.f M seg+spleen :M nnU-Net was negligible or 1:1) (see Table S4 for details).The difference in the mean and median HD 95 of the two models were negligible (range = − 0.62 to 0.01 mm).Lastly, the difference in the mean and median PDV of the two models ranged from -1.7% to 0.6%.Qualitatively, the difference between percent of cases receiving score ≥ 3 across two models were within 1% except segment 5-8 where M seg+spleen led by 5%.Overall, the results from two models were equivalent.
Lastly, the Wilcoxon signed-rank test showed that M seg+spleen and M nnU−Net were not significantly different in their metrics (p > 0.05).Comparison of M seg+spleen with M vess showed no significance in most cases with few exceptions (see footer of Table 7).Furthermore, M seg+spleen showed better agreement than M vess in terms of HD 95 .Therefore, in overall comparison, we establish that M seg+spleen ∼ M nnU−Net and M nnU−Net > M vess .We selected M seg+spleen as our best model due to its wider application as the mean/median DSC of spleen is 0.99.7 shows the results from quantitative assessment of our models on the pre-surgery CTs.The mean and average of median DSC values of M seg+spleen across all segments were 0.91 and 0.92, respectively, and those for spleen were 0.91 and 0.96.Individually, the mean and median DSCs of all segments from M seg+spleen were ≥ 0.90 except segment 2 where median and mean DSC were 0.86 and 0.85, respectively.With regards to distance metrics, segment 2 from M seg+spleen showed a mean and median HD 95 values of 8.5 mm and 9.4 mm which was the highest among all other segments.The best HD 95 were obtained in case of segment 1 with mean and median values of 2.8 mm and 3.2 mm.Additionally, spleen showed mean HD 95 of 2.2 mm.With regard to volumetric comparison, M seg+spleen vs radiologist ground-truth contours, the overall mean and average median values across all segments were 8.2% and 5.6%.Likewise, mean and median PDV for spleen were within 2%.Lastly, stratification of DSC based on the cancer type showed no performance change in segments ( ± 1%) but spleen of CC (N = 5) showed 2% lesser DSC than CRM (N = 22) cases.
In comparison to M seg+spleen , M paU−Net and M vess showed poor performance in case of segment 1 as mean DSC of M seg+spleen were greater than other two models by 7% and 6%, respectively.On segment 3 and 4, M seg+spleen outperformed M paU−Net by 2% and 5%, respectively.Moreover, the mean DSC value of M Vess was around 5% less that other models on segments 2, 3, and 4 which supports that vessels architectures and segments 2 and 3 boundary are sensitive to each other.The mean DSC of other three models were within 2% of one another.With regards to HD, M paU−Net showed the largest HD but all other models showed similar performance.
Qualitatively, regarding M seg+spleen , 97% of segments showed a score ≥ 3 with 69% showing a score of ≥ 4 and 27% showing a score of 5. Individually, at least 64% of each segment showed a score of 4 or more.Segments 1, 4, and 5-8 received higher scores than segments 2 which is highlighted by the lower value of 15% (score of 5) in Table 5.Compared with other models, contours from M seg+spleen included 14% more cases of Likert score ≥ 3 than M paU−Net .However, the other two models received similar scores as the M seg+spleen .

Assessment of the models on challenge datasets
Tables 6 and 8 shows the results from quantitative and qualitative assessment of all models on the challenge dataset (C CH ).With regards to the best model ( M Seg+Spleen ), both overall mean and median DSC values of seg- ments were 0.87.The individual mean and median DSC values were ≥ 0.96 for segment 5-8 and spleen whereas the mean/median DSC for segments 1, 2, 3, 4 ranged 0.80 to 0.88.Segment 2 had the lowest mean and median DSCs of 0.80.For distance metrics, segment 1 and spleen had a mean and median HD 95 within 5 mm which was better than all other segments.The largest mean and median HD 95 values were ≥ 10 mm which was observed in the segment 2. Lastly, the overall mean and average median PDV were 11% and 9.5% for segment and 2% for spleen.Largest PDVs were observed in segment 2 with mean/median of 19%.Lastly, since the cancer types of Table 5.Comparison of descriptive statistics from models trained with segments, segments with vessel (color channel) and segments with spleen on validation set (C RTVal ). 1 Data in each cell is organized as row 1 = Median, row 2 = Mean, Standard deviation, row 3 = Max, Min; 1 M nnUnet = model trained with segments only, M seg+spleen = Model trained with Segments and Spleen as labels, M vess = Model trained with segments as label and vessel as color channel.2a IntraMD = Intra-observer mean dice.2b InterMD = Inter-observer mean dice. 3HD = Hausdorff distance (95 = 95th percentile and A = Average in mm); Wilcoxon signed rank test with Bonferroni adjustment showed p > 0.05 in M nnU-Net vs. M seg+spleen for all.In M seg+spleen vs. M vess , segment 3 showed p < 0.05 in DSC and HD A .In M nnU-Net vs. M vess , segment 3 showed p < 0.05 in HD 95 and segment 4 showed p < 0.05 in PDV and HD 95 .www.nature.com/scientificreports/challenge datasets are not available, we could not perform stratified DSC analysis.Comparatively, both M nnU−Net and M Vess showed mean and median DSC/HD 95 /PDV within 1%/1 mm/2% of our best model.On the other hand, M paU−Net showed mean and median HD 95 /PDV of 5 mm/6% higher than the that of the best model.Qualitatively, 100% of cases in M seg+spleen received an overall Likert score ≥ 3 with more than 80% received score ≥ 4. Lower Likert scores were localized to segments 2 and 3 contours.Regarding other models, Likert scores of M Vess and M Seg showed similar trend as the M Seg+Spleen .In contrast, the percentage of cases of M paU−Net receiving score ≥ 3 was 64% with only 20% showing overall score ≥ 4. Lastly, more than 97% of spleen from M seg+spleen received score ≥ 4.

Assessment of the models on post-portal vein embolization images
As per Table 9, M Seg+Spleen showed mean and median DSCs ≥ 0.87 for all segments and spleen except segment 2 where the mean and median DSCs were 0.82 and 0.80 in the case of segment 2. Furthermore, segments 2, 3, 4, 5-8, showed mean/median DSCs HD 95 ≥7 mm.Segment 1 and Spleen showed mean/median HD 95 within 5 mm.Mean and median PDVs of segments 1, 3, 4, 5-8 were within 10% but that of segment 2 was ≥ 15%.The stratified DSC analysis using cancer types showed CC (N = 3) larger DSC CRM (N = 17) by 2 to 4%.With regards to other models, all of the models showed mean and median DSCs within 1% of M Seg+Spleen with the exception of M paU−Net in case of segment 4 where DSCs were less than that of M Seg+Spleen by 7%.Similar trends were observed in HD 95 with the exception of M paU−Net showing mean HD 95 up to 18% in the case of segment 5-8.Except M paU−Net , PDVs of all models were within 4% of one another.Mean PDVs of M paU−Net were greater than that of M Seg+Spleen by 16%.
Qualitatively, at least 90% of cases received a score ≥ 3 and at least 85% received a score of ≥ 4 across all models in each segment with the exception of segment 4 and 5-8.For segments 4 and 5-8, only 5% and 10% cases of M paU−Net received score ≥ 4 whereas at least 25% cases of M Seg+Spleen received score ≥ 4. Additionally, a score ≥ 3 was received by more than 45% of cases of segment 5-8 across all models.Lastly, all cases of spleen received a score of 5. Examples of Likert scores with the specific images are shown in Fig. 2.

Assessment of the model on non-contrast images
As per Table 10, M Seg+Spleen showed mean and median DSCs ≥ 0.83 across all segments and spleen with the exception of segment 1 and 2 where the mean DSCs were 0.70 and 0.78 respectively.Further, mean, and median HD 95 were ≥ 5mm across all segments but spleen showed HD 95 <5 mm.Segment 1 showed a mean and median PDVs of 18% and 30% which was the largest PDV compared to other segments.Next, The mean MDA ranged from 1.6-3.6 mm for segments and was 1.3 mm for spleen.
M nnU−Net , showed similar performance as M Seg+Spleen , across all metrics in all segments.Specifically, the agreement between the models were within 2%, 1.5 mm, and 3% and 0.2 mm in terms of DSC, HD 95 , PDV, and MDA, respectively, with M nnU−Net showing underperformance.On the other hand, M paU−Net showed slightly improved performance than M nnU−Net and M Seg+Spleen in case of segment 1 and 2. Specifically, mean DSC of segment 1 from M paU−Net was greater than that of M Seg+Spleen by 8%.Similarly, mean DSC of segment 2 from M paU−Net was greater than that of M Seg+Spleen by 5%.However, such magnitude of discrimination was not observed in segment 1 and 4 in terms of HD 95 .M paU−Net showed mean HD 95 greater than that of other two models by 10 mm and 6 mm in case of segments 1 and 4, respectively.Likewise, the mean PDV were larger than two models by 9% for segment 4. Additionally, M paU−Net showed MDA was within 1 mm for all segments when compared with the M Seg+Spleen .
No M Vess model was trained, and no qualitative evaluation was performed because there is no vessel informa- tion on the non-CECT images.

Discussion
In this study, we have developed a clinically translatable model that can be used to auto-contour the liver segments and spleen on abdominal CT images.We validated all models on a validation set ( C RTVal ) of 40 CECT of patients with primary and metastatic liver tumors to identify the best model.We also assessed all models on various test sets (N = 78) shown in Fig. 1.First, we demonstrated that 3D full resolution architecture of nnU-Net outperformed 3D attention U-Net (paU-Net) by 2-5% in DSC across all liver segments.We also investigated the impact of adding segmentation of the vessels and spleen to aid in segmenting the liver segments and observed no major performance change between the models.Our final model can segment liver segments 1, 2 ,3 ,4 and 5-8 and the spleen with an average mean DSC of 0.89 and 0.99 across liver segments and spleen, respectively.We demonstrated that our model can be used in the clinical environment for surgical planning (mean DSC = 0.91) and for PVE patients (overall Likert score ≥ 4 for 95%).To our knowledge, this is the first study to develop a single model to contour liver segments and spleen which is validated across primary/secondary liver cancers patients and across both contrast and non-contrast images.
Our final model is applicable in four clinical scenarios.First, the model can be used to auto-contour the segments of liver surgery patients where it can aid in estimating the volumetric change due to PVE and in overall resection planning, demonstrating an accuracy of 5.6% in overall median volume.Second, the model can be used to auto-contour liver segments in patients undergoing RT for liver cancer as studies 5,27,28 have highlighted the importance of understanding liver segment regeneration for the optimization of RT plans.Third, the volume estimation from the model can be used in the prediction of cirrhosis and fibrosis as studies have reported that segment-volume ratio are significant predictors of cirrhosis/fibrosis 18,29 .Last, for the pathologies leading to hepatosplenomegaly, our model can be used to segment liver and spleen with higher accuracies in the case where segment 2 and 3 is abutted with spleen.Once our model is fully translated in the clinic, the utilization of model will allow improve efficiency, as the model can generate all its structure in 30-75 s per patient.The required time is very efficient compared to 90 min required in manual segmentation at our clinic and up to three minutes required in some of the semi-automatic segmentation methods 30 .
With regard to technical results, our first major observation was in the comparison of STAPLE vs. majority vote where we hypothesized that STAPLE > majority vote.This was confirmed based upon visual assessment that all 40 images in test set from C RT has at least one slice with increased zero valued pixel at the segment demarca- tion than STAPLE.The observation was expected because STAPLE assigns the label based on the probability values compared to a majority voting in SimpleITK (used in our study), which utilizes frequency of label which could lead to large number of undecided pixels.Second, in our architecture selection study, we observed that the nnU-Net architecture was superior with the paU-Net architecture demonstrating over segmentation of segment 1 including volume of segments 4, 5-8, and inferior venacava, and under segmentation in segment 3 with volume classified as segments 2 and 4.This could be due to less options in data augmentation in paU-Net than nnU-Net which greatly impacted the performance in the cases where vessel defining the segment boundaries deviated from the majority of the training data.Lastly, the paU-Net often failed to accurately contour segment 4, typically failing at the interface of the portal vein.The nnU-Net did not suffer from this uncertainty and therefore the accuracy improvement for segment 4 was the most significant, compared to the paU-Net model.
With regard to improvement in uncertainty, the statistical test showed no differences in models when spleen were added to the best architecture model.Specifically, in C RT , for M vess vsM nnU−Net , 91% of the cases showed DSC differences within [-0.025,0.025].The cases where DSC differences were larger ( M nnU−Net >M vess ) corre- sponded to errors in M vess due to over segmentation of segment 3 to segment 2 in two cases, over segmentation and under segmentation of segments 5-8 over 4 in one case.Similar trends were observed for M vess vsM seg+spleen , to suggest a preference for M seg+spleen .However, most of the contrast in performance was observed in segment 1, 2, and 4 (Table S5).In M seg+spleen vs. M nnU−Net , we argued M seg+spleen was similar to M nnU−Net (Table 7 and Table S4), however, we selected M seg+spleen because of slightly improved performance.Quantitatively, we observed in Table 7 that descriptive statistics of the results were similar except segment 5-8 of M seg+spleen where minimum DSC and maximum HD 95 improved by 0.05 and ~ 9 mm upon addition of spleen.Upon qualitative assessment of those cases, the improvement in M seg+spleen was due to lesser under segmentation of segment 5-8 compared to M nnU−Net .Next, our validation set included N = 8/40 cases of segment 2 and 3 hypertrophy.In N = 7/8, there was no difference in segment 2 and 3 i.e., both models showed reasonable segmentation without any over or under segmentations.In N = 1/7, M nnU−Net showed under segmentation of segment 2 next to spleen but segmenta- tions from M seg+spleen were improved on the same slices.Although our hypothesis that including the spleen in the model would be better than one without spleen was not supported because both models showed reasonable M seg+spleen , the final model showed excellent performance on the C LS patients for all segments except segment 2 where mean DSC was 0.85 and mean HD 95 was 9.4 mm.Furthermore, the subjective analysis showed that except segment 2/3, more than 70% of all cases received overall score ≥ 4 on Likert score.This is likely due to the uncertainties in the boundary of segment 2/3.While the uncertainties are primarily attributed to performance of the model, it is also important to note that the opacification of the veins plays a great role in the ability of radiologist to evaluate the segmentation.The radiologist (SY) reported that N = 16/33 images were arterial phase images leading to a reduced confidence level in the evaluation of the contours as the portal venous branches are not well opacified and localized on the arterial phase images.Additionally, the visual assessment also showed that 5/33 of C LS showed holes or under segmentation in segment 5-8 due to photon starvation from metal artifact of the embolization coil/stent (N = 4/5) and tumor hole (N = 1/5).Furthermore, another N = 2/33 cases showed holes and under segmentation in segment 4 due to photon starvation from metal artifacts.Lastly, we observed slightly lower DSC in CC compared to CRM primarily because of portal hypertension in CC which could lead to enlarged spleen and could affect the contour performance.This was supported when obtained a difference of 13 cc between mean volume of both cancer types.Comparatively, since our final model showed better DSC in the case of CC opposed to HCC by approximately 2%, it could be argued that the severity of underlying disease which affects liver texture on CT across different cancer types could also impact vessels and hence the contours.Therefore, one would expect a DSC performance trend of colorectal metastasis patients > Cholangiocarcinoma > Hepatocellular carcinoma.However, since the number of patients in Cholangiocarcinoma in C LS is smaller (5 vs. 22), we cannot state a robust conclusion.In comparison with other models, our best model outperformed paU-Net and vessel-based model mostly on segment 1, 4 and 2, 3, 4, respectively but not on segment 5-8.This could be because segment 5-8 is the largest structure which means it is less sensitive to change in the vessel structures and includes more features.This requires lesser optimization in the model which means model less robust models such M paU−Net could also show better performance.
Next, in C CH , M Seg+Spleen showed overall mean DSC of 0.87 which was smaller than the observed results on the C LS and C RT Val sets.Specifically, poor results were confined to segment 1 thru 4. The reason behind such observation was uncertainties in the boundaries of the segments in most of the cases.The images in challenge dataset also include cases with large and multiple tumors in the which could potentially lead to vessel occlusion and/or unremarkable opacification of the vessels on CT scans.Further, upon visual assessment, N = 4/25 cases of C CH showed under and over segmentation.Specifically, N = 3/4 showed under segmentation in segment 2, 4, and 5-8 dues to tumor and diseases, and N = 1/4 segment showed over segmentation to heart.
In C PVE , we observed that the overall mean DSC of segments was 0.87 which is primarily because of poor per- formance in the segment 2. Upon visual assessment, we found N = 20/20 images showed inconsistency between the segment 2-3 boundary of ground-truth and prediction.The boundary of segments 2 and 3 is dictated by the portal veins in the left liver, and the architecture of those veins exhibit higher variation across patient population due to disease in liver.Another reason is segmental hypertrophy which could result in under and over segmentation of a specific segments.The volume of segment 2 from our best model in C PVE is 148 ± 73 cc and the ground-truth volume of normal liver from CHAOS dataset is 88 ± 36 cc which supports there is hypertro- phy of segment 2. Next, regarding the effect of metallic artifacts, we found that N = 17/20 patients of C PVE had embolization coils spanning segment 5-8 and 4 with mostly localized in segment 5-8.N = 2/17 were immune from the impact of metal artifacts.However, in the remaining N = 15/17, both segments 4 (N = 3/15) and 5-8 (N = 15/15) showed holes in contours due to photon starvation arising from metal artifacts.This was expected because our training dataset did not include patients undergoing portal vein embolization.Lastly, the stratified DSC analysis for different cancer types showed the model performed better on CC (N = 3) patients than CRM (N = 17) patients by 2-4% which is not consistent with our observation in the C RTVal .
Next, in the perturbation analysis, we observed that M Seg+Spleen was still better than the other two models in terms of DSC, HD 95 , and PDV across all segments except segment 1 and 2. In segment 1, and 2, M paUNet showed slightly better performance (p < 0.05).However, this was contradicted when we assessed the MDA which was higher for M paUNet .Therefore, we attribute the observation of DSC for segment 1 mostly because of attention mechanism due to absence of contrast and randomness in the data.Overall, we argue that our best model could be potentially used on non-contrast images of same examination in clinic for segments 3, 4, 5-8 and spleen.For segments 1, and 2, minimum interventions from radiologist would be required to correct the contours.Lastly, since the non-contrast images are hardly used to discriminate tumor types, we did not perform stratified DSC analysis on non-contrast cases.
Comparing the performance of final model across validation and various test sets in Tables 6, 7, 8 and 9, we found that model performs best on C LS as evidenced by improved segment mean DSC (2-6%) than other cohorts.This could be attributed to the fact that surgery patients have less severe pathologies (e.g., surgery is typically a first-line therapy for smaller tumors) and minimal fewer artifacts than patients undergoing radiotherapy or portal vein embolization or patients in challenge cohorts.Further, we also observed that mean segmental DSC of C RTVal and C LS were slightly better than C CH and C PVE .Specifically, while performance in segment 1 is within mean DSC of 2% across the datasets, segments 2, 3, and 4 showed lesser mean DSCs (up to 6%) which is attributed to fact that C CH dataset has larger and numerous tumors, larger slice thickness.For segment 5-8, C PVE showed lesser mean DSC by 3-5% due to presence of under segmentation or holes in segment 5-8 arising from metallic artifacts.
Considering the above analysis, our study has three limitations (1) segmentation of combined segments 5-8, (2) failure of the model on segments with metal artifacts, and (3) uncertainty in the segment 2 and 3 boundaries.For (1), clinical practice for surgical planning dictated our segmentation selection and the combination of www.nature.com/scientificreports/segments 5-8.In addition, in our experience, there is substantial variability in the manual contouring of these segments individually.For (2), the issue can be addressed by manually editing the failed contours in the cases with severe photon starvation and increasing the number of such cases in our training datasets.For the last issue, we could implement post-processing methods to automatically optimize the boundary of segment 2 and 3.In our clinic, the segmental boundaries are separated based on the branching of portal veins and the regions above the left portal vein branch are segment 2 whereas regions below the left portal vein branches are segment 3 31 .We can use our in-house tool to generate liver vessels on CT scans in post-processing phase 23  showed inferior results by up to 6% which is attributed to the variability in the boundaries of the segments 2 and sensitivity of our model to the vessel architecture.Another reason is the difference in the underlying pathology of the literature compared to our datasets.Lee et al. 2022 18 assessed their model performance on the patients with hepatitis C and cirrhosis, however, our C LS is dominantly CRM and CC patients (see Table 1).The severity of cancer is also known to cause cavernous transformation of the vessels which also leads to uncertainties in the segment 2 contours.

Conclusion
In this study, we developed and validated to a clinically acceptable accuracy, a fully automated model that can auto-contour liver segments and spleen on CECT images.We found that implementing the attention mechanism in 3D U-Net did not improve the performance when compared with the 3D full-resolution nnU-Net.We also identified that the addition of segmenting the vessels and spleen did not have large impact on accuracy of segment contours.The application of the model is primarily intended for use with patients undergoing assessment for liver surgery or liver radiotherapy, but the model can be used in any clinical scenario where there is a need for segment contouring on CECT.Upon assessing our model on patients undergoing portal-vein embolization, we conclude that contouring is significantly impacted by presence of metallic artifacts leading to holes in the contours.However, inclusion of such patients in the training may improve performance in the future.Lastly, with regard to non-contrast images, we conclude that our final model can contours segments with accuracies sufficient enough for clinical use with review and possibly moderate interventions from radiologist.

Figure 1 .
Figure 1.(A) Overall workflow of the study.(B) Architecture for 3D-patch based U-Net with attention mechanism (C) nnU-Net framework which automatically optimizes the architecture based on the type of datasets.*Quantitative analysis were performed by calculating Dice similarity coefficient, 95th percentile Hausdoff 's distance, and percent change in the volume of segments and spleen between AI predicted and ground-truth contours.Statistical analysis was performed using Wilcoxon signed rank test with Bonferroni correction.**All models were assessed on cohorts of Block 3 using both quantitative and qualitative analyses (Figures created using biore nder.com).
Our framework includes training, validation, and test sets.As shown in Fig. 1A, C RT was used for training and validation, and C LS , C CH , C PVE were used for test set.C RT datasets were split into training (N = 160), and validation (N = 40) by randomizing planning and 3 month follow up images of patients.The optimization of models during training was performed using cross entropy and dice as a loss function (see Sect. "Manual and AI edited segmentations" for more details).Hyperparameters were tuned manually and automatically according to

Table 6 .Table 7 .
Likert scale assessment performed by independent radiologists to asses the usability of contours in the clinic.Comparison of descriptive statistics from models trained with segments, segments with vessel (color channel) and segments with spleen on liver surgery cohort C LS . 1 Data in each cell is organized as row 1 = Median, row 2 = Mean, Standard deviation, row 3 = Max, Min; 1 M nnUnet = model trained with segments only, M seg+spleen = Model trained with Segments and Spleen as labels, M vess = Model trained with segments as label and vessel as color channel; 2a IntraMD = Intra-observer mean dice, 2b IntrerMD = Inter-observer mean dice, 3 HD = Hausdorff distance (95 = 95th percentile and A = Average) in mm; PDV = percent difference in volume.

Figure 2 .
Figure 2. Example cases of three different Likert score (5, 4, and 3) is shown for two different cohorts.Blue arrow highlights the uncertainties in boundaries between the manual and model predicted contours.For score 4 and 3 in the images of C PVE , the arrow highlights the hole in segment 5-8 due to metal artifacts.In C PVE , a score of 4 is given when image has a hole, but segments boundaries follow the vessels. https://doi.org/10.1038/s41598-024-53997-y

Table 1 .
Characteristics of patients used in this study.┼ Median (min-max).*All means all the cases showed same values.**CRM colorectal or other metastasis, CC Cholangiocarcinoma, HCC Hepatocellular carcinoma, mixed more than one cancer types.

Table 2 .
Number of CT scans allotted for training, validation, and test sets across different models.

Table 3 .
Scoring criteria used by radiologists to evaluate the contours for qualitative analysis.

Table 8 .
Comparison of descriptive statistics from models trained with segments, segments with vessel (color channel) and segments with spleen on challenge cohort C https://doi.org/10.1038/s41598-024-53997-ywww.nature.com/scientificreports/ performance on cases with segment 2/3 hypertrophy we still selected model with spleen as our final model as this model has wider application and can be also used to estimate the severity of cirrhosis/fibrosis if needed in the patient undergoing liver surgery or RT.
Vol:.(1234567890) Scientific Reports | (2024) 14:4678 | https://doi.org/10.1038/s41598-024-53997-ywww.nature.com/scientificreports/ 32d also implement vessel enhancements and active contour methods, as reviewed by Ciecholewski et.al.202132, to further enhance the vessels at the periphery of segment 2 and 3. Despite the limitations, our model performs comparable or improved accuracy in comparison with studies as shown in Table11.Tian et al. 2019 reported the mean values across all segments and our results are in close agreement with their result, In comparison with Lee et al. 2022, our model demonstrated superior results on C LS in all segments (except segment 2) by 3-30%.For segment 2, our model

Table 11 .
Comparison of M seg+spleen with studies that developed liver w/wo spleen segmentation.*Lee et al. 2022 reported separate results of segment 5 thru 8.We averaged the reported median values.