Performance analysis and knowledge-based quality assurance of critical organ auto-segmentation for pediatric craniospinal irradiation

Craniospinal irradiation (CSI) is a vital therapeutic approach utilized for young patients suffering from central nervous system disorders such as medulloblastoma. The task of accurately outlining the treatment area is particularly time-consuming due to the presence of several sensitive organs at risk (OAR) that can be affected by radiation. This study aimed to assess two different methods for automating the segmentation process: an atlas technique and a deep learning neural network approach. Additionally, a novel method was devised to prospectively evaluate the accuracy of automated segmentation as a knowledge-based quality assurance (QA) tool. Involving a patient cohort of 100, ranging in ages from 2 to 25 years with a median age of 8, this study employed quantitative metrics centered around overlap and distance calculations to determine the most effective approach for practical clinical application. The contours generated by two distinct methods of atlas and neural network were compared to ground truth contours approved by a radiation oncologist, utilizing 13 distinct metrics. Furthermore, an innovative QA tool was conceptualized, designed for forthcoming cases based on the baseline dataset of 100 patient cases. The calculated metrics indicated that, in the majority of cases (60.58%), the neural network method demonstrated a notably higher alignment with the ground truth. Instances where no difference was observed accounted for 31.25%, while utilization of the atlas method represented 8.17%. The QA tool results showed that the two approaches achieved 100% agreement in 39.4% of instances for the atlas method and in 50.6% of instances for the neural network auto-segmentation. The results indicate that the neural network approach showcases superior performance, and its significantly closer physical alignment to ground truth contours in the majority of cases. The metrics derived from overlap and distance measurements have enabled clinicians to discern the optimal choice for practical clinical application.

Pediatric patients with central nervous system malignancies, such as medulloblastoma, ependymoma, atypical teratoid rhabdoid tumor, and germinoma, are often treated with craniospinal irradiation (CSI) therapy 1 .The most common malignant pediatric brain tumor is medulloblastoma, accounting for up to 25% of pediatric brain tumors diagnosed in high-income countries and up to 49% of pediatric brain tumors diagnosed in low and middle-income countries (LMICs).Additionally, medulloblastoma requires CSI treatment for patients older than 3-5 years of age 2 .Since CSI treats the entire brain and spinal volumes, treatment planning involves complex treatment fields and optimizations, where the dose to a significant number of normal structures must be considered to determine the risk of side effects.Treatment planning requires contouring these structures and tumor target volumes on simulation computed tomography (CT) scans.Delineating accurate structures is a time-consuming process critical for conformal therapies 3 .Due to the substantial number of necessary organ contours, CSI cases require significant time to contour and plan, which can be demanding, especially in LMICs where the radiation therapy capabilities can vary due to limited personnel or resources 3 .The necessity of CSI therapy to treat pediatric brain tumors leads to a need for fast and robust contouring methods.
Due to the time required for organ delineation and importance of accurate contours, auto-segmentation methods have been developed to decrease contouring time and reduce inter-observer variations 3 .Atlas techniques have  www.nature.com/scientificreports/while a positive value denoted that the atlas method was more closely aligned with ground truth.The color bar visualization indicated that darker hues corresponded to greater dissimilarity between the methods, with red signifying the neural network method's closer alignment to ground truth and blue indicating the atlas method's closer alignment to ground truth.Conversely, lighter hues indicated increased similarity between the methods.The instances in which the atlas method displayed accuracy in generating contours closer to ground truth were limited to the chiasm and cochleae, except for the TNR metric.For the remaining organs, the neural network method either matched the performance of the atlas method or exhibited higher accuracy in creating contours that were closer to ground truth.Overall, when comparing the two methods, the neural network demonstrated closer alignment with ground truth in 60.58% of cases, the atlas method was superior in 8.17% of cases, and no significant difference was observed in 31.25% of cases.

Contour comparison by patient
To conduct a more comprehensive comparison between atlas and neural network auto-segmentation outcomes, the metrics were averaged across organs, yielding an average metric for each patient.This approach enables an analysis of the methods' performance in contouring an entire patient.The collective metrics encompassing the entire body are outlined in Table 1.Specifically, the average DSC for all organs was 0.66 ± 0.14 for the atlas method and 0.73 ± 0.04 for the neural network method.Correspondingly, the average PPV for all organs was 0.65 ± 0.14 for the atlas method and 0.79 ± 0.04 for the neural network method.A comparison of the DSC and PPV distributions between the two methods was presented in Fig. 3.The average mean HD for all organs was found to be 6.16 ± 17.21 mm for the atlas method and 1.59 ± 1.08 mm for the neural network method.
For each combination of organ and metric (e.g.Brain + DSC), a paired t-test was conducted to compare the atlas method with the neural network method.The null hypothesis maintained that there was no significant difference between the two methods, setting a significance threshold at 0.05.The analysis of p-values indicated that, in 208 comparisons, the neural network method outperformed the atlas method in 137 instances, the atlas method prevailed in 17 instances, and in 54 instances, there was no significant difference between the performances of the neural network and atlas methods, adhering to the established significance level of 0.05.

Knowledge-based quality assurance tool
The baseline KDEs for each organ are visualized for 100 CSI patients in Fig. 4. The HU distributions were distinct, each varying over a specific range of HU values.The standard KDEs included two standard deviations above and below the average KDE to include 95% of ground truth contours.These KDEs with the standard deviation bounds are shown in Fig. 5.
In total, 160 auto-segmentations were performed for both atlas and neural network methods for 10 test patients and 16 organs and compared against KDE baselines from 100 patients.The results showed that the two approaches achieved agreements of over 75% in 65.6% of instances for the atlas method and in 82.5% of instances for the neural network auto-segmentation.Moreover, the atlas method achieved agreements for 100% in 39.4% of cases, while the neural network method achieved 100% agreement in 50.6% of cases.Percent agreements were presented in heat maps for two methods in 10 test patients and 16 normal organs in Fig. 6.

Contour comparison by organ
The neural network method produced contours that were generally closer to the ground truth, except for certain cases like the chiasm and cochleae.This observation aligns with existing literature 9 , which indicates that neural www.nature.com/scientificreports/networks tend to struggle with small structures and those with low contrast-attributes that characterize the cochleae and chiasm, respectively.Notably, manual contouring is often required for the chiasm in CSI patients due to its absence in CT scans and its visualization primarily in MRI scans.Moreover, the neural network's chiasm contours often took on a rectangular form, in contrast to the more accurate "X" shape of the ground truth contours, potentially contributing to discrepancies.For the chiasm and cochleae, the atlas method tended to overestimate size, resulting in a high TPR as the model captured a significant portion of the actual feature voxels.Conversely, the neural network's contours leaned toward underestimation, leading to lower TPR and other metrics, yet a higher TNR due to fewer false positives.Furthermore, the neural network's cochlea contours exhibited notably better PPV, implying that the model's predictions largely aligned with actual cochlear components.Variations in contouring approaches among radiation oncologists might contribute to these findings.Comparatively, the atlas method yielded higher TNR for the brainstem, esophagus, and eyes.This might arise from the neural network's lack of training on the pediatric population, potentially resulting in size overestimation.This is indicated by higher TPR and PPV for the neural network, signifying accurate prediction of voxels relevant to the target structure.www.nature.com/scientificreports/ The paired t-test, conducted at a significance level of 0.05, between the atlas and neural network methods, demonstrated that the neural network significantly outperformed the atlas method in achieving contour accuracy for kidneys and lenses, as supported by most metrics.Among the evaluated 16 organs, the neural network showed statistically significant enhancements in DSC for 9 organs (56.25%) in comparison with the atlas method.For 4 organs (25%), the DSC results were similar between the two methods.On the whole, the neural network presented contours that were either more aligned with the ground truth or showed fewer discrepancies across all metrics assessed, relative to the atlas method.www.nature.com/scientificreports/

Contour comparison by patient
Patient-wise average metrics for organs exhibited higher values in terms of overlap metrics (excluding RI) and lower values for distance metrics with the neural network method compared to the atlas method, signifying that the neural network's results were consistently closer to the ground truth across almost all metrics.Instances where the atlas method outperformed feature prediction occurred in fewer than 3% of cases, potentially arising from high patient matching facilitating successful and precise deformable registration.However, predominantly, the atlas method's PPV values were considerably lower, particularly at lower percentiles.While the neural network might not completely anticipate contours with high similarity to ground truth for all patients, its predictions pertaining to organs or structures are highly likely to be accurate due to its elevated PPV.Consequently, this may lead to fewer inadvertent errors and necessitate less editing time compared to the atlas method.Even though the neural network method wasn't specifically trained on pediatric cases, it still benefits from the entirety of training patient data.Conversely, the atlas method constructs structures solely based on a single prior patient due to one-on-one deformable image registration method between the best-matched anatomy and the test case.As a result, the neural network method is likely more informed, adaptive, and capable of contour creation, owing to its comprehensive training data.Developing a similar model trained on pediatric CSI patients could potentially yield even more comparable contours to the ground truth.

Knowledge-based quality assurance tool
One application of knowledge-based quality assurance was demonstrated on a test patient (patient #4) concerning the right kidney, out of the total of 10 patients in Fig. 7.The comparison of contours between the atlas and neural network auto-segmentation techniques against the reference contour is depicted in Fig. 7a.The objective was to ascertain the percentage agreement between the two methods, leveraging the baseline KDEs derived from 100 CSI patients, as illustrated in Fig. 7b.The discernible disparity in the right kidney's contour resulting from the atlas method was highly pronounced and quantitatively characterized using the KDE approach, which involved modeling the HU distribution values from the past 100 CSI cases.
The applicability of the in-house knowledge-based QA tool remains independent of the specific type of autosegmentation method employed.However, a key constraint lies in the fact that the accuracy of KDEs is contingent upon the subjectivity inherent in the ground truth contours endorsed by clinicians.The principal aim of utilizing this tool is to identify significant errors stemming from auto-segmentation processes.A clinical threshold of 100% agreement could be pursued.Organs falling below this threshold are highlighted for closer scrutiny and potential further investigation.This threshold serves as an indicator for the presence of gross errors generated by the auto-segmentation tools.This assessment could be seamlessly incorporated into the review stage of treatment planning, ensuring that any organ segmentations falling below the threshold are promptly flagged for clinician review.The adoption of this tool in clinical practice may encounter several barriers.One significant barrier is the resistance to change, common in settings where traditional practices are deeply ingrained.To mitigate this, extensive training, and demonstrations of the tool's efficacy in enhancing patient safety and improving workflow efficiency could be conducted.Additionally, integration challenges with existing image manipulation systems may arise, necessitating collaboration with IT professionals to ensure compatibility and ease of use.Further testing is warranted to comprehensively gauge the tool's applicability and ascertain the optimal approach for leveraging it as a safety measure for auto-segmentation models.Our findings indicate that 10 test cases fell within the 2-sigma region when compared to 100 historic cases.This suggests that the distribution of results could vary www.nature.com/scientificreports/across different centers, potentially being tighter or more relaxed depending on the consistency with which ground truth contours are executed.To accommodate this variability, the tolerance level of 100% in the QA tool can be adjusted based on several factors, including the number of normal organs reviewed, the percentage of gross errors identified, and the clinically acceptable deviations from the expected results.

Conclusion
Neural networks hold great promise for automating the segmentation process in radiation oncology.This automation is particularly valuable in complex treatment planning scenarios like CSI, where the process is timeconsuming and resource-intensive.This quantitative evaluation, comparing a neural network approach to an atlas-based method, revealed that, on the whole, the neural network generated contours that were closer to the ground truth.Notably, the neural network model assessed in this study, trained on adult patients, exhibited respectable performance on pediatric cases as well.However, even greater advancements could potentially arise from employing a neural network specifically trained on pediatric patients.
In the context of safety and quality assurance, a knowledge-based comparison of HU distributions offers a valuable means to scrutinize auto-segmentation outcomes.This approach involves verifying if organ contours align with established clinical standards, effectively identifying any inadvertent inclusion of surrounding tissues within the contour.The integration of a neural network-based auto-segmentation algorithm with such a quality assurance tool has the potential to significantly enhance the efficiency of contouring workflows, not only for CSI patients but across a broader spectrum of cases.

Patient data selection
This study was performed by using retrospectively acquired CT scans on Phillips IQon Spectral CT (Philips Healthcare, Cleveland, OH) and commercially available clinical MIM software (MIM, version 7.3.2,MIM Software Inc., Cleveland, OH).The datasets of pediatric patients who underwent CSI treatments were randomly selected from November, 2015 to August, 2023.Institutional review board approval was obtained prior to the study.Patient ages were from 2 to 25 years, with a median age of 8.The patients had CT acquisition using 120 kVp, 1.171 spiral pitch factor, 0.625 mm collimation, a 50 cm field of view, 512 × 512 matrix size, and slice thickness of 1.5 mm based on CSI protocol.
For each patient, dosimetrists first manually delineated the normal structures with great care, followed by a thorough review of their work.After this initial phase, physicians made any necessary adjustments and provided their final approvals.This step-by-step collaboration between dosimetrists and physicians guaranteed that the contours of normal organ structures met the high clinical standards required for treatment planning.This process established the ground truth contours for 170 patients including brain, brainstem, chiasm, left cochlea, right cochlea, esophagus, left eye, right eye, left lens, right lens, left lung, right lung, left kidney, right kidney, left optic nerve, and right optic nerve.
Automated segmentation was applied to the CT scans of the CSI patients through an atlas segmentation software, employing a deformable image registration algorithm known as the VoxAlign Deformation Engine, developed by MIM Software.The atlas algorithm searched an archive containing 170 prior CSI ground truth cases and identified the CT scan that closely matched the current patient's anatomy.By minimizing intensity disparities between the two images through deformable image registration of the historical patient's CT, the algorithm generated a deformed vector field.This vector field was then employed to modify the contours from the previous patient, aligning them with the anatomical specifics of the current patient 20 .The atlas algorithm was applied to 100 CSI test cases to establish the atlas test contours.
The CT scans of these 100 CSI test patients were also auto-segmented using a neural network algorithm (Contour ProtégéAI, MIM Software Inc., Cleveland, OH) to establish the neural network test contours.For each patient, the neural network algorithm was executed, employing a U-Net architecture to convert the input image into a segmentation mask for each specific anatomical structure 21 .Both the CT Head and Neck neural network and CT Thorax models (Contour ProtégéAI, MIM Software Inc., Cleveland, OH) were used for each patient.These individual contours were aggregated, resulting in a collection of 16 listed contours for 100 test patients that were recorded for the purpose of comparison.

Contour comparison
100 CSI patients that were auto-segmented by both atlas and the neural network methods were analyzed using 13 distinct metrics and compared against the ground truth contours.The metrics can be split into two categories, overlap and distance measures.The crisp definitions for all metrics were adapted from a study 22 .

Overlap metrics
The segmentations of the ground truth and the test were defined as S g and S t , respectively.Both segmentations were split up into two classes, S g = S 1 g , S 2 g and S t = S 1 t , S 2 t , where the first class S 1 g , S 1 t was the anatomy of interest and the second class S 2 g , S 2 t was the background.An assignment function f i g (x) defined if a point x in a medical image volume was in the feature of interest of the background, where f i g (x) = 1 if x ∈ S i g , and The definition of f i t (x) , the assignment function for the test segmentation, was defined similarly.If X = {x 1 , . . ., x n } defined the point set of all points inside of the medical image volume, all points in the image were members of the feature of image or background classes, meaning www.nature.com/scientificreports/ The common cardinalities describing the overlap of the two segmentations were the four aspects of a confusion matrix, including true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).These were defined by the sum of agreement between the classes of segmentation of i ∈ S g and j ∈ S t , calculated as where TP = m 11 , FP = m 10 , FN = m 01 , and TN = m 00 .
True positives described the instances that were part of the feature of interest and were correctly classified as part of the feature of interest by the test segmentation.False positives described the instances that were part of the background and were incorrectly classified as part of the feature of interest by the test segmentation.False negatives described the instances that were part of the feature of interest and were incorrectly classified as background by the test segmentation.True negatives were describing the instances that were part of the background and were correctly classified as background by the test segmentation.
The overlap measures used these cardinalities.All overlap measures ranged from 0 to 1, where 0 indicated low agreement between ground truth and test segmentations and 1 indicated high agreement between ground truth and test segmentations.The Dice Similarity Coefficient (DSC), commonly used in comparing medical volume segmentation, measured the degree of overlap between two segmentations.It is defined by The Jaccard index (JAC) measured overlap and can be related to DSC by, The true positive rate (TPR), also known as Sensitivity or Recall, measured the proportion of positive cases correctly identified by the test segmentation and is defined by, The true negative rate (TNR), also known as Specificity, measured the proportion of negative cases correctly identified by the test segmentation and is defined by, The positive predictive value (PPV), also known as Precision, represented the proportion of predicted positive cases that were actually positive and is defined by, The rand index (RI) is commonly used to measure similarity between data clustering but can also evaluate classifications.The RI described the proportion of correct identifications by the test segmentation and is defined by, It was noted during the calculations that since the entire CT was used as an input for both the ground truth and test segmentations that the number of true negatives (TN) was extremely high, causing the TNR and RI to be very close to 1 (e.g.0.99998) even if a segmentation had a low TPR and PPV (e.g.below 0.2).To allow for the TNR and RI to provide meaningful data, the segmentation comparisons were performed after a bounding box was created.The bounding box was the smallest rectangular volume enclosing both ground truth and test segmentations.Note that this bounding box did not have any effect on TP, FP, or FN since all actual positives and predicted positives were included in the bounding box.

Distance metrics
The next set of metrics were the distance metrics, which used the Hausdorff Distance (HD), a measure of the distance between the ground truth and test segmentations.The HD measured the maximum of the distances describing each point in both sets of surface points of each volume to its closest point in the other set of surface points and is defined by, where h(A, B) is the directed Hausdorff Distance between point set A and point set B described by, where �a − b� is Euclidean distance between two points.However, the HD was sensitive to outliers, so other metrics were developed to compare the distance.The other distance metrics were calculated from the nearest neighbor distances for all points in A and all points in B. This nearest neighbor function, returning a vector of distances describing the minimum distances from set A to B, can be written as: The traditional HD was designated as HD max which can also be written as, where d(A, B) and d(B, A) are vectors.The other distance metrics were calculated as follows: The mean distance to agreement (MDA) measured the mean of the nearest neighbors from only one set of surface points to the other as described by

Knowledge-based quality assurance tool
An in-house knowledge-based quality assurance tool was developed by leveraging the distinctive patterns of HU distributions for each individual organ.Sample HU distributions were acquired for each ground truth contour, a total of 1600 HU distributions encompassing 100 patients for 16 organs.The QA tool used kernel density estimation (KDE) which is a statistical technique to estimate the probability density function of a random variable and provide insights into the underlying distribution of data points.In a CT image, the data points were interpreted as voxels, each accompanied by a corresponding HU value.To generate a KDE for each patient's organs, a collection of HU values was extracted.This KDE approach utilized kernel densities to construct a probability density function, effectively capturing the normalized distribution so that the total area under the probability distribution is equal to 1.This methodology permitted a comparison that remained uninfluenced by the organs' varying sizes, which can differ widely within a pediatric population.
To validate the consistency of the HU distributions across multiple patients, KDEs were randomly divided into two groups.One group consisted of KDEs for 80 patients, which were utilized to establish a reference baseline distribution.The second group comprised KDEs for 20 patients, intended for validation purposes.
Subsequently, KDSs for the cohort of 80 reference patients were collectively calculated for each organ.This procedure yielded 16 averaged KDEs, corresponding to the 16 distinct organs.The standard deviation (SD) among these KDEs across the reference patient group was also determined.The agreement value of 0.95 was observed when the ground truths of 20 test patients were compared against the reference KDEs of 80 patients.This correspondence directly conformed to the statistical empirical guideline referred to as the two SDs or twosigma rule.Therefore, it was logical to anticipate that the newly generated contours would fall within the range of ± 95% of the ground truth HU distributions.
To test the agreements of the newly auto-segmented contours, KDEs were created for two datasets: the first one was comprised of KDEs from 100 patients, utilized to establish a baseline distribution of ground truth contours.The second dataset consisted of KDEs from 10 patients who were not included in the library of autosegmentation tools.These 10 patients' data were segmented by both atlas and neural network methods for further analysis.The KDEs from 100 patients for each organ were averaged, resulting in the creation of 16 benchmark KDEs corresponding to the 16 organs.These averaged KDEs were employed as the baseline distribution for each respective organ.Upper bounds for the baselines were determined by adding two SDs to the distributions, while lower bounds were established by subtracting two SD from the distributions.These upper and lower bounds encompassed 95% of the ground truth distributions at each HU value, adhering to the empirical 2-sigma rule.
Test contours from atlas and neural network methods were compared against the baseline distribution using ± 2 SD values as part of the knowledge-based quality assurance procedure.This comparison was performed by creating a KDE for the test contour from its HU distribution.The extent of the test KDE, denoted as KDE T , which falls within the lower and upper bounds of the baseline KDE, represented as LB KDE B and UB KDE B respectively, was calculated.This calculation yielded a quantitative measure that gauged the level of agreement, as formulated below.

Figure 1 .
Figure 1.Heatmap showing the difference between neural network and atlas overlap metrics to determine if the atlas or neural network is closer to ground truth, designated by blue and red, respectively.L and R subscripts mean left and right, respectively.DSC dice similarity coefficient, JAC Jaccard index, TPR true positive rate, TNR true negative rate, RI rand index, PPV positive predictive value.

Figure 2 .
Figure 2. Heatmap showing the difference between neural network and atlas distance metrics to determine if the atlas or neural network is closer to ground truth, designated by blue and red, respectively.L and R subscripts mean left and right, respectively.HD Hausdorff distance, MDA mean distance to agreement.

Figure 3 .
Figure 3. Histograms showing the dice similarity coefficient (DSC) and positive predictive value (PPV) for atlas and neural network methods in 100 patients averaged across all 16 organs.

Figure 4 .
Figure 4. Average baseline KDE distributions for all 16 organs of 100 CSI patients.L and R subscripts mean left and right, respectively.

Figure 5 .
Figure 5. Baseline KDE distributions with ± 2 SD (green shaded area) for 16 organs of 100 CSI patients.L and R subscripts mean left and right, respectively.

Figure 6 .
Figure 6.Percentage agreements of atlas and neural network methods against baseline KDEs for 10 test patients and 16 organs.L and R subscripts mean left and right, respectively.

Table 1 .
13quantitative metrics averaged across 16 organs for all 100 patients and compared against ground truth contours for atlas and neural network methods.DSC dice similarity coefficient, JAC jaccard index, TPR true positive rate, HD Hausdorff distance, MDA mean distance to agreement, TNR true negative rate, RI rand index, PPV positive predictive value. https://doi.org/10.1038/s41598-024-55015-7