Effects of interobserver and interdisciplinary segmentation variabilities on CT-based radiomics for pancreatic cancer

Radiomics is a method to mine large numbers of quantitative imaging features and develop predictive models. It has shown exciting promise for improved cancer decision support from early detection to personalized precision treatment, and therefore offers a desirable new direction for pancreatic cancer where the mortality remains high despite the current care and intense research. For radiomics, interobserver segmentation variability and its effect on radiomic feature stability is a crucial consideration. While investigations have been reported for high-contrast cancer sites such as lung cancer, no studies to date have investigated it on CT-based radiomics for pancreatic cancer. With three radiation oncology observers and three radiology observers independently contouring on the contrast CT of 21 pancreatic cancer patients, we conducted the first interobserver segmentation variability study on CT-based radiomics for pancreatic cancer. Moreover, our novel investigation assessed whether there exists an interdisciplinary difference between the two disciplines. For each patient, a consensus tumor volume was generated using the simultaneous truth and performance level expectation algorithm, using the dice similarity coefficient (DSC) to assess each observer’s delineation against the consensus volume. Radiation oncology observers showed a higher average DSC of 0.81 ± 0.06 than the radiology observers at 0.69 ± 0.16 (p = 0.002). On a panel of 1277 radiomic features, the intraclass correlation coefficients (ICC) was calculated for all observers and those of each discipline. Large variations of ICCs were observed for different radiomic features, but ICCs were generally higher for the radiation oncology group than for the radiology group. Applying a threshold of ICC > 0.75 for considering a feature as stable, 448 features (35%) were found stable for the radiation oncology group and 214 features (16%) were stable from the radiology group. Among them, 205 features were found stable for both groups. Our results provide information for interobserver segmentation variability and its effect on CT-based radiomics for pancreatic cancer. An interesting interdisciplinary variability found in this study also introduces new considerations for the deployment of radiomics models.


Results
Segmentation variability. Interobserver and interdisciplinary segmentation variability were observed. Figure 1 illustrates two example cases delineated by all observers with varying degrees of agreement between observers. For the case on the left, a high volume agreement was observed among observers. For the patient on the right, a low agreement was observed. Anecdotally, on this patient, the three radiation oncology observers delineated significantly larger volumes than the three radiology observers. For the patient cohort overall, the radiology observers were also found to contour more conservatively than the radiation oncology observers, yielding a mean volume of. 35.6 ± 15.0 cm 3 versus 54.1 ± 15.9 cm 3 (p < 0.0001).
The dice similarity coefficient (DSC) was calculated per patient to quantify the agreement between each observer contour with the consensus contour. Figure 2 shows the distribution of DSC values with mean and standard deviations within each discipline for each patient CT image. The mean DSC for radiation oncology observers and radiology observers were 0.81 ± 0.06 and 0.69 ± 0.16, respectively. Using the grading scale defined in the color legend of Fig. 2, 6 cases of high agreement (DSC > 0.85) were observed for radiation oncology, with no average DSC values falling below medium agreement (DSC < 0.7), compared to the 4 cases of very low agreement from radiology contours (DSC < 0.5). As shown in the distribution plot in Fig. 2, the radiation oncology group showed both a higher average agreement with the consensus segmentation (indicated by higher average DSCs) and a higher agreement with each other (indicated by the tighter standard deviations), when compared with the radiology group. A paired two-tailed independent T-test calculated a p-value of 0.0038 indicating a statistical significance of the different DSC values between the two disciplines' observers.
Radiomic feature robustness. A total of 1277 features were analyzed for robustness using the intraclass correlation coefficient (ICC) (2,1) calculated both with all 6 observers and separately with the observers of a single discipline. Of these, 143 were extracted from the original image as described in the methods section: 24 first-order, 17 shape, 22 intensity-volume histogram (IVH), and 80 texture features. As examples, the ICC values from radiation oncology contours for first order and shape features are shown in Fig. 3. ICC values from both discipline for all other original features are presented in Supplementary Figures S1-S7.
Comparing the single-discipline observer ICCs, varying degrees of difference were observed for different types of original radiomic features between radiation oncologist observers and radiology observers. For either discipline, relatively low ICCs were observed for some features. Overall, radiomic feature ICCs were lower within the radiology observers than the radiation oncology observers, indicating poorer robustness. Applying ICC ≥ 0.75 as the threshold for features considered robust, 48/143 (34%) original features were robust based on radiation oncology ICCs and 35/143 (24%) were robust based on radiology ICCs.
To observe the overall distribution of ICC values and compare between the two disciplines for all radiomic features, Fig. 4 Tables 1 and 2 for the radiation oncology and the radiology group, respectively. The feature count and number of robust features are reported for each feature class within each image filter type. As with Figs. 3 and 4, varying robustness were observed for different features and for different feature classes. The top three features with the highest ICCs within each image filter category are listed in Tables 3 and 4 for radiation oncology and radiology groups, respectively. Generally, these top robust features tend to come from certain texture categories such as those based on the run length matrix and the size zone matrix.
A Wilcoxon Signed-rank Test was used to measure the statistical difference between radiation oncology and radiology based ICC values for each image filter. All filters indicated a significant difference with p-values < 0.0001.
Feature ranking correlation between two disciplines. To evaluate relative feature robustness assessed by either discipline, we plotted the original feature ranking within each feature type for the two disciplines in Fig. 5. In other words, in Fig. 5a, for first order features, we plot out their robustness ranking based on radiation oncology observers and radiology observers. The feature rankings are similarly plotted out for other feature classes comparing the two disciplines. A perfect agreement between the two disciplines would result in a 45-degree linear regression of R 2 = 1. Texture features included gray-level co-occurrence matrix (GLCM), gray-level run length matrix (GLRLM), gray-level size zone matrix (GLSZM), neighboring gray tone differ-

Discussion
To our knowledge, this is the first study addressing the degree of interobserver and interdisciplinary variation in tumor delineation and its effect on CT-derived radiomic feature stability in pancreatic cancer. Using the consensus contours generated based on six expert observers using the Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm, the DSC values between individual observer's and the consensus contours show considerable interobserver delineation variation for pancreatic cancer. Furthermore, significant interdisciplinary differences are observed between radiation oncology and radiology observer DSCs. These results are also reflected in the large variance of ICC values regardless of image filter and the statistically significant variance between disciplinary-based derived features. Not surprisingly, lower overall ICC values were observed for radiomic features derived from contours with lower DSC values with the consensus volume. This correlation suggests that radiomic feature robustness, and therefore, subsequent radiomic analyses, could be strongly influenced by interobserver delineation variability. But as the radiomic feature showed widely varying degrees of robustness against the interobserver delineation variability, excluding the unstable features in radiomic feature selection appear important for pancreatic cancer model development. While Belli et al. assessed radiomic feature robustness due to interobserver variation in pancreatic cancer, FDG-PET/CT based contouring was used 20 . Thus, no direct comparison can be drawn with our study as the first of its kind to assess feature robustness due to interobserver variation from contrast-enhanced CT imaging. For low-contrast tumors such as pancreatic cancer, inter-observer delineation uncertainty could be quite different based on anatomy (CT imaging) versus on take-up (PET imaging). Therefore, it is not surprising that our study found lower ICCs than what was previously reported in the PET study. In Luisa Belli et al. 's study evaluating interobserver variability for 73 radiomic features from PET/CT images, 47% of radiomic features were considered robust using an ICC cutoff of 0.80. When considering only features from the non-filtered image in our study, 34% (Radiation Oncology) and 24% (Radiology) were considered robust with an ICC cutoff of 0.75. This difference can be attributed to the utilization of PET imaging, which likely improved tumor conspicuity. Comparing our results for pancreatic cancer with previous studies investigating radiomic feature robustness against segmentation variability in other sites such as breast, non-small cell lung cancer, glioblastoma, and liver, the ICCs are also considerably lower [12][13][14][15][16][17][18][19][20][21][22] . In comparison, Pavic et al. found 90% and 56% of radiomic features considered stable with an ICC cutoff of 0.80 despite interobserver variation in non-small cell lung cancer and head and neck squamous www.nature.com/scientificreports/ cell cancer, respectively 12 . The results indicate that although all these studies confirm that contour variability impacts radiomic feature stability and features show varying degrees of robustness, interobserver variability will likely be more dominating among different sources of uncertainty to impact radiomic feature reproducibility for pancreatic cancer than for other high-contrast cancers such as lung cancer. Interestingly, our investigation also revealed a significant interdisciplinary delineation variation. This is to our knowledge the first study to investigate the existence of a specialty difference in segmentation for radiomics. Our results agree with previous report from Nq et al. on radiotherapy target delineation for post-operative head and neck cancer patients 32 that a specialty difference does exist. We observe that the tumor volumes from the radiation oncology group tend to be larger than those from the radiology group. Despite our efforts to standardize the contouring experience, this could have resulted from radiation oncologists' tendency to "not miss the target" in their training despite the common contouring guideline in this study to exclude any uncertain area. In addition, we found higher consistency among the radiation oncology group than among the radiology group, possibly stemming from the practice difference that volume segmentation is a routine activity in radiation oncology, but in radiology volume labeling is more important than whole volume segmentation. Among different classes of original radiomic features, the relative agreement between the two disciplines is better for first-order features Table 1. Statistic summary table of radiation oncologist derived stable features. Total count of first order, shape, gray-level co-occurrence matrix (GLCM), gray-level run length matrix (GLRLM), neighboring gray tone difference matrix (NGTDM), and neighboring gray-level dependence matrix (NGLDM), gray-level size zone matrix (GLSZM), and IVH features with respective robust features counts are listed. Feature definitions and calculation provided by Computational Environment for Radiological Research (CERR) 35 , as recommended by the imaging biomarker standardization initiative (IBSI) 36 . www.nature.com/scientificreports/ and shape features than the texture features. These novel findings are important as vast majority of the current radiomic models are developed based on delineated volumes either from radiology or radiation oncology. Such interdisciplinary delineation uncertainties should be considered when the segmentation is done by a different specialty in model deployment than in model development. Some studies have suggested the use of semiautomatic segmentation to improve radiomic feature robustness against volume delineation [14][15][16] . However, the current semiautomatic segmentation methods' effectiveness in low contrast regions such as that of pancreatic cancer in CT imaging may be more challenging.
There are some limitations to our study to consider. Intraobserver variability was not investigated in our study. The threshold ICC value of 0.75 was selected based on literature investigating radiomic feature robustness due to interobserver variation in other tumor sites [12][13][14][15][16][17][18][19][20][21] . With ICC thresholds between 0.7 and 0.9 seen in the literature, our study chose a threshold of 0.75 on the lower end, adding a small conservative margin to account for the increased difficulty in contouring pancreatic cancer compared to those in the aforementioned studies. The significance of this threshold with respect to its effect on radiomics analysis, however, has not been investigated. Additionally, our study evaluated feature robustness, but its effect on the latter phases of radiomics analysis and how best to address it were not included. Also, this investigation is based solely on CT images while MRI and Table 2. Statistic summary table of radiologist derived stable features. Total count of first order, shape, gray-level co-occurrence matrix (GLCM), gray-level run length matrix (GLRLM), neighboring gray tone difference matrix (NGTDM), and neighboring gray-level dependence matrix (NGLDM), gray-level size zone matrix (GLSZM), and IVH features with respective robust features counts are listed. Feature definitions and calculation provided by Computational Environment for Radiological Research (CERR) 35 , as recommended by the imaging biomarker standardization initiative (IBSI) 36 . www.nature.com/scientificreports/ PET images also provide additional information and are sometimes fused to CT for pancreatic tumor evaluation. However, as CT remains the dominating radiological imaging modality in clinical practice for pancreatic cancer, we chose to focus the current study on CT-based radiomics. In our study, CT scans of limited scanner types and acquisition parameters were selected to isolate the observer variability from variabilities of imaging factors. Similarly, a fixed window/level setting was imposed in our study, ensuring different observers' contours were not affected by varying window/level settings individual observers may choose. Yet by doing so, our findings reflect the results obtained with our specific image acquisition, reconstruction, and display parameters, and may need validation before generalized to other cases. In addition, our radiomic feature results from 5 mm isotropic voxels could be affected by edge and partial volume effects, though the tumor volumes are relatively large with a mean volume of 50.4 cm 3 .
Overall, the results of our investigation contribute to the conversation necessitating more rigorous evaluation of volume reproducibility prior to radiomic feature analysis. Our site-specific findings for pancreatic cancer are important as feature robustness against segmentation uncertainty will likely play a more dominating role in the reproducibility of radiomics for such cancers. The novel discovery on interdisciplinary variations also introduces new considerations for the deployment of radiomics-based predictive models.

Study cohort. Under the approval of the Institutional Review Board of University of Nebraska Medical
Center (IRB#091-01-EP and IRB#127-18-EP), the diagnostic contrast-enhanced CT scans from 21 pancreatic cancer patients were used for this study. The patients whose images were investigated in this study were all enrolled with informed consents in the Rapid Autopsy Pancreas Program at University of Nebraska Medical Center. To date, the program has collected unique tissue specimens within hours of death from over a hundred enrolled pancreatic cancer patients over the past decade. Retrospective analyses in this study were performed in accordance with the relevant guidelines and regulations as approved by the Research Ethics Committee of University of Nebraska Medical Center. From the patients with available contrast-enhanced CT images at the time Table 3. ICC values of top 3 features for radiation oncologist derived contours per filtered image. Feature definitions and calculation provided by Computational Environment for Radiological Research (CERR) 35 , as recommended by the imaging biomarker standardization initiative (IBSI) 36 . www.nature.com/scientificreports/ of diagnosis, we selected the maximum number of patients imaged with the same line of CT scanners from a single vendor and acquired with the same acquisition protocol and slice thickness to minimize the effect of other uncertainty-contributing factors. This resulted in the 21 patients used in the study. These 21 patients included 1 stage IIA, 2 stage IIB, 1 stage III, and 17 stage IV patients.

Image acquisition and volume segmentation.
For the 21 patients included in the study, the image acquisition was performed using one of the following three CT scanner models: Lightspeed VCT, Lightspeed Pro 16, and Lightspeed RT16 (GE Healthcare, Boston, Massachusetts, USA). Patients received ISOVUE injection with bolus triggering arterial phase imaging about 30 s and venous phase about 60 s after injection. A slice thickness of 5 mm was used for all patient acquisitions, and the in-plane resolution was between 0.6 and 0.9 mm. Six expert observers, comprised of three radiation oncologists and three radiologists, each contoured the entire cohort. The three radiation oncologists have 21, 8, and 6 years of experience within their specialty, respectively, and the three radiologists have 19, 7, and 4 years of experience within their specialty, respectively. All observers in the study are experts in assessing/treating pancreatic tumors as their clinical and/or research specialization. For each patient, the tumor was contoured using the iPlan software (Brainlab AG, Feldkirchen, Germany). To standardize the delineation for individual observers, the following segmentation instructions were given: • Tumor: Only contour what you feel certain is the gross tumor. Where it is uncertain, exclude.
• A window width of 400 with level at 50 has been preset for all cases. Do not change the window/level. • Use only the assigned CT for contouring. Do not rely on the help of MR, or other CT data.
• Complete contours independently. www.nature.com/scientificreports/ All contours were visually inspected by a non-observer investigator. In a couple of instances, an individual contour deviated substantially from that of other observers' . The observer was asked to review their contour, with the option to edit if so chose, while still blinded from other observers' contours.
Consensus volume generation and segmentation variability assessment. For each patient, a consensus volume was created using the STAPLE algorithm as a ground truth surrogate 33 . The STAPLE algorithm creates a consensus volume based on the volume delineations from all 6 observers.
The dice similarity coefficient (DSC) was used to quantify the degree of volume overlap between two volumes 34 . The calculation of the DSC value is defined in Eq. 1, where of A and B are the two volumes for which the DSC is to be calculated, and ∩ indicates the intersection of the two volumes. The DSC is between 0 and 1, with 1 indicating two identical contours and 0 indicating two completely different contours.
Because the DSC requires a pair-wise comparison, it was calculated for each observer contour to quantify the overlap with the consensus volume. Image processing and radiomic feature extraction. For this study, a panel of 1277 radiomic features were calculated using an adapted version of CERR implemented in Matlab R2018b. These features consisted of: first order (n = 24), shape (n = 17), texture (n = 80), and intensity-volume histogram (IVH) (n = 22), on the original image and with 9 filters applied: Laplacian of Gaussian (LoG, n = 126), and 8 permutations of 3D wavelets (LLL, HLL, LHL, HHL, LLH, HLH, LHH, and HHH totaling n = 1008). As the shape features stay invariant with image filters, they were excluded in counting LoG and wavelet features. Individual calculated values can be found in the Supplementary Dataset for all features extracted using CERR, which followed the recommenda-(1) DSC = 2|A| ∩ |B| |A|+|B| Figure 5. Ranked feature robustness. Features were ranked based on their relative robustness within each feature class for each discipline to indicate potential trends. A slope of 1 would suggest that the relative stability of features tend to be equivalent for both disciplines. www.nature.com/scientificreports/ tions provided by the image biomarker standardization initiative (IBSI) 36 . Before feature extraction, the voxels were made isotropic by resampling the images to 5 × 5 × 5 mm 3 voxels using sinc interpolation. Images were discretized using a bin width of 25 and the texture matrices were calculated for all three dimensions (resulting in 26 directions, or 13 symmetrical directions) with a voxel offset of 1 for neighboring voxels.
Radiomic feature robustness evaluation. The intraclass correlation coefficient (ICC) was used to quantitatively evaluate the robustness of radiomic features due to interobserver variations. An ICC (2,1) was selected to account for two-way random effects with single measurements when assessing the absolute agreement [37][38][39] . ICC values were also calculated separately based on discipline (radiation oncology vs. radiology) to assess interdisciplinary effects.
Radiomic features with ICC values greater than 0.75 were considered reproducible and robust. All ICC values were calculated using the 'irr' package in RStudio 40 .
Ethics approval. The study was approved by the Institutional Review Board of University of Nebraska Medical Center (IRB#091-01-EP and IRB#127-18-EP).

Conclusion
Volume segmentation variability affects radiomic feature stability for CT-based radiomics studies in pancreatic cancer, as has been shown for other cancer sites. Considerably lower interobserver ICCs were found than for high-contrast cancer sites, suggesting a more dominating role segmentation uncertainty plays in radiomics for pancreatic cancer. A novel interdisciplinary variability is also observed on segmentation, introducing new considerations for the deployment of radiomics-based predictive models.