Resampling and harmonization for mitigation of heterogeneity in image parameters of baseline scans

Our study investigates the effects of heterogeneity in image parameters on the reproducibility of prognostic performance of models built using radiomic biomarkers. We compare the prognostic performance of models derived from the heterogeneity-mitigated features with that of models obtained from raw features, to assess whether reproducibility of prognostic scores improves upon application of our methods. We used two datasets: The Breast I-SPY1 dataset—Baseline DCE-MRI scans of 156 women with locally advanced breast cancer, treated with neoadjuvant chemotherapy, publicly available via The Cancer Imaging Archive (TCIA); The NSCLC IO dataset—Baseline CT scans of 107 patients with stage 4 non-small cell lung cancer (NSCLC), treated with pembrolizumab immunotherapy at our institution. Radiomic features (n = 102) are extracted from the tumor ROIs. We use a variety of resampling and harmonization scenarios to mitigate the heterogeneity in image parameters. The patients were divided into groups based on batch variables. For each group, the radiomic phenotypes are combined with the clinical covariates into a prognostic model. The performance of the groups is assessed using the c-statistic, derived from a Cox proportional hazards model fitted on all patients within a group. The heterogeneity-mitigation scenario (radiomic features, derived from images that have been resampled to minimum voxel spacing, are harmonized using the image acquisition parameters as batch variables) gave models with highest prognostic scores (for e.g., IO dataset; batch variable: high kernel resolution—c-score: 0.66). The prognostic performance of patient groups is not comparable in case of models built using non-heterogeneity mitigated features (for e.g., I-SPY1 dataset; batch variable: small pixel spacing—c-score: 0.54, large pixel spacing—c-score: 0.65). The prognostic performance of patient groups is closer in case of heterogeneity-mitigated scenarios (for e.g., scenario—harmonize by voxel spacing parameters: IO dataset; thin slice—c-score: 0.62, thick slice—c-score: 0.60). Our results indicate that accounting for heterogeneity in image parameters is important to obtain more reproducible prognostic scores, irrespective of image site or modality. For non-heterogeneity mitigated models, the prognostic scores are not comparable across patient groups divided based on batch variables. This study can be a step in the direction of constructing reproducible radiomic biomarkers, thus increasing their application in clinical decision making.

www.nature.com/scientificreports/ (varying across the pairwise scenarios, determined using the CCC metric) was higher for harmonized features (range 15.4-87.9%) when compared to the non-harmonized features (range 8.8 to 85.7%) 21 . The studies discussed above have focused on the reproducibility of the radiomic features in a limited setting (test-retest scenarios, phantom studies and so on). The variation in radiomic features in a test-retest experimental scenario (difference between the scans is in the order of minutes) is not reflective of a test-retest scenario in a clinical setting (difference between the scans is in the order of days), as significant physiological changes in the tumor regions can over time. Similarly, the assessment of the effect of image parameter heterogeneity on the reproducibility of radiomic features extracted from phantom scans is not comparable to the assessment performed on the features extracted from tumor regions. This is because the features extracted from human tissue are expected to encapsulate a wider range of variation, as they are also influenced by biological factors. Further, while the studies have explored the sensitivity of individual radiomic features to image parameter variation, little attention has been given to assessing how the image parameter heterogeneity affects the reproducibility of radiomic biomarkers, and how various heterogeneity-mitigation techniques can be used to improve the robustness of the radiomic signatures.
Our study aims to investigate the effects of individual image parameters and how their heterogeneity affects the reproducibility of prognostic performance of models built using radiomic biomarkers. We have used a variety of resampling and harmonization techniques to mitigate the heterogeneity in the radiomic features. We will compare the prognostic performance of the models derived from the heterogeneity-mitigated features with the performance of the models obtained from the raw, non-heterogeneity mitigated features, and assess whether the reproducibility of the prognostic scores improves upon the application of our methods. We hypothesize that the radiomic biomarkers derived from images with more homogenous imaging parameters will produce models whose prognostic performance is more consistent across the individual parameter categories. Our study includes two databases. The first dataset consists of baseline DCE-MRI scans of 156 women with locally advanced breast cancer, publicly available via The Cancer Imaging Archive (TCIA). The women underwent neoadjuvant chemotherapy with an anthracycline-cyclophosphamide regimen alone or followed by taxane. The second dataset consists of baseline CT scans of 107 patients with stage 4 NSCLC, treated at our institution with first-line pembrolizumab monotherapy or combination therapy. We have included datasets from different organ sites and different image modalities, to see if our hypothesis holds across different sites and modalities.

Materials and methods
Breast I-SPY1 dataset. Study sample and data. The ACRIN 6657/I-SPY1 TRIAL 22 enrolled n = 237 women from May 2002 to March 2006. From this cohort, n = 230 women met the eligibility criteria of being diagnosed with locally advanced breast cancer with primary tumors of stage T3 measuring at least 3 cm in diameter. The pre-operative DCE-MRI images of 222 women were publicly available via The Cancer Imaging Archive (TCIA) 23 . From this TCIA set, 15 women were excluded for our present study, due to incomplete DCE acquisition scans. A subsequent 51 women were also excluded due to either incomplete histopathologic data or recurrence-free survival (RFS) outcome, or missing pre-treatment DCE-MRI scans. This resulted in the inclusion of n = 156 women for this study, with baseline DCE-MRI scans. All women underwent longitudinal DCE-MRI imaging on a 1.5 T field-strength system. Women underwent neoadjuvant chemotherapy with an anthracycline-cyclophosphamide regimen alone or followed by taxane. The demographic information of the patients is included in Supplementary Table S1. NSCLC IO dataset. Study sample and data. This single-center retrospective, observational study was conducted at the Hospital of the University of Pennsylvania between November 2016 and December 2020. The study was approved by the University of Pennsylvania's Institutional Review Board (IRB) committee under a waiver of informed consent. All methods in this study were in accordance with the Declaration of Helsinki. Patients (n = 107) with stage 4 Non-Small Cell Lung Cancer (NSCLC) treated with first-line pembrolizumab based therapy at our institution were identified. The demographic information of the patients is included in Supplementary  Table S3. Preliminary analyses conducted by our group on this dataset can be found here 24 .
Radiomic feature extraction. The 3D tumor volumes were manually segmented by board-certified, fellowshiptrained radiologists using the semi-automated ITK-SNAP software (version 3.6.0) 25 . We have used the Cancer Phenomics Toolkit (CaPTk) 26 , a highly-standardized, user-friendly, open-source software developed at our institution, that conforms to the Imaging Biomarker Standardization Initiative (IBSI) radiomics standardization protocols 27 , for extraction of radiomic features (n = 102) from the tumor regions of interest (ROIs). The radiomic features represent the following eight type of descriptors: (1) Intensity features or first-order statistics (capturing the voxel grey-level intensities within a neighborhood). (2) Histogram-based features (computed using an intensity histogram by discretization of the original intensity distribution. (3) Volumetric features (computed by utilizing the voxel intensities in the ROI and are based on the relationship between discretized intensity and the fraction of the volume containing the least intensity). (4) Morphologic features (describe geometric aspects of a region of interest (ROI), such as area and volume). (5) Gray level run length matrix features (based on quantifying gray level runs as the lengths of consecutive pixels). (6) Neighboring gray tone difference matrix features (rotation-independent features based on gray-level relationships between neighboring voxels and aim to capture the coarseness of the overall texture). (7) Gray level size zone matrix features (grey level size zone matrix (GLSZM) counts the number of groups (or zones) of linked voxels, where voxels are linked if the neighboring voxel has an identical discretized grey level). (8) Local binary pattern features (describe the local texture patterns in an image where the LBP works in a block size of 3 × 3, in which the center pixel is used as a threshold for the neighboring pixel, and the LBP code of a center pixel is generated by encoding the computed threshold value www.nature.com/scientificreports/ into a decimal value). A list of features belonging to each family and their formulae, can be found in Supplementary Tables S13 and S14 respectively.

Radiomic feature harmonization.
ComBat is a harmonization method originally developed for genomics that can correct variation in features due to imaging parameters by using empirical Bayes to estimate location and scale parameters to shift data 28 . While ComBat is fast and easy to use, current implementations of ComBat are only able to harmonize by a single batch effect at a time and are therefore unable to adequately harmonize datasets that are heterogeneous in more than one batch effect. The OPNested ComBat approach used in our study enables harmonization by multiple batch effects by implementing sequential harmonization [29][30][31] . The approach was initialized with the radiomic features as input data and a list of batch variables (Breast I-SPY1: Table 1 and NSCLC IO: Table 2 www.nature.com/scientificreports/ Accounting for heterogeneity in imaging parameters. In our study, we use the following scenarios (summarized in Table 3) to mitigate the heterogeneity in image parameters: 1. The variation in image physical dimensions is addressed by harmonizing the radiomic features using the voxel spacing parameters as the batch variables. This is performed under two scenarios, using offsets of 3 mm (1A) or 5 mm (1B) while extracting the features. Here, offset defines the distance between the center voxel and the neighboring voxels.
We keep a common offset value for feature extraction, since the voxel spacing varies across the images. Thus, a standard offset value (either 3 mm or 5 mm) will ensure the feature extraction is being performed in the same physical dimension. 2. The variation in image acquisition parameters is addressed by harmonizing the radiomic features using contrast enhancement and kernel resolution as the batch variables. This is performed under two scenarios, using offsets of 3 mm (2A) or 5 mm (2B) while extracting the features. 3. The variation in the image physical dimension (voxel spacing) parameters is addressed by performing anisotropic resampling on the images. The images are resampled to the minimum value across each of the voxel spacing parameters [Breast I-SPY1 dataset: (x × y × z-0.7 mm, 0.7 mm, 1.5 mm); NSCLC IO dataset: (x × y × z-0.54 mm, 0.54 mm, 0.8 mm)]. The variation in the contrast enhancement and kernel resolution parameters is addressed by harmonizing the features from the above resampled images, using the image acquisition parameters as batch variables in this scenario. 4. The variation in image physical dimensions and acquisition parameters is addressed by harmonizing the radiomic features using the voxel spacing parameters, contrast enhancement and kernel resolution parameters as the batch variables. This is performed under two scenarios, using offsets of 3 mm (4A) or 5 mm (4B) while extracting the features.
Radiomic phenotype identification. Following heterogeneity-mitigation with each of the scenarios described above, unsupervised hierarchical clustering was performed on the features 32 . An agglomerative approach was used to create a hierarchical clustering of the patients using Euclidean distance between the extracted features and Ward's minimum variance method as the clustering criterion 33 . The optimal number of distinct phenotypes, k, was determined by assessing the stability and significance of each phenotype for each value of k that was considered. The optimal number of stable phenotypes was determined using consensus clustering 34 , where dataset was sub-sampled and cluster arrangements were determined using varying values of k. For each value of k, the proportion that two patients occupied the same phenotype cluster out of the number of times they appeared in the same subsample was determined and stored in a consensus matrix, from which a cumulative distribution function (CDF) was determined. Cluster stability, determined by the area under the CDF curve, was evaluated for each value of k. Statistical significance of the identified, stable phenotypes was evaluated using the SigClust method 35,36 . Here, the significance of the cluster index, defined as the sum of within-cluster sums of squares about the cluster-mean divided by the total sum of squares about the overall mean was tested against a null distribution, simulated using 10,000 samples from a Gaussian distribution fit to the data. The test was performed at each phenotype split to determine statistical significance (p < 0.05). Two optimal radiomic phenotypes were identified in each scenario.  Table S4) include PDL1 expression, ECOG, BMI and smoking status. The prognostic performance of the groups is assessed using the concordance statistic (c-statistic) 37 , derived from a Cox proportional hazards model fitted on the all the patients present within a given group (Table 4: Breast I-SPY1 and Table 5: NSCLC IO). We also wanted to see if the trends in the c-statistics Table 3. A description of the heterogeneity-mitigation scenarios. www.nature.com/scientificreports/ hold when c-scores are derived from a five-fold cross-validated Cox proportional-hazards analysis with 200 iterations. These cross-validated c-scores and 95% confidence intervals (CIs) have been included in the Supplementary File (Table S7: breast I-SPY1 and Table S8: NSCLC IO). A flowchart summarizing the steps involved in comparing the prognostic performance of the models derived from the patient groups is included below (Fig. 1).   www.nature.com/scientificreports/ The comparison is performed for the models containing the radiomic biomarkers derived from the features where the image parameter heterogeneity has been mitigated using various scenarios and the model containing the radiomic biomarker derived from the raw features.

Results
In our analysis, we have compared the performance of the prognostic models in the patient groups divided on the basis of the individual batch variables. The mean prognostic score (central tendency) of the first group's model is compared to the second group's model.
We made the following observations: 1. For the I-SPY1 and IO datasets, scenario 3 (radiomic features, derived from images that have been resampled to the minimum voxel spacing, are harmonized using the image acquisition parameters as batch variables) gave models with consistently high prognostic scores across the batch variable groups (I-SPY1 dataset (batch variable group: large pixel spacing)-c-score: 0.67, IO dataset (batch variable group: high kernel resolution)c-score: 0.66). 2. The prognostic performance of the patient groups divided on the basis of batch variables is not comparable in the case of models built using the raw, non-heterogeneity mitigated features (for instance: I-SPY1 dataset (batch variable group: small pixel spacing)-c-score: 0.54, (batch variable group: large pixel spacing)-cscore: 0.65, IO dataset (batch variable group: low kernel resolution)-c-score: 0.57, (batch variable group: high kernel resolution)-c-score: 0.62). 3. The prognostic performance of the patient groups divided on the batch variables are closer (comparable) in case of pixel spacing for the I-SPY1 dataset (for instance scenario 1A-batch variable group: small pixel spacing-c-score: 0.58, large pixel spacing-c-score: 0.64) and slice thickness for the IO dataset (for instance scenario 3-batch variable group: thin slice-c-score: 0.62, thick slice-c-score: 0.60). The prognostic performance of the models fitted on the entire dataset, for both the raw and heterogeneitymitigated features, has been included in Supplementary Table S11 (for the ISPY1 dataset) and Table S12 (for the IO dataset). 4. The p value of the dendrogram split is more significant in the heatmaps derived from the heterogeneity mitigated features as compared to the non-heterogeneity mitigated features (for instance, in Table 6 (Breast I-SPY1 dataset), the p value of the dendrogram split in the heatmap for the non-heterogeneity mitigated features (patient group: thick slices) is 0.04 and 0.001 in the heatmap for features with heterogeneity mitigated using scenario 4b (harmonize by voxel spacing and image acquisition parameters, offset 5 mm for feature extraction). The p value of the dendrogram splits for the patient groups based on the other batch variables (for the Breast I-SPY1 and NSCLC IO datasets) are included in the Supplementary Table S9. 5. The normalized mutual information (NMI) between phenotypes derived from heterogeneity mitigation scenario 3 (models with the highest prognostic scores) and other heterogeneity mitigation scenarios is higher than the NMI between phenotypes derived from heterogeneity mitigation scenario 3 and those derived from the non-heterogeneity mitigated scenario (for instance, in Table 7 (NSCLC IO dataset), for patients with high kernel resolution scans, the NMI between phenotypes derived from scenario 3 and scenario 1a (harmonize by voxel spacing parameters, offset 3 mm for feature extraction) is 0.38 and the NMI between scenario 3 and those derived from the non-heterogeneity mitigated scenario is 0.003).
The NMI values between phenotypes for the patient groups based on the other batch variables (for the Breast I-SPY1 and NSCLC IO datasets) are included in the Supplementary Table S10. The phenotypes from the radiomic features mitigated using scenario 3 and the non-heterogeneity mitigated features for patients grouped based on their batch variables can be visualized using Fig. 2 [(panel 1a: Breast I-SPY1 dataset-patients with thick slices, heterogeneity mitigated features), (panel 1b: Breast I-SPY1 Table 6. Significance of the cluster dendrogram split for heatmaps built using features subjected to various heterogeneity mitigation and non-mitigation scenarios; for patient groups divided based on batch variables (thick slice: Breast I-SPY1 dataset; high kernel resolution: NSCLC IO dataset). www.nature.com/scientificreports/ Table 7. Normalized mutual information between phenotypes of the best-performing heterogeneitymitigation scenario and other mitigation and non-mitigation scenarios; for patient groups divided based on batch variables (thick slice: Breast I-SPY1 dataset; high kernel resolution: NSCLC IO dataset).

Discussion
The heterogeneous nature of image parameters, as a result of variation in scanner parameters and image acquisition protocols, especially in large-scale retrospective datasets from multi-institutional studies, makes the development of reproducible radiomic biomarkers challenging. The radiomics community has recently started discussing how the robustness of radiomic biomarkers to the heterogeneity in image parameters is essential to improving their acceptance in the clinical community. However, even though previous studies have focused on the reproducibility of radiomic features in a limited setting (test-retest, phantom studies etc.) and have explored the sensitivity of individual radiomic features to image parameter variability, little attention has been given to assessing how this variability affects the reproducibility of radiomic signatures.
Our study assesses several techniques to address the effect of heterogeneity in image parameters on the reproducibility of radiomic biomarkers. We observed that, in case of both the databases, the phenotypes derived from features whose heterogeneity has been mitigated using various scenarios are more similar to each other (higher normalized mutual information (NMI) score). The NMI score is lower between phenotypes derived from heterogeneity-mitigated features and phenotypes derived from the raw features. In the non-heterogeneity mitigated models, the prognostic scores are not comparable across the patient groups divided on the basis of each batch variable. The prognostic performance of the patient groups divided based on the batch variables are closer (comparable) in case of pixel spacing for the Breast I-SPY1 dataset and voxel spacing for the NSCLC IO dataset. We note that, among the various heterogeneity mitigation scenarios, the model containing the radiomic phenotypes derived from scenario 3 (resampling images to minimum voxel spacing and harmonizing for differences in image acquisition parameters) had a higher prognostic performance across most of the patient groups, and thus can be used as a potential starting point for the heterogeneity mitigation component in future studies. Our results also show that the phenotypes obtained using unsupervised hierarchical clustering are more significant (metric-p value of dendrogram split in the heatmap) in the case of heterogeneity mitigated features compared to the raw features.
We note that although the statistical significance of the phenotypes obtained with heterogeneity mitigation is improved, the prognostic performance of the models does not improve substantially. One of the possible reasons for this reduction has been discussed in the paper based on the harmonization method used in our analysis: "A possible explanation is that because imaging parameters were generally associated with outcome as a consequence is study design, the removal of variation associated with those imaging parameters reduced predictive performance" 30 . However, we would like to point out that an improvement in the reproducibility of the radiomic signatures does not necessarily correlate with an improvement in prognostic performance. Although the predictive performance may be moderate for some of our radiomic models, the application of heterogeneitymitigation techniques does make the prognostic scores more comparable across the patient groups divided based on the batch variables, and hence, reproducible. We would like to reinstate here that it is more desirable to have a model with a modest prognostic performance in the training set, but with comparable performance in the test set, as compared to having a model with high performance only in the training set, but the performance does not validate in the test set. Even a radiomic biomarker with a high prognostic performance loses interpretability if it is highly sensitive to changes in image parameters. Reproducibility of the biomarker is key to make it usable in a clinical setting.
Our results indicate that accounting for heterogeneity in image parameters is important to obtain more reproducible prognostic scores, irrespective of the image site or modality. Our study discusses the importance of heterogeneity mitigation in radiomic parameters and why it is important to ensure that the prognostic model is robust to the variation in image acquisition and physical dimensions. It also addresses the question of postprocessing feature standardization, as standardization during the image acquisition stage might not be feasible, especially in large datasets obtained from multi-institutional studies. We hope our study can be a step in the direction of constructing reproducible radiomic biomarkers, thus increasing their application and acceptance in clinical decision-making.

Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. www.nature.com/scientificreports/