Comparing effectiveness of image perturbation and test retest imaging in improving radiomic model reliability

Image perturbation is a promising technique to assess radiomic feature repeatability, but whether it can achieve the same effect as test–retest imaging on model reliability is unknown. This study aimed to compare radiomic model reliability based on repeatable features determined by the two methods using four different classifiers. A 191-patient public breast cancer dataset with 71 test–retest scans was used with pre-determined 117 training and 74 testing samples. We collected apparent diffusion coefficient images and manual tumor segmentations for radiomic feature extraction. Random translations, rotations, and contour randomizations were performed on the training images, and intra-class correlation coefficient (ICC) was used to filter high repeatable features. We evaluated model reliability in both internal generalizability and robustness, which were quantified by training and testing AUC and prediction ICC. Higher testing performance was found at higher feature ICC thresholds, but it dropped significantly at ICC = 0.95 for the test–retest model. Similar optimal reliability can be achieved with testing AUC = 0.7–0.8 and prediction ICC > 0.9 at the ICC threshold of 0.9. It is recommended to include feature repeatability analysis using image perturbation in any radiomic study when test–retest is not feasible, but care should be taken when deciding the optimal feature repeatability criteria.


Feature repeatability and predictability
Compared to test-retest, there was a systematic larger feature repeatability on image perturbation.Figure 2a visualizes the distribution of feature ICCs assessed by training perturbation versus test-retest.Among all the 1120 volume-independent radiomic features, only 143 showed lower ICC under image perturbation than test-retest, which can be visualized as scarce scattered points above the diagonal line in Fig. 2a.However, the feature repeatability under image perturbation and test-retest demonstrated a strong correlation with Pearson correlation r = 0.79 (p-value < 0.001).
The feature repeatability agreement between perturbation and test-retest showed a strong dependence on ICC thresholds, as shown in Fig. 2b.In general, the number of commonly repeatable and non-repeatable features between the two ICC measures increased with higher ICC thresholds.Specifically, the number of mutually agreed repeatable features decreased from 621 to 141, 18, and 2 with ICC threshold increased from 0.5 to 0.75, 0.9, and Figure 1.Study workflow.We conducted our study by (1) radiomic feature repeatability assessment by testretest and image perturbation, (2) radiomic model development using high-repeatable features from the two assessments, and (3) internal generalizability and robustness analysis of the two models.Mp and Mtr are the models based on repeatable features assessed by image perturbation and test-retest respectively.0.95, as suggested by the shrinking blue bars in Fig. 2b.In contrast, the number of mutually disagreed repeatable features increased from 151 to 484, 989, and 1072 (green bars).For disagreements between perturbation and test-retest evaluation, very few (< 0.7%) features are repeatable against test-retest variations while unrepeatable against perturbation (red bars), and a considerate amount of features are repeatable against perturbation while unrepeatable under test-retest settings (orange bars).
Only a small portion of all the volume-independent radiomic features demonstrated strong univariate correlation with the prediction outcome with an inclination towards high repeatable features (Fig. 2a).Quantitatively, 111 radiomic features reached statistical significance (p-value < 0.05) when correlating with pCR.With the ICC threshold of 0.5, 11% (n = 71) of the high-repeatable features under test-retest had statistical significance and 10% (n = 93).The percentage increased to 23% at ICC threshold of 0.75 but decreased to 5% (n = 1) and 0% (n = 0) at 0.9 and 0.95 for test-retest.However, a continuous increase to 11% (n = 72), 25% (n = 32), and 27% (n = 12) for perturbation was discovered.The final selected features for model development can be found in the Supplementary Material (Table S1).

Internal generalizability and robustness
An overall trend of increasing internal generalizability and robustness was observed with increasing ICC thresholds.Figure 3 presented the overall trend and comparisons of training and testing AUCs of M p and M tr under varying feature ICC thresholds for the four classifiers.For logistic regression, the testing AUC increased significantly from 0.56 (0.41-0.70) at baseline (ICC threshold = 0) to the maximum of 0.76 (0.64-0.88, p = 0.021) at ICC threshold = 0.9 under perturbation and 0.77 (0.64-0.88, p = 0.018) under test-retest.The same trend can be observed for the rest of the classifiers.On the other hand, both M p and M tr demonstrated steady decreases of the training AUCs under increasing ICC thresholds without statistically significant differences to the baseline.Similarly, the baseline models had the lowest robustness with, for example, prediction ICC = 0.51 (0.45-0.58) on training perturbation, 0.57 (0.49-0.66) on testing perturbation, and 0.45 (0.25-0.62) on test-retest for the logistic regression, as indicated by the lowest bars in Fig. 4. Significant improvement can be already observed when increasing the feature ICC threshold to 0.5 for both M p and M tr .
M tr demonstrated higher internal generalizability and robustness than M p on larger feature ICC filtering thresholds in general.We observed smaller training AUCs and higher testing AUCs of M tr at ICC thresholds of 0.5 and 0.75 for all the four classifiers (Fig. 3).The AUC differences between M p and M tr were kept small with the absolute values below 0.1 (p > 0.05) for logistic regression and gaussian naive bayes while larger differences in testing AUCs were found for SVM and random forest.Under the ICC threshold of 0.75, M tr had a significantly higher prediction ICC on both testing perturbation (e.g.logistic regression, M tr =0.93 (0.91-0.95), M p =0.86 (0.82-0.90)) and test-retest ( M tr =0.87 (0.80-0.92),M p =0.75 (0.63-0.84)), while smaller differences found on training perturbation ( M tr =0.91 (0.89-0.93),M p =0.90 (0.86-0.92)), as demonstrated by Fig. 3. On the other hand, M tr had smaller prediction ICCs under the ICC threshold of 0.5 with significant differences on training and testing perturbation for SVM and gaussian naive bayes.The ICC threshold of 0.9 demonstrated minimum model robustness deviations between M p and M tr , except for SVM under test-retest.
Both M tr internal generalizability and robustness dropped significantly when increasing the ICC threshold from 0.9 to 0.95 for the four classifiers.For example, for logistic regression, the training AUCs of both M p and M tr remained stable, while a much larger decrease of testing AUC to 0.59 (0.45-0.73) was found for M tr at ICC threshold = 0.95.On the contrary, M p had a slightly reduced testing AUC to 0.75 (0.62-0.86).Similar to internal generalizability, the prediction ICC of M tr fell significantly on training perturbation, testing perturbation, and

Discussion
This is the first study that directly compared the reliability of radiomic models based on repeatable radiomic features selected by image perturbation and test-retest imaging using ADC maps derived from a publicly available breast cancer DWI dataset.Model reliability was evaluated in both internal generalizability and robustness, which were quantified by training and testing AUC and probability prediction ICC, respectively.During the experiment, several radiomic models were constructed for comparisons using varying feature repeatability (ICC) thresholds and four different classifiers.We observed systematically lower radiomic feature repeatability assessed by test-retest than perturbation with better binary agreement at higher ICC thresholds.Similar optimal internal generalizability and robustness were achieved by the four classification models based on perturbation ( M p ) and test-retest ( M tr ) at the ICC threshold of 0.9 simultaneously.Slightly lower training AUCs and higher testing AUCs were achieved by M tr than M p at lower ICC thresholds (p > 0.05).In addition, M tr demonstrated significantly higher prediction ICCs on training perturbation, testing perturbation, and test-retest at the ICC threshold of 0.75.Notably, increasing the ICC threshold to 0.95 resulted in significant drops of testing AUC and prediction ICCs for M tr .Our results provide the direct evidence that our perturbation method could replace test-retest method in building a reliable radiomic model with optimal internal generalizability and robustness.The lower radiomic feature repeatability under test-retest could be largely attributed by the larger variations of tumor segmentations.We further evaluated the segmentation similarities by the Dice similarity coefficients (DSC) and Hausdorff distances (HD) with rigid registrations between test and retest images.The tumor segmentations were less similar between test and retest images (Dice = 0.51(± 0.16), HD = 12.47 mm(± 10.95 mm)) than image perturbation (Dice = 0.71(± 0.11), HD = 2.72 mm(± 0.90 mm)).Previous research by Saha et al. has also suggested less stable radiomics features from breast MRI within the tumor volume due to a large inter-reader variability (Dice = 0.60) 20 .They also emphasized the necessity of standardization in breast tumor segmentation through precise instructions or auto-contouring, where Dice can be increased to 0.77.
The slightly reduced model robustness and internal generalizability on M p compared with M tr could be explained by the conservative evaluation of feature repeatability by image perturbation.Compared with test-retest, image perturbation yielded larger numbers of repeatable radiomic features using the same ICC filtering thresholds, and more features were found to be predictive in training under univariate test (Fig. 2a).Thus, the final selected features of M p were more likely to be predictive in training and resulted in a higher training AUC.On the other hand, M p had a slower increase of AUC in testing and prediction ICC in testing perturba- tion and test-retest before ICC threshold reached 0.95, suggesting a slightly lower internal generalizability and testing robustness than M tr .The rather strict evaluation of feature repeatability from test-retest may reduce the variabilities of selected features under the same patient condition and enhanced the probability of true discovery.Furthermore, the reduced number of feature candidates could also contribute to the lower risk of overfitting.www.nature.com/scientificreports/However, the extremely high feature ICC threshold of 0.95 resulted in a much lower internal generalizability and robustness under test-retest.During feature selection, only 5 features remained as repeatable for M tr , and none of them showed significant univariate correlation with pCR in training.Consequently, the final selected features had a minimum probability of being truly predictive, and the constructed model was largely overfitted on training with significantly reduced testing AUC.Meanwhile, the predicted probabilities were confounded within the high-slop region (Fig. 5).Although the selected features and their linear combinations are guaranteed to be highly repeatable (ICC > = 0.95), they could result in larger variations of the prediction values due to the sigmoid transformation.Although the internal generalizability and robustness describe model reliability from two different perspectives, a model built from features with low sensitivity to the prediction target is more likely to have low performances on both due to the previous discussed reasons.Such findings underline the importance of careful selection on repeatable feature criteria when optimizing the final predictive model.A balance between sensitivity and repeatability needs to be achieved depending on the level of data standardization during application.
Despite the different reliabilities of M p and M tr under multiple feature repeatability criteria, they both achieved optimal internal generalizability and high robustness at the ICC threshold of 0.9 with similar metric values.Such observation provides the direct evidence that perturbation could replace test-retest imaging while achieving the similar optimal model performance.It is advised to incorporate radiomic feature repeatability analysis using image perturbation when test-retest is less achievable due to limited medical resources.Nevertheless, the optimal ICC threshold discovered by this study may not be generalizable to other radiomics applications where different image modalities and cancer site were studied and different radiomic features were extracted.
In addition to the comparisons between perturbation and test-retest, we discovered a positive impact of higher feature repeatability on model reliability, as suggested by the increasing testing AUCs and prediction ICCs under higher ICC thresholds.Our results are consistent with the findings by Teng et al. that image perturbation could enhance radiomic model reliability on multiple head-and-neck cancer datasets 17 .A higher model output repeatability is generally guaranteed with increased input repeatability when using a linear logistic regression model, as long as the predictability is ensured.Similar to the comparison between M p and M tr , both the reduced feature variabilities and candidate numbers from higher repeatability thresholds could be the major contributors of the enhanced internal generalizability.
Our study has several limitations that need to be addressed by future investigations.First, only one public dataset was used to conduct this experiment.Further investigations on the applicability of our findings need to be conducted on other image modalities, cancer sites, and radiomic feature categories.Second, previous studies have also suggested the impact of scanning settings and image preprocessing parameters on radiomic feature repeatability [21][22][23] .Therefore, a comprehensive test-retest dataset including different scanners, image acquisition protocols and preprocessing settings is needed to further evaluate the role of perturbation in building a reliable radiomic model.Third, we evaluated model reliability in terms of internal generalizability and robustness without considering external validation performance.Patient data from multiple institutions could be recruited to further enhance our understandings of the impact of feature repeatability on cross-institutional reliability.

Conclusions
We systematically compared the radiomic model reliability, including both internal generalizability and robustness, between using repeatable radiomic features assessed by image perturbation and test-retest imaging.The same optimal reliability can be achieved by image perturbation as test-retest imaging.Higher feature repeatability resulted in higher model reliability in general, but may have an opposite effect at extremely high repeatability threshold.We recommend the radiomic community to include feature repeatability analysis using image www.nature.com/scientificreports/perturbation in any radiomic study when test-retest is not feasible, but care should be taken when deciding the optimal criteria during repeatable feature selection.

Study design
As illustrated in Fig. 1, patients were randomly split into one training and testing set for model development and validation.We assessed model reliability in both internal generalizability and robustness.Internal generalizability was evaluated by its discriminatory power on both the training and testing set.Model robustness was quantified by measuring output probability variability on perturbed training, perturbed testing, and test-retest images, similar to the methodology adopted by Teng et al. 17 .The comparisons were performed under different feature repeatability thresholds to mimic a wide range of selections of repeatability criteria.

Patient data
We retrospectively collected 191 patients from the publicly available BMMR2 challenge dataset 24,25 .It was derived from the ACRIN 6698 trail where female patients with invasive breast cancer were prospectively enrolled from ten institutions between 2012 and 2015 26 .All the patients received neoadjuvant chemotherapy with 12 weeks of paclitaxel (with/without additional experimental agent) followed by 12 weeks of anthracycline before surgery.Diffusion-weighted MRI was performed for each patient before, 3 weeks after, and 12 weeks after the commencement of the chemotherapy.IRB approval is waived due to the solely use of public data.Pretreatment DWI-derived ADC maps and manual tumor segmentations were downloaded from The Cancer Imaging Archive 27 for radiomics model development.Pathologic complete response (pCR), which is the binary outcome assessing the absence of invasive disease in breast and lymph nodes at the time of surgery 28 , was used as the prediction endpoint.We adopted the same train-test split as the BMMR2 challenge with 60% (n = 117) randomly chosen as the training set and the remaining 40% (n = 74) set as the testing set.The same ratios of pCR, hormone receptor (HR), human epidermal growth receptor 2 (HER2) status were controlled for the train-test split.Additionally, we collected the 71 test-retest pre-treatment ADC map pairs scanned within a "coffee-break" from the BMMR2 challenge dataset for comparison.Forty-one test-retest patients overlapped with the primary patient cohort.The tumor volume was manually drawn on dynamic contrast-enhanced MR subtraction images 26 , and migrated to the ADC map.

Radomics feature extraction
A comprehensive set of Radiomics features was extracted from the original and filtered DWI images within the tumor volume.Before feature extraction, all the DWI images were isotropically resampled to the resolution of 1mmx1mmx1mm and discretized to a fixed bin number of 32 for the original and filtered images.Such preprocessing procedure ensures the consistent image resolution and pixel values across patients with reduced noise.We applied three-dimensional Laplacian-of-Gaussian (LoG) image filters with multiple kernel sizes (1, 2, 3, 4, 5 mm) and eight sets of wavelet filters with full combinations of high-/low-pass in three dimensions.Both first-order (n = 18) and texture features (n = 70) were extracted from each preprocessed image, and shape features (n = 14) were extracted from the tumor segmentation.Texture features include calculations from Gray-Level Co-occurrence Matrix (GLCM), Gray Level Size Zone Matrix (GLSZM), Gray Level Run Length Matrix (GLRLM), and Origi Matrix (NGTDM).The definitions and extraction of radiomic features follow the standardization by the image biomarker standardisation initiative 29 .In total, 1316 radiomics features were extracted for each patient.Detailed settings of the image preprocessing and feature extraction parameters are listed in Table 1.All the image preprocessing and feature extraction procedures were performed by the Python package PyRadiomics (version 3.0.1) 30.

Feature repeatability assessment
Radiomics feature repeatability was assessed from both perturbed images and test-retest images for model reliability comparisons, shown in Fig. 1.We performed 40 image perturbations independently for each patient by random combinations of rotations, translations, and contour randomizations, same as the methodology adopted by Teng et al. 17 .Contour randomization was achieved by deforming the original tumor segmentation by a 3-dimensional random displacement field.The algorithm of random displacement field generation is adapted from the methodology proposed by Simard et al. 31 .A random field vector component on each dimension is generated randomly under a uniform distribution between -1 and 1 for each voxel point.All the z-component of the field vectors on the same slice were kept to the same value to mimic the uniform inter-slice contour variations from the slice-by-slice contouring.The field vectors were then normalized on each dimension by the root mean square and scaled by the user-defined intensity value.They were then smoothed by a gaussian filter with user-defined sigma to ensure the continuous change of the random displacement field and avoid sharp changes of the deformed contours.Detailed image perturbation parameters can be found in Table 1.
The same set of radiomics features were extracted from each perturbed or test/retest image with the same preprocessing procedure.One-way, random, absolute, single rater intraclass correlation coefficient (ICC) 32 was calculated for each radiomic feature under both image perturbation and test-retest due to the random choice of perturbation parameters and scanning condition for each patient: where MS R represents the mean square of average perturbation values for patients, MS W is the residual source of variance, which is calculated as the variance of perturbation values averaged across patients, and k is the number of perturbations.The ICC calculation was provided by the Python package Pingouin (version 0.5.2) 33 .

Radiomics model construction
Two types of radiomics models were separately constructed from the repeatable features under image perturbation ( M p ) and test-retest ( M tr ), as shown in Fig. 1. Volume dependent radiomic features were first removed to minimize its confounding effect on the comparison results, as tumor volume is more stable by definition.Radiomic features that had a Pearson correlation r > 0.6 to the tumor mesh volume was removed from subsequent analysis.Repeatable features were determined from the pre-set ICC thresholds of 0, 0.5, 0.75, 0.9, and 0.95.They were further filtered by redundancy and outcome relevancy before model training.We adopted the minimum Redundancy-Maximum Relevance (mRMR) feature selection algorithm to rank the repeatable features based on the redundancy and outcome relevancy 34 .Finally, 5 top-ranked features were selected for model development using four different classifiers, including logistic regression, SVM, random forest, and gaussian naive bayes.They were trained by the scikit-learn package (version 1.3.0) in Python.The majority pCR group (non-event) was randomly down-sampled by 500 times and an ensemble of classifiers were trained.The final prediction probability of each patient was given by the average of the individual model predictions.This easy-ensemble approach could reduce the training bias from the heavily imbalanced outcome 35 .It was implemented by the publicly available python package imbalance-learn (version 0.9.1) 36 .

Model reliability assessment
We assessed radiomics model reliability in both internal generalizability and robustness (Fig. 1).Internal generalizability was assessed by comparing training and testing classification performance evaluated by AUC.Model robustness was assessed by the model prediction repeatability under the setting of both perturbation (training and testing) and test-retest.Probability predictions of either model were generated on the perturbed training, perturbed testing, and test-retest images, and the one-way, random, absolute ICCs were calculated for the prediction repeatability using the same rationale of feature repeatability.Both internal generalizability and robustness were compared between M p and M tr with different ICC threshold settings, as shown in Fig. 1.

Statistical analyses
Each classification performance metric was evaluated under 1000-iteration patient bootstrapping to acquire 95% confidence interval (95CI).Two-way p-values for comparing the classification performance were calculated by permutation test with 1000 iterations using the function "permutation_test" provided by the open-source Python package scipy (version 1.9.1) 37 .The comparison was performed between each pair of models with and without feature repeatability filtering as well as M p and M tr .A p-value < 0.05 was considered significant.The 95CI of the model prediction ICC was evaluated according to the formulas presented by McGraw et al. 32 . Vol

Figure 2 .
Figure 2. (a) Scatter plots showing the repeatability of volume independent features measured by intraclass correlation coefficient (ICC) under test-retest imaging (y-axis) and image perturbation (x-axis).The perturbation method yielded higher ICC values than the test-retest method in general.Furthermore, features that had significant univariate correlations with the outcome, pCR, where colored as orange while the rest as blue.(b) Stacked bar plot displaying the feature repeatability agreement between perturbation and testretest.P + / − indicates the repeatable/unrepeatable feature group by the perturbation method and TR + / − for repeatable/unrepeatable feature group in the test-retest method.

Figure 3 .
Figure 3.Comparison of internal generalizability between models based on repeatable features assessed by image perturbation ( M p , blue) and the test-retest imaging ( M tr , orange) under varying thresholds for logistic regression, SVM, random forest, and gaussian naive bayes classifiers.Training and testing classification performance were quantified by area under the receiver operating characteristic curve (AUC).The error bars indicate 95% confidence intervals acquired from 1000-iteration bootstrapping.

Figure 4 .
Figure 4. Bar plots for comparing robustness between models based on repeatable features assessed by image perturbation ( M p , blue) and the test-retest imaging ( M tr , orange) under varying thresholds for logistic regression, SVM, random forest, and gaussian naive bayes classifiers.Model robustness was evaluated by probability prediction ICC under perturbation or tests-retest.The error bars indicate 95% confidence intervals acquired during ICC calculation.

Figure 5 .
Figure 5. Distributions of the linearly combined feature values and predicted probabilities of the logistic regression models developed from test-retest repeatable features and perturbation repeatable features using the feature intra-class correlation coefficient threshold of 0.95.The predicted probabilities follow the sigmoid mapping of the logistic regression.Samples with ground-truth of non-event are colored by blue and event by orange.Predictions of the test-retest model were aggregated in the high-slop region whereas a wider spread is found for the perturbation model.