Optimal Statistical Incorporation of Independent Feature Stability Information into Radiomics Studies

Conducting side experiments termed robustness experiments, to identify features that are stable with respect to rescans, annotation, or other confounding effects is an important element in radiomics research. However, the matter of how to include the finding of these experiments into the model building process still needs to be explored. Three different methods for incorporating prior knowledge into a radiomics modelling process were evaluated: the naïve approach (ignoring feature quality), the most common approach consisting of removing unstable features, and a novel approach using data augmentation for information transfer (DAFIT). Multiple experiments were conducted using both synthetic and publicly available real lung imaging patient data. Ignoring additional information from side experiments resulted in significantly overestimated model performances meaning the estimated mean area under the curve achieved with a model was increased. Removing unstable features improved the performance estimation, while slightly decreasing the model performance, i.e. decreasing the area under curve achieved with the model. The proposed approach was superior both in terms of the estimation of the model performance and the actual model performance. Our experiments show that data augmentation can prevent biases in performance estimation and has several advantages over the plain omission of the unstable feature. The actual gain that can be obtained depends on the quality and applicability of the prior information on the features in the given domain. This will be an important topic of future research.


A0 Experiment on different skew factors
In order to research the impact of skewed distributions on the proposed method DaFIT, we repeated the synthetic experiments. This time we added noise drawn from a skewed normal distribution instead of normally distributed noise. The noise is calculated using the package "skewnorm" from "scipy.stats", and we tested two different values for the skew factor, namely two and four. We did not change the noise model used for DaFIT, still assuming an unskewed normal distribution for the noise. The experiments were carried out multiple times to account for the randomness in the experiments.
The results are shown in the figure below. No clear trend for better/worse prediction due to the skewewing can be seen, however it seems that the resulting classifier are slightly better if trained on the skewed data.

A1 Result listed for different confounding variables
Mean absolute estimation errors (AEE) and mean performance measure with Area Under Curve (AUC) for different strategies and classifier on a real-world dataset. The absolute estimation error is defined as the absolute difference between the performance estimated with five-fold crossvalidation and the minimum performance obtained on any left-out test set. The error bars give the 95% confidence interval based on bootstrapping. Be aware that the y-axis for the AUC-plots has an offset to show relevant areas.

A2 Result listed for different targets
Mean absolute estimation errors (AEE) and mean performance measure with Area Under Curve (AUC) for different strategies and classifier on a real-world dataset. The absolute estimation error is defined as the absolute difference between the performance estimated with five-fold crossvalidation and the minimum performance obtained on any left-out test set. The error bars give the 95% confidence interval based on bootstrapping. Be aware that the y-axis for the AUC-plots has an offset to show relevant areas.

confounding effect and target
Mean absolute estimation errors (AEE) for different strategies and classifier on a real-world dataset. The absolute estimation error is defined as the absolute difference between the performance estimated with five-fold cross-validation and the minimum performance obtained on any left-out test set. The error bars give the 95% confidence interval based on bootstrapping.

A4 Grid of area under curve w.r.t. confounding effect and target
Mean performance measure with Area Under Curve (AUC) for different strategies and classifier on a real-world dataset. The error bars give the 95% confidence interval based on bootstrapping. Be aware that the y-axis for the AUC-plots has an offset to show relevant areas.

A6 Description of targets:
A slightly more detailed description of the classification targets. Those targets were chosen because they are publicly available in conjunction with the LIDC-IDRI dataset.
 Subtlety: How difficult is the lesion to detect?  Calcification: What is the pattern of calcification, if it is present?  Sphericity: Describing the shape of the nodule in terms of roundness.  Margin: Rates how well-defined the margin is.  Lobulation: The degree of lublation ranging from none to marked  Spiculation: The extend of speculation  Malignancy: Subjective guess how likely it is that the nodule is malignant, assumed that the scan originated from a 60-year-old male smoker.
The information about the classification targets is taken from the publication describing the creation process of the LIDC-IDRI dataset.

A7 List of calculated features
Here is full list of all features that are calulcated. The name of each feature follows a fixed set-up: The first part gives the family of the features, for example "Volumetric_Features". The second part is separated by two double-points: "::" and gives the parameter, if applicable. The third part, after two additional double points gives the actual feature name, for example "Voxel_Volume".
A more detailed description of each feature can be found in the documentation of MITK Phenotyping: http://mitk.org/wiki/Phenotyping#Documentation_and_Help in the documentation of the corresponding feature family class.