Radiomics feature stability of open-source software evaluated on apparent diffusion coefficient maps in head and neck cancer

Radiomics is a promising technique for discovering image based biomarkers of therapy response in cancer. Reproducibility of radiomics features is a known issue that is addressed by the image biomarker standardisation initiative (IBSI), but it remains challenging to interpret previously published radiomics signatures. This study investigates the reproducibility of radiomics features calculated with two widely used radiomics software packages (IBEX, MaZda) in comparison to an IBSI compliant software package (PyRadiomics). Intensity histogram, shape and textural features were extracted from 334 diffusion weighted magnetic resonance images of 59 head and neck cancer (HNC) patients from the PREDICT-HN observational radiotherapy study. Based on name and linear correlation, PyRadiomics shares 83 features with IBEX and 49 features with MaZda, a sub-set of well correlated features are considered reproducible (IBEX: 15 features, MaZda: 18 features). We explore the impact of including non-reproducible radiomics features in a HNC radiotherapy response model. It is possible to classify equivalent patient groups using radiomic features from either software, but only when restricting the model to reliable features using a correlation threshold method. This is relevant for clinical biomarker validation trials as it provides a framework to assess the reproducibility of reported radiomic signatures from existing trials.

www.nature.com/scientificreports/ contrast enhanced imaging (DCE-MRI). This study focuses on radiomics features calculated on apparent diffusion coefficient (ADC) maps derived from DW-MRI images. The apparent diffusion coefficient has been correlated to cellularity in many tumour types 17 and has been linked to cell proliferation in head and neck squamous cell carcinoma 18 . During radiotherapy, early changes in ADC have been linked to treatment response outcomes for multiple tumour types, making it a potential candidate for biological image-guided adaptive radiotherapy on MRI guided radiotherapy systems 19 . Developing a radiomics model is often considered as a series of discrete tasks each with its own challenges 5 , with the variability of each task known to impact model performance [20][21][22] . In MRI studies, these effects have been investigated with regard to image acquisition [23][24][25][26][27] , region of interest segmentation 23,[28][29][30] , image pre-processing 28,29,31,32 , feature extraction [33][34][35] and feature reduction combined with classifier training 13,36,37 . To address the known variability issues of features extracted with different software [33][34][35] the image biomarker standardisation initiative (IBSI) 38 has proposed a set of feature extraction guidelines. In head and neck cancer, MRI studies have reported feature extraction with software such as MazDa 24,[39][40][41] , IBEX 42 and in-house solutions based on MATLAB 37,[42][43][44][45][46][47][48][49] , none of which adhear to the IBSI guidelines.
Radiomic analysis generates hundreds of features, making feature reduction a crucial step to prevent overfitting when developing a radiomics model. Validation studies 50,51 select a small set of features based on previously reported radiomic signatures. The IBSI guidelines should mitigate known feature reproducibility issues [33][34][35] in future studies, but feature uncertainty remains a problem when interpreting previously reported radiomic signatures. This study investigates the correlation between features generated with open-source radiomic software packages (IBEX 52 and MaZda 53 ) used in many published studies against an open-source tool (PyRadiomics 54 ) which follows the IBSI guidelines. We then explore the impact of non-reproducible radiomics features on a HNC radiotherapy response model using through therapy ADC radiomics features. Our comparison focuses on DW-MRI of head and neck cancer but provides general confidence on which previously reported radiomics features can be reproduced with software that adheres to the IBSI guidelines.

Results
Variation in radiomic features. Radiomics features were extracted from 334 apparent diffusion coefficient maps from the prospective PREDICT-HN study 59 (Fig. 1) Figure 2) shows high correlation between intensity histogram features with a range of correlation between features in the shape and texture classes. A summary of correlation between features shared with PyRadiomics ( Fig. 3) shows that on average IBEX extracts more highly correlated shape features, with MaZda extracting more highly correlated first order, GLCM and GLRM features. Features with a high Pearson's coefficient ( r > 0.901 ) were considered reproducible between software packages. IBEX had 15 reproducible features

Discussion
Existing studies of variability in radiomics feature extraction 33-35 explore a range of image modalities and feature extraction software. A study of mammograms and HNC computed tomography images 33 also extracted IBEX and MaZda features, but compared them against one another and with two in-house software packages. One feature extraction study of HNC patients was less comparable as it analysed PET images 35 and compared two in-house radiomics software packages. A study of HNC patients who had both CT and MRI imaging 34 also extracted PyRadiomics features but compared them to features extracted with Moddicom 55 and the radiomics extension to CERR 56 . Whilst that study 34 also extracted features from MRI images of HNC patients, the features were from T2 weighted images, whereas our study is the first, to our knowledge, to investigate feature extraction stability on ADC maps from diffusion weighted MRI. www.nature.com/scientificreports/ Our study observed a similar trend of feature extraction reproducibility to existing studies; intensity histogram features have high reproducibility across feature extraction software, with shape and textural features being more software package dependent. The study of mammograms and HNC CT images 33 demonstrated high intra-class correlation (ICC) of intensity histogram features and low ICC of GLCM features, though this was partially attributed to the use of default GLCM extraction settings for each software. The study of HNC patients with both CT and MRI images 34 showed that CT features had high Spearman correlation of intensity histogram, shape and GLCM features and lower ICC for GLRLM and grey-level size zone matrix (GLSZM) features. The T2 weighted MRI features compared in that study had high Spearman correlation for intensity histogram and shape features between PyRadiomics and CERR and only shape features highly correlated between PyRadiomics and Moddicom; the lack of correlation with intensity histogram features was attributed to Moddicom performing an image intensity correction on the T2 weighted MRI. In our study NGTDM features were extracted with IBEX and had a mixture of well correlated (1/5) and uncorrelated (4/5) features, the previously mentioned PET study 35 also extracted NGTDM features but only a percentage of grouped textural features above an ICC threshold was reported. The similarity of our findings with those previously reported may indicate that variability in radiomics features due to the extraction software can be considered independent of the imaging modality.
Two radiomics feature reproducibility studies 34,35 have investigated the impact of feature extraction variability on HNC model performance. The joint MRI and CT imaging study 34 performed hierarchical clustering of patients for each feature class separately (intensity histogram, GLCM, etc.) and observed consistent clustering of clinical variables (TN category, GTV volume) for radiomics feature classes with a high reproducibility. The PET study 35 demonstrated that models of local tumour control built on features from two different radiomics packages can stratify patients into very similar low or high risk recurrence groups, with the model features being highly correlated between the two software packages (ICC > 0.9). These studies and our results demonstrate that it is possible to generate equivalent radiomics models with different feature extraction software, but that care must be taken to ensure that features are highly correlated between the software packages. Additionally, our study extends this observation to patient classification based on radiomics features prior to treatment, at multiple time points during treatment and post-treatment.
A limitation of this and previous studies [33][34][35] is the ambiguity in matching features between different software, firstly identifying shared features and secondly matching the feature extraction settings which often differ between software package. Name similarity does not guarantee equation similarity, for example the shape feature "SurfaceVolumeRatio" is calculated with a voxel based volume in IBEX and a mesh based volume in PyRadiomics. Equation similarity can be difficult to establish due to different notations, such as the formulation of the shape feature "SurfaceArea" between PyRadiomics and IBEX (Supplementary Table 6) which appear quite different but have a high correlation (0.998). We observed that equation similarity does not guarantee strong feature correlation, even with well-matched extraction settings, for a number of IBEX GLCM features such as "AutoCorrelation", "MaximumProbability" and "JointEnergy" (Supplementary Table 7). Alternatively, minor differences in equations with non-ideally matched extraction settings showed moderate correlation, such as MaZda GLCM features "Dif-ferenceVariance" and "InverseDifferenceMoment" (Supplementary Table 7). The naming issue can be avoided in future for a sub-set of features that have a unique identifier as defined in the IBSI guidelines 38 . To assist with consistent naming and feature extraction settings an ontology based radiomics workflow has been proposed 57 .
It is challenging to match the feature extraction settings between PyRadiomics, IBEX and MaZda. We were unable to define a reduced intensity range in MaZda for GLCM and GLRLM features, which may have negatively www.nature.com/scientificreports/ affected the feature correlation results, though we observed higher correlations in these feature classes from MaZda than with IBEX that had a matched intensity range and bin width. Similarly, we were unable to define identical directions for GLCM and GLRLM feature extraction across the software packages, averaging over different directions may have negatively affected our feature correlation results. A newer version of MaZda (qmazda) was selected for this study as it supports batch feature extraction, from the software documentation it was inferred that there was minimal change to the feature extraction code from earlier MaZda versions. The region of interest was calculated with a Plastimatch derived label map for PyRadiomics and MaZda and was calculated directly from the DICOM structure file (RTSTRUCT) by IBEX, which may have introduced variability in the image region used for feature extraction. We explored radiomics features calculated on the original ADC map to avoid additional variability from image filtration (Wavelet, Laplacian or Gaussian, etc.) between software implementations. Image filtration reproducibility between software using MRI images may be worth investigating as differences in wavelet filtered features from PET images have been previously reported 35 . Whilst our investigation of feature variability on patient modelling demonstrated general classification differences between software, it is challenging to quantify a clinical impact as the patient groups are not correlated with a clinical outcome. There is some evidence that unsupervised feature clustering is correlated with clinical outcomes in HNC 34 . Magnetic resonance imaging offers insight into disease related anatomical and functional changes and is ideal for radiomics analysis as multiple MRI sequences can provide complimentary information 58 . This study reports the variability of radiomics features extracted from ADC maps, that are a relatively quantitative image, generating features not significantly influenced by scanner manufacturer or magnetic field strength 26 ; radiomics features extracted from T1 and T2 weighted images are less reproducible across scanner and not recommended at this point for multicentre trials 24 . The increasing uptake of MRI for simulation and therapy in radiotherapy departments offers an unprecedented opportunity to characterise tumour response and personalise patient treatment. This study provides information on how feature extraction software can impact the reproducibility of a radiomics workflow, which we should endeavour to optimise in order to accelerate discovery through reproducible research and data sharing.

Conclusion
This work highlights feature and model reproducibility issues due to different radiomic analysis software. We propose a correlation threshold method to select reproducible features and demonstrate that the identified features from both software generate an equivalent model. This is relevant for the selection of radiomic features in clinical biomarker validation trials as it provides a framework to assess the reproducibility of radiomic signatures from existing studies.

Material and methods
Study cohort. The imaging data used in this study was collected as part of the prospective PREDICT-HN study 59 (Fig. 1). The trial imaged 59 patients with head and neck squamous cell carcinoma who were treated with curative intent radiotherapy, patient details are summarised in Table 1. Trial participants were imaged prior to radiotherapy, weekly during radiotherapy (following fraction 5, 10, 15, 20, 25 and 30) and two to three months post-radiotherapy. Imaging data was acquired for all patients prior to radiotherapy (n = 59), with a lower number of images acquired during radiotherapy over the first three weeks (n = 40), at weeks four and five (n = 39), at week six (n = 38) and post-radiotherapy (n = 39).
Ethics approval and consent to participate. The PREDICT-HN study 59  Variation in patient modelling. To demonstrate the potential impact of incorporating non-reproducible features in a radiomics model we used unsupervised learning to identify two groups of patients, based on radiomic features at pre-treatment, throughout radiotherapy and post-radiotherapy. Separate radiomics models were generated based on PyRadiomics and IBEX/MaZda features, first using all features and then with the sub-set of reproducible features. Reproducible features were selected as those with a high Pearson's correlation coefficient ( r > 0.901 ) as calculated for the analysis of variation in radiomics features, the correlation threshold was determined as per the sensitivity analysis described below. Features that contained an invalid number (i.e. infinity) for any patient in the modelling cohort were excluded. Patients (n = 36) with image data for all time points were grouped with SciPy 66,67 using Ward's minimum variance clustering method 68  www.nature.com/scientificreports/ features (z-score standardisation) with an automatic minimum clustering threshold to generate no more than two clusters. To test the sensitivity of clustering to the selected correlation threshold, we performed the clustering over a range of thresholds ( r > 0.0 to r > 0.999 , with an increment of 0.001) and measured the clustering similarity as the percentage of patients clustered by IBEX or MaZda into the same groups as PyRadiomics. We define clustering similarity as, where A PyRad and B PyRad are patient groups from clustering with PyRadiomics features, A i and B i are patient groups from clustering with either IBEX or MaZda features, set intersection is denoted with ∩ , set union is denoted with ∪ and the number of patients in a set is denoted with |A| . Due to the possibility that unsupervised clustering can return similar groups but in a different order, the similarity metric was calculated as, www.nature.com/scientificreports/