Intensity harmonization techniques influence radiomics features and radiomics-based predictions in sarcoma patients

Intensity harmonization techniques (IHT) are mandatory to homogenize multicentric MRIs before any quantitative analysis because signal intensities (SI) do not have standardized units. Radiomics combine quantification of tumors’ radiological phenotype with machine-learning to improve predictive models, such as metastastic-relapse-free survival (MFS) for sarcoma patients. We post-processed the initial T2-weighted-imaging of 70 sarcoma patients by using 5 IHTs and extracting 45 radiomics features (RFs), namely: classical standardization (IHTstd), standardization per adipose tissue SIs (IHTfat), histogram-matching with a patient histogram (IHTHM.1), with the average histogram of the population (IHTHM.All) and plus ComBat method (IHTHM.All.C), which provided 5 radiomics datasets in addition to the original radiomics dataset without IHT (No-IHT). We found that using IHTs significantly influenced all RFs values (p-values: < 0.0001–0.02). Unsupervised clustering performed on each radiomics dataset showed that only clusters from the No-IHT, IHTstd, IHTHM.All, and IHTHM.All.C datasets significantly correlated with MFS in multivariate Cox models (p = 0.02, 0.007, 0.004 and 0.02, respectively). We built radiomics-based supervised models to predict metastatic relapse at 2-years with a training set of 50 patients. The models performances varied markedly depending on the IHT in the validation set (range of AUROC from 0.688 with IHTstd to 0.823 with IHTHM.1). Hence, the use of intensity harmonization and the related technique should be carefully detailed in radiomics post-processing pipelines as it can profoundly affect the reproducibility of analyses.

technical details regarding homogeneization of SIs are frequently missing in materials and methods and, even when performed, assessment of the optimal setting for the MRI dataset of interest is often lacking.
Some intensity harmonization techniques (IHTs) have been proposed in the neuroimaging literature to enable robust analysis of structural and diffusion MRIs across different radiological centers and longitudinally, but most cannot be transposed to sarcomas because of the heterogeneity of tissues surrounding sarcomas, which are ubiquitous tumors. Available IHTs regarding non-brain MRI are scarce. The most frequently encountered are global scaling (e.g. where SIs values are centered by removing the mean and scaled to unit variance, or transformed to range between 0 and 1), ratio with SIs of a healthy tissue that is not affected by the disease (for instance adipose tissue or muscle in musculoskeletal imaging), or histogram-matching (HM, where the intensity histograms are transformed to match a reference intensity histogram) [16][17][18] . In addition, Orlhac et al. have recently shown that ComBat harmonization method, which was initially described in genomics to remove batch effect, could correct non biological differences related to the type of scanners 19 . Though the authors focused on CT-scanner, ComBat may help reduce unwanted variations in MRI-based radiomics datasets as well.
Thus, our aim was to investigate how the IHT could influence MRI-based radiomics analyses in a uniformlytreated cohort of soft-tissue sarcomas (STS) patients with which the presence of intra-tumor heterogeneity on initial T2-weighted-imaging (-WI) has been previously correlated with metastatic-relapse free survival (MFS) 4,6,20 . To do so, to comprehensively assess the impact of IHT on radiomics analyses, we investigated its influence on: (i) the RFs values; (ii) the prognostic value of radiomics-based unsupervised classifications; and (iii) the performances of supervised classifiers to predict early metastatic relapses.

Methods
Study population. This study was approved by the local Research Ethics Committee of Bergonié Insitute (Bordeaux, France) according to good clinical practices and applicable laws and regulations. All methods were performed in accordance with the relevant guidelines and regulations. The need for written informed consent was waived because of its retrospective nature.
Patients were consecutively recruited as they fulfilled the following inclusion criteria: newly-diagnosed, nonmetastatic (according to chest CT-scan), histologically-proven high-grade STS of trunk wall or extremities (n = 163), treated with 4-6 cycles of anthracycline-based neoadjuvant chemotherapy and curative surgery at our sarcoma reference center from June 2006 to November 2016 (n = 133), available baseline MRI (n = 95) with axial spin-echo T2-WI without artefacts (n = 72), and available clinical and radiological follow-ups for at least 2 years after the surgery (n = 70). Follow-ups consisted in a clinical examination and chest radiograph every 3 months for 2 years, every 6 months for 5 years and annually until 10 years after surgery, which were complemented by chest CT-scans and MRIs in case of doubtful findings. All relapses were histopathologically confirmed. MFS was defined as the time since curative surgery to metastatic relapse.  [12/70, 17.1%]) with adjustment of coils, field-of-view and matrix depending on tumor size, location and depth. Regarding T2-WI, the range of repetition and echo times were 2,400-4,500 ms and 70-130 ms, respectively. Slice thickness ranged from 3 to 5 mm. The protocol also systematically included 2D or 3D T1-WI after intra-venous gadolinium-chelates injection (with or without fat-suppression).
First, T2-WIs were converted to nifti format. Voxel size resampling (with b-spline interpolator) and N4 bias field correction were applied to obtain a common spatial resolution of 1 × 1 × 4 mm 3 and to correct non-uniform intensities 22 .
Second, a senior radiologist (A.C., with 4 years of experience in sarcoma imaging) manually segmented the whole tumor volume, slice-by-slice, by using LIFEx freeware (version 5.10, Inserm, Orsay, France, www.lifex soft.org) 23 . The radiologist had access to all the other MRI sequences to adjust the boundary of the segmentation if needed. The volumes of interests were all validated by a second senior radiologist (M.K., with 28 years of experience in sarcoma imaging).
Third, 4 IHTs were applied in parallel to the whole imaging dataset in order to harmonize the SIs of the T2-WI, providing 4 harmonized datasets, i.e.: (1) IHT fat , which consisted in dividing all the SIs of a given T2-WI by the mean SI of adipose tissue on that T2-WI, as follows: where x, y and z are the coordinates of a voxel. To do so, the first senior radiologist segmented a volume of at least 10 cm 3 of pure normally-appearing adipose tissue on each T2-WI in order to extract the mean SI per patient. (2) IHT std , which consisted in normalizing the SIs of a T2-WI according to the minimum and maximum of all voxels included in this T2-WI, as follows:  www.nature.com/scientificreports/ enabling the extraction of 5 datasets of 45 3-D RFs by using LIFEx software. SIs were previously discretized into 128 fixed bins. Thirteen histogram-based and 32 s-order texture features from grey-level co-occurrence matrix (GLCM, n = 7-with a 1-voxel distance to neighbors), grey-level run length matrix (GLRLM, n = 11), neighborhood grey-level different matrix (NGLDM, n = 3) and grey-level zone length matrix (GLZLM, n = 11) were calculated (details are giving in Supplementary Data 3).
ComBat compensation. We applied the ComBat-Harmonization function in R (https ://githu b.com/forti n1/ComBa tHarm oniza tion) to the 45 RFs that were extracted from the IHT HM.All dataset with a non-parametric setting in order to remove unwanted noise due to technical variations between the 3 MR-systems of the study while preserving biological variability, and notably when there are only a few patients per site 19,24,25 . ComBat-Harmonization is classically applied at the end of the postprocessing pipeline, herein, after the extraction of RFs obtained with the IHT that was hypothesized to be the more relevant and realistic among the 5 IHTs (namely IHT HM.All ). This data-driven method identifies the protocol effect assuming that the value of each feature, RF, measured in a volume-of-interest, (x,y,z), with an imaging protocol, i, can be written as: RF i,(x,y,z) = α + γ i + δ i × ε i,(x,y,z) (in which α is the average value for features y ij ; γ i is an additive protocol effect and δ i is a multiplicative protocol effect affected by an error term ε ij ). The compensations consists in estimating the model parameters α, γ i and δ i , and by using a maximum likelihood approach on the basis of the set of available observations: , in which α , γ i and δ i are estimators of α, γ i and δ i . Parametric and nonparametric forms of ComBat-Harmonization have been developed. The non-parametric form does not assume law followed by the parameters and has been used in the present study.
The resulting RFs were labelled IHT HM.All.C . In total, six paired datasets of 45 RFs were obtained, namely: No-IHT, IHT fat , IHT std , IHT HM.1 , IHT HM.All and IHT HM.All.C .

Statistical analysis.
Statistical analysis was performed with R. All tests were two-tailed. A p-value of less than 0.05 was deemed significant. A 3-steps approach was performed to evaluate the impact of IHTs on each aspect of radiomics studies ( Fig. 1): (1) Per-RF analysis: RFs were all normalized in order to range between 0 and 1 and to facilitate direct comparisons. For each RF, the influence of the IHT was evaluated with one-way repeated-measures ANOVA.
Post-Hoc comparisons were assessed with Tukey test and Bonferroni corrections. Intraclass correlation coefficients (ICC) were estimated for each RF, with a 2-way random model, agreement between raters and 6 raters ("irr" package). (2) Unsupervised analysis: A hierarchical clustering analysis with the Ward method was applied on each of the 6 subsets of RFs. RFs were centered and scaled by mean beforehand and the Euclidean distance between each pair of patients was computed. Visual inspection of the silhouette plot enabled to select 2 clusters of patients for each harmonization technique. We calculated the Baker's gamma coefficient between each pair of dendrograms (dendextend" package), and the Kappa index between each pair of clustering results, which enabled the quantification of their divergence depending on the IHT 26 .
The correlations between MFS and the clusters yielded by the models were assessed with Kaplan-Meier analysis and multivariable Cox models-after adjustment to the classical confounding covariables for sarcomas, i.e.: the longest baseline diameter (< vs. ≥ 10 cm), performance status (0 vs. 1-2), histological type (undifferentiated sarcomas vs. other), number of chemotherapy cycles (4 vs. 5-6), chemotherapy type (anthracycline-ifosfamide vs doxorubicine), adjuvant radiotherapy, surgical margins (R0 vs. R1-R2) and histological response (goods vs. poor responder to chemotherapy with a cut-off of 10% viable cells on post-chemotherapy surgical specimen). Prognostic performances of the 6 multivariate models were evaluated and compared through concordance-indices, which estimate the models's ability to provide a reliable ranking of the survival times based on the individual risk scores.
(3) Supervised analysis: The same supervised machine-learning approach was applied to the 6 datasets of RFs in order to predict the occurrence of a metastatic relapse within 2 years after curative surgery by using the "caret" and "glmnet" packages 27,28 . The total population of 70 patients with available clinical and radiological follow-up was randomly subdivided into one training cohort of 50 patients and one testing cohort of 20 patients with the same proportion of metastatic relapses by using the createDataPartition function. The training cohort was used to train a binomial logistic regression with combination of least absolute shrinkage and selection operator (LASSO) and ridge penalizations (elasticnet-LR). This algorithm consists of reducing the number and the importance of explanatory variables in order to optimize the performances of the classification model. The coefficients of the less contributive variables are shrunken towards 0 (: ridge regression) or even set to 0 (: LASSO). The amount of ridge and LASSO penalization was investigated by using a manual grid search with two hyperparamètres: α (mixing percentage) and λ (regularization parameter) and tenfold cross validation, repeated 5 times. The same partitioning of patients was used for the 6 datasets. The same clinical and pathological covariables as in the unsupervised analysis were included, in addition to the same 3 shape RFs (volume, compacity and sphericity-which are independent from the IHT).
The performances of supervised models were evaluated through cross-validated accuracy and area under the ROC curves (AUROC) with 95% confidence interval (95%CI). To do so, we extracted the 5 × 10 = 50 estimations of the accuracy and AUROC from the 50 distinct test sub-cohorts of 5 patients from the training cohort, and we  Unsupervised analysis. All 6 unsupervised classifications achieved were different. Table 3 shows the correlation matrices for Kappa indices and Baker coefficients. The pair of clustering with the highest positive correlation was obtained with IHT HM.All versus IHT HM.All.C (Kappa = 0.75, Baker coefficient = 0.55). The lowest correlated pair was obtained with No-IHT versus IHT HM.1 (Kappa = 0.18, Baker coefficient = 0.05). Both correlated dendrograms are displayed in Fig. 3.
Regarding the prognostic value of the clusters, our univariate analysis showed that significantly different survivals were found with the clusters obtained with the IHT HM.All radiomics dataset (Log-rang p-value = 0.03) but not with the other IHTs. Kaplan Meier curves for the 6 clustering analyses are given in Fig. 4.
To assess the prognostic values in presence of confounding variables, we elaborated multivariate models demonstrating that the clusters obtained with RFs from the No-IHT, IHT std , IHT HM.All and IHT HM.All.C were independently associated with MFS in the multivariate modeling (p = 0.02, 0.007, 0.004 and 0.02, respectively- Table 4  The final hyperparameters and performances of the classification models are given in Table 5. The best performances in repeated cross-validation were found with the models based on the RFs from the IHTHM.

Discussion
The post-processing of medical images to perform radiomics studies is mandatory to ensure the comparability of multicentric datasets but it can result in additional bias that may alter the performances of predictive models and preclude the reproducibility of MRI-based radiomics signatures. Because structural MRIs are acquired in arbitrary units, the intensity harmonization is crucial to enable the comparability of examinations acquired with different MR-systems, coils, and acquisition parameters. We found that all 45 textural features widely used in the literature were significantly influenced by IHT. Furthermore, depending on the IHC used, the results of unsupervised and supervised analyses based on RFs and their clinical correlations were dramatically changed. In addition, using an inappropriate IHT could decrease the performances of radiomics-based predictive models as it was highlighted by the comparative analysis with the models built with the No-IHT imaging dataset. Our results concur with previous studies that found a significant influence of other post-processing steps on the absolute values of RFs (such as voxel size standardization, gray-levels discretization or manual segmentation) in addition to pre-processing steps (such as magnetic field strength, manufacturers, coils, acquisition parameters or filters). Recently, Scalco et al. found that the IHT for T2-WI had a significant impact on the reproducibility of RFs and on the inter-observer reproducibility of RFs that were extracted from pelvic organs from two MRIs separated by months 29 . These findings have been also applied to other IHTs such as variants of HM and a homemade method taking into account the SIs of organs of interest, the prostate, but the authors focused on the  Table 5. Accuracy and area under the ROC curves (AUROC) of the supervised models in repeated cross validation (training cohort) and in the testing/validation independent cohort, depending on the 5 intensity harmonization techniques (IHTs) or the lack of IHT (named No-IHT). Results are giving with 95% confidence interval.  30 . To our knowledge, this study is the first to demonstrate the dramatic impact of IHTs on RF-based predictions. Moreover, in a recent review of MRI-based sarcoma radiomics studies, we found that 17 out 31 (54.8%) did not mention the method used for making comparable the SIs of MRI dataset (under review). It should be emphasized that the current Image Biomarker Standarisation Initiative and Radiomics Quality Score lack of precise guidelines regarding IHT for MRI 31 .

Intensity harmonization technique
Previous studies have already emphasized the influence of IHT on segmentation and tissue classification tasks but they mostly involved brain MRI for inflammatory or degenerative diseases, and not specifically study their influence on radiomics analyses 24,25,32,33 . Moreover, the methods proposed in these studies were not readily transposable to non-brain imaging and/or not available in open source language (for instance, DeepHarmony) 34 .
In this study, we focused the analyses on techniques previously used in the body-imaging radiomics literature (i.e. scaling, histogram-matching or ComBat-Harmonization) but further studies should consider translating other popular intensity harmonization algorithms to body MRI. The RAVEL algorithm, which aims at estimating a voxel-specific unwanted variation by using a control region (i.e. brain cerebro-spinal fluid), may be particularly promising if applied to body-MR, with the possible use of healthy adipose tissues as control in the setting of soft tissue sarcomas for example 24,25 . Alternatively, instead of a post-processing intensity harmonization, the harmonization of SIs could be achieved since the acquisition step, through the use of standardized T1-mapping or T2-mapping sequence. However, thousands of MRIs have already been stored and, logically, the radiological community expects to pool and include these images in retrospective radiomics studies.
None of the IHTs used in this study demonstrated an unequivocal superiority compared to the others. This observation lets us hypothesize that the "best" technique is not universal but may actually vary depending on the dataset and the study objectives. Our present data does not allow us to validate this hypothesis, as it would require additional datasets to test if the same IHT constantly provides the best models whatever the disease and the outcome. While the unsupervised analysis highlighted the prognostic value of clusters elaborated with RFs from the IHT std , IHT HM.All and IHT HM.All.C datasets, the supervised analysis emphasized on the other hand the prognostic value of other models elaborated with RFs from the IHT fat and IHT HM.1 in the testing cohort. It is worth noting that our supervised models showed moderately higher performances in the validation cohort than in the training cohort (range of differences: 0.03-0.13). Although this finding suggests that the models were not overfitted, it also indicates that the training could have been premature (despite the use of repeated cross-validation and exhaustive grid search) and that a sampling bias could have occurred during the data partitioning in our rather small study population (despite the fact that the splits were obtained randomly and were well-balanced regarding the outcome).
Importantly, our unsupervised analysis revealed that using an inappropriate IHT could even lead to a total loss of relevant information from the radiomics data. Indeed, the concordance indices of the reference model (which was elaborated with clinical and radiological variables alone) and the model relying on IHT HM.1 were equivalent, www.nature.com/scientificreports/ which stresses the lack of prognostic value of the corresponding clusters. Similarly, although the lowest AUROC was reached with the No-IHT dataset in cross-validation, the performances of this supervised model were not markedly different from those obtained with some of the IHTs in the two cohorts (especially the IHT std ). These findings also suggest that radiomics studies should investigate all the available IHTs in an exploratory subset of the cohort, as well as no use of IHT, and subsequently select the one that optimizes the predictions. For instance, the extraction of RFs according to various voxel sizes and/or numbers of gray levels is commonly performed in radiomics studies. By analogy, one could consider extracting the RFs according to different IHTs and select the most robust and predictive RFs at univariable level. Hence, the intensity harmonization techniques could be considered as a "hyperparameter" of the post-processing pipeline. Interestingly, IHT HM.All.C yielded moderately good performances in both unsupervised and supervised analyses (with similar results in training and testing cohorts), which suggests that this method may provide the more realistic radiomics data in the setting of our study. It should be emphasized that the co-variable arguments given to the ComBat function may/might be incomplete in the setting of sarcomas. In any case, the clinical outcome of the study should not be included among the ComBat covariables because it should not depend on the MR-system or acquisition parameters of the sequences. A distinctive feature of sarcomas over other cancers is their anatomical ubiquity, hence, requiring adjusting several other acquisition parameters depending on the tumor location (for instance thoracic wall, thigh or wrist). Further studies should investigate the best co-variables for ComBat for non-brain MRI. In addition, ComBat could have been used with the No-IHT, IHT fat , IHT std , IHT HM.1 radiomics features. We purposely decided to limit the application of ComBat to only one dataset (IHT HM.All ) to avoid multiplying the post-hoc analyses, performances measurements, or superposing ROC curves, while our current results already enables us to stress the strong impact of IHT on radiomics-features and radiomics-based classifications and predicitions.
Our results also deepened that intra-tumoral heterogeneous SIs on T2-WI is predictive of MFS in a quantitative manner and other studies have also correlated this parameter with overall and/or metastatic-relapse free survivals in STS patients with relatively close and similar performances to ours 6,7,20 . Indeed, Peeken et al. used an equivalent of IHTstd and applied ComBat to correct for multicenter effect. They also provided the sarcoma histological type as a biological covariable (which slightly improved the performances) 6 . Their best model relied on radiomics features from Fat Sat T2 weighted imaging and showed a concordance-index of 0.74 in the validation cohort. On the other hand, Spraker et al. did not explicitly use an intensity harmonization technique, neither ComBat 7 . Interestingly, their best clinical and radiological prognostic models for the overall survival showed a concordance-index of 0.78 in the validation cohort.
Our study has limits. First, the study population was relatively small although this is the largest study investigating IHT and radiomics. It should be noted sarcoma radiological studies rarely exceed our population number. Second, we focused this proof-of-concept methodological study on T2-WI sequences but further investigations should be performed on other MRI sequences, such as T1-WI, contrast-enhanced T1-WI, DCE-MRI and diffusion imaging. We purposely chose this sequence because it is commonly reported as the most informative morphological sequence for sarcomas 8,20 . Third, our study design could be criticized. Indeed, judging which of the IHTs is the best by using the performances of predictive models (AUROC or concordance-index) as judgment criteria can only be valid if the intrinsic prognostic value of MRI-based radiomics features is certain. In this case, lowering these performances with a particular IHT would mean that this IHT caused noise and inappropriate deviation in the data. However, as already stated, prior studies converged towards same results regarding the relationship between MRI-based radiomics features, heterogeneity on T2-WI and outcomes of sarcoma patients 6,7,20,35 . Alternative study designs could have been proposed in the absence of such relationship, (i) either by using a phantom made of compartments with various degrees of heterogeneity, (ii) or by using MRIs of healthy volunteers covering organs with different textures and investigating which IHT enables the best radiomics-based classification of these organs (by analogy with the study by Orlhac et al.) 19 . Fourth, other shape and textural RFs than the 48 features used in this study can be encountered in the literature. Yet, we purposely decided to limit our investigations to this set of RFs, which are proposed by the LIFEx freeware, as they follow the definitions of the Imaging Biomarker Standardization Initiative 23,31 . Furthermore, adding more potential radiomics predictors in our multivariate analyses would have increased the multidimensionality of our dataset and the risk of overfitted results regarding the limited number of patients.
To conclude, through the example of sarcomas, our study highlights that the IHT can directly influence the values of MRI-based RFs, subsequently leading to dramatical changes in the predictions of both unsupervised and supervised models. Therefore, IHTs need to be deepened regarding non-brain MRI and should be carefully explored and detailed when building radiomics models to ensure the robustness and reproducibility of radiomics signatures.

Data availability
The datasets generated during and/or analyzed during the current study are not publicly available due to the clinical and confidential nature of the material but can be made available from the corresponding author on reasonable request.