Efficacy of MRI data harmonization in the age of machine learning: a multicenter study across 36 datasets

Pooling publicly-available MRI data from multiple sites allows to assemble extensive groups of subjects, increase statistical power, and promote data reuse with machine learning techniques. The harmonization of multicenter data is necessary to reduce the confounding effect associated with non-biological sources of variability in the data. However, when applied to the entire dataset before machine learning, the harmonization leads to data leakage, because information outside the training set may affect model building, and potentially falsely overestimate performance. We propose a 1) measurement of the efficacy of data harmonization; 2) harmonizer transformer, i.e., an implementation of the ComBat harmonization allowing its encapsulation among the preprocessing steps of a machine learning pipeline, avoiding data leakage by design. We tested these tools using brain T1-weighted MRI data from 1740 healthy subjects acquired at 36 sites. After harmonization, the site effect was removed or reduced, and we showed the data leakage effect in predicting individual age from MRI data, highlighting that introducing the harmonizer transformer into a machine learning pipeline allows for avoiding data leakage by design.

is used to create the model, potentially leading to falsely overestimated performance.
To the best of our knowledge, the harmonization techniques for neuroimaging data have been applied without paying attention to avoid data leakage, and this effect has not been quantified. In addition, despite the Python package neuroHarmonize 2 and the R code provided by Radua and colleagues 3 include functions that estimate the harmonization model on the training data and apply it separately to the test data, they have not been conceived to be executed on a machine learning pipeline, i.e., an end-to-end framework that orchestrates the flow of data into a machine learning model and allows to speed-up the development and test of machine learning systems, natively avoiding data leakage.
For these reasons, in this study, we propose 1) a measurement of the efficacy of data harmonization in reducing the site effect by the performance of a machine learning classifier 6 The BC coefficient is 0 when there is no overlap and 1 when the overlap is complete. In our study, n is the number of the single-center datasets grouped in the meta-dataset covering the above-mentioned age ranges and may be different in every meta-dataset. Therefore, we constructed the CHILDHOOD meta-dataset containing 11 single-center datasets, whose subjects' age varies between 5 and 13 years, and age distributions have a BC = 0.71. The ADOLESCENCE meta-dataset includes 9 single-center datasets whose subjects' age ranges from 11 to 20 years, and age distributions have an overlap amount equal to 0.45. Finally, the ADULTHOOD meta-dataset consists of all data belonging to subjects aged between 18 and 87 years old (12 single-center datasets), whose age distributions have a BC = 0. A detailed description of the composition of each meta-dataset and their age distributions are shown in Table 2 and Figure 1, respectively. In addition, we also merged all single-center datasets, creating a meta-dataset, called LIFESPAN, that covers the entire age range (5 -87 years). In this meta-dataset, composed of 36 imaging sites, the single-center age distributions have a null overlap ( Figure 1).

MR image processing
For each brain MR T1-weighted image, we performed a cortical reconstruction and a volumetric segmentation (see details in section 2.2.1). In this work, we analyzed cerebral structures only, and we extracted neuroimaging features from various regions of the cerebral cortex: the entire cerebral cortex, separately the left/right hemispheres of the cerebral cortex, and left/right frontal, temporal, parietal, and temporal lobes. In particular, for each region, we computed the average cortical thickness (CT) and the fractal dimension (FD) (see details in section 2.2.2). The harmonization process of the neuroimaging features was reported in section 2.3, and the statistical and machine learning analyses in section 2.4.

Cortical reconstruction and volumetric segmentation
We used the FreeSurfer package to perform completely automated cortical reconstruction and volumetric segmentation of each subject's structural T1-weighted scan. We used version 7.1.1, except in a few cases: (i) for T1-weighted images belonging to ICBM and NKI2 datasets, we used FreeSurfer version 5.3, and (ii) for the ABIDEI datasets, we used the FreeSurfer version 5.1 outputs previously made available online by Cameron and colleagues 48 (http://preprocessed-connectomes-project.org/abide/index.html). Even though different FreeSurfer versions may affect neuroimaging variables 49-53 , such variability is considered part of the site variability and handled by the harmonization procedure. Indeed, all subjects in each center have been processed with the same version of FreeSurfer. FreeSurfer is extensively documented (see Ref. 54 for a review) and publicly accessible (http://surfer.nmr.mgh.harvard.edu/). In addition to the standard FreeSurfer outputs, we performed a parcellation of the cortical lobes using the mri_annotation2label tool with the --lobesStrict option.
All Freesurfer outputs used in this study were visually inspected for quality assurance by two experienced radiologists (M.M. and C.T., with 35 and 30 years of experience, respectively) following an improved version of the ENIGMA Cortical Quality Control Protocol 2.0 (http://enigma.ini.usc.edu/protocols/imaging-protocols/). Firstly, we created an HTML file for each single-center dataset showing, for each subject, the segmentation of the cortical regions overlayed on the T1-weighted images. Then, we scrolled the HTML file to determine gross segmentation errors in any cortical regions visually. For each single-center dataset, we estimated the statistical outliers for CT features (see section 2.2.2 for details about CT features extraction), defined as any data points below or above the mean by 2.698 standard deviations. For each subject, we carefully inspected the cortical segmentations that showed features values labeled as statistical outliers to assess whether the outlier was an actual segmentation error. In this case, the subject was excluded from further analyses.

Extraction of cortical thickness and fractal dimension features
For each subject, using FreeSurfer tools, we computed the average CT of each cortical region as the average distance measured from each vertex of the gray/white boundary surface to the pial surface 55 .
The FD is a numerical representation of shape complexity 56 . The FD is normally a fractional value and is considered a dimension because it gives a measure of space-filling performs the 3D box-counting algorithm 60 , adopting an automated selection of the fractal scaling window 59a crucial step for establishing the FD for non-ideal fractals 59,61 .
Briefly, we overlapped a grid composed of 3D cubes of different sizes s (where s = 2 k voxels, and k = 0, 1, ..., 8) onto the segmentation and recorded the number of cubes N(s) needed to fully enclose the structure for each size. This process was repeated with 20 uniformly distributed random offsets to prevent the systematic influence of the grid placement, and the relative box count was averaged to obtain a single N(s) value 62,63 . For a fractal object, the data points of the number of cubes N(s) vs. size s in the log-log plane can be modeled through a linear regression within a range of spatial scales called the fractal scaling window. Fractalbrain automatically selects the optimal fractal scaling window by searching for the interval of spatial scales that provides the best linear fit, as measured by the rounded coefficient of determination adjusted for the number of data points (R 2 adj). If multiple intervals have the same rounded R 2 adj, the widest interval (i.e., the one that contains the most data points in the log-log plot) is selected 59 . The FD of the brain structure is then estimated as the slope (in absolute value) of the linear regression model included in the automatically selected fractal scaling window. As an example, in Figure 2, we reported a loglog plot of the 3D box-counting algorithm optimized for the automatic selection of the best fractal scaling window of the cerebral cortex of one subject.

Harmonization of brain cortical features
We harmonized cortical features using ComBat, a model that builds upon the statistical harmonization technique proposed by Johnson and colleagues 35 for location and scale (L/S) adjustments to the data while preserving between-subject biological variability. Briefly, let be the one-dimensional array of n neuroimaging features for the single-center i, participant j, and feature f, for a total of k single-center datasets, n participants, and V features. Still, let be the × matrix of biological covariates of interest, and Z be the × matrix of single-center labels. The ComBat harmonization model can be written as follows: where ( ) denotes the variation of captured by the biologically relevant covariates , is the one-dimensional array of the k coefficients associated with the single-center labels for the feature . We assume that the residual terms have mean 0.  71 . In our study, these variables were included in the harmonization process as sources of inter-subject biological variability. Finally, since it is not evident that the site effect affects all MRI-derived measures in the same way 3 , we performed a separate harmonization for each feature group of the same type (i.e., CT and FD).

The harmonizer transformer
The increased sample size due to the pooling of data acquired in various centers necessarily facilitates the application of machine learning techniques. For training and testing machine learning models, a proper validation scheme that handles data splitting must be chosen ( Figure 3). This choice is crucial to avoid data leakage by ensuring that the entire workflow (preprocessing and model-building steps) is constructed on training data and evaluated on test data never seen during the learning phase. Indeed, data leakage in the training process may incur falsely high performance in the test set (see, e.g., Ref. 72 and Ref. 73 ). Especially in Medicine and Health care, where relatively small datasets are usually available, the straightforward hold-out validation scheme is rarely applied. In contrast, the cross-validation (CV) and its nested version (nested CV) for hyperparameters optimization of the entire workflow [74][75][76] are frequently preferred. Also, repeated CVs or repeated nested CVs are suggested for improving the reproducibility of the entire machine learning system 75 . Several training and test data procedures are carried out in all these validation schemes on different data split, recalling the need for a compact code structure to avoid errors that may lead to data leakage. In this view, machine learning pipelines are a solution because they orchestrate all the processing steps in a short, easier-to-read, and easier-to-maintain code structure ( Figure   3). A pipeline represents the entire data workflow, combining all transformation steps (e.g., data cleaning, data imputation, data scaling, and general data preprocessing) and machine learning model training. It is essential to automate an end-to-end training/test process without any form of data leakage and improve reproducibility, ease of deployment, and code reuse, especially when complex validation schemes are needed.
In the Scikit-learn library, a popular, open-source, well-documented, and easy-tolearn machine learning package that implements a vast number of machine learning algorithms, a pipeline is a chain of "transformers" and a final "estimator" acting as a single object. The transformers are modules that apply preprocessing to the data, whereas estimators are modules that fit a model based on training data and are capable of inferring some properties on new data (https://scikit-learn.org/stable/developers/develop.html). In particular, transformers are classes with a "fit" method, which learns model parameters (e.g., mean and standard deviation for data standardization) from a training set, and a "transform" method which applies this transformation model to any data. For example, for data standardization (transforming data to have zero mean and unit standard deviation), the mean μ must be subtracted from the data, and the result must be divided by the standard deviation σ.
Notwithstanding, this procedure must be firstly performed on the training set (using μ and σ computed in the training set). In the test set, or any validation set, the same transformation must be applied to data using the same two parameters μ and σ computed for centering the training set. Basically, the "fit" method calculates the parameters (e.g., μ and σ in our case) and saves them internally, whereas the "transform" method applies the transformation (using the saved parameters) to any particular set of data.
For these reasons, in this study, we propose the harmonizera Scikit-learn Python transformer that encapsulates the neuroHarmonize procedure among the preprocessing steps of a machine learning pipeline. Basically, the "fit" method of the harmonizer transformer learns the NeuroHarmonize model parameters from a training set and saves the parameters internally, whereas the "transform'" method is used to apply the neuroHarmonize model, previously learned on the training data set, e.g., to unseen data. The source code of the harmonizer transformer is publicly available in a GitHub repository at https://github.com/Imaging-AI-for-Health-virtual-lab/harmonizer.
In the following, we included the harmonizer transformer in a pipeline to learn the harmonization procedure parameters on the training data only and apply the harmonization procedure (with parameters learned in the training set) to the test data. This prevented data leakage in the harmonization procedure independently of the chosen validation scheme.

Statistical and machine learning analyses
We performed the statistical and machine learning analyses described in the following paragraphs for each feature group of the same type (i.e., CT and FD) and each meta-dataset (i.e., CHILDHOOD, ADOLESCENCE, ADULTHOOD, and LIFESPAN).

Visualization and quantification of site effect
We first performed a series of analyses of increasing complexity to explore the actual existence of a site effect in the data. For each region-feature pair, we qualitatively showed the site effect on raw data through boxplots, using the site as the independent variable and each region-feature pair as the dependent variable. Quantitatively, the site effect was measured by analyzing covariance (ANCOVA)a general linear model that blends analysis of variance (ANOVA) and linear regression. ANCOVA evaluates whether the means of a dependent variable are equal across levels of a categorical independent variable while statistically controlling for the effects of other variables that are not of primary interest, known as covariates or nuisance variables. In this study, we set the single-center dataset as the independent variable, age, and sex as covariates, and each region-feature pair as the dependent variable.
Additionally, to further investigate the site effect on raw data and to measure the success of ComBat harmonization, we predicted the imaging site from the neuroimaging features, grouped by feature type, namely CT and FD. Specifically, we used the supervised eXtreme Gradient Boosting (XGBoost) method (with version 0.90 default hyperparameters for a classification task), a scalable end-to-end tree-boosting system widely used to achieve stateof-the-art performance on many recent machine learning challenges 77 . Using N=100 repetitions of a stratified 5-fold CV, we estimated the median balanced accuracy. The statistical significance of prediction performance was determined via permutation analysis.
Thus, for each features group, 5000 new models were created using a random permutation of the target labels (i.e., the imaging site), such that the explanatory neuroimaging variables were dissociated from their corresponding imaging site to simulate the null distribution of the performance measure against which the observed value was tested 78 . Median balanced accuracy was considered significantly different from the chance level when the p-value computed using permutation tests was < 0.05. Additionally, we calculated the average confusion matrix over repetitions to graphically evaluate the goodness of prediction. The same imaging site prediction was performed on raw data (i.e., without harmonization) to confirm the existence of the site effect and on harmonized data (with neuroHarmonize and Harmonizer transformer) to investigate if the site effect was reduced or removed.
We propose to measure the efficacy of harmonization in reducing or removing the site effect through a two-step assessment. First, we evaluated whether the site prediction after the harmonization process was not significantly different from a random prediction by comparing the median balanced accuracy over repetitions with the distribution of balanced accuracies estimated using the permutation test with 5000 permutations (the default value in FSL -FMRIB Software libraryrandomise tool for non-parametric permutation inference on neuroimaging data 79 ). Considering, for example, a significance threshold of 0.05 in the permutation test, in the case of complete removal of the site effect, the site prediction will not be different from that of a random model (i.e., p-value ≥ 0.05). Second, in the case of permutations test p-value < 0.05, we compared the balanced accuracy obtained by predicting the site without and with the harmonization procedure. In particular, we assessed the site effect reduction by ensuring that the median balanced accuracy obtained predicting the imaging site with harmonized data was significantly lower than that estimated with raw data through the non-parametric one-sided Wilcoxon signed-rank test, with a significance threshold of 0.05 80 . The source code for evaluating the effectiveness of harmonization using the harmonizer transformer is publicly available in a GitHub repository at https://github.com/Imaging-AI-for-Health-virtual-lab/harmonizer.
To estimate the effect of data leakage in the prediction of the imaging site caused by performing the harmonization on all data before splitting into training and test sets, we tested whether the balanced accuracies obtained using neuroHarmonize on all data before any split were consistently lower than those estimated using the harmonizer transformer in the above mentioned stratified CV scheme. Since the same data set splits were applied for both CT and FD, the comparison was carried out through a paired test, i.e., the non-parametric one-sided Wilcoxon signed-rank test with a significance threshold of 0.05 80 .

Associations with age
While it is essential to show that a harmonization method successfully reduces a possible site effect, it is equally crucial to note that it preserves the biological variability in the data.
Indeed, a harmonization method that removes both site and biological effects has no utility.
One of the most influential sources of biological variability in the neuroimaging features of healthy subjects is undoubtedly chronological age. Throughout the lifespan, the brain structure changes because of a complex interplay between multiple maturational and neurodegenerative processes. Such processes could yield large spatial and temporal variations in the brain 65,81,82 .
For these reasons, we attempted to predict individual age from neuroimaging features through an XGBoost model (version 0.90 with default hyperparameters for a regression task) 77 . We estimated the median (over repetitions) mean absolute error (MAE) using N=100 repetitions of a 5-fold CV. Age prediction was performed on harmonized data using both neuroHarmonize and the harmonizer transformer in the CV pipeline. To estimate the effect of data leakage in the age prediction caused by performing the harmonization on all data before splitting into training and test sets, we compared the MAE values obtained using neuroHarmonize on all data before any split and the harmonizer transformer in the abovementioned CV scheme. In particular, since the same data set splits were applied for both CT and FD, we assessed whether the median MAE using neuroHarmonize on all data before any split was consistently lower than that estimated using the harmonizer transformer through a paired test, i.e., the non-parametric one-sided Wilcoxon signed-rank test with a significance threshold of 0.05 80 .
Moreover, before and after the harmonization procedure, for each region-feature pair, we qualitatively visualized the site effect on the relationship between age and each regionfeature pair through scatterplots (with age as the independent variable and each regionfeature pair as the dependent variable).

Simulation experiments
The harmonizer transformer prevents data leakage in the harmonization procedure in any machine learning pipeline independently of the chosen validation scheme. Differently, applying harmonization before data spitting, data leakage is present, and its severity depends on the specific context and the extent of the leakage. In Neuroimaging, the entity and impact of the data leakage effect is still an underexplored area. Therefore, we performed simulation experiments (with known site effects) and computational tests for assessing the data leakage effect when the harmonization process is performed before the training-test data splitting.

CT and FD data simulation settings
Let be the one-dimensional array of the simulated feature , for the single-center , and participant , for a total of single-center datasets, participants for each center, and features. In this study, we simulated CT and FD data for k = 3, 10, 36 single-centers. Each single-center dataset provided the same number of participants (i.e., = n), with assuming the values 25, 50, 100, 250. Totally, we did 24 experiments, i.e., we simulated 24 different multicenter datasets (12 for the CT features and 12 for the FD measures).
Each was generated using the model proposed by Johnson and colleagues 35 and recently used for neuroimaging features' simulation by Chen and collaborators 83 : where is the average value of the feature in the single-center ICBM dataset, is the effect of the age on the feature estimated through a linear regression between actual age and feature in the single-center ICBM dataset, is a simulated age variable drawn from a uniform distribution X ~ uniform( [20,90]). The mean site effect γ was drawn from a normal distribution with zero mean and standard deviation equal to 0.1, while the variance site effect was drawn from a center-specific inverse gamma distribution with chosen parameters. For our simulations, we chose to distinguish the site-specific location factors by assuming independent and identically distributed (i.i.d.) normal distributions and scaling factors using the parameters described as follows. We set the value of the inverse gamma shape, for each center, as {46, 51, 56}, respectively, when k = 3, as {40, 42, .., 58} when k = 10, and as {10, 12, .., 70} when k = 36. In all cases, the inverse gamma scale was set to 50.

Measuring the effect of data leakage
To measure the effect of data leakage, after an external hold-out ( Figure 4), firstly, we computed the performance of a site prediction classifier trained using a) the harmonizer transformer within the machine learning pipeline (internal not leaked test set) and b) harmonizing all data with neuroHarmonize before imaging site prediction (internal leaked test set). Secondly, we compared these performances with that observed on an external test set never used for harmonization and training ( Figure 4). In the absence of data leakage, the performance in the internal and external test sets should be similar and not significantly different. When data leakage is present, the performance in the internal test set is overly optimistic (i.e., significantly better than that on the external test set). In detail, for each experiment, we performed the following steps.
External hold-out. We randomly split the data into two parts, i.e., a data set We repeated each experiment (i.e., all these steps) 100 times with random data splits and computed the average balanced accuracy across the 100 repetitions. In any repetitions, the same data set splits (external or in CV) were applied for both CT and FD.
Thus, any comparisons among their performances will be handled with paired statistical tests.

Measuring the effect of data leakage in simulated data
The results were similar for both CT and FD simulated features. The performances obtained on the leaked internal test set were overly optimistic, i.e., significantly better than those obtained in the external test set, indicating the presence of data leakage. In contrast, the average balanced accuracies recorded on the not leaked test internal set were statistically not different from those of the external test set.
Moreover, as the number of samples available in each single-center dataset decreases, the effect of data leakage increases (Tables 3 and 4 for CT and FD, respectively).
This phenomenon is even more evident in Figure 5, where we reported the difference between the average balanced accuracy obtained in the external test set and that gained in the internal test sets vs. the number of participants in each single-center site for the CT and FD, respectively. When data leakage is present (dashed lines in Figure 5), the difference between the average balanced accuracy in the external test set and that in the internal leaked test set always differs significantly from zero (Bonferroni adjusted p-values < 10 -9 and < 10 -19 for CT and FD, respectively) and increases as the number of participants in each single-center dataset decreases. This result has a profound impact because most neuroimaging studies (with in vivo data) have single-centers datasets with a number of subjects between 25 and 100.
Conversely, when data leakage is not present (solid lines in Figure 5), the difference between the average balanced accuracy in the external test set and that in the internal not leaked test set was approximately zero and remained constant as the number of participants in each single-center dataset changes.

Visualization and quantification of the site effect in in vivo data
Quality control of FreeSurfer's outputs resulted in removing 47 subjects based on the overall low quality of cortical reconstruction or segmentation errors in any regions. All brain regions of the remaining 1740 subjects had both CT and FD features. Thus, we have been able to analyze the site effect, the harmonization adjustments, and age prediction on the same subjects for the CT and FD groups of features. The demographic characteristics of the subjects included in the study after the quality control have been reported in Table 5.
The boxplots in Figures 6 and 7  The same result was measured quantitatively using ANCOVA analysis. Indeed, all CT and FD features were significantly different across the single-center datasets ( Table 6), but the site effect, measured by the partial η 2 was different in the two feature sets. In the CHILDHOOD meta-dataset, for example, each cortical region showed a higher partial η 2 for FD than for CT, suggesting that, in childhood, acquisition characteristics impact more on the structural complexity measure, i.e., FD, than on the cortical thickness. On the other hand, in the ADOLESCENCE meta-dataset, the frontal and temporal lobes (bilaterally), along with the entire structure, show lower partial η 2 for FD than for CT, whereas the parietal and occipital lobes (bilaterally) have higher partial η 2 for FD than for CT. Finally, in the ADULTHOOD meta-dataset, only the occipital and temporal lobes (bilaterally) have lower partial η 2 for FD than CT.

Harmonization efficacy
To assess whether most of the variation in the data was still associated with the site after harmonization, we predicted the imaging site using neuroimaging features grouped by feature type (i.e., CT and FD). Figures 8 and 9 report the average confusion matrices (over 100 repetitions) for CT and FD features, respectively. When predicting the site using the raw data, the main diagonal of the confusion matrix is prominent (i.e., the predicted site is usually the actual site) for both feature groups and each meta-dataset (Figures 8 and 9). On the other hand, when the prediction of the site is performed using harmonized data (through neuroHarmonize or harmonizer transformer), the impact of the main diagonal of the confusion matrix is weak. The confusion matrices show a vertical pattern indicating that the predicted site is often the same site, regardless of the actual site (Figures 8 and 9). Moreover, the confusion matrix obtained using the harmonizer within the machine learning pipeline seems similar to that obtained by harmonizing all the data with neuroHarmonize before imaging site prediction. This result suggests that the action of the harmonizer resembles that of neuroHarmonize, although the model is built on training data only and then applied to test data. The confusion matrices for CT and FD features in the LIFESPAN meta-dataset have also been shown in Figure 10. Table 7 reports the median balanced accuracies (over 100 repetitions) of imaging site prediction, and the efficacy of the harmonization is shown in Table 8. Specifically, we have reported the pair (permutation test p-value, one-sided Wilcoxon signed-rank test p-value) to statistically assess the removal or reduction of the site effect, respectively. As expected, the median balanced accuracy of site prediction using the raw data was significantly different from the chance level (permutation test p-value > 0.05 for all data), and thus, an actual imaging site effect was present on raw data. After harmonization, with neuroHarmonize or harmonizer transformer, the site effect was removed (permutation test p-value  0.05 in Table 8) or only reduced (permutation test p-value < 0.05, but with median balanced accuracy reduced on harmonized data, as statistically measured by the one-sided Wilcoxon signed-rank test p-value < 0.05 in Table 8). Specifically, by performing harmonization using neuroHarmonize on all data, we observe that the site effect removal seems to be ensured in all analyses performed except for the imaging site predictions using FD features in the ADOLESCENCE and ADULTHOOD meta-datasets (permutation test p-value equal to 0.0188 and 0.0002, respectively, in Table 8). We found the same behavior when predicting the imaging site using CT and FD features in the LIFESPAN meta-dataset (permutation test p-value equal to 0.0002 in Table 8). In the latter cases, although significantly different from a random prediction, the balanced accuracies were significantly lower than those obtained using the original data (one-sided Wilcoxon signed-rank test p-values < 0.001 in Table 8), and this indicates a site effect reduction. When applying the harmonizer transformer to the data (within the CV), we observed the actual efficacy of the harmonization, without introducing data leakage, as in the previous case. Indeed, we confirmed a complete removal of site effect only in imaging site prediction using CT features in ADULTHOOD metadataset (permutation test p-value equal to 0.1064 in Table 8). In all the other cases, the imaging site prediction was significantly different from the chance level (permutation test pvalues < 0.05 in Table 8), but the balanced accuracies were significantly lower than those obtained using the original data (one-sided Wilcoxon signed-rank test p-values < 0.001 in Table 8). Thus, the site effect removal measured using data harmonized before the splitting into training and test sets was a clear sign of data leakage even in in vivo data. Table 9 reports the median MAE values (over 100 repetitions) of the age prediction model.

Age prediction
Overall, MAE values of age prediction using data harmonized with neuroHarmonize before the splitting into training and test sets are significantly lower than those obtained using data harmonized with the harmonizer within the CV (one-sided Wilcoxon signed-rank p-values < 0.001 for all the cases, except for CT features in the CHILDHOOD meta-dataset, see Table   9). In line with the results of simulations, the data leakage introduced by harmonizing the data all at once leads to an overly optimistic performance.
Finally, in Figures 11 and 12, we reported the age-dependent trends of the average CT and FD of the cerebral cortex without harmonization and harmonized with the harmonizer transformer, respectively. In line with previous literature concerning features such as CT and volumes 2,5 , also in this study, the harmonized average CT and FD values showed less variability than that observed on raw data.

Discussion
In this study, we introduced the harmonizer transformer, which encapsulates the data harmonization procedure among the preprocessing steps of a machine learning pipeline to avoid data leakage. To this end, we explored the ComBat harmonization of CT and FD features extracted from brain T1-weighted MRI data of 1740 healthy subjects aged 5 -87 years acquired at 36 sites and simulated data. We measured the efficacy of the harmonization process in reducing or removing the unwanted site effect through a two-step assessment comparing the performance in imaging site prediction using harmonized data with that of 1) a random prediction and 2) a prediction using non-harmonized data. Finally, we confirmed how data leakage related to harmonization performed before data splitting leads to overestimating performance in simulated and in vivo data.
Using simulated data, we showed that the data leakage effect introduced by performing the harmonization before data splitting is clearly evident and worse when the single-center dataset size is small and comparable with the size of the most common neuroimaging in vivo studies. In these simulated experiments, we paid particular attention to comparing different harmonization and machine learning approaches in the same conditions, i.e., the same data splits and using the same number of subjects for harmonization (for this reason, we adopted 80% of the data set size for fitting the neuroHarmonize model; indeed using the harmonizer approach, the harmonization was computed in the training fold of a 5fold CV, i.e., using 80% of the samples).
We chose the ComBat harmonization method due to its widespread use in the scientific community 7,12,16-34 and its implementation in the neuroHarmonize package, which enables the specification of covariates with generic non-linear effects 2 . The efficacy of ComBat and its variants has been evaluated by comparing their performance with other harmonization techniques 3,5,6 and by simulating site effects using single-center data 2 .
However, various harmonization techniques can be used for features extracted from MRI images. One such method is the residuals harmonization, which employs a global scaling procedure to account for the influence of each site using a pair of parameters (offset and scale). These parameters can be estimated through a linear regression model or a more sophisticated approach that considers non-linearities 5 . Global scaling was initially introduced to harmonize images directly 6  It is important to note that this study was the first in which the efficacy of the harmonization procedure of neuroimaging data has been evaluated by comparing the accuracy of the imaging site prediction also to the chance level. Indeed, previous works have consistently shown a decrease in the accuracy of the imaging site prediction after harmonization, but without applying a significance test, and thus it was not known whether the site effect was removed or only reduced [see, e.g., Ref. 2 and Ref. 5 ]. As hypothesized, there was a real imaging site effect on the raw data (permutation test p-value < 0.05 for all data). The site effect was either eliminated or only reduced after data harmonization with neuroHarmonize or harmonizer transformer. Specifically, the difference between the efficacy of harmonization by applying neuroHarmonize on all data or harmonizer within the CV was expected because, in the former case, data leakage is present leading to a falsely overestimated performance, i.e., a permutation test p-value  0.05 and a lower median balanced accuracy (Tables 7 and 8 Looking at the permutation test p-values for imaging site prediction using data harmonized with neuroHarmonize (which were harmonized before splitting into training and test sets), it can be observed that the efficacy of harmonization worsened as the overlap of the age distributions in multicenter meta-datasets decreased ( The goodness of age prediction using the data harmonized with neuroHarmonize before the splitting into training and test sets is falsely increased compared with the use of data harmonized with the harmonizer within the CV. Indeed, the median MAE values obtained in predicting age using data harmonized with neuroHarmonize before splitting into training and test sets were significantly lower than those estimated using data harmonized with harmonizer within the CV (one-sided Wilcoxon signed-rank p-values < 0.001 for all the cases, except for CT features in the CHILDHOOD meta-dataset, see Table 8). These results confirm how data leakage related to data harmonization before splitting them into training and test sets leads to performance overestimation even for in vivo data and underlines the importance of encapsulating the data harmonization procedure among the preprocessing steps of a machine learning pipeline.
In previous single-centers studies, we observed that the computation of the FD using the box-counting algorithm with the automated selection of the optimal fractal scaling window implemented in fractalbrain best predicted chronological age in two datasets of healthy children and adults among various FD approaches, and more conventional features, such as CT, and gyrification index 59 . In this large multicenter study, we confirmed the more remarkable ability of the FD of the cerebral cortex to predict individual age better than the average CT. In the LIFESPAN meta-dataset, for example, the error in age prediction using CT features (MAE = 7.55 years) was reduced by more than 25% using FD features (MAE = identify the optimal approach for specific research questions and data sets.
Secondly, we showed and measured the data leakage effect using simulated and in vivo data of CT and FD of the cerebral cortex only. Various other morphological and functional MRI-derived features might be considered. However, the focus of the study was mainly to measure the efficacy of the harmonization and show a possible detrimental effect of data harmonization on the entire dataset before machine learning analysis, and this effect is not relative to the features considered.
Lastly, for site/age prediction, we adopted an XGBoost decision tree with default parameters. It is well known that classification/regression performances may be affected by the value of the hyperparameters, and proper hyperparameter optimization, e.g., through a nested CV could be adopted. However, this procedure was not feasible in our study because of the relatively small size of data in many centersan undesired but common scenario in many publicly available datasets. Thus, though this choice was arbitrary, we feel that using the same hyperparameters for both neuroHarmonize and Harmonize transformer data was reasonable.
In conclusion, we showed that introducing the harmonizer transformer, which encapsulates the harmonization procedure among the preprocessing steps of a machine learning pipeline, avoided data leakage. Using in vivo data, after Combat harmonization, the site effect was completely removed or reduced while preserving the biological variability.
We, therefore, suggest that future multicenter imaging studies will include the data harmonization method in the machine learning pipelines and measure the efficacy of the harmonization process.

Data Availability
The brain MR T1-weighted images that support the findings of this study are available from the following online repositories: -Autism Brain Imaging Data Exchange (ABIDE): http://fcon_1000.projects.nitrc.org/indi/abide/ -Information eXtraction from Images (IXI) study: https://brain-development.org/ixi- features that support the findings of this study are freely available on a Zenodo repository 100 .

Code Availability
The source code of the efficacy measurement and harmonizer transformer are publicly available in a GitHub repository at https://github.com/Imaging-AI-for-Health-virtual-     and b) harmonizing all data with neuroHarmonize before imaging site prediction (internal leaked test set). Secondly, we compared these performances with that observed on an external test set never used for harmonization and training.        and LIFESPAN meta-datasets without and with harmonization using the harmonizer transformer. In the latter case, we considered only the first CV among the 100 repetitions.
Specifically, for each subject, we plotted the harmonized value obtained in the fold when the subject was included in the test set.   k is the number of single-center datasets, each one containing n participants. CT: cortical thickness. *One-tailed paired t-test Bonferroni adjusted p-value < 10 -9 . k is the number of single-center datasets, each one containing n participants. FD: fractal dimension. *One-tailed paired t-test Bonferroni adjusted p-value < 10 -19 .   Table 7. Site prediction results. The median and the interquartile range of the balanced accuracy over 100 repetitions of the 5-fold CV have been reported. In bold, we have highlighted significant falsely overestimated performance due to data leakage (the median balanced accuracy in imaging site prediction using data harmonized with neuroHarmonize is lower, i.e., better performance, than that estimated using data harmonized with the harmonizer transformer within the CVone-sided Wilcoxon signed-rank test p-values < 0.001 for all the analyses).  the site effect has been removed (i.e., p-value greater than 0.05 means that the imaging site prediction is not different from a random prediction). One-sided Wilcoxon signed-rank test pvalue indicates whether the site effect has been reduced (i.e., p-value less than 0.05 means that the prediction of imaging site using the harmonized features obtains a balanced accuracy significantly less than that estimated using raw data).