Introduction

In recent years there has been an increasing trend toward data sharing in neuroimaging research communities, leading to a rising number of public neuroimaging databases and collaborative multicenter initiatives1,2,3,4. Indeed, pooling MRI data from multiple sites provides an opportunity to assemble more extensive and diverse groups of subjects2,3,5,6, increase statistical power3,7,8,9,10, and study rare disorders and subtle effects11,12. However, a major drawback of combining neuroimaging data across sites is the introduction of confounding effects due to non-biological variability in the data, typically related to image acquisition hardware and protocol. Indeed, properties of MRI such as scanner field strength, radiofrequency coil type, gradients coil characteristics, hardware, image reconstruction algorithm, and non-standardized acquisition protocol parameters can introduce unwanted technical variability, also reflected in MRI-derived features13,14,15.

The harmonization of multicenter data, defined as applying mathematical and statistical concepts to reduce unwanted site variability while maintaining the biological content, is, therefore, necessary to ensure the success of cooperative analyses. Currently, among the harmonization methods for tabular data available to the neuroimaging scientific community, ComBat is one of the most widely used7,12,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34. The ComBat model was first introduced in gene expression analysis as a batch-effect correction tool to remove unwanted variation associated with the site and preserve biological associations in the data35. In general, ComBat applies to situations where multiple features of the same type are measured for each participant, i.e., expression levels for different genes or imaging-derived metrics from different voxels or anatomical regions. The success of ComBat and derivatives has been measured compared to other harmonization techniques3,5,6 and through simulations of the site effect from single-center data2. Previous literature has primarily focused on assessing the maintenance of biological variability in harmonized data2,3,5. However, less effort has been put into quantitative measurements of the efficacy of harmonization in removing the unwanted site effect.

Moreover, the pooling of multicenter data and the consequent availability of large sample sizes paves the way for data reuse with machine and deep learning techniques17,19,22,23,25. In the case of multicenter data, harmonization is thus added to conventional data preprocessing steps, including, e.g., data cleaning and imputation, feature extraction, and reduction. Similar to other procedures, the harmonization parameters should be optimized on training data only and subsequently applied to test data. Indeed, this approach avoids data leakage, which happens when information from outside the training set is used to create the model, potentially leading to falsely overestimated performance. Crucially, this aspect has sometimes been overlooked in previous applications of ComBat by harmonizing the entire data sample before data splitting (training and test sets) used for training and testing machine or deep learning techniques2,5,17,19,22,23,25,36,37,38,39,40,41.

To the best of our knowledge, the harmonization techniques for neuroimaging data have been applied without paying attention to avoid data leakage, and this effect has not been quantified. In addition, despite the Python package neuroHarmonize2 and the R code provided by Radua and colleagues3 include functions that estimate the harmonization model on the training data and apply it separately to the test data, they have not been conceived to be executed on a machine learning pipeline, i.e., an end-to-end framework that orchestrates the flow of data into a machine learning model and allows to speed-up the development and test of machine learning systems, natively avoiding data leakage by design.

For these reasons, in this study, we propose 1) a measurement of the efficacy of data harmonization in reducing the site effect by the performance of a machine learning classifier trained to identify the imaging site, 2) a ComBat implementation using a harmonizer transformer, i.e., a method that, combined with a classifier/regressor, forms a composite estimator, to be used in a machine learning pipeline, thus simplifying data analysis and avoiding data leakage by design (the source code of the efficacy measurement and harmonizer transformer are publicly available in a GitHub repository at https://github.com/Imaging-AI-for-Health-virtual-lab/harmonizer). First, we showed and measured the effect of data leakage when harmonization is performed before data splitting using simulated neuroimaging data with known site effect. Then, we estimated the efficacy of data harmonization in reducing the site effect using the harmonizer transformer on brain T1-weighted MRI data from 1787 healthy subjects aged 5–87 years acquired at 36 imaging sites. The morphological features of cortical thickness (CT) and fractal dimension (FD), a descriptor of the structural complexity of objects with self-similarity properties42, are extracted to characterize brain morphology. To the best of our knowledge, this is the first time that measures of brain structural complexity, such as FD, have been studied on such a large, multicenter, and harmonized data sample. Finally, we investigated the age prediction using neuroimaging variables harmonized in the entire dataset before machine learning and using the harmonizer transformer to estimate the effect of data leakage in in vivo data.

Methods

MRI datasets

We gathered brain MR T1-weighted images of 1787 healthy subjects aged 5–87 years belonging to 36 single-center datasets of various studies. These include the Autism Brain Imaging Data Exchange (ABIDE) (https://fcon_1000.projects.nitrc.org/indi/abide/) first and second initiatives (ABIDE I and ABIDE II, respectively)43,44, the Information eXtraction from Images (IXI) study (https://brain-development.org/ixi-dataset/), the 1000 Functional Connectomes Project (FCP) (https://fcon_1000.projects.nitrc.org/fcpClassic/FcpTable.html), and the Consortium for Reliability and Reproducibility (CoRR) (https://fcon_1000.projects.nitrc.org/indi/CoRR/html/index.html). From each study, we drew several specific datasets of brain MR T1-weighted images acquired in the same place with the same scanner and acquisition protocol (see Table 1). Both ABIDE I, and II initiatives contributed with 17 datasets, and we named them with the initiative prefix (ABIDEI or ABIDEII) followed by the institution name that collected the images (e.g., ABIDEI-CALTECH and ABIDEII-BNI_1). For the institution names, we used the same nomenclature as reported online45 with the following exceptions: (i) we merged LEUVEN_1 and LEUVEN_2 data in ABIDEI-LEUVEN, UCLA_1, and UCLA_2 data in ABIDEI-UCLA, UM_1 and UM_2 data in ABIDEI-UM, because the acquisition parameters were the same, (ii) we split the data from ABIDEII-KKI_1 into ABIDEII-KKI_8ch and ABIDEII-KKI_32ch, because the acquisitions were performed using an 8-channel or a 32-channel phased-array head coil, respectively. The IXI study provided three different datasets named with the prefix IXI followed by the institution name that collected the images (e.g., IXI-Guys). From the 1000 FCP and CoRR studies, we used the International Consortium for Brain Mapping (ICBM) and the Nathan Kline Institute - Rockland Sample Pediatric Multimodal Imaging Test-Retest Sample (NKI2) datasets, respectively.

Table 1 Scanning parameters for each single-center dataset.

In each single-center dataset, baseline MRI scans of typically developing and aging brain (one for each subject) with available age and sex information were included. The lack of a recognized neurological or psychiatric disorder diagnosis was used to define normal development and aging. The leading institutions, at each site where the MR images were collected, had obtained informed consent from all participants, and were authorized by the local Ethics Committees. Table S1 shows the general characteristics of each single-center dataset. In this study, we grouped the single-center datasets into three multicenter meta-datasets based on age and the amount of overlap between age distributions. We have considered the following age ranges: childhood (5–13 years), adolescence (11–20 years), and adulthood (18–87 years). We measured the overlap between age distributions by the n-distribution Bhattacharyya coefficient (BC)46, an extension of the 2-distribution BC47. The BC coefficient is 0 when there is no overlap and 1 when the overlap is complete. In our study, n is the number of the single-center datasets grouped in the meta-dataset covering the above-mentioned age ranges and may be different in every meta-dataset. Therefore, we constructed the CHILDHOOD meta-dataset containing 11 single-center datasets, whose subjects’ age varies between 5 and 13 years, and age distributions have a BC = 0.71. The ADOLESCENCE meta-dataset includes 9 single-center datasets whose subjects’ age ranges from 11 to 20 years, and age distributions have an overlap amount equal to 0.45. Finally, the ADULTHOOD meta-dataset consists of all data belonging to subjects aged between 18 and 87 years old (12 single-center datasets), whose age distributions have a BC = 0. A detailed description of the composition of each meta-dataset and their age distributions are shown in Table 2 and Fig. 1, respectively. In addition, we also merged all single-center datasets, creating a meta-dataset, called LIFESPAN, that covers the entire age range (5–87 years). In this meta-dataset, composed of 36 imaging sites, the single-center age distributions have a null overlap (Fig. 1).

Table 2 Description of the demographic characteristics of each meta-dataset.
Fig. 1
figure 1

Age distributions. Age distributions of participants for CHILDHOOD, ADOLESCENCE, ADULTHOOD, and LIFESPAN meta-datasets, grouped by single-center dataset and sorted by median age.

MR image processing

For each brain MR T1-weighted image, we performed a cortical reconstruction and a volumetric segmentation. In this work, we analyzed cerebral structures only, and we extracted neuroimaging features from various regions of the cerebral cortex: the entire cerebral cortex, separately the left/right hemispheres of the cerebral cortex, and left/right frontal, temporal, parietal, and temporal lobes. In particular, for each region, we computed the average cortical thickness (CT) and the fractal dimension (FD).

Cortical reconstruction and volumetric segmentation

We used the FreeSurfer package to perform completely automated cortical reconstruction and volumetric segmentation of each subject’s structural T1-weighted scan. We used version 7.1.1, except in a few cases: (i) for T1-weighted images belonging to ICBM and NKI2 datasets, we used FreeSurfer version 5.3, and (ii) for the ABIDEI datasets, we used the FreeSurfer version 5.1 outputs previously made available online by Cameron and colleagues48 (http://preprocessed-connectomes-project.org/abide/index.html). Even though different FreeSurfer versions may affect neuroimaging variables49,50,51,52,53, such variability is considered part of the site variability and handled by the harmonization procedure. Indeed, all subjects in each center have been processed with the same version of FreeSurfer. FreeSurfer is extensively documented (see ref. 54 for a review) and publicly accessible (http://surfer.nmr.mgh.harvard.edu/). In addition to the standard FreeSurfer outputs, we performed a parcellation of the cortical lobes using the mri_annotation2label tool with the–lobesStrict option.

All Freesurfer outputs used in this study were visually inspected for quality assurance by two experienced radiologists (M.M. and C.T., with 35 and 30 years of experience, respectively) following an improved version of the ENIGMA Cortical Quality Control Protocol 2.0 (http://enigma.ini.usc.edu/protocols/imaging-protocols/). Firstly, we created an HTML file for each single-center dataset showing, for each subject, the segmentation of the cortical regions overlayed on the T1-weighted images. Then, we scrolled the HTML file to determine gross segmentation errors in any cortical regions visually. For each single-center dataset, we estimated the statistical outliers for CT features, defined as any data points below or above the mean by 2.698 standard deviations. For each subject, we carefully inspected the cortical segmentations that showed features values labeled as statistical outliers to assess whether the outlier was an actual segmentation error. In this case, the subject was excluded from further analyses.

Extraction of cortical thickness and fractal dimension features

For each subject, using FreeSurfer tools, we computed the average CT of each cortical region as the average distance measured from each vertex of the gray/white boundary surface to the pial surface55.

The FD is a numerical representation of shape complexity56. The FD is normally a fractional value and is considered a dimension because it gives a measure of space-filling capacity57. An FD value between 2 and 3 is typical of a complex and heavily folded 2-D surface buried in a 3-D region, such as the human cerebral cortex. The FD is a very compact measure of shape complexity, combining cortical thickness, sulcal depth, and folding area into a single numeric value58,59. In this study, the fractal analysis was carried out using the fractalbrain toolkit version 1.1 (freely available at https://github.com/chiaramarzi/fractalbrain-toolkit) and described in detail in Marzi et al.59. The fractalbrain toolkit processes FreeSurfer outputs directly, computing the FD of various regions of the cerebral cortex: the entire cerebral cortex, separately the left/right hemispheres, and left/right frontal, temporal, parietal, and temporal lobes. Fractalbrain performs the 3D box-counting algorithm60, adopting an automated selection of the fractal scaling window59 – a crucial step for establishing the FD for non-ideal fractals59,61.

Briefly, we overlapped a grid composed of 3D cubes of different sizes s (where s = 2k voxels, and k = 0, 1, …, 8) onto the segmentation and recorded the number of cubes N(s) needed to fully enclose the structure for each size. This process was repeated with 20 uniformly distributed random offsets to prevent the systematic influence of the grid placement, and the relative box count was averaged to obtain a single N(s) value62,63. For a fractal object, the data points of the number of cubes N(s) vs. size s in the log-log plane can be modeled through a linear regression within a range of spatial scales called the fractal scaling window. Fractalbrain automatically selects the optimal fractal scaling window by searching for the interval of spatial scales that provides the best linear fit, as measured by the rounded coefficient of determination adjusted for the number of data points (R2adj). If multiple intervals have the same rounded R2adj, the widest interval (i.e., the one that contains the most data points in the log-log plot) is selected59. The FD of the brain structure is then estimated as the slope (in absolute value) of the linear regression model included in the automatically selected fractal scaling window. As an example, in Fig. 2, we reported a log-log plot of the 3D box-counting algorithm optimized for the automatic selection of the best fractal scaling window of the cerebral cortex of one subject.

Fig. 2
figure 2

3D box-counting for computation of the FD. An example of the 3D box-counting algorithm that uses an automated selection of the fractal scaling window through the fractalbrain toolkit59. N(s) is the average number of 3D cubes of side s needed to fully enclose the brain structure computed using 20 uniformly distributed random offsets to the grid origin. The regression line within the optimal fractal scaling window, whose slope (sign changed) is the FD, is depicted in red.

Harmonization of brain cortical features

We harmonized cortical features using ComBat, a model that builds upon the statistical harmonization technique proposed by Johnson and colleagues35 for location and scale (L/S) adjustments to the data while preserving between-subject biological variability. Briefly, let yijf be the one-dimensional array of n neuroimaging features for the single-center i, participant j, and feature f, for a total of k single-center datasets, n participants, and V features. Still, let X be the n × p matrix of biological covariates of interest, and Z be the n × k matrix of single-center labels. The ComBat harmonization model can be written as follows:

$${y}_{ijf}={f}_{f}\left({X}_{ij}\right)+{Z}_{ij}{\vartheta }_{f}+{\delta }_{if}{\varepsilon }_{ijf}$$
(1)

where ff (Xij) denotes the variation of yijf captured by the biologically relevant covariates Xij, \({\vartheta }_{f}\) is the one-dimensional array of the k coefficients associated with the single-center labels Zij for the feature f. We assume that the residual terms εijf have mean 0. The parameters δif describe the multiplicative site effect of the i-th site on the feature f, i.e., the scale (S) adjustment, while the location (L) parameter for the i-th site on the feature f, is represented by γif (the empirical Bayes estimates of the term \({Z}_{ij}{\vartheta }_{f}\)). Consistent with the ComBat model notation used in Fortin et al. (2017), the harmonized \({y}_{ijf}^{* }\) become:

$${y}_{ijf}^{* }=\frac{{y}_{ijf}-{f}_{f}\left({X}_{ij}\right)-{\gamma }_{if}}{{\delta }_{if}}+{f}_{f}\left({X}_{ij}\right)$$
(2)

In this study, we used the ComBat model implemented in the neuroHarmonize v. 2.1.0 package (freely available at https://github.com/rpomponio/neuroHarmonize) – an open-source and easy-to-use Python module2. In particular, neuroHarmonize extends the neuroCombat package5,6 with the possibility of specifying covariates with generic nonlinear effects on the neuroimaging feature to harmonize. In particular, the ff (Xij) term in Eq. (1) is a Generalized Additive Model (GAM) function of the specific covariates2. Indeed, MRI-derived features are known to be influenced by demographic factors, such as age2,3,5,59,64,65,66,67,68,69,70 and sex71. In our study, these variables were included in the harmonization process as sources of inter-subject biological variability. Finally, since it is not evident that the site effect affects all MRI-derived measures in the same way3, we performed a separate harmonization for each feature group of the same type (i.e., CT and FD).

The harmonizer transformer

The increased sample size due to the pooling of data acquired in various centers necessarily facilitates the application of machine learning techniques. For training and testing machine learning models, a proper validation scheme that handles data splitting must be chosen (Fig. 3). This choice is crucial to avoid data leakage by ensuring that the entire workflow (preprocessing and model-building steps) is constructed on training data and evaluated on test data never seen during the learning phase. Indeed, data leakage in the training process may incur falsely high performance in the test set (see, e.g., ref. 72 and ref. 73). Especially in Medicine and Healthcare, where relatively small datasets are usually available, the straightforward hold-out validation scheme is rarely applied. In contrast, the cross-validation (CV) and its nested version (nested CV) for hyperparameters optimization of the entire workflow74,75,76 are frequently preferred. Also, repeated CVs or repeated nested CVs are suggested for improving the reproducibility of the entire machine learning system75. Several training and test data procedures are carried out in all these validation schemes on different data split, recalling the need for a compact code structure to avoid errors that may lead to data leakage. In this view, machine learning pipelines are a solution because they orchestrate all the processing steps in a short, easier-to-read, and easier-to-maintain code structure (Fig. 3). A pipeline represents the entire data workflow, combining all transformation steps (e.g., data cleaning, data imputation, data scaling, and general data preprocessing) and machine learning model training. It is essential to automate an end-to-end training/test process without any form of data leakage and improve reproducibility, ease of deployment, and code reuse, especially when complex validation schemes are needed.

Fig. 3
figure 3

Machine learning pipeline. A pipeline represents the entire data workflow, combining all transformation steps and machine learning model training. It is essential to automate an end-to-end training/test process without any form of data leakage and improve reproducibility, ease of deployment, and code reuse, especially when complex validation schemes are needed.

In the Scikit-learn library, a popular, open-source, well-documented, and easy-to-learn machine learning package that implements a vast number of machine learning algorithms, a pipeline is a chain of “transformers” and a final “estimator” acting as a single object. The transformers are modules that apply preprocessing to the data, whereas estimators are modules that fit a model based on training data and are capable of inferring some properties on new data (https://scikit-learn.org/stable/developers/develop.html). In particular, transformers are classes with a “fit” method, which learns model parameters (e.g., mean and standard deviation for data standardization) from a training set, and a “transform” method which applies this transformation model to any data. For example, for data standardization (transforming data to have zero mean and unit standard deviation), the mean μ must be subtracted from the data, and the result must be divided by the standard deviation σ. Notwithstanding, this procedure must be firstly performed on the training set (using μ and σ computed in the training set). In the test set, or any validation set, the same transformation must be applied to data using the same two parameters μ and σ computed for centering the training set. Basically, the “fit” method calculates the parameters (e.g., μ and σ in our case) and saves them internally, whereas the “transform” method applies the transformation (using the saved parameters) to any particular set of data.

For these reasons, in this study, we propose the harmonizer – a Scikit-learn Python transformer that encapsulates the neuroHarmonize procedure among the preprocessing steps of a machine learning pipeline. The “fit” method of the harmonizer transformer learns the NeuroHarmonize model parameters from a training set and saves the parameters internally, whereas the “transform’” method is used to apply the neuroHarmonize model, previously learned on the training data set, e.g., to unseen data. The source code of the harmonizer transformer is publicly available in a GitHub repository at https://github.com/Imaging-AI-for-Health-virtual-lab/harmonizer.

In the following, we included the harmonizer transformer in a pipeline to learn the harmonization procedure parameters on the training data only and apply the harmonization procedure (with parameters obtained in the training set) to the test data. This prevented data leakage by design in the harmonization procedure independently of the chosen validation scheme.

Statistical and machine learning analyses

We performed the statistical and machine learning analyses described in the following paragraphs for each feature group of the same type (i.e., CT and FD) and each meta-dataset (i.e., CHILDHOOD, ADOLESCENCE, ADULTHOOD, and LIFESPAN).

Visualization and quantification of site effect

We first performed a series of analyses of increasing complexity to explore the actual existence of a site effect in the data. For each region-feature pair, we qualitatively showed the site effect on raw data through boxplots, using the site as the independent variable and each region-feature pair as the dependent variable. Quantitatively, the site effect was measured by analyzing covariance (ANCOVA) – a general linear model that blends analysis of variance (ANOVA) and linear regression. ANCOVA evaluates whether the means of a dependent variable are equal across levels of a categorical independent variable while statistically controlling for the effects of other variables that are not of primary interest, known as covariates or nuisance variables. In this study, we set the single-center dataset as the independent variable, age, age×age, and sex as covariates, and each region-feature pair as the dependent variable.

Additionally, to further investigate the site effect on raw data and to measure the success of ComBat harmonization, we predicted the imaging site from the neuroimaging features, grouped by feature type, namely CT and FD. Specifically, we used the supervised eXtreme Gradient Boosting (XGBoost) method (with version 0.90 default hyperparameters for a classification task), a scalable end-to-end tree-boosting system widely used to achieve state-of-the-art performance on many recent machine learning challenges77. Using N=100 repetitions of a stratified 5-fold CV, we estimated the median balanced accuracy. The statistical significance of prediction performance was determined via permutation analysis. Thus, for each features group, 5000 new models were created using a random permutation of the target labels (i.e., the imaging site), such that the explanatory neuroimaging variables were dissociated from their corresponding imaging site to simulate the null distribution of the performance measure against which the observed value was tested78. Since, in this study, single-center datasets showed different age groups, the random target labels permutation was performed within groups of subjects of similar age79, which were categorized into five-year intervals. The selection of a 5-year value was made to ensure it was sufficiently small to discern age differences while being large enough to avoid an excessive reduction in the potential permutations within each age group.

Median balanced accuracy was considered significantly different from the chance level when the p-value computed using permutation tests was < 0.05. Additionally, we calculated the average confusion matrix over repetitions to graphically evaluate the goodness of prediction. The same imaging site prediction was performed on raw data (i.e., without harmonization) to confirm the existence of the site effect and on harmonized data (with neuroHarmonize and Harmonizer transformer) to investigate if the site effect was reduced or removed.

We propose to measure the efficacy of harmonization in reducing or removing the site effect through a two-step assessment. First, we evaluated whether the site prediction after the harmonization process was not significantly different from a random prediction by comparing the median balanced accuracy over repetitions with the distribution of balanced accuracies estimated using the permutation test with 5000 permutations (the default value in FSL – FMRIB Software library – randomise tool for non-parametric permutation inference on neuroimaging data80). Considering, for example, a significance threshold of 0.05 in the permutation test, in the case of complete removal of the site effect, the site prediction will not be different from that of a random model (i.e., p-value ≥ 0.05). Second, in the case of permutations test p-value < 0.05, we compared the balanced accuracy obtained by predicting the site without and with the harmonization procedure. In particular, we assessed the site effect reduction by ensuring that the median balanced accuracy obtained predicting the imaging site with harmonized data was significantly lower than that estimated with raw data through the non-parametric one-sided Wilcoxon signed-rank test, with a significance threshold of 0.0581. The source code for evaluating the effectiveness of harmonization using the harmonizer transformer is publicly available in a GitHub repository at https://github.com/Imaging-AI-for-Health-virtual-lab/harmonizer.

To estimate the effect of data leakage in the prediction of the imaging site caused by performing the harmonization on all data before splitting into training and test sets, we tested whether the balanced accuracies obtained using neuroHarmonize on all data before any split were consistently lower than those estimated using the harmonizer transformer in the above mentioned stratified CV scheme. Since the same data set splits were applied for both CT and FD, the comparison was carried out through a paired test, i.e., the non-parametric one-sided Wilcoxon signed-rank test with a significance threshold of 0.0581.

Associations with age

While it is essential to show that a harmonization method successfully reduces a possible site effect, it is equally crucial to note that it preserves the biological variability in the data. Indeed, a harmonization method that removes both site and biological effects has no utility. One of the most influential sources of biological variability in the neuroimaging features of healthy subjects is undoubtedly chronological age. Throughout the lifespan, the brain structure changes because of a complex interplay between multiple maturational and neurodegenerative processes. Such processes could yield large spatial and temporal variations in the brain65,82,83.

For these reasons, we attempted to predict individual age from neuroimaging features through an XGBoost model (version 0.90 with default hyperparameters for a regression task)77. We estimated the median (over repetitions) mean absolute error (MAE) using N = 100 repetitions of a 5-fold CV. Age prediction was performed on harmonized data using both neuroHarmonize and the harmonizer transformer in the CV pipeline. To estimate the effect of data leakage in the age prediction caused by performing the harmonization on all data before splitting into training and test sets, we compared the MAE values obtained using neuroHarmonize on all data before any split and the harmonizer transformer in the above-mentioned CV scheme. In particular, since the same data set splits were applied for both CT and FD, we assessed whether the median MAE using neuroHarmonize on all data before any split was consistently lower than that estimated using the harmonizer transformer through a paired test, i.e., the non-parametric one-sided Wilcoxon signed-rank test with a significance threshold of 0.0581.

Moreover, before and after the harmonization procedure, for each region-feature pair, we qualitatively visualized the site effect on the relationship between age and each region-feature pair through scatterplots (with age as the independent variable and each region-feature pair as the dependent variable).

Simulation experiments

The harmonizer transformer prevents data leakage by design in the harmonization procedure in any machine learning pipeline independently of the chosen validation scheme. Differently, applying harmonization before data spitting, data leakage is present, and its severity depends on the specific context and the extent of the leakage. In Neuroimaging, the entity and impact of the data leakage effect is still an underexplored area. Therefore, we performed simulation experiments (with known site effects) and computational tests for assessing the data leakage effect when the harmonization process is performed before the training-test data splitting.

CT and FD data simulation settings

Let yijf be the one-dimensional array of the simulated feature f, for the single-center i, and participant j, for a total of k single-center datasets, ni participants for each center, and V features. In this study, we simulated CT and FD data for k = 3, 10, 36 single-centers. Each single-center dataset provided the same number of participants (i.e., ni=n), with n assuming the values 25, 50, 100, 250. Totally, we did 24 experiments, i.e., we simulated 24 different multicenter datasets (12 for the CT features and 12 for the FD measures).

Each yijf was generated based on the model proposed by Johnson and colleagues35 and recently used for neuroimaging features’ simulation by Chen and collaborators84:

$${y}_{ijf}={{\rm{\alpha }}}_{f}+{{\rm{\beta }}}_{f1}{x}_{ij}+{{\rm{\beta }}}_{f2}{x}_{ij}^{2}+{{\rm{\gamma }}}_{if}+{{\rm{\delta }}}_{if}{{\rm{\varepsilon }}}_{ijf}$$
(3)

where αf is the average value of the feature f in the single-center ICBM dataset, βf1 = −0.0009 and βf2 = −0.00005 are the linear and quadratic effects of the age on the feature f, respectively, and xij is a simulated age variable drawn from a uniform distribution X ~ uniform([20,90]). Considering the nature of our investigation, which examines the relationship between cortical thickness and FD with age, it is reasonable to assume that the relationship is no more than quadratic59,85. The mean site effect γif was drawn from a normal distribution with zero mean and standard deviation equal to 0.1, while the variance site effect δif was drawn from a center-specific inverse gamma distribution with chosen parameters. For our simulations, we chose to distinguish the site-specific location factors by assuming independent and identically distributed (i.i.d.) normal distributions and scaling factors using the parameters described as follows. We set the value of the inverse gamma shape, for each center, as {46, 51, 56}, respectively, when k = 3, as {40, 42, .., 58} when k = 10, and as {10, 12, .., 40, 41, .., 50, 52, .., 70} when k = 36. In all cases, the inverse gamma scale was set to 50.

Measuring the effect of data leakage

We measured the effect of data leakage for both the site and age prediction independently. Hereinafter, we will refer generically to performance, indicating the balanced accuracy for the site prediction task and the MAE for the age prediction task. To measure the effect of data leakage, after an external hold-out (Fig. 4), firstly, we computed the performance of an imaging site/age prediction estimator trained using a) the harmonizer transformer within the machine learning pipeline (internal not leaked test set) and b) harmonizing all data with neuroHarmonize before the actual prediction (internal leaked test set). Secondly, we compared these performances with that observed on an external test set never used for harmonization and training (Fig. 4). In the absence of data leakage, the performance in the internal and external test sets should be similar and not significantly different. When data leakage is present, the performance in the internal test set is overly optimistic (i.e., significantly better than that on the external test set). In detail, for each experiment, we performed the following steps.

Fig. 4
figure 4

Overview of the analysis of simulated data for each experiment. After an external hold-out, we computed the performance of a site prediction classifier trained using (a) the harmonizer transformer within the machine learning pipeline (internal not leaked test set) and (b) harmonizing all data with neuroHarmonize before imaging site/age prediction (internal leaked test set). Secondly, we compared these performances with that observed on an external test set never used for harmonization and training.

External hold-out

We randomly split the data into two parts, i.e., a data set containing 50% of the samples and an external test set with the other 50% of the instances.

Imaging site/age prediction estimator training and test on the external test set

We fitted a harmonization model with neuroHarmonize using age as a covariate with a nonlinear relationship with individual MRI-derived features. To fit the harmonization model, we used the same number of instances adopted for the other two approaches (see next analyses), i.e., 80% of samples, randomly chosen, of the data set. Then, we applied the harmonization model to the data set and the external test set. Finally, we trained an XGBoost model (with version 0.90 default hyperparameters for a classification task) to predict the imaging site/age and tested it on the harmonized external test set.

Imaging site/age prediction estimator training and test using harmonizer transformer within the machine learning pipeline (not leaked internal test set)

We trained and tested a pipeline containing the harmonizer transformer and an XGBoost estimator (with version 0.90 default hyperparameters) on the data set to predict the imaging site/age through a stratified 10-times repeated 5-fold CV. Thus, we trained the pipeline in the training sets of each iteration of the CV and considered the performance within the test sets of the CV.

Imaging site/age prediction estimator training and test harmonizing all data with neuroHarmonize before imaging site prediction (leaked internal test set)

We trained and tested a pipeline containing an XGBoost estimator (with version 0.90 default hyperparameters) on the harmonized dataset to predict the imaging site/age through a stratified 10-times repeated 5-fold CV. Thus, we trained the pipeline in the training sets of each iteration of the CV and considered the performance metric within the test sets of the CV.

For each task, i.e., imaging site and age prediction, we repeated each experiment (i.e., all these steps) 100 times with random data splits and computed the average performance across the 100 repetitions. Finally, we compared the average performance across the 100 repetitions of each internal test set (leaked and not-leaked) with that of the external test set. When data leakage is present, the performance in the internal test set is better than that on the external test set (i.e., lower balanced accuracy and MAE values for the imaging site and age prediction, respectively). To assess whether the average performance of each internal test set was lower than that of the external test set, we conducted a one-tailed t-test, applying Bonferroni correction for multiple comparisons. This statistical analysis allowed us to evaluate the significance of any differences observed between the average performance of the internal and external test sets.

In addition, we calculated, for each internal test set, the Cohen’s d effect size to estimate the magnitude of the differences between performance distributions’ means. Specifically, we used the following Cohen’s d formula: \({\rm{d}}=\frac{\overline{{x}_{e}}-\overline{{x}_{i}}}{s}\) where \(\overline{{x}_{e}}\) is the average performance in the external test set, \(\overline{{x}_{i}}\) is the average performance in the internal test set, and s is the standard deviation of the difference between performance obtained in the external test set and that achieved in the internal test set.

Results

Measuring the effect of data leakage in simulated data

Regarding the imaging site prediction, the results were similar for both CT (Table 3) and FD (Table 4) simulated features. The performances obtained on the leaked internal test set were overly optimistic, i.e., significantly better than those obtained in the external test set, indicating the presence of data leakage. In contrast, the average balanced accuracies recorded on the not leaked test internal set were statistically not different from those of the external test set (except in one case – see details in Table 4).

Moreover, as the number of samples available in each single-center dataset decreases, the effect of data leakage increases (Tables 3, 4 for CT and FD, respectively). This phenomenon is even more evident in Fig. 5, where we reported the difference between the average balanced accuracy obtained in the external test set and that gained in the internal test sets vs. the number of participants in each single-center site for the CT and FD, respectively. When data leakage is present (dashed lines in Fig. 5), the difference between the average balanced accuracy in the external test set and that in the internal leaked test set always differs significantly from zero (Bonferroni adjusted p-values < 10−9 and < 10−10 for CT and FD, respectively) and increases as the number of participants in each single-center dataset decreases. This result has a profound impact because most neuroimaging studies (with in vivo data) have single-centers datasets with a number of subjects between 25 and 100. Conversely, when data leakage is not present (solid lines in Fig. 5), the difference between the average balanced accuracy in the external test set and that in the internal not leaked test set was approximately zero and remained constant as the number of participants in each single-center dataset changes.

Table 3 Imaging site prediction results with CT simulated data.
Table 4 Imaging site prediction results with FD simulated data.
Fig. 5
figure 5

Imaging site prediction results with CT and FD simulated data. We reported the difference between the average balanced accuracy obtained in the external test set and that gained in the internal test sets (dotted line for leaked internal test set and solid line for not leaked internal test set) and Cohen’s d effect size vs. the number of participants per single-center dataset n. The cross marker indicates a significant difference between balanced accuracy distributions (one-tailed paired t-test Bonferroni adjusted p-value < 10−9 and < 10−10 for CT and FD, respectively). The colors and line types in Cohen’s d plots are consistent with those employed in the other plots.

Data leakage was also observed in the age prediction task for both CT and FD features. Similarly to the site prediction task, the performance on the leaked internal test set appears overly optimistic (Tables 5, 6 for CT and FD, respectively), and the impact of data leakage becomes more pronounced as the number of samples in each single-center dataset decreases (Fig. 6).

Table 5 Age prediction results with CT simulated data.
Table 6 Age prediction results with FD simulated data.
Fig. 6
figure 6

Age prediction results with CT and FD simulated data. We reported the difference between the average MAE obtained in the external test set and that gained in the internal test sets (dotted line for leaked internal test set and solid line for not leaked internal test set) and Cohen’s d effect size vs. the number of participants per single-center dataset n. The cross marker indicates a significant difference between balanced accuracy distributions (see Tables 5, 6 for details). The colors and line types in Cohen’s d plots are consistent with those employed in the other plots.

Visualization and quantification of the site effect in in vivo data

Quality control of FreeSurfer’s outputs resulted in removing 47 subjects based on the overall low quality of cortical reconstruction or segmentation errors in any regions. All brain regions of the remaining 1740 subjects had both CT and FD features. Thus, we have been able to analyze the site effect, the harmonization adjustments, and age prediction on the same subjects for the CT and FD groups of features. The demographic characteristics of the subjects included in the study after the quality control have been reported in Table 7.

Table 7 Demographic characteristics of the subjects remaining after quality control and who entered into the analyses.

The boxplots in Figs. 7, 8 summarize the distribution of the average CT and FD of the cerebral cortex at each imaging site. Specifically, the site effect differs between the two features. For example, in the CHILDHOOD meta-dataset, the ABIDEI-KKI_32ch, ABIDEI-KKI_8ch, and ABIDEII-NYU_1 single-center datasets show the lowest average CT values, while subjects from the ABIDEI-STANFORD dataset have the lowest FD values. Also, for the ADOLESCENCE meta-dataset, the site effect has a different behavior for CT and FD features: for example, ABIDEI-TCD_1 shows the lowest values of CT, while ABIDEI-LEUVEN shows the lowest values of FD. At the same time, in the ADULTHOOD meta-dataset, ABIDEI-SBL has the lowest mean CT values, whereas ABIDEII-BNI_1 has the lowest FD values.

Fig. 7
figure 7

Boxplot of the average CT of the cerebral cortex. The boxplots of the average CT of the cerebral cortex without harmonization are shown for the CHILDHOOD, ADOLESCENCE, ADULTHOOD, and LIFESPAN meta-datasets.

Fig. 8
figure 8

Boxplot of the average FD of the cerebral cortex. The boxplots of the FD of the cerebral cortex without harmonization are shown for the CHILDHOOD, ADOLESCENCE, ADULTHOOD, and LIFESPAN meta-datasets.

The same result was measured quantitatively using ANCOVA analysis. Indeed, all CT and FD features were significantly different across the single-center datasets (Table 8), but the site effect, measured by the partial η2 was different in the two feature sets. In the CHILDHOOD meta-dataset, for example, each cortical region showed a higher partial η2 for FD than for CT, suggesting that, in childhood, acquisition characteristics impact more on the structural complexity measure, i.e., FD, than on the cortical thickness. On the other hand, in the ADOLESCENCE meta-dataset, the frontal and temporal lobes (bilaterally), along with the entire structure, show lower partial η2 for FD than for CT, whereas the parietal and occipital lobes (bilaterally) have higher partial η2 for FD than for CT. Finally, in the ADULTHOOD meta-dataset, only the occipital and temporal lobes (bilaterally) have lower partial η2 for FD than CT.

Table 8 ANCOVA results on raw data.

Harmonization efficacy

To assess whether most of the variation in the data was still associated with the site after harmonization, we predicted the imaging site using neuroimaging features grouped by feature type (i.e., CT and FD). Figures 9, 10 report the average confusion matrices (over 100 repetitions) for CT and FD features, respectively. When predicting the site using the raw data, the main diagonal of the confusion matrix is prominent (i.e., the predicted site is usually the actual site) for both feature groups and each meta-dataset (Figs. 9, 10). On the other hand, when the prediction of the site is performed using harmonized data (through neuroHarmonize or harmonizer transformer), the impact of the main diagonal of the confusion matrix is weak. The confusion matrices show a vertical pattern indicating that the predicted site is often the same site, regardless of the actual site (Figs. 9, 10). Moreover, the confusion matrix obtained using the harmonizer within the machine learning pipeline seems similar to that obtained by harmonizing all the data with neuroHarmonize before imaging site prediction. This result suggests that the action of the harmonizer resembles that of neuroHarmonize, although the model is built on training data only and then applied to test data. The confusion matrices for CT and FD features in the LIFESPAN meta-dataset have also been shown in Fig. 11.

Fig. 9
figure 9

Confusion matrices of site prediction using CT features. Each confusion matrix was normalized for the number of subjects belonging to each site. In this way, the sum of the matrix cells of each row gives 1. The confusion matrix obtained using the harmonizer within the machine learning pipeline seems similar to that obtained by harmonizing all the data with neuroHarmonize before imaging site prediction, even though the model is built on training data only and then applied to test data.

Fig. 10
figure 10

Confusion matrices of site prediction using FD features. Each confusion matrix was normalized for the number of subjects belonging to each site. In this way, the sum of the matrix cells of each row gives 1. The confusion matrix obtained using the harmonizer within the machine learning pipeline seems similar to that obtained by harmonizing all the data with neuroHarmonize before imaging site prediction, even though the model is built on training data only and then applied to test data.

Fig. 11
figure 11

Confusion matrices of site prediction using CT and FD features in the LIFESPAN meta-dataset. Each confusion matrix was normalized for the number of subjects belonging to each site. In this way, the sum of the matrix cells of each row gives 1. The confusion matrix obtained using the harmonizer within the machine learning pipeline seems similar to that obtained by harmonizing all the data with neuroHarmonize before imaging site prediction, even though the model is built on training data only and then applied to test data.

Table 9 reports the median balanced accuracies (over 100 repetitions) of imaging site prediction, and the efficacy of the harmonization is shown in Table 10. Specifically, we have reported the pair (age-group permutation test p-value, one-sided Wilcoxon signed-rank test p-value) to statistically assess the removal or reduction of the site effect, respectively. As expected, the median balanced accuracy of site prediction using the raw data was significantly different from the chance level (age-group permutation test p-value ≥ 0.05 for all data), and thus, an actual imaging site effect was present on raw data. After harmonization, with neuroHarmonize or harmonizer transformer, the site effect was removed (age-group permutation test p-value ≥ 0.05 in Table 10) or only reduced (age-group permutation test p-value < 0.05, but with median balanced accuracy reduced on harmonized data, as statistically measured by the one-sided Wilcoxon signed-rank test p-value < 0.05 in Table 10). Specifically, by performing harmonization using neuroHarmonize on all data, we observe that the site effect removal seems to be ensured in all analyses performed except for the imaging site predictions using FD features in the ADOLESCENCE and ADULTHOOD meta-datasets (age-group permutation test p-value equal to 0.0188 and 0.0002, respectively, in Table 10). We found the same behavior when predicting the imaging site using CT and FD features in the LIFESPAN meta-dataset (age-group permutation test p-value equal to 0.0002 in Table 10). In the latter cases, although significantly different from a random prediction, the balanced accuracies were significantly lower than those obtained using the original data (one-sided Wilcoxon signed-rank test p-values < 0.001 in Table 10), and this indicates a site effect reduction. When applying the harmonizer transformer to the data (within the CV), we observed the actual efficacy of the harmonization, without introducing data leakage, as in the previous case. Indeed, we confirmed a complete removal of site effect only in imaging site prediction using CT features in ADULTHOOD meta-dataset (age-group permutation test p-value equal to 0.1064 in Table 10). In all the other cases, the imaging site prediction was significantly different from the chance level (age-group permutation test p-values < 0.05 in Table 10), but the balanced accuracies were significantly lower than those obtained using the original data (one-sided Wilcoxon signed-rank test p-values < 0.001 in Table 10). Thus, the site effect removal measured using data harmonized before the splitting into training and test sets was a clear sign of data leakage even in in vivo data.

Table 9 Site prediction results.
Table 10 Harmonization efficacy.

Age prediction

Table 11 reports the median MAE values (over 100 repetitions) of the age prediction model. Overall, MAE values of age prediction using data harmonized with neuroHarmonize before the splitting into training and test sets are significantly lower than those obtained using data harmonized with the harmonizer within the CV (one-sided Wilcoxon signed-rank p-values < 0.001 for all the cases, except for CT features in the CHILDHOOD meta-dataset, see Table 11). In line with the results of simulations, the data leakage introduced by harmonizing the data all at once leads to an overly optimistic performance.

Table 11 Age prediction results.

Finally, in Figs. 12, 13, we reported the age-dependent trends of the average CT and FD of the cerebral cortex without harmonization and harmonized with the harmonizer transformer, respectively. In line with previous literature concerning features such as CT and volumes2,5, also in this study, the harmonized average CT and FD values showed less variability than that observed on raw data.

Fig. 12
figure 12

Scatterplot of the average CT of the cerebral cortex vs. age. The plot of the average CT of the cerebral cortex vs. age is shown for the CHILDHOOD, ADOLESCENCE, ADULTHOOD, and LIFESPAN meta-datasets without and with harmonization using the harmonizer transformer. In the latter case, we considered only the first CV among the 100 repetitions. Specifically, for each subject, we plotted the harmonized value obtained in the fold when the subject was included in the test set.

Fig. 13
figure 13

Scatterplot of the FD of the cerebral cortex vs. age. The plot of the FD of the cerebral cortex vs. age is shown for the CHILDHOOD, ADOLESCENCE, ADULTHOOD, and LIFESPAN meta-datasets without and with harmonization using the harmonizer transformer. In the latter case, we considered only the first CV among the 100 repetitions. Specifically, for each subject, we plotted the harmonized value obtained in the fold when the subject was included in the test set.

Discussion

In this study, we introduced the harmonizer transformer, which encapsulates the data harmonization procedure among the preprocessing steps of a machine learning pipeline to avoid data leakage by design. To this end, we explored the ComBat harmonization of CT and FD features extracted from brain T1-weighted MRI data of 1740 healthy subjects aged 5–87 years acquired at 36 sites and simulated data. We measured the efficacy of the harmonization process in reducing or removing the unwanted site effect through a two-step assessment comparing the performance in imaging site prediction using harmonized data with that of 1) a random prediction and 2) a prediction using non-harmonized data. Finally, we confirmed how data leakage related to harmonization performed before data splitting leads to overestimating performance in simulated and in vivo data.

Using simulated data, we showed that the data leakage effect introduced by performing the harmonization before data splitting is clearly evident and worse when the single-center dataset size is small and comparable with the size of the most common neuroimaging in vivo studies. In these simulated experiments, we paid particular attention to comparing different harmonization and machine learning approaches in the same conditions, i.e., the same data splits and using the same number of subjects for harmonization (for this reason, we adopted 80% of the data set size for fitting the neuroHarmonize model; indeed using the harmonizer approach, the harmonization was computed in the training fold of a 5-fold CV, i.e., using 80% of the samples).

We chose the ComBat harmonization method due to its widespread use in the scientific community7,12,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34 and its implementation in the neuroHarmonize package, which enables the specification of covariates with generic non-linear effects2. The efficacy of ComBat and its variants has been evaluated by comparing their performance with other harmonization techniques3,5,6 and by simulating site effects using single-center data2. However, various harmonization techniques can be used for features extracted from MRI images. One such method is the residuals harmonization, which employs a global scaling procedure to account for the influence of each site using a pair of parameters (offset and scale). These parameters can be estimated through a linear regression model or a more sophisticated approach that considers non-linearities5. Global scaling was initially introduced to harmonize images directly6. The adjusted residuals harmonization, an advancement of the residuals harmonization, integrates biological covariates (such as age, sex, and diseases) into the linear regression model, facilitating the removal of unwanted site effects while maintaining biological variability5. Lastly, the Correcting Covariance Batch Effects (CovBat) method is a recent variant of the ComBat method that aims to address site effects in the mean, variance, and covariance of the neuroimaging features84.

It is important to note that this study was the first in which the efficacy of the harmonization procedure of neuroimaging data has been evaluated by comparing the accuracy of the imaging site prediction also to the chance level. Indeed, previous works have consistently shown a decrease in the accuracy of the imaging site prediction after harmonization, but without applying a significance test, and thus it was not known whether the site effect was removed or only reduced [see, e.g., ref. 2 and ref. 5]. As hypothesized, there was a real imaging site effect on the raw data (age-group permutation test p-value < 0.05 for all data). The site effect was either eliminated or only reduced after data harmonization with neuroHarmonize or harmonizer transformer. Specifically, the difference between the efficacy of harmonization by applying neuroHarmonize on all data or harmonizer within the CV was expected because, in the former case, data leakage is present leading to a falsely overestimated performance, i.e., an age-group permutation test p-value ≥ 0.05 and a lower median balanced accuracy (Tables 7, 8). On the one hand, the complete removal of the imaging site measured using the data harmonized with neuroHarmonize was only apparent. Indeed, using the harmonizer within the CV, the imaging site effect was completely removed only for CT features in the ADULTHOOD meta-dataset. In line with the results of the simulations, we noted that the median balanced accuracies obtained by performing site prediction with harmonized data using the neuroHarmonize show significantly lower values than those observed using the harmonizer transformer within the CV (one-sided Wilcoxon signed-rank p-values < 0.001 for all the analyses). The differences found in the median balanced accuracy of imaging site prediction using the harmonizer transformer and neuroHarmonize emphasize the importance of introducing the harmonizer transformer into a machine learning pipeline to avoid data leakage, a source of bias in prediction results. Notably, the procedure used to measure data leakage on the simulated data (i.e., comparing the performance of imaging site prediction between the internal test set of the CV and external test set) was not viable for the in vivo data due to the limited sample size in several centers (less than 20 subjects).

Looking at the age-group permutation test p-values for imaging site prediction using data harmonized with neuroHarmonize (which were harmonized before splitting into training and test sets), it can be observed that the efficacy of harmonization worsened as the overlap of the age distributions in multicenter meta-datasets decreased (Table 10). Specifically, for CT features, the age-group permutation test p-value was 0.5023 in the CHILDHOOD meta-dataset, which exhibits a good overlap of age distributions (BC = 0.71), but dropped to 0.0002 in the LIFESPAN meta-dataset, which exhibits a BC = 0. Similar behavior was observed for FD features. These results on in vivo data are in line with the simulations performed by Pomponio and colleagues2, which suggested that age-disjoint studies should be challenging to harmonize in the presence of nonlinear age effects2. The efficacy of the harmonization performed in CV using the harmonizer transformer does not appear seemingly to have a close link to the degree of overlap of the age distributions in the multicenter meta-datasets. This may be explained by the fact that the harmonizer transformer handles training data only – randomly chosen within the whole meta-dataset – in the different folds of the CV, and the actual BC values may vary.

The goodness of age prediction using the data harmonized with neuroHarmonize before the splitting into training and test sets is falsely increased compared with the use of data harmonized with the harmonizer within the CV. Indeed, the median MAE values obtained in predicting age using data harmonized with neuroHarmonize before splitting into training and test sets were significantly lower than those estimated using data harmonized with harmonizer within the CV (one-sided Wilcoxon signed-rank p-values < 0.001 for all the cases, except for CT features in the CHILDHOOD meta-dataset, see Table 10). These results confirm how data leakage related to data harmonization before splitting them into training and test sets leads to performance overestimation even for in vivo data and underlines the importance of encapsulating the data harmonization procedure among the preprocessing steps of a machine learning pipeline.

In previous single-centers studies, we observed that the computation of the FD using the box-counting algorithm with the automated selection of the optimal fractal scaling window implemented in fractalbrain best predicted chronological age in two datasets of healthy children and adults among various FD approaches, and more conventional features, such as CT, and gyrification index59. In this large multicenter study, we confirmed the more remarkable ability of the FD of the cerebral cortex to predict individual age better than the average CT. In the LIFESPAN meta-dataset, for example, the error in age prediction using CT features (MAE = 7.55 years) was reduced by more than 25% using FD features (MAE = 5.60 years) in line with previous literature59,68. This result furtherly confirms that FD conveys additional information to that provided by other conventional structural features58,59,67,68,86,87,88,89,90,91,92,93,94,95,96,97,98,99.

This study has some limitations. Firstly, to show the utility of encapsulating the data harmonization procedure among the preprocessing steps of a machine learning pipeline to avoid data leakage, we used only the ComBat harmonization method. However, other harmonization techniques are available and could be similarly effective, including the recent CovBat model, which adds harmonization of covariance between sites84. Future research may consider comparing and contrasting the performance of different harmonization methods to identify the optimal approach for specific research questions and data sets.

Secondly, we showed and measured the data leakage effect using simulated and in vivo data of CT and FD of the cerebral cortex only. Various other morphological and functional MRI-derived features might be considered. However, the focus of the study was mainly to measure the efficacy of the harmonization and show a possible detrimental effect of data harmonization on the entire dataset before machine learning analysis, and this effect is not relative to the features considered.

Lastly, for site/age prediction, we adopted an XGBoost decision tree with default parameters. It is well known that classification/regression performances may be affected by the value of the hyperparameters, and proper hyperparameter optimization, e.g., through a nested CV, could be adopted. However, this procedure was not feasible in our study because of the relatively small size of data in many centers – an undesired but common scenario in many publicly available datasets. Thus, though this choice was arbitrary, we feel that using the same hyperparameters for both neuroHarmonize and Harmonize transformer data was reasonable.

In conclusion, we showed that introducing the harmonizer transformer, which encapsulates the harmonization procedure among the preprocessing steps of a machine learning pipeline, avoided data leakage. Using in vivo data, after Combat harmonization, the site effect was completely removed or reduced while preserving the biological variability. We, therefore, suggest that future multicenter imaging studies will include the data harmonization method in the machine learning pipelines and measure the efficacy of the harmonization process.