Efficacy of MRI data harmonization in the age of machine learning: a multicenter study across 36 datasets

Marzi, Chiara; Giannelli, Marco; Barucci, Andrea; Tessa, Carlo; Mascalchi, Mario; Diciotti, Stefano

doi:10.1038/s41597-023-02421-7

Download PDF

Analysis
Open access
Published: 23 January 2024

Efficacy of MRI data harmonization in the age of machine learning: a multicenter study across 36 datasets

Scientific Data volume 11, Article number: 115 (2024) Cite this article

1545 Accesses
2 Altmetric
Metrics details

Subjects

Abstract

Pooling publicly-available MRI data from multiple sites allows to assemble extensive groups of subjects, increase statistical power, and promote data reuse with machine learning techniques. The harmonization of multicenter data is necessary to reduce the confounding effect associated with non-biological sources of variability in the data. However, when applied to the entire dataset before machine learning, the harmonization leads to data leakage, because information outside the training set may affect model building, and potentially falsely overestimate performance. We propose a 1) measurement of the efficacy of data harmonization; 2) harmonizer transformer, i.e., an implementation of the ComBat harmonization allowing its encapsulation among the preprocessing steps of a machine learning pipeline, avoiding data leakage by design. We tested these tools using brain T₁-weighted MRI data from 1740 healthy subjects acquired at 36 sites. After harmonization, the site effect was removed or reduced, and we showed the data leakage effect in predicting individual age from MRI data, highlighting that introducing the harmonizer transformer into a machine learning pipeline allows for avoiding data leakage by design.

Genome-wide association studies

Article 26 August 2021

Virtual reality-empowered deep-learning analysis of brain cells

Article Open access 22 April 2024

Development and validation of a new algorithm for improved cardiovascular risk prediction

Article Open access 18 April 2024

Introduction

In recent years there has been an increasing trend toward data sharing in neuroimaging research communities, leading to a rising number of public neuroimaging databases and collaborative multicenter initiatives^1,2,3,4. Indeed, pooling MRI data from multiple sites provides an opportunity to assemble more extensive and diverse groups of subjects^2,3,5,6, increase statistical power^3,7,8,9,10, and study rare disorders and subtle effects^11,12. However, a major drawback of combining neuroimaging data across sites is the introduction of confounding effects due to non-biological variability in the data, typically related to image acquisition hardware and protocol. Indeed, properties of MRI such as scanner field strength, radiofrequency coil type, gradients coil characteristics, hardware, image reconstruction algorithm, and non-standardized acquisition protocol parameters can introduce unwanted technical variability, also reflected in MRI-derived features^13,14,15.

The harmonization of multicenter data, defined as applying mathematical and statistical concepts to reduce unwanted site variability while maintaining the biological content, is, therefore, necessary to ensure the success of cooperative analyses. Currently, among the harmonization methods for tabular data available to the neuroimaging scientific community, ComBat is one of the most widely used^{7,12,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34}. The ComBat model was first introduced in gene expression analysis as a batch-effect correction tool to remove unwanted variation associated with the site and preserve biological associations in the data³⁵. In general, ComBat applies to situations where multiple features of the same type are measured for each participant, i.e., expression levels for different genes or imaging-derived metrics from different voxels or anatomical regions. The success of ComBat and derivatives has been measured compared to other harmonization techniques^3,5,6 and through simulations of the site effect from single-center data². Previous literature has primarily focused on assessing the maintenance of biological variability in harmonized data^2,3,5. However, less effort has been put into quantitative measurements of the efficacy of harmonization in removing the unwanted site effect.

Moreover, the pooling of multicenter data and the consequent availability of large sample sizes paves the way for data reuse with machine and deep learning techniques^{17,19,22,23,25}. In the case of multicenter data, harmonization is thus added to conventional data preprocessing steps, including, e.g., data cleaning and imputation, feature extraction, and reduction. Similar to other procedures, the harmonization parameters should be optimized on training data only and subsequently applied to test data. Indeed, this approach avoids data leakage, which happens when information from outside the training set is used to create the model, potentially leading to falsely overestimated performance. Crucially, this aspect has sometimes been overlooked in previous applications of ComBat by harmonizing the entire data sample before data splitting (training and test sets) used for training and testing machine or deep learning techniques^{2,5,17,19,22,23,25,36,37,38,39,40,41}.

To the best of our knowledge, the harmonization techniques for neuroimaging data have been applied without paying attention to avoid data leakage, and this effect has not been quantified. In addition, despite the Python package neuroHarmonize² and the R code provided by Radua and colleagues³ include functions that estimate the harmonization model on the training data and apply it separately to the test data, they have not been conceived to be executed on a machine learning pipeline, i.e., an end-to-end framework that orchestrates the flow of data into a machine learning model and allows to speed-up the development and test of machine learning systems, natively avoiding data leakage by design.

For these reasons, in this study, we propose 1) a measurement of the efficacy of data harmonization in reducing the site effect by the performance of a machine learning classifier trained to identify the imaging site, 2) a ComBat implementation using a harmonizer transformer, i.e., a method that, combined with a classifier/regressor, forms a composite estimator, to be used in a machine learning pipeline, thus simplifying data analysis and avoiding data leakage by design (the source code of the efficacy measurement and harmonizer transformer are publicly available in a GitHub repository at https://github.com/Imaging-AI-for-Health-virtual-lab/harmonizer). First, we showed and measured the effect of data leakage when harmonization is performed before data splitting using simulated neuroimaging data with known site effect. Then, we estimated the efficacy of data harmonization in reducing the site effect using the harmonizer transformer on brain T₁-weighted MRI data from 1787 healthy subjects aged 5–87 years acquired at 36 imaging sites. The morphological features of cortical thickness (CT) and fractal dimension (FD), a descriptor of the structural complexity of objects with self-similarity properties⁴², are extracted to characterize brain morphology. To the best of our knowledge, this is the first time that measures of brain structural complexity, such as FD, have been studied on such a large, multicenter, and harmonized data sample. Finally, we investigated the age prediction using neuroimaging variables harmonized in the entire dataset before machine learning and using the harmonizer transformer to estimate the effect of data leakage in in vivo data.

Methods

MRI datasets

We gathered brain MR T₁-weighted images of 1787 healthy subjects aged 5–87 years belonging to 36 single-center datasets of various studies. These include the Autism Brain Imaging Data Exchange (ABIDE) (https://fcon_1000.projects.nitrc.org/indi/abide/) first and second initiatives (ABIDE I and ABIDE II, respectively)^43,44, the Information eXtraction from Images (IXI) study (https://brain-development.org/ixi-dataset/), the 1000 Functional Connectomes Project (FCP) (https://fcon_1000.projects.nitrc.org/fcpClassic/FcpTable.html), and the Consortium for Reliability and Reproducibility (CoRR) (https://fcon_1000.projects.nitrc.org/indi/CoRR/html/index.html). From each study, we drew several specific datasets of brain MR T₁-weighted images acquired in the same place with the same scanner and acquisition protocol (see Table 1). Both ABIDE I, and II initiatives contributed with 17 datasets, and we named them with the initiative prefix (ABIDEI or ABIDEII) followed by the institution name that collected the images (e.g., ABIDEI-CALTECH and ABIDEII-BNI_1). For the institution names, we used the same nomenclature as reported online⁴⁵ with the following exceptions: (i) we merged LEUVEN_1 and LEUVEN_2 data in ABIDEI-LEUVEN, UCLA_1, and UCLA_2 data in ABIDEI-UCLA, UM_1 and UM_2 data in ABIDEI-UM, because the acquisition parameters were the same, (ii) we split the data from ABIDEII-KKI_1 into ABIDEII-KKI_8ch and ABIDEII-KKI_32ch, because the acquisitions were performed using an 8-channel or a 32-channel phased-array head coil, respectively. The IXI study provided three different datasets named with the prefix IXI followed by the institution name that collected the images (e.g., IXI-Guys). From the 1000 FCP and CoRR studies, we used the International Consortium for Brain Mapping (ICBM) and the Nathan Kline Institute - Rockland Sample Pediatric Multimodal Imaging Test-Retest Sample (NKI2) datasets, respectively.

Table 1 Scanning parameters for each single-center dataset.

Full size table

In each single-center dataset, baseline MRI scans of typically developing and aging brain (one for each subject) with available age and sex information were included. The lack of a recognized neurological or psychiatric disorder diagnosis was used to define normal development and aging. The leading institutions, at each site where the MR images were collected, had obtained informed consent from all participants, and were authorized by the local Ethics Committees. Table S1 shows the general characteristics of each single-center dataset. In this study, we grouped the single-center datasets into three multicenter meta-datasets based on age and the amount of overlap between age distributions. We have considered the following age ranges: childhood (5–13 years), adolescence (11–20 years), and adulthood (18–87 years). We measured the overlap between age distributions by the n-distribution Bhattacharyya coefficient (BC)⁴⁶, an extension of the 2-distribution BC⁴⁷. The BC coefficient is 0 when there is no overlap and 1 when the overlap is complete. In our study, n is the number of the single-center datasets grouped in the meta-dataset covering the above-mentioned age ranges and may be different in every meta-dataset. Therefore, we constructed the CHILDHOOD meta-dataset containing 11 single-center datasets, whose subjects’ age varies between 5 and 13 years, and age distributions have a BC = 0.71. The ADOLESCENCE meta-dataset includes 9 single-center datasets whose subjects’ age ranges from 11 to 20 years, and age distributions have an overlap amount equal to 0.45. Finally, the ADULTHOOD meta-dataset consists of all data belonging to subjects aged between 18 and 87 years old (12 single-center datasets), whose age distributions have a BC = 0. A detailed description of the composition of each meta-dataset and their age distributions are shown in Table 2 and Fig. 1, respectively. In addition, we also merged all single-center datasets, creating a meta-dataset, called LIFESPAN, that covers the entire age range (5–87 years). In this meta-dataset, composed of 36 imaging sites, the single-center age distributions have a null overlap (Fig. 1).

Table 2 Description of the demographic characteristics of each meta-dataset.

Full size table

MR image processing

For each brain MR T₁-weighted image, we performed a cortical reconstruction and a volumetric segmentation. In this work, we analyzed cerebral structures only, and we extracted neuroimaging features from various regions of the cerebral cortex: the entire cerebral cortex, separately the left/right hemispheres of the cerebral cortex, and left/right frontal, temporal, parietal, and temporal lobes. In particular, for each region, we computed the average cortical thickness (CT) and the fractal dimension (FD).

Cortical reconstruction and volumetric segmentation

We used the FreeSurfer package to perform completely automated cortical reconstruction and volumetric segmentation of each subject’s structural T₁-weighted scan. We used version 7.1.1, except in a few cases: (i) for T₁-weighted images belonging to ICBM and NKI2 datasets, we used FreeSurfer version 5.3, and (ii) for the ABIDEI datasets, we used the FreeSurfer version 5.1 outputs previously made available online by Cameron and colleagues⁴⁸ (http://preprocessed-connectomes-project.org/abide/index.html). Even though different FreeSurfer versions may affect neuroimaging variables^{49,50,51,52,53}, such variability is considered part of the site variability and handled by the harmonization procedure. Indeed, all subjects in each center have been processed with the same version of FreeSurfer. FreeSurfer is extensively documented (see ref. ⁵⁴ for a review) and publicly accessible (http://surfer.nmr.mgh.harvard.edu/). In addition to the standard FreeSurfer outputs, we performed a parcellation of the cortical lobes using the mri_annotation2label tool with the–lobesStrict option.

All Freesurfer outputs used in this study were visually inspected for quality assurance by two experienced radiologists (M.M. and C.T., with 35 and 30 years of experience, respectively) following an improved version of the ENIGMA Cortical Quality Control Protocol 2.0 (http://enigma.ini.usc.edu/protocols/imaging-protocols/). Firstly, we created an HTML file for each single-center dataset showing, for each subject, the segmentation of the cortical regions overlayed on the T₁-weighted images. Then, we scrolled the HTML file to determine gross segmentation errors in any cortical regions visually. For each single-center dataset, we estimated the statistical outliers for CT features, defined as any data points below or above the mean by 2.698 standard deviations. For each subject, we carefully inspected the cortical segmentations that showed features values labeled as statistical outliers to assess whether the outlier was an actual segmentation error. In this case, the subject was excluded from further analyses.

Extraction of cortical thickness and fractal dimension features

For each subject, using FreeSurfer tools, we computed the average CT of each cortical region as the average distance measured from each vertex of the gray/white boundary surface to the pial surface⁵⁵.

The FD is a numerical representation of shape complexity⁵⁶. The FD is normally a fractional value and is considered a dimension because it gives a measure of space-filling capacity⁵⁷. An FD value between 2 and 3 is typical of a complex and heavily folded 2-D surface buried in a 3-D region, such as the human cerebral cortex. The FD is a very compact measure of shape complexity, combining cortical thickness, sulcal depth, and folding area into a single numeric value^58,59. In this study, the fractal analysis was carried out using the fractalbrain toolkit version 1.1 (freely available at https://github.com/chiaramarzi/fractalbrain-toolkit) and described in detail in Marzi et al.⁵⁹. The fractalbrain toolkit processes FreeSurfer outputs directly, computing the FD of various regions of the cerebral cortex: the entire cerebral cortex, separately the left/right hemispheres, and left/right frontal, temporal, parietal, and temporal lobes. Fractalbrain performs the 3D box-counting algorithm⁶⁰, adopting an automated selection of the fractal scaling window⁵⁹ – a crucial step for establishing the FD for non-ideal fractals^59,61.

Briefly, we overlapped a grid composed of 3D cubes of different sizes s (where s = 2^k voxels, and k = 0, 1, …, 8) onto the segmentation and recorded the number of cubes N(s) needed to fully enclose the structure for each size. This process was repeated with 20 uniformly distributed random offsets to prevent the systematic influence of the grid placement, and the relative box count was averaged to obtain a single N(s) value^62,63. For a fractal object, the data points of the number of cubes N(s) vs. size s in the log-log plane can be modeled through a linear regression within a range of spatial scales called the fractal scaling window. Fractalbrain automatically selects the optimal fractal scaling window by searching for the interval of spatial scales that provides the best linear fit, as measured by the rounded coefficient of determination adjusted for the number of data points (R²_adj). If multiple intervals have the same rounded R²_adj, the widest interval (i.e., the one that contains the most data points in the log-log plot) is selected⁵⁹. The FD of the brain structure is then estimated as the slope (in absolute value) of the linear regression model included in the automatically selected fractal scaling window. As an example, in Fig. 2, we reported a log-log plot of the 3D box-counting algorithm optimized for the automatic selection of the best fractal scaling window of the cerebral cortex of one subject.

Harmonization of brain cortical features

We harmonized cortical features using ComBat, a model that builds upon the statistical harmonization technique proposed by Johnson and colleagues³⁵ for location and scale (L/S) adjustments to the data while preserving between-subject biological variability. Briefly, let y_ijf be the one-dimensional array of n neuroimaging features for the single-center i, participant j, and feature f, for a total of k single-center datasets, n participants, and V features. Still, let X be the n × p matrix of biological covariates of interest, and Z be the n × k matrix of single-center labels. The ComBat harmonization model can be written as follows:

$${y}_{ijf}={f}_{f}\left({X}_{ij}\right)+{Z}_{ij}{\vartheta }_{f}+{\delta }_{if}{\varepsilon }_{ijf}$$

(1)

where f_f (X_ij) denotes the variation of y_ijf captured by the biologically relevant covariates X_ij, ${\vartheta }_{f}$ is the one-dimensional array of the k coefficients associated with the single-center labels Z_ij for the feature f. We assume that the residual terms ε_ijf have mean 0. The parameters δ_if describe the multiplicative site effect of the i-th site on the feature f, i.e., the scale (S) adjustment, while the location (L) parameter for the i-th site on the feature f, is represented by γ_if (the empirical Bayes estimates of the term ${Z}_{ij}{\vartheta }_{f}$). Consistent with the ComBat model notation used in Fortin et al. (2017), the harmonized ${y}_{ijf}^{* }$ become:

$${y}_{ijf}^{* }=\frac{{y}_{ijf}-{f}_{f}\left({X}_{ij}\right)-{\gamma }_{if}}{{\delta }_{if}}+{f}_{f}\left({X}_{ij}\right)$$

(2)

In this study, we used the ComBat model implemented in the neuroHarmonize v. 2.1.0 package (freely available at https://github.com/rpomponio/neuroHarmonize) – an open-source and easy-to-use Python module². In particular, neuroHarmonize extends the neuroCombat package^5,6 with the possibility of specifying covariates with generic nonlinear effects on the neuroimaging feature to harmonize. In particular, the f_f (X_ij) term in Eq. (1) is a Generalized Additive Model (GAM) function of the specific covariates². Indeed, MRI-derived features are known to be influenced by demographic factors, such as age^{2,3,5,59,64,65,66,67,68,69,70} and sex⁷¹. In our study, these variables were included in the harmonization process as sources of inter-subject biological variability. Finally, since it is not evident that the site effect affects all MRI-derived measures in the same way³, we performed a separate harmonization for each feature group of the same type (i.e., CT and FD).

The harmonizer transformer

The increased sample size due to the pooling of data acquired in various centers necessarily facilitates the application of machine learning techniques. For training and testing machine learning models, a proper validation scheme that handles data splitting must be chosen (Fig. 3). This choice is crucial to avoid data leakage by ensuring that the entire workflow (preprocessing and model-building steps) is constructed on training data and evaluated on test data never seen during the learning phase. Indeed, data leakage in the training process may incur falsely high performance in the test set (see, e.g., ref. ⁷² and ref. ⁷³). Especially in Medicine and Healthcare, where relatively small datasets are usually available, the straightforward hold-out validation scheme is rarely applied. In contrast, the cross-validation (CV) and its nested version (nested CV) for hyperparameters optimization of the entire workflow^74,75,76 are frequently preferred. Also, repeated CVs or repeated nested CVs are suggested for improving the reproducibility of the entire machine learning system⁷⁵. Several training and test data procedures are carried out in all these validation schemes on different data split, recalling the need for a compact code structure to avoid errors that may lead to data leakage. In this view, machine learning pipelines are a solution because they orchestrate all the processing steps in a short, easier-to-read, and easier-to-maintain code structure (Fig. 3). A pipeline represents the entire data workflow, combining all transformation steps (e.g., data cleaning, data imputation, data scaling, and general data preprocessing) and machine learning model training. It is essential to automate an end-to-end training/test process without any form of data leakage and improve reproducibility, ease of deployment, and code reuse, especially when complex validation schemes are needed.

In the Scikit-learn library, a popular, open-source, well-documented, and easy-to-learn machine learning package that implements a vast number of machine learning algorithms, a pipeline is a chain of “transformers” and a final “estimator” acting as a single object. The transformers are modules that apply preprocessing to the data, whereas estimators are modules that fit a model based on training data and are capable of inferring some properties on new data (https://scikit-learn.org/stable/developers/develop.html). In particular, transformers are classes with a “fit” method, which learns model parameters (e.g., mean and standard deviation for data standardization) from a training set, and a “transform” method which applies this transformation model to any data. For example, for data standardization (transforming data to have zero mean and unit standard deviation), the mean μ must be subtracted from the data, and the result must be divided by the standard deviation σ. Notwithstanding, this procedure must be firstly performed on the training set (using μ and σ computed in the training set). In the test set, or any validation set, the same transformation must be applied to data using the same two parameters μ and σ computed for centering the training set. Basically, the “fit” method calculates the parameters (e.g., μ and σ in our case) and saves them internally, whereas the “transform” method applies the transformation (using the saved parameters) to any particular set of data.

For these reasons, in this study, we propose the harmonizer – a Scikit-learn Python transformer that encapsulates the neuroHarmonize procedure among the preprocessing steps of a machine learning pipeline. The “fit” method of the harmonizer transformer learns the NeuroHarmonize model parameters from a training set and saves the parameters internally, whereas the “transform’” method is used to apply the neuroHarmonize model, previously learned on the training data set, e.g., to unseen data. The source code of the harmonizer transformer is publicly available in a GitHub repository at https://github.com/Imaging-AI-for-Health-virtual-lab/harmonizer.

In the following, we included the harmonizer transformer in a pipeline to learn the harmonization procedure parameters on the training data only and apply the harmonization procedure (with parameters obtained in the training set) to the test data. This prevented data leakage by design in the harmonization procedure independently of the chosen validation scheme.

Statistical and machine learning analyses

We performed the statistical and machine learning analyses described in the following paragraphs for each feature group of the same type (i.e., CT and FD) and each meta-dataset (i.e., CHILDHOOD, ADOLESCENCE, ADULTHOOD, and LIFESPAN).

Visualization and quantification of site effect

We first performed a series of analyses of increasing complexity to explore the actual existence of a site effect in the data. For each region-feature pair, we qualitatively showed the site effect on raw data through boxplots, using the site as the independent variable and each region-feature pair as the dependent variable. Quantitatively, the site effect was measured by analyzing covariance (ANCOVA) – a general linear model that blends analysis of variance (ANOVA) and linear regression. ANCOVA evaluates whether the means of a dependent variable are equal across levels of a categorical independent variable while statistically controlling for the effects of other variables that are not of primary interest, known as covariates or nuisance variables. In this study, we set the single-center dataset as the independent variable, age, age×age, and sex as covariates, and each region-feature pair as the dependent variable.

Additionally, to further investigate the site effect on raw data and to measure the success of ComBat harmonization, we predicted the imaging site from the neuroimaging features, grouped by feature type, namely CT and FD. Specifically, we used the supervised eXtreme Gradient Boosting (XGBoost) method (with version 0.90 default hyperparameters for a classification task), a scalable end-to-end tree-boosting system widely used to achieve state-of-the-art performance on many recent machine learning challenges⁷⁷. Using N=100 repetitions of a stratified 5-fold CV, we estimated the median balanced accuracy. The statistical significance of prediction performance was determined via permutation analysis. Thus, for each features group, 5000 new models were created using a random permutation of the target labels (i.e., the imaging site), such that the explanatory neuroimaging variables were dissociated from their corresponding imaging site to simulate the null distribution of the performance measure against which the observed value was tested⁷⁸. Since, in this study, single-center datasets showed different age groups, the random target labels permutation was performed within groups of subjects of similar age⁷⁹, which were categorized into five-year intervals. The selection of a 5-year value was made to ensure it was sufficiently small to discern age differences while being large enough to avoid an excessive reduction in the potential permutations within each age group.

Median balanced accuracy was considered significantly different from the chance level when the p-value computed using permutation tests was < 0.05. Additionally, we calculated the average confusion matrix over repetitions to graphically evaluate the goodness of prediction. The same imaging site prediction was performed on raw data (i.e., without harmonization) to confirm the existence of the site effect and on harmonized data (with neuroHarmonize and Harmonizer transformer) to investigate if the site effect was reduced or removed.

We propose to measure the efficacy of harmonization in reducing or removing the site effect through a two-step assessment. First, we evaluated whether the site prediction after the harmonization process was not significantly different from a random prediction by comparing the median balanced accuracy over repetitions with the distribution of balanced accuracies estimated using the permutation test with 5000 permutations (the default value in FSL – FMRIB Software library – randomise tool for non-parametric permutation inference on neuroimaging data⁸⁰). Considering, for example, a significance threshold of 0.05 in the permutation test, in the case of complete removal of the site effect, the site prediction will not be different from that of a random model (i.e., p-value ≥ 0.05). Second, in the case of permutations test p-value < 0.05, we compared the balanced accuracy obtained by predicting the site without and with the harmonization procedure. In particular, we assessed the site effect reduction by ensuring that the median balanced accuracy obtained predicting the imaging site with harmonized data was significantly lower than that estimated with raw data through the non-parametric one-sided Wilcoxon signed-rank test, with a significance threshold of 0.05⁸¹. The source code for evaluating the effectiveness of harmonization using the harmonizer transformer is publicly available in a GitHub repository at https://github.com/Imaging-AI-for-Health-virtual-lab/harmonizer.

To estimate the effect of data leakage in the prediction of the imaging site caused by performing the harmonization on all data before splitting into training and test sets, we tested whether the balanced accuracies obtained using neuroHarmonize on all data before any split were consistently lower than those estimated using the harmonizer transformer in the above mentioned stratified CV scheme. Since the same data set splits were applied for both CT and FD, the comparison was carried out through a paired test, i.e., the non-parametric one-sided Wilcoxon signed-rank test with a significance threshold of 0.05⁸¹.

Associations with age

While it is essential to show that a harmonization method successfully reduces a possible site effect, it is equally crucial to note that it preserves the biological variability in the data. Indeed, a harmonization method that removes both site and biological effects has no utility. One of the most influential sources of biological variability in the neuroimaging features of healthy subjects is undoubtedly chronological age. Throughout the lifespan, the brain structure changes because of a complex interplay between multiple maturational and neurodegenerative processes. Such processes could yield large spatial and temporal variations in the brain^65,82,83.

For these reasons, we attempted to predict individual age from neuroimaging features through an XGBoost model (version 0.90 with default hyperparameters for a regression task)⁷⁷. We estimated the median (over repetitions) mean absolute error (MAE) using N = 100 repetitions of a 5-fold CV. Age prediction was performed on harmonized data using both neuroHarmonize and the harmonizer transformer in the CV pipeline. To estimate the effect of data leakage in the age prediction caused by performing the harmonization on all data before splitting into training and test sets, we compared the MAE values obtained using neuroHarmonize on all data before any split and the harmonizer transformer in the above-mentioned CV scheme. In particular, since the same data set splits were applied for both CT and FD, we assessed whether the median MAE using neuroHarmonize on all data before any split was consistently lower than that estimated using the harmonizer transformer through a paired test, i.e., the non-parametric one-sided Wilcoxon signed-rank test with a significance threshold of 0.05⁸¹.

Moreover, before and after the harmonization procedure, for each region-feature pair, we qualitatively visualized the site effect on the relationship between age and each region-feature pair through scatterplots (with age as the independent variable and each region-feature pair as the dependent variable).

Simulation experiments

The harmonizer transformer prevents data leakage by design in the harmonization procedure in any machine learning pipeline independently of the chosen validation scheme. Differently, applying harmonization before data spitting, data leakage is present, and its severity depends on the specific context and the extent of the leakage. In Neuroimaging, the entity and impact of the data leakage effect is still an underexplored area. Therefore, we performed simulation experiments (with known site effects) and computational tests for assessing the data leakage effect when the harmonization process is performed before the training-test data splitting.

CT and FD data simulation settings

Let y_ijf be the one-dimensional array of the simulated feature f, for the single-center i, and participant j, for a total of k single-center datasets, n_i participants for each center, and V features. In this study, we simulated CT and FD data for k = 3, 10, 36 single-centers. Each single-center dataset provided the same number of participants (i.e., n_i=n), with n assuming the values 25, 50, 100, 250. Totally, we did 24 experiments, i.e., we simulated 24 different multicenter datasets (12 for the CT features and 12 for the FD measures).

Each y_ijf was generated based on the model proposed by Johnson and colleagues³⁵ and recently used for neuroimaging features’ simulation by Chen and collaborators⁸⁴:

$${y}_{ijf}={{\rm{\alpha }}}_{f}+{{\rm{\beta }}}_{f1}{x}_{ij}+{{\rm{\beta }}}_{f2}{x}_{ij}^{2}+{{\rm{\gamma }}}_{if}+{{\rm{\delta }}}_{if}{{\rm{\varepsilon }}}_{ijf}$$

(3)

where α_f is the average value of the feature f in the single-center ICBM dataset, β_f1 = −0.0009 and β_f2 = −0.00005 are the linear and quadratic effects of the age on the feature f, respectively, and x_ij is a simulated age variable drawn from a uniform distribution X ~ uniform([20,90]). Considering the nature of our investigation, which examines the relationship between cortical thickness and FD with age, it is reasonable to assume that the relationship is no more than quadratic^59,85. The mean site effect γ_if was drawn from a normal distribution with zero mean and standard deviation equal to 0.1, while the variance site effect δ_if was drawn from a center-specific inverse gamma distribution with chosen parameters. For our simulations, we chose to distinguish the site-specific location factors by assuming independent and identically distributed (i.i.d.) normal distributions and scaling factors using the parameters described as follows. We set the value of the inverse gamma shape, for each center, as {46, 51, 56}, respectively, when k = 3, as {40, 42, .., 58} when k = 10, and as {10, 12, .., 40, 41, .., 50, 52, .., 70} when k = 36. In all cases, the inverse gamma scale was set to 50.

Measuring the effect of data leakage

We measured the effect of data leakage for both the site and age prediction independently. Hereinafter, we will refer generically to performance, indicating the balanced accuracy for the site prediction task and the MAE for the age prediction task. To measure the effect of data leakage, after an external hold-out (Fig. 4), firstly, we computed the performance of an imaging site/age prediction estimator trained using a) the harmonizer transformer within the machine learning pipeline (internal not leaked test set) and b) harmonizing all data with neuroHarmonize before the actual prediction (internal leaked test set). Secondly, we compared these performances with that observed on an external test set never used for harmonization and training (Fig. 4). In the absence of data leakage, the performance in the internal and external test sets should be similar and not significantly different. When data leakage is present, the performance in the internal test set is overly optimistic (i.e., significantly better than that on the external test set). In detail, for each experiment, we performed the following steps.

External hold-out

We randomly split the data into two parts, i.e., a data set containing 50% of the samples and an external test set with the other 50% of the instances.

Imaging site/age prediction estimator training and test on the external test set

We fitted a harmonization model with neuroHarmonize using age as a covariate with a nonlinear relationship with individual MRI-derived features. To fit the harmonization model, we used the same number of instances adopted for the other two approaches (see next analyses), i.e., 80% of samples, randomly chosen, of the data set. Then, we applied the harmonization model to the data set and the external test set. Finally, we trained an XGBoost model (with version 0.90 default hyperparameters for a classification task) to predict the imaging site/age and tested it on the harmonized external test set.

Imaging site/age prediction estimator training and test using harmonizer transformer within the machine learning pipeline (not leaked internal test set)

We trained and tested a pipeline containing the harmonizer transformer and an XGBoost estimator (with version 0.90 default hyperparameters) on the data set to predict the imaging site/age through a stratified 10-times repeated 5-fold CV. Thus, we trained the pipeline in the training sets of each iteration of the CV and considered the performance within the test sets of the CV.

Imaging site/age prediction estimator training and test harmonizing all data with neuroHarmonize before imaging site prediction (leaked internal test set)

We trained and tested a pipeline containing an XGBoost estimator (with version 0.90 default hyperparameters) on the harmonized dataset to predict the imaging site/age through a stratified 10-times repeated 5-fold CV. Thus, we trained the pipeline in the training sets of each iteration of the CV and considered the performance metric within the test sets of the CV.

For each task, i.e., imaging site and age prediction, we repeated each experiment (i.e., all these steps) 100 times with random data splits and computed the average performance across the 100 repetitions. Finally, we compared the average performance across the 100 repetitions of each internal test set (leaked and not-leaked) with that of the external test set. When data leakage is present, the performance in the internal test set is better than that on the external test set (i.e., lower balanced accuracy and MAE values for the imaging site and age prediction, respectively). To assess whether the average performance of each internal test set was lower than that of the external test set, we conducted a one-tailed t-test, applying Bonferroni correction for multiple comparisons. This statistical analysis allowed us to evaluate the significance of any differences observed between the average performance of the internal and external test sets.

In addition, we calculated, for each internal test set, the Cohen’s d effect size to estimate the magnitude of the differences between performance distributions’ means. Specifically, we used the following Cohen’s d formula: ${\rm{d}}=\frac{\overline{{x}_{e}}-\overline{{x}_{i}}}{s}$ where $\overline{{x}_{e}}$ is the average performance in the external test set, $\overline{{x}_{i}}$ is the average performance in the internal test set, and s is the standard deviation of the difference between performance obtained in the external test set and that achieved in the internal test set.

Results

Measuring the effect of data leakage in simulated data

Regarding the imaging site prediction, the results were similar for both CT (Table 3) and FD (Table 4) simulated features. The performances obtained on the leaked internal test set were overly optimistic, i.e., significantly better than those obtained in the external test set, indicating the presence of data leakage. In contrast, the average balanced accuracies recorded on the not leaked test internal set were statistically not different from those of the external test set (except in one case – see details in Table 4).

Moreover, as the number of samples available in each single-center dataset decreases, the effect of data leakage increases (Tables 3, 4 for CT and FD, respectively). This phenomenon is even more evident in Fig. 5, where we reported the difference between the average balanced accuracy obtained in the external test set and that gained in the internal test sets vs. the number of participants in each single-center site for the CT and FD, respectively. When data leakage is present (dashed lines in Fig. 5), the difference between the average balanced accuracy in the external test set and that in the internal leaked test set always differs significantly from zero (Bonferroni adjusted p-values < 10⁻⁹ and < 10⁻¹⁰ for CT and FD, respectively) and increases as the number of participants in each single-center dataset decreases. This result has a profound impact because most neuroimaging studies (with in vivo data) have single-centers datasets with a number of subjects between 25 and 100. Conversely, when data leakage is not present (solid lines in Fig. 5), the difference between the average balanced accuracy in the external test set and that in the internal not leaked test set was approximately zero and remained constant as the number of participants in each single-center dataset changes.

Table 3 Imaging site prediction results with CT simulated data.

Full size table

Table 4 Imaging site prediction results with FD simulated data.

Full size table

Data leakage was also observed in the age prediction task for both CT and FD features. Similarly to the site prediction task, the performance on the leaked internal test set appears overly optimistic (Tables 5, 6 for CT and FD, respectively), and the impact of data leakage becomes more pronounced as the number of samples in each single-center dataset decreases (Fig. 6).

Table 5 Age prediction results with CT simulated data.

Full size table

Table 6 Age prediction results with FD simulated data.

Full size table

Visualization and quantification of the site effect in in vivo data

Quality control of FreeSurfer’s outputs resulted in removing 47 subjects based on the overall low quality of cortical reconstruction or segmentation errors in any regions. All brain regions of the remaining 1740 subjects had both CT and FD features. Thus, we have been able to analyze the site effect, the harmonization adjustments, and age prediction on the same subjects for the CT and FD groups of features. The demographic characteristics of the subjects included in the study after the quality control have been reported in Table 7.

Table 7 Demographic characteristics of the subjects remaining after quality control and who entered into the analyses.

Full size table

The boxplots in Figs. 7, 8 summarize the distribution of the average CT and FD of the cerebral cortex at each imaging site. Specifically, the site effect differs between the two features. For example, in the CHILDHOOD meta-dataset, the ABIDEI-KKI_32ch, ABIDEI-KKI_8ch, and ABIDEII-NYU_1 single-center datasets show the lowest average CT values, while subjects from the ABIDEI-STANFORD dataset have the lowest FD values. Also, for the ADOLESCENCE meta-dataset, the site effect has a different behavior for CT and FD features: for example, ABIDEI-TCD_1 shows the lowest values of CT, while ABIDEI-LEUVEN shows the lowest values of FD. At the same time, in the ADULTHOOD meta-dataset, ABIDEI-SBL has the lowest mean CT values, whereas ABIDEII-BNI_1 has the lowest FD values.

The same result was measured quantitatively using ANCOVA analysis. Indeed, all CT and FD features were significantly different across the single-center datasets (Table 8), but the site effect, measured by the partial η² was different in the two feature sets. In the CHILDHOOD meta-dataset, for example, each cortical region showed a higher partial η² for FD than for CT, suggesting that, in childhood, acquisition characteristics impact more on the structural complexity measure, i.e., FD, than on the cortical thickness. On the other hand, in the ADOLESCENCE meta-dataset, the frontal and temporal lobes (bilaterally), along with the entire structure, show lower partial η² for FD than for CT, whereas the parietal and occipital lobes (bilaterally) have higher partial η² for FD than for CT. Finally, in the ADULTHOOD meta-dataset, only the occipital and temporal lobes (bilaterally) have lower partial η² for FD than CT.

Table 8 ANCOVA results on raw data.

Full size table

Harmonization efficacy

To assess whether most of the variation in the data was still associated with the site after harmonization, we predicted the imaging site using neuroimaging features grouped by feature type (i.e., CT and FD). Figures 9, 10 report the average confusion matrices (over 100 repetitions) for CT and FD features, respectively. When predicting the site using the raw data, the main diagonal of the confusion matrix is prominent (i.e., the predicted site is usually the actual site) for both feature groups and each meta-dataset (Figs. 9, 10). On the other hand, when the prediction of the site is performed using harmonized data (through neuroHarmonize or harmonizer transformer), the impact of the main diagonal of the confusion matrix is weak. The confusion matrices show a vertical pattern indicating that the predicted site is often the same site, regardless of the actual site (Figs. 9, 10). Moreover, the confusion matrix obtained using the harmonizer within the machine learning pipeline seems similar to that obtained by harmonizing all the data with neuroHarmonize before imaging site prediction. This result suggests that the action of the harmonizer resembles that of neuroHarmonize, although the model is built on training data only and then applied to test data. The confusion matrices for CT and FD features in the LIFESPAN meta-dataset have also been shown in Fig. 11.

Table 9 reports the median balanced accuracies (over 100 repetitions) of imaging site prediction, and the efficacy of the harmonization is shown in Table 10. Specifically, we have reported the pair (age-group permutation test p-value, one-sided Wilcoxon signed-rank test p-value) to statistically assess the removal or reduction of the site effect, respectively. As expected, the median balanced accuracy of site prediction using the raw data was significantly different from the chance level (age-group permutation test p-value ≥ 0.05 for all data), and thus, an actual imaging site effect was present on raw data. After harmonization, with neuroHarmonize or harmonizer transformer, the site effect was removed (age-group permutation test p-value ≥ 0.05 in Table 10) or only reduced (age-group permutation test p-value < 0.05, but with median balanced accuracy reduced on harmonized data, as statistically measured by the one-sided Wilcoxon signed-rank test p-value < 0.05 in Table 10). Specifically, by performing harmonization using neuroHarmonize on all data, we observe that the site effect removal seems to be ensured in all analyses performed except for the imaging site predictions using FD features in the ADOLESCENCE and ADULTHOOD meta-datasets (age-group permutation test p-value equal to 0.0188 and 0.0002, respectively, in Table 10). We found the same behavior when predicting the imaging site using CT and FD features in the LIFESPAN meta-dataset (age-group permutation test p-value equal to 0.0002 in Table 10). In the latter cases, although significantly different from a random prediction, the balanced accuracies were significantly lower than those obtained using the original data (one-sided Wilcoxon signed-rank test p-values < 0.001 in Table 10), and this indicates a site effect reduction. When applying the harmonizer transformer to the data (within the CV), we observed the actual efficacy of the harmonization, without introducing data leakage, as in the previous case. Indeed, we confirmed a complete removal of site effect only in imaging site prediction using CT features in ADULTHOOD meta-dataset (age-group permutation test p-value equal to 0.1064 in Table 10). In all the other cases, the imaging site prediction was significantly different from the chance level (age-group permutation test p-values < 0.05 in Table 10), but the balanced accuracies were significantly lower than those obtained using the original data (one-sided Wilcoxon signed-rank test p-values < 0.001 in Table 10). Thus, the site effect removal measured using data harmonized before the splitting into training and test sets was a clear sign of data leakage even in in vivo data.

Table 9 Site prediction results.

Full size table

Table 10 Harmonization efficacy.

Full size table

Age prediction

Table 11 reports the median MAE values (over 100 repetitions) of the age prediction model. Overall, MAE values of age prediction using data harmonized with neuroHarmonize before the splitting into training and test sets are significantly lower than those obtained using data harmonized with the harmonizer within the CV (one-sided Wilcoxon signed-rank p-values < 0.001 for all the cases, except for CT features in the CHILDHOOD meta-dataset, see Table 11). In line with the results of simulations, the data leakage introduced by harmonizing the data all at once leads to an overly optimistic performance.

Table 11 Age prediction results.

Full size table

Finally, in Figs. 12, 13, we reported the age-dependent trends of the average CT and FD of the cerebral cortex without harmonization and harmonized with the harmonizer transformer, respectively. In line with previous literature concerning features such as CT and volumes^2,5, also in this study, the harmonized average CT and FD values showed less variability than that observed on raw data.

Discussion

In this study, we introduced the harmonizer transformer, which encapsulates the data harmonization procedure among the preprocessing steps of a machine learning pipeline to avoid data leakage by design. To this end, we explored the ComBat harmonization of CT and FD features extracted from brain T₁-weighted MRI data of 1740 healthy subjects aged 5–87 years acquired at 36 sites and simulated data. We measured the efficacy of the harmonization process in reducing or removing the unwanted site effect through a two-step assessment comparing the performance in imaging site prediction using harmonized data with that of 1) a random prediction and 2) a prediction using non-harmonized data. Finally, we confirmed how data leakage related to harmonization performed before data splitting leads to overestimating performance in simulated and in vivo data.

Using simulated data, we showed that the data leakage effect introduced by performing the harmonization before data splitting is clearly evident and worse when the single-center dataset size is small and comparable with the size of the most common neuroimaging in vivo studies. In these simulated experiments, we paid particular attention to comparing different harmonization and machine learning approaches in the same conditions, i.e., the same data splits and using the same number of subjects for harmonization (for this reason, we adopted 80% of the data set size for fitting the neuroHarmonize model; indeed using the harmonizer approach, the harmonization was computed in the training fold of a 5-fold CV, i.e., using 80% of the samples).

We chose the ComBat harmonization method due to its widespread use in the scientific community^{7,12,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34} and its implementation in the neuroHarmonize package, which enables the specification of covariates with generic non-linear effects². The efficacy of ComBat and its variants has been evaluated by comparing their performance with other harmonization techniques^3,5,6 and by simulating site effects using single-center data². However, various harmonization techniques can be used for features extracted from MRI images. One such method is the residuals harmonization, which employs a global scaling procedure to account for the influence of each site using a pair of parameters (offset and scale). These parameters can be estimated through a linear regression model or a more sophisticated approach that considers non-linearities⁵. Global scaling was initially introduced to harmonize images directly⁶. The adjusted residuals harmonization, an advancement of the residuals harmonization, integrates biological covariates (such as age, sex, and diseases) into the linear regression model, facilitating the removal of unwanted site effects while maintaining biological variability⁵. Lastly, the Correcting Covariance Batch Effects (CovBat) method is a recent variant of the ComBat method that aims to address site effects in the mean, variance, and covariance of the neuroimaging features⁸⁴.

It is important to note that this study was the first in which the efficacy of the harmonization procedure of neuroimaging data has been evaluated by comparing the accuracy of the imaging site prediction also to the chance level. Indeed, previous works have consistently shown a decrease in the accuracy of the imaging site prediction after harmonization, but without applying a significance test, and thus it was not known whether the site effect was removed or only reduced [see, e.g., ref. ² and ref. ⁵]. As hypothesized, there was a real imaging site effect on the raw data (age-group permutation test p-value < 0.05 for all data). The site effect was either eliminated or only reduced after data harmonization with neuroHarmonize or harmonizer transformer. Specifically, the difference between the efficacy of harmonization by applying neuroHarmonize on all data or harmonizer within the CV was expected because, in the former case, data leakage is present leading to a falsely overestimated performance, i.e., an age-group permutation test p-value ≥ 0.05 and a lower median balanced accuracy (Tables 7, 8). On the one hand, the complete removal of the imaging site measured using the data harmonized with neuroHarmonize was only apparent. Indeed, using the harmonizer within the CV, the imaging site effect was completely removed only for CT features in the ADULTHOOD meta-dataset. In line with the results of the simulations, we noted that the median balanced accuracies obtained by performing site prediction with harmonized data using the neuroHarmonize show significantly lower values than those observed using the harmonizer transformer within the CV (one-sided Wilcoxon signed-rank p-values < 0.001 for all the analyses). The differences found in the median balanced accuracy of imaging site prediction using the harmonizer transformer and neuroHarmonize emphasize the importance of introducing the harmonizer transformer into a machine learning pipeline to avoid data leakage, a source of bias in prediction results. Notably, the procedure used to measure data leakage on the simulated data (i.e., comparing the performance of imaging site prediction between the internal test set of the CV and external test set) was not viable for the in vivo data due to the limited sample size in several centers (less than 20 subjects).

Looking at the age-group permutation test p-values for imaging site prediction using data harmonized with neuroHarmonize (which were harmonized before splitting into training and test sets), it can be observed that the efficacy of harmonization worsened as the overlap of the age distributions in multicenter meta-datasets decreased (Table 10). Specifically, for CT features, the age-group permutation test p-value was 0.5023 in the CHILDHOOD meta-dataset, which exhibits a good overlap of age distributions (BC = 0.71), but dropped to 0.0002 in the LIFESPAN meta-dataset, which exhibits a BC = 0. Similar behavior was observed for FD features. These results on in vivo data are in line with the simulations performed by Pomponio and colleagues², which suggested that age-disjoint studies should be challenging to harmonize in the presence of nonlinear age effects². The efficacy of the harmonization performed in CV using the harmonizer transformer does not appear seemingly to have a close link to the degree of overlap of the age distributions in the multicenter meta-datasets. This may be explained by the fact that the harmonizer transformer handles training data only – randomly chosen within the whole meta-dataset – in the different folds of the CV, and the actual BC values may vary.

The goodness of age prediction using the data harmonized with neuroHarmonize before the splitting into training and test sets is falsely increased compared with the use of data harmonized with the harmonizer within the CV. Indeed, the median MAE values obtained in predicting age using data harmonized with neuroHarmonize before splitting into training and test sets were significantly lower than those estimated using data harmonized with harmonizer within the CV (one-sided Wilcoxon signed-rank p-values < 0.001 for all the cases, except for CT features in the CHILDHOOD meta-dataset, see Table 10). These results confirm how data leakage related to data harmonization before splitting them into training and test sets leads to performance overestimation even for in vivo data and underlines the importance of encapsulating the data harmonization procedure among the preprocessing steps of a machine learning pipeline.

In previous single-centers studies, we observed that the computation of the FD using the box-counting algorithm with the automated selection of the optimal fractal scaling window implemented in fractalbrain best predicted chronological age in two datasets of healthy children and adults among various FD approaches, and more conventional features, such as CT, and gyrification index⁵⁹. In this large multicenter study, we confirmed the more remarkable ability of the FD of the cerebral cortex to predict individual age better than the average CT. In the LIFESPAN meta-dataset, for example, the error in age prediction using CT features (MAE = 7.55 years) was reduced by more than 25% using FD features (MAE = 5.60 years) in line with previous literature^59,68. This result furtherly confirms that FD conveys additional information to that provided by other conventional structural features^{58,59,67,68,86,87,88,89,90,91,92,93,94,95,96,97,98,99}.

This study has some limitations. Firstly, to show the utility of encapsulating the data harmonization procedure among the preprocessing steps of a machine learning pipeline to avoid data leakage, we used only the ComBat harmonization method. However, other harmonization techniques are available and could be similarly effective, including the recent CovBat model, which adds harmonization of covariance between sites⁸⁴. Future research may consider comparing and contrasting the performance of different harmonization methods to identify the optimal approach for specific research questions and data sets.

Secondly, we showed and measured the data leakage effect using simulated and in vivo data of CT and FD of the cerebral cortex only. Various other morphological and functional MRI-derived features might be considered. However, the focus of the study was mainly to measure the efficacy of the harmonization and show a possible detrimental effect of data harmonization on the entire dataset before machine learning analysis, and this effect is not relative to the features considered.

Lastly, for site/age prediction, we adopted an XGBoost decision tree with default parameters. It is well known that classification/regression performances may be affected by the value of the hyperparameters, and proper hyperparameter optimization, e.g., through a nested CV, could be adopted. However, this procedure was not feasible in our study because of the relatively small size of data in many centers – an undesired but common scenario in many publicly available datasets. Thus, though this choice was arbitrary, we feel that using the same hyperparameters for both neuroHarmonize and Harmonize transformer data was reasonable.

In conclusion, we showed that introducing the harmonizer transformer, which encapsulates the harmonization procedure among the preprocessing steps of a machine learning pipeline, avoided data leakage. Using in vivo data, after Combat harmonization, the site effect was completely removed or reduced while preserving the biological variability. We, therefore, suggest that future multicenter imaging studies will include the data harmonization method in the machine learning pipelines and measure the efficacy of the harmonization process.

Data availability

The brain MR T₁-weighted images that support the findings of this study are available from the following online repositories:

- Autism Brain Imaging Data Exchange (ABIDE): https://fcon_1000.projects.nitrc.org/indi/abide/

- Information eXtraction from Images (IXI) study: https://brain-development.org/ixi-dataset/

- 1000 Functional Connectomes Project (FCP) – ICBM dataset: https://fcon_1000.projects.nitrc.org/fcpClassic/FcpTable.html

- Consortium for Reliability and Reproducibility (CoRR) - NKI 2 - Nathan Kline Institute (Milham): https://fcon_1000.projects.nitrc.org/indi/CoRR/html/index.html

The CT and FD features, derived from brain MR T₁-weighted of 1740 subjects, that support the findings of this study, are freely available on Zenodo^100,101. The simulated CT and FD features that support the findings of this study are freely available on a Zenodo repository¹⁰².

Code availability

The source code of the efficacy measurement and harmonizer transformer is publicly available in a GitHub repository at https://github.com/Imaging-AI-for-Health-virtual-lab/harmonizer. The following are the versions of software and Python libraries used to obtain the results presented in this study:

- FreeSurfer version 7.1.1. For T₁-weighted images belonging to ICBM and NKI2 datasets, we used FreeSurfer version 5.3. ABIDEI T₁-weighted images were already processed using FreeSurfer version 5.1.

- fractalbrain toolkit version 1.1

- neuroHarmonize v. 2.1.0 package

- eXtreme Gradient Boosting (XGBoost) version 0.90.

References

Alfaro-Almagro, F. et al. Image processing and Quality Control for the first 10,000 brain imaging datasets from UK Biobank. NeuroImage 166, 400–424 (2018).
Article PubMed Google Scholar
Pomponio, R. et al. Harmonization of large MRI datasets for the analysis of brain imaging patterns throughout the lifespan. NeuroImage 208, 116450 (2020).
Article PubMed Google Scholar
Radua, J. et al. Increased power by harmonizing structural MRI site differences with the ComBat batch adjustment method in ENIGMA. NeuroImage 218, 116956 (2020).
Article PubMed Google Scholar
Thompson, P. M. et al. The ENIGMA Consortium: large-scale collaborative analyses of neuroimaging and genetic data. Brain Imaging Behav. 8, 153–182 (2014).
Article PubMed PubMed Central Google Scholar
Fortin, J. P. et al. Harmonization of cortical thickness measurements across scanners and sites. NeuroImage 167, 104–120 (2018).
Article PubMed Google Scholar
Fortin, J. P. et al. Harmonization of multi-site diffusion tensor imaging data. NeuroImage 161, 149–170 (2017).
Article PubMed Google Scholar
Beer, J. C. et al. Longitudinal ComBat: A method for harmonizing longitudinal multi-scanner imaging data. NeuroImage 220, 117129 (2020).
Article PubMed Google Scholar
Keshavan, A. et al. Power estimation for non-standardized multisite studies. NeuroImage 134, 281–294 (2016).
Article PubMed Google Scholar
Pinto, M. S. et al. Harmonization of Brain Diffusion MRI: Concepts and Methods. Front. Neurosci. 14, 396 (2020).
Article PubMed PubMed Central Google Scholar
Suckling, J. et al. Components of variance in a multicentre functional MRI study and implications for calculation of statistical power. Hum. Brain Mapp. 29, 1111–1122 (2008).
Article PubMed Google Scholar
Dansereau, C. et al. Statistical power and prediction accuracy in multisite resting-state fMRI connectivity. NeuroImage 149, 220–232 (2017).
Article PubMed Google Scholar
Yu, M. et al. Statistical harmonization corrects site effects in functional connectivity measurements from multi‐site fMRI data. Hum. Brain Mapp. 39, 4213–4227 (2018).
Article PubMed PubMed Central Google Scholar
Han, X. et al. Reliability of MRI-derived measurements of human cerebral cortical thickness: The effects of field strength, scanner upgrade and manufacturer. NeuroImage 32, 180–194 (2006).
Article PubMed Google Scholar
Jovicich, J. et al. Reliability in multi-site structural MRI studies: Effects of gradient non-linearity correction on phantom and human data. NeuroImage 30, 436–443 (2006).
Article PubMed Google Scholar
Takao, H., Hayashi, N. & Ohtomo, K. Effect of scanner in longitudinal studies of brain volume changes. J. Magn. Reson. Imaging 34, 438–444 (2011).
Article PubMed Google Scholar
Hatton, S. N. et al. White matter abnormalities across different epilepsy syndromes in adults: an ENIGMA-Epilepsy study. Brain 143, 2454–2473 (2020).
Article PubMed PubMed Central Google Scholar
Ingalhalikar, M. et al. Functional Connectivity-Based Prediction of Autism on Site Harmonized ABIDE Dataset. IEEE Trans. Biomed. Eng. 68, 3628–3637 (2021).
Article PubMed PubMed Central Google Scholar
Li, Y., Ammari, S., Balleyguier, C., Lassau, N. & Chouzenoux, E. Impact of Preprocessing and Harmonization Methods on the Removal of Scanner Effects in Brain MRI Radiomic Features. Cancers 13, 3000 (2021).
Article PubMed PubMed Central Google Scholar
Luna, A. et al. Maturity of gray matter structures and white matter connectomes, and their relationship with psychiatric symptoms in youth. Hum. Brain Mapp. 42, 4568–4579 (2021).
Article PubMed PubMed Central Google Scholar
Maikusa, N. et al. Comparison of traveling‐subject and ComBat harmonization methods for assessing structural brain characteristics. Hum. Brain Mapp. 42, 5278–5287 (2021).
Article PubMed PubMed Central Google Scholar
Orlhac, F. et al. How can we combat multicenter variability in MR radiomics? Validation of a correction procedure. Eur. Radiol. 31, 2272–2280 (2021).
Article PubMed Google Scholar
Wachinger, C., Rieckmann, A. & Pölsterl, S. Detect and correct bias in multi-site neuroimaging datasets. Med. Image Anal. 67, 101879 (2021).
Article PubMed Google Scholar
Wengler, K. et al. Cross‐Scanner Harmonization of Neuromelanin‐Sensitive MRI for Multisite Studies. J. Magn. Reson. Imaging 54, 1189–1199 (2021).
Article PubMed PubMed Central Google Scholar
Zavaliangos-Petropulu, A. et al. Diffusion MRI Indices and Their Relation to Cognitive Impairment in Brain Aging: The Updated Multi-protocol Approach in ADNI3. Front. Neuroinformatics 13, 2 (2019).
Article Google Scholar
Zhu, Y. et al. Application of a Machine Learning Algorithm for Structural Brain Images in Chronic Schizophrenia to Earlier Clinical Stages of Psychosis and Autism Spectrum Disorder: A Multiprotocol Imaging Dataset Study. Schizophr. Bull. sbac030 (2022).
Tafuri, B. et al. The impact of harmonization on radiomic features in Parkinson’s disease and healthy controls: A multicenter study. Front. Neurosci. 16, 1012287 (2022).
Article PubMed PubMed Central Google Scholar
Parekh, P. et al. Sample size requirement for achieving multisite harmonization using structural brain MRI features. NeuroImage 264, 119768 (2022).
Article PubMed Google Scholar
Chen, A. A., Luo, C., Chen, Y., Shinohara, R. T. & Shou, H. Privacy-preserving harmonization via distributed ComBat. NeuroImage 248, 118822 (2022).
Article PubMed Google Scholar
Lombardi, A. et al. Extensive Evaluation of Morphological Statistical Harmonization for Brain Age Prediction. Brain Sci. 10, 364 (2020).
Article PubMed PubMed Central Google Scholar
Zounek, A. J. et al. Feasibility of radiomic feature harmonization for pooling of [18F]FET or [18F]GE-180 PET images of gliomas. Z. Für Med. Phys. 33, 91–102 (2023).
Article Google Scholar
Dai, P. et al. The alterations of brain functional connectivity networks in major depressive disorder detected by machine learning through multisite rs-fMRI data. Behav. Brain Res. 435, 114058 (2022).
Article PubMed Google Scholar
Saponaro, S. et al. Multi-site harmonization of MRI data uncovers machine-learning discrimination capability in barely separable populations: An example from the ABIDE dataset. NeuroImage Clin. 35, 103082 (2022).
Article PubMed PubMed Central Google Scholar
Du, X. et al. Unraveling schizophrenia replicable functional connectivity disruption patterns across sites. Hum. Brain Mapp. 44, 156–169 (2023).
Article PubMed Google Scholar
Dudley, J. A. et al. ABCD_Harmonizer: An Open-source Tool for Mapping and Controlling for Scanner Induced Variance in the Adolescent Brain Cognitive Development Study. Neuroinformatics 21, 323–337 (2023).
Article PubMed Google Scholar
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostat. Oxf. Engl. 8, 118–127 (2007).
Google Scholar
He, L. et al. Deep Multimodal Learning From MRI and Clinical Data for Early Prediction of Neurodevelopmental Deficits in Very Preterm Infants. Front. Neurosci. 15, 753033 (2021).
Article PubMed PubMed Central Google Scholar
Kim, J. I. et al. Classification of Preschoolers with Low-Functioning Autism Spectrum Disorder Using Multimodal MRI Data. J. Autism Dev. Disord. (2022).
Lo Gullo, R. et al. Assessing PD-L1 Expression Status Using Radiomic Features from Contrast-Enhanced Breast MRI in Breast Cancer Patients: Initial Results. Cancers 13, 6273 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lopez-Soley, E. et al. Dynamics and Predictors of Cognitive Impairment along the Disease Course in Multiple Sclerosis. J. Pers. Med. 11, 1107 (2021).
Article PubMed PubMed Central Google Scholar
Simhal, A. K. et al. Predicting multiscan MRI outcomes in children with neurodevelopmental conditions following MRI simulator training. Dev. Cogn. Neurosci. 52, 101009 (2021).
Article PubMed PubMed Central Google Scholar
Zhou, X. et al. Multimodal MR Images-Based Diagnosis of Early Adolescent Attention-Deficit/Hyperactivity Disorder Using Multiple Kernel Learning. Front. Neurosci. 15, 710133 (2021).
Article PubMed PubMed Central Google Scholar
Mandelbrot, B. B. The fractal geometry of nature. (W.H. Freeman, 1982).
Di Martino, A. et al. Enhancing studies of the connectome in autism using the autism brain imaging data exchange II. Sci. Data 4, 170010 (2017).
Article PubMed PubMed Central Google Scholar
Di Martino, A. et al. The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Mol. Psychiatry 19, 659–667 (2014).
Article PubMed Google Scholar
Autism Brain Imaging Data Exchange (ABIDE). https://fcon_1000.projects.nitrc.org/indi/abide/ (2017).
Kang, S. M. & Wildes, R. P. The n-distribution Bhattacharyya coefficient. York Univ. (2015).
Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull Calcutta Math Soc 35, 99–109 (1943).
MathSciNet Google Scholar
Cameron, C. et al. The Neuro Bureau Preprocessing Initiative: open sharing of preprocessed neuroimaging data and derivatives. Front. Neuroinformatics 7 (2013).
Bigler, E. D. et al. FreeSurfer 5.3 versus 6.0: are volumes comparable? A Chronic Effects of Neurotrauma Consortium study. Brain Imaging Behav. 14, 1318–1327 (2020).
Article PubMed Google Scholar
Chepkoech, J.-L., Walhovd, K. B., Grydeland, H. & Fjell, A. M., for the Alzheimer’s Disease Neuroimaging Initiative. Effects of change in FreeSurfer version on classification accuracy of patients with Alzheimer’s disease and mild cognitive impairment: Effects of Change in FreeSurfer Version. Hum. Brain Mapp. 37, 1831–1841 (2016).
Article PubMed PubMed Central Google Scholar
Filip, P. et al. Different FreeSurfer versions might generate different statistical outcomes in case–control comparison studies. Neuroradiology 64, 765–773 (2022).
Article PubMed PubMed Central Google Scholar
Glatard, T. et al. Reproducibility of neuroimaging analyses across operating systems. Front. Neuroinformatics 9, (2015).
Gronenschild, E. H. B. M. et al. The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements. PLoS ONE 7, e38234 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Fischl, B. FreeSurfer. NeuroImage 62, 774–781 (2012).
Article PubMed Google Scholar
Fischl, B. & Dale, A. M. Measuring the thickness of the human cerebral cortex from magnetic resonance images. Proc. Natl. Acad. Sci. 97, 11050–11055 (2000).
Article ADS CAS PubMed PubMed Central Google Scholar
Cutting, J. E. & Garvin, J. J. Fractal curves and complexity. Percept. Psychophys. 42, 365–370 (1987).
Article CAS PubMed Google Scholar
Fernández, E. & Jelinek, H. F. Use of Fractal Theory in Neuroscience: Methods, Advantages, and Potential Problems. Methods 24, 309–321 (2001).
Article PubMed Google Scholar
Im, K. et al. Fractal dimension in human cortical surface: Multiple regression analysis with cortical thickness, sulcal depth, and folding area. Hum. Brain Mapp. 27, 994–1003 (2006).
Article PubMed PubMed Central Google Scholar
Marzi, C., Giannelli, M., Tessa, C., Mascalchi, M. & Diciotti, S. Toward a more reliable characterization of fractal properties of the cerebral cortex of healthy subjects during the lifespan. Sci. Rep. 10, 16957 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Russell, D. A., Hanson, J. D. & Ott, E. Dimension of Strange Attractors. Phys. Rev. Lett. 45, 1175–1178 (1980).
Article ADS MathSciNet Google Scholar
Losa, G. A. The fractal geometry of life. Riv. Biol. 102, 29–59 (2009).
PubMed Google Scholar
Falconer, K. J. Fractal geometry: mathematical foundations and applications. (John Wiley & Sons Inc, 2014).
Goñi, J. et al. Robust estimation of fractal measures for characterizing the structural complexity of the human brain: Optimization and reproducibility. NeuroImage 83, 646–657 (2013).
Article PubMed Google Scholar
Courchesne, E. et al. Normal Brain Development and Aging: Quantitative Analysis at in Vivo MR Imaging in Healthy Volunteers. Radiology 216, 672–682 (2000).
Article CAS PubMed Google Scholar
Fjell, A. M. & Walhovd, K. B. Structural Brain Changes in Aging: Courses, Causes and Cognitive Consequences. Rev. Neurosci. 21 (2010).
Hogstrom, L. J., Westlye, L. T., Walhovd, K. B. & Fjell, A. M. The Structure of the Cerebral Cortex Across Adult Life: Age-Related Patterns of Surface Area, Thickness, and Gyrification. Cereb. Cortex 23, 2521–2530 (2013).
Article PubMed Google Scholar
Madan, C. R. & Kensinger, E. A. Predicting age from cortical structure across the lifespan. Eur. J. Neurosci. 47, 399–416 (2018).
Article PubMed PubMed Central Google Scholar
Madan, C. R. & Kensinger, E. A. Cortical complexity as a measure of age-related brain atrophy. NeuroImage 134, 617–629 (2016).
Article PubMed Google Scholar
Raznahan, A. et al. How Does Your Cortex Grow? J. Neurosci. 31, 7174–7177 (2011).
Article CAS PubMed PubMed Central Google Scholar
Zheng, F. et al. Age-related changes in cortical and subcortical structures of healthy adult brains: A surface-based morphometry study: Age-Related Study in Healthy Adult Brain Structure. J. Magn. Reson. Imaging 49, 152–163 (2019).
Article PubMed Google Scholar
Sowell, E. R. et al. Sex Differences in Cortical Thickness Mapped in 176 Healthy Individuals between 7 and 87 Years of Age. Cereb. Cortex 17, 1550–1560 (2007).
Article PubMed Google Scholar
Yagis, E. et al. Effect of data leakage in brain MRI classification using 2D convolutional neural networks. Sci. Rep. 11, 22544 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Tampu, I. E., Eklund, A. & Haj-Hosseini, N. Inflation of test accuracy due to data leakage in deep learning-based classification of OCT images. Sci. Data 9, 580 (2022).
Article PubMed PubMed Central Google Scholar
Müller, A. C. & Guido, S. Introduction to machine learning with Python: a guide for data scientists. (O’Reilly Media, Inc, 2016).
Scheda, R. & Diciotti, S. Explanations of Machine Learning Models in Repeated Nested Cross-Validation: An Application in Age Prediction Using Brain Complexity Features. Appl. Sci. 12, 6681 (2022).
Article CAS Google Scholar
Varma, S. & Simon, R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics 7, 91 (2006).
Article PubMed PubMed Central Google Scholar
Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794, https://doi.org/10.1145/2939672.2939785 (ACM, 2016).
Nichols, T. E. & Holmes, A. P. Nonparametric permutation tests for functional neuroimaging: A primer with examples. Hum. Brain Mapp. 15, 1–25 (2002).
Article PubMed Google Scholar
Ojala, M. & Garriga, G. C. Permutation Tests for Studying Classifier Performance. J Mach Learn Res 11, 1833–1863 (2010).
MathSciNet Google Scholar
Winkler, A. M., Ridgway, G. R., Webster, M. A., Smith, S. M. & Nichols, T. E. Permutation inference for the general linear model. NeuroImage 92, 381–397 (2014).
Article PubMed Google Scholar
Wilcoxon, F. Individual Comparisons by Ranking Methods. Biom. Bull. 1, 80 (1945).
Article Google Scholar
Brouwer, R. M. et al. Genetic variants associated with longitudinal changes in brain structure across the lifespan. Nat. Neurosci. 25, 421–432 (2022).
Article CAS PubMed PubMed Central Google Scholar
Oschwald, J. et al. Brain structure and cognitive ability in healthy aging: a review on longitudinal correlated change. Rev. Neurosci. 31, 1–57 (2019).
Article PubMed PubMed Central Google Scholar
Chen, A. A. et al. Mitigating site effects in covariance for machine learning in neuroimaging data. Hum. Brain Mapp. 43, 1179–1195 (2022).
Article PubMed Google Scholar
Steffener, J. Education and age-related differences in cortical thickness and volume across the lifespan. Neurobiol. Aging 102, 102–110 (2021).
Article PubMed Google Scholar
Free, S. L., Sisodiya, S. M., Cook, M. J., Fish, D. R. & Shorvon, S. D. Three-dimensional fractal analysis of the white matter surface from magnetic resonance images of the human brain. Cereb. Cortex 6, 830–836 (1996).
Article CAS PubMed Google Scholar
King, R. D. et al. Fractal dimension analysis of the cortical ribbon in mild Alzheimer’s disease. NeuroImage 53, 471–479 (2010).
Article PubMed Google Scholar
King, R. D. et al. Characterization of Atrophic Changes in the Cerebral Cortex Using Fractal Dimensional Analysis. Brain Imaging Behav. 3, 154–166 (2009).
Article PubMed PubMed Central Google Scholar
Marzi, C., Giannelli, M., Tessa, C., Mascalchi, M. & Diciotti, S. Fractal Analysis of MRI Data at 7 T: How Much Complex Is the Cerebral Cortex? IEEE Access 9, 69226–69234 (2021).
Article Google Scholar
Marzi, C. et al. Structural Complexity of the Cerebellum and Cerebral Cortex is Reduced in Spinocerebellar Ataxia Type 2. J. Neuroimaging 28, 688–693 (2018).
Article PubMed Google Scholar
Pani, J. et al. Longitudinal study of the effect of a 5-year exercise intervention on structural brain complexity in older adults. A Generation 100 substudy. NeuroImage 119226 (2022).
Pantoni, L. et al. Fractal dimension of cerebral white matter: A consistent feature for prediction of the cognitive performance in patients with small vessel disease and mild cognitive impairment. NeuroImage Clin. 24, 101990 (2019).
Article PubMed PubMed Central Google Scholar
Nazlee, N., Waiter, G. D. & Sandu, A. Age‐associated sex and asymmetry differentiation in hemispheric and lobar cortical ribbon complexity across adulthood: A UK Biobank imaging study. Hum. Brain Mapp. hbm.26076, https://doi.org/10.1002/hbm.26076 (2022).
Sandu, A.-L. et al. Fractal dimension analysis of MR images reveals grey matter structure irregularities in schizophrenia. Comput. Med. Imaging Graph. 32, 150–158 (2008).
Article PubMed Google Scholar
Sandu, A.-L. et al. Post-adolescent developmental changes in cortical complexity. Behav. Brain Funct. 10, 44 (2014).
Article PubMed PubMed Central Google Scholar
Sandu, A.-L. et al. Sexual dimorphism in the relationship between brain complexity, volume and general intelligence (g): a cross-cohort study. Sci. Rep. 12, 11025 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Sandu, A.-L., Specht, K., Beneventi, H., Lundervold, A. & Hugdahl, K. Sex-differences in grey–white matter structure in normal-reading and dyslexic adolescents. Neurosci. Lett. 438, 80–84 (2008).
Article CAS PubMed Google Scholar
Sandu, A.-L. et al. Structural brain complexity and cognitive decline in late life — A longitudinal study in the Aberdeen 1936 Birth Cohort. NeuroImage 100, 558–563 (2014).
Article PubMed Google Scholar
Sandu, A.-L., Paillère Martinot, M.-L., Artiges, E. & Martinot, J.-L. 1910s’ brains revisited. Cortical complexity in early 20th century patients with intellectual disability or with dementia praecox. Acta Psychiatr. Scand. 130, 227–237 (2014).
Article PubMed Google Scholar
Marzi, C. & Diciotti, S. Multicenter dataset of neuroimaging features (part I). Zenodo https://doi.org/10.5281/zenodo.7845311 (2023).
Marzi, C. & Diciotti, S. Multicenter dataset of neuroimaging features (part II). Zenodo https://doi.org/10.5281/zenodo.7845361 (2023).
Marzi, C. & Diciotti, S. Multicenter dataset of simulated neuroimaging features - quadratic relationship with age. Zenodo https://doi.org/10.5281/zenodo.8119042 (2023).

Download references

Acknowledgements

We wish to thank Federica Giorgini and Riccardo Benedetti for data management, Stefano Orsolini for the technical support in the quality control assessment of all FreeSurfer outputs, and Martina Franco for her preliminary analysis of a part of this data. The research leading to these results has received funding from the European Union - NextGenerationEU through the Italian Ministry of University and Research under PNRR - M4C2-I1.3 Project PE_00000019 "HEAL ITALIA" to Stefano Diciotti" CUP J33C22002920006. The views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them.

Author information

Authors and Affiliations

Department of Statistics, Computer Science and Applications “Giuseppe Parenti”, University of Florence, 50134, Florence, Italy
Chiara Marzi
“Nello Carrara” Institute of Applied Physics (IFAC), National Research Council (CNR), 50019, Sesto Fiorentino, Florence, Italy
Chiara Marzi & Andrea Barucci
Unit of Medical Physics, Pisa University Hospital “Azienda Ospedaliero-Universitaria Pisana”, 56126, Pisa, Italy
Marco Giannelli
Radiology Unit Apuane e Lunigiana, Azienda USL Toscana Nord Ovest, 54100, Massa, Italy
Carlo Tessa
Department of Experimental and Clinical Biomedical Sciences “Mario Serio”, University of Florence, 50139, Florence, Italy
Mario Mascalchi
Division of Epidemiology and Clinical Governance, Institute for Study, Prevention and netwoRk in Oncology (ISPRO), 50139, Florence, Italy
Mario Mascalchi
Department of Electrical, Electronic, and Information Engineering “Guglielmo Marconi” - DEI, University of Bologna, 47522, Cesena, Italy
Stefano Diciotti
Alma Mater Research Institute for Human-Centered Artificial Intelligence, University of Bologna, 40121, Bologna, Italy
Stefano Diciotti

Authors

Chiara Marzi
View author publications
You can also search for this author in PubMed Google Scholar
Marco Giannelli
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Barucci
View author publications
You can also search for this author in PubMed Google Scholar
Carlo Tessa
View author publications
You can also search for this author in PubMed Google Scholar
Mario Mascalchi
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Diciotti
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Chiara Marzi: Conceptualization, Methodology, Software, Validation, Formal analysis, Data curation, Writing – Original draft, Writing – Review & Editing, Visualization. Marco Giannelli, Andrea Barucci, Carlo Tessa and Mario Mascalchi: Writing – Review & Editing. Stefano Diciotti: Conceptualization, Methodology, Resources, Writing – Original draft, Writing – Review & Editing, Visualization, Supervision, Project administration.

Corresponding author

Correspondence to Stefano Diciotti.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Related Manuscript File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Marzi, C., Giannelli, M., Barucci, A. et al. Efficacy of MRI data harmonization in the age of machine learning: a multicenter study across 36 datasets. Sci Data 11, 115 (2024). https://doi.org/10.1038/s41597-023-02421-7

Download citation

Received: 06 December 2022
Accepted: 27 July 2023
Published: 23 January 2024
DOI: https://doi.org/10.1038/s41597-023-02421-7

Subjects

Abstract

Similar content being viewed by others

Genome-wide association studies

Virtual reality-empowered deep-learning analysis of brain cells

Development and validation of a new algorithm for improved cardiovascular risk prediction

Introduction

Methods

MRI datasets

MR image processing

Cortical reconstruction and volumetric segmentation

Extraction of cortical thickness and fractal dimension features

Harmonization of brain cortical features

The harmonizer transformer

Statistical and machine learning analyses

Visualization and quantification of site effect

Associations with age

Simulation experiments

CT and FD data simulation settings

Measuring the effect of data leakage

External hold-out

Imaging site/age prediction estimator training and test on the external test set

Imaging site/age prediction estimator training and test using harmonizer transformer within the machine learning pipeline (not leaked internal test set)

Imaging site/age prediction estimator training and test harmonizing all data with neuroHarmonize before imaging site prediction (leaked internal test set)

Results

Measuring the effect of data leakage in simulated data

Visualization and quantification of the site effect in in vivo data

Harmonization efficacy

Age prediction

Discussion

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Related Manuscript File

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links