Multi-site benchmark classification of major depressive disorder using machine learning on cortical and subcortical measures

Machine learning (ML) techniques have gained popularity in the neuroimaging field due to their potential for classifying neuropsychiatric disorders. However, the diagnostic predictive power of the existing algorithms has been limited by small sample sizes, lack of representativeness, data leakage, and/or overfitting. Here, we overcome these limitations with the largest multi-site sample size to date (N = 5365) to provide a generalizable ML classification benchmark of major depressive disorder (MDD) using shallow linear and non-linear models. Leveraging brain measures from standardized ENIGMA analysis pipelines in FreeSurfer, we were able to classify MDD versus healthy controls (HC) with a balanced accuracy of around 62%. But after harmonizing the data, e.g., using ComBat, the balanced accuracy dropped to approximately 52%. Accuracy results close to random chance levels were also observed in stratified groups according to age of onset, antidepressant use, number of episodes and sex. Future studies incorporating higher dimensional brain imaging/phenotype features, and/or using more advanced machine and deep learning methods may yield more encouraging prospects.


Introduction
Major depressive disorder (MDD) is a psychiatric disorder with great impact on society, with a lifetime prevalence of 14% 1 , often resulting in reduced quality of life 2 and increased risk of suicide for those affected 3 . Considering the possibility of treatment resistance 4 and accelerated brain aging 5 , early recognition and implementation of effective treatments are critical.
Unfortunately, there are no reliable biomarkers to date to diagnose MDD, to predict its highly variable natural progression or response to treatment 6 . Until now, the diagnosis of MDD relies exclusively on self-reported symptoms in clinical interviews, which -despite great effortspresent risk of misdiagnosis due to subjectivity and limited specificity of some symptoms, especially in the early stage of mental disorders. Furthermore, comorbid conditions such as substance use disorders, anxiety spectrum disorders 7 , and other mental and somatic diseases 8 may contribute to the difficulty of correctly diagnosing and treating MDD.
With modern neuroimaging techniques such as magnetic resonance imaging (MRI), it has become possible to investigate cortical and subcortical brain alterations associated with MDD with high spatial resolution. Numerous studies reveal structural brain differences in MDD compared to healthy controls (HC) [9][10][11][12][13] , with patients presenting, on average, smaller hippocampal volumes as well as lower cortical thickness in the insula, temporal lobes, and orbitofrontal areas. However, inference at the group level and small effect sizes preclude clinical application. Analytic tools such as machine learning (ML) that allow multivariate combinations of brain features and enable inference at the individual level may result in better discrimination between MDD 245patients and HC, thereby potentially providing clinically relevant biomarkers for MDD.
Current literature shows MRI-based MDD classification accuracies ranging from 53% to 91% 14,15 with inconsistencies regarding which regions are the most informative for the classification.
This lack of consensus in the literature raises concerns regarding the generalizability of the classification methods and their related findings. A major contributor to high variability in classification performances is sample size 16,17 . Specifically smaller samples of the test data set tend to show more extreme results in both directions 16 , whereas studies with larger sample sizes in the test set tend to converge to an accuracy of around 60% 17 . In the presence of publication bias, which favors the reporting of overestimations, published literature can quickly accumulate inflated results 18 . Further, overestimations in the neuroimaging field 19,20 may also be driven by data leakage, which refers to the use of test data in any part of the training process.
Other factors contributing to inconsistencies in results include the heterogeneity of samples in relation to demographic and clinical characteristics, factors that play a significant role both in MDD and in the general population 5,21,22 . As large representative samples within a single cohort is difficult (e.g., due to financial cost, access to patient population, etc.), there is a growing interest in performing multi-site mega-analyses to address these issues.
ENIGMA MDD is a large-scale worldwide consortium, which curates and applies standardized analysis protocols to MRI and clinical/demographic data of MDD patients and HC from 52 independent sites from 17 countries across 6 continents (for review, 23 ). Such large-scale approaches with global representation are necessary for identifying brain alterations associated with MDD that are realistic, reliable, and generalizable 24 . Therefore, we consider data from different international cohorts included in ENIGMA MDD a powerful and efficient resource to benchmark the robustness of the most commonly used ML algorithms in MDD-related neuroimaging studies (for review, 14 ). Such algorithms include support vector machines (SVM), logistic regression with least absolute shrinkage and selection operator (LASSO) and ridge regularization, elastic net, and random forests. An additional advantage of ENIGMA MDD is that the inclusion of thousands of participants allows the stratification of several important factors related to cortical and subcortical brain alterations in MDD such as sex, age, age of MDD onset, and antidepressant use. However, unifiying multi-site data presents challenges.
The global group differences between cohorts -referred to here as a site effect -may arise from different MR acquisition equipment and acquisition protocols 25 , and/or demographic and clinical factors 26,27 . Ignoring the site effect may lead to construction of suboptimal lessgeneralizable classification models 28 , hindering the generalizability of the results. Along these lines, a commonly used strategy to mitigate site effect is to apply a harmonization technique such as ComBat 29 . Adopted from genomic studies, NeuroComBat estimates and statistically corrects for (harmonizes) differences in location (mean) and scale (variance) across different cohorts, while preserving or perhaps even enhancing the effect size of the variables of interest [30][31][32] . There are only a few studies attempting multi-site MDD classification using structural brain metrics 17,33,34 ; however, site effects were not addressed in their analyses.
The main goal of this study was to establish a benchmark for classification of MDD vs HC based on structural cortical and subcortical brain measures in the largest sample to date. We applied the most commonly used ML algorithms in MDD neuroimaging studies: SVM with linear and rbf kernels with and without feature selection (PCA, t-test), logistic regression with LASSO/ridge regularization, elastic net, and random forests. The model's performance is estimated via balanced accuracy, area under the receiver operating characteristic (AUC), sensitivity and specificity. We hypothesized that all models would be able to classify MDD vs HC with balanced accuracy higher than random chance, based on provided brain measures. We pooled preprocessed structural data from ENIGMA MDD participants, including 5,365 subjects (2,288 MDD and 3,077 HC) from 30 cohorts worldwide. As we were equally interested in general structural brain alterations in MDD as well as the generalizability of classification performance in sites unseen in the training phase, the data were split according to two strategies.
First, age and sex (Splitting by Age/Sex) were evenly distributed across all cross-validation (CV) folds, where each fold is used as a test set once and the rest of folds is used as a training set iteratively. Second, the sites (Splitting by Site) were kept whole across CV folds, so the algorithms were trained and tested on different sets of cohorts, resulting in large betweensample heterogeneity of training and test sets, potentially resulting in lower classification performance 35 , especially if large site effects are present. Because MDD is a highly heterogeneous diagnosis -and previous work from ENIGMA MDD 10,11 has identified distinct alterations in different clinical subgroups -we also stratified MDD based on sex, age of onset, antidepressant use, and number of depressive episodes to investigate whether classification accuracy could be improved when considering more homogenous subgroups. Additionally, we investigated which brain areas were most relevant to classification performance.
In summary, we expected that (1) All models would correctly classify MDD above chance level, (2) Splitting by Site would yield lower performance versus Splitting by Age/Sex, (3) Application of ComBat would improve classification performances for all models, and (4) Stratified analyses according to demographic and clinical characteristics would yield higher classification performance. We also explored the impact of other approaches to remove site effects (ComBat-GAM 36 and CovBat 37 ) from structural brain measures prior to feeding these measures into the classification models.

Participant Sample
A total of 5,365 participants, 2,288 patients with MDD and 3,077 healthy controls, from 30 cohorts participating in the ENIGMA MDD working group, were included in the analyses.
Information on sample characteristics, inclusion/exclusion criteria for each cohort can be found in Supplementary Table 1. Subjects with less than 75% of combined cortical and subcortical features and/or missing demographic/clinical information required for a particular analysis were excluded from the analysis. Missing cortical and subcortical features for the remaining subjects (2% of all data) were imputed by using multiple linear regression with age and sex of all subjects (regardless of diagnosis) as predictors, estimated for each cohort separately. All participating sites reported approval from their institutional review boards and local ethics committees and also obtained written informed consent for all participants.

Brain Imaging Processing
Structural T1-weighted 3D brain MRI scans of participating subjects were acquired from each site and preprocessed according to the rigorously validated ENIGMA Consortium protocols (http://enigma.ini.usc.edu/protocols/imaging-protocols/). Information on the MRI scanners and acquisition protocols used for each cohort can be found in Supplementary

Data Splitting into Cross-Validation Folds
We applied two different strategies to split the data into training and test sets: Splitting by Age/Sex and Splitting by Site. For both strategies, data was split into 10 folds, 9 of which were used for the training and the remaining fold was used as a test set. This was repeated iteratively until each fold was used once as a test set, thus performing the 10-fold CV. We investigated the general differences in brain volumes that can characterize MDD by using the Splitting by Age/Sex strategy. In this way, the age and sex distribution as well as number of subjects between the folds were balanced to mitigate the effect of these factors on the classification performance. However, it should be noted that with each site represented in every CV fold the potential site effects in this strategy, if any, would be diluted between the folds, which would not represent a realistic clinical scenario, where a classification model likely has to generalize to unseen sites. Therefore, we used a second strategy, Splitting by Site, which would yield more realistic metrics of classification performance for unseen sites. Using this strategy, every site was present only in one fold, thus the model is always trained and tested on different sets of sites and sites were distributed across folds to balance the number of subjects in every fold as close as possible. In this scenario, potential site-specific confounders (e.g., different MR scanners/acquisition protocols, demographic and clinical differences, etc.) were not equally distributed between the training and test sets. In this way, we can fairly evaluate the generalizability from one cohort to another. Finally, to access the performance estimates for each site, we explored leave-site-out CVs.

Analysis Pipeline
After distributing the data into CV folds corresponding to the splitting strategies, 9 folds were used for the training, while the remaining fold was held out as a test set ( Figure 3). CV folds were residualized normatively, partialling out the linear effect of age, sex and ICV from all cortical and subcortical features. In this step, age, sex and ICV regressors were estimated on the HC from training CV folds and applied to the MDD training data and to all data in the test set. After normalizing all features to have mean of zero and standard deviation of one based on the estimated from the training set's initial features' distributions, training and test folds were used for training and performance estimation, respectively. Additionally, class weighting was performed to mitigate an unbalanced training set across classes. Models' hyperparameters were estimated in the training data via nested 10-folds cross-validation using grid search (random splits, for both Splitting by Site and Splitting by Age/Sex), before the performance was measured on the test data to avoid data leakage through the choice of hyperparameters. The list of hyperparameters that were adjusted can be found in Supplementary Table 3. We evaluated the performance of SVM with linear kernel, SVM with rbf kernel, logistic regression with LASSO regularization, logistic regression with ridge regularization, elastic net, and random forests by using balanced accuracy, sensitivity, specificity and AUC as performance metrics.
As regularization serves as an in-build feature selection algorithm, we additionally applied feature selection via PCA and t-test for SVM algorithms. To access model-level assessment 70 , all models were also trained on the subset of features, i.e. only cortical surface areas, only cortical thicknesses and only subcortical volumes. Lastly, we investigated which features contributed most to the classification performance by looking at the decision-making of the most successful model. In case no performance differences across models were found, we reported the weights of the SVM with linear kernel as the representative classifier. These weights correspond to the classification performance of Splitting by Age/Sex strategy as all sites are used for weight's estimation. To assess confidence intervals of the feature weights, we performed 599-bootstrap 71 on the whole data set.
Further analyses were performed by stratifying the data according to demographic and clinical categories, including sex, age of onset (<21 years old vs >21 years old), antidepressant use (yes/no at time of scan), and number of depressive episodes (first episode vs recurrent episodes).
The subjects with missing information on these factors were not included in these stratification analyses, while they were still considered for the main analysis.
All the steps from CV folds to classification were repeated with feature specific harmonization of site effects via ComBat. Variance explained by age, sex and ICV was preserved in the cortical and subcortical features during harmonization step. The harmonized folds were then residualized normatively with all subsequent steps from the analysis without harmonization step. Furthermore, we compared ComBat with two modifications: ComBat-GAM and CovBat.
More detailed description of ComBat, ComBat-GAM and CovBat as well as their implementation for both splitting strategies can be found in Supplementary section "Harmonization methods".
We used Python (version 3.8.8) to perform all calculations. All classification models and feature selection methods were imported from sklearn library (version 1.

Participants and Data Splitting
From 5,572 participants, 207 were excluded due to less than 75% of combined cortical and subcortical features being provided, resulting in 5,365 subjects (2,288 MDD and 3,077 HC) used in the analysis.

Full Data Set Analysis
The classification performance of all models was similar and is presented in Table 3. When sites were evenly distributed across all CV folds (Splitting by Age/Sex), the highest balanced accuracy of 0.639 was achieved by SVM with rbf kernel, when trained using all cortical and subcortical features. The application of ComBat harmonization resulted in a performance drop of all models close to chance level. This pattern of lower classification performance, when ComBat was applied, was also observed across other classification metrics (see Supplementary   Table 5-7). Yet specificity was found to be up to 10% higher than sensitivity, possibly related to potential imbalances in ratio MDD to HC and its effect on the classification. For the Splitting by Site strategy, classification performances did not change significantly based on whether the ComBat harmonization was performed or not. Balanced accuracy was close to random chance, indicating that the models were not able to differentiate MDD subjects from HC. The application of ComBat did not result in higher classification accuracies (   from healthy controls (HC), we investigate the importance of the structural brain features by looking at the corresponding feature weights for the regional cortical surface areas, cortical thicknesses and subcortical volumes. The dashed lines represents ±0.1 as thresholds of feature weights, above which informative features stand out.

Data Stratification
Next, we investigated the classification performance of models trained and tested on stratified data by demographic and clinical characteristics. The general pattern of the highest accuracy achieved by Splitting by Age/Sex strategy without ComBat and the significant drop in the accuracy when ComBat is applied was observed in all stratified analyses (below). In the Splitting by Site strategy, the classification performance did not change significantly when ComBat was applied. Information on the feature weights may be found in Supplementary   Figures 1-4.

Males vs females
The number of male subjects is 2,131 and female subjects is 3,227 (7 male participants from the Episca cohort were not considered as we could not split them into 10 CV folds). In the Splitting by Age/Sex strategy without the harmonization step, the highest balanced accuracy of 0.632 was achieved when trained and tested on males -compared to maximum of 0.585 for females. When ComBat was applied, the accuracy dropped to 0.530 for males and to 0.529 for females, showing that there were minimal differences in classification results for males and females. For Splitting by Site, the accuracy did not change depending on the use of ComBat for both males (0.513 to 0.506) and females (0.519 to 0.517). Nevertheless, different brain regions were found important for classification in subgroups. In general, more features were found to be important for classification for males compared to females; this is especially noticeable for the regional surface areas (Supplementary Figure 1).

Age of onset
For Splitting by Age/Sex, when only patients first diagnosed in adolescence were included in the analysis, yielding 3,794 subjects in total, an accuracy of 0.626 was achieved, compared to 0.623 when patients who were first diagnosed in adulthood were analyzed. These accuracies dropped to 0.548 and 0.521 respectively, when ComBat was applied. In the Splitting by Site strategy, the balanced accuracy metrics did not change substantially for both subgroups: 0.541 to 0.544 for the adolescent-onset group and 0.546 to 0.518 for the adult-onset group, highlighting the absence of significant differences between these groups (Supplementary Figure   2).

Antidepressant use vs antidepressant free (at the time of MR scan)
Both subgroups showed a drop in balanced accuracy when ComBat was applied. In case of Splitting by Age/Sex, it reduced from 0.564 to 0.529 for the antidepressant-free subgroup (4,408 subjects) and from 0.716 to 0.534 for the antidepressant subgroup (3,988 subjects).
When Splitting by Site, the balanced accuracy metrics did not change significantly for any of the subgroups when ComBat was used. For the antidepressant-free subgroup, it decreased from 0.564 to 0.528, while for the antidepressant group, it dropped from 0.560 to 0.483 (Supplementary Figure 3).

First episode vs recurrent episodes
Similarly, a drop in accuracy was observed when the data set was stratified based on the number of depressive episodes with vs without ComBat. In Splitting by Age/Sex, the balanced accuracy for the first episode subgroup dropped from 0.559 to 0.518 when ComBat was applied. For individuals with more than one episode, the balanced accuracy decreased from 0.644 to 0.520 with ComBat. In the Splitting by Site strategy, the algorithm's performance was not majorly affected by ComBat in the single episode subgroup, yielding 0.482 to 0.512 in balanced accuracy and an insignificant drop from 0.521 to 0.505 for the recurrent episodes subgroup (Supplementary Figure 4).

Discussion
In this work, we benchmarked ML performance on the largest multi-site data set to date, using regional cortical and subcortical structural information for the task of discriminating patients with MDD vs HC. We applied the most commonly used ML algorithms to 152 features of 5,365 subjects from the ENIGMA MDD working group. To investigate brain characteristics common to MDD, as well as realistic classification metrics for unseen sites, we used two different data splitting approaches. Balanced accuracy was up to 63%, when data was split into folds according to Splitting by Age/Sex, and up to 51%, when data was split into folds according to Splitting by Site strategy. The harmonization of the data via ComBat evened the classification performance for both data splitting strategies, yielding up to 52% of balanced accuracy. This classification level implies that initial differences in performances were due to the site effects, most likely stemming from differences in MRI acquisition differences across sites. Lastly, the data set was stratified based on demographic and clinical factors, but we found only minor differences in terms of classification performances between subgroups.

Data Splitting and Site Effect
Splitting of the data plays an important role in formulating and testing the hypotheses as well as validating them. As shown in 38  was not supported. Regressing out these sources of demographic information did not significantly change the level of classification when predicting site belonging. According to our results, a major source of the site effect comes from the different scanner models and acquisition protocols, since we achieved the highest accuracy when attempting to classify scanner type (see Suppl. "Harmonization methods").
In addition to scanning differences, demographic and diagnosis distribution were different across the sites. Therefore, we explored if balancing the sample in terms of age and sex distributions would lead to higher classification performance. However, balancing of age/sex distributions across sites did not improve classification performance in Splitting by Site (balanced accuracy 52.6%/50.7% without/with ComBat). Thus, balancing age and sex did not contributed to better performance. As the MDD/HC ratio also varied across sites, an influence of site affiliation to the main MDD vs HC task could exist. Therefore, we additionally explored if the classification performance would drop to random level by equalizing the MDD/HC proportion before splitting the data according to Splitting by Age/Sex. Indeed, we observed a substantial drop of the balanced accuracy from 61% to 53% with 1:1 MDD to HC ratio, confirming our assumption of an incorporation of the site affiliation in the diagnostic classification.
Building on this, ComBat was able to remove the site effect, as all classification models could not differentiate between sites after its application. Subsequently, there were no differences between classification results accross splitting approaches, with around 0.52 in balanced accuracy. Such a low accuracyclose to random chanceis consistent with another large sample study based on two cohorts 17 . In their study, self-reported current depression was speculated as a reason for low accuracy, but this possibility is unlikely explaining our classification results that were based on clinical. Moreover, similar classification levels in our and their study support the notion that a more balanced ratio between classes is not a main aspect behind the low power of discrimination.
Similar to ComBat, other more sophisticated harmonization methods such as ComBat-GAM and CovBat were able to remove site effect, but did not improve the balanced accuracy in Splitting by Site strategy. We cannot exclude the possibility that ComBat-like harmonization tools may overcorrect the data and remove weaker group differences of interest 40 . Hence, encouraging such evaluations in large data sets as well as implementing new methods to be tested 41,42 on both the group and the single subject prediction level could be of great benefit for the imaging community.

Machine Learning Performance
In our study, the selection of all classification algorithms was guided by their frequent appearance in prior neuroimaging studies as well as its low computational complexity.
According to previous studies 14,17 , SVM is the most commonly and successfully used algorithm in previous analyses. We have tested other commonly used ML algorithms, such as logistic regression with LASSO, logistic ridge regression and elastic net logistic 14,43,44 . Given that logistic regression models already have an in-built feature selection procedure, we also included feature selection algorithms such as the two-sample t-test and PCA 45-47 , for a fair comparison with SVM. There was no single winner with a significantly higher classification performance across all algorithms, with a balanced accuracy up to 64%, when applied in data split by age/sex, and up to 53%, when split according to subsets of site. A similar trend was observed with AUC.
Such a low performance for the Splitting by Site strategy can be a result of simplicity of the models since the highly non-linear interactions of different cortical and subcortical regions are not considered, with an exception of SVM with RBF kernel. In general, specificity was up to 5 % higher than sensitivity, possibly as a result of the imbalanced MDD/HC data sets, even when the impact of both classes was weighted by its ratio during the training.
Considering such a low balanced accuracy, future studies could apply more sophisticated classification methods such as Convolutional Neural Networks 48 , which are able to detect nonlinear interactions between all the features as well as to consider spatial information of the given features. As it was demonstrated previously on both real and simulated data 49 , regressing out covariates can lead to lower classification performance, therefore one could use an importance weighting instead. Another option would be to include other data modalities such as vertex-wise cortical and subcortical maps 50,51 or even voxel-wise T1 images to capture even more fine-grained changes, which are also present in shapes of subcortical structures 52 or diffusion MRI. A recent resting-state fMRI multi-site study by Qin 53 reported an accuracy of 81.5%. Thus, integration of structural and functional data modalities may result in even higher classification performances.

Predictive Brain Regions
Our results do not support the hypothesis that MDD can be discriminated from healthy controls by regional structural features that; classification performance, when site effects were removed, was close to chance level. Nevertheless, during investigation of the most discriminative regions, even after ComBat, we found an overlap with previously reported MDD-related regions.
Multiple cortical and subcortical regions were found as the most discriminative between MDD and HC. Most of the cortical regions were identified in previous ENIGMA MDD work 10 , which overlaps with our study set of cohorts. Shape differences in left temporal gyrus were reported previously in a younger population with MDD 54 . Left postcentral gyrus and right cuneus surface area were associated with severity of depressive symptoms, while left superior frontal gyrus, bilateral lingual gyrus and left entorhinal cortical thickness were decreased in MDD group 10,55 . In a previous study, MDD subjects exhibited reduced cortical volume compared to HC 56 . Differences in orbitofrontal cortex between MDD and HC were also previously identified 10 . Overall, the effect sizes for case-control differences in these studies were small, which is in line with our current results showing low classification accuracies of these structural brain measures. Additionally, we also found increased thickness of left caudal middle frontal gyrus, right pars triangularis, right superior parietal and right temporal pole in MDD group, which was not previously reported. All subcortical volumes identified as informative for classification became uninformative after ComBat was applied, suggesting that either previously identified alterations in subcortical regions 11 cannot be directly used as MDD predictors at an individual level or ComBat removed differences significant for classification.
One possible reason is that subcortical volumes tend to exhibit complex association with the age. Therefore, linear age regression might be an overly simplistic representation of aging trajectories both in ComBat and residualization step. While some of the regions were found also to be predictive in the previous mega-analysis on MDD vs HC study from Stolicyn and colleagues 17 , it is dificult to draw a consistent conclusion as they highlight the regions based on the selection frequency by the decision tree model, without reporting the direction of the modulation.
When models were trained and tested only on the subset of features in Splitting by Age/Sex, cortical thicknesses and subcortical volumes yielded higher balanced accuracy compared to cortical surface areas, which is consistent with the previous Enigma MDD meta-analysis, due to an overlap of study cohorts. Yet with the data harmonization step, there was no distinct subgroup of features providing more discriminative information. Together, we observed more changes in weights for cortical thicknesses and subcortical volumes after applying ComBat.
One possibility is that differences are more pronounced in thickness than surface area, which is in line with previous findings from univariate approaches 57 . Another possibility is that differences in scanners and acquisition protocols may affect thickness features more strongly than surface areas, in line with previous works 58 . This is a very pertinent topic to be further investigated using multi-cohort mega-analyses on volumetric measures, particularly when the site effect is systematically considered.
Importantly, identified features correspond to the Splitting by Age/Sex strategy as the SVM model was trained on the whole data set with entirely mixed cohorts. While these regions were found to be informative according to the SVM with linear kernel, this model (and every other considered model) failed to differentiate MDD from HC on an individual level, thus one has to be cautious when interpreting these results. Structural alterations in myelination, gray matter, and curvature were found to be associated with MDD-associated genes (Li et al., 2021).
Furthermore, a small sample study revealed MDD-related alterations in sulcal depth 59 , while white matter topologically-based MDD classification led to up to 76% in accuracy 60 . Thus, the performance could be potentially elevated by integrating morphological shape features with white matter characterestics, such as sulcal depth and curvature, and myelination density as it led to improved performance when classifying sex and autism 61 .

Data stratification
When the data set was stratified, we found substantial differences in balanced accuracies between the groups only for Splitting by Age/Sex strategy without harmonization step, yet these results were strongly influenced by the site effect.

Conclusion
We benchmarked the classification of MDD vs HC using the most commonly used ML algorithms applied to regional surface area features, cortical thickness features and subcortical volumes in the largest multi-site global data set to date. We systematically addressed the questions of general MDD characteristics and power to classify unseen sites by splitting the data in two parallel strategies, which were complemented by ComBat harmonization. A classification accuracy up to 63% was achieved when all cohorts were present in the test set, which decreased down to 52% after ComBat harmonization. Here we have shown that most commonly used ML algorithms may not be able to differentiate MDD from HC on the single subject level using only structural morphometric brain data, even when trained on data from thousands of participants. Furthermore, the performance was not higher in stratified, clinically and demographically more homogeneous groups. Additional work is required to examine if more sophisticated algorithms also known as deep learning can achieve higher predictive power or if other MRI modalities such as task-based or resting state fMRI can provide more discriminative information for successful MDD classification.

Data availability
Authors are not allowed to share the data of participating sites to third parties inside or outside the ENIGMA MDD consortium. Some sites may provide data upon request.

Supplementary Materials
Supplementary

SCID interview
Psychiatric inpatients and tinnitus patients with MDD or a disorder of the depressive spectrum (also adjustment disorders as pointed out in the data table); psychiatrically healthy controls were derived from community and tinnitus patients MDD subjects: presence of axis-I disorders other than MDD or adjustment disorders. Control subjects: no Axis-I diagnosis, no medication use. Exclusion criteria for all subjects included history of neurological disease (e.g. tumour, head trauma, epilepsy) or untreated internal medical condtitions, intellectual and/or developmental disability. Only German native speakers were allowed to participate.

SCAN interview
Community based or outpatients, none were inpatients. MDD subjects: Less than two depressive episodes of at least moderate severity. Did not meet DSM-IV diagnostic criteria for recurrent major depressive disorder. Control group participants were clinically interviewed to ensure they had never experienced depressive symptoms. Exclusion criteria for all participants were for contraindications to MRI; other exclusion criteria were a diagnosis of neurological disorder, head injury leading to loss of consciousness or conditions known to affect brain structure or function (including alcohol or substance misuse), ascertained during clinical interview. Potential participants were also excluded if they or a first-degree relative had ever fulfilled criteria for mania, hypomania, schizophrenia or moodincongruent psychosis.
Contraindications to MRI, diagnosis of neurological disorder, head injury leading to loss of consciousness or conditions known to affect brain structure or function (including alcohol or substance misuse), if they or a firstdegree relative had ever fulfilled criteria for mania, hypomania, schizophrenia or moodincongruent psychosis.

MINI
Older adults, aged above 55, with severe depression admitted to be treated with ECT Exclusion criteria for all participants included: 1) use of pharmacotherapeutics for treating psychiatric conditions within the past 6 months, 2) misuse of drugs within two months prior to MRI scanning; 3) two or more alcoholic drinks per week within the previous month (as assessed by the Customary Drinking and Drug Use Record; CDDR) (Brown et al, 1998); 4) a full scale IQ score of less than 75 (as assessed by the Wechsler Abbreviated Scale of Intelligence; WASI) (Wechsler, 1999); 5) contraindications for MRI including ferromagnetic implants and claustrophobia; 6) pregnancy or the possibility of pregnancy; 7) left-handedness; 8) prepubertal status (as assessed as Tanner stages of 1 or 2) (Tanner, 1962); 9) inability to understand and comply with procedures; 10) neurological disorder (including meningitis, migraine, or HIV); 11) head trauma; 12) learning disability; 13) serious health problems; and 14) complicated or premature birth (i.e., birth before 33 weeks of gestation). The MDD group was subject to the additional exclusion criterion of a primary psychiatric diagnosis other than MDD. The HCL group was subject to the additional exclusion criteria of: 1) history of mood or psychotic disorders in a first-or second-degree relative (as assessed by the Family Interview for Genetics; FIGS) (Maxwell, 1992); and 2) current or lifetime DSM-IV-TR Axis I psychiatric disorder.

SHIP M-CIDI interview
Population based longitudinal cohort study MDD subjects: presence of axis-I disorders other than MDD, anxiety disorders, conversion, somatization and eating disorder. Control subjects: no lifetime diagnosis of depression, no antidepressiva, and severity index=0 All subjects: We removed subjects with medical conditions (e.g. a history of cerebral tumor, stroke, Parkinson's diseases, multiple sclerosis, epilepsy, hydrocephalus, enlarged ventricles, pathological lesions) or due to technical reasons (e.g. severe movement artifacts or inhomogeneity of the magnetic field).

M-CIDI interview
Population based longitudinal cohort study MDD subjects: no special exclusion criteria Control subjects: no lifetime diagnosis of depression, no antidepressiva, and severity index=0 All subjects: We removed subjects with due to medical conditions (e.g. a history of cerebral tumor, stroke, Parkinson's diseases, multiple sclerosis, epilepsy, hydrocephalus, enlarged ventricles, pathological lesions) or due to technical reasons (e.g. severe movement artifacts or inhomogeneity of the magnetic field). San Raffael e Milano OSR

SCID interview adult MDD depressed inpatients
Other diagnoses on Axis I; pregnancy, history of epilepsy, major medical and neurological disorders; absence of a history of drug or alcohol dependency or abuse within the last six months. inflammation-related symptoms, including fever and infectious or inflammatory disease; uncontrolled systemic disease; uncontrolled metabolic disease or other significant uncontrolled somatic disorder known to affect mood; somatic medications known to affect mood or the immune system, such as corticosteroids, non-steroid antiinflammatory drugs and statins. Singap ore

SCID interview
Inclusion: 1) DSM IV dx of MDD (Patients) 2) Age: 21-65 3) English speaking 4) Provision of informed written consent Exclusion criteria 1) History of significant head injury 2)Neurological diseases such as epilepsy, cerebrovascular accident 3) Impaired thyroid function 4) Steroid use 5) DSM IV alcohol or substance use or dependence 6) Contraindications to MRI (e.g. pacemaker, orbital foreign body, recent surgery/procedure with metallic devices/implants deployed) using standard MRI Request Form from NNI 7)Pregnant women 8) Claustrophobia SoCAT SCID interview Inclusion criteria: DSM IV dx for mdd patients Age: 18-65 right-handed currently depressed or remitted; Control subjects: any history of psychiatric disorder Exclusion criteria 1) History of significant head injury 2)Neurological diseases such as epilepsy, cerebrovascular accident 3)Other diagnoses on Axis I disorders4) Stanfor d FAA

SCID interview
Community-based DSM-diagnosed sample MDD subjects: presence of axis-I disorders other than MDD, anxiety and eating disorders . Control subjects: control individuals did not meet diagnostic criteria for any current psychiatric. Both groups: alcohol / substance abuse or dependence within six months prior to MRI scanning, history of head trauma with loss of consciousness > 5 min, aneurysm, or any neurological or metabolic disorders that require ongoing medication or that may affect the central nervous system (including thyroid disease, diabetes, epilepsy or other seizures, or multiple sclerosis), MRI contraindications, or bad MRI data (e.g., extreme movement).

SCID interview
Community-based DSM-diagnosed sample MDD subjects: presence of axis-I disorders other than MDD, anxiety and eating disorders .

Aggreg ate
Control subjects: control individuals did not meet diagnostic criteria for any current psychiatric.     Table 9: Balanced accuracy of support vector machines (SVM) with linear kernel without feature selection trained and validated with leave-one-site-out cross-validation.
More extreme values of balanced accuracy are obtained for cohorts containing no healthy subjects. Note that ComBat brings these values closer to average across all cohorts.

Harmonization methods
We harmonized individual cortical and subcortical features by implementing the wellestablished statistical harmonization algorithm, ComBat 3 . Its purpose was to adjust Location (mean) and Scale (variation) (L/S) of all features of the data collected from different cohorts by preserving the influence of biologically-significant factors of interest in the features.
Additionally, it is assumed that the site effect is independent across cortical and subcortical features. Additive and multiplicative site effects are assumed to be probabilistic and estimated by a Bayesian approach. In short, the empirical priors are added over the site specific means and variance 4 . Subsequently, the cortical and subcortical features would be standardized to mean of 0 and standard deviation of 1, while the site effect would be removed. ComBat assumes that the data , , for ROI k, site i and subject j can be represented by the following model: Where is an overall ROI value, is a design matrix where is a vector containing site affiliation and controlled covariates of participant j in site i. In our case, these are age, sex and ICV. is the vector of regression coefficients corresponding to , and correspond to additive and multiplicative site effect and is an error term assumed to follow normal distribution with mean zero and variance 2 . After parameter estimation in the model above, the standardized data * can be calculated as follows: where ̂, ̂ ,̂ and ̂ are estimated ComBat parameters. Additionally, it is assumed that the site effect is not independent across cortical and subcortical features.
All parameter estimations, which includes estimates of ̂, ̂ ,̂ and ̂, should be computed only on the training set, i.e. 9 CV folds, to avoid non-independence of the training and test data, also known as data leakage. After parameter estimations and training of the ML algorithm were complete, the calculated parameters were used to adjust the test data and the performance of = + + + (1) the trained classification algorithm measured on the test set represented by the remaining CV fold. These parameters were directly used for adjusting data from unseen subjects from the test set only if these subjects belong to the same cohorts as in the training set. This scenario corresponds to Splitting by Age/Sex strategy as every CV fold contains subjects from all cohorts.
In Splitting by Site strategy, subjects from one cohort are included only in one CV fold, thus the direct usage of estimated ComBat parameters on the test set is imprudent. Here we adapted the approach of a reference batch adjustment 5 , which constitutes fixing a reference sites, while other sites are adjusted to the mean and variance of the reference site according to the following where , correspond to coefficients estimated on the reference site r. Additionally, , represent additive and multiplicative differences between site i and r. In our case, the test set was adjusted to a unified batch made by integrating all cohorts from the training set and adjusting to common mean and variance by the ComBat, which allowed to harmonize unseen cohorts without data leakage from the training set to the test set (Supplementary Figure 5). This framework was additionally extended to include non-linear preservation of the covariates by substituting with a Generalized Additive Model (ComBat-GAM) 6 , allowing nonlinear age trends to be preserved during the harmonization step. Furthermore, we considered a CovBat model, which assumes an additional covariance site effect alongside with mean and variance corrections 7 . We tested ComBat's harmonization ability to remove the site effect from the full data by training cortical and subcortical features via SVM with a linear kernel to predict the site. This was mostly tested in cases where the number of sites was below 10 6,8,9 , so it is relevant in the context of our current analysis. We used the Splitting by Age/Sex strategy, since to predict site information from the test folds, this information should be presented in the training foldsand this would not be possible in the Splitting by Site strategy . The balanced accuracy was 0.854 before applying ComBat without correction for age, sex and ICV. Such a high performance confirmed the expected strong sites effect presented in the data, which may interfere with the main MDD vs HC classification task. The classification balanced accuracy dropped substantially to 0.031 after applying ComBat, indicating the significant removal of site-related information in cortical and subcortical features. Such a low accuracy comes from the fact that with ComBat site-related information was removed from the data, resulting in SVM always predicting SHIP_T0the biggest cohort. By assessing the confusion matrices we could evidence that -without the harmonization step -the classification algorithm was able to predict site affiliation of the subject successfully, except of sites coming from the same research group e.g., MPIP (two cohorts) and in case of SHIP_T0-SHIP_S2 and Melbourne-MoralDilemma pairs. As an illustration, an example of a feature being harmonized via ComBat may be seen in Supplementary Figure 6.
To further investigate whether differences across sites were mainly driven by irregular age and sex distributions, we repeated the classification task by also regressing out age and sex from the features. This resulted in 0.816 and 0.031 balanced accuracies for predicting site without and with ComBat harmonization respectively. By comparing the results with and without this residualization step, we could infer that the classification performance in differentiating sites only minor roots in age and sex distribution differences between sites. To see if the site effect was due to the differences in scanners and scan acquisition protocols between cohorts, we trained SVM to predict scanner type from cortical and subcortical features. The resulting accuracy was 0.875, even higher than only site prediction. This hints to the site effect being primarily caused by differences in acquisition equipment across sites.
Supplementary Figure 6: An example of site effect removal by ComBat for left pars opercularis thickness. Color corresponds to the site affiliation. While the differences between sites are reduced, remaining differences correspond to age-and sex-related differences between cohorts (middle and bottom).