Systematic misestimation of machine learning performance in neuroimaging studies of depression

We currently observe a disconcerting phenomenon in machine learning studies in psychiatry: While we would expect larger samples to yield better results due to the availability of more data, larger machine learning studies consistently show much weaker performance than the numerous small-scale studies. Here, we systematically investigated this effect focusing on one of the most heavily studied questions in the field, namely the classification of patients suffering from Major Depressive Disorder (MDD) and healthy controls based on neuroimaging data. Drawing upon structural MRI data from a balanced sample of N = 1868 MDD patients and healthy controls from our recent international Predictive Analytics Competition (PAC), we first trained and tested a classification model on the full dataset which yielded an accuracy of 61%. Next, we mimicked the process by which researchers would draw samples of various sizes (N = 4 to N = 150) from the population and showed a strong risk of misestimation. Specifically, for small sample sizes (N = 20), we observe accuracies of up to 95%. For medium sample sizes (N = 100) accuracies up to 75% were found. Importantly, further investigation showed that sufficiently large test sets effectively protect against performance misestimation whereas larger datasets per se do not. While these results question the validity of a substantial part of the current literature, we outline the relatively low-cost remedy of larger test sets, which is readily available in most cases.


Appendix B. Predictive Analytics Competition 2018
The medical machine learning lab from Prof. Dr. Tim Hahn invited teams from all over the world to develop a model classifying patients suffering from Major Depressive Disorder (MDD) and healthy individuals based on structural Magnetic Resonance Imaging (sMRI) data. This competition was called Predictive Analytics Competition 2018 (Pac 2018).

Data
The data for this competition -comprising sMRI datasets with and without MDD from = 2,240 subjects -was provided by the Institute of Translational Psychiatry, Münster. The images were preprocessed in advanced with the SPM toolbox CAT-12 (Matlab 9.0 / SMP12 rev. 6685 / CAT12 v.1184) and qualitychecked. The participants got, additional to the diagnosis, the age, gender, total intracranial volume (TIV) and the scanner-site to consider them as covariates.
The data was split into a training and test set in advance (Table B.2). The test set was held back and only used in the final step to designate the winner. The "Pac Award Winner 2018" was announced in a ceremony held at annually meeting of the Organization for Human Brain Mapping (OHBM) in Singapore at the 21 st June 2018.

Results
In total 49 teams registered with at least 170 participants. The winner was determined by the highest balanced accuracy score on the held-back test set. The team with the highest score was "paranoidandroid" (Table B.3) and won the Pac 2018 Award.

Appendix D. Influence of the scanner distribution
Scanner site distributions were not well balanced in the PAC sample. This imbalance was even stronger in our randomly drawn sub-samples. To determine the influence of the different scanner-sites on model accuracy, we took the same methodical approach that we used for the "train sample size effect analysis" and determined the scanner distribution for each set. Following, we calculated Spearman's rho between the scanner-distribution and the accuracy of the hold-out test set ( = 300). The scanner-distribution for each set was approximated using Gini's index. Here, we show that the scanner distribution has a statistically significant influence on test set accuracy but explains only 0.79 % of the variance.

Appendix E. Adjustment of SVM regularization based on sample size
The regularization of a linear SVM depends on the absolute number of outliers in a sample. To exclude the possibility that the use of default hyperparameters (a constant value of = 1) caused the effect that we observed, we have adjusted the SVM regularization (our hyperparameter) based on the size of each analyses sub-sample. In this adjustment, a constant value is divided by the sample size. To approximate the default parameter = 1 on average, we have set = 75. For example, would be = 100/75 = 1.33 for a sample size of = 100. In this case, the adjusted regularization did not deliver results as good as what we observed with = 1 (see Figure E.1, E.2). Therefore, we conclude that the default value of = 1 across changing N values did not increase the probability of misestimating accuracy as sample size decreased.

Appendix F. Alternative Machine Configurations
To exclude the possibility that the observed effects can only be traced back to our specific configuration, we have tested other usual configurations. The configurations consist of a preprocessing and a classification. In the preprocessing step, we used a method for reduction of feature space and a method for features selection. The reduction of the feature space was achieved with a Principal Component Analysis (PCA), where only a certain number the first components were used. Afterwards, for feature selection, an ANOVA was calculated and a specific number of feature beginning with the highest F-value were taken. The actual configuration used for preprocessing is listed in Table F.10. For the classification, we chose three specific machines: 1. An SVM with a linear kernel and default parameter. ( = 1.0) 2. An SVM with an RBF kernel and default parameter ( = 1.0; = 1/ feature ) 3. A Random Forrest and default parameter ( estimators = 100) In combination with the preprocessing, this results in 48 configurations. Since the found effect is limited to the test set size, we have only repeated our analyses for the test set for each of these configurations.

Appendix F.1. Results
The results are comparable to the originally used configuration, an SVM with a linear kernel and no preprocessing (see Figure F.3 to F.50 and Table F.11 to F.58). The results thus underline a generally valid character of the findings. An outlier in the results can be found for a specific configuration, a PCA with 10 components and the SVM with an RBF kernel (see Figure F.22 and Table F.30). This result can be explained due to overfitting, the machine constantly returns one constant class as a prediction.