Data leakage inflates prediction performance in connectome-based machine learning models

Predictive modeling is a central technique in neuroimaging to identify brain-behavior relationships and test their generalizability to unseen data. However, data leakage undermines the validity of predictive models by breaching the separation between training and test data. Leakage is always an incorrect practice but still pervasive in machine learning. Understanding its effects on neuroimaging predictive models can inform how leakage affects existing literature. Here, we investigate the effects of five forms of leakage–involving feature selection, covariate correction, and dependence between subjects–on functional and structural connectome-based machine learning models across four datasets and three phenotypes. Leakage via feature selection and repeated subjects drastically inflates prediction performance, whereas other forms of leakage have minor effects. Furthermore, small datasets exacerbate the effects of leakage. Overall, our results illustrate the variable effects of leakage and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.


S3. Family leakage analysis
Beyond the three original phenotypes and models in this study, we considered several additional phenotypes from the Child Behavioral Checklist (CBCL) 56 , as well as one additional model (Random Forest).The phenotypes were the Anxiety and Depression CBCL Syndrome Scale Raw Score (Anx/Dep), Aggressive CBCL Syndrome Scale Raw Score (Aggression), Internal CBCL Syndrome Scale Raw Score (Internal), and the External CBCL Syndrome Scale Raw Score (External).The Random Forest was considered in case the other models (ridge regression, SVR, CPM) had too low of a memorization capacity that would limit the effects of leakage.For the Random Forest 27 , 10 estimators were used, and a grid search was performed varying the maximum depth (3, 5, 7, 9).We found that, across all models and phenotypes, the median performance of twin leakage was greater than the median gold standard performance (Figure S2).The effects of leakage were greater for ridge regression and SVR compared to CPM and the Random Forest.Although not tested, family leakage may become increasingly impactful for complex models, such as neural networks, that are trained with larger datasets and more participants per family.We compared the similarity of each phenotype between twins and the corresponding increase in prediction performance in a leaky vs. non-leaky pipeline.As a metric of similarity, we took the ratio of the mean absolute error of the phenotype between each twin pair to the mean absolute error between the participant and all non-twin participants, and this ratio was averaged across all participants.The MAE ratio for participant p is defined as: Thus, a value closer to 0 reflects greater similarity of that phenotype between twins.We chose to use phenotype similarity instead of literature estimates of heritability because a phenotype such as age is not necessarily "heritable," but it is identical (or nearly identical due to different interview dates) for twins.Similarly, the CBCL measures were determined by a parent questionnaire and thus may reflect the tendencies of a parent in answering questions rather than explicit heritability of a trait.
The most similar phenotypes did not necessarily show greater leakage effects (Figure S3).This result could point toward the limited memorization capacity of the model, given the study design.For example, if there were many members per family, the model may more easily "memorize" the signature of family members, and the effects of leakage may be greater.Furthermore, we performed a simulation that altered the percentage of one-individual families in the dataset.To do this, we started with only families with multiple individuals, and then we added in random fractions of the participants without family members.In general, the effects of leakage increased as the fraction of participants coming from families with multiple members increased (Figure S4).However, the effects were still relatively small.Supplementary Figure 7. Similarity of coefficients across all pipelines, averaged over 100 random seeds, related to

Supplementary Figure 2 .
Comparison of prediction performance between the gold standard and twin leakage in the ABCD twin subset, related to Figure 6.Boxplot elements were defined as follows: the center line is the median across 100 random iterations; box limits are the upper and lower quartiles; whiskers are 1.5x the interquartile range; points are outliers.Four models (ridge regression, SVR: support vector regression, CPM: connectome-based predictive modeling, Random Forest) and seven phenotypes were included.In all cases, the median performance was higher for twin leakage compared to the gold standard.AP: Attention CBCL Syndrome Scale Raw Score; MR: WISC-V Matrix Reasoning Total Raw Score; Anx/Dep: Anxiety and Depression CBCL Syndrome Scale Raw Score; Aggression: Aggressive CBCL Syndrome Scale Raw Score; Internal: Internal CBCL Syndrome Scale Raw Score; External: External CBCL Syndrome Scale Raw Score.

Figure 3 .
Comparison of prediction performance with twin leakage (y-axis) and the gold standard (xaxis) colored by phenotype similarity, related to Figure 6.An MAE Ratio closer to zero entails greater similarity between twins.The shape of each point indicates the phenotype.AP: Attention CBCL Syndrome Scale Raw Score; MR: WISC-V Matrix Reasoning Total Raw Score; Anx/Dep: Anxiety and Depression CBCL Syndrome Scale Raw Score; Aggression: Aggressive CBCL Syndrome Scale Raw Score; Internal: Internal CBCL Syndrome Scale Raw Score; External: External CBCL Syndrome Scale Raw Score.

Figure 8 .
Figure 8.For each pair of the 13 pipelines, we computed the correlation between their coefficients.Missing values (i.e., no site information in PNC) are shown in black.ABCD: Adolescent Brain Cognitive Development; HBN: Healthy Brain Network; HCPD: Human Connectome Project Development; PNC: Philadelphia Neurodevelopmental Cohort.

Table 1 .
Summary of the leakage types used in this study and their mapping to those used by