Brain–phenotype models fail for individuals who defy sample stereotypes

Individual differences in brain functional organization track a range of traits, symptoms and behaviours1–12. So far, work modelling linear brain–phenotype relationships has assumed that a single such relationship generalizes across all individuals, but models do not work equally well in all participants13,14. A better understanding of in whom models fail and why is crucial to revealing robust, useful and unbiased brain–phenotype relationships. To this end, here we related brain activity to phenotype using predictive models—trained and tested on independent data to ensure generalizability15—and examined model failure. We applied this data-driven approach to a range of neurocognitive measures in a new, clinically and demographically heterogeneous dataset, with the results replicated in two independent, publicly available datasets16,17. Across all three datasets, we find that models reflect not unitary cognitive constructs, but rather neurocognitive scores intertwined with sociodemographic and clinical covariates; that is, models reflect stereotypical profiles, and fail when applied to individuals who defy them. Model failure is reliable, phenotype specific and generalizable across datasets. Together, these results highlight the pitfalls of a one-size-fits-all modelling approach and the effect of biased phenotypic measures18–20 on the interpretation and utility of resulting brain–phenotype models. We present a framework to address these issues so that such models may reveal the neural circuits that underlie specific phenotypes and ultimately identify individualized neural targets for clinical intervention.


Supplementary Figures
Supplementary Figure 1.Similarity (rank correlation) of misclassification frequency (MF; averaged across in-scanner conditions) derived from classification analyses of the given Yale phenotypic measure using raw and motion-regressed functional connectivity (FC).Diagonal: mean = 0.78, s.d.= 0.12.

MOTION REGRESSED RAW
Supplementary Figure 2. Relationships between each pair of covariates in the Yale dataset.Individual variable distributions on the main diagonal.For pairwise relationships, two continuous covariates presented as a scatterplot, one continuous and one categorical as a boxplot, and two categorical as a faceted bar plot.Boxplot line and hinges represent median and quartiles, respectively; whiskers extend to most extreme non-outliers.Lines on scatterplots reflect smoothed conditional means and their 95% CI.Relationships between variables significantly related to MF enclosed in boxes.For each significant (P < 0.05, via twotailed Spearman correlation and Mann-Whitney U test) such relationship, we re-tested the relationship between each variable in the pair with MF while controlling for the other, either via partial Spearman correlation (     large sample-based, broadly generalizable models 50,51 , and subtype-specific models that do not require such large samples and that capture meaningfully distinct neural representations of the modeled phenotype.Further, this framework can be extended to consideration of multiple phenotypes, via either joint modeling of multiple phenotypic measures (as in 52 ) or post-hoc intersection of phenotype-specific groups to yield more nuanced subtypes.Training models in subgroups, however, may fail to yield useful brainphenotype relationships if the phenotypic measure itself is biased 53 (see Causes and implications of model failure).

Covariate-outcome relationships may be varied and complex
As discussed in the main text, relationships between covariates and the outcome of interest can take many forms.A phenotypic outcome may be related to a covariate, but this shared variance need not overlap with the variance shared between outcome and brain.Alternatively, it is possible that a given outcome-covariate relationship holds across the entirety of the sample, or that a sample is homogenous in the given domain (e.g., age); in these cases, the covariate would not influence misclassification frequency.
Finally, if the outcome-covariate shared variance does overlap with the outcome-brain shared variance 54 , it need not do so in the same manner in all participants.

Additional limitations and future directions
In this work, sample constraints permitted us to identify the existence of stereotypical profiles and the implications of profile inconsistency (i.e., misclassification), but prevented comprehensive, precise profile characterization.In particular, we use the terms "white" and "racialized groups" (see Methods for definition) not to ignore the differences among ethnic, immigrant, cultural, and racial groups in the United States, but rather to address sample size limitations (see Extended Data Fig. 7 for more specific racial and ethnic breakdowns of each sample).While this dichotomy is frequently used in research, we recognize that it is arbitrary, and that race is a crude, often misleading proxy 55 for social and economic disparities (e.g., early-life experiences 56 , disparate educational quality 57 , socioeconomic status 58 , discrimination and perceived discrimination 59 , experiences of segregation and racism 60 , social status 61 , and neighborhood disadvantage 62 ).Further, the limitations of using "white" as the de facto reference group 63,64 and education quantity rather than quality 57 have been well documented.We encourage researchers to incorporate these considerations into study design, as well as data collection and analysis (see Causes and implications of model failure: Limitations and future directions and Extended Data Fig. 6).

Supplementary Figure 3 .
Relationships between each pair of covariates in the UCLA dataset, presented as in Supplementary Fig.

Table 1 .
In-scanner tasks and corresponding RDoC domains and constructs.
2. If only one covariate in pair remained significantly related to MF after control, that covariate's name is reported (e.g., *rx*rx for Hopkins/Rx indicates that medication status remains significantly associated with low-scorer and high-scorer MF after controlling for symptom severity, Supplementary

Table 2 .
Measures used in the post-scan behavioral battery, with corresponding RDoC domains and constructs where relevant.

Table 3 .
Demographic and clinical information for the Yale, UCLA, and HCP datasets.m, mean; s.d., standard deviation.For categorical variables, number of participants per group reported.Note that different measures (and thus scales) were in many cases used across datasets (see main text).

Table 4 .
Mean classification accuracy (averaged across 100 iterations) for each phenotypic measure using FC calculated from all in-scanner conditions in the Yale dataset.

Table 5 .
Mean classification accuracy (averaged across 100 iterations) for each phenotypic measure using FC calculated from all in-scanner conditions in the UCLA dataset.Number of classified individuals (i.e., high or low, non-outlier score) for each measure, followed by size of training sample (range given subsampling after holding out test data) in parentheses.BART, balloon analog risk task; PAMe, paired associates memory task -encoding; PAMr, paired associates memory task -retrieval; SCAP, spatial working memory capacity tasks; TS, task switching.Significance determined via one-tailed permutation testing; significant P values (P < 0.05) in parentheses, all FDR adjusted (24 comparisons).

Table 6 .
Mean classification accuracy (averaged across 1000 iterations, given 10-fold analysis) for each phenotypic measure using FC calculated from all in-scanner conditions in the HCP dataset.Number of classified individuals (i.e., high or low score) for each measure, followed by size of training sample (range given subsampling after holding out test data) in parentheses.Significance determined via one-tailed permutation testing; P values in parentheses, all FDR adjusted (20 comparisons).

Table 7 .
Mean classification accuracy, averaged across iterations and in-scanner conditions for each phenotypic measure in the Yale dataset, using motion-regressed (MoR) and raw FC for classification.SS, Symbol Search.Note that, as expected given that motion, a proxy for performance, has been regressed from the brain data, classification performance is overall lower in the MoR case than in the raw case, further highlighting that phenotypic measures reflect constellations of covariates.This correction is complicated, however, by the group-specific relationships between motion and phenotype (see Causes and implications of model failure).

Table 8 .
25,26djusted (90 tests) P values for above-chance performance in cross-dataset analyses, calculated via one-tailed comparison to distribution of accuracy across 100 iterations of classification of permuted labels (respecting family-related limits on exchangeability in the HCP dataset25,26).We note that correction for multiple comparisons is complicated by the non-independence of the tests (i.e., within dataset, Test: CCP and MCP are subsets of Test: All), but present FDR-adjusted P values as more conservative estimates of significance than uncorrected P values.All, whole sample; CCP, correctly classified participants within sample; MCP, misclassified participants within sample.Voc, vocabulary.Yale/UCLA and Yale/HCP indicate the pair of datasets used for presented analyses (e.g., "Yale/UCLA, Yale train" refers to the analysis in which Yale data were used to train the model, which was subsequently tested on UCLA data). in medians between groups X and Y; RG, racialized groups; W, white; NRx, not taking psychiatric medication; Rx, taking psychiatric medication; NDx, no diagnosis via interview; Dx, one or more diagnoses via interview.For clarity and concision, test statistics only reported for P £ 0.1 for a given measure (results with P > 0.1 in gray).*Significant(P<0.05)terms in regression of mean (low/high) MF on subset of covariates significantly related to MF in the given dataset (as in Fig.4a).n in each analysis: a given measure (results with P > 0.1 in gray).*Significant(P< 0.05) terms in regression of mean phenotypic score on subset of covariates significantly related to MF (as in Fig.4b).n in each analysis: Yale correct pairwise = 111-129, Yale misclassified pairwise = 105-123, Yale correct regression = 128, Yale misclassified regression = 122, UCLA correct pairwise = 139, UCLA misclassified pairwise = 92, UCLA correct regression = 139, UCLA misclassified regression = 92, HCP correct pairwise = 492-503, HCP misclassified pairwise = 264-271, HCP correct regression = 492, HCP misclassified regression = 264. for