Main

MRI data (such as cortical thickness or resting-state functional connectivity (RSFC)) are increasingly being used for the ambitious task of relating individual differences in brain structure and function to typical variation in complex psychological phenotypes (for example, cognitive ability and psychopathology). To clearly distinguish such BWAS from other neuroimaging research, we formally define them as ‘studies of the associations between common inter-individual variability in human brain structure/function and cognition or psychiatric symptomatology’. Classically univariate, BWAS have recently been facilitated by more powerful, but more difficult to interpret multivariate prediction techniques (for example, support vector regression (SVR) and canonical correlation analysis (CCA)). BWAS hold great promise for predicting and reducing psychiatric disease burden and advancing our understanding of the cognitive abilities that underlie humanity’s intellectual feats. However, obtaining MRI data remains expensive (approximately US$1,000 per hour), resulting in small-sample BWAS findings that have not been replicated7,8,9,10.

Factors that have contributed to poor reproducibility of population-based research in psychology11, genomics12 and medicine13, such as methodological variability14, data mining for significant results15, overfitting16, confirmation and publication biases17, and inadequate statistical power5 probably also affect BWAS. Researchers are starting to address replication failures by standardizing analyses, pre-registering hypotheses, publishing null results and sharing data and code18. Nevertheless, there have been concerns that reliance on relatively small samples (the median sample size (n) in openneuro.org studies as of September 2021 is 23) may also be contributing to BWAS replication failures5,19,20,21. Small studies are most vulnerable to sampling variability, the random variation of an association across population subsamples. Sampling variability decreases and associations stabilize with increasing sample sizes19,22, at a rate of √n. Thus, if true brain-wide associations were smaller than previously assumed (for example, bivariate linear correlation r = 0.2–0.8), larger samples would be required to accurately measure them19,20. Other population-based sciences aiming to robustly characterize relatively small effects—such as epidemiology and genomics (that is, genome-wide association studies (GWAS))—have steadily increased sample sizes12 from below 100 to over 1,000,000.

Recently, neuroimaging consortia have collected samples orders of magnitude larger than before (for example, the Adolescent Brain Cognitive Development23 (ABCD) study, n = 11,874; Human Connectome Project24 (HCP), n = 1,200; and UK Biobank25 (UKB), n = 35,735), enabling accurate estimation of BWAS effect sizes. Beginning with the ABCD Study and using the HCP and UKB data for verification, we performed billions of univariate and multivariate analyses to evaluate BWAS effect sizes and reproducibility as a function of sample size, using sample sizes from small (n = 25) to large (n = 32,572).

Precise BWAS require large samples

BWAS relate population variability in brain features (for example, RSFC between two brain regions (edge)) and behavioural phenotypes (for example, cognitive ability). To estimate brain-wide associations in ABCD data, we correlated widely used cortical thickness and RSFC metrics with 41 measures indexing demographics, cognition and mental health (Supplementary Table 1). Brain-wide associations were estimated across multiple levels of anatomical resolution in both structural (cortical vertices, regions of interest (ROI) and networks) and functional (connections (edges), principal components and networks) MRI data (Fig. 1). To ameliorate the effects of nuisance variables such as head motion, we applied strict denoising strategies (n = 3,928; >8 min; RSFC data post frame censoring at a filtered framewise displacement (filtered-FD) < 0.08 mm; Methods, ‘DCANBOLDproc preprocessing’). Repeat analyses using less rigorous motion censoring that retained a larger subset of the full ABCD sample (n = 9,753), produced a similar BWAS effect size distribution (Supplementary Fig. 1).

Fig. 1: Effect sizes and sampling variability of univariate brain-wide associations.
figure 1

ABCD Study sample data (n = 3,928). a, b, Effect sizes were estimated using standard correlations (bivariate linear r). Brain-wide association histograms (normalized to per panel maximum bin) of cortical thickness with cognitive ability (left, green) and psychopathology (right, purple) at all levels of analysis (vertex, ROI and network; for separated levels of analysis see Supplementary Fig. 2a, b) (a), and RSFC with cognitive ability (left, green) and psychopathology (right, purple) at all levels of analysis (edge, network and component) (b). c, d, The largest brain-wide associations (ROI, top 10%) for cortical thickness with cognitive ability (left, green) and psychopathology (right, purple) (c), and RSFC with cognitive ability (left, green) and psychopathology (right, purple) (d). e, f, Sampling variability (1,000 resamples per sample size in logarithmically spaced bins: n = 25, 33, 50, 70, 100, 135, 200, 265, 375, 525, 725, 1,000, 1,430, 2,000, 2,800 and 3,604 (3,928 for cortical thickness)) of the largest brain-wide association for each brain–behavioural phenotype pair, for cortical thickness with cognitive ability (left, green) and psychopathology (right, purple) (e), and RSFC with cognitive ability (left, green) and psychopathology (right, purple) (f). Solid lines represent the mean across 1,000 resamples. Shading represents the minimum to maximum correlation range across subsamples, for a given sample size. Grey dashed line represents the 95% confidence interval and the black dashed line represents the 99% confidence interval. f, g, Examples of two n = 25 subsamples, in which inaccurate default mode network (DMN) correlations were observed for cortical thickness with cognitive ability (left, green) and psychopathology (right, purple) (g), and RSFC with cognitive ability (left, green) and psychopathology (right, purple) (h). Black dashed line denotes linear fit from full sample.

Source data

BWAS analyses frequently link a single brain feature to a single behavioural phenotype. In Fig. 1a, b, we show the distributions of such univariate associations between cortical thickness and RSFC and two extensively studied phenotypes, cognitive ability (NIH Toolbox total score) and psychopathology (child behaviour checklist (CBCL) total score; Methods, ‘Psychological and demographic data’; Supplementary Table 1; Supplementary Fig. 2 for non-overlapping histograms). In the full, rigorously denoised ABCD sample (n = 3,928), across all brain-wide associations, the median univariate effect size (|r|) was 0.01 (Extended Data Fig. 1). The top 1% largest of all possible brain-wide associations (around 11 million total associations) reached a |r| value greater than 0.06 (Fig. 1a, b). The top 10% largest associations were distributed across sensorimotor and association cortex (Fig. 1c, d). Across all univariate brain-wide associations, the largest correlation that replicated out-of-sample was |r| = 0.16. Sociodemographic covariate adjustment resulted in decreased effect sizes, especially for the strongest associations (top 1% Δr = −0.014; Extended Data Fig. 2).

Smaller brain-wide association studies have reported larger univariate correlations (r > 0.2) than the largest effects we measured in much larger samples. To resolve this apparent contradiction, we simulated the effects of independent research groups using samples of varying sizes to estimate the same brain–phenotype association. For the strongest univariate brain-wide associations, we charted sampling variability as a function of sample size (Fig. 1e, f, n = 25–3,928). At n = 25, the 99% confidence interval for univariate associations was r ± 0.52, documenting that BWAS effects can be strongly inflated by chance. In larger samples (n = 1,964 in each split half), the top 1% largest BWAS effects were still inflated by r = 0.07 (78%), on average (Supplementary Fig. 3). At n = 25, two independent population subsamples can reach the opposite conclusion about the same brain–behaviour association (for example, Fig. 1g, h), solely owing to sampling variability. See Supplementary Figs. 46 for sampling variability by sample size plots for all brain metrics and behavioural phenotypes.

Task fMRI data have also been correlated with cognitive phenotypes. Recent studies have suggested that treating task fMRI data similar to RSFC and combining the two modalities could strengthen BWAS effects slightly26. Therefore, we also estimated univariate BWAS associations for combined task and rest functional connectivity in ABCD Study data27, which produced the same distribution of association strengths (top 1% |r| > 0.06) as RSFC. The HCP collected a wide variety of fMRI tasks, enabling us to compute all brain-wide associations between 86 task activation contrasts and 39 behavioural measures. The distributions of BWAS effect sizes for classical task fMRI activations and RSFC were closely matched (Extended Data Fig. 3, Supplementary Discussion).

Low measurement reliability can attenuate the observed correlation between two variables. Within-person measurement reliability for the exemplar behavioural phenotypes (NIH Toolbox28, r = 0.90; CBCL29, r = 0.94) and imaging measures (cortical thickness30, r > 0.96; RSFC: ABCD, r = 0.48; HCP, r = 0.79; UKB, r = 0.39; Extended Data Fig. 4) are moderate to high. Whereas behavioural (NIH Toolbox, CBCL) and cortical thickness measures are already close to their reliability ceiling, further improvements in RSFC measurement reliability could theoretically increase effect sizes slightly (Supplementary Fig. 7, Supplementary Discussion). Theoretical maximum BWAS effect sizes are unlikely to be reached owing to fundamental biological limits on the strength of the true association and/or the limitations of behavioural phenotyping and MRI physics.

Effect sizes replicate across datasets

Since the ABCD Study data (n = 11,874; age range: 9–10 years; 20 min, RSFC collected) were from a 21-site paediatric cohort (multiple scanner types), we sought to replicate BWAS effect sizes in single-site, single-scanner-type adult data. Thus, we used the HCP dataset which contains the most data per participant among large studies (n = 1,200; age range: 22–35 years; single scanner; 60 min, RSFC collected), and the UKB dataset which has the largest sample size, but less RSFC data per participant (n = 35,735; age range: 40–69 years; single scanner type; 6 min, RSFC collected), to verify univariate BWAS effect size distributions. All three datasets overlapped in containing RSFC and cognitive ability data. To control for sample size effects, the ABCD and UKB datasets were subsampled to match the HCP (n = 900, strict denoising). Across the three size-matched datasets we found similar effect size distributions for associations between RSFC and cognitive ability (Fig. 2; top 1% at n = 900 ABCD, |r| > 0.11; HCP, |r| > 0.12; UKB, |r| > 0.09; Extended Data Fig. 5; see Supplementary Fig. 8 for all ABCD/HCP cognitive measures).

Fig. 2: Effect sizes of brain-wide associations are consistent across the largest neuroimaging study samples.
figure 2

Univariate BWAS effect sizes from correlations (linear bivariate r) between fluid intelligence and edge-wise RSFC are shown for HCP, ABCD and UKB study samples. The ABCD (n = 3,928) and UKB (n = 32,572) datasets were subsampled (with replacement) 100 times to match the HCP sample size (n = 900), revealing consistent effect sizes (medians: HCP |r| = 0.03, ABCD |r| = 0.03, UKB |r| = 0.02). See Extended Data Fig. 5 for UKB resampling to both ABCD and HCP sample sizes.

Source data

To account for potential multi-site effects, we directly compared sampling variability between the HCP (single site) and ABCD datasets (Extended Data Fig. 6a), and between a single ABCD site (n = 603) and the 20 remaining sites (Extended Data Fig. 6b). Sampling variability was equivalent for single- and multi-site samples, underscoring the effectiveness of the ABCD Study’s cross-site harmonization efforts23. The generalizability of the univariate BWAS effect size distribution (Fig. 2, Extended Data Figs. 5, 6) across age (9–69 years), sites, scanner types and pulse sequences suggests that it is universal to BWAS with current technologies and methods.

Statistical errors limit reproducibility

Statistical error rates depend on effect sizes and significance testing thresholds. To quantify how the pairing of smaller than expected effect sizes and sampling variability (that is, random variation of an association across population subsamples) affects BWAS reproducibility, we used non-parametric bootstrapping19 to generate smaller BWAS subsamples and characterized the relationship between statistical errors and sample size across significance thresholds (P < 0.05 to P < 10−7; Fig. 3, Supplementary Fig. 9 for UKB) and verified the results with analytic statistical power estimations31 (Supplementary Fig. 10).

Fig. 3: Statistical errors and reproducibility of univariate brain-wide associations.
figure 3

Data from ABCD Study sample (n = 3,928; see Supplementary Fig. 9 for UKB). a, False negative rates (relative to full sample; see Methods, ‘False positives, false negatives and power’) for correlations (bivariate linear r) between psychological phenotypes and brain features (cortical thickness: vertex-wise; RSFC: edge-wise) as a function of sample size and two-tailed P value (P value thresholding was identical in full sample and subsamples; same P values used in cf). b, Magnitude error rates for three levels of effect size inflation (50%, 100% and 200%) as a function of sample size and statistical threshold (P < 0.05 and P < 10−7 (Bonferroni-corrected); P value threshold was the same in full sample and subsamples). c, Sign error rates reported as the percentage of subsamples with the opposite sign as the full sample, as a function of sample size and P value. d, Statistical power of subsamples relative to full sample (same sign, both significant) as a function of sample size and P value. e, Probability of replicating (same sign, both significant) a univariate brain–phenotype association out-of-sample across P values (note: data end at n ≈ 2,000, as the replication sample is half of the full sample). Replication rates follow the square of power. f, False positive rates of subsamples relative to the full sample as a function of sample size and P value.

Source data

Statistical errors were pervasive across BWAS sample sizes. Even for samples as large as 1,000, false negative rates (Fig. 3a) were very high (75–100%) and half of the statistically significant associations were inflated by at least 100% (Fig. 3b). More lenient statistical thresholding reduces false negatives and effect size inflation, but increases the rate of sign errors (Fig. 3c). Statistical power (1 − false negative rate), which indexes the probability of detecting a significant effect, remained low even for relatively large sample sizes: maximum statistical power 0.68 for n = 3,928 (Fig. 3d).

Given the high statistical error rates and low power of univariate BWAS in typically sized samples, we quantified the probability that a significant univariate association would replicate in a size-matched replication dataset (Fig. 3e; P from 10−7 to 0.05). In keeping with common practice, we defined successful replication as passing the same statistical threshold in sample and out of sample. At the largest split half sample size (n = 1,964), 25% of univariate BWAS replications succeeded with a threshold of P < 0.05. At sample sizes more typical for BWAS (n < 500), replication rates were around 5% (Fig. 3e).

Paradoxically, correcting for multiple comparisons reduces the probability of successfully replicating univariate BWAS effects (Fig. 3d, e). More stringent statistical thresholding reduces false positive rates (Fig. 3f) but increases false negative rates (Fig. 3a), thus lowering statistical power (Fig. 3d, Extended Data Fig. 7). In underpowered BWAS, stricter statistical thresholds select for very large correlations, which are the most likely to be inflated due to sampling variability (Fig. 1e, f). With Bonferroni multiple-comparisons correction (P < 10−7), a sample size of 9,500 was required to be 80% powered for detecting the top 1% largest (r > 0.06) BWAS effects (Supplementary Fig. 10a), compared with a sample size of 2,200 for uncorrected P < 0.05 (Supplementary Fig. 10b).

Multivariate BWAS reproducibility

Multivariate methods use weighted brain patterns to predict a single behavioural phenotype (SVR; for example, cognitive ability), or combinations of multiple phenotypes (CCA; for example, all NIH Toolbox subscales). To examine multivariate brain-wide associations as a function of sample size, we trained SVR (Supplementary Figs. 1113) and CCA (Supplementary Figs. 14, 15) models on discovery set data (in-sample; including nested cross-validation (SVR) and principal component analysis (PCA) dimensionality reduction (SVR and CCA); Methods, ‘Multivariate out-of-sample replication’) and subsequently tested their generalization to the replication set using standard out-of-sample estimates of SVR (rpred) and CCA (rCV1) association strength (Fig. 4). Sampling variability was assessed by generating bootstrapped subsamples (n = 100) for each sample size. Multivariate out-of-sample associations were tested for statistical significance using nonparametric null distributions (>99% confidence interval).

Fig. 4: Multivariate brain-wide associations.
figure 4

af, In-sample brain–behavioral phenotype associations as a function of out-of-sample associations and sample size. Mean multivariate brain–behavioral phenotype associations across 100 bootstrap samples at n = 200 (red dots) and for the full sample (black dots). Grey dashed lines represent the significance threshold for out-of-sample correlations (>99% confidence interval of permutations), determined on the full sample (see Methods, ‘Multivariate out-of-sample replication’). Data are from the ABCD Study; full sample sizes: cortical thickness n = 1,814; RSFC n = 1,964. a, b, For SVR, out-of-sample association strength is reported as the correlation between predicted and observed phenotype scores (rpred) using models trained on the discovery set. SVR of cortical thickness (a) and RSFC (b), with cognitive ability (green, left) and psychopathology (purple, right). c, d, For CCA, out-of-sample association strength is reported as the correlation of phenotypic and brain scores in the first canonical variate pair (rCV1) when discovery set weights are applied to the replication set. CCA of cortical thickness (c) and RSFC (d), with all NIH Toolbox (green, left) and CBCL (purple, right) subscales. e, Differences between out-of-sample (SVR: rpred; CCA: rCV1) and corresponding in-sample associations by multivariate method (left), imaging modality (middle) and behavioural phenotype (right); normalized to per-panel maximum. On average, out-of-sample associations (mean r = 0.17) were smaller (∆r = −0.29; 63% reduction) than in-sample associations (mean r = 0.46), similar to replication effect size reductions in cancer biology49 and psychology50. f, SVR out-of-sample association (rpred) as a function of univariate effect size (r; top 1% for each phenotype) across the 41 phenotypes (bivariate linear r = 0.79, orange line).

Source data

Across multivariate methods (SVR and CCA), imaging modalities (cortical thickness and RSFC), and behavioural phenotypes (cognitive ability and psychopathology), small discovery samples typical for neuroimaging generated variable, inflated in-sample associations that frequently did not pass statistical significance thresholds (Fig. 4a–d). Increasing sample sizes to thousands of participants provided moderate statistical replication with reduced variability and smaller differences between in-sample and out-of-sample associations. On average, RSFC (versus cortical thickness) and cognitive (versus psychopathology) measures provided stronger out-of-sample associations (Fig. 4a–d) that were closer to in-sample estimates (Fig. 4e). Narrowing the definition of replication to detecting statistical significance in out-of-sample data did not alleviate the need for large sample sizes (Supplementary Table 2).

Multivariate out-of-sample associations were stronger compared to univariate, particularly at large sample sizes (for example, maximum RSFC–crystallized intelligence association: SVR rpred = 0.39, univariate r = 0.16). Even at the largest sample sizes (n ≈ 2,000), multivariate in-sample associations remained inflated on average (in-sample to out-of-sample: Δr = −0.29; Fig. 4e, Supplementary Fig. 16; see Extended Data Fig. 8 for univariate) and feature weights were variable (Supplementary Fig. 13). Out-of-sample replication was maximized by using a relatively low-dimensional feature space (Supplementary Figs. 11, 12, 14, 15), reaffirming that brain-wide associations are represented in widely distributed circuitry, consistent with univariate BWAS (Fig. 1c, d). Across behavioural phenotypes, multivariate out-of-sample associations were robustly linked to univariate effect sizes (r = 0.79, P < 0.001; Fig. 4f).

The underpowered BWAS paradox

At smaller sample sizes, the largest, most inflated BWAS effects are most likely to be statistically significant and therefore, paradoxically, the most likely to be published5,21,32. Typically, BWAS have been sufficiently powered to only detect statistical significance for inflated associations (Fig 3d). High sampling variability in smaller samples frequently generates strong associations by chance19 (Fig. 1e, f). Stricter in-sample statistical thresholding (that is, multiple-comparison correction)—which is common in neuroimaging—lowers BWAS power, thus trapping us deeper in the paradox by selecting for even more inflated effects (Fig. 3). When attempting to replicate inflated BWAS associations, regression to the mean (actual effect size) makes non-significance (that is, replication failure) the most likely outcome (Figs. 3, 4, Extended Data Fig. 8). Bias in favour of significant, larger BWAS effects has limited the publication of null results, perpetuating inflated effect sizes that form the basis for subsequent power and meta analyses.

Importance of small-sample neuroimaging

There is no one-size-fits-all solution for neuroimaging studies; minimum sample size requirements depend on the study design. Neuroimaging-only studies are typically adequately powered at small sample sizes. For example, central tendencies of human functional brain organization among groups can be accurately represented by averaging within small samples (that is, n = 25; Supplementary Fig. 17). Precise individual-specific RSFC and fMRI activation brain maps can be generated by repeatedly sampling the same individual33. Small samples have also provided blueprints for reducing MRI artefacts34, increasing the amount of usable data35.

Using non-BWAS approaches, many fundamental links between the human brain and behaviour have been uncovered and replicated in small neuroimaging samples36. Within-person designs (for example, longitudinal37), studies with induced effects (for example, lesions38 or tasks39), or both (for example, interventions40) frequently have increased measurement reliability and effect sizes. For rarer clinical conditions, amassing large samples is impossible. In many cases, within-person, induced-effects approaches are not only cost-effective, but also most relevant to clinical care. Thus, small-sample neuroimaging will always be critical for studying the human brain.

Importance of large samples for BWAS

Large neuroimaging consortium data (ABCD, HCP and UKB) have revealed that small BWAS effects and population sampling variability routinely results in inflated, irreproducible brain–phenotype associations until sample sizes reach well into the thousands (Extended Data Fig. 9). Therefore, BWAS should use datasets with at least thousands of high-quality, standardly processed samples14. Additional consideration should be given to potential confounding effects and interpretations of statistical significance41.

The recovery of genomics from its reproducibility crisis has set a valuable example for BWAS12. Early candidate-gene studies were underpowered and many associations between common genetic variants and psychiatric phenotypes could not be replicated42. In response, GWAS consortia have grown genomic samples into the millions43 and taken advantage of specialized study designs (for example, twins) and methodological innovations (for example, polygenic risk scores) and set strict data standards. Fortunately, BWAS findings can achieve reproducibility in relatively smaller samples than GWAS, owing to larger effect sizes.

Reproducibly linking brain and behaviour

All brain–behaviour studies will benefit from technological advances that generate higher quality brain and behavioural data with greater efficiency, such as real-time quality control35, multi-band multi-echo44 sequences and thermal denoising for fMRI45, as well as deep behavioural phenotyping with ecological momentary assessment46 and passive sensing.

As with GWAS47, funding agencies should boost the aggregation of BWAS-appropriate datasets through mandatory sharing policies. Even for large datasets collected and processed identically, in-sample associations are stronger than out-of-sample replications (Fig. 4e, Extended Data Fig. 8); therefore, reporting both in-sample and out-of-sample effect sizes should be a requirement for publication and funding. BWAS may also benefit from focusing data collection on the most robust brain–phenotype associations (for example, functional versus structural and direct behavioural versus questionnaire).

The brain, in contrast to the genome, is expected to change over time and can be manipulated ethically. For greater effect sizes and statistical power, neuroscience should focus on within-participant study designs over cross-sectional study designs, and on interventional (therapy, medications, brain stimulation and surgery) over observational study designs. Rather than associating pre-defined psychological constructs and brain features48, data-driven, combined brain–behaviour phenotypes will further advance our understanding of cognition and mental health. Altogether, our prospects for linking neuroimaging markers to complex human behaviours are better than ever.

Methods

ABCD Study sample

This project used the baseline ABCD BIDS (Brain Imaging Data Structure) data consisting of RSFC data from 10,259 participants released through the ABCD-BIDS Community Collection51 (ABCD collection 3165; https://github.com/ABCD-STUDY/nda-abcd-collection-3165) and demographic and behavioural data from 11,572 9–10 year old participants from the ABCD 2.0 release52. The ABCD Study obtained centralized institutional review board (IRB) approval from the University of California, San Diego. Each of the 21 sites also obtained local IRB approval. Ethical regulations were followed during data collection and analysis. Parents or caregivers provided written informed consent, and children gave written assent.

In addition to data from the ABCD 2.0 release, we used the ABCD reproducible matched samples51 (ARMS), available in ABCD collection 3165, that divided individuals from the full behavioural sample (n = 11,572) into discovery (n = 5,786) and replication (n = 5,786) sets, which were matched across 9 variables: site location, age, sex, ethnicity, grade, highest level of parental education, handedness, combined family income, and prior exposure to anaesthesia. Family members (that is, sibling pairs, twins and triplets) were kept together in the same set and the two sets were matched to include equal numbers of single participants and family members. These split ARMS datasets were used for replicability analyses.

Head motion can systematically bias neuroimaging studies53. However, these systematic biases can be addressed through rigorous head motion correction. Therefore, we used strict inclusion criteria with regard to head motion. Specifically, inclusion criteria for the current project (see Casey et al.23 for broader ABCD inclusion criteria) consisted of at least 600 frames (8 min) of low-motion54 (filtered-FD < 0.08) RSFC data. Our final dataset consisted of RSFC data from a total of n = 3,928 youth across the discovery (n = 1,964) and replication (n = 1,964) sets. The final discovery and replication sets did not differ in mean framewise displacement (difference in means = 0.002, t = 0.60, P = 0.55) or total frames included (difference in means = 6.4, t = 0.94, P = 0.35). The participant lists for ARMS samples can be found in the ABCD-BIDS Community Collection (ABCD collection 3165) for community use51.

ABCD MRI acquisition

Imaging was performed at 21 sites in the United States, harmonized across Siemens Prisma, Philips and GE 3T scanners. Details on image acquisition can be found in ref. 23. Twenty minutes (4 × 5 min runs) of eyes-open resting-state blood oxygenation level dependent (BOLD) data were acquired to ensure at least 8 min of low-motion data. All resting-state scans fMRI scans used a gradient-echo echo planar imaging (EPI) sequence (repetition time = 800 ms, echo time = 30 ms, flip angle = 90°, voxel size = 2.4 mm3, 60 slices). Head motion was monitored using framewise integrated real-time MRI monitoring (FIRMM) software at many of the Siemens sites35.

ABCD-BIDS processing overview

ABCD and UKB MRI data processing was completed with the freely available ABCD-BIDS pipeline51 (https://github.com/DCAN-Labs/abcd-hcp-pipeline). Data were downloaded and converted to the BIDS format using ABCD-Dicom2BIDS (https://github.com/DCAN-Labs/abcd-dicom2bids). Only data that passed the fast-track quality control (QC; tagged prior to ABCD release 2.0) were processed (also see release notes: https://collection3165.readthedocs.io/en/stable/). The ABCD-BIDS pipeline is a modification of the original HCP pipeline55. In brief, this MRI data-processing pipeline comprises six stages. (1) PreFreesurfer normalizes anatomical data. This normalization entails brain extraction, denoising, and then bias field correction on anatomical T1 and/or T2 weighted data. The ABCD-HCP pipeline includes two additional modifications to improve output image quality. ANTs56 DenoiseImage models scanner noise as a Rician distribution and attempts to remove such noise from the T1 and T2 anatomical images. Additionally, ANTs N4BiasFieldCorrection attempts to smooth relative image histograms in different parts of the brain and improves bias field correction. (2) FreeSurfer57 constructs cortical surfaces from the normalized anatomical data. This stage performs anatomical segmentation, white–grey and grey–CSF cortical surface construction, and surface registration to a standard surface template. Surfaces are refined using the T2 weighted anatomical data. Mid-thickness surfaces, which represent the average of white–grey and grey–CSF surfaces, are generated here. (3) PostFreesurfer converts prior outputs into an HCP-compatible format (that is, CIFTIs) and transforms the volumes to a standard volume template space using ANTs nonlinear registration, and the surfaces to the standard surface space via spherical registration. (4) The Vol (volume) stage corrects for functional distortions via reverse-phase encoding spin-echo images. All resting-state runs underwent intensity normalization to a whole-brain-mode value of 1,000, within run correction for head movement, and functional data registration to the standard template. Atlas transformation was computed by registering the mean intensity image from each BOLD session to the high resolution T1 image, and then applying the anatomical registration to the BOLD image. This atlas transformation, mean field distortion correction, and resampling to 3 mm3 atlas space were combined into a single interpolation using the FSL58 applywarp tool. (5) The Surf (surface) stage projects the normalized functional data onto the template surfaces, as described below. (6) We have added an fMRI and fcMRI preprocessing stage, DCANBOLDproc, also described below. (7) Last, an executive summary is provided for easy participant-level QC across all processed data.

fMRI surface processing

The BOLD fMRI volumetric data were sampled to each participant’s original mid-thickness left and right-hemisphere surfaces constrained by the grey-matter ribbon. Once sampled to the surface, time courses were deformed and resampled from the individual’s original surface to the 32 k fs_LR surface in a single step. This resampling allows point-to-point comparison between each individual registered to this surface space. These surfaces were then combined with volumetric subcortical and cerebellar data into the CIFTI format using Connectome Workbench59, creating full brain time courses excluding non-grey matter tissue. Finally, the resting-state time courses were smoothed with a 2 mm full-width-half-maximum kernel applied to geodesic distances on surface data and euclidean distances on volumetric data.

DCANBOLDproc preprocessing

Additional BOLD preprocessing steps were executed to reduce spurious variance unlikely to reflect neuronal activity34. First, a respiratory filter was used to improve framewise displacement estimates calculated in the Vol stage54. Second, temporal masks were created to flag motion-contaminated frames using the improved framewise displacement estimates53. Frames with a filtered-FD > 0.3 mm were flagged as motion-contaminated for nuisance regression only. After computing the temporal masks for high motion frame censoring, the data were processed with the following steps: (1) demeaning and detrending, (2) interpolation across censored frames using least squares spectral estimation of the values at censored frames so that continuous data can be (3) denoised via a GLM with whole brain, ventricular, and white matter signal regressors, as well as their derivatives. Denoised data were then passed through (4) a band-pass filter (0.008 Hz < f < 0.1 Hz) without re-introducing nuisance signals60 or contaminating frames near high-motion frames.

Generation of RSFC matrices

ABCD RSFC data consists of 4 × 5 min runs. For each participant with full brain coverage, all available RSFC data were concatenated and high motion frames (filtered-FD > 0.08) were censored. The timeseries of BOLD activity for each ROI was correlated to that of every other ROI (333 cortical ROIs from Gordon et al.61; 61 subcortical ROIs from Seitzman et al.62), forming a 394 × 394 correlation matrix, which was subsequently Fisher z-transformed. For network level analyses, correlations were averaged across previously defined canonical functional networks61. Inter-individual difference connectome-wide spatial components, which are not bound by network boundaries63,64, were computed by performing PCA on a matrix composed of all ROI × ROI pairs (edges) from each participant.

Generation of cortical thickness metrics

For each participant, cortical thickness was extracted from 59,412 cortical vertices. For ROI level matrices, cortical thickness was averaged within each cortical parcel61 (n = 333). For network level matrices, cortical thickness was averaged within each cortical network61 (n = 13). Inter-individual spatial components were computed by performing PCA on a matrix composed of all cortical vertices from each participant.

Psychological and demographic data

The ABCD Study population is well-characterized with hundreds of demographic, physical, cognitive, and mental health variables65. The current project examined the associations between 41 of these variables (Supplementary Table 1) and brain structure (cortical thickness) and function (RSFC). Psychological and demographic variables were selected to reflect the primary domains of interest, cognition (individual subscales and composite scores from the NIH Toolbox) and mental health (individual subscales and composite scores from the CBCL), as well as demographic and physical variables relevant to development (for example, age) and health (for example, body mass index).

Psychological and demographic covariates

The primary goal of this project was to study how the pairing of brain–phenotype effect sizes and sampling variability (random variation across samples, as opposed to systematic variation threatening causal inference66) can account for wide-spread replication failures. Hence, our results focus on bivariate associations (correlation) and standard multivariate models linking brain structure and function to psychological and demographic variables without covariate adjustment. However, we did examine the influence of sociodemographic covariates standardly used in ABCD analyses (race, gender, parental marital status, parental income, Hispanic versus non-Hispanic ethnicity, family and data collection site) on BWAS effect sizes noting that they generally decrease effect sizes, particularly for the largest BWAS effects (see Extended Data Fig. 2). Furthermore, the ABCD subsamples (ARMS; see above) we used for replication analyses are matched for salient demographic factors (site location, family composition, age, sex, ethnicity, grade, highest level of parental education, handedness, combined family income and prior exposure to anaesthesia; see above). Also, where possible, ABCD-distributed age-corrected scores were used, given (1) well-established age-related changes in these measures and (2) age-corrected scores improved normality for many measures (for example, CBCL syndrome scales and broadband factors).

Capture of psychological and demographic data

The ABCD Data Analysis and Informatics Center (DAIC) has released an online tool called DEAP (Data Exploration and Analysis Portal), which can be accessed at https://deap.nimhda.org/. In this Article, we introduce an additional tool called ABCDE (ABCD Boolean Capture Data Explorer, developed by B.P.K.), which we have used for preparation of the data herein. ABCDE complements DEAP by allowing for finer-grained control of data extraction on the researcher’s own computer rather than through a web portal. The source code and documentation can be accessed at https://gitlab.com/DosenbachGreene/abcde.

Univariate brain–behavioural phenotype correlations

For each brain measure at a given level of organization, we correlated the brain measures (structure: cortical thickness; function: RSFC) with each psychological variable. Cognitive ability (total composite score on the NIH Toolbox) and psychopathology (total score on the CBCL) are presented in the main text; all others are included in the Extended Data Fig. 1. Correlations between brain and phenotypes were generated for RSFC at the edge level (ROI–ROI pair (n = 77,421)), network level (average of RSFC within and between each network (n = 105)) and component level (principal component weights (n = 100)). To extract components representing inter-individual differences, we vectorized each participant’s RSFC matrix, concatenated the vectorized matrices and then performed PCA (Matlab’s pca.m function). Correlations between brain and phenotypes were generated for cortical thickness at the vertex level (n = 59,412), ROI level (n = 333) and network level (n = 13). Repeat analyses employing less rigorous motion censoring and thus retaining a larger subset of the full ABCD sample (n = 9,753) replicated the effect sizes (top 1% largest effects: |r| > 0.06).

Resampling procedures

To examine the distribution of correlations for iteratively larger sample sizes, we randomly selected participants with replacement from the full sample (n = 3,928, post denoising) at logarithmically spaced sample sizes (16 intervals: n = 25, 33, 50, 70, 100, 135, 200, 265, 375, 525, 725, 1,000, 1,430, 2,000, 2,800 and 3,928). For cortical thickness data, the full sample contained the same sampling bins, with the exception of the final bin (full sample), which contained n = 3,604 participants. At each sample size, we randomly sampled participants 1,000 times, resulting in 16,000 brain–psychological phenotype resamplings for each brain–phenotype correlation. For multivariate approaches, 100 bootstrap samples were computed across the logarithmically spaced sample sizes (16 intervals: n = 25, 33, 45, 60, 80, 100, 145, 200, 256, 350, 460, 615, 825, 1,100, 1,475 and 1,964 (1,814 for cortical thickness)). We note that the iterations were reduced for multivariate methods (100 iterations) owing to their high computational costs. In addition, the multivariate analyses were primarily focused on mean estimates, rather than the full distribution. We also performed sensitivity analyses to quantify sampling variability using data from only singletons (that is, no sibling and/or twin pairs), which was nearly identical to sampling variability in the full sample (included siblings and/or twins; Extended Data Fig. 10; Δr = 0.0005). For highlighting the effects of sampling variability (Fig. 1e, f), we extracted the brain–phenotype correlation with the largest effect size for each imaging modality (cortical thickness and RSFC) and exemplar phenotype (cognitive ability and psychopathology). The sampling variability (range of possible correlations, 99% confidence interval and 95% confidence interval) at each sampling interval for correlations between RSFC and cortical thickness with cognitive ability and psychopathology are presented in the main text (Fig. 1e, f); correlations between brain measures and other behaviours can be found in Supplementary Figs. 4, 5.

Sampling variability examples with a sample size of 25

Using the outputs from the resampling procedures above, we used the 1,000 resamplings with n = 25 to examine the correlation between the DMN and cognitive ability (total composite score on the NIH Toolbox), as well as the DMN and psychopathology (total problem score on the CBCL), for both cortical thickness and RSFC. To demonstrate how sampling variability affects correlations, the 1,000 resamples were ranked by effect size. Subsequently, we selected two samples from the top 10 samples (in terms of effect size); one with a significant positive association and one with a significant negative association.

ABCD task data

Data from three in-scanner fMRI tasks (n-back, stop signal, monetary incentive delay) were concatenated to the 4 × 5 min resting-state runs (rest + task) to determine whether additional data affected the effect size estimates. After data were concatenated across the 4 conditions (rest + 3 task states), correlation matrices were generated and correlated with psychological phenotypes as detailed above, under univariate brain–behavioural phenotype correlations. Task events were not regressed67. Data processing steps for task data were the same as RSFC, including the removal of frames with a filtered-FD > 0.08 mm.

Correlations between behavioural phenotypes

To examine the range of sampling variability as a function of sample size between 41 psychological and demographic measures (Supplementary Fig. 6), we randomly selected participants with replacement from the full behavioural sample (n = 11,572) at logarithmical spaced sample sizes (9 intervals: n = 25, 50, 100, 200, 500, 1,000, 2,000, 4,000 and 9,000). At each interval, we randomly sampled participants 1,000 times, resulting in 9,000 behaviour–behaviour phenotype correlation resamplings for each association. For each association between behavioural phenotypes, we quantified sampling variability at each sampling bin as the range of correlations observed through this resampling procedure.

False positives, false negatives and power

False negative (Fig. 3a) and false positive (Fig. 3f) rates were derived through resampling (see ‘Resampling procedures’) for all edge-wise brain-wide associations. For each sample size bin (16 total), we randomly sampled with replacement n individuals (1,000 subsamples) and computed the brain–behavioral phenotype correlation and associated P value. A correlation was deemed significant if it passed a threshold (P value range: <0.05 to <10−7 (Bonferroni-corrected) across 77,421 ROI–ROI pairs) in the full sample (cortical thickness n = 3,604, RSFC n = 3,928). At each sample size, if a correlation in the full sample was not significant, we determined the percentage of studies that resulted in a false positive significant correlation across a broad range of P values (0.05 to 10−7). Conversely, if a correlation in the full sample was significant (P < 0.05 to 10−7), we determined the percentage of studies that resulted in a false negative non-significant correlation across a broad range of P values (10−7 to 0.05). Statistical power (Fig. 3d) was calculated as 1 − false negative rate.

BWAS correlation inflation

For each univariate brain-wide association in the full sample (cortical thickness n = 3,604; RSFC n = 3,928) at the vertex/edge level, we determined whether or not a correlation was significant (using two-tailed P < 0.05 (uncorrected) and P < 10−7 (Bonferroni corrected for multiple comparisons) thresholds). Then, for each significant correlation in the full sample, we extracted all of the significant correlations (P < 0.05 and 10−7) observed across 1,000 subsamples at each sample size bin. Of these significant correlations in subsamples at each sample size bin, we determined the percentage that were inflated, relative to the full sample effect size, across varying magnitudes (50%, 100% and 200%; Fig. 3b).

BWAS sign errors

Each brain-wide association was extracted from the full sample as a reference. Across the 1,000 subsamples within a sampling bin, we determined the percentage of correlations that had the opposite correlation sign as the correlation sign in the full sample, thresholding the subsamples at the same P values as all other analyses of statistical errors (P < 10−7 to 0.05).

Univariate BWAS replication

Replication is commonly defined as detecting a significant association (for example, P < 0.05) that was deemed significant (P < 0.05) in a previous sample (Fig. 3e). To determine the probability of replicating a brain–phenotype association in a new data (out-of-sample) at a given sample size, we correlated every brain feature (RSFC edge, cortical thickness vertex) with each behavioural phenotype in 1,000 bootstrapped samples across sample sizes (same sampling bins as listed under ‘Resampling procedures’). For each behavioural phenotype, sample size (n = 25, 33, 45, 60, 80, 100, 145, 200, 256, 350, 460, 615, 825, 1,100, 1,475 and 1,964 (1,814 for cortical thickness); note: data end at n ≈ 2,000 as the replication sample is half of the full), and bootstrapped subsample, we first determined the brain–behavioural phenotype associations that were significant (at P < 10−7 to 0.05) in the discovery (in-sample) dataset. Next, we extracted the same brain features from the replication (out-of-sample) dataset and quantified the percentage of associations that were also significant in the replication dataset. Note, to mirror a process of replicating existing effects, we used the number of identified significant associations in the discovery sample as the total number of features that could be replicated (as opposed to the total number of brain features regardless of discovery sample significance). For example, if all significant BWAS in the discovery sample were also significant in the replication sample, the probability of replication would be 100%.

Effect sizes in HCP replication

We used data from n = 900 individuals from the HCP 1,200 Subject Data Release (aged 22–35 years). All HCP participants provided informed consent. A custom Siemens SKYRA 3.0T MRI scanner and a custom 32-channel head matrix coil were used to obtain high-resolution T1-weighted (MP-RAGE, TR = 2.4 s, 0.7 mm3 voxels) and BOLD contrast sensitive (gradient-echo EPI, multiband factor 8, TR = 0.72 s, 2 mm3 voxels) images from each participant. The HCP used sequences with left-to-right (LR) and right-to-left (RL) phase encoding, with a single RL and LR run on each day for two consecutive days for a total of four runs68. MRI data were preprocessed as previously described62. All HCP data are available at https://db.humanconnectome.org/.

Similar to the ABCD data, we extracted the timeseries from a total of 394 cortical and subcortical ROIs, correlated and Fisher z-transformed them. Data from the NIH Toolbox were correlated with each edge of the RSFC correlation matrix across participants. Across all NIH Toolbox subscales, the tails of the distributions of the resulting brain–behavioural phenotype correlations were compared to 100 subsampled ABCD brain–behavioural phenotype correlations (n = 877, matching HCP sample size). In Supplementary Fig. 8, we show the distributions of brain–behavioural phenotype correlations for ABCD and HCP data, for each NIH Toolbox subscale.

Effect sizes in UKB replication

We used pre-processed resting-state data from n = 32,572 individuals from the January 2020 UKB release69, processed with the same processing pipeline as the ABCD data. All UKB participants provided informed consent. For a complete description of study flow and imaging protocols, see Littlejohns et al.70. The UKB collects measures of fluid intelligence, which we used to correlate with RSFC, mimicking ABCD and HCP samples. For Fig. 2, we used 100 × n = 900 subsamples from the ABCD and UKB datasets to match the sample size of HCP for the associations between RSFC and fluid intelligence (n = 900). We subsequently determined the threshold to reach the top 1% strongest RSFC associations with fluid intelligence in each of the three datasets.

Sampling variability in HCP replication

To quantify the degree of sampling variability in single site, single scanner HCP data compared to multi-site, multi-scanner ABCD data, we subsampled ABCD RSFC data to match HCP sample sizes (n = 877, denoised and complete behavioural data across all NIH Toolbox subscales). For each dataset, we carried out resampling, as detailed under ‘Resampling procedures’ (12 intervals: n = 25, 33, 50, 70, 100, 135, 200, 265, 375, 525, 725 and 875), across all NIH Toolbox subscales. The range of correlations and 95% confidence interval observable in each sampling bin are shown in Extended Data Fig. 6a for both HCP and ABCD data.

Sampling variability for single-site ABCD versus multi-site ABCD

We directly compared single site ABCD data (site 16; n = 603) with multi-site ABCD data (n = 3,325, 20 sites—site 16 was excluded) using 1,000 bootstrapped samples at 10 sample size intervals: n = 25,33, 50, 70, 100, 135, 200, 265, 375 and 525. For this analysis, we used the associations between RSFC and all NIH Toolbox subscales (Extended Data Fig. 6b). The range of correlations and 95% confidence interval in each resampling bin is shown in Extended Data Fig. 6b for single-site and multi-site ABCD data.

BWAS effect sizes in task activation versus RSFC in HCP

We estimated the effect sizes between task activations (86 total contrasts, see Supplementary Table 3) and behavioural phenotypes71 (39 total, see Supplementary Table 4) across three levels of analysis: vertices, ROIs and networks (n = 844). In these same individuals, we estimated the effect sizes between RSFC and the same phenotypes across three levels of analysis: edges, principal components, and networks. To compare the resulting effect size distributions (for example, Extended Data Fig. 3), we determined the top 1% strongest effect sizes, as well as the maximum correlation (absolute value).

Multivariate out-of-sample replication

For multivariate out-of-sample replication, we used SVR and CCA. SVR with a linear kernel was performed using the e1071 package in the R environment (version 3.5.2) to predict primary phenotypes (psychopathology and cognitive ability) and other demographics and psychological phenotypes (Supplementary Figs. 11, 12) from individual differences in either RSFC or cortical thickness. One hundred bootstrap samples (sampling with replacement) were generated for each sample size. Hyperparameter tuning was examined in (1) split halves of the full discovery sample for multiple cognitive (NIH Toolbox) and psychopathology (CBCL) scales and (2) tenfold cross-validation within the full discovery sample for primary phenotypes (psychopathology and cognitive ability; Supplementary Figs. 11, 12). Hyperparameter tuning did not appreciably change out-of-sample prediction estimates to the replication sample (for example, average out-of-sample correlation difference between tuned and non-tuned models: RSFC = −0.006, cortical thickness = 0.014; Supplementary Figs. 11, 12). Figure 4a, b use default hyperparameters and PCA dimensionality reduction (with a threshold of 50% variance explained in the discovery set, for each sample size) prior to SVR, given that this procedure balanced out-of-sample prediction and model complexity for nearly all model types (Supplementary Figs. 11, 12). Replication set data were not used to estimate principal components, but rather replication set data were projected into component space via independently estimated loading matrices for each subsample of the discovery set to prevent bias. An alternative strategy of univariate feature ranking was also examined, where SVR models were trained on the 5,000, 10,000 or 15,000 vertices (cortical thickness) or edges (RSFC) with the highest bivariate correlation to the variable of interest in the training dataset, but this approach resulted in lower out-of-sample prediction (Supplementary Figs. 11, 12). Out-of-sample association strength is reported as the correlation between predicted and observed phenotypic scores (rpred; using models trained on the discovery set). Significance thresholds for out-of-sample replication (rpred) were estimated via permutation testing (1,000 iterations) with models trained on the full discovery set (RSFC: n = 1964; cortical thickness: n = 1,814) and tested on the full replication set.

CCA was performed using Matlab’s (2019A) cannoncor.m function for joint associations of the NIH Toolbox and CBCL with individual differences in either RSFC or cortical thickness. Equivalent bootstrapping and subsampling of the in-sample discovery set were tested and applied to the out-of-sample replication set, as in the SVR analyses. To model sampling variability across sample sizes, 100 bootstrap (sampling with replacement) samples were generated for each sample size. As with SVR, Fig. 4c, d used PCA dimensionality reduction (threshold of 20% variance explained in the in-sample discovery set, for each sample size) prior to CCA given that this maximized out-of-sample correlation (rCV1; Supplementary Figs. 14, 15). CCA models were fit on iteratively larger subsamples of the in-sample discovery dataset. The first canonical vector (CV1) weights were extracted and applied to the full out-of-sample brain and behavior data. This resulted in the out-of-sample correlation (rCV1) between multivariate brain and behavior data. Significance thresholds for out-of-sample replication were estimated via permutation testing (1,000 iterations) with models trained on the full ABCD discovery set (RSFC: n = 1964; cortical thickness: n = 1,814) and tested on the full replication set.

Towards a new era of BWAS

In Extended Data Fig. 9, sampling variability, statistical errors (false positives, false negatives, inflation and sign errors), and out-of-sample multivariate associations (rpred, rCV1) were plotted as a function of sample size (y-axis: 0–1 for sampling variability (r), 0–100% for statistical errors (cumulative sum across all four error types), 0–100% for out-of-sample associations). To account for differences between in-sample and out-of-sample multivariate associations, out-of-sample multivariate associations were normalized by the mean in-sample (discovery) correlation at the full sample size. All three curves (sampling variability, statistical errors, and out-of-sample association) were based on the largest univariate and multivariate brain-wide association (RSFC with cognitive ability).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.