Magnetic resonance imaging (MRI) has transformed our understanding of the human brain through well-replicated mapping of abilities to specific structures (for example, lesion studies) and functions1,2,3 (for example, task functional MRI (fMRI)). Mental health research and care have yet to realize similar advances from MRI. A primary challenge has been replicating associations between inter-individual differences in brain structure or function and complex cognitive or mental health phenotypes (brain-wide association studies (BWAS)). Such BWAS have typically relied on sample sizes appropriate for classical brain mapping4 (the median neuroimaging study sample size is about 25), but potentially too small for capturing reproducible brain–behavioural phenotype associations5,6. Here we used three of the largest neuroimaging datasets currently available—with a total sample size of around 50,000 individuals—to quantify BWAS effect sizes and reproducibility as a function of sample size. BWAS associations were smaller than previously thought, resulting in statistically underpowered studies, inflated effect sizes and replication failures at typical sample sizes. As sample sizes grew into the thousands, replication rates began to improve and effect size inflation decreased. More robust BWAS effects were detected for functional MRI (versus structural), cognitive tests (versus mental health questionnaires) and multivariate methods (versus univariate). Smaller than expected brain–phenotype associations and variability across population subsamples can explain widespread BWAS replication failures. In contrast to non-BWAS approaches with larger effects (for example, lesions, interventions and within-person), BWAS reproducibility requires samples with thousands of individuals.
MRI data (such as cortical thickness or resting-state functional connectivity (RSFC)) are increasingly being used for the ambitious task of relating individual differences in brain structure and function to typical variation in complex psychological phenotypes (for example, cognitive ability and psychopathology). To clearly distinguish such BWAS from other neuroimaging research, we formally define them as ‘studies of the associations between common inter-individual variability in human brain structure/function and cognition or psychiatric symptomatology’. Classically univariate, BWAS have recently been facilitated by more powerful, but more difficult to interpret multivariate prediction techniques (for example, support vector regression (SVR) and canonical correlation analysis (CCA)). BWAS hold great promise for predicting and reducing psychiatric disease burden and advancing our understanding of the cognitive abilities that underlie humanity’s intellectual feats. However, obtaining MRI data remains expensive (approximately US$1,000 per hour), resulting in small-sample BWAS findings that have not been replicated7,8,9,10.
Factors that have contributed to poor reproducibility of population-based research in psychology11, genomics12 and medicine13, such as methodological variability14, data mining for significant results15, overfitting16, confirmation and publication biases17, and inadequate statistical power5 probably also affect BWAS. Researchers are starting to address replication failures by standardizing analyses, pre-registering hypotheses, publishing null results and sharing data and code18. Nevertheless, there have been concerns that reliance on relatively small samples (the median sample size (n) in openneuro.org studies as of September 2021 is 23) may also be contributing to BWAS replication failures5,19,20,21. Small studies are most vulnerable to sampling variability, the random variation of an association across population subsamples. Sampling variability decreases and associations stabilize with increasing sample sizes19,22, at a rate of √n. Thus, if true brain-wide associations were smaller than previously assumed (for example, bivariate linear correlation r = 0.2–0.8), larger samples would be required to accurately measure them19,20. Other population-based sciences aiming to robustly characterize relatively small effects—such as epidemiology and genomics (that is, genome-wide association studies (GWAS))—have steadily increased sample sizes12 from below 100 to over 1,000,000.
Recently, neuroimaging consortia have collected samples orders of magnitude larger than before (for example, the Adolescent Brain Cognitive Development23 (ABCD) study, n = 11,874; Human Connectome Project24 (HCP), n = 1,200; and UK Biobank25 (UKB), n = 35,735), enabling accurate estimation of BWAS effect sizes. Beginning with the ABCD Study and using the HCP and UKB data for verification, we performed billions of univariate and multivariate analyses to evaluate BWAS effect sizes and reproducibility as a function of sample size, using sample sizes from small (n = 25) to large (n = 32,572).
Precise BWAS require large samples
BWAS relate population variability in brain features (for example, RSFC between two brain regions (edge)) and behavioural phenotypes (for example, cognitive ability). To estimate brain-wide associations in ABCD data, we correlated widely used cortical thickness and RSFC metrics with 41 measures indexing demographics, cognition and mental health (Supplementary Table 1). Brain-wide associations were estimated across multiple levels of anatomical resolution in both structural (cortical vertices, regions of interest (ROI) and networks) and functional (connections (edges), principal components and networks) MRI data (Fig. 1). To ameliorate the effects of nuisance variables such as head motion, we applied strict denoising strategies (n = 3,928; >8 min; RSFC data post frame censoring at a filtered framewise displacement (filtered-FD) < 0.08 mm; Methods, ‘DCANBOLDproc preprocessing’). Repeat analyses using less rigorous motion censoring that retained a larger subset of the full ABCD sample (n = 9,753), produced a similar BWAS effect size distribution (Supplementary Fig. 1).
BWAS analyses frequently link a single brain feature to a single behavioural phenotype. In Fig. 1a, b, we show the distributions of such univariate associations between cortical thickness and RSFC and two extensively studied phenotypes, cognitive ability (NIH Toolbox total score) and psychopathology (child behaviour checklist (CBCL) total score; Methods, ‘Psychological and demographic data’; Supplementary Table 1; Supplementary Fig. 2 for non-overlapping histograms). In the full, rigorously denoised ABCD sample (n = 3,928), across all brain-wide associations, the median univariate effect size (|r|) was 0.01 (Extended Data Fig. 1). The top 1% largest of all possible brain-wide associations (around 11 million total associations) reached a |r| value greater than 0.06 (Fig. 1a, b). The top 10% largest associations were distributed across sensorimotor and association cortex (Fig. 1c, d). Across all univariate brain-wide associations, the largest correlation that replicated out-of-sample was |r| = 0.16. Sociodemographic covariate adjustment resulted in decreased effect sizes, especially for the strongest associations (top 1% Δr = −0.014; Extended Data Fig. 2).
Smaller brain-wide association studies have reported larger univariate correlations (r > 0.2) than the largest effects we measured in much larger samples. To resolve this apparent contradiction, we simulated the effects of independent research groups using samples of varying sizes to estimate the same brain–phenotype association. For the strongest univariate brain-wide associations, we charted sampling variability as a function of sample size (Fig. 1e, f, n = 25–3,928). At n = 25, the 99% confidence interval for univariate associations was r ± 0.52, documenting that BWAS effects can be strongly inflated by chance. In larger samples (n = 1,964 in each split half), the top 1% largest BWAS effects were still inflated by r = 0.07 (78%), on average (Supplementary Fig. 3). At n = 25, two independent population subsamples can reach the opposite conclusion about the same brain–behaviour association (for example, Fig. 1g, h), solely owing to sampling variability. See Supplementary Figs. 4–6 for sampling variability by sample size plots for all brain metrics and behavioural phenotypes.
Task fMRI data have also been correlated with cognitive phenotypes. Recent studies have suggested that treating task fMRI data similar to RSFC and combining the two modalities could strengthen BWAS effects slightly26. Therefore, we also estimated univariate BWAS associations for combined task and rest functional connectivity in ABCD Study data27, which produced the same distribution of association strengths (top 1% |r| > 0.06) as RSFC. The HCP collected a wide variety of fMRI tasks, enabling us to compute all brain-wide associations between 86 task activation contrasts and 39 behavioural measures. The distributions of BWAS effect sizes for classical task fMRI activations and RSFC were closely matched (Extended Data Fig. 3, Supplementary Discussion).
Low measurement reliability can attenuate the observed correlation between two variables. Within-person measurement reliability for the exemplar behavioural phenotypes (NIH Toolbox28, r = 0.90; CBCL29, r = 0.94) and imaging measures (cortical thickness30, r > 0.96; RSFC: ABCD, r = 0.48; HCP, r = 0.79; UKB, r = 0.39; Extended Data Fig. 4) are moderate to high. Whereas behavioural (NIH Toolbox, CBCL) and cortical thickness measures are already close to their reliability ceiling, further improvements in RSFC measurement reliability could theoretically increase effect sizes slightly (Supplementary Fig. 7, Supplementary Discussion). Theoretical maximum BWAS effect sizes are unlikely to be reached owing to fundamental biological limits on the strength of the true association and/or the limitations of behavioural phenotyping and MRI physics.
Effect sizes replicate across datasets
Since the ABCD Study data (n = 11,874; age range: 9–10 years; 20 min, RSFC collected) were from a 21-site paediatric cohort (multiple scanner types), we sought to replicate BWAS effect sizes in single-site, single-scanner-type adult data. Thus, we used the HCP dataset which contains the most data per participant among large studies (n = 1,200; age range: 22–35 years; single scanner; 60 min, RSFC collected), and the UKB dataset which has the largest sample size, but less RSFC data per participant (n = 35,735; age range: 40–69 years; single scanner type; 6 min, RSFC collected), to verify univariate BWAS effect size distributions. All three datasets overlapped in containing RSFC and cognitive ability data. To control for sample size effects, the ABCD and UKB datasets were subsampled to match the HCP (n = 900, strict denoising). Across the three size-matched datasets we found similar effect size distributions for associations between RSFC and cognitive ability (Fig. 2; top 1% at n = 900 ABCD, |r| > 0.11; HCP, |r| > 0.12; UKB, |r| > 0.09; Extended Data Fig. 5; see Supplementary Fig. 8 for all ABCD/HCP cognitive measures).
To account for potential multi-site effects, we directly compared sampling variability between the HCP (single site) and ABCD datasets (Extended Data Fig. 6a), and between a single ABCD site (n = 603) and the 20 remaining sites (Extended Data Fig. 6b). Sampling variability was equivalent for single- and multi-site samples, underscoring the effectiveness of the ABCD Study’s cross-site harmonization efforts23. The generalizability of the univariate BWAS effect size distribution (Fig. 2, Extended Data Figs. 5, 6) across age (9–69 years), sites, scanner types and pulse sequences suggests that it is universal to BWAS with current technologies and methods.
Statistical errors limit reproducibility
Statistical error rates depend on effect sizes and significance testing thresholds. To quantify how the pairing of smaller than expected effect sizes and sampling variability (that is, random variation of an association across population subsamples) affects BWAS reproducibility, we used non-parametric bootstrapping19 to generate smaller BWAS subsamples and characterized the relationship between statistical errors and sample size across significance thresholds (P < 0.05 to P < 10−7; Fig. 3, Supplementary Fig. 9 for UKB) and verified the results with analytic statistical power estimations31 (Supplementary Fig. 10).
Statistical errors were pervasive across BWAS sample sizes. Even for samples as large as 1,000, false negative rates (Fig. 3a) were very high (75–100%) and half of the statistically significant associations were inflated by at least 100% (Fig. 3b). More lenient statistical thresholding reduces false negatives and effect size inflation, but increases the rate of sign errors (Fig. 3c). Statistical power (1 − false negative rate), which indexes the probability of detecting a significant effect, remained low even for relatively large sample sizes: maximum statistical power 0.68 for n = 3,928 (Fig. 3d).
Given the high statistical error rates and low power of univariate BWAS in typically sized samples, we quantified the probability that a significant univariate association would replicate in a size-matched replication dataset (Fig. 3e; P from 10−7 to 0.05). In keeping with common practice, we defined successful replication as passing the same statistical threshold in sample and out of sample. At the largest split half sample size (n = 1,964), 25% of univariate BWAS replications succeeded with a threshold of P < 0.05. At sample sizes more typical for BWAS (n < 500), replication rates were around 5% (Fig. 3e).
Paradoxically, correcting for multiple comparisons reduces the probability of successfully replicating univariate BWAS effects (Fig. 3d, e). More stringent statistical thresholding reduces false positive rates (Fig. 3f) but increases false negative rates (Fig. 3a), thus lowering statistical power (Fig. 3d, Extended Data Fig. 7). In underpowered BWAS, stricter statistical thresholds select for very large correlations, which are the most likely to be inflated due to sampling variability (Fig. 1e, f). With Bonferroni multiple-comparisons correction (P < 10−7), a sample size of 9,500 was required to be 80% powered for detecting the top 1% largest (r > 0.06) BWAS effects (Supplementary Fig. 10a), compared with a sample size of 2,200 for uncorrected P < 0.05 (Supplementary Fig. 10b).
Multivariate BWAS reproducibility
Multivariate methods use weighted brain patterns to predict a single behavioural phenotype (SVR; for example, cognitive ability), or combinations of multiple phenotypes (CCA; for example, all NIH Toolbox subscales). To examine multivariate brain-wide associations as a function of sample size, we trained SVR (Supplementary Figs. 11–13) and CCA (Supplementary Figs. 14, 15) models on discovery set data (in-sample; including nested cross-validation (SVR) and principal component analysis (PCA) dimensionality reduction (SVR and CCA); Methods, ‘Multivariate out-of-sample replication’) and subsequently tested their generalization to the replication set using standard out-of-sample estimates of SVR (rpred) and CCA (rCV1) association strength (Fig. 4). Sampling variability was assessed by generating bootstrapped subsamples (n = 100) for each sample size. Multivariate out-of-sample associations were tested for statistical significance using nonparametric null distributions (>99% confidence interval).
Across multivariate methods (SVR and CCA), imaging modalities (cortical thickness and RSFC), and behavioural phenotypes (cognitive ability and psychopathology), small discovery samples typical for neuroimaging generated variable, inflated in-sample associations that frequently did not pass statistical significance thresholds (Fig. 4a–d). Increasing sample sizes to thousands of participants provided moderate statistical replication with reduced variability and smaller differences between in-sample and out-of-sample associations. On average, RSFC (versus cortical thickness) and cognitive (versus psychopathology) measures provided stronger out-of-sample associations (Fig. 4a–d) that were closer to in-sample estimates (Fig. 4e). Narrowing the definition of replication to detecting statistical significance in out-of-sample data did not alleviate the need for large sample sizes (Supplementary Table 2).
Multivariate out-of-sample associations were stronger compared to univariate, particularly at large sample sizes (for example, maximum RSFC–crystallized intelligence association: SVR rpred = 0.39, univariate r = 0.16). Even at the largest sample sizes (n ≈ 2,000), multivariate in-sample associations remained inflated on average (in-sample to out-of-sample: Δr = −0.29; Fig. 4e, Supplementary Fig. 16; see Extended Data Fig. 8 for univariate) and feature weights were variable (Supplementary Fig. 13). Out-of-sample replication was maximized by using a relatively low-dimensional feature space (Supplementary Figs. 11, 12, 14, 15), reaffirming that brain-wide associations are represented in widely distributed circuitry, consistent with univariate BWAS (Fig. 1c, d). Across behavioural phenotypes, multivariate out-of-sample associations were robustly linked to univariate effect sizes (r = 0.79, P < 0.001; Fig. 4f).
The underpowered BWAS paradox
At smaller sample sizes, the largest, most inflated BWAS effects are most likely to be statistically significant and therefore, paradoxically, the most likely to be published5,21,32. Typically, BWAS have been sufficiently powered to only detect statistical significance for inflated associations (Fig 3d). High sampling variability in smaller samples frequently generates strong associations by chance19 (Fig. 1e, f). Stricter in-sample statistical thresholding (that is, multiple-comparison correction)—which is common in neuroimaging—lowers BWAS power, thus trapping us deeper in the paradox by selecting for even more inflated effects (Fig. 3). When attempting to replicate inflated BWAS associations, regression to the mean (actual effect size) makes non-significance (that is, replication failure) the most likely outcome (Figs. 3, 4, Extended Data Fig. 8). Bias in favour of significant, larger BWAS effects has limited the publication of null results, perpetuating inflated effect sizes that form the basis for subsequent power and meta analyses.
Importance of small-sample neuroimaging
There is no one-size-fits-all solution for neuroimaging studies; minimum sample size requirements depend on the study design. Neuroimaging-only studies are typically adequately powered at small sample sizes. For example, central tendencies of human functional brain organization among groups can be accurately represented by averaging within small samples (that is, n = 25; Supplementary Fig. 17). Precise individual-specific RSFC and fMRI activation brain maps can be generated by repeatedly sampling the same individual33. Small samples have also provided blueprints for reducing MRI artefacts34, increasing the amount of usable data35.
Using non-BWAS approaches, many fundamental links between the human brain and behaviour have been uncovered and replicated in small neuroimaging samples36. Within-person designs (for example, longitudinal37), studies with induced effects (for example, lesions38 or tasks39), or both (for example, interventions40) frequently have increased measurement reliability and effect sizes. For rarer clinical conditions, amassing large samples is impossible. In many cases, within-person, induced-effects approaches are not only cost-effective, but also most relevant to clinical care. Thus, small-sample neuroimaging will always be critical for studying the human brain.
Importance of large samples for BWAS
Large neuroimaging consortium data (ABCD, HCP and UKB) have revealed that small BWAS effects and population sampling variability routinely results in inflated, irreproducible brain–phenotype associations until sample sizes reach well into the thousands (Extended Data Fig. 9). Therefore, BWAS should use datasets with at least thousands of high-quality, standardly processed samples14. Additional consideration should be given to potential confounding effects and interpretations of statistical significance41.
The recovery of genomics from its reproducibility crisis has set a valuable example for BWAS12. Early candidate-gene studies were underpowered and many associations between common genetic variants and psychiatric phenotypes could not be replicated42. In response, GWAS consortia have grown genomic samples into the millions43 and taken advantage of specialized study designs (for example, twins) and methodological innovations (for example, polygenic risk scores) and set strict data standards. Fortunately, BWAS findings can achieve reproducibility in relatively smaller samples than GWAS, owing to larger effect sizes.
Reproducibly linking brain and behaviour
All brain–behaviour studies will benefit from technological advances that generate higher quality brain and behavioural data with greater efficiency, such as real-time quality control35, multi-band multi-echo44 sequences and thermal denoising for fMRI45, as well as deep behavioural phenotyping with ecological momentary assessment46 and passive sensing.
As with GWAS47, funding agencies should boost the aggregation of BWAS-appropriate datasets through mandatory sharing policies. Even for large datasets collected and processed identically, in-sample associations are stronger than out-of-sample replications (Fig. 4e, Extended Data Fig. 8); therefore, reporting both in-sample and out-of-sample effect sizes should be a requirement for publication and funding. BWAS may also benefit from focusing data collection on the most robust brain–phenotype associations (for example, functional versus structural and direct behavioural versus questionnaire).
The brain, in contrast to the genome, is expected to change over time and can be manipulated ethically. For greater effect sizes and statistical power, neuroscience should focus on within-participant study designs over cross-sectional study designs, and on interventional (therapy, medications, brain stimulation and surgery) over observational study designs. Rather than associating pre-defined psychological constructs and brain features48, data-driven, combined brain–behaviour phenotypes will further advance our understanding of cognition and mental health. Altogether, our prospects for linking neuroimaging markers to complex human behaviours are better than ever.
ABCD Study sample
This project used the baseline ABCD BIDS (Brain Imaging Data Structure) data consisting of RSFC data from 10,259 participants released through the ABCD-BIDS Community Collection51 (ABCD collection 3165; https://github.com/ABCD-STUDY/nda-abcd-collection-3165) and demographic and behavioural data from 11,572 9–10 year old participants from the ABCD 2.0 release52. The ABCD Study obtained centralized institutional review board (IRB) approval from the University of California, San Diego. Each of the 21 sites also obtained local IRB approval. Ethical regulations were followed during data collection and analysis. Parents or caregivers provided written informed consent, and children gave written assent.
In addition to data from the ABCD 2.0 release, we used the ABCD reproducible matched samples51 (ARMS), available in ABCD collection 3165, that divided individuals from the full behavioural sample (n = 11,572) into discovery (n = 5,786) and replication (n = 5,786) sets, which were matched across 9 variables: site location, age, sex, ethnicity, grade, highest level of parental education, handedness, combined family income, and prior exposure to anaesthesia. Family members (that is, sibling pairs, twins and triplets) were kept together in the same set and the two sets were matched to include equal numbers of single participants and family members. These split ARMS datasets were used for replicability analyses.
Head motion can systematically bias neuroimaging studies53. However, these systematic biases can be addressed through rigorous head motion correction. Therefore, we used strict inclusion criteria with regard to head motion. Specifically, inclusion criteria for the current project (see Casey et al.23 for broader ABCD inclusion criteria) consisted of at least 600 frames (8 min) of low-motion54 (filtered-FD < 0.08) RSFC data. Our final dataset consisted of RSFC data from a total of n = 3,928 youth across the discovery (n = 1,964) and replication (n = 1,964) sets. The final discovery and replication sets did not differ in mean framewise displacement (difference in means = 0.002, t = 0.60, P = 0.55) or total frames included (difference in means = 6.4, t = 0.94, P = 0.35). The participant lists for ARMS samples can be found in the ABCD-BIDS Community Collection (ABCD collection 3165) for community use51.
ABCD MRI acquisition
Imaging was performed at 21 sites in the United States, harmonized across Siemens Prisma, Philips and GE 3T scanners. Details on image acquisition can be found in ref. 23. Twenty minutes (4 × 5 min runs) of eyes-open resting-state blood oxygenation level dependent (BOLD) data were acquired to ensure at least 8 min of low-motion data. All resting-state scans fMRI scans used a gradient-echo echo planar imaging (EPI) sequence (repetition time = 800 ms, echo time = 30 ms, flip angle = 90°, voxel size = 2.4 mm3, 60 slices). Head motion was monitored using framewise integrated real-time MRI monitoring (FIRMM) software at many of the Siemens sites35.
ABCD-BIDS processing overview
ABCD and UKB MRI data processing was completed with the freely available ABCD-BIDS pipeline51 (https://github.com/DCAN-Labs/abcd-hcp-pipeline). Data were downloaded and converted to the BIDS format using ABCD-Dicom2BIDS (https://github.com/DCAN-Labs/abcd-dicom2bids). Only data that passed the fast-track quality control (QC; tagged prior to ABCD release 2.0) were processed (also see release notes: https://collection3165.readthedocs.io/en/stable/). The ABCD-BIDS pipeline is a modification of the original HCP pipeline55. In brief, this MRI data-processing pipeline comprises six stages. (1) PreFreesurfer normalizes anatomical data. This normalization entails brain extraction, denoising, and then bias field correction on anatomical T1 and/or T2 weighted data. The ABCD-HCP pipeline includes two additional modifications to improve output image quality. ANTs56 DenoiseImage models scanner noise as a Rician distribution and attempts to remove such noise from the T1 and T2 anatomical images. Additionally, ANTs N4BiasFieldCorrection attempts to smooth relative image histograms in different parts of the brain and improves bias field correction. (2) FreeSurfer57 constructs cortical surfaces from the normalized anatomical data. This stage performs anatomical segmentation, white–grey and grey–CSF cortical surface construction, and surface registration to a standard surface template. Surfaces are refined using the T2 weighted anatomical data. Mid-thickness surfaces, which represent the average of white–grey and grey–CSF surfaces, are generated here. (3) PostFreesurfer converts prior outputs into an HCP-compatible format (that is, CIFTIs) and transforms the volumes to a standard volume template space using ANTs nonlinear registration, and the surfaces to the standard surface space via spherical registration. (4) The Vol (volume) stage corrects for functional distortions via reverse-phase encoding spin-echo images. All resting-state runs underwent intensity normalization to a whole-brain-mode value of 1,000, within run correction for head movement, and functional data registration to the standard template. Atlas transformation was computed by registering the mean intensity image from each BOLD session to the high resolution T1 image, and then applying the anatomical registration to the BOLD image. This atlas transformation, mean field distortion correction, and resampling to 3 mm3 atlas space were combined into a single interpolation using the FSL58 applywarp tool. (5) The Surf (surface) stage projects the normalized functional data onto the template surfaces, as described below. (6) We have added an fMRI and fcMRI preprocessing stage, DCANBOLDproc, also described below. (7) Last, an executive summary is provided for easy participant-level QC across all processed data.
fMRI surface processing
The BOLD fMRI volumetric data were sampled to each participant’s original mid-thickness left and right-hemisphere surfaces constrained by the grey-matter ribbon. Once sampled to the surface, time courses were deformed and resampled from the individual’s original surface to the 32 k fs_LR surface in a single step. This resampling allows point-to-point comparison between each individual registered to this surface space. These surfaces were then combined with volumetric subcortical and cerebellar data into the CIFTI format using Connectome Workbench59, creating full brain time courses excluding non-grey matter tissue. Finally, the resting-state time courses were smoothed with a 2 mm full-width-half-maximum kernel applied to geodesic distances on surface data and euclidean distances on volumetric data.
Additional BOLD preprocessing steps were executed to reduce spurious variance unlikely to reflect neuronal activity34. First, a respiratory filter was used to improve framewise displacement estimates calculated in the Vol stage54. Second, temporal masks were created to flag motion-contaminated frames using the improved framewise displacement estimates53. Frames with a filtered-FD > 0.3 mm were flagged as motion-contaminated for nuisance regression only. After computing the temporal masks for high motion frame censoring, the data were processed with the following steps: (1) demeaning and detrending, (2) interpolation across censored frames using least squares spectral estimation of the values at censored frames so that continuous data can be (3) denoised via a GLM with whole brain, ventricular, and white matter signal regressors, as well as their derivatives. Denoised data were then passed through (4) a band-pass filter (0.008 Hz < f < 0.1 Hz) without re-introducing nuisance signals60 or contaminating frames near high-motion frames.
Generation of RSFC matrices
ABCD RSFC data consists of 4 × 5 min runs. For each participant with full brain coverage, all available RSFC data were concatenated and high motion frames (filtered-FD > 0.08) were censored. The timeseries of BOLD activity for each ROI was correlated to that of every other ROI (333 cortical ROIs from Gordon et al.61; 61 subcortical ROIs from Seitzman et al.62), forming a 394 × 394 correlation matrix, which was subsequently Fisher z-transformed. For network level analyses, correlations were averaged across previously defined canonical functional networks61. Inter-individual difference connectome-wide spatial components, which are not bound by network boundaries63,64, were computed by performing PCA on a matrix composed of all ROI × ROI pairs (edges) from each participant.
Generation of cortical thickness metrics
For each participant, cortical thickness was extracted from 59,412 cortical vertices. For ROI level matrices, cortical thickness was averaged within each cortical parcel61 (n = 333). For network level matrices, cortical thickness was averaged within each cortical network61 (n = 13). Inter-individual spatial components were computed by performing PCA on a matrix composed of all cortical vertices from each participant.
Psychological and demographic data
The ABCD Study population is well-characterized with hundreds of demographic, physical, cognitive, and mental health variables65. The current project examined the associations between 41 of these variables (Supplementary Table 1) and brain structure (cortical thickness) and function (RSFC). Psychological and demographic variables were selected to reflect the primary domains of interest, cognition (individual subscales and composite scores from the NIH Toolbox) and mental health (individual subscales and composite scores from the CBCL), as well as demographic and physical variables relevant to development (for example, age) and health (for example, body mass index).
Psychological and demographic covariates
The primary goal of this project was to study how the pairing of brain–phenotype effect sizes and sampling variability (random variation across samples, as opposed to systematic variation threatening causal inference66) can account for wide-spread replication failures. Hence, our results focus on bivariate associations (correlation) and standard multivariate models linking brain structure and function to psychological and demographic variables without covariate adjustment. However, we did examine the influence of sociodemographic covariates standardly used in ABCD analyses (race, gender, parental marital status, parental income, Hispanic versus non-Hispanic ethnicity, family and data collection site) on BWAS effect sizes noting that they generally decrease effect sizes, particularly for the largest BWAS effects (see Extended Data Fig. 2). Furthermore, the ABCD subsamples (ARMS; see above) we used for replication analyses are matched for salient demographic factors (site location, family composition, age, sex, ethnicity, grade, highest level of parental education, handedness, combined family income and prior exposure to anaesthesia; see above). Also, where possible, ABCD-distributed age-corrected scores were used, given (1) well-established age-related changes in these measures and (2) age-corrected scores improved normality for many measures (for example, CBCL syndrome scales and broadband factors).
Capture of psychological and demographic data
The ABCD Data Analysis and Informatics Center (DAIC) has released an online tool called DEAP (Data Exploration and Analysis Portal), which can be accessed at https://deap.nimhda.org/. In this Article, we introduce an additional tool called ABCDE (ABCD Boolean Capture Data Explorer, developed by B.P.K.), which we have used for preparation of the data herein. ABCDE complements DEAP by allowing for finer-grained control of data extraction on the researcher’s own computer rather than through a web portal. The source code and documentation can be accessed at https://gitlab.com/DosenbachGreene/abcde.
Univariate brain–behavioural phenotype correlations
For each brain measure at a given level of organization, we correlated the brain measures (structure: cortical thickness; function: RSFC) with each psychological variable. Cognitive ability (total composite score on the NIH Toolbox) and psychopathology (total score on the CBCL) are presented in the main text; all others are included in the Extended Data Fig. 1. Correlations between brain and phenotypes were generated for RSFC at the edge level (ROI–ROI pair (n = 77,421)), network level (average of RSFC within and between each network (n = 105)) and component level (principal component weights (n = 100)). To extract components representing inter-individual differences, we vectorized each participant’s RSFC matrix, concatenated the vectorized matrices and then performed PCA (Matlab’s pca.m function). Correlations between brain and phenotypes were generated for cortical thickness at the vertex level (n = 59,412), ROI level (n = 333) and network level (n = 13). Repeat analyses employing less rigorous motion censoring and thus retaining a larger subset of the full ABCD sample (n = 9,753) replicated the effect sizes (top 1% largest effects: |r| > 0.06).
To examine the distribution of correlations for iteratively larger sample sizes, we randomly selected participants with replacement from the full sample (n = 3,928, post denoising) at logarithmically spaced sample sizes (16 intervals: n = 25, 33, 50, 70, 100, 135, 200, 265, 375, 525, 725, 1,000, 1,430, 2,000, 2,800 and 3,928). For cortical thickness data, the full sample contained the same sampling bins, with the exception of the final bin (full sample), which contained n = 3,604 participants. At each sample size, we randomly sampled participants 1,000 times, resulting in 16,000 brain–psychological phenotype resamplings for each brain–phenotype correlation. For multivariate approaches, 100 bootstrap samples were computed across the logarithmically spaced sample sizes (16 intervals: n = 25, 33, 45, 60, 80, 100, 145, 200, 256, 350, 460, 615, 825, 1,100, 1,475 and 1,964 (1,814 for cortical thickness)). We note that the iterations were reduced for multivariate methods (100 iterations) owing to their high computational costs. In addition, the multivariate analyses were primarily focused on mean estimates, rather than the full distribution. We also performed sensitivity analyses to quantify sampling variability using data from only singletons (that is, no sibling and/or twin pairs), which was nearly identical to sampling variability in the full sample (included siblings and/or twins; Extended Data Fig. 10; Δr = 0.0005). For highlighting the effects of sampling variability (Fig. 1e, f), we extracted the brain–phenotype correlation with the largest effect size for each imaging modality (cortical thickness and RSFC) and exemplar phenotype (cognitive ability and psychopathology). The sampling variability (range of possible correlations, 99% confidence interval and 95% confidence interval) at each sampling interval for correlations between RSFC and cortical thickness with cognitive ability and psychopathology are presented in the main text (Fig. 1e, f); correlations between brain measures and other behaviours can be found in Supplementary Figs. 4, 5.
Sampling variability examples with a sample size of 25
Using the outputs from the resampling procedures above, we used the 1,000 resamplings with n = 25 to examine the correlation between the DMN and cognitive ability (total composite score on the NIH Toolbox), as well as the DMN and psychopathology (total problem score on the CBCL), for both cortical thickness and RSFC. To demonstrate how sampling variability affects correlations, the 1,000 resamples were ranked by effect size. Subsequently, we selected two samples from the top 10 samples (in terms of effect size); one with a significant positive association and one with a significant negative association.
ABCD task data
Data from three in-scanner fMRI tasks (n-back, stop signal, monetary incentive delay) were concatenated to the 4 × 5 min resting-state runs (rest + task) to determine whether additional data affected the effect size estimates. After data were concatenated across the 4 conditions (rest + 3 task states), correlation matrices were generated and correlated with psychological phenotypes as detailed above, under univariate brain–behavioural phenotype correlations. Task events were not regressed67. Data processing steps for task data were the same as RSFC, including the removal of frames with a filtered-FD > 0.08 mm.
Correlations between behavioural phenotypes
To examine the range of sampling variability as a function of sample size between 41 psychological and demographic measures (Supplementary Fig. 6), we randomly selected participants with replacement from the full behavioural sample (n = 11,572) at logarithmical spaced sample sizes (9 intervals: n = 25, 50, 100, 200, 500, 1,000, 2,000, 4,000 and 9,000). At each interval, we randomly sampled participants 1,000 times, resulting in 9,000 behaviour–behaviour phenotype correlation resamplings for each association. For each association between behavioural phenotypes, we quantified sampling variability at each sampling bin as the range of correlations observed through this resampling procedure.
False positives, false negatives and power
False negative (Fig. 3a) and false positive (Fig. 3f) rates were derived through resampling (see ‘Resampling procedures’) for all edge-wise brain-wide associations. For each sample size bin (16 total), we randomly sampled with replacement n individuals (1,000 subsamples) and computed the brain–behavioral phenotype correlation and associated P value. A correlation was deemed significant if it passed a threshold (P value range: <0.05 to <10−7 (Bonferroni-corrected) across 77,421 ROI–ROI pairs) in the full sample (cortical thickness n = 3,604, RSFC n = 3,928). At each sample size, if a correlation in the full sample was not significant, we determined the percentage of studies that resulted in a false positive significant correlation across a broad range of P values (0.05 to 10−7). Conversely, if a correlation in the full sample was significant (P < 0.05 to 10−7), we determined the percentage of studies that resulted in a false negative non-significant correlation across a broad range of P values (10−7 to 0.05). Statistical power (Fig. 3d) was calculated as 1 − false negative rate.
BWAS correlation inflation
For each univariate brain-wide association in the full sample (cortical thickness n = 3,604; RSFC n = 3,928) at the vertex/edge level, we determined whether or not a correlation was significant (using two-tailed P < 0.05 (uncorrected) and P < 10−7 (Bonferroni corrected for multiple comparisons) thresholds). Then, for each significant correlation in the full sample, we extracted all of the significant correlations (P < 0.05 and 10−7) observed across 1,000 subsamples at each sample size bin. Of these significant correlations in subsamples at each sample size bin, we determined the percentage that were inflated, relative to the full sample effect size, across varying magnitudes (50%, 100% and 200%; Fig. 3b).
BWAS sign errors
Each brain-wide association was extracted from the full sample as a reference. Across the 1,000 subsamples within a sampling bin, we determined the percentage of correlations that had the opposite correlation sign as the correlation sign in the full sample, thresholding the subsamples at the same P values as all other analyses of statistical errors (P < 10−7 to 0.05).
Univariate BWAS replication
Replication is commonly defined as detecting a significant association (for example, P < 0.05) that was deemed significant (P < 0.05) in a previous sample (Fig. 3e). To determine the probability of replicating a brain–phenotype association in a new data (out-of-sample) at a given sample size, we correlated every brain feature (RSFC edge, cortical thickness vertex) with each behavioural phenotype in 1,000 bootstrapped samples across sample sizes (same sampling bins as listed under ‘Resampling procedures’). For each behavioural phenotype, sample size (n = 25, 33, 45, 60, 80, 100, 145, 200, 256, 350, 460, 615, 825, 1,100, 1,475 and 1,964 (1,814 for cortical thickness); note: data end at n ≈ 2,000 as the replication sample is half of the full), and bootstrapped subsample, we first determined the brain–behavioural phenotype associations that were significant (at P < 10−7 to 0.05) in the discovery (in-sample) dataset. Next, we extracted the same brain features from the replication (out-of-sample) dataset and quantified the percentage of associations that were also significant in the replication dataset. Note, to mirror a process of replicating existing effects, we used the number of identified significant associations in the discovery sample as the total number of features that could be replicated (as opposed to the total number of brain features regardless of discovery sample significance). For example, if all significant BWAS in the discovery sample were also significant in the replication sample, the probability of replication would be 100%.
Effect sizes in HCP replication
We used data from n = 900 individuals from the HCP 1,200 Subject Data Release (aged 22–35 years). All HCP participants provided informed consent. A custom Siemens SKYRA 3.0T MRI scanner and a custom 32-channel head matrix coil were used to obtain high-resolution T1-weighted (MP-RAGE, TR = 2.4 s, 0.7 mm3 voxels) and BOLD contrast sensitive (gradient-echo EPI, multiband factor 8, TR = 0.72 s, 2 mm3 voxels) images from each participant. The HCP used sequences with left-to-right (LR) and right-to-left (RL) phase encoding, with a single RL and LR run on each day for two consecutive days for a total of four runs68. MRI data were preprocessed as previously described62. All HCP data are available at https://db.humanconnectome.org/.
Similar to the ABCD data, we extracted the timeseries from a total of 394 cortical and subcortical ROIs, correlated and Fisher z-transformed them. Data from the NIH Toolbox were correlated with each edge of the RSFC correlation matrix across participants. Across all NIH Toolbox subscales, the tails of the distributions of the resulting brain–behavioural phenotype correlations were compared to 100 subsampled ABCD brain–behavioural phenotype correlations (n = 877, matching HCP sample size). In Supplementary Fig. 8, we show the distributions of brain–behavioural phenotype correlations for ABCD and HCP data, for each NIH Toolbox subscale.
Effect sizes in UKB replication
We used pre-processed resting-state data from n = 32,572 individuals from the January 2020 UKB release69, processed with the same processing pipeline as the ABCD data. All UKB participants provided informed consent. For a complete description of study flow and imaging protocols, see Littlejohns et al.70. The UKB collects measures of fluid intelligence, which we used to correlate with RSFC, mimicking ABCD and HCP samples. For Fig. 2, we used 100 × n = 900 subsamples from the ABCD and UKB datasets to match the sample size of HCP for the associations between RSFC and fluid intelligence (n = 900). We subsequently determined the threshold to reach the top 1% strongest RSFC associations with fluid intelligence in each of the three datasets.
Sampling variability in HCP replication
To quantify the degree of sampling variability in single site, single scanner HCP data compared to multi-site, multi-scanner ABCD data, we subsampled ABCD RSFC data to match HCP sample sizes (n = 877, denoised and complete behavioural data across all NIH Toolbox subscales). For each dataset, we carried out resampling, as detailed under ‘Resampling procedures’ (12 intervals: n = 25, 33, 50, 70, 100, 135, 200, 265, 375, 525, 725 and 875), across all NIH Toolbox subscales. The range of correlations and 95% confidence interval observable in each sampling bin are shown in Extended Data Fig. 6a for both HCP and ABCD data.
Sampling variability for single-site ABCD versus multi-site ABCD
We directly compared single site ABCD data (site 16; n = 603) with multi-site ABCD data (n = 3,325, 20 sites—site 16 was excluded) using 1,000 bootstrapped samples at 10 sample size intervals: n = 25,33, 50, 70, 100, 135, 200, 265, 375 and 525. For this analysis, we used the associations between RSFC and all NIH Toolbox subscales (Extended Data Fig. 6b). The range of correlations and 95% confidence interval in each resampling bin is shown in Extended Data Fig. 6b for single-site and multi-site ABCD data.
BWAS effect sizes in task activation versus RSFC in HCP
We estimated the effect sizes between task activations (86 total contrasts, see Supplementary Table 3) and behavioural phenotypes71 (39 total, see Supplementary Table 4) across three levels of analysis: vertices, ROIs and networks (n = 844). In these same individuals, we estimated the effect sizes between RSFC and the same phenotypes across three levels of analysis: edges, principal components, and networks. To compare the resulting effect size distributions (for example, Extended Data Fig. 3), we determined the top 1% strongest effect sizes, as well as the maximum correlation (absolute value).
Multivariate out-of-sample replication
For multivariate out-of-sample replication, we used SVR and CCA. SVR with a linear kernel was performed using the e1071 package in the R environment (version 3.5.2) to predict primary phenotypes (psychopathology and cognitive ability) and other demographics and psychological phenotypes (Supplementary Figs. 11, 12) from individual differences in either RSFC or cortical thickness. One hundred bootstrap samples (sampling with replacement) were generated for each sample size. Hyperparameter tuning was examined in (1) split halves of the full discovery sample for multiple cognitive (NIH Toolbox) and psychopathology (CBCL) scales and (2) tenfold cross-validation within the full discovery sample for primary phenotypes (psychopathology and cognitive ability; Supplementary Figs. 11, 12). Hyperparameter tuning did not appreciably change out-of-sample prediction estimates to the replication sample (for example, average out-of-sample correlation difference between tuned and non-tuned models: RSFC = −0.006, cortical thickness = 0.014; Supplementary Figs. 11, 12). Figure 4a, b use default hyperparameters and PCA dimensionality reduction (with a threshold of 50% variance explained in the discovery set, for each sample size) prior to SVR, given that this procedure balanced out-of-sample prediction and model complexity for nearly all model types (Supplementary Figs. 11, 12). Replication set data were not used to estimate principal components, but rather replication set data were projected into component space via independently estimated loading matrices for each subsample of the discovery set to prevent bias. An alternative strategy of univariate feature ranking was also examined, where SVR models were trained on the 5,000, 10,000 or 15,000 vertices (cortical thickness) or edges (RSFC) with the highest bivariate correlation to the variable of interest in the training dataset, but this approach resulted in lower out-of-sample prediction (Supplementary Figs. 11, 12). Out-of-sample association strength is reported as the correlation between predicted and observed phenotypic scores (rpred; using models trained on the discovery set). Significance thresholds for out-of-sample replication (rpred) were estimated via permutation testing (1,000 iterations) with models trained on the full discovery set (RSFC: n = 1964; cortical thickness: n = 1,814) and tested on the full replication set.
CCA was performed using Matlab’s (2019A) cannoncor.m function for joint associations of the NIH Toolbox and CBCL with individual differences in either RSFC or cortical thickness. Equivalent bootstrapping and subsampling of the in-sample discovery set were tested and applied to the out-of-sample replication set, as in the SVR analyses. To model sampling variability across sample sizes, 100 bootstrap (sampling with replacement) samples were generated for each sample size. As with SVR, Fig. 4c, d used PCA dimensionality reduction (threshold of 20% variance explained in the in-sample discovery set, for each sample size) prior to CCA given that this maximized out-of-sample correlation (rCV1; Supplementary Figs. 14, 15). CCA models were fit on iteratively larger subsamples of the in-sample discovery dataset. The first canonical vector (CV1) weights were extracted and applied to the full out-of-sample brain and behavior data. This resulted in the out-of-sample correlation (rCV1) between multivariate brain and behavior data. Significance thresholds for out-of-sample replication were estimated via permutation testing (1,000 iterations) with models trained on the full ABCD discovery set (RSFC: n = 1964; cortical thickness: n = 1,814) and tested on the full replication set.
Towards a new era of BWAS
In Extended Data Fig. 9, sampling variability, statistical errors (false positives, false negatives, inflation and sign errors), and out-of-sample multivariate associations (rpred, rCV1) were plotted as a function of sample size (y-axis: 0–1 for sampling variability (r), 0–100% for statistical errors (cumulative sum across all four error types), 0–100% for out-of-sample associations). To account for differences between in-sample and out-of-sample multivariate associations, out-of-sample multivariate associations were normalized by the mean in-sample (discovery) correlation at the full sample size. All three curves (sampling variability, statistical errors, and out-of-sample association) were based on the largest univariate and multivariate brain-wide association (RSFC with cognitive ability).
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.
Participant level data from all datasets (ABCD, HCP, UKB) are openly available pursuant to individual consortium-level data access rules. The ABCD data repository grows and changes over time (https://nda.nih.gov/abcd). The ABCD data used in this report came from ABCD collection 3165 and the Annual Release 2.0 (https://doi.org/10.15154/1503209). The UK Biobank is a large-scale biomedical database and research resource containing genetic, lifestyle and health information from half a million UK participants (www.ukbiobank.ac.uk). UK Biobank’s database, which includes blood samples, heart and brain scans and genetic data of the 500,000 volunteer participants, is globally accessible to approved researchers who are undertaking health-related research that is in the public interest. Data were provided, in part, by the Human Connectome Project, WU-Minn Consortium (principal investigators: D. Van Essen and K. Ugurbil; 1U54MH091657) funded by the 16 NIH institutes and centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University. Some data used in the present study are available for download from the Human Connectome Project (www.humanconnectome.org). Users must agree to data use terms for the HCP before being allowed access to the data and ConnectomeDB; details are provided at https://www.humanconnectome.org/study/hcp-young-adult/data-use-terms. Source data are provided with this paper.
Analysis code specific to this study can be found at https://gitlab.com/DosenbachGreene/bwas. Code for processing ABCD and UKB data can be found at https://github.com/DCAN-Labs/abcd-hcp-pipeline. MRI data analysis code can be found at https://github.com/ABCD-STUDY/nda-abcd-collection-3165. FIRMM software is available at https://firmm.readthedocs.io/en/latest/release_notes/ (the ABCD Study used version 3.0.14). The MuMln R package (version 1.43.17) is available at https://cran.r-project.org/web/packages/MuMIn/index.html.
Raichle, M. E. et al. A default mode of brain function. Proc. Natl Acad. Sci. USA 98, 676–682 (2001).
Wagner, A. D. et al. Building memories: remembering and forgetting of verbal experiences as predicted by brain activity. Science 281, 1188–1191 (1998).
Buckner, R. L. et al. Detection of cortical activation during averaged single trials of a cognitive task using functional magnetic resonance imaging. Proc. Natl Acad. Sci. USA 93, 14878–14883 (1996).
Szucs, D. & Ioannidis, J. P. Sample size evolution in neuroimaging research: an evaluation of highly-cited studies (1990-2012) and of latest practices (2017-2018) in high-impact journals. Neuroimage 221, 117164 (2020).
Button, K. S. et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14, 365–376 (2013).
Yarkoni, T. Big correlations in little studies: inflated fMRI correlations reflect low statistical power—Commentary on Vul et al. (2009). Perspect. Psychol. Sci. 4, 294–298 (2009).
Poldrack, R. A. et al. Scanning the horizon: towards transparent and reproducible neuroimaging research. Nat. Rev. Neurosci. 18, 115–126 (2017).
Masouleh, S. K., Eickhoff, S. B., Hoffstaedter, F. & Genon, S., Alzheimer’s Disease Neuroimaging Initiative. Empirical examination of the replicability of associations between brain structure and psychological variables. eLife 8, e43464 (2019).
Dinga, R. et al. Evaluating the evidence for biotypes of depression: Methodological replication and extension of Drysdale et al. (2017). Neuroimage Clin 22, 101796 (2019).
Boekel, W. et al. A purely confirmatory replication study of structural brain–behavior correlations. Cortex 66, 115–133 (2015).
Nosek, B. A., Cohoon, J., Kidwell, M. & Spies, J. R. Estimating the reproducibility of psychological science. Preprint at https://doi.org/10.31219/osf.io/447b3 (2016).
Visscher, P. M. et al. 10 years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 101, 5–22 (2017).
Begley, C. G. & Ellis, L. M. Drug development: raise standards for preclinical cancer research. Nature 483, 531–533 (2012).
Botvinik-Nezer, R. et al. Variability in the analysis of a single neuroimaging dataset by many teams. Nature 582, 84–88 (2020).
Nuzzo, R. Scientific method: Statistical errors. Nature 506, 150–152 (2014).
Hawkins, D. M. The problem of overfitting. J. Chem. Inf. Comput. Sci. 44, 1–12 (2004).
Bishop, D. How scientists can stop fooling themselves over statistics. Nature 584, 9 (2020).
Munafò, M. R. et al. A manifesto for reproducible science. Nat. Hum. Behav. 1, 0021 (2017).
Schönbrodt, F. D. & Perugini, M. At what sample size do correlations stabilize? J. Res. Pers. 47, 609–612 (2013).
Varoquaux, G. Cross-validation failure: small sample sizes lead to large error bars. Neuroimage 180, 68–77 (2018).
Yarkoni, T. Big correlations in little studies: Inflated fMRI correlations reflect low statistical power—Commentary on Vul et al. (2009). Perspect. Psychol. Sci. 4, 294–298 (2009).
Button, K. S. et al. Confidence and precision increase with high statistical power. Nat. Rev. Neurosci. 14, 585–586 (2013).
Casey, B. J. et al. The Adolescent Brain Cognitive Development (ABCD) study: imaging acquisition across 21 sites. Dev. Cogn. Neurosci. 32, 43–54 (2018).
Van Essen, D. C. et al. The WU-Minn Human Connectome Project: an overview. NeuroImage 80, 62–79 (2013).
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Greene, A. S., Gao, S., Scheinost, D. & Constable, R. T. Task-induced brain state manipulation improves prediction of individual traits. Nat. Commun. 9, 2807 (2018).
Chaarani, B. et al. Baseline brain function in the preadolescents of the ABCD Study. Nat. Neurosci. 24, 1176–1186 (2021).
Heaton, R. K. et al. Reliability and validity of composite scores from the NIH Toolbox Cognition Battery in adults. J. Int. Neuropsychol. Soc. 20, 588–598 (2014).
Achenbach, T. M. & Rescorla, L. Manual for the ASEBA School-age Forms & Profiles: An Integrated System of Multi-informant Assessment (ASEBA, 2001).
Kharabian Masouleh, S. et al. Influence of processing pipeline on cortical thickness measurement. Cereb. Cortex 30, 5014–5027 (2020).
Erdfelder, E., Faul, F. & Buchner, A. GPOWER: A general power analysis program. Behav. Res. Methods Instrum. Comput. 28, 1–11 (1996).
Ioannidis, J. P. A., Munafò, M. R., Fusar-Poli, P., Nosek, B. A. & David, S. P. Publication and other reporting biases in cognitive sciences: detection, prevalence, and prevention. Trends Cogn. Sci. 18, 235–241 (2014).
Gordon, E. M. et al. Precision functional mapping of individual human brains. Neuron 95, 791–807.e7 (2017).
Ciric, R. et al. Benchmarking of participant-level confound regression strategies for the control of motion artifact in studies of functional connectivity. NeuroImage 154, 174–187 (2017).
Dosenbach, N. U. F. et al. Real-time motion analytics during brain MRI improve data quality and reduce costs. Neuroimage 161, 80–93 (2017).
Kanwisher, N., Stanley, D. & Harris, A. The fusiform face area is selective for faces not animals. NeuroReport 10, 183–187 (1999).
Pritschet, L. et al. Functional reorganization of brain networks across the human menstrual cycle. Neuroimage 220, 117091 (2020).
Laumann, T. O. et al. Brain network reorganisation in an adolescent after bilateral perinatal strokes. Lancet Neurol. 20, 255–256 (2021).
Kay, K. N., Naselaris, T., Prenger, R. J. & Gallant, J. L. Identifying natural images from human brain activity. Nature 452, 352–355 (2008).
Newbold, D. J. et al. Plasticity and spontaneous activity pulses in disused human brain circuits. Neuron 107, 580–589.e6 (2020).
Smith, S. M. & Nichols, T. E. Statistical challenges in ‘big data’ human neuroimaging. Neuron 97, 263–268 (2018).
Border, R. et al. No support for historical candidate gene or candidate gene-by-interaction hypotheses for major depression across multiple large samples. Am. J. Psychiatry 176, 376–387 (2019).
Lee, J. J. et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat. Genet. 50, 1112–1121 (2018).
Kundu, P., Inati, S. J., Evans, J. W., Luh, W.-M. & Bandettini, P. A. Differentiating BOLD and non-BOLD signals in fMRI time series using multi-echo EPI. Neuroimage 60, 1759–1770 (2012).
Vizioli, L. et al. Lowering the thermal noise barrier in functional brain mapping with magnetic resonance imaging. Nat. Commun. 12, 5181 (2021).
Shiffman, S., Stone, A. A. & Hufford, M. R. Ecological momentary assessment. Ann. Rev. Clin. Psychol. 4, 1–32 (2008).
NOT-OD-07-088: policy for sharing of data obtained in NIH supported or conducted genome-wide association studies (GWAS). National Institutes of Health https://grants.nih.gov/grants/guide/notice-files/NOT-OD-07-088.html (2007).
Buzsáki, G. The brain–cognitive behavior problem: a retrospective. eNeuro 7, ENEURO.0069–20.2020 (2020).
Errington, T. M., Denis, A., Perfito, N., Iorns, E. & Nosek, B. A. Challenges for assessing replicability in preclinical cancer biology. eLife 10, e67995 (2021).
Patil, P., Peng, R. D. & Leek, J. T. What should researchers expect when they replicate studies? A statistical view of replicability in psychological science. Perspect. Psychol. Sci. 11, 539–544 (2016).
Feczko, E. et al. Adolescent Brain Cognitive Development (ABCD) community MRI collection and utilities. Preprint at https://doi.org/10.1101/2021.07.09.451638 (2021).
Volkow, N. D. et al. The conception of the ABCD Study: from substance use to a broad NIH collaboration. Dev. Cogn. Neurosci. 32, 4–7 (2018).
Power, J. D., Barnes, K. A., Snyder, A. Z., Schlaggar, B. L. & Petersen, S. E. Spurious but systematic correlations in functional connectivity MRI networks arise from subject motion. Neuroimage 59, 2142–2154 (2012).
Fair, D. A. et al. Correction of respiratory artifacts in MRI head motion estimates. Neuroimage 208, 116400 (2020).
Glasser, M. F. et al. The minimal preprocessing pipelines for the Human Connectome Project. Neuroimage 80, 105–124 (2013).
Avants, B. B. et al. A reproducible evaluation of ANTs similarity metric performance in brain image registration. Neuroimage 54, 2033–2044 (2011).
Dale, A. M., Fischl, B. & Sereno, M. I. Cortical surface-based analysis. NeuroImage 9, 179–194 (1999).
Jenkinson, M., Beckmann, C. F., Behrens, T. E. J., Woolrich, M. W. & Smith, S. M. FSL. Neuroimage 62, 782–790 (2012).
Marcus, D. S. et al. Informatics and data mining tools and strategies for the human connectome project. Front. Neuroinform. 5, 4 (2011).
Hallquist, M. N., Hwang, K. & Luna, B. The nuisance of nuisance regression: spectral misspecification in a common approach to resting-state fMRI preprocessing reintroduces noise and obscures functional connectivity. Neuroimage 82, 208–225 (2013).
Gordon, E. M. et al. Generation and evaluation of a cortical area parcellation from resting-state correlations. Cereb. Cortex 26, 288–303 (2016).
Seitzman, B. A. et al. A set of functionally-defined brain regions with improved representation of the subcortex and cerebellum. Neuroimage 206, 116290 (2020).
Marek, S. et al. Identifying reproducible individual differences in childhood functional brain networks: an ABCD Study. Dev. Cogn. Neurosci. 40, 100706 (2019).
Smith, S. M. et al. A positive-negative mode of population covariation links brain connectivity, demographics and behavior. Nat. Neurosci. 18, 1565–1567 (2015).
Barch, D. M. et al. Demographic, physical and mental health assessments in the adolescent brain and cognitive development study: rationale and description. Dev. Cogn. Neurosci. 32, 55–66 (2018).
Rothman, K. Modern Epidemiology (Lippincott Williams & Wilkins, 2016).
Gratton, C. et al. Functional brain networks are dominated by stable group and individual factors, not cognitive or daily variation. Neuron 98, 439–452.e5 (2018).
Van Essen, D. C. et al. The Human Connectome Project: a data acquisition perspective. Neuroimage 62, 2222–2231 (2012).
Miller, K. L. et al. Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nat. Neurosci. 19, 1523–1536 (2016).
Littlejohns, T. J. et al. The UK Biobank imaging enhancement of 100,000 participants: raitonale, data collection, management and future directions. Nat. Commun. 11, 2624 (2020).
Barch, D. M. et al. Function in the human connectome: task-fMRI and individual differences in behavior. Neuroimage 80, 169–189 (2013).
We thank A. M. Dale, T. L. Jernigan, W. Zhao, C. Makowski, C. C. Fan and C. Palmer for their thoughtful comments on the manuscript. This work was supported by NIH grants MH100019 (S.M., N.A.S., A.M. and T.O.L.,), MH121518 (S.M.), DA007261 (D.F.M.), NS090978 (B.P.K.), DA007261 (A.S.H.), MH125023 (M.R.D), NS110332 (D.J.N.), NS115672 (A.Z.), MH112473 (T.O.L.), MH104592 (D.J.G.), AA02969 (S.M.M.), DA041148 (D.A.F.), DA04112 (D.A.F.), MH115357 (D.A.F.), MH096773 (D.A.F. and N.U.F.D.), MH122066 (D.A.F. and N.U.F.D.), MH121276 (D.A.F. and N.U.F.D.), MH124567 (D.A.F. and N.U.F.D.), NS088590 (N.U.F.D.), and the Andrew Mellon Predoctoral Fellowship (B.T.-C.), Lynne and Andrew Redleaf Foundation (D.A.F.), Kiwanis Neuroscience Research Foundation (N.U.F.D.) and the Jacobs Foundation grant 2016121703 (N.U.F.D.). ABCD Study: data used in the preparation of this article, in part, were obtained from the ABCD Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children aged 9–10 and follow them over 10 years into early adulthood. The ABCD Study is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041022, U01DA041028, U01DA041048, U01DA041089, U01DA041106, U01DA041117, U01DA041120, U01DA041134, U01DA041148, U01DA041156, U01DA041174, U24DA041123, U24DA041147, U01DA041093 and U01DA041025. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/scientists/workgroups/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators. HCP Study: data were provided, in part, by the Human Connectome Project, WU-Minn Consortium (U54 MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University. UK Biobank study: this research has been conducted, in part, using data from UK Biobank (www.ukbiobank.ac.uk). UK Biobank is generously supported by its founding funders the Wellcome Trust and UK Medical Research Council, as well as the Department of Health, Scottish Government, the Northwest Regional Development Agency, British Heart Foundation and Cancer Research UK. XSEDE and Pittsburgh Supercomputing Center: this work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC TG-IBN200009). MIDB, NGDR and MSI: this work used the storage and computational resources provided by the Masonic Institute for the Developing Brain (MIDB) the Neuroimaging Genomics Data Resource (NGDR) and the Minnesota Supercomputing Institute (MSI). NGDR is supported by the University of Minnesota Informatics Institute through the MnDRIVE initiative in coordination with the College of Liberal Arts, Medical School and College of Education and Human Development at the University of Minnesota. Daenerys NCCR: this work used the storage and computational resources provided by the Daenerys Neuroimaging Community Computing Resource (NCCR). The Daenerys NCCR is supported by the McDonnell Center for Systems Neuroscience at Washington University, the Intellectual and Developmental Disabilities Research Center (IDDRC; P50 HD103525) at Washington University School of Medicine and the Institute of Clinical and Translational Sciences (ICTS; UL1 TR002345) at Washington University School of Medicine.
E.A.E., D.A.F and N.U.F.D. have a financial interest in NOUS Imaging Inc. and may financially benefit if the company is successful in marketing FIRMM motion-monitoring software products. O.M.-D., E.A.E., A.N.V., D.A.F. and N.U.F.D. may receive royalty income based on FIRMM technology developed at Washington University School of Medicine and Oregon Health and Sciences University and licensed to NOUS Imaging Inc. D.A.F. and N.U.F.D. are co-founders of NOUS Imaging Inc. and E.A.E. is a former employee of NOUS Imaging. These potential conflicts of interest have been reviewed and are managed by Washington University School of Medicine, Oregon Health and Sciences University and the University of Minnesota. The other authors declare no competing interests.
Peer review information
Nature thanks Marcus Munafo, Russell Poldrack and other, anonymous, reviewers for their contribution to the peer review of this work. Peer review reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Fig. 1 Distributions of brain-wide association effect sizes by imaging modality and behavioral phenotype.
Histograms of all (a) cortical thickness and (b) resting-state functional connectivity (RSFC) associations, with demographic, cognitive, and mental health/personality variables. Correlations (r; linear bivariate) between brain measures and behavioral phenotypes were computed at multiple levels of scale (cortical thickness: vertices, regions of interest (ROIs), networks; RSFC: ROI-ROI pairs (edges), principal components, networks). The ordering of subgraphs follows the ordering of measures in the legend. All data shown are from the ABCD Study (n = 3,928).
Extended Data Fig. 2 Impact of sociodemographic covariates on brain-wide association effect sizes.
The influence of sociodemographic covariates (race, gender, parental marital status, parental income, hispanic versus non-hispanic ethnicity, family, data collection site) on BWAS (brain-wide association studies) effect sizes was examined in the ABCD Study dataset (n = 3,587 with complete cases for this analysis) through the model comparison strategy developed by the ABCD Data Analysis and Informatics Core and used in the Data Exploration and Analysis Portal (deap.nimhda.org). The percentages of variance explained by fixed effects in multilevel models (pseudo-R2) were calculated with the MuMIn package in R (1.43.17) and square root transformed to approximate an absolute-value BWAS correlation (|r|). The estimated BWAS effect sizes (|r|) prior to covariate adjustment are plotted on the x-axis and those after sociodemographic covariate adjustment on the y-axis. Values below the identity line indicate a reduction in effect size after covariate adjustment, values above an increase in effect size. BWAS models with and without covariate adjustment always included cognitive ability or psychopathology as the outcome variable and nested random effects of family and data collection site, in order to maximize comparability for subsequent fixed effects model comparisons. BWAS effect sizes without covariate adjustment were taken from models that only included these random effects, the brain feature of interest (cortical thickness [vertex]/RSFC [edge]) as a single fixed effect, and the psychological phenotype (cognitive ability/psychopathology). BWAS effect sizes without covariate adjustment estimated the unique, covariate-adjusted effect linking the brain feature of interest to the psychological phenotype by comparing a model with sociodemographic fixed effects but no brain feature fixed effect, to one with both the sociodemographic fixed effects and the brain feature. The difference in pseudo-R2 (subsequently transformed to |r|) represents the additional fixed-effect variance the brain feature explained beyond the sociodemographic covariates.
Extended Data Fig. 3 Brain-wide association effect sizes derived from functional MRI (fMRI) task activations are similar to resting-state functional connectivity (RSFC).
(a) Cognitive ability (NIH Toolbox total composite score) plotted as a function of dorsal attention network working memory task activation (z). Note that this correlation with fMRI task activation (r = 0.34) is much larger than the largest replicated univariate effect size for RSFC. (b) Cognitive ability plotted as a function of working memory task accuracy. Individual differences in cognitive ability (phenotype of interest) are strongly correlated with individual differences in working memory (r = 0.54). Thus, task-specific effects (behavioral performance) confound links between brain function and the phenotype of interest (e.g. cognitive ability). (c) Residualizing the behavioral phenotype of interest (cognitive ability) with respect to individual differences in working memory task accuracy (task-specific effect) produces an association between task fMRI and cognitive ability (r = 0.14) similar to the (d) the association between dorsal attention network RSFC and cognitive ability (r = 0.11). Data shown are from the HCP Study (n = 844).
Extended Data Fig. 4 Split-half reliability of resting-state functional connectivity (RSFC) in HCP, ABCD and UKB study samples.
Distribution of within-person split-half reliability33 of ROI (333 cortical ROIs from Gordon et al.61) connectivity matrices derived from RSFC data. The UKB data contain a single 6 min. resting-state run; the ABCD Study collected 4 x 5 min. runs (20 min. total), and the HCP collected 4 x 15 min. runs of resting-state data (60 min. total).
Extended Data Fig. 5 Effect size distributions for HCP, ABCD, UKB studies and expected sampling variability.
To determine whether smaller effect sizes in larger samples can be explained by the expected reduction of sampling variability, we estimated sampling variability (grey) for the full range of BWAS (brain-wide association studies) effect sizes observed in UKB (edge-wise resting-state functional connectivity [RSFC]; cognitive ability) as a function of sample size (x-axis). As in our primary ABCD analyses, UKB effects were resampled using a bootstrap procedure (1,000 iterations per edge). The actual distributions of the HCP, ABCD, and UKB BWAS effect sizes were then visualized relative to the expected sampling variability in UKB across sample sizes (grey). Consistent with an inflation of BWAS effect sizes due to sampling variability, relatively larger BWAS effect sizes in HCP (n = 900) and ABCD (n = 3,928) align with effect sizes in subsamples of the UKB data at corresponding sample sizes.
Extended Data Fig. 6 Comparison of single- and multi-site BWAS (brain-wide association studies) sampling variability.
(a) Sampling variability of resting-state functional connectivity (RSFC) associations with the NIH Toolbox subscales in equally-sized samples (n = 877) from HCP (grey) and ABCD (red). Effect sizes (center of error bands) were matched across datasets (r = 0.06) to isolate sampling variability for a given effect. (b) Sampling variability of RSFC associations with the NIH Toolbox subscales in a single-site ABCD sample (site 16; n = 603; teal) and every other ABCD site (n = 3,325; red). Effect sizes (center of error bands) were matched across datasets (r = 0.06).
Extended Data Fig. 7 Relationship between power and statistical threshold.
Statistical power (1 – false negative rate) is plotted as a function of the P value (two-tailed; < 0.05, < 10−2, < 10−3, < 10−4, < 10−5, < 10−6, < 10−7) used for significance testing in the denoised ABCD Study sample (n = 3,928). P < 0.05 represents an uncorrected threshold, whereas P < 10−7 represents a Bonferroni correction. More stringent control for multiple comparisons decreases power and increases sample size requirements.
Extended Data Fig. 8 Inflation of univariate BWAS (brain-wide association studies) effect sizes (top 1% largest) by imaging modality and behavioral phenotype.
Better out-of-sample replication is indexed by a smaller difference between the discovery and replication datasets effect sizes (right side of histogram). Negative values indicate that an association was inflated in the discovery dataset, relative to what was observed in the replication dataset. Out-of-sample reductions in effect sizes greater than 100% reflect sign errors. The leftward shift of cortical thickness relative to resting-state functional connectivity (RSFC), and for psychopathology relative to cognitive ability indicates worse univariate BWAS reproducibility.
Extended Data Fig. 9 Influence of sample size on the robustness of brain-wide associations.
Trajectories of sampling variability (99% confidence interval; orange), statistical error rates (cumulative sum of false negatives, false positives, magnitude errors, sign errors; yellow), and support vector regression (SVR) out-of-sample association strength (as % of full in-sample association; dark red) as a function of sample size. Sample size (n ~ 4,000) represents a full sample (discovery + replication datasets of ~2,000 each). Data shown are from ABCD Study.
Extended Data Fig. 10 Sampling variability is nearly identical when considering singletons vs. all participants.
Data were from the ABCD Study sample. Sampling variability (y-axis) as a function of sample size (x-axis; n = 25, 35, 45, 60, 80, 100, 145, 200, 256, 350, 460, 615, 825, 1,100, 1,475, 2,000) for all participants (black) and singletons only (twins and siblings excluded; green). Sampling variability was quantified as the difference between the upper and lower 95% confidence interval across 1,000 bootstraps (resampled with replacement) across all 77,421 resting-state functional connectivity (RSFC; edges) associations with cognitive ability. The effect size magnitudes were likewise nearly identical in size-matched resamples (singletons-only [n = 2,528]: median |r| = 0.017; siblings-included [n = 2,528]: median |r| = 0.020).
The file contains Supplementary Discussion, Supplementary Figs. 1–17 and Supplementary Tables 1–4.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Marek, S., Tervo-Clemmens, B., Calabro, F.J. et al. Reproducible brain-wide association studies require thousands of individuals. Nature 603, 654–660 (2022). https://doi.org/10.1038/s41586-022-04492-9
This article is cited by
Understanding the relationship between cerebellar structure and social abilities
Molecular Autism (2023)
Distinct and shared patterns of brain plasticity during electroconvulsive therapy and treatment as usual in depression: an observational multimodal MRI-study
Translational Psychiatry (2023)
Evidence from “big data” for the default-mode hypothesis of ADHD: a mega-analysis of multiple large samples
Individualized fMRI connectivity defines signatures of antidepressant and placebo responses in major depression
Molecular Psychiatry (2023)
New and emerging approaches to treat psychiatric disorders
Nature Medicine (2023)
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.