Numerous neuroimaging studies have investigated the neural basis of interindividual differences but the replicability of brain–phenotype associations remains largely unknown. We used the UK Biobank neuroimaging dataset (N = 37,447) to examine associations with six variables related to physical and mental health: age, body mass index, intelligence, memory, neuroticism and alcohol consumption, and assessed the improvement of replicability for brain–phenotype associations with increasing sampling sizes. Age may require only 300 individuals to provide highly replicable associations but other phenotypes required 1,500 to 3,900 individuals. The required sample size showed a negative power law relation with the estimated effect size. When only comparing the upper and lower quarters, the minimally required sample sizes for imaging decreased by 15–75%. Our findings demonstrate that large-scale neuroimaging data are required for replicable brain–phenotype associations, that this can be mitigated by preselection of individuals and that small-scale studies may have reported false positive findings.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Brain-cognition relationships in late-life depression: a systematic review of structural magnetic resonance imaging studies
Translational Psychiatry Open Access 19 August 2023
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
This research used data from the UK Biobank resource (https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=100). Access to UK Biobank data requires the submission and approval of a research project by the UK Biobank committee. The DK atlas was used to parcellate the human cortex into 66 regions for structural brain measures (https://surfer.nmr.mgh.harvard.edu/fswiki/CorticalParcellation). Twenty-one functional networks were used to estimate functional brain measures (https://www.fmrib.ox.ac.uk/ukbiobank/group_means/rfMRI_ICA_d25_good_nodes.html).
Python code on Jupyter notebook used for statistical analyses is available in GitHub: https://github.com/deeppsych/Replicability-ukbb.
Niu, X., Zhang, F., Kounios, J. & Liang, H. Improved prediction of brain age using multimodal neuroimaging data. Hum. Brain Mapp. 41, 1626–1643 (2020).
Kaczkurkin, A. N., Raznahan, A. & Satterthwaite, T. D. Sex differences in the developing brain: insights from multimodal neuroimaging. Neuropsychopharmacology 44, 71–85 (2019).
Steegers, C. et al. The association between body mass index and brain morphology in children: a population-based study. Brain Struct. Funct. 226, 787–800 (2021).
Radonjić, N. V. et al. Structural brain imaging studies offer clues about the effects of the shared genetic etiology among neuropsychiatric disorders. Mol. Psychiatry 26, 2101–2110 (2021).
Spear, L. P. Effects of adolescent alcohol consumption on the brain and behaviour. Nat. Rev. Neurosci. 19, 197–214 (2018).
Hilger, K. et al. Predicting intelligence from brain gray matter volume. Brain Struct. Funct. 225, 2111–2129 (2020).
Aarts, A. A. et al. Estimating the reproducibility of psychological science. Science https://doi.org/10.1126/science.aac4716 (2015).
Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).
Replication studies offer much more than technical details. Nature 541, 259–260 (2017).
Poldrack, R. A. et al. Scanning the horizon: towards transparent and reproducible neuroimaging research. Nat. Rev. Neurosci. 18, 115–126 (2017).
Button, K. S. et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14, 365–376 (2013).
Satizabal, C. L. et al. Genetic architecture of subcortical brain structures in 38,851 individuals. Nat. Genet. 51, 1624–1636 (2019).
Boekel, W. et al. A purely confirmatory replication study of structural brain–behavior correlations. Cortex 66, 115–133 (2015).
Munson, B. A. & Hernandez, A. E. Inconsistency of findings due to low power: a structural MRI study of bilingualism. Brain Lang. 195, 104642 (2019).
Zhou, Z. W. et al. Inconsistency in abnormal functional connectivity across datasets of ADHD-200 in children with attention deficit hyperactivity disorder. Front. Psychiatry 10, 692 (2019).
Schmaal, L. et al. ENIGMA MDD: seven years of global neuroimaging studies of major depression through worldwide data sharing. Transl. Psychiatry 10, 172 (2020).
Müller, V. I. et al. Altered brain activity in unipolar depression revisited: meta-analyses of neuroimaging studies. JAMA Psychiatry 74, 47–55 (2017).
Littlejohns, T. J. et al. The UK Biobank imaging enhancement of 100,000 participants: rationale, data collection, management and future directions. Nat. Commun. 11, 2624 (2020).
Smith, S. M. et al. An expanded set of genome-wide association studies of brain imaging phenotypes in UK Biobank. Nat. Neurosci. 24, 737–745 (2021).
Marek, S. et al. Reproducible brain-wide association studies require thousands of individuals. Nature 603, 654–660 (2022).
Miller, K. L. et al. Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nat. Neurosci. 19, 1523–1536 (2016).
Meyer, C. et al. Seasonality in human cognitive brain responses. Proc. Natl Acad. Sci. USA 113, 3066–3071 (2016).
Kampa, M. et al. Replication of fMRI group activations in the neuroimaging battery for the Mainz Resilience Project (MARP). Neuroimage 204, 116223 (2020).
Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 15, 155–163 (2016).
Ingre, M. Why small low-powered studies are worse than large high-powered studies and how to protect against ‘trivial’ findings in research: comment on Friston (2012). NeuroImage 81, 496–498 (2013).
Button, K. S. et al. Confidence and precision increase with high statistical power. Nat. Rev. Neurosci. 14, 585–586 (2013).
Schönbrodt, F. D. & Perugini, M. At what sample size do correlations stabilize? J. Res. Pers. 47, 609–612 (2013).
Grady, C. L., Rieck, J. R., Nichol, D., Rodrigue, K. M. & Kennedy, K. M. Influence of sample size and analytic approach on stability and interpretation of brain–behavior correlations in task-related fMRI data. Hum. Brain Mapp. 42, 204–219 (2021).
Varoquaux, G. Cross-validation failure: small sample sizes lead to large error bars. NeuroImage 180, 68–77 (2018).
Genon, S. et al. Searching for behavior relating to grey matter volume in a-priori defined right dorsal premotor regions: lessons learned. Neuroimage 157, 144–156 (2017).
Schulz, M.-A., Bzdok, D., Haufe, S., Haynes, J.-D. & Ritter, K. Performance reserves in brain-imaging-based phenotype prediction. Preprint at bioRxiv https://doi.org/10.1101/2022.02.23.481601 (2022).
Masouleh, S. K., Eickhoff, S. B., Hoffstaedter, F. & Genon, S. Empirical examination of the replicability of associations between brain structure and psychological variables. eLife https://doi.org/10.7554/eLife.43464 (2019).
Loken, E. & Gelman, A. Measurement error and the replication crisis. Science 355, 584–585 (2017).
Border, R. et al. No support for historical candidate gene or candidate gene-by-interaction hypotheses for major depression across multiple large samples. Am. J. Psychiatry 176, 376–387 (2019).
Marigorta, U. M., Rodríguez, J. A., Gibson, G. & Navarro, A. Replicability and prediction: lessons and challenges from GWAS. Trends Genet. 34, 504–517 (2018).
Tam, V. et al. Benefits and limitations of genome-wide association studies. Nat. Rev. Genet. 20, 67–484 (2019).
Cole, J. H. Multimodality neuroimaging brain-age in UK biobank: relationship to biomedical, lifestyle, and cognitive factors. Neurobiol. Aging 92, 34–42 (2020).
Woo, C. W., Chang, L. J., Lindquist, M. A. & Wager, T. D. Building better biomarkers: brain models in translational neuroimaging. Nat. Neurosci. 20, 365–377 (2017).
Abrol, A. et al. Deep learning encodes robust discriminative neuroimaging representations to outperform standard machine learning. Nat. Commun.12, 353 (2021).
Sullivan, G. M. & Feinn, R. Using effect size—or why the P value is not enough. J. Grad. Med. Educ. 4, 279–282 (2012).
Albers, C. The problem with unadjusted multiple and sequential statistical testing. Nat. Commun. 10, 1921 (2019).
Xia, M. et al. Reproducibility of functional brain alterations in major depressive disorder: evidence from a multisite resting-state functional MRI study with 1,434 individuals. Neuroimage 189, 700–714 (2019).
Wang, M. et al. Reproducible abnormalities of functional gradient reliably predict clinical and cognitive symptoms in schizophrenia. Preprint at bioRxiv https://doi.org/10.1101/2020.11.24.395251 (2020).
Rosenberg, M. D. & Finn, E. S. How to establish robust brain–behavior relationships without thousands of individuals. Nat. Neurosci. 25, 835–837 (2022).
Gratton, C., Nelson, S. M. & Gordon, E. M. Brain–behavior correlations: two paths toward reliability. Neuron 110, 1446–1449 (2022).
Melzer, T. R. et al. Test–retest reliability and sample size estimates after MRI scanner relocation. Neuroimage 211, 116608 (2020).
Noble, S., Scheinost, D. & Constable, R. T. A decade of test–retest reliability of functional connectivity: a systematic review and meta-analysis. Neuroimage 203, 116157 (2019).
Tozzi, L., Fleming, S. L., Taylor, Z. D., Raterink, C. D. & Williams, L. M. Test–retest reliability of the human functional connectome over consecutive days: identifying highly reliable portions and assessing the impact of methodological choices. Netw. Neurosci. 4, 925–945 (2020).
Noble, S. et al. Influences on the test–retest reliability of functional connectivity MRI and its relationship with behavioral utility. Cereb. Cortex 27, 5415–5429 (2017).
Fan, L. et al. The human brainnetome atlas: a new brain atlas based on connectional architecture. Cereb. Cortex 26, 3508–3526 (2016).
Kardan, O. et al. Differences in the functional brain architecture of sustained attention and working memory in youth and adults. PLoS Biol.20, e3001938 (2022).
Harvey, J. L., Demetriou, L., McGonigle, J. & Wall, M. B. A short, robust brain activation control task optimised for pharmacological fMRI studies. PeerJ 6, e5540 (2018).
Suda, A. et al. Functional organization for response inhibition in the right inferior frontal cortex of individual human brains. Cereb. Cortex 30, 6325–6335 (2020).
Fry, A., Littlejohns, T., Sudlow, C., Doherty, N. & Allen, N. OP41 The representativeness of the UK Biobank cohort on a range of sociodemographic, physical, lifestyle and health-related characteristics. J. Epidemiol. Community Health 70, A26 (2016).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Alfaro-Almagro, F. et al. Image processing and quality control for the first 10,000 brain imaging datasets from UK Biobank. Neuroimage 166, 400 (2018).
Andersson, J. L. R., Jenkinson, M. & Smith, S. Non-linear Registration aka Spatial Normalisation FMRIB Technical Report TRO7JA2 (FMRIB Centre, 2007).
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Shrout, P. E. & Fleiss, J. L. Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86, 420–428 (1979).
Vallat, R. Pingouin: statistics in Python. J. Open Source Softw. https://doi.org/10.21105/joss.01026 (2018).
Warrens, M. J. Similarity measures for 2 × 2 tables. J. Intell. Fuzzy Syst. 36, 3005–3018 (2019).
This research has been conducted using the UK Biobank Resource under application no. 30091. A.A. and K.J.H.V. are supported by the Foundation Volksbond Rotterdam. This work was supported by a China Scholarship Council (CSC) grant to S.L. G.v.W has received research funding by Philips for an unrelated project. The funders have no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
The authors declare no competing interests.
Peer review information
Nature Human Behaviour thanks Deanna Barch, Omid Kardan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended Data Fig. 1 The largest absolute effect sizes of brain–phenotype associations for each imaging modality and phenotype.
The upper text shows the specific brain measures which have the largest absolutes of effect sizes across brain–phenotype associations. Cortical surface area, CSA; Cortical thickness, CT; Functional connectivity, FC.
Extended Data Fig. 2 The proportion of brain features with regional replicability of larger than 0.75.
(a) Cortical surface area (CSA); (b) Cortical thickness (CT); (c) Functional connectivity (FC).
Extended Data Fig. 3 The relationship between observed and predicted sample sizes for added phenotypes.
x axis represents true minimally required sample sizes to obtain the replicable brain associations; y axis represents the predicted required sample sizes according to the effect sizes of brain associations with added phenotypes.
Extended Data Fig. 4 Differences of minimally required sample size to reach 75% replication probability between two-sample t-test using median splits and correlation analysis.
Positive values indicate lower minimally required sample sizes for the two-sample t-test, whereas negative values indicate higher minimally required sample sizes. Cross symbol ‘×’ indicates that the minimally required sample sizes are missing for the two-sample t-test or Spearman’s correlation analysis, because the regional replicability of all brain measures did not reach 0.75 at any sample size. Zero ‘0’ indicates that the minimally required sample sizes are the same between two-sample t-test and Spearman’s correlation analysis.
Extended Data Fig. 5 The minimally required sample sizes with preselection procedures at 10%, 20%, 25%, 30%, 40%, and 50% (median) for six representative phenotypes, when using a significance threshold of p < 0.05 uncorrected.
(a) Age; (b) Body mass index (BMI); (c) Fluid intelligence; (d) Numeric memory; (e) Neuroticism; (f) Alcohol consumption. Missing estimates (dots) could be related to two situations: first, good replicability is not obtained even at the largest sampling size; second, the specific preselection is not conducted because of the distribution of the variable. Cortical surface area (CSA), cortical thickness (CT), and functional connectivity (FC).
Extended Data Fig. 6 Decrease proportion of additional phenotypes in minimally required sample size to reach good replicability between a two-sample t-test (by sample selection) and correlation analysis for added phenotypes.
(a) Cortical surface area (CSA); (b) Cortical thickness (CT); (c) Functional connectivity (FC). Empty bars indicate that the minimal required sample sizes are missing for the two-sample t-test or Spearman’s correlation analysis, because the regional replicability of all brain measures did not reach 0.75 at any sample size. Coloured bars indicate that minimally required sample size decreased in a two-sample t-test (by sample selection) compared to correlation analysis.
Extended Data Fig. 7 The minimally required sample sizes with preselection procedures at 10%, 20%, 25%, 30%, 40%, and 50% for additional phenotypes.
Missing estimates (dots) could be related to two factors: first, good replicability is not obtained even at the largest sampling size; second, the specific preselection is not conducted because of the distribution of the variable. Cortical surface area (CSA), cortical thickness (CT), and functional connectivity (FC).
Extended Data Fig. 8 Improvement of the replicability of feature selection with increasing sample size at the thresholds ranging from 5% to 25%.
(a) shows the Jaccard index for cortical surface area (CSA) at different thresholds. (b) shows the Jaccard index for cortical thickness (CT) at different thresholds. (c) shows the Jaccard index for functional connectivity (FC) at different thresholds.
Extended Data Fig. 9 Improvement of replicability for partial least square (PLS) regression analysis with increasing sample size.
(a), (b), and (c) show the proportion of decreased correlations of PLS1 with the phenotypes for cortical surface area (CSA), cortical thickness (CT), and functional connectivity (FC); (d), (e), and (f) show the intraclass correlation coefficient (ICCs) of PLS weights. The dotted lines indicate good and moderate replicability levels (0.75 and 0.5).
About this article
Cite this article
Liu, S., Abdellaoui, A., Verweij, K.J.H. et al. Replicable brain–phenotype associations require large-scale neuroimaging data. Nat Hum Behav 7, 1344–1356 (2023). https://doi.org/10.1038/s41562-023-01642-5
This article is cited by
Brain-cognition relationships in late-life depression: a systematic review of structural magnetic resonance imaging studies
Translational Psychiatry (2023)