Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Replicable brain–phenotype associations require large-scale neuroimaging data

Abstract

Numerous neuroimaging studies have investigated the neural basis of interindividual differences but the replicability of brain–phenotype associations remains largely unknown. We used the UK Biobank neuroimaging dataset (N = 37,447) to examine associations with six variables related to physical and mental health: age, body mass index, intelligence, memory, neuroticism and alcohol consumption, and assessed the improvement of replicability for brain–phenotype associations with increasing sampling sizes. Age may require only 300 individuals to provide highly replicable associations but other phenotypes required 1,500 to 3,900 individuals. The required sample size showed a negative power law relation with the estimated effect size. When only comparing the upper and lower quarters, the minimally required sample sizes for imaging decreased by 15–75%. Our findings demonstrate that large-scale neuroimaging data are required for replicable brain–phenotype associations, that this can be mitigated by preselection of individuals and that small-scale studies may have reported false positive findings.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of the analytical steps in this study to estimate replicability.
Fig. 2: Improvement of global replicability with increasing sample size.
Fig. 3: Improvement of regional replicability with increasing sampling size for CSA.
Fig. 4: Improvement of regional replicability with increasing sampling size for CT.
Fig. 5: Improvement of regional replicability with increasing sampling size for FC.
Fig. 6: The relationships of the largest absolutes effect sizes derived from the full sample with the minimally required sample sizes needed to achieve regional replicability of 0.75.
Fig. 7: The decreased proportion of minimally required sample size to reach good replicability between a two-sample t-test (by the first and fourth quarters) and correlation analysis.

Similar content being viewed by others

Data availability

This research used data from the UK Biobank resource (https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=100). Access to UK Biobank data requires the submission and approval of a research project by the UK Biobank committee. The DK atlas was used to parcellate the human cortex into 66 regions for structural brain measures (https://surfer.nmr.mgh.harvard.edu/fswiki/CorticalParcellation). Twenty-one functional networks were used to estimate functional brain measures (https://www.fmrib.ox.ac.uk/ukbiobank/group_means/rfMRI_ICA_d25_good_nodes.html).

Code availability

Python code on Jupyter notebook used for statistical analyses is available in GitHub: https://github.com/deeppsych/Replicability-ukbb.

References

  1. Niu, X., Zhang, F., Kounios, J. & Liang, H. Improved prediction of brain age using multimodal neuroimaging data. Hum. Brain Mapp. 41, 1626–1643 (2020).

    Article  PubMed  Google Scholar 

  2. Kaczkurkin, A. N., Raznahan, A. & Satterthwaite, T. D. Sex differences in the developing brain: insights from multimodal neuroimaging. Neuropsychopharmacology 44, 71–85 (2019).

    Article  PubMed  Google Scholar 

  3. Steegers, C. et al. The association between body mass index and brain morphology in children: a population-based study. Brain Struct. Funct. 226, 787–800 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Radonjić, N. V. et al. Structural brain imaging studies offer clues about the effects of the shared genetic etiology among neuropsychiatric disorders. Mol. Psychiatry 26, 2101–2110 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  5. Spear, L. P. Effects of adolescent alcohol consumption on the brain and behaviour. Nat. Rev. Neurosci. 19, 197–214 (2018).

    Article  CAS  PubMed  Google Scholar 

  6. Hilger, K. et al. Predicting intelligence from brain gray matter volume. Brain Struct. Funct. 225, 2111–2129 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Aarts, A. A. et al. Estimating the reproducibility of psychological science. Science https://doi.org/10.1126/science.aac4716 (2015).

  8. Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).

  9. Replication studies offer much more than technical details. Nature 541, 259–260 (2017).

  10. Poldrack, R. A. et al. Scanning the horizon: towards transparent and reproducible neuroimaging research. Nat. Rev. Neurosci. 18, 115–126 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Button, K. S. et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14, 365–376 (2013).

    Article  CAS  PubMed  Google Scholar 

  12. Satizabal, C. L. et al. Genetic architecture of subcortical brain structures in 38,851 individuals. Nat. Genet. 51, 1624–1636 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Boekel, W. et al. A purely confirmatory replication study of structural brain–behavior correlations. Cortex 66, 115–133 (2015).

    Article  PubMed  Google Scholar 

  14. Munson, B. A. & Hernandez, A. E. Inconsistency of findings due to low power: a structural MRI study of bilingualism. Brain Lang. 195, 104642 (2019).

  15. Zhou, Z. W. et al. Inconsistency in abnormal functional connectivity across datasets of ADHD-200 in children with attention deficit hyperactivity disorder. Front. Psychiatry 10, 692 (2019).

  16. Schmaal, L. et al. ENIGMA MDD: seven years of global neuroimaging studies of major depression through worldwide data sharing. Transl. Psychiatry 10, 172 (2020).

  17. Müller, V. I. et al. Altered brain activity in unipolar depression revisited: meta-analyses of neuroimaging studies. JAMA Psychiatry 74, 47–55 (2017).

  18. Littlejohns, T. J. et al. The UK Biobank imaging enhancement of 100,000 participants: rationale, data collection, management and future directions. Nat. Commun. 11, 2624 (2020).

  19. Smith, S. M. et al. An expanded set of genome-wide association studies of brain imaging phenotypes in UK Biobank. Nat. Neurosci. 24, 737–745 (2021).

  20. Marek, S. et al. Reproducible brain-wide association studies require thousands of individuals. Nature 603, 654–660 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Miller, K. L. et al. Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nat. Neurosci. 19, 1523–1536 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Meyer, C. et al. Seasonality in human cognitive brain responses. Proc. Natl Acad. Sci. USA 113, 3066–3071 (2016).

  23. Kampa, M. et al. Replication of fMRI group activations in the neuroimaging battery for the Mainz Resilience Project (MARP). Neuroimage 204, 116223 (2020).

  24. Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 15, 155–163 (2016).

  25. Ingre, M. Why small low-powered studies are worse than large high-powered studies and how to protect against ‘trivial’ findings in research: comment on Friston (2012). NeuroImage 81, 496–498 (2013).

  26. Button, K. S. et al. Confidence and precision increase with high statistical power. Nat. Rev. Neurosci. 14, 585–586 (2013).

  27. Schönbrodt, F. D. & Perugini, M. At what sample size do correlations stabilize? J. Res. Pers. 47, 609–612 (2013).

  28. Grady, C. L., Rieck, J. R., Nichol, D., Rodrigue, K. M. & Kennedy, K. M. Influence of sample size and analytic approach on stability and interpretation of brain–behavior correlations in task-related fMRI data. Hum. Brain Mapp. 42, 204–219 (2021).

  29. Varoquaux, G. Cross-validation failure: small sample sizes lead to large error bars. NeuroImage 180, 68–77 (2018).

  30. Genon, S. et al. Searching for behavior relating to grey matter volume in a-priori defined right dorsal premotor regions: lessons learned. Neuroimage 157, 144–156 (2017).

  31. Schulz, M.-A., Bzdok, D., Haufe, S., Haynes, J.-D. & Ritter, K. Performance reserves in brain-imaging-based phenotype prediction. Preprint at bioRxiv https://doi.org/10.1101/2022.02.23.481601 (2022).

  32. Masouleh, S. K., Eickhoff, S. B., Hoffstaedter, F. & Genon, S. Empirical examination of the replicability of associations between brain structure and psychological variables. eLife https://doi.org/10.7554/eLife.43464 (2019).

  33. Loken, E. & Gelman, A. Measurement error and the replication crisis. Science 355, 584–585 (2017).

  34. Border, R. et al. No support for historical candidate gene or candidate gene-by-interaction hypotheses for major depression across multiple large samples. Am. J. Psychiatry 176, 376–387 (2019).

  35. Marigorta, U. M., Rodríguez, J. A., Gibson, G. & Navarro, A. Replicability and prediction: lessons and challenges from GWAS. Trends Genet. 34, 504–517 (2018).

  36. Tam, V. et al. Benefits and limitations of genome-wide association studies. Nat. Rev. Genet. 20, 67–484 (2019).

  37. Cole, J. H. Multimodality neuroimaging brain-age in UK biobank: relationship to biomedical, lifestyle, and cognitive factors. Neurobiol. Aging 92, 34–42 (2020).

  38. Woo, C. W., Chang, L. J., Lindquist, M. A. & Wager, T. D. Building better biomarkers: brain models in translational neuroimaging. Nat. Neurosci. 20, 365–377 (2017).

  39. Abrol, A. et al. Deep learning encodes robust discriminative neuroimaging representations to outperform standard machine learning. Nat. Commun.12, 353 (2021).

  40. Sullivan, G. M. & Feinn, R. Using effect size—or why the P value is not enough. J. Grad. Med. Educ. 4, 279–282 (2012).

  41. Albers, C. The problem with unadjusted multiple and sequential statistical testing. Nat. Commun. 10, 1921 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  42. Xia, M. et al. Reproducibility of functional brain alterations in major depressive disorder: evidence from a multisite resting-state functional MRI study with 1,434 individuals. Neuroimage 189, 700–714 (2019).

  43. Wang, M. et al. Reproducible abnormalities of functional gradient reliably predict clinical and cognitive symptoms in schizophrenia. Preprint at bioRxiv https://doi.org/10.1101/2020.11.24.395251 (2020).

  44. Rosenberg, M. D. & Finn, E. S. How to establish robust brain–behavior relationships without thousands of individuals. Nat. Neurosci. 25, 835–837 (2022).

    Article  CAS  PubMed  Google Scholar 

  45. Gratton, C., Nelson, S. M. & Gordon, E. M. Brain–behavior correlations: two paths toward reliability. Neuron 110, 1446–1449 (2022).

    Article  CAS  PubMed  Google Scholar 

  46. Melzer, T. R. et al. Test–retest reliability and sample size estimates after MRI scanner relocation. Neuroimage 211, 116608 (2020).

  47. Noble, S., Scheinost, D. & Constable, R. T. A decade of test–retest reliability of functional connectivity: a systematic review and meta-analysis. Neuroimage 203, 116157 (2019).

  48. Tozzi, L., Fleming, S. L., Taylor, Z. D., Raterink, C. D. & Williams, L. M. Test–retest reliability of the human functional connectome over consecutive days: identifying highly reliable portions and assessing the impact of methodological choices. Netw. Neurosci. 4, 925–945 (2020).

  49. Noble, S. et al. Influences on the test–retest reliability of functional connectivity MRI and its relationship with behavioral utility. Cereb. Cortex 27, 5415–5429 (2017).

  50. Fan, L. et al. The human brainnetome atlas: a new brain atlas based on connectional architecture. Cereb. Cortex 26, 3508–3526 (2016).

  51. Kardan, O. et al. Differences in the functional brain architecture of sustained attention and working memory in youth and adults. PLoS Biol.20, e3001938 (2022).

  52. Harvey, J. L., Demetriou, L., McGonigle, J. & Wall, M. B. A short, robust brain activation control task optimised for pharmacological fMRI studies. PeerJ 6, e5540 (2018).

  53. Suda, A. et al. Functional organization for response inhibition in the right inferior frontal cortex of individual human brains. Cereb. Cortex 30, 6325–6335 (2020).

  54. Fry, A., Littlejohns, T., Sudlow, C., Doherty, N. & Allen, N. OP41 The representativeness of the UK Biobank cohort on a range of sociodemographic, physical, lifestyle and health-related characteristics. J. Epidemiol. Community Health 70, A26 (2016).

  55. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

  56. Alfaro-Almagro, F. et al. Image processing and quality control for the first 10,000 brain imaging datasets from UK Biobank. Neuroimage 166, 400 (2018).

    Article  PubMed  Google Scholar 

  57. Andersson, J. L. R., Jenkinson, M. & Smith, S. Non-linear Registration aka Spatial Normalisation FMRIB Technical Report TRO7JA2 (FMRIB Centre, 2007).

  58. Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

  59. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

  60. Shrout, P. E. & Fleiss, J. L. Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86, 420–428 (1979).

  61. Vallat, R. Pingouin: statistics in Python. J. Open Source Softw. https://doi.org/10.21105/joss.01026 (2018).

  62. Warrens, M. J. Similarity measures for 2 × 2 tables. J. Intell. Fuzzy Syst. 36, 3005–3018 (2019).

Download references

Acknowledgements

This research has been conducted using the UK Biobank Resource under application no. 30091. A.A. and K.J.H.V. are supported by the Foundation Volksbond Rotterdam. This work was supported by a China Scholarship Council (CSC) grant to S.L. G.v.W has received research funding by Philips for an unrelated project. The funders have no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

S.L., A.A., K.J.H.V. and G.v.W conceived and designed the study. S.L. analysed the data and wrote the manuscript. A.A., K.J.H.V. and G.v.W provided significant feedback on the data analysis and the revision of the manuscript. A.A., K.J.H.V. and G.v.W jointly supervised the work.

Corresponding authors

Correspondence to Shu Liu or Guido A. van Wingen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Human Behaviour thanks Deanna Barch, Omid Kardan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 The largest absolute effect sizes of brain–phenotype associations for each imaging modality and phenotype.

The upper text shows the specific brain measures which have the largest absolutes of effect sizes across brain–phenotype associations. Cortical surface area, CSA; Cortical thickness, CT; Functional connectivity, FC.

Extended Data Fig. 2 The proportion of brain features with regional replicability of larger than 0.75.

(a) Cortical surface area (CSA); (b) Cortical thickness (CT); (c) Functional connectivity (FC).

Extended Data Fig. 3 The relationship between observed and predicted sample sizes for added phenotypes.

x axis represents true minimally required sample sizes to obtain the replicable brain associations; y axis represents the predicted required sample sizes according to the effect sizes of brain associations with added phenotypes.

Extended Data Fig. 4 Differences of minimally required sample size to reach 75% replication probability between two-sample t-test using median splits and correlation analysis.

Positive values indicate lower minimally required sample sizes for the two-sample t-test, whereas negative values indicate higher minimally required sample sizes. Cross symbol ‘×’ indicates that the minimally required sample sizes are missing for the two-sample t-test or Spearman’s correlation analysis, because the regional replicability of all brain measures did not reach 0.75 at any sample size. Zero ‘0’ indicates that the minimally required sample sizes are the same between two-sample t-test and Spearman’s correlation analysis.

Extended Data Fig. 5 The minimally required sample sizes with preselection procedures at 10%, 20%, 25%, 30%, 40%, and 50% (median) for six representative phenotypes, when using a significance threshold of p < 0.05 uncorrected.

(a) Age; (b) Body mass index (BMI); (c) Fluid intelligence; (d) Numeric memory; (e) Neuroticism; (f) Alcohol consumption. Missing estimates (dots) could be related to two situations: first, good replicability is not obtained even at the largest sampling size; second, the specific preselection is not conducted because of the distribution of the variable. Cortical surface area (CSA), cortical thickness (CT), and functional connectivity (FC).

Extended Data Fig. 6 Decrease proportion of additional phenotypes in minimally required sample size to reach good replicability between a two-sample t-test (by sample selection) and correlation analysis for added phenotypes.

(a) Cortical surface area (CSA); (b) Cortical thickness (CT); (c) Functional connectivity (FC). Empty bars indicate that the minimal required sample sizes are missing for the two-sample t-test or Spearman’s correlation analysis, because the regional replicability of all brain measures did not reach 0.75 at any sample size. Coloured bars indicate that minimally required sample size decreased in a two-sample t-test (by sample selection) compared to correlation analysis.

Extended Data Fig. 7 The minimally required sample sizes with preselection procedures at 10%, 20%, 25%, 30%, 40%, and 50% for additional phenotypes.

Missing estimates (dots) could be related to two factors: first, good replicability is not obtained even at the largest sampling size; second, the specific preselection is not conducted because of the distribution of the variable. Cortical surface area (CSA), cortical thickness (CT), and functional connectivity (FC).

Extended Data Fig. 8 Improvement of the replicability of feature selection with increasing sample size at the thresholds ranging from 5% to 25%.

(a) shows the Jaccard index for cortical surface area (CSA) at different thresholds. (b) shows the Jaccard index for cortical thickness (CT) at different thresholds. (c) shows the Jaccard index for functional connectivity (FC) at different thresholds.

Extended Data Fig. 9 Improvement of replicability for partial least square (PLS) regression analysis with increasing sample size.

(a), (b), and (c) show the proportion of decreased correlations of PLS1 with the phenotypes for cortical surface area (CSA), cortical thickness (CT), and functional connectivity (FC); (d), (e), and (f) show the intraclass correlation coefficient (ICCs) of PLS weights. The dotted lines indicate good and moderate replicability levels (0.75 and 0.5).

Supplementary information

Supplementary Information

Supplementary Results and Figs. 1–26.

Reporting Summary

Peer Review File

Supplementary Tables

Supplementary Tables 1–19.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, S., Abdellaoui, A., Verweij, K.J.H. et al. Replicable brain–phenotype associations require large-scale neuroimaging data. Nat Hum Behav 7, 1344–1356 (2023). https://doi.org/10.1038/s41562-023-01642-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41562-023-01642-5

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing