Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Towards algorithmic analytics for large-scale datasets

Abstract

The traditional goal of quantitative analytics is to find simple, transparent models that generate explainable insights. In recent years, large-scale data acquisition enabled, for instance, by brain scanning and genomic profiling with microarray-type techniques, has prompted a wave of statistical inventions and innovative applications. Here we review some of the main trends in learning from ‘big data’ and provide examples from imaging neuroscience. Some main messages we find are that modern analysis approaches (1) tame complex data with parameter regularization and dimensionality-reduction strategies, (2) are increasingly backed up by empirical model validations rather than justified by mathematical proofs, (3) will compare against and build on open data and consortium repositories, as well as (4) often embrace more elaborate, less interpretable models to maximize prediction accuracy.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1
Fig. 2: Strongest population mode that links intra-network connectivity patterns and inter-network connectivity patterns.
Fig. 3: Relevance of population associations between six brain-imaging modalities and thousands of behavioural phenotypes.

References

  1. 1.

    Efron, B. Large-scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction Vol. 1 (Cambridge Univ. Press, 2012).

  2. 2.

    Nature 539, 467–468 (2016).

  3. 3.

    Efron, B. & Hastie, T. Computer-Age Statistical Inference (Cambridge Univ. Press, 2016).

  4. 4.

    Jordan, M. I. On statistics, computation and scalability. Bernoulli 19, 1378–1390 (2013).

    MathSciNet  Article  Google Scholar 

  5. 5.

    Donoho, D. 50 years of data science. J. Comput. Graph. Stat. 26, 745–766 (2017).

    MathSciNet  Article  Google Scholar 

  6. 6.

    Casella, G. & Berger, R. L. Statistical Inference Vol. 2 (Duxbury, 2002).

  7. 7.

    Efron, B. & Tibshirani, R. J. Statistical data analysis in the computer age. Science 253, 390–395 (1991).

    Article  Google Scholar 

  8. 8.

    Nuzzo, R. Scientific method: statistical errors. Nature 506, 150–152 (2014).

    Article  Google Scholar 

  9. 9.

    Wasserstein, R. L. & Lazar, N. A. The ASA’s statement on P-values: context, process, and purpose. Am. Stat. 70, 129–133 (2016).

    MathSciNet  Article  Google Scholar 

  10. 10.

    Blei, D. M. & Smyth, P. Science and data science. Proc. Natl Acad. Sci. USA 114, 8689–8692 (2017).

    Article  Google Scholar 

  11. 11.

    Halevy, A., Norvig, P. & Pereira, F. The unreasonable effectiveness of data. IEEE Intell. Syst. 24, 8–12 (2009).

    Article  Google Scholar 

  12. 12.

    Breiman, L. Statistical modeling: the two cultures. Stat. Sci. 16, 199–231 (2001).

    MathSciNet  Article  Google Scholar 

  13. 13.

    Jordan, M. I. et al. Frontiers in Massive Data Analysis (The National Academies Press, 2013).

  14. 14.

    Bzdok, D. & Yeo, B. T. T. Inference in the age of big data: future perspectives on neuroscience. NeuroImage 155, 549–564 (2017).

    Article  Google Scholar 

  15. 15.

    Smith, S. M. & Nichols, T. E. Statistical challenges in “big data” human neuroimaging. Neuron 97, 263–268 (2018).

    Article  Google Scholar 

  16. 16.

    Elliott, L. T. et al. Genome-wide association studies of brain imaging phenotypes in UK Biobank. Nature 562, 210–216 (2018).

    Article  Google Scholar 

  17. 17.

    Amunts, K. et al. BigBrain: an ultrahigh-resolution 3D human brain model. Science 340, 1472–1475 (2013).

    Article  Google Scholar 

  18. 18.

    McIntosh, A. R. & Mišić, B. Multivariate statistical analyses for neuroimaging data. Annu. Rev. Psychol. 64, 499–525 (2013).

    Article  Google Scholar 

  19. 19.

    McIntosh, A., Bookstein, F., Haxby, J. V. & Grady, C. Spatial pattern analysis of functional brain images using partial least squares. NeuroImage 3, 143–157 (1996).

    Article  Google Scholar 

  20. 20.

    Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning (Springer, 2001).

  21. 21.

    Giraud, C. Introduction to High-dimensional Statistics (CRC Press, 2014).

  22. 22.

    Hastie, T., Tibshirani, R. & Wainwright, M. Statistical Learning with Sparsity: The Lasso and Generalizations (CRC Press, 2015).

  23. 23.

    Mohri, M., Talwalkar, A. & Rostamizadeh, A. Foundations of Machine Learning (Adaptive Computation and Machine Learning Series, MIT Press, 2012).

  24. 24.

    Shalev-Shwartz, S. & Ben-David, S. Understanding Machine Learning: From Theory to Algorithms (Cambridge Univ. Press, 2014).

  25. 25.

    McElreath, R. Statistical Rethinking (Chapman & Hall/CRC, 2015).

  26. 26.

    Kruschke, J. K. Doing Bayesian Data Analysis (Elsevier, 2011).

  27. 27.

    Wipf, D. P. & Nagarajan, S. S. Sparse estimation using general likelihoods and non-factorial priors. In Advances in Neural Information Processing Systems 1625–1632 (NIPS, 2008).

  28. 28.

    Chen, G. et al. Handling multiplicity in neuroimaging through Bayesian lenses with multilevel modeling. Neuroinformatics https://doi.org/10.1007/s12021-018-9409-6 (2018).

  29. 29.

    Gelman, A., Carlin, J. B., Stern, H. S. & Rubin, D. B. Bayesian Data Analysis Vol. 2 (Chapman & Hall/CRC, 2014).

  30. 30.

    MacKay, D. J. C. Information Theory, Inference and Learning Algorithms (Cambridge Univ. Press, 2003).

  31. 31.

    Smith, S. M. et al. A positive–negative mode of population covariation links brain connectivity, demographics and behavior. Nat. Neurosci. 18, 1565–1567 (2015).

    Article  Google Scholar 

  32. 32.

    Witten, D. M., Tibshirani, R. & Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10, 515–534 (2009).

    Article  Google Scholar 

  33. 33.

    Virtanen, S., Klami, A. & Kaski, S. Bayesian CCA via group sparsity. In Proc. 28th International Conference on International Conference on Machine Learning (eds Getoor, L. & Scheffer, T.) 457–464 (Omnipress, 2011).

  34. 34.

    Andrew, G., Arora, R., Bilmes, J. & Livescu, K. Deep canonical correlation analysis. In International Conference on Machine Learning 1247–1255 (PMLR, 2013).

  35. 35.

    Haufe, S. et al. On the interpretation of weight vectors of linear models in multivariate neuroimaging. NeuroImage 87, 96–110 (2014).

    Article  Google Scholar 

  36. 36.

    Friston, K. J. et al. Statistical parametric maps in functional imaging: a general linear approach. Hum. Brain Mapp. 2, 189–210 (1994).

    Article  Google Scholar 

  37. 37.

    Kernbach, J. M. et al. Subspecialization within default mode nodes characterized in 10,000 UK Biobank participants. Proc. Natl Acad. Sci. USA 115, 12295–12300 (2018).

    Article  Google Scholar 

  38. 38.

    Bzdok, D. et al. Characterization of the temporo-parietal junction by combining data-driven parcellation, complementary connectivity analyses, and functional decoding. NeuroImage 81, 381–392 (2013).

    Article  Google Scholar 

  39. 39.

    Wang, H.-T. et al. Dimensions of experience: exploring the heterogeneity of the wandering mind. Psychol. Sci. 29, 56–71 (2018).

    Article  Google Scholar 

  40. 40.

    Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning requires rethinking generalization. Preprint at arXiv https://arxiv.org/abs/1611.03530 (2016).

  41. 41.

    Stone, M. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Series B 36, 111–147 (1974).

    MathSciNet  MATH  Google Scholar 

  42. 42.

    Geisser, S. The predictive sample reuse method with applications. J. Am. Stat. Assoc. 70, 320–328 (1975).

    Article  Google Scholar 

  43. 43.

    Breiman, L. Bagging predictors. Mach. Learn. 24, 123–140 (1996).

    MATH  Google Scholar 

  44. 44.

    Efron, B. & Tibshirani, R. J. An Introduction to the Bootstrap (CRC Press, 1994).

  45. 45.

    Miller, K. L. et al. Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nat. Neurosci. 19, 1523 (2016).

    Article  Google Scholar 

  46. 46.

    Berkson, J. Some difficulties of interpretation encountered in the application of the chi-square test. J. Am. Stat. Assoc. 33, 526–536 (1938).

    Article  Google Scholar 

  47. 47.

    Bzdok, D. Classical statistics and statistical learning in imaging neuroscience. Front. Neurosci. 11, 543 (2017).

    Article  Google Scholar 

  48. 48.

    Nichols, T. E. & Holmes, A. P. Nonparametric permutation tests for functional neuroimaging: a primer with examples. Hum. Brain Mapp. 15, 1–25 (2002).

    Article  Google Scholar 

  49. 49.

    Winkler, A. M. et al. Non‐parametric combination and related permutation tests for neuroimaging. Hum. Brain Mapp. 37, 1486–1511 (2016).

    Article  Google Scholar 

  50. 50.

    Ge, T., Yeo, B. T. T. & Winkler, A. A brief overview of permutation testing with examples. Organization for Human Brain Mapping https://www.ohbmbrainmappingblog.com/blog/a-brief-overview-of-permutation-testing-with-examples (2018).

  51. 51.

    Varoquaux, G. Cross-validation failure: small sample sizes lead to large error bars. NeuroImage 180, 68–77 (2017).

    Article  Google Scholar 

  52. 52.

    Goodfellow, I. J., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).

  53. 53.

    Medland, S. E., Jahanshad, N., Neale, B. M. & Thompson, P. M. Whole-genome analyses of whole-brain data: working within an expanded search space. Nat. Neurosci. 17, 791–800 (2014).

    Article  Google Scholar 

  54. 54.

    Leonelli, S. Data-centric Biology: A Philosophical Study (Univ. Chicago Press, 2016).

  55. 55.

    Poldrack, R. A. & Gorgolewski, K. J. Making big data open: data sharing in neuroimaging. Nat. Neurosci. 17, 1510–1517 (2014).

    Article  Google Scholar 

  56. 56.

    Bron, E. E. et al. Standardized evaluation of algorithms for computer-aided diagnosis of dementia based on structural MRI: the CADDementia challenge. NeuroImage 111, 562–579 (2015).

    Article  Google Scholar 

  57. 57.

    Sarica, A., Cerasa, A., Quattrone, A. & Calhoun, V. Editorial on special issue: machine learning on MCI. J. Neurosci. methods 302, 1 (2018).

    Article  Google Scholar 

  58. 58.

    Arbabshirani, M. R., Plis, S., Sui, J. & Calhoun, V. D. Single subject prediction of brain disorders in neuroimaging: promises and pitfalls. NeuroImage 145, 137–165 (2017).

    Article  Google Scholar 

  59. 59.

    Woo, C.-W., Chang, L. J., Lindquist, M. A. & Wager, T. D. Building better biomarkers: brain models in translational neuroimaging. Nat. Neurosci. 20, 365–377 (2017).

    Article  Google Scholar 

  60. 60.

    Van Essen, D. C. et al. The Human Connectome Project: a data acquisition perspective. NeuroImage 62, 2222–2231 (2012).

    Article  Google Scholar 

  61. 61.

    Petkova, E. et al. Statistical analysis plan for stage 1 EMBARC (Establishing Moderators and Biosignatures of Antidepressant Response for Clinical Care) study. Contemp. Clin. Trials Commun. 6, 22–30 (2017).

    Article  Google Scholar 

  62. 62.

    Ghahramani, Z. Probabilistic machine learning and artificial intelligence. Nature 521, 452–459 (2015).

    Article  Google Scholar 

  63. 63.

    LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    Article  Google Scholar 

  64. 64.

    Shmueli, G. To explain or to predict? Stat. Sci. 25, 289–310 (2010).

    MathSciNet  Article  Google Scholar 

  65. 65.

    Harrell, F. Is medicine mesmerized by machine learning? Statistical Thinking http://www.fharrell.com/post/medml/ (2019).

  66. 66.

    Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 4765–4774 (NIPS, 2017).

  67. 67.

    Chen, J., Song, L., Wainwright, M. J. & Jordan, M. I. Learning to explain: an information-theoretic perspective on model interpretation. Preprint at https://arxiv.org/abs/1802.07814 (2018).

  68. 68.

    Szucs, D. & Ioannidis, J. When null hypothesis significance testing is unsuitable for research: a reassessment. Front. Hum. Neurosci. 11, 390 (2017).

    Article  Google Scholar 

  69. 69.

    Bzdok, D. & Ioannidis, J. P. A. Exploration, inference and prediction in neuroscience and biomedicine. Trends Neurosci. 42, 251–262 (2019).

    Article  Google Scholar 

  70. 70.

    Pearl, J. & Mackenzie, D. The Book of Why: The New Science of Cause and Effect (Basic Books, 2018).

  71. 71.

    Efron, B. Why isn’t everyone a Bayesian? Am. Stat. 40, 1–5 (1986).

    MathSciNet  MATH  Google Scholar 

  72. 72.

    Norvig, P. On chomsky and the two cultures of statistical learning. Peter Norvig http://norvig.com/chomsky.html (2011).

  73. 73.

    O’Neil, C. Weapons of Math Destruction. How Big Data Increases Inequality and Threatens Democracy (Crown, 2016).

  74. 74.

    Haynes, J.-D. A primer on pattern-based approaches to fMRI: principles, pitfalls, and perspectives. Neuron 87, 257–270 (2015).

    Article  Google Scholar 

  75. 75.

    Henke, N. et al. The Age of Analytics: Competing in a Data-driven World Technical Report (McKinsey Global Institute, 2016).

  76. 76.

    Hoyos-Idrobo, A., Varoquaux, G., Schwartz, Y. & Thirion, B. FReM—scalable and stable decoding with fast regularized ensemble of models. NeuroImage 180, 160–172 (2018).

    Article  Google Scholar 

  77. 77.

    Yarkoni, T. & Westfall, J. Choosing prediction over explanation in psychology: lessons from machine learning. Perspect. Psychol. Sci. 12, 1100–1122 (2016).

    Article  Google Scholar 

  78. 78.

    Friston, K. J. et al. Classical and Bayesian inference in neuroimaging: applications. NeuroImage 16, 484–512 (2002).

    Article  Google Scholar 

  79. 79.

    Friston, K. J. et al. Classical and Bayesian inference in neuroimaging: theory. NeuroImage 16, 465–483 (2002).

    Article  Google Scholar 

  80. 80.

    Körding, K. P. & Wolpert, D. M. Bayesian integration in sensorimotor learning. Nature 427, 244–247 (2004).

    Article  Google Scholar 

  81. 81.

    Friston, K. J., Liddle, P. F., Frith, C. D., Hirsch, S. R. & Frackowiak, R. S. J. The left medial temporal region and schizophrenia. Brain 115, 367–382 (1992).

    Article  Google Scholar 

  82. 82.

    Varoquaux, G. et al. Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage 145, 166–179 (2017).

    Article  Google Scholar 

  83. 83.

    Pereira, F., Mitchell, T. & Botvinick, M. Machine learning classifiers and fMRI: a tutorial overview. NeuroImage 45, 199–209 (2009).

    Article  Google Scholar 

  84. 84.

    Allen, E. A., Erhardt, E. B. & Calhoun, V. D. Data visualization in the neurosciences: overcoming the curse of dimensionality. Neuron 74, 603–608 (2012).

    Article  Google Scholar 

  85. 85.

    Marblestone, A. H., Wayne, G. & Kording, K. P. Toward an integration of deep learning and neuroscience. Front. Comput. Neurosci. 10, 94 (2016).

    Article  Google Scholar 

  86. 86.

    Plis, S. M. et al. Deep learning for neuroimaging: a validation study. Front. Neurosci. 8, 299 (2014).

    Article  Google Scholar 

  87. 87.

    Hassabis, D., Kumaran, D., Summerfield, C. & Botvinick, M. Neuroscience-inspired artificial intelligence. Neuron 95, 245–258 (2017).

    Article  Google Scholar 

  88. 88.

    Doria, V. et al. Emergence of resting state networks in the preterm human brain. Proc. Natl Acad. Sci. USA 107, 20015–20020 (2010).

    Article  Google Scholar 

  89. 89.

    Sui, J. et al. A CCA+ ICA based model for multi-task brain imaging data fusion and its application to schizophrenia. NeuroImage 51, 123–134 (2010).

    Article  Google Scholar 

  90. 90.

    Jonas, E. & Kording, K. P. Could a neuroscientist understand a microprocessor? PLoS Comput. Biol. 13, e1005268 (2017).

    Article  Google Scholar 

  91. 91.

    Dai, T. & Guo, Y., Alzheimer’s Disease Neuroimaging Initiative. Predicting individual brain functional connectivity using a Bayesian hierarchical model. NeuroImage 147, 772–787 (2017).

    Article  Google Scholar 

  92. 92.

    Eickhoff, S. B., Thirion, B., Varoquaux, G. & Bzdok, D. Connectivity-based parcellation: critique and implications. Hum. Brain Mapp. 36, 4771–4792 (2015).

    Article  Google Scholar 

  93. 93.

    Woolrich, M. W. Bayesian inference in FMRI. NeuroImage 62, 801–810 (2012).

    Article  Google Scholar 

  94. 94.

    Haxby, J. V. et al. Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science 293, 2425–2430 (2001).

    Article  Google Scholar 

  95. 95.

    Kriegeskorte, N., Goebel, R. & Bandettini, P. Information-based functional brain mapping. Proc. Natl Acad. Sci. USA 103, 3863–3868 (2006).

    Article  Google Scholar 

  96. 96.

    Rasmussen, P. M., Hansen, L. K., Madsen, K. H., Churchill, N. W. & Strother, S. C. Model sparsity and brain pattern interpretation of classification models in neuroimaging. Pattern Recognit. 45, 2085–2100 (2012).

    Article  Google Scholar 

  97. 97.

    Baldassarre, L., Pontil, M. & Mourão-Miranda, J. Sparsity is better with stability: combining accuracy and stability for model selection in brain decoding. Front. Neurosci. 11, 62 (2017).

    Article  Google Scholar 

  98. 98.

    Woo, C. W., Krishnan, A. & Wager, T. D. Cluster-extent based thresholding in fMRI analyses: pitfalls and recommendations. NeuroImage 91, 412–419 (2014).

    Article  Google Scholar 

  99. 99.

    Faisal, A. A., Selen, L. P. & Wolpert, D. M. Noise in the nervous system. Nat. Rev. Neurosci. 9, 292–303 (2008).

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Danilo Bzdok.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bzdok, D., Nichols, T.E. & Smith, S.M. Towards algorithmic analytics for large-scale datasets. Nat Mach Intell 1, 296–306 (2019). https://doi.org/10.1038/s42256-019-0069-5

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing