Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit

Abstract

Despite the practical success of deep neural networks, a comprehensive theoretical framework that can predict practically relevant scores, such as the test accuracy, from knowledge of the training data is currently lacking. Huge simplifications arise in the infinite-width limit, in which the number of units N in each hidden layer ( = 1, …, L, where L is the depth of the network) far exceeds the number P of training examples. This idealization, however, blatantly departs from the reality of deep learning practice. Here we use the toolset of statistical mechanics to overcome these limitations and derive an approximate partition function for fully connected deep neural architectures, which encodes information on the trained models. The computation holds in the thermodynamic limit, where both N and P are large and their ratio α = P/N is finite. This advance allows us to obtain: (1) a closed formula for the generalization error associated with a regression task in a one-hidden layer network with finite α1; (2) an approximate expression of the partition function for deep architectures (via an effective action that depends on a finite number of order parameters); and (3) a link between deep neural networks in the proportional asymptotic limit and Student’s t-processes.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Learning curves of 1HL networks.
Fig. 2: Experiments with deep networks L > 1.
Fig. 3: Universality behaviour of random data and order parameter as a function of depth L.

Similar content being viewed by others

Data availability

The CIFAR1072 and MNIST73 datasets that we used for all our experiments are publicly available online, respectively, at https://www.cs.toronto.edu/~kriz/cifar.html and http://yann.lecun.com/exdb/mnist/.

Code availability

The code used to perform experiments, compute theory predictions and analyse data is available at: https://github.com/rpacelli/FC_deep_bayesian_networks (ref. 74).

References

  1. Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).

  2. Engel, A. & Van den Broeck, C. Statistical Mechanics of Learning (Cambridge Univ. Press, 2001).

  3. Seroussi, I., Naveh, G. & Ringel, Z. Separation of scales and a thermodynamic description of feature learning in some CNNs. Nat. Commun. 14, 908 (2023).

    Article  Google Scholar 

  4. Wakhloo, A. J., Sussman, T. J. & Chung, S. Linear classification of neural manifolds with correlated variability. Phys. Rev. Lett. 131, 027301 (2023).

    Article  MathSciNet  Google Scholar 

  5. Li, Q. & Sompolinsky, H. Statistical mechanics of deep linear neural networks: the backpropagating kernel renormalization. Phys. Rev. X 11, 031059 (2021).

    Google Scholar 

  6. Baldassi, C. et al. Learning through atypical phase transitions in overparameterized neural networks. Phys. Rev. E 106, 014116 (2022).

    Article  MathSciNet  Google Scholar 

  7. Canatar, A., Bordelon, B. & Pehlevan, C. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nat. Commun. 12, 2914 (2021).

    Article  Google Scholar 

  8. Mozeika, A., Li, B. & Saad, D. Space of functions computed by deep-layered machines. Phys. Rev. Lett. 125, 168301 (2020).

    Article  MathSciNet  Google Scholar 

  9. Goldt, S., Mézard, M., Krzakala, F. & Zdeborová, L. Modeling the influence of data structure on learning in neural networks: the hidden manifold model. Phys. Rev. X 10, 041044 (2020).

    Google Scholar 

  10. Bahri, Y. et al. Statistical mechanics of deep learning. Ann. Rev. Condensed Matter Phys. 11, 501–528 (2020).

    Article  Google Scholar 

  11. Li, B. & Saad, D. Exploring the function space of deep-learning machines. Phys. Rev. Lett. 120, 248301 (2018).

    Article  MathSciNet  Google Scholar 

  12. Neal, R. M. in Bayesian Learning for Neural Networks 29–53 (Springer, 1996).

  13. Williams, C. Computing with infinite networks. In Proc. 9th International Conference on Neural Information Processing Systems (eds Jordan, M. I. & Petsche, T.) 295–301 (MIT Press, 1996).

  14. de G. Matthews, A. G., Hron, J., Rowland, M., Turner, R. E. & Ghahramani, Z. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations (ICLR, 2018).

  15. Lee, J. et al. Deep neural networks as Gaussian processes. In International Conference on Learning Representations (ICLR, 2018).

  16. Garriga-Alonso, A., Rasmussen, C. E. & Aitchison, L. Deep convolutional networks as shallow Gaussian processes. In International Conference on Learning Representations (ICLR, 2019).

  17. Novak, R. et al. Bayesian deep convolutional networks with many channels are Gaussian processes. In International Conference on Learning Representations (ICLR, 2019).

  18. Jacot, A., Gabriel, F. & Hongler, C. Neural tangent kernel: convergence and generalization in neural networks. In Proc. 32nd International Conference on Neural Information Processing Systems (ed. Bengio, S. et al.) 8580–8589 (Curran Associates, 2018).

  19. Chizat, L., Oyallon, E. & Bach, F. On lazy training in differentiable programming. In Proc. 33rd International Conference on Neural Information Processing Systems (ed. Wallach, H. et al.) 2937–2947 (Curran Associates, 2019).

  20. Lee, J. et al. Wide neural networks of any depth evolve as linear models under gradient descent. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Wallach, H. et al.) 8572–8583 (Curran Associates, 2019).

  21. Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).

    Article  Google Scholar 

  22. Bordelon, B., Canatar, A. & Pehlevan, C. Spectrum dependent learning curves in kernel regression and wide neural networks. In Proc. 37th International Conference on Machine Learning (eds Daumé, H. III & Singh, A.) 1024–1034 (PMLR, 2020).

  23. Dietrich, R., Opper, M. & Sompolinsky, H. Statistical mechanics of support vector networks. Phys. Rev. Lett. 82, 2975–2978 (1999).

    Article  Google Scholar 

  24. Seleznova, M. & Kutyniok, G. Neural tangent kernel beyond the infinite-width limit: effects of depth and initialization. Preprint at https://arxiv.org/abs/2202.00553 (2022).

  25. Vyas, N., Bansal, Y. & Preetum, N. Limitations of the NTK for understanding generalization in deep learning. Preprint at https://arxiv.org/abs/2206.10012 (2022).

  26. Antognini, J. M. Finite size corrections for neural network gaussian processes Preprint at https://arxiv.org/abs/1908.10030 (2019).

  27. Yaida, S. Non-Gaussian processes and neural networks at finite widths. In Proc. 1st Mathematical and Scientific Machine Learning Conference (eds Lu, J. & Ward, R.) 165–192 (PMLR, 2020).

  28. Hanin, B. Random fully connected neural networks as perturbatively solvable hierarchies. Preprint at https://arxiv.org/abs/2204.01058 (2023).

  29. Zavatone-Veth, J. & Pehlevan, C. Exact marginal prior distributions of finite bayesian neural networks. In Advances in Neural Information Processing Systems (eds Vaughan, J. et al.) 3364–3375 (Curran Associates, 2021).

  30. Bengio, Y. & Delalleau, O. in Algorithmic Learning Theory (eds Kivinen, J. et al.) 18–36 (Springer, 2011).

  31. Bartlett, P. L., Harvey, N., Liaw, C. & Mehrabian, A. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn. Res. 20, 2285–2301 (2019).

    MathSciNet  Google Scholar 

  32. Rotondo, P., Lagomarsino, M. C. & Gherardi, M. Counting the learnable functions of geometrically structured data. Phys. Rev. Res. 2, 023169 (2020).

    Article  Google Scholar 

  33. Rotondo, P., Pastore, M. & Gherardi, M. Beyond the storage capacity: data-driven satisfiability transition. Phys. Rev. Lett. 125, 120601 (2020).

    Article  Google Scholar 

  34. Pastore, M., Rotondo, P., Erba, V. & Gherardi, M. Statistical learning theory of structured data. Phys. Rev. E 102, 032119 (2020).

    Article  MathSciNet  Google Scholar 

  35. Gherardi, M. Solvable model for the linear separability of structured data. Entropy 23, 305 (2021).

  36. Pastore, M. Critical properties of the SAT/UNSAT transitions in the classification problem of structured data. J. Stat. Mech. 2021, 113301 (2021).

    Article  MathSciNet  Google Scholar 

  37. Aguirre-López, F., Pastore, M. & Franz, S. Satisfiability transition in asymmetric neural networks. J. Phys. A 55, 305001 (2022).

    Article  MathSciNet  Google Scholar 

  38. Saxe, A. M., McClelland, J. L. & Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Proc. 2nd International Conference on Learning Representations (eds. Bengio, Y. & LeCun, Y.) (ICLR, 2014).

  39. Saxe, A. M., McClelland, J. L. & Ganguli, S. A mathematical theory of semantic development in deep neural networks. Proc. Natl Acad. Sci. USA 116, 11537–11546 (2019).

    Article  MathSciNet  Google Scholar 

  40. Zavatone-Veth, J., Canatar, A., Ruben, B. & Pehlevan, C. Asymptotics of representation learning in finite Bayesian neural networks. In Advances in Neural Information Processing Systems (eds Vaughan, J. W. et al.) 24765–24777 (Curran Associates, 2021).

  41. Naveh, G. & Ringel, Z. A self consistent theory of gaussian processes captures feature learning effects in finite CNNs. In Advances in Neural Information Processing Systems (eds Vaughan, J. W. et al.) 21352–21364 (Curran Associates, 2021).

  42. Zavatone-Veth, J. A., Tong, W. L. & Pehlevan, C. Contrasting random and learned features in deep Bayesian linear regression. Phys. Rev. E 105, 064118 (2022).

    Article  MathSciNet  Google Scholar 

  43. Bardet, J.-M. & Surgailis, D. Moment bounds and central limit theorems for Gaussian subordinated arrays. J. Multivariate Anal. 114, 457–473 (2013).

    Article  MathSciNet  Google Scholar 

  44. Nourdin, I., Peccati, G. & Podolskij, M. Quantitative Breuer–Major theorems. Stochastic Process. Appl. 121, 793–812 (2011).

    Article  MathSciNet  Google Scholar 

  45. Breuer, P. & Major, P. Central limit theorems for non-linear functionals of Gaussian fields. J. Multivariate Anal. 13, 425–441 (1983).

    Article  MathSciNet  Google Scholar 

  46. Gerace, F., Loureiro, B., Krzakala, F., Mezard, M. & Zdeborova, L. Generalisation error in learning with random features and the hidden manifold model. In Proc. 37th International Conference on Machine Learning (eds Daumé, H. III & Singh, A.) 3452–3462 (PMLR, 2020).

  47. Loureiro, B. et al. Learning curves of generic features maps for realistic datasets with a teacher-student model. J. Stat. Mech. https://doi.org/10.1088/1742-5468/ac9825 (2021).

  48. Goldt, S. et al. The gaussian equivalence of generative models for learning with shallow neural networks. In Proc. 2nd Mathematical and Scientific Machine Learning Conference (eds. Bruna, J. et al.) 426–471 (PMLR, 2022).

  49. Dobriban, E. & Wager, S. High-dimensional asymptotics of prediction: ridge regression and classification. Ann. Stat. 46, 247–279 (2018).

    Article  MathSciNet  Google Scholar 

  50. Mei, S. & Montanari, A. The generalization error of random features regression: precise asymptotics and the double descent curve. Commun. Pure Appl. Math. 75, 667–766 (2019).

  51. Ghorbani, B., Mei, S., Misiakiewicz, T. & Montanari, A. Linearized two-layers neural networks in high dimension. Ann. Stat. 49, 1029 – 1054 (2021).

    Article  MathSciNet  Google Scholar 

  52. Ariosto, S., Pacelli, R., Ginelli, F., Gherardi, M. & Rotondo, P. Universal mean-field upper bound for the generalization gap of deep neural networks. Phys. Rev. E 105, 064309 (2022).

    Article  MathSciNet  Google Scholar 

  53. Shah, A., Wilson, A. & Ghahramani, Z. Student-t processes as alternatives to Gaussian processes. In Proc. 17th International Conference on Artificial Intelligence and Statistics (eds Kaski, S. & Corander, J.) 877–885 (PMLR, 2014).

  54. Zavatone-Veth, J. A., Canatar, A., Ruben, B. S. & Pehlevan, C. Asymptotics of representation learning in finite Bayesian neural networks. J. Stat. Mech. 2022, 114008 (2022).

  55. Hanin, B. & Zlokapa, A. Bayesian interpolation with deep linear networks. Proc. Natl Acad. Sci. 120, e2301345120 (2023).

  56. Coolen, A. C. C., Sheikh, M., Mozeika, A., Aguirre-López, F. & Antenucci, F. Replica analysis of overfitting in generalized linear regression models. J. Phys. A 53, 365001 (2020).

    Article  MathSciNet  Google Scholar 

  57. Mozeika, A., Sheikh, M., Aguirre-López, F., Antenucci, F. & Coolen, A. C. C. Exact results on high-dimensional linear regression via statistical physics. Phys. Rev. E 103, 042142 (2021).

    Article  MathSciNet  Google Scholar 

  58. Uchiyama, Y., Oka, H. & Nono, A. Student’s t-process regression on the space of probability density functions. Proc. ISCIE International Symposium on Stochastic Systems Theory and its Applications 2021, 1–5 (2021).

    Article  Google Scholar 

  59. Lee, H., Yun, E., Yang, H. & Lee, J. Scale mixtures of neural network Gaussian processes. In International Conference on Learning Representations (ICLR, 2022).

  60. Aitchison, L. Why bigger is not always better: on finite and infinite neural networks. In Proc. 37th International Conference on Machine Learning (eds Daumé, H. III & Singh, A.) 156–164 (PMLR, 2020).

  61. Zavatone-Veth, J. A. & Pehlevan, C. Depth induces scale-averaging in overparameterized linear Bayesian neural networks. In 2021 55th Asilomar Conference on Signals, Systems, and Computers 600–607 (IEEE, 2021).

  62. Yang, A. X., Robeyns, M., Milsom, E., Schoots, N. & Aitchison, L. A theory of representation learning in deep neural networks gives a deep generalisation of kernel methods. Preprint at https://arxiv.org/abs/2108.13097 (2023).

  63. Cho, Y. & Saul, L. Kernel methods for deep learning. In Advances in Neural Information Processing Systems (eds Bengio, Y. et al.) Vol. 22 (Curran Associates, 2009).

  64. Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J. & Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems (eds Garnett, R. et al.) Vol. 29 (Curran Associates, 2016).

  65. Yang, G. & Schoenholz, S. Mean field residual networks: on the edge of chaos. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) Vol. 30 (Curran Associates, 2017).

  66. Tracey, B. D. & Wolpert, D. Upgrading from Gaussian processes to Student’s-t processes. In 2018 AIAA Non-Deterministic Approaches Conference 1659 (2018).

  67. Roberts, D. A., Yaida, S. & Hanin, B. The Principles of Deep Learning Theory (Cambridge Univ. Press, 2022).

  68. Gerace, F., Krzakala, F., Loureiro, B., Stephan, L. & Zdeborová, L. Gaussian universality of linear classifiers with random labels in high-dimension. Preprint at https://arxiv.org/abs/2205.13303 (2022).

  69. Cui, H., Krzakala, F. & Zdeborová, L. Bayes-optimal learning of deep random networks of extensive-width. In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 6468–6521 (PMLR, 2023).

  70. Lee, J. et al. Finite versus infinite neural networks: an empirical study. In Advances in Neural Information Processing Systems (eds Lin, H. et al.) 15156–15172 (Curran Associates, 2020).

  71. Pang, G., Yang, L. & Karniadakis, G. E. Neural-net-induced Gaussian process regression for function approximation and PDE solution. J. Comput. Phys. 384, 270–288 (2019).

    Article  MathSciNet  Google Scholar 

  72. Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images (Univ. Toronto, 2012).

  73. Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).

    Article  Google Scholar 

  74. Pacelli, R. rpacelli/FC_deep_bayesian_networks: FC_deep_bayesian_networks (2023).

Download references

Acknowledgements

M.P. has been supported by a grant from the Simons Foundation (grant no. 454941, S. Franz). P.R. acknowledges funding from the Fellini program under the H2020-MSCA-COFUND action, grant agreement no. 754496, INFN (IT) and from #NEXTGENERATIONEU (NGEU), National Recovery and Resilience Plan (NRRP), project MNESYS (PE0000006) ‘A Multiscale integrated approach to the study of the nervous system in health and disease’ (DN. 1553 11.10.2022). We would like to thank S. Franz, L. Molinari, F. Aguirre-López, R. Burioni, A. Vezzani, R. Aiudi, F. Bassetti, B. Bassetti, P. Baglioni and the Computing Sciences group at Bocconi University in Milan for discussions and suggestions.

Author information

Authors and Affiliations

Authors

Contributions

P.R., S.A. and M.P. performed the analytical calculations, supported by F.G, M.G. and R.P. Numerical experiments, data analysis and data visualization were carried out by R.P. All the authors contributed to discussing and interpreting the results and to writing and editing the paper. S.A and R.P. contributed equally to the work.

Corresponding author

Correspondence to P. Rotondo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary text.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pacelli, R., Ariosto, S., Pastore, M. et al. A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit. Nat Mach Intell 5, 1497–1507 (2023). https://doi.org/10.1038/s42256-023-00767-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-023-00767-6

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics