A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit

Pacelli, R.; Ariosto, S.; Pastore, M.; Ginelli, F.; Gherardi, M.; Rotondo, P.

doi:10.1038/s42256-023-00767-6

Article
Published: 18 December 2023

A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit

Nature Machine Intelligence volume 5, pages 1497–1507 (2023)Cite this article

2386 Accesses
3 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Despite the practical success of deep neural networks, a comprehensive theoretical framework that can predict practically relevant scores, such as the test accuracy, from knowledge of the training data is currently lacking. Huge simplifications arise in the infinite-width limit, in which the number of units N_ℓ in each hidden layer (ℓ = 1, …, L, where L is the depth of the network) far exceeds the number P of training examples. This idealization, however, blatantly departs from the reality of deep learning practice. Here we use the toolset of statistical mechanics to overcome these limitations and derive an approximate partition function for fully connected deep neural architectures, which encodes information on the trained models. The computation holds in the thermodynamic limit, where both N_ℓ and P are large and their ratio α_ℓ = P/N_ℓ is finite. This advance allows us to obtain: (1) a closed formula for the generalization error associated with a regression task in a one-hidden layer network with finite α₁; (2) an approximate expression of the partition function for deep architectures (via an effective action that depends on a finite number of order parameters); and (3) a link between deep neural networks in the proportional asymptotic limit and Student’s t-processes.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Learning curves of 1HL networks.**

**Fig. 2: Experiments with deep networks L > 1.**

**Fig. 3: Universality behaviour of random data and order parameter as a function of depth L.**

Power-law scaling to assist with key challenges in artificial intelligence

Article Open access 12 November 2020

Sample-efficient learning of interacting quantum systems

Article 24 May 2021

Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks

Article Open access 18 May 2021

Data availability

The CIFAR10⁷² and MNIST⁷³ datasets that we used for all our experiments are publicly available online, respectively, at https://www.cs.toronto.edu/~kriz/cifar.html and http://yann.lecun.com/exdb/mnist/.

Code availability

The code used to perform experiments, compute theory predictions and analyse data is available at: https://github.com/rpacelli/FC_deep_bayesian_networks (ref. ⁷⁴).

References

Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).
Engel, A. & Van den Broeck, C. Statistical Mechanics of Learning (Cambridge Univ. Press, 2001).
Seroussi, I., Naveh, G. & Ringel, Z. Separation of scales and a thermodynamic description of feature learning in some CNNs. Nat. Commun. 14, 908 (2023).
Article Google Scholar
Wakhloo, A. J., Sussman, T. J. & Chung, S. Linear classification of neural manifolds with correlated variability. Phys. Rev. Lett. 131, 027301 (2023).
Article MathSciNet Google Scholar
Li, Q. & Sompolinsky, H. Statistical mechanics of deep linear neural networks: the backpropagating kernel renormalization. Phys. Rev. X 11, 031059 (2021).
Google Scholar
Baldassi, C. et al. Learning through atypical phase transitions in overparameterized neural networks. Phys. Rev. E 106, 014116 (2022).
Article MathSciNet Google Scholar
Canatar, A., Bordelon, B. & Pehlevan, C. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nat. Commun. 12, 2914 (2021).
Article Google Scholar
Mozeika, A., Li, B. & Saad, D. Space of functions computed by deep-layered machines. Phys. Rev. Lett. 125, 168301 (2020).
Article MathSciNet Google Scholar
Goldt, S., Mézard, M., Krzakala, F. & Zdeborová, L. Modeling the influence of data structure on learning in neural networks: the hidden manifold model. Phys. Rev. X 10, 041044 (2020).
Google Scholar
Bahri, Y. et al. Statistical mechanics of deep learning. Ann. Rev. Condensed Matter Phys. 11, 501–528 (2020).
Article Google Scholar
Li, B. & Saad, D. Exploring the function space of deep-learning machines. Phys. Rev. Lett. 120, 248301 (2018).
Article MathSciNet Google Scholar
Neal, R. M. in Bayesian Learning for Neural Networks 29–53 (Springer, 1996).
Williams, C. Computing with infinite networks. In Proc. 9th International Conference on Neural Information Processing Systems (eds Jordan, M. I. & Petsche, T.) 295–301 (MIT Press, 1996).
de G. Matthews, A. G., Hron, J., Rowland, M., Turner, R. E. & Ghahramani, Z. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations (ICLR, 2018).
Lee, J. et al. Deep neural networks as Gaussian processes. In International Conference on Learning Representations (ICLR, 2018).
Garriga-Alonso, A., Rasmussen, C. E. & Aitchison, L. Deep convolutional networks as shallow Gaussian processes. In International Conference on Learning Representations (ICLR, 2019).
Novak, R. et al. Bayesian deep convolutional networks with many channels are Gaussian processes. In International Conference on Learning Representations (ICLR, 2019).
Jacot, A., Gabriel, F. & Hongler, C. Neural tangent kernel: convergence and generalization in neural networks. In Proc. 32nd International Conference on Neural Information Processing Systems (ed. Bengio, S. et al.) 8580–8589 (Curran Associates, 2018).
Chizat, L., Oyallon, E. & Bach, F. On lazy training in differentiable programming. In Proc. 33rd International Conference on Neural Information Processing Systems (ed. Wallach, H. et al.) 2937–2947 (Curran Associates, 2019).
Lee, J. et al. Wide neural networks of any depth evolve as linear models under gradient descent. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Wallach, H. et al.) 8572–8583 (Curran Associates, 2019).
Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).
Article Google Scholar
Bordelon, B., Canatar, A. & Pehlevan, C. Spectrum dependent learning curves in kernel regression and wide neural networks. In Proc. 37th International Conference on Machine Learning (eds Daumé, H. III & Singh, A.) 1024–1034 (PMLR, 2020).
Dietrich, R., Opper, M. & Sompolinsky, H. Statistical mechanics of support vector networks. Phys. Rev. Lett. 82, 2975–2978 (1999).
Article Google Scholar
Seleznova, M. & Kutyniok, G. Neural tangent kernel beyond the infinite-width limit: effects of depth and initialization. Preprint at https://arxiv.org/abs/2202.00553 (2022).
Vyas, N., Bansal, Y. & Preetum, N. Limitations of the NTK for understanding generalization in deep learning. Preprint at https://arxiv.org/abs/2206.10012 (2022).
Antognini, J. M. Finite size corrections for neural network gaussian processes Preprint at https://arxiv.org/abs/1908.10030 (2019).
Yaida, S. Non-Gaussian processes and neural networks at finite widths. In Proc. 1st Mathematical and Scientific Machine Learning Conference (eds Lu, J. & Ward, R.) 165–192 (PMLR, 2020).
Hanin, B. Random fully connected neural networks as perturbatively solvable hierarchies. Preprint at https://arxiv.org/abs/2204.01058 (2023).
Zavatone-Veth, J. & Pehlevan, C. Exact marginal prior distributions of finite bayesian neural networks. In Advances in Neural Information Processing Systems (eds Vaughan, J. et al.) 3364–3375 (Curran Associates, 2021).
Bengio, Y. & Delalleau, O. in Algorithmic Learning Theory (eds Kivinen, J. et al.) 18–36 (Springer, 2011).
Bartlett, P. L., Harvey, N., Liaw, C. & Mehrabian, A. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn. Res. 20, 2285–2301 (2019).
MathSciNet Google Scholar
Rotondo, P., Lagomarsino, M. C. & Gherardi, M. Counting the learnable functions of geometrically structured data. Phys. Rev. Res. 2, 023169 (2020).
Article Google Scholar
Rotondo, P., Pastore, M. & Gherardi, M. Beyond the storage capacity: data-driven satisfiability transition. Phys. Rev. Lett. 125, 120601 (2020).
Article Google Scholar
Pastore, M., Rotondo, P., Erba, V. & Gherardi, M. Statistical learning theory of structured data. Phys. Rev. E 102, 032119 (2020).
Article MathSciNet Google Scholar
Gherardi, M. Solvable model for the linear separability of structured data. Entropy 23, 305 (2021).
Pastore, M. Critical properties of the SAT/UNSAT transitions in the classification problem of structured data. J. Stat. Mech. 2021, 113301 (2021).
Article MathSciNet Google Scholar
Aguirre-López, F., Pastore, M. & Franz, S. Satisfiability transition in asymmetric neural networks. J. Phys. A 55, 305001 (2022).
Article MathSciNet Google Scholar
Saxe, A. M., McClelland, J. L. & Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Proc. 2nd International Conference on Learning Representations (eds. Bengio, Y. & LeCun, Y.) (ICLR, 2014).
Saxe, A. M., McClelland, J. L. & Ganguli, S. A mathematical theory of semantic development in deep neural networks. Proc. Natl Acad. Sci. USA 116, 11537–11546 (2019).
Article MathSciNet Google Scholar
Zavatone-Veth, J., Canatar, A., Ruben, B. & Pehlevan, C. Asymptotics of representation learning in finite Bayesian neural networks. In Advances in Neural Information Processing Systems (eds Vaughan, J. W. et al.) 24765–24777 (Curran Associates, 2021).
Naveh, G. & Ringel, Z. A self consistent theory of gaussian processes captures feature learning effects in finite CNNs. In Advances in Neural Information Processing Systems (eds Vaughan, J. W. et al.) 21352–21364 (Curran Associates, 2021).
Zavatone-Veth, J. A., Tong, W. L. & Pehlevan, C. Contrasting random and learned features in deep Bayesian linear regression. Phys. Rev. E 105, 064118 (2022).
Article MathSciNet Google Scholar
Bardet, J.-M. & Surgailis, D. Moment bounds and central limit theorems for Gaussian subordinated arrays. J. Multivariate Anal. 114, 457–473 (2013).
Article MathSciNet Google Scholar
Nourdin, I., Peccati, G. & Podolskij, M. Quantitative Breuer–Major theorems. Stochastic Process. Appl. 121, 793–812 (2011).
Article MathSciNet Google Scholar
Breuer, P. & Major, P. Central limit theorems for non-linear functionals of Gaussian fields. J. Multivariate Anal. 13, 425–441 (1983).
Article MathSciNet Google Scholar
Gerace, F., Loureiro, B., Krzakala, F., Mezard, M. & Zdeborova, L. Generalisation error in learning with random features and the hidden manifold model. In Proc. 37th International Conference on Machine Learning (eds Daumé, H. III & Singh, A.) 3452–3462 (PMLR, 2020).
Loureiro, B. et al. Learning curves of generic features maps for realistic datasets with a teacher-student model. J. Stat. Mech. https://doi.org/10.1088/1742-5468/ac9825 (2021).
Goldt, S. et al. The gaussian equivalence of generative models for learning with shallow neural networks. In Proc. 2nd Mathematical and Scientific Machine Learning Conference (eds. Bruna, J. et al.) 426–471 (PMLR, 2022).
Dobriban, E. & Wager, S. High-dimensional asymptotics of prediction: ridge regression and classification. Ann. Stat. 46, 247–279 (2018).
Article MathSciNet Google Scholar
Mei, S. & Montanari, A. The generalization error of random features regression: precise asymptotics and the double descent curve. Commun. Pure Appl. Math. 75, 667–766 (2019).
Ghorbani, B., Mei, S., Misiakiewicz, T. & Montanari, A. Linearized two-layers neural networks in high dimension. Ann. Stat. 49, 1029 – 1054 (2021).
Article MathSciNet Google Scholar
Ariosto, S., Pacelli, R., Ginelli, F., Gherardi, M. & Rotondo, P. Universal mean-field upper bound for the generalization gap of deep neural networks. Phys. Rev. E 105, 064309 (2022).
Article MathSciNet Google Scholar
Shah, A., Wilson, A. & Ghahramani, Z. Student-t processes as alternatives to Gaussian processes. In Proc. 17th International Conference on Artificial Intelligence and Statistics (eds Kaski, S. & Corander, J.) 877–885 (PMLR, 2014).
Zavatone-Veth, J. A., Canatar, A., Ruben, B. S. & Pehlevan, C. Asymptotics of representation learning in finite Bayesian neural networks. J. Stat. Mech. 2022, 114008 (2022).
Hanin, B. & Zlokapa, A. Bayesian interpolation with deep linear networks. Proc. Natl Acad. Sci. 120, e2301345120 (2023).
Coolen, A. C. C., Sheikh, M., Mozeika, A., Aguirre-López, F. & Antenucci, F. Replica analysis of overfitting in generalized linear regression models. J. Phys. A 53, 365001 (2020).
Article MathSciNet Google Scholar
Mozeika, A., Sheikh, M., Aguirre-López, F., Antenucci, F. & Coolen, A. C. C. Exact results on high-dimensional linear regression via statistical physics. Phys. Rev. E 103, 042142 (2021).
Article MathSciNet Google Scholar
Uchiyama, Y., Oka, H. & Nono, A. Student’s t-process regression on the space of probability density functions. Proc. ISCIE International Symposium on Stochastic Systems Theory and its Applications 2021, 1–5 (2021).
Article Google Scholar
Lee, H., Yun, E., Yang, H. & Lee, J. Scale mixtures of neural network Gaussian processes. In International Conference on Learning Representations (ICLR, 2022).
Aitchison, L. Why bigger is not always better: on finite and infinite neural networks. In Proc. 37th International Conference on Machine Learning (eds Daumé, H. III & Singh, A.) 156–164 (PMLR, 2020).
Zavatone-Veth, J. A. & Pehlevan, C. Depth induces scale-averaging in overparameterized linear Bayesian neural networks. In 2021 55th Asilomar Conference on Signals, Systems, and Computers 600–607 (IEEE, 2021).
Yang, A. X., Robeyns, M., Milsom, E., Schoots, N. & Aitchison, L. A theory of representation learning in deep neural networks gives a deep generalisation of kernel methods. Preprint at https://arxiv.org/abs/2108.13097 (2023).
Cho, Y. & Saul, L. Kernel methods for deep learning. In Advances in Neural Information Processing Systems (eds Bengio, Y. et al.) Vol. 22 (Curran Associates, 2009).
Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J. & Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems (eds Garnett, R. et al.) Vol. 29 (Curran Associates, 2016).
Yang, G. & Schoenholz, S. Mean field residual networks: on the edge of chaos. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) Vol. 30 (Curran Associates, 2017).
Tracey, B. D. & Wolpert, D. Upgrading from Gaussian processes to Student’s-t processes. In 2018 AIAA Non-Deterministic Approaches Conference 1659 (2018).
Roberts, D. A., Yaida, S. & Hanin, B. The Principles of Deep Learning Theory (Cambridge Univ. Press, 2022).
Gerace, F., Krzakala, F., Loureiro, B., Stephan, L. & Zdeborová, L. Gaussian universality of linear classifiers with random labels in high-dimension. Preprint at https://arxiv.org/abs/2205.13303 (2022).
Cui, H., Krzakala, F. & Zdeborová, L. Bayes-optimal learning of deep random networks of extensive-width. In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 6468–6521 (PMLR, 2023).
Lee, J. et al. Finite versus infinite neural networks: an empirical study. In Advances in Neural Information Processing Systems (eds Lin, H. et al.) 15156–15172 (Curran Associates, 2020).
Pang, G., Yang, L. & Karniadakis, G. E. Neural-net-induced Gaussian process regression for function approximation and PDE solution. J. Comput. Phys. 384, 270–288 (2019).
Article MathSciNet Google Scholar
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images (Univ. Toronto, 2012).
Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
Article Google Scholar
Pacelli, R. rpacelli/FC_deep_bayesian_networks: FC_deep_bayesian_networks (2023).

Download references

Acknowledgements

M.P. has been supported by a grant from the Simons Foundation (grant no. 454941, S. Franz). P.R. acknowledges funding from the Fellini program under the H2020-MSCA-COFUND action, grant agreement no. 754496, INFN (IT) and from #NEXTGENERATIONEU (NGEU), National Recovery and Resilience Plan (NRRP), project MNESYS (PE0000006) ‘A Multiscale integrated approach to the study of the nervous system in health and disease’ (DN. 1553 11.10.2022). We would like to thank S. Franz, L. Molinari, F. Aguirre-López, R. Burioni, A. Vezzani, R. Aiudi, F. Bassetti, B. Bassetti, P. Baglioni and the Computing Sciences group at Bocconi University in Milan for discussions and suggestions.

Author information

Authors and Affiliations

Dipartimento di Scienza Applicata e Tecnologia, Politecnico di Torino, Torino, Italy
R. Pacelli
Artificial Intelligence Lab, Bocconi University, Milano, Italy
R. Pacelli
Dipartimento di Scienza e Alta Tecnologia and Center for Nonlinear and Complex Systems, Università degli Studi dell’Insubria, Como, Italy
S. Ariosto & F. Ginelli
I.N.F.N. Sezione di Milano, Milano, Italy
S. Ariosto, F. Ginelli, M. Gherardi & P. Rotondo
Université Paris-Saclay, CNRS, LPTMS, Orsay, France
M. Pastore
Laboratoire de physique de l’École normale supérieure, CNRS, PSL University, Sorbonne University, Université Paris-Cité, Paris, France
M. Pastore
Università degli Studi di Milano, Milano, Italy
M. Gherardi
Dipartimento di Scienze Matematiche, Fisiche e Informatiche, Università degli Studi di Parma, Parco Area delle Scienze, Parma, Italy
P. Rotondo

Authors

R. Pacelli
View author publications
You can also search for this author in PubMed Google Scholar
S. Ariosto
View author publications
You can also search for this author in PubMed Google Scholar
M. Pastore
View author publications
You can also search for this author in PubMed Google Scholar
F. Ginelli
View author publications
You can also search for this author in PubMed Google Scholar
M. Gherardi
View author publications
You can also search for this author in PubMed Google Scholar
P. Rotondo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

P.R., S.A. and M.P. performed the analytical calculations, supported by F.G, M.G. and R.P. Numerical experiments, data analysis and data visualization were carried out by R.P. All the authors contributed to discussing and interpreting the results and to writing and editing the paper. S.A and R.P. contributed equally to the work.

Corresponding author

Correspondence to P. Rotondo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary text.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Pacelli, R., Ariosto, S., Pastore, M. et al. A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit. Nat Mach Intell 5, 1497–1507 (2023). https://doi.org/10.1038/s42256-023-00767-6

Download citation

Received: 21 October 2022
Accepted: 30 October 2023
Published: 18 December 2023
Issue Date: December 2023
DOI: https://doi.org/10.1038/s42256-023-00767-6

This article is cited by

Inversion dynamics of class manifolds in deep learning reveals tradeoffs underlying generalization
- Simone Ciceri
- Lorenzo Cassani
- Marco Gherardi
Nature Machine Intelligence (2024)