Abstract
Despite the practical success of deep neural networks, a comprehensive theoretical framework that can predict practically relevant scores, such as the test accuracy, from knowledge of the training data is currently lacking. Huge simplifications arise in the infinite-width limit, in which the number of units Nℓ in each hidden layer (ℓ = 1, …, L, where L is the depth of the network) far exceeds the number P of training examples. This idealization, however, blatantly departs from the reality of deep learning practice. Here we use the toolset of statistical mechanics to overcome these limitations and derive an approximate partition function for fully connected deep neural architectures, which encodes information on the trained models. The computation holds in the thermodynamic limit, where both Nℓ and P are large and their ratio αℓ = P/Nℓ is finite. This advance allows us to obtain: (1) a closed formula for the generalization error associated with a regression task in a one-hidden layer network with finite α1; (2) an approximate expression of the partition function for deep architectures (via an effective action that depends on a finite number of order parameters); and (3) a link between deep neural networks in the proportional asymptotic limit and Student’s t-processes.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The CIFAR1072 and MNIST73 datasets that we used for all our experiments are publicly available online, respectively, at https://www.cs.toronto.edu/~kriz/cifar.html and http://yann.lecun.com/exdb/mnist/.
Code availability
The code used to perform experiments, compute theory predictions and analyse data is available at: https://github.com/rpacelli/FC_deep_bayesian_networks (ref. 74).
References
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).
Engel, A. & Van den Broeck, C. Statistical Mechanics of Learning (Cambridge Univ. Press, 2001).
Seroussi, I., Naveh, G. & Ringel, Z. Separation of scales and a thermodynamic description of feature learning in some CNNs. Nat. Commun. 14, 908 (2023).
Wakhloo, A. J., Sussman, T. J. & Chung, S. Linear classification of neural manifolds with correlated variability. Phys. Rev. Lett. 131, 027301 (2023).
Li, Q. & Sompolinsky, H. Statistical mechanics of deep linear neural networks: the backpropagating kernel renormalization. Phys. Rev. X 11, 031059 (2021).
Baldassi, C. et al. Learning through atypical phase transitions in overparameterized neural networks. Phys. Rev. E 106, 014116 (2022).
Canatar, A., Bordelon, B. & Pehlevan, C. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nat. Commun. 12, 2914 (2021).
Mozeika, A., Li, B. & Saad, D. Space of functions computed by deep-layered machines. Phys. Rev. Lett. 125, 168301 (2020).
Goldt, S., Mézard, M., Krzakala, F. & Zdeborová, L. Modeling the influence of data structure on learning in neural networks: the hidden manifold model. Phys. Rev. X 10, 041044 (2020).
Bahri, Y. et al. Statistical mechanics of deep learning. Ann. Rev. Condensed Matter Phys. 11, 501–528 (2020).
Li, B. & Saad, D. Exploring the function space of deep-learning machines. Phys. Rev. Lett. 120, 248301 (2018).
Neal, R. M. in Bayesian Learning for Neural Networks 29–53 (Springer, 1996).
Williams, C. Computing with infinite networks. In Proc. 9th International Conference on Neural Information Processing Systems (eds Jordan, M. I. & Petsche, T.) 295–301 (MIT Press, 1996).
de G. Matthews, A. G., Hron, J., Rowland, M., Turner, R. E. & Ghahramani, Z. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations (ICLR, 2018).
Lee, J. et al. Deep neural networks as Gaussian processes. In International Conference on Learning Representations (ICLR, 2018).
Garriga-Alonso, A., Rasmussen, C. E. & Aitchison, L. Deep convolutional networks as shallow Gaussian processes. In International Conference on Learning Representations (ICLR, 2019).
Novak, R. et al. Bayesian deep convolutional networks with many channels are Gaussian processes. In International Conference on Learning Representations (ICLR, 2019).
Jacot, A., Gabriel, F. & Hongler, C. Neural tangent kernel: convergence and generalization in neural networks. In Proc. 32nd International Conference on Neural Information Processing Systems (ed. Bengio, S. et al.) 8580–8589 (Curran Associates, 2018).
Chizat, L., Oyallon, E. & Bach, F. On lazy training in differentiable programming. In Proc. 33rd International Conference on Neural Information Processing Systems (ed. Wallach, H. et al.) 2937–2947 (Curran Associates, 2019).
Lee, J. et al. Wide neural networks of any depth evolve as linear models under gradient descent. In Proc. 33rd International Conference on Neural Information Processing Systems (eds Wallach, H. et al.) 8572–8583 (Curran Associates, 2019).
Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).
Bordelon, B., Canatar, A. & Pehlevan, C. Spectrum dependent learning curves in kernel regression and wide neural networks. In Proc. 37th International Conference on Machine Learning (eds Daumé, H. III & Singh, A.) 1024–1034 (PMLR, 2020).
Dietrich, R., Opper, M. & Sompolinsky, H. Statistical mechanics of support vector networks. Phys. Rev. Lett. 82, 2975–2978 (1999).
Seleznova, M. & Kutyniok, G. Neural tangent kernel beyond the infinite-width limit: effects of depth and initialization. Preprint at https://arxiv.org/abs/2202.00553 (2022).
Vyas, N., Bansal, Y. & Preetum, N. Limitations of the NTK for understanding generalization in deep learning. Preprint at https://arxiv.org/abs/2206.10012 (2022).
Antognini, J. M. Finite size corrections for neural network gaussian processes Preprint at https://arxiv.org/abs/1908.10030 (2019).
Yaida, S. Non-Gaussian processes and neural networks at finite widths. In Proc. 1st Mathematical and Scientific Machine Learning Conference (eds Lu, J. & Ward, R.) 165–192 (PMLR, 2020).
Hanin, B. Random fully connected neural networks as perturbatively solvable hierarchies. Preprint at https://arxiv.org/abs/2204.01058 (2023).
Zavatone-Veth, J. & Pehlevan, C. Exact marginal prior distributions of finite bayesian neural networks. In Advances in Neural Information Processing Systems (eds Vaughan, J. et al.) 3364–3375 (Curran Associates, 2021).
Bengio, Y. & Delalleau, O. in Algorithmic Learning Theory (eds Kivinen, J. et al.) 18–36 (Springer, 2011).
Bartlett, P. L., Harvey, N., Liaw, C. & Mehrabian, A. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn. Res. 20, 2285–2301 (2019).
Rotondo, P., Lagomarsino, M. C. & Gherardi, M. Counting the learnable functions of geometrically structured data. Phys. Rev. Res. 2, 023169 (2020).
Rotondo, P., Pastore, M. & Gherardi, M. Beyond the storage capacity: data-driven satisfiability transition. Phys. Rev. Lett. 125, 120601 (2020).
Pastore, M., Rotondo, P., Erba, V. & Gherardi, M. Statistical learning theory of structured data. Phys. Rev. E 102, 032119 (2020).
Gherardi, M. Solvable model for the linear separability of structured data. Entropy 23, 305 (2021).
Pastore, M. Critical properties of the SAT/UNSAT transitions in the classification problem of structured data. J. Stat. Mech. 2021, 113301 (2021).
Aguirre-López, F., Pastore, M. & Franz, S. Satisfiability transition in asymmetric neural networks. J. Phys. A 55, 305001 (2022).
Saxe, A. M., McClelland, J. L. & Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Proc. 2nd International Conference on Learning Representations (eds. Bengio, Y. & LeCun, Y.) (ICLR, 2014).
Saxe, A. M., McClelland, J. L. & Ganguli, S. A mathematical theory of semantic development in deep neural networks. Proc. Natl Acad. Sci. USA 116, 11537–11546 (2019).
Zavatone-Veth, J., Canatar, A., Ruben, B. & Pehlevan, C. Asymptotics of representation learning in finite Bayesian neural networks. In Advances in Neural Information Processing Systems (eds Vaughan, J. W. et al.) 24765–24777 (Curran Associates, 2021).
Naveh, G. & Ringel, Z. A self consistent theory of gaussian processes captures feature learning effects in finite CNNs. In Advances in Neural Information Processing Systems (eds Vaughan, J. W. et al.) 21352–21364 (Curran Associates, 2021).
Zavatone-Veth, J. A., Tong, W. L. & Pehlevan, C. Contrasting random and learned features in deep Bayesian linear regression. Phys. Rev. E 105, 064118 (2022).
Bardet, J.-M. & Surgailis, D. Moment bounds and central limit theorems for Gaussian subordinated arrays. J. Multivariate Anal. 114, 457–473 (2013).
Nourdin, I., Peccati, G. & Podolskij, M. Quantitative Breuer–Major theorems. Stochastic Process. Appl. 121, 793–812 (2011).
Breuer, P. & Major, P. Central limit theorems for non-linear functionals of Gaussian fields. J. Multivariate Anal. 13, 425–441 (1983).
Gerace, F., Loureiro, B., Krzakala, F., Mezard, M. & Zdeborova, L. Generalisation error in learning with random features and the hidden manifold model. In Proc. 37th International Conference on Machine Learning (eds Daumé, H. III & Singh, A.) 3452–3462 (PMLR, 2020).
Loureiro, B. et al. Learning curves of generic features maps for realistic datasets with a teacher-student model. J. Stat. Mech. https://doi.org/10.1088/1742-5468/ac9825 (2021).
Goldt, S. et al. The gaussian equivalence of generative models for learning with shallow neural networks. In Proc. 2nd Mathematical and Scientific Machine Learning Conference (eds. Bruna, J. et al.) 426–471 (PMLR, 2022).
Dobriban, E. & Wager, S. High-dimensional asymptotics of prediction: ridge regression and classification. Ann. Stat. 46, 247–279 (2018).
Mei, S. & Montanari, A. The generalization error of random features regression: precise asymptotics and the double descent curve. Commun. Pure Appl. Math. 75, 667–766 (2019).
Ghorbani, B., Mei, S., Misiakiewicz, T. & Montanari, A. Linearized two-layers neural networks in high dimension. Ann. Stat. 49, 1029 – 1054 (2021).
Ariosto, S., Pacelli, R., Ginelli, F., Gherardi, M. & Rotondo, P. Universal mean-field upper bound for the generalization gap of deep neural networks. Phys. Rev. E 105, 064309 (2022).
Shah, A., Wilson, A. & Ghahramani, Z. Student-t processes as alternatives to Gaussian processes. In Proc. 17th International Conference on Artificial Intelligence and Statistics (eds Kaski, S. & Corander, J.) 877–885 (PMLR, 2014).
Zavatone-Veth, J. A., Canatar, A., Ruben, B. S. & Pehlevan, C. Asymptotics of representation learning in finite Bayesian neural networks. J. Stat. Mech. 2022, 114008 (2022).
Hanin, B. & Zlokapa, A. Bayesian interpolation with deep linear networks. Proc. Natl Acad. Sci. 120, e2301345120 (2023).
Coolen, A. C. C., Sheikh, M., Mozeika, A., Aguirre-López, F. & Antenucci, F. Replica analysis of overfitting in generalized linear regression models. J. Phys. A 53, 365001 (2020).
Mozeika, A., Sheikh, M., Aguirre-López, F., Antenucci, F. & Coolen, A. C. C. Exact results on high-dimensional linear regression via statistical physics. Phys. Rev. E 103, 042142 (2021).
Uchiyama, Y., Oka, H. & Nono, A. Student’s t-process regression on the space of probability density functions. Proc. ISCIE International Symposium on Stochastic Systems Theory and its Applications 2021, 1–5 (2021).
Lee, H., Yun, E., Yang, H. & Lee, J. Scale mixtures of neural network Gaussian processes. In International Conference on Learning Representations (ICLR, 2022).
Aitchison, L. Why bigger is not always better: on finite and infinite neural networks. In Proc. 37th International Conference on Machine Learning (eds Daumé, H. III & Singh, A.) 156–164 (PMLR, 2020).
Zavatone-Veth, J. A. & Pehlevan, C. Depth induces scale-averaging in overparameterized linear Bayesian neural networks. In 2021 55th Asilomar Conference on Signals, Systems, and Computers 600–607 (IEEE, 2021).
Yang, A. X., Robeyns, M., Milsom, E., Schoots, N. & Aitchison, L. A theory of representation learning in deep neural networks gives a deep generalisation of kernel methods. Preprint at https://arxiv.org/abs/2108.13097 (2023).
Cho, Y. & Saul, L. Kernel methods for deep learning. In Advances in Neural Information Processing Systems (eds Bengio, Y. et al.) Vol. 22 (Curran Associates, 2009).
Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J. & Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems (eds Garnett, R. et al.) Vol. 29 (Curran Associates, 2016).
Yang, G. & Schoenholz, S. Mean field residual networks: on the edge of chaos. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) Vol. 30 (Curran Associates, 2017).
Tracey, B. D. & Wolpert, D. Upgrading from Gaussian processes to Student’s-t processes. In 2018 AIAA Non-Deterministic Approaches Conference 1659 (2018).
Roberts, D. A., Yaida, S. & Hanin, B. The Principles of Deep Learning Theory (Cambridge Univ. Press, 2022).
Gerace, F., Krzakala, F., Loureiro, B., Stephan, L. & Zdeborová, L. Gaussian universality of linear classifiers with random labels in high-dimension. Preprint at https://arxiv.org/abs/2205.13303 (2022).
Cui, H., Krzakala, F. & Zdeborová, L. Bayes-optimal learning of deep random networks of extensive-width. In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 6468–6521 (PMLR, 2023).
Lee, J. et al. Finite versus infinite neural networks: an empirical study. In Advances in Neural Information Processing Systems (eds Lin, H. et al.) 15156–15172 (Curran Associates, 2020).
Pang, G., Yang, L. & Karniadakis, G. E. Neural-net-induced Gaussian process regression for function approximation and PDE solution. J. Comput. Phys. 384, 270–288 (2019).
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images (Univ. Toronto, 2012).
Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
Pacelli, R. rpacelli/FC_deep_bayesian_networks: FC_deep_bayesian_networks (2023).
Acknowledgements
M.P. has been supported by a grant from the Simons Foundation (grant no. 454941, S. Franz). P.R. acknowledges funding from the Fellini program under the H2020-MSCA-COFUND action, grant agreement no. 754496, INFN (IT) and from #NEXTGENERATIONEU (NGEU), National Recovery and Resilience Plan (NRRP), project MNESYS (PE0000006) ‘A Multiscale integrated approach to the study of the nervous system in health and disease’ (DN. 1553 11.10.2022). We would like to thank S. Franz, L. Molinari, F. Aguirre-López, R. Burioni, A. Vezzani, R. Aiudi, F. Bassetti, B. Bassetti, P. Baglioni and the Computing Sciences group at Bocconi University in Milan for discussions and suggestions.
Author information
Authors and Affiliations
Contributions
P.R., S.A. and M.P. performed the analytical calculations, supported by F.G, M.G. and R.P. Numerical experiments, data analysis and data visualization were carried out by R.P. All the authors contributed to discussing and interpreting the results and to writing and editing the paper. S.A and R.P. contributed equally to the work.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks the anonymous reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary text.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Pacelli, R., Ariosto, S., Pastore, M. et al. A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit. Nat Mach Intell 5, 1497–1507 (2023). https://doi.org/10.1038/s42256-023-00767-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-023-00767-6
This article is cited by
-
Inversion dynamics of class manifolds in deep learning reveals tradeoffs underlying generalization
Nature Machine Intelligence (2024)