Fitting elephants in modern machine learning by statistically consistent interpolation

Mitra, Partha P.

doi:10.1038/s42256-021-00345-8

Perspective
Published: 19 May 2021

Fitting elephants in modern machine learning by statistically consistent interpolation

Partha P. Mitra ORCID: orcid.org/0000-0001-8818-6804^1,2

Nature Machine Intelligence volume 3, pages 378–386 (2021)Cite this article

1897 Accesses
4 Citations
62 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

Textbook wisdom advocates for smooth function fits and implies that interpolation of noisy data should lead to poor generalization. A related heuristic is that fitting parameters should be fewer than measurements (Occam’s razor). Surprisingly, contemporary machine learning approaches, such as deep nets, generalize well, despite interpolating noisy data. This may be understood via statistically consistent interpolation (SCI), that is, data interpolation techniques that generalize optimally for big data. Here, we elucidate SCI using the weighted interpolating nearest neighbours algorithm, which adds singular weight functions to k nearest neighbours. This shows that data interpolation can be a valid machine learning strategy for big data. SCI clarifies the relation between two ways of modelling natural phenomena: the rationalist approach (strong priors) of theoretical physics with few parameters, and the empiricist (weak priors) approach of modern machine learning with more parameters than data. SCI shows that the purely empirical approach can successfully predict. However, data interpolation does not provide theoretical insights, and the training data requirements may be prohibitive. Complex animal brains are between these extremes, with many parameters, but modest training data, and with prior structure encoded in species-specific mesoscale circuitry. Thus, modern machine learning provides a distinct epistemological approach that is different both from physical theories and animal brains.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: The wiNN algorithm applied to linear regression.**

**Fig. 2: Classification using wiNN, illustrated in 2D.**

**Fig. 4: Data-driven ML as a ‘third epistemology’.**

Piecewise linear neural networks and deep learning

Article 09 June 2022

A critique of pure learning and what artificial neural networks can learn from animal brains

Article Open access 21 August 2019

A review of some techniques for inclusion of domain-knowledge into deep neural networks

Article Open access 20 January 2022

References

Dyson, F. A meeting with Enrico Fermi. Nature 427, 297 (2004).
Article Google Scholar
James, G., Witten, D., Hastie, T. & Tibshirani, R. An Introduction to Statistical Learning Vol. 112 (Springer, 2013).
Györfi, L., Kohler, M., Krzyzak, A. & Walk, H. A Distribution-Free Theory of Nonparametric Regression (Springer, 2002).
Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR, 2017).
Wyner, A. J., Olson, M., Bleich, J. & Mease, D. Explaining the success of adaboost and random forests as interpolating classifiers. J. Mach. Learn. Res. 18, 1–33 (2017).
MathSciNet MATH Google Scholar
Belkin, M., Ma, S. & Mandal, S. To understand deep learning we need to understand kernel learning. In Proc. 35th International Conference on Machine Learning 541–549 (PMLR, 2018).
Belkin, M., Hsu, D. & Mitra, P. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. In Advances in Neural Information Processing Systems Vol. 31 (NIPS, 2018).
Cutler, A. & Zhao, G. Pert-perfect random tree ensembles. Comput. Sci. Stat. 33, 490–497 (2001).
Google Scholar
Donoho, D. L. & Tanner, J. Sparse nonnegative solution of underdetermined linear equations by linear programming. Proc. Natl Acad. Sci. USA 102, 9446–9451 (2005).
Article MathSciNet Google Scholar
Wainwright, M. J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint Vol. 48 (Cambridge Univ. Press, 2019).
Rakhlin, A. & Zhai, X. Consistency of interpolation with laplace kernels is a high-dimensional phenomenon. In Proc. Thirty-Second Conference on Learning Theory 2595–2623 (PMLR, 2019).
Ongie, G., Willett, R., Soudry, D. & Srebro, N. A function space view of bounded norm infinite width ReLU nets: the multivariate case. In International Conference on Learning Representations (ICLR, 2020).
Belkin, M., Hsu, D., Ma, S. & Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. Natl Acad. Sci. USA 116, 15849–15854 (2019).
Article MathSciNet Google Scholar
Liang, T. & Rakhlin, A. Just interpolate: kernel “ridgeless” regression can generalize. Ann. Stat. 48, 1329–1347 (2020).
Article MathSciNet Google Scholar
Bartlett, P. L., Long, P. M., Lugosi, G. & Tsigler, A. Benign overfitting in linear regression. Proc. Natl Acad. Sci. USA 117, 30063–30070 (2020).
Article Google Scholar
Montanari, A., Ruan, F., Sohn, Y. & Yan, J. The generalization error of max-margin linear classifiers: high-dimensional asymptotics in the overparametrized regime. Preprint at https://arxiv.org/pdf/1911.01544.pdf (2019).
Karzand, M. & Nowak, R. D. Active learning in the overparameterized and interpolating regime. Preprint at https://arxiv.org/pdf/1905.12782.pdf (2019).
Xing, Y., Song, Q. & Cheng, G. Statistical optimality of interpolated nearest neighbor algorithms. Preprint at https://arxiv.org/pdf/1810.02814.pdf (2018).
Anthony, M. & Bartlett, P. L. Neural Network Learning: Theoretical Foundations (Cambridge Univ. Press, 1999).
Arora, S., Du, S., Hu, W., Li, Z. & Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning 322–332 (PMLR, 2019).
Allen-Zhu, Z., Li, Y. & Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems Vol. 32, 6158–6169 (NIPS, 2019).
Schapire, R. E. et al. Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Stat. 26, 1651–1686 (1998).
MathSciNet MATH Google Scholar
Mücke, N. & Steinwart, I. Global minima of DNNs: the plenty pantry. Preprint at https://arxiv.org/pdf/1905.10686.pdf (2019).
Nadaraya, E. A. On estimating regression. Theory Probability Appl. 9, 141–142 (1964).
Article Google Scholar
Watson, G. Smooth regression analysis. Sankhya A 26, 359–372 (1964).
MathSciNet MATH Google Scholar
Cover, T. & Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13, 21–27 (1967).
Article Google Scholar
Shepard, D. A two-dimensional interpolation function for irregularly-spaced data. In Proc. 23rd ACM National Conference 517–524 (ACM, 1968).
Devroye, L., Györfi, L. & Krzyżak, A. The Hilbert kernel regression estimate. J. Multivariate Anal. 65, 209–227 (1998).
Article MathSciNet Google Scholar
Waring, E. VII. Problems concerning interpolations. Philos. Trans. R. Soc. Lond 69, 59–67 (1779).
Google Scholar
Runge, C. Über empirische Funktionen und die Interpolation zwischen äquidistanten Ordinaten. Z. Math. Phys. 46, 20 (1901).
MATH Google Scholar
Xing, Y., Song, Q. & Cheng, G. Benefit of interpolation in nearest neighbor algorithms. Preprint at https://arxiv.org/pdf/1909.11720.pdf (2019).
Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2, 303–314 (1989).
Article MathSciNet Google Scholar
Jacot, A., Gabriel, F. & Hongler, C. Neural tangent kernel: convergence and generalization in neural networks. In Advances in Neural Information Processing Systems Vol. 31, 8571–8580 (NIPS, 2018).
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).
Bartlett, P. L. For valid generalization the size of the weights is more important than the size of the network. In Advances in Neural Information Processing Systems Vol. 9, 134–140 (NIPS, 1997).
Neyshabur, B., Tomioka, R. & Srebro, N. In search of the real inductive bias: on the role of implicit regularization in deep learning. Preprint at https://arxiv.org/pdf/1412.6614.pdf (2014).
Heaven, W. D. Our weird behavior during the pandemic is messing with AI models. MIT Technology Review (11 May 2020).
Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. Preprint at https://arxiv.org/pdf/1412.6572.pdf (2014).
Engstrom, L., Tran, B., Tsipras, D., Schmidt, L. & Madry, A. A rotation and a translation suffice: fooling CNNs with simple transformations. In International Conference on Learning Representations (ICLR, 2018).
Ma, S., Bassily, R. & Belkin, M. The power of interpolation: understanding the effectiveness of SGD in modern over-parametrized learning. In International Conference on Machine Learning 3325–3334 (PMLR, 2018).
Mitra, P. P. Fast convergence for stochastic and distributed gradient descent in the interpolation limit. In Proc. 26th European Signal Processing Conference (EUSIPCO) 1890–1894 (IEEE, 2018).
Loog, M., Viering, T., Mey, A., Krijthe, J. H. & Tax, D. M. A brief prehistory of double descent. Proc. Natl Acad. Sci. USA 117, 10625–10626 (2020).
Article Google Scholar
Engel, A. & Van den Broeck, C. Statistical Mechanics of Learning (Cambridge Univ. Press, 2001).
Hastie, T., Montanari, A., Rosset, S. & Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. Preprint at https://arxiv.org/pdf/1903.08560.pdf (2019).
Mitra, P. P. Understanding overfitting peaks in generalization error: analytical risk curves for l₂ and l₁ penalized interpolation. Preprint at https://arxiv.org/pdf/1906.03667.pdf (2019).
Adlam, B. & Pennington, J. The neural tangent kernel in high dimensions: triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning 74–84 (PMLR, 2020).
Hessel, M. et al. Rainbow: combining improvements in deep reinforcement learning. In Proc. 32nd AAAI Conference on Artificial Intelligence (AAAI, 2018); https://ojs.aaai.org/index.php/AAAI/article/view/11796
Fehér, O., Wang, H., Saar, S., Mitra, P. P. & Tchernichovski, O. De novo establishment of wild-type song culture in the zebra finch. Nature 459, 564–568 (2009).
Article Google Scholar
Chomsky, N. et al. Language and Problems of Knowledge: The Managua Lectures Vol. 16 (MIT Press, 1988).
Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention 234–241 (Springer, 2015).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 2018).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Article Google Scholar
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems Vol. 30, 5998–6008 (NIPS, 2017).
Abraham, T. H. (Physio)logical circuits: the intellectual origins of the McCulloch–Pitts neural networks. J. History Behav. Sci. 38, 3–25 (2002).
Article Google Scholar
Turing, A. M. Intelligent Machinery (National Physical Laboratory, 1948).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article Google Scholar
Sillito, A. M., Cudeiro, J. & Jones, H. E. Always returning: feedback and sensory processing in visual cortex and thalamus. Trends Neurosci. 29, 307–316 (2006).
Article Google Scholar
Bohland, J. W. et al. A proposal for a coordinated effort for the determination of brainwide neuroanatomical connectivity in model organisms at a mesoscopic scale. PLoS Comput. Biol. 5, e1000334 (2009).
Article Google Scholar
Mitra, P. P. The circuit architecture of whole brains at the mesoscopic scale. Neuron 83, 1273–1283 (2014).
Article Google Scholar
Oh, S. W. et al. A mesoscale connectome of the mouse brain. Nature 508, 207–214 (2014).
Article Google Scholar
Scheffer, L. K. et al. A connectome and analysis of the adult Drosophila central brain. eLife 9, e57443 (2020).
Article Google Scholar
Majka, P. et al. Unidirectional monosynaptic connections from auditory areas to the primary visual cortex in the marmoset monkey. Brain Struct. Funct. 224, 111–131 (2019).
Article Google Scholar
Kaelbling, L. P. The foundation of efficient robot learning. Science 369, 915–916 (2020).
Article Google Scholar
Muñoz-Castaneda, R. et al. Cellular anatomy of the mouse primary motor cortex. Preprint at bioRxiv https://doi.org/10.1101/2020.10.02.323154 (2020).
Schmidt, M. & Lipson, H. Distilling free-form natural laws from experimental data. Science 324, 81–85 (2009).
Article Google Scholar
Katz, Y. Noam Chomsky on where artificial intelligence went wrong. The Atlantic (1 November 2012).
Acosta, G., Smith, E. & Kreinovich, V. Need for a careful comparison between hypotheses: case study of epicycles. In Towards Analytical Techniques for Systems Engineering Applications 61–64 (Springer, 2020).
Landauer, R. Irreversibility and heat generation in the computing process. IBM J. Res. Dev. 5, 183–191 (1961).
Article MathSciNet Google Scholar
Feynman, R. P. Feynman Lectures on Computation (CRC Press, 2018).

Download references

Acknowledgements

This work was supported by the Crick–Clay Professorship (CSHL) and the H. N. Mahabala Chair Professorship (IIT Madras).

Author information

Authors and Affiliations

Cold Spring Harbor Laboratory Cold Spring Harbor, New York, NY, USA
Partha P. Mitra
Center for Computational Brain Research, IIT, Madras, India
Partha P. Mitra

Authors

Partha P. Mitra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Partha P. Mitra.

Ethics declarations

Competing interests

The author declares no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks Samet Oymak and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mitra, P.P. Fitting elephants in modern machine learning by statistically consistent interpolation. Nat Mach Intell 3, 378–386 (2021). https://doi.org/10.1038/s42256-021-00345-8

Download citation

Received: 26 November 2019
Accepted: 15 April 2021
Published: 19 May 2021
Issue Date: May 2021
DOI: https://doi.org/10.1038/s42256-021-00345-8

This article is cited by

Organizing memories for generalization in complementary learning systems
- Weinan Sun
- Madhu Advani
- James E. Fitzgerald
Nature Neuroscience (2023)