Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Perspective
  • Published:

Fitting elephants in modern machine learning by statistically consistent interpolation

A preprint version of the article is available at arXiv.

Abstract

Textbook wisdom advocates for smooth function fits and implies that interpolation of noisy data should lead to poor generalization. A related heuristic is that fitting parameters should be fewer than measurements (Occam’s razor). Surprisingly, contemporary machine learning approaches, such as deep nets, generalize well, despite interpolating noisy data. This may be understood via statistically consistent interpolation (SCI), that is, data interpolation techniques that generalize optimally for big data. Here, we elucidate SCI using the weighted interpolating nearest neighbours algorithm, which adds singular weight functions to k nearest neighbours. This shows that data interpolation can be a valid machine learning strategy for big data. SCI clarifies the relation between two ways of modelling natural phenomena: the rationalist approach (strong priors) of theoretical physics with few parameters, and the empiricist (weak priors) approach of modern machine learning with more parameters than data. SCI shows that the purely empirical approach can successfully predict. However, data interpolation does not provide theoretical insights, and the training data requirements may be prohibitive. Complex animal brains are between these extremes, with many parameters, but modest training data, and with prior structure encoded in species-specific mesoscale circuitry. Thus, modern machine learning provides a distinct epistemological approach that is different both from physical theories and animal brains.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The wiNN algorithm applied to linear regression.
Fig. 2: Classification using wiNN, illustrated in 2D.
Fig. 3: SCI placed in context.
Fig. 4: Data-driven ML as a ‘third epistemology’.

Similar content being viewed by others

References

  1. Dyson, F. A meeting with Enrico Fermi. Nature 427, 297 (2004).

    Article  Google Scholar 

  2. James, G., Witten, D., Hastie, T. & Tibshirani, R. An Introduction to Statistical Learning Vol. 112 (Springer, 2013).

  3. Györfi, L., Kohler, M., Krzyzak, A. & Walk, H. A Distribution-Free Theory of Nonparametric Regression (Springer, 2002).

  4. Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR, 2017).

  5. Wyner, A. J., Olson, M., Bleich, J. & Mease, D. Explaining the success of adaboost and random forests as interpolating classifiers. J. Mach. Learn. Res. 18, 1–33 (2017).

    MathSciNet  MATH  Google Scholar 

  6. Belkin, M., Ma, S. & Mandal, S. To understand deep learning we need to understand kernel learning. In Proc. 35th International Conference on Machine Learning 541–549 (PMLR, 2018).

  7. Belkin, M., Hsu, D. & Mitra, P. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. In Advances in Neural Information Processing Systems Vol. 31 (NIPS, 2018).

  8. Cutler, A. & Zhao, G. Pert-perfect random tree ensembles. Comput. Sci. Stat. 33, 490–497 (2001).

    Google Scholar 

  9. Donoho, D. L. & Tanner, J. Sparse nonnegative solution of underdetermined linear equations by linear programming. Proc. Natl Acad. Sci. USA 102, 9446–9451 (2005).

    Article  MathSciNet  Google Scholar 

  10. Wainwright, M. J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint Vol. 48 (Cambridge Univ. Press, 2019).

  11. Rakhlin, A. & Zhai, X. Consistency of interpolation with laplace kernels is a high-dimensional phenomenon. In Proc. Thirty-Second Conference on Learning Theory 2595–2623 (PMLR, 2019).

  12. Ongie, G., Willett, R., Soudry, D. & Srebro, N. A function space view of bounded norm infinite width ReLU nets: the multivariate case. In International Conference on Learning Representations (ICLR, 2020).

  13. Belkin, M., Hsu, D., Ma, S. & Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. Natl Acad. Sci. USA 116, 15849–15854 (2019).

    Article  MathSciNet  Google Scholar 

  14. Liang, T. & Rakhlin, A. Just interpolate: kernel “ridgeless” regression can generalize. Ann. Stat. 48, 1329–1347 (2020).

    Article  MathSciNet  Google Scholar 

  15. Bartlett, P. L., Long, P. M., Lugosi, G. & Tsigler, A. Benign overfitting in linear regression. Proc. Natl Acad. Sci. USA 117, 30063–30070 (2020).

    Article  Google Scholar 

  16. Montanari, A., Ruan, F., Sohn, Y. & Yan, J. The generalization error of max-margin linear classifiers: high-dimensional asymptotics in the overparametrized regime. Preprint at https://arxiv.org/pdf/1911.01544.pdf (2019).

  17. Karzand, M. & Nowak, R. D. Active learning in the overparameterized and interpolating regime. Preprint at https://arxiv.org/pdf/1905.12782.pdf (2019).

  18. Xing, Y., Song, Q. & Cheng, G. Statistical optimality of interpolated nearest neighbor algorithms. Preprint at https://arxiv.org/pdf/1810.02814.pdf (2018).

  19. Anthony, M. & Bartlett, P. L. Neural Network Learning: Theoretical Foundations (Cambridge Univ. Press, 1999).

  20. Arora, S., Du, S., Hu, W., Li, Z. & Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning 322–332 (PMLR, 2019).

  21. Allen-Zhu, Z., Li, Y. & Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems Vol. 32, 6158–6169 (NIPS, 2019).

  22. Schapire, R. E. et al. Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Stat. 26, 1651–1686 (1998).

    MathSciNet  MATH  Google Scholar 

  23. Mücke, N. & Steinwart, I. Global minima of DNNs: the plenty pantry. Preprint at https://arxiv.org/pdf/1905.10686.pdf (2019).

  24. Nadaraya, E. A. On estimating regression. Theory Probability Appl. 9, 141–142 (1964).

    Article  Google Scholar 

  25. Watson, G. Smooth regression analysis. Sankhya A 26, 359–372 (1964).

    MathSciNet  MATH  Google Scholar 

  26. Cover, T. & Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13, 21–27 (1967).

    Article  Google Scholar 

  27. Shepard, D. A two-dimensional interpolation function for irregularly-spaced data. In Proc. 23rd ACM National Conference 517–524 (ACM, 1968).

  28. Devroye, L., Györfi, L. & Krzyżak, A. The Hilbert kernel regression estimate. J. Multivariate Anal. 65, 209–227 (1998).

    Article  MathSciNet  Google Scholar 

  29. Waring, E. VII. Problems concerning interpolations. Philos. Trans. R. Soc. Lond 69, 59–67 (1779).

    Google Scholar 

  30. Runge, C. Über empirische Funktionen und die Interpolation zwischen äquidistanten Ordinaten. Z. Math. Phys. 46, 20 (1901).

    MATH  Google Scholar 

  31. Xing, Y., Song, Q. & Cheng, G. Benefit of interpolation in nearest neighbor algorithms. Preprint at https://arxiv.org/pdf/1909.11720.pdf (2019).

  32. Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2, 303–314 (1989).

    Article  MathSciNet  Google Scholar 

  33. Jacot, A., Gabriel, F. & Hongler, C. Neural tangent kernel: convergence and generalization in neural networks. In Advances in Neural Information Processing Systems Vol. 31, 8571–8580 (NIPS, 2018).

  34. Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).

  35. Bartlett, P. L. For valid generalization the size of the weights is more important than the size of the network. In Advances in Neural Information Processing Systems Vol. 9, 134–140 (NIPS, 1997).

  36. Neyshabur, B., Tomioka, R. & Srebro, N. In search of the real inductive bias: on the role of implicit regularization in deep learning. Preprint at https://arxiv.org/pdf/1412.6614.pdf (2014).

  37. Heaven, W. D. Our weird behavior during the pandemic is messing with AI models. MIT Technology Review (11 May 2020).

  38. Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and harnessing adversarial examples. Preprint at https://arxiv.org/pdf/1412.6572.pdf (2014).

  39. Engstrom, L., Tran, B., Tsipras, D., Schmidt, L. & Madry, A. A rotation and a translation suffice: fooling CNNs with simple transformations. In International Conference on Learning Representations (ICLR, 2018).

  40. Ma, S., Bassily, R. & Belkin, M. The power of interpolation: understanding the effectiveness of SGD in modern over-parametrized learning. In International Conference on Machine Learning 3325–3334 (PMLR, 2018).

  41. Mitra, P. P. Fast convergence for stochastic and distributed gradient descent in the interpolation limit. In Proc. 26th European Signal Processing Conference (EUSIPCO) 1890–1894 (IEEE, 2018).

  42. Loog, M., Viering, T., Mey, A., Krijthe, J. H. & Tax, D. M. A brief prehistory of double descent. Proc. Natl Acad. Sci. USA 117, 10625–10626 (2020).

    Article  Google Scholar 

  43. Engel, A. & Van den Broeck, C. Statistical Mechanics of Learning (Cambridge Univ. Press, 2001).

  44. Hastie, T., Montanari, A., Rosset, S. & Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. Preprint at https://arxiv.org/pdf/1903.08560.pdf (2019).

  45. Mitra, P. P. Understanding overfitting peaks in generalization error: analytical risk curves for l2 and l1 penalized interpolation. Preprint at https://arxiv.org/pdf/1906.03667.pdf (2019).

  46. Adlam, B. & Pennington, J. The neural tangent kernel in high dimensions: triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning 74–84 (PMLR, 2020).

  47. Hessel, M. et al. Rainbow: combining improvements in deep reinforcement learning. In Proc. 32nd AAAI Conference on Artificial Intelligence (AAAI, 2018); https://ojs.aaai.org/index.php/AAAI/article/view/11796

  48. Fehér, O., Wang, H., Saar, S., Mitra, P. P. & Tchernichovski, O. De novo establishment of wild-type song culture in the zebra finch. Nature 459, 564–568 (2009).

    Article  Google Scholar 

  49. Chomsky, N. et al. Language and Problems of Knowledge: The Managua Lectures Vol. 16 (MIT Press, 1988).

  50. Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention 234–241 (Springer, 2015).

  51. Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 2018).

  52. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

    Article  Google Scholar 

  53. Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems Vol. 30, 5998–6008 (NIPS, 2017).

  54. Abraham, T. H. (Physio)logical circuits: the intellectual origins of the McCulloch–Pitts neural networks. J. History Behav. Sci. 38, 3–25 (2002).

    Article  Google Scholar 

  55. Turing, A. M. Intelligent Machinery (National Physical Laboratory, 1948).

  56. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    Article  Google Scholar 

  57. Sillito, A. M., Cudeiro, J. & Jones, H. E. Always returning: feedback and sensory processing in visual cortex and thalamus. Trends Neurosci. 29, 307–316 (2006).

    Article  Google Scholar 

  58. Bohland, J. W. et al. A proposal for a coordinated effort for the determination of brainwide neuroanatomical connectivity in model organisms at a mesoscopic scale. PLoS Comput. Biol. 5, e1000334 (2009).

    Article  Google Scholar 

  59. Mitra, P. P. The circuit architecture of whole brains at the mesoscopic scale. Neuron 83, 1273–1283 (2014).

    Article  Google Scholar 

  60. Oh, S. W. et al. A mesoscale connectome of the mouse brain. Nature 508, 207–214 (2014).

    Article  Google Scholar 

  61. Scheffer, L. K. et al. A connectome and analysis of the adult Drosophila central brain. eLife 9, e57443 (2020).

    Article  Google Scholar 

  62. Majka, P. et al. Unidirectional monosynaptic connections from auditory areas to the primary visual cortex in the marmoset monkey. Brain Struct. Funct. 224, 111–131 (2019).

    Article  Google Scholar 

  63. Kaelbling, L. P. The foundation of efficient robot learning. Science 369, 915–916 (2020).

    Article  Google Scholar 

  64. Muñoz-Castaneda, R. et al. Cellular anatomy of the mouse primary motor cortex. Preprint at bioRxiv https://doi.org/10.1101/2020.10.02.323154 (2020).

  65. Schmidt, M. & Lipson, H. Distilling free-form natural laws from experimental data. Science 324, 81–85 (2009).

    Article  Google Scholar 

  66. Katz, Y. Noam Chomsky on where artificial intelligence went wrong. The Atlantic (1 November 2012).

  67. Acosta, G., Smith, E. & Kreinovich, V. Need for a careful comparison between hypotheses: case study of epicycles. In Towards Analytical Techniques for Systems Engineering Applications 61–64 (Springer, 2020).

  68. Landauer, R. Irreversibility and heat generation in the computing process. IBM J. Res. Dev. 5, 183–191 (1961).

    Article  MathSciNet  Google Scholar 

  69. Feynman, R. P. Feynman Lectures on Computation (CRC Press, 2018).

Download references

Acknowledgements

This work was supported by the Crick–Clay Professorship (CSHL) and the H. N. Mahabala Chair Professorship (IIT Madras).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Partha P. Mitra.

Ethics declarations

Competing interests

The author declares no competing interests.

Additional information

Peer review informationNature Machine Intelligence thanks Samet Oymak and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mitra, P.P. Fitting elephants in modern machine learning by statistically consistent interpolation. Nat Mach Intell 3, 378–386 (2021). https://doi.org/10.1038/s42256-021-00345-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-021-00345-8

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics