Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Moving beyond generalization to accurate interpretation of flexible models

Abstract

Machine learning optimizes flexible models to predict data. In scientific applications, there is a rising interest in interpreting these flexible models to derive hypotheses from data. However, it is unknown whether good data prediction guarantees the accurate interpretation of flexible models. Here, we test this connection using a flexible, yet intrinsically interpretable framework for modelling neural dynamics. We find that many models discovered during optimization predict data equally well, yet they fail to match the correct hypothesis. We develop an alternative approach that identifies models with correct interpretation by comparing model features across data samples to separate true features from noise. We illustrate our findings using recordings of spiking activity from the visual cortex of monkeys performing a fixation task. Our results reveal that good predictions cannot substitute for accurate interpretation of flexible models and offer a principled approach to identify models with correct interpretation.

A preprint version of the article is available at bioRxiv.

Access options

from\$8.99

All prices are NET prices.

Data availability

All synthetic data reported in this paper can be reproduced using the source code. Electrophysiological data were gathered by N.A. Steinmetz and T. Moore as described in ref. 35, and are archived at the Stanford Neuroscience Institute server at Stanford University. The synthetic and neurophysiological data are available from the corresponding author upon request.

Code availability

The source code to reproduce the results of this study is freely available on GitHub (https://github.com/engellab/neuralflow, https://doi.org/10.5281/zenodo.4010952).

References

1. 1.

Neyman, J. & Pearson, E. S. On the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. Lond. A 231, 289–337 (1933).

2. 2.

Elsayed, G. F. & Cunningham, J. P. Structure in neural population recordings: an expected byproduct of simpler phenomena? Nat. Neurosci. 20, 1310–1318 (2017).

3. 3.

Szucs, D. & Ioannidis, J. P. A. When null hypothesis significance testing is unsuitable for research: a reassessment. Front. Hum. Neurosci. 11, 390 (2017).

4. 4.

Chandrasekaran, C. et al. Brittleness in model selection analysis of single neuron firing rates. Preprint at https://doi.org/10.1101/430710 (2018).

5. 5.

Burnham, K. P. & Anderson, D. R. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (Springer Science & Business Media, 2007).

6. 6.

Bollimunta, A., Totten, D. & Ditterich, J. Neural dynamics of choice: single-trial analysis of decision-related activity in parietal cortex. J. Neurosci. 32, 12684–12701 (2012).

7. 7.

Churchland, A. K. et al. Variance as a signature of neural computations during decision making. Neuron 69, 818–831 (2011).

8. 8.

Latimer, K. W., Yates, J. L., Meister, M. L., Huk, A. C. & Pillow, J. W. Single-trial spike trains in parietal cortex reveal discrete steps during decision-making. Science 349, 184–187 (2015).

9. 9.

Zoltowski, D. M., Latimer, K. W., Yates, J. L., Huk, A. C. & Pillow, J. W. Discrete stepping and nonlinear ramping dynamics underlie spiking responses of LIP neurons during decision-making. Neuron 102, 1249–1258 (2019).

10. 10.

Durstewitz, D., Koppe, G. & Toutounji, H. Computational models as statistical tools. Curr. Opin. Behav. Sci. 11, 93–99 (2016).

11. 11.

Linderman, S. W. & Gershman, S. J. Using computational theory to constrain statistical models of neural data. Curr. Opin. Neurobiol. 46, 14–24 (2017).

12. 12.

Pandarinath, C. et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nat. Methods. 15, 805–815 (2018).

13. 13.

Shrikumar, A., Greenside, P., Shcherbina, A. & Kundaje, A. Not just a black box: learning important features through propagating activation differences. Preprint at https://arxiv.org/abs/1605.01713 (2016).

14. 14.

Yamins, D. L. K. et al. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl Acad. Sci. USA 111, 8619–8624 (2014).

15. 15.

Pospisil, D. A. & Pasupathy, A. ‘Artiphysiology’ reveals V4-like shape tuning in a deep network trained for image classification. eLife 7, e38242 (2018).

16. 16.

Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

17. 17.

Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).

18. 18.

Brunton, B. W., Botvinick, M. M. & Brody, C. D. Rats and humans can optimally accumulate evidence for decision-making. Science 340, 95–98 (2013).

19. 19.

Sussillo, D. & Barak, O. Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks. Neural Comput. 25, 626–649 (2013).

20. 20.

Belkin, M., Hsu, D., Ma, S. & Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. Natl Acad. Sci. USA 116, 15849–15854 (2019).

21. 21.

Haas, K. R., Yang, H. & Chu, J. W. Expectation-maximization of the potential of mean force and diffusion coefficient in Langevin dynamics from single molecule FRET data photon by photon. J. Phys. Chem. B 117, 15591–15605 (2013).

22. 22.

Duncker, L., Bohner, G., Boussard, J. & Sahani, M. Learning interpretable continuous-time models of latent stochastic dynamical systems. Preprint at https://arxiv.org/abs/1902.04420 (2019).

23. 23.

Amarasingham, A., Geman, S. & Harrison, M. T. Ambiguity and nonidentifiability in the statistical analysis of neural codes. Proc. Natl Acad. Sci. USA 112, 6455–6460 (2015).

24. 24.

Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).

25. 25.

Yan, H. et al. Nonequilibrium landscape theory of neural networks. Proc. Natl Acad. Sci. USA 110, E4185–94 (2013).

26. 26.

Bottou, L., Curtis, F. E. & Nocedal, J. Optimization methods for large-scale machine learning. SIAM Rev. 60, 223–311 (2018).

27. 27.

Hastie, T., Tibshirani, R., Friedman, J. & Franklin, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction (Springer, 2005).

28. 28.

Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning requires rethinking generalization. Preprint at https://arxiv.org/abs/1611.03530 (2016).

29. 29.

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. On large-batch training for deep learning: generalization gap and sharp minima. Preprint at https://arxiv.org/abs/1609.04836 (2016).

30. 30.

Ilyas, A. et al. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems 32 125–136 (Curran Associates, 2019).

31. 31.

Cawley, G. C. & Talbot, N. L. C. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107 (2010).

32. 32.

Prechelt, L. in Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science Vol. 7700 (eds Montavon, G., Orr, G. B. & Müller, K. R.) 53–67 (Springer, 2012).

33. 33.

Haas, K. R., Yang, H. & Chu, J.-W. Trajectory entropy of continuous stochastic processes at equilibrium. J. Phys. Chem. Lett. 5, 999–1003 (2014).

34. 34.

Kalimeris, D. et al. SGD on neural networks learns functions of increasing complexity. In Advances in Neural Information Processing Systems 32 3496–3506 (Curran Associates, 2019).

35. 35.

Engel, T. A. et al. Selective modulation of cortical state during spatial attention. Science 354, 1140–1144 (2016).

36. 36.

Schmidt, M. & Lipson, H. Distilling free-form natural laws from experimental data. Science 324, 81–85 (2009).

37. 37.

Daniels, B. C. & Nemenman, I. Automated adaptive inference of phenomenological dynamical models. Nat. Commun. 6, 8133 (2015).

38. 38.

Brunton, S. L., Proctor, J. L. & Kutz, J. N. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl Acad. Sci. USA 113, 3932–3937 (2016).

39. 39.

Boninsegna, L., Nüske, F. & Clementi, C. Sparse learning of stochastic dynamical equations. J. Chem. Phys. 148, 241723 (2018).

40. 40.

Rudy, S. H., Nathan Kutz, J. & Brunton, S. L. Deep learning of dynamics and signal-noise decomposition with time-stepping constraints. J. Comput. Phys. 396, 483–506 (2019).

41. 41.

Zhao, Y. & Park, I. M. Variational joint filtering. Preprint at https://arxiv.org/abs/1707.09049v4 (2017).

42. 42.

Schwalger, T., Deger, M. & Gerstner, W. Towards a theory of cortical columns: from spiking neurons to interacting neural populations of finite size. PLoS Comput. Biol. 13, e1005507 (2017).

43. 43.

Hennequin, G., Ahmadian, Y., Rubin, D. B., Lengyel, M. & Miller, K. D. The dynamical regime of sensory cortex: stable dynamics around a single stimulus-tuned attractor account for patterns of noise variability. Neuron 98, 846–860 (2018).

44. 44.

Holcman, D. & Tsodyks, M. The emergence of up and down states in cortical networks. PLoS Comput. Biol. 2, e23 (2006).

45. 45.

Jercog, D. et al. UP–DOWN cortical dynamics reflect state transitions in a bistable network. eLife 6, e22425 (2017).

46. 46.

Levenstein, D., Buzsáki, G. & Rinzel, J. NREM sleep in the rodent neocortex and hippocampus reflects excitable dynamics. Nat. Commun. 10, 2478 (2019).

47. 47.

Recanatesi, S., Pereira, U., Murakami, M., Mainen, Z. F. & Mazzucato, L. Metastable attractors explain the variable timing of stable behavioral action sequences. Preprint at https://doi.org/10.1101/2020.01.24.919217 (2020).

48. 48.

Cunningham, J. P. & Yu, B. M. Dimensionality reduction for large-scale neural recordings. Nat. Neurosci. 17, 1500–1509 (2014).

49. 49.

Williamson, R. C., Doiron, B., Smith, M. A. & Yu, B. M. Bridging large-scale neuronal recordings and large-scale network models using dimensionality reduction. Curr. Opin. Neurobiol. 55, 40–47 (2019).

50. 50.

Murray, J. D. et al. Stable population coding for working memory coexists with heterogeneous neural dynamics in prefrontal cortex. Proc. Natl Acad. Sci. USA 114, 394–399 (2017).

51. 51.

Elsayed, G. F., Lara, A. H., Kaufman, M. T., Churchland, M. M. & Cunningham, J. P. Reorganization between preparatory and movement population responses in motor cortex. Nat. Commun. 7, 13239 (2016).

52. 52.

Risken, H. The Fokker–Planck Equation (Springer, 1996).

Acknowledgements

This work was supported by NIH grant no. R01 EB026949 and the Swartz Foundation. We thank G. Angeris for help in the early stages of the project, K. Haas for useful discussions and P. Koo, J. Jansen, A. Siepel, J. Kinney and T. Janowitz for their thoughtful comments on the manuscript. We thank N.A. Steinmetz and T. Moore for sharing the electrophysiological data, which are presented in ref. 35 and are archived at the Stanford Neuroscience Institute server at Stanford University.

Author information

Authors

Contributions

M.G. and T.A.E. designed the study, performed research, discussed the findings and wrote the paper.

Corresponding author

Correspondence to Tatiana A. Engel.

Ethics declarations

Competing interests

The authors declare no competing interests.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Overfitting does not occur with infinite training data.

a, A series of models produced by the gradient-descent, when the same finite set of data is used throughout the optimization. Substantial overfitting is observed. b, Same as a, but optimization is performed with spikes resampled on each iteration of the gradient-descent from a fixed latent trajectory (latent trajectory is the same as in a). Overfitting is still observed. c, Histogram of the latent trajectory (normalized as probability density) and discovered equilibrium probability density (at iteration 100,000) from the simulation in b. Overfitted model contains features that are not present in the latent trajectory. d, Same as c, but for a different latent trajectory. Spurious features in c and d are different. e, Same as c, but with resampling both the latent trajectory and spikes on each iteration of the gradient-descent. No signs of overfitting are observed. f, After 100,000 iterations, the inferred potential for the simulation in e still perfectly matches the ground truth. These simulations confirm that overfitting emerges largely due to Poisson noise (compare a and b), although noise in the latent trajectory also contributes to overfitting (compare b and f). No overfitting occurs when both spikes and latent trajectory are resampled on each gradient-descent iteration. In all panels, the training data contains ~ 10,000 spikes generated from a double-well potential.

Extended Data Fig. 2 A feedforward neural network exhibits generalization plateaux and overfitting in model selection.

We trained a shallow feedforward neural network (1 hidden layer, 100 neurons, the total number of parameters is 301) on a regression problem. The noisy data set of 1,000 samples was generated from a linear model y = x + 0.2ξ, where $$\xi \sim {\mathcal{N}}(0,1)$$. The network was trained using Matlab Deep Learning toolbox, which runs stochastic gradient-descent with early stopping regularization. We initialized all parameters (weights and biases) from the normal distribution with zero mean and variance 0.01. We verified that our results did not depend on a particular realization of the initial parameters. a, Network architecture. b, Example data set (dots) along with the linear ground-truth model (line). c, Training and validated mean squared errors (MSEs) over the optimization epochs. Long plateau in the validated MSE indicates that many models generalize equally well. Arrow indicates the minimum of the validated MSE, that is the model with the best generalization. d, The model with the best generalization (red line, corresponds to arrow in c) contains spurious features not present in the linear ground-truth model (grey line). e, Same as b but for a different data sample from the same ground-truth model. f, Same as c, but for the data sample in e. g, Same as d, but for the data sample in e. The model with the best generalization (green line, corresponds to arrow in f) closely matches the ground-truth model (grey line).

Extended Data Fig. 3 Results of model selection using the best generalization and feature consistency methods for large data amount.

Simulations are shown for ten independent data samples each with 200,000 spikes generated from the same ground-truth model (triple-well potential). Each coloured line represents one simulation. Details of the model selection procedures are provided in Supplementary Note 1.7. a, Model selection based on the best generalization. Validated negative log-likelihood (upper panel) achieves the minimum (dots) at different gradient-descent iterations on different data samples. Fitted potentials (lower panel) selected at the minimum of the validated negative log-likelihood are consistent across data samples and with the ground-truth model for the large data amount. b, Model selection based on feature consistency. KL divergence (upper panel) between potentials discovered from two data halves at each level of feature complexity. Models are selected at the feature complexity $${{\mathcal{M}}}^{* }$$ (dots) where KL divergence exceeds the threshold $${D}_{{\rm{KL}}}^{{\rm{thresh}}}$$ (dashed line). The selected potentials (lower panel) are consistent across data samples and with the ground-truth model. c, Same as b, but for models selected at a larger feature complexity $${\mathcal{M}}>{{\mathcal{M}}}^{* }$$ (dots). These models differ between the two data halves and are inconsistent across data samples. Only five simulations are shown for clarity.

Extended Data Fig. 4 Results of model selection using the best generalization and feature consistency methods for moderate data amount.

Simulations are shown for ten independent data samples each with 20,000 spikes generated from the same ground-truth model (double-well potential). Each coloured line represents one simulation. Details of the model selection procedures are provided in Supplementary Note 1.7. a, Model selection based on the best generalization. Validated negative log-likelihood (upper panel) achieves the minimum (dots) at different gradient-descent iterations on different data samples. Fitted potentials (lower panel) selected at the minimum of the validated negative log-likelihood are inconsistent across data samples, and many of them exhibit spurious features. b, Model selection based on feature consistency. KL divergence (upper panel) between potentials discovered from two data halves at each level of feature complexity. Models are selected at the feature complexity $${{\mathcal{M}}}^{* }$$ (dots) where KL divergence exceeds the threshold $${D}_{{\rm{KL}}}^{{\rm{thresh}}}$$ (dashed line). The selected potentials (lower panel) are consistent across data samples and with the ground-truth model. c, Same as b, but for models selected at a larger feature complexity $${\mathcal{M}}>{{\mathcal{M}}}^{* }$$ (dots). These models differ between the two data halves and are inconsistent across data samples. Only five simulations are shown for clarity.

Extended Data Fig. 5 Model selection using feature consistency method for complex potential shapes.

Simulations are shown for five independent data samples generated from the same ground-truth model. Each coloured line represents one simulation. The KL-divergence (left column) between models discovered from two halves of each data sample at different levels of feature complexity. Models are selected at the feature complexity $${{\mathcal{M}}}^{* }$$ (coloured dots) where the KL-divergence exceeds the threshold $${D}_{{\rm{KL}}}^{{\rm{thresh}}}$$ (dashed line). The selected models (right column) are consistent across data samples and with the ground truth when the data amount is sufficient (a,b). Underfitting can occur for low data amounts (c,d). a, The ground-truth is a triple-well potential with the shape inferred for the example V4 channel (Fig. 6e in the main text). Each data sample contains roughly 30,000 spikes. $${D}_{{\rm{KL}}}^{{\rm{thresh}}}=0.01$$. b, The ground-truth is a complex four-well potential. Each data sample contains roughly 400,000 spikes. $${D}_{{\rm{KL}}}^{{\rm{thresh}}}=0.01$$. All five KL-curves exceed the KL-threshold. The sharp rise of KL-divergence is not yet apparent for this number of GD iterations. c, The same ground-truth potential as in b. Each sample of synthetic data contains roughly 200,000 spikes. $${D}_{{\rm{KL}}}^{{\rm{thresh}}}=0.01$$. Some of the selected models are underfitted. d, The same data as in c but with a higher KL-threshold $${D}_{{\rm{KL}}}^{{\rm{thresh}}}=0.02$$. Increasing $${D}_{{\rm{KL}}}^{{\rm{thresh}}}$$ reduces underfitting of these complex dynamics resulting in more correct outcomes, but it also increases the probability of overfitting for simple ground-truth dynamics.

Extended Data Fig. 6 Discovering models of neural dynamics from neurophysiological recordings.

Potentials at $${{\mathcal{M}}}^{* }$$ identified independently from two halves of the data, $${\mathscr{D}}1$$ (red) and $${\mathscr{D}}2$$ (blue), for other 15 channels in the recording (potentials for the example channel 6 are shown in Fig. 6e in the main text). Error bars shade the area between two potentials discovered from the halves of $${\mathscr{D}}1$$ (red) and $${\mathscr{D}}2$$ (blue). Details of this analysis are provided in the main text.

Extended Data Fig. 7 Underfitting due to small data amount for the potential shape inferred from V4 data.

We generated ten synthetic data samples from the triple-well potential with the same shape as inferred for the example V4 channel (Fig. 6e in the main text). Each sample of synthetic data contained roughly 5,000 spikes (which roughly corresponds to the amount of real data for that channel). We divided each data sample in two halves, performed gradient-descent optimization on each half, and selected the potential at $${{\mathcal{M}}}^{* }$$ using our feature-consistency method with $${D}_{{\rm{KL}}}^{{\rm{thresh}}}=0.01$$ (same as for V4 data). The selected potentials (blue) are shown along with the ground-truth (grey) in separate panels for each simulation. Error bars shade the area between potentials discovered from two data halves. Due to small data amount, underfitting is observed in roughly half of the simulations. The pattern of underfitted potential shapes across simulations resembles the pattern of potential shapes observed across V4 channels (cf. Extended Data Fig. 6).

Extended Data Fig. 8 Discovering dynamics from spikes generated by microscopic simulations of a recurrent excitatory-inhibitory network model.

a, Schematic of a recurrent balanced network model with two excitatory (E1 and E2) and one inhibitory (I) populations42. Each excitatory population consists of 400 neurons, and the inhibitory population consists of 200 neurons. The spiking neurons are simulated with the generalized integrate-and-fire (GIF) model. Details of the model architecture and simulations are described in ref. 42. The model exhibits winner-take-all dynamics, whereby two excitatory populations compete via the common inhibitory population. The model’s activity alternates between two attractor states, where either E1 or E2 has higher firing rates (lower spike raster). Spike trains of ten example neurons from E1 (red) and E2 (blue) are shown. This microscopic recurrent network exhibits metastable transitions between two attractors, which is the ground truth known from the theoretical analysis of the model42. We analysed a data set generated by microscopic simulations of this model, which contained 100 s of spiking activity of 20 neurons from the population E1. We divided the data set in two halves, performed gradient-descent optimization on each half, and selected the potential at $${{\mathcal{M}}}^{* }$$ using our feature-consistency method (same procedures as in all other simulations). b, KL divergence between models discovered from two data halves at different levels of feature complexity. The model is selected at the feature complexity $${{\mathcal{M}}}^{* }$$ (dot) where KL divergence exceeds the threshold $${D}_{{\rm{KL}}}^{{\rm{thresh}}}=0.01$$ (dashed line). c, The selected potential exhibits two attractor wells, in agreement with the ground-truth dynamics for this network model.

Supplementary information

Supplementary Information

Supplementary Notes 1.1–1.7, 2.1–2.3, Tables 1–3 and Figs. 1–5.

Rights and permissions

Reprints and Permissions

Genkin, M., Engel, T.A. Moving beyond generalization to accurate interpretation of flexible models. Nat Mach Intell 2, 674–683 (2020). https://doi.org/10.1038/s42256-020-00242-6

• Accepted:

• Published:

• Issue Date: