Moving beyond generalization to accurate interpretation of flexible models

Genkin, Mikhail; Engel, Tatiana A.

doi:10.1038/s42256-020-00242-6

Article
Published: 26 October 2020

Moving beyond generalization to accurate interpretation of flexible models

Nature Machine Intelligence volume 2, pages 674–683 (2020)Cite this article

3460 Accesses
13 Citations
126 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

Machine learning optimizes flexible models to predict data. In scientific applications, there is a rising interest in interpreting these flexible models to derive hypotheses from data. However, it is unknown whether good data prediction guarantees the accurate interpretation of flexible models. Here, we test this connection using a flexible, yet intrinsically interpretable framework for modelling neural dynamics. We find that many models discovered during optimization predict data equally well, yet they fail to match the correct hypothesis. We develop an alternative approach that identifies models with correct interpretation by comparing model features across data samples to separate true features from noise. We illustrate our findings using recordings of spiking activity from the visual cortex of monkeys performing a fixation task. Our results reveal that good predictions cannot substitute for accurate interpretation of flexible models and offer a principled approach to identify models with correct interpretation.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Deriving theories from data.**

**Fig. 2: A flexible and intrinsically interpretable framework for modelling neural dynamics.**

**Fig. 3: Gradient-descent optimization produces a series of models with different interpretations.**

**Fig. 4: Flexible models optimized for data prediction cannot be interpreted reliably.**

**Fig. 5: Identifying models with correct interpretation.**

**Fig. 6: Discovering interpretable models of neural dynamics from neurophysiological recordings.**

Probing the structure–function relationship with neural networks constructed by solving a system of linear equations

Article Open access 15 February 2021

Parsimonious neural networks learn interpretable physical laws

Article Open access 17 June 2021

Revealing nonlinear neural decoding by analyzing choices

Article Open access 16 November 2021

Data availability

All synthetic data reported in this paper can be reproduced using the source code. Electrophysiological data were gathered by N.A. Steinmetz and T. Moore as described in ref. ³⁵, and are archived at the Stanford Neuroscience Institute server at Stanford University. The synthetic and neurophysiological data are available from the corresponding author upon request.

Code availability

The source code to reproduce the results of this study is freely available on GitHub (https://github.com/engellab/neuralflow, https://doi.org/10.5281/zenodo.4010952).

References

Neyman, J. & Pearson, E. S. On the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. Lond. A 231, 289–337 (1933).
Article Google Scholar
Elsayed, G. F. & Cunningham, J. P. Structure in neural population recordings: an expected byproduct of simpler phenomena? Nat. Neurosci. 20, 1310–1318 (2017).
Article Google Scholar
Szucs, D. & Ioannidis, J. P. A. When null hypothesis significance testing is unsuitable for research: a reassessment. Front. Hum. Neurosci. 11, 390 (2017).
Article Google Scholar
Chandrasekaran, C. et al. Brittleness in model selection analysis of single neuron firing rates. Preprint at https://doi.org/10.1101/430710 (2018).
Burnham, K. P. & Anderson, D. R. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (Springer Science & Business Media, 2007).
Bollimunta, A., Totten, D. & Ditterich, J. Neural dynamics of choice: single-trial analysis of decision-related activity in parietal cortex. J. Neurosci. 32, 12684–12701 (2012).
Article Google Scholar
Churchland, A. K. et al. Variance as a signature of neural computations during decision making. Neuron 69, 818–831 (2011).
Article Google Scholar
Latimer, K. W., Yates, J. L., Meister, M. L., Huk, A. C. & Pillow, J. W. Single-trial spike trains in parietal cortex reveal discrete steps during decision-making. Science 349, 184–187 (2015).
Article Google Scholar
Zoltowski, D. M., Latimer, K. W., Yates, J. L., Huk, A. C. & Pillow, J. W. Discrete stepping and nonlinear ramping dynamics underlie spiking responses of LIP neurons during decision-making. Neuron 102, 1249–1258 (2019).
Article Google Scholar
Durstewitz, D., Koppe, G. & Toutounji, H. Computational models as statistical tools. Curr. Opin. Behav. Sci. 11, 93–99 (2016).
Article Google Scholar
Linderman, S. W. & Gershman, S. J. Using computational theory to constrain statistical models of neural data. Curr. Opin. Neurobiol. 46, 14–24 (2017).
Article Google Scholar
Pandarinath, C. et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nat. Methods. 15, 805–815 (2018).
Article Google Scholar
Shrikumar, A., Greenside, P., Shcherbina, A. & Kundaje, A. Not just a black box: learning important features through propagating activation differences. Preprint at https://arxiv.org/abs/1605.01713 (2016).
Yamins, D. L. K. et al. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl Acad. Sci. USA 111, 8619–8624 (2014).
Article Google Scholar
Pospisil, D. A. & Pasupathy, A. ‘Artiphysiology’ reveals V4-like shape tuning in a deep network trained for image classification. eLife 7, e38242 (2018).
Article Google Scholar
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Article Google Scholar
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015).
Article Google Scholar
Brunton, B. W., Botvinick, M. M. & Brody, C. D. Rats and humans can optimally accumulate evidence for decision-making. Science 340, 95–98 (2013).
Article Google Scholar
Sussillo, D. & Barak, O. Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks. Neural Comput. 25, 626–649 (2013).
Article MathSciNet Google Scholar
Belkin, M., Hsu, D., Ma, S. & Mandal, S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. Natl Acad. Sci. USA 116, 15849–15854 (2019).
Article MathSciNet Google Scholar
Haas, K. R., Yang, H. & Chu, J. W. Expectation-maximization of the potential of mean force and diffusion coefficient in Langevin dynamics from single molecule FRET data photon by photon. J. Phys. Chem. B 117, 15591–15605 (2013).
Article Google Scholar
Duncker, L., Bohner, G., Boussard, J. & Sahani, M. Learning interpretable continuous-time models of latent stochastic dynamical systems. Preprint at https://arxiv.org/abs/1902.04420 (2019).
Amarasingham, A., Geman, S. & Harrison, M. T. Ambiguity and nonidentifiability in the statistical analysis of neural codes. Proc. Natl Acad. Sci. USA 112, 6455–6460 (2015).
Article Google Scholar
Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).
Yan, H. et al. Nonequilibrium landscape theory of neural networks. Proc. Natl Acad. Sci. USA 110, E4185–94 (2013).
Article Google Scholar
Bottou, L., Curtis, F. E. & Nocedal, J. Optimization methods for large-scale machine learning. SIAM Rev. 60, 223–311 (2018).
Article MathSciNet Google Scholar
Hastie, T., Tibshirani, R., Friedman, J. & Franklin, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction (Springer, 2005).
Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning requires rethinking generalization. Preprint at https://arxiv.org/abs/1611.03530 (2016).
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. On large-batch training for deep learning: generalization gap and sharp minima. Preprint at https://arxiv.org/abs/1609.04836 (2016).
Ilyas, A. et al. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems 32 125–136 (Curran Associates, 2019).
Cawley, G. C. & Talbot, N. L. C. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107 (2010).
MathSciNet MATH Google Scholar
Prechelt, L. in Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science Vol. 7700 (eds Montavon, G., Orr, G. B. & Müller, K. R.) 53–67 (Springer, 2012).
Haas, K. R., Yang, H. & Chu, J.-W. Trajectory entropy of continuous stochastic processes at equilibrium. J. Phys. Chem. Lett. 5, 999–1003 (2014).
Article Google Scholar
Kalimeris, D. et al. SGD on neural networks learns functions of increasing complexity. In Advances in Neural Information Processing Systems 32 3496–3506 (Curran Associates, 2019).
Engel, T. A. et al. Selective modulation of cortical state during spatial attention. Science 354, 1140–1144 (2016).
Article Google Scholar
Schmidt, M. & Lipson, H. Distilling free-form natural laws from experimental data. Science 324, 81–85 (2009).
Article Google Scholar
Daniels, B. C. & Nemenman, I. Automated adaptive inference of phenomenological dynamical models. Nat. Commun. 6, 8133 (2015).
Article Google Scholar
Brunton, S. L., Proctor, J. L. & Kutz, J. N. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl Acad. Sci. USA 113, 3932–3937 (2016).
Article MathSciNet Google Scholar
Boninsegna, L., Nüske, F. & Clementi, C. Sparse learning of stochastic dynamical equations. J. Chem. Phys. 148, 241723 (2018).
Article Google Scholar
Rudy, S. H., Nathan Kutz, J. & Brunton, S. L. Deep learning of dynamics and signal-noise decomposition with time-stepping constraints. J. Comput. Phys. 396, 483–506 (2019).
Article MathSciNet Google Scholar
Zhao, Y. & Park, I. M. Variational joint filtering. Preprint at https://arxiv.org/abs/1707.09049v4 (2017).
Schwalger, T., Deger, M. & Gerstner, W. Towards a theory of cortical columns: from spiking neurons to interacting neural populations of finite size. PLoS Comput. Biol. 13, e1005507 (2017).
Article Google Scholar
Hennequin, G., Ahmadian, Y., Rubin, D. B., Lengyel, M. & Miller, K. D. The dynamical regime of sensory cortex: stable dynamics around a single stimulus-tuned attractor account for patterns of noise variability. Neuron 98, 846–860 (2018).
Article Google Scholar
Holcman, D. & Tsodyks, M. The emergence of up and down states in cortical networks. PLoS Comput. Biol. 2, e23 (2006).
Article Google Scholar
Jercog, D. et al. UP–DOWN cortical dynamics reflect state transitions in a bistable network. eLife 6, e22425 (2017).
Article Google Scholar
Levenstein, D., Buzsáki, G. & Rinzel, J. NREM sleep in the rodent neocortex and hippocampus reflects excitable dynamics. Nat. Commun. 10, 2478 (2019).
Article Google Scholar
Recanatesi, S., Pereira, U., Murakami, M., Mainen, Z. F. & Mazzucato, L. Metastable attractors explain the variable timing of stable behavioral action sequences. Preprint at https://doi.org/10.1101/2020.01.24.919217 (2020).
Cunningham, J. P. & Yu, B. M. Dimensionality reduction for large-scale neural recordings. Nat. Neurosci. 17, 1500–1509 (2014).
Article Google Scholar
Williamson, R. C., Doiron, B., Smith, M. A. & Yu, B. M. Bridging large-scale neuronal recordings and large-scale network models using dimensionality reduction. Curr. Opin. Neurobiol. 55, 40–47 (2019).
Article Google Scholar
Murray, J. D. et al. Stable population coding for working memory coexists with heterogeneous neural dynamics in prefrontal cortex. Proc. Natl Acad. Sci. USA 114, 394–399 (2017).
Article Google Scholar
Elsayed, G. F., Lara, A. H., Kaufman, M. T., Churchland, M. M. & Cunningham, J. P. Reorganization between preparatory and movement population responses in motor cortex. Nat. Commun. 7, 13239 (2016).
Article Google Scholar
Risken, H. The Fokker–Planck Equation (Springer, 1996).

Download references

Acknowledgements

This work was supported by NIH grant no. R01 EB026949 and the Swartz Foundation. We thank G. Angeris for help in the early stages of the project, K. Haas for useful discussions and P. Koo, J. Jansen, A. Siepel, J. Kinney and T. Janowitz for their thoughtful comments on the manuscript. We thank N.A. Steinmetz and T. Moore for sharing the electrophysiological data, which are presented in ref. ³⁵ and are archived at the Stanford Neuroscience Institute server at Stanford University.

Author information

Authors and Affiliations

Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
Mikhail Genkin & Tatiana A. Engel

Authors

Mikhail Genkin
View author publications
You can also search for this author in PubMed Google Scholar
Tatiana A. Engel
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.G. and T.A.E. designed the study, performed research, discussed the findings and wrote the paper.

Corresponding author

Correspondence to Tatiana A. Engel.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Overfitting does not occur with infinite training data.

a, A series of models produced by the gradient-descent, when the same finite set of data is used throughout the optimization. Substantial overfitting is observed. b, Same as a, but optimization is performed with spikes resampled on each iteration of the gradient-descent from a fixed latent trajectory (latent trajectory is the same as in a). Overfitting is still observed. c, Histogram of the latent trajectory (normalized as probability density) and discovered equilibrium probability density (at iteration 100,000) from the simulation in b. Overfitted model contains features that are not present in the latent trajectory. d, Same as c, but for a different latent trajectory. Spurious features in c and d are different. e, Same as c, but with resampling both the latent trajectory and spikes on each iteration of the gradient-descent. No signs of overfitting are observed. f, After 100,000 iterations, the inferred potential for the simulation in e still perfectly matches the ground truth. These simulations confirm that overfitting emerges largely due to Poisson noise (compare a and b), although noise in the latent trajectory also contributes to overfitting (compare b and f). No overfitting occurs when both spikes and latent trajectory are resampled on each gradient-descent iteration. In all panels, the training data contains ~ 10,000 spikes generated from a double-well potential.

Extended Data Fig. 2 A feedforward neural network exhibits generalization plateaux and overfitting in model selection.

We trained a shallow feedforward neural network (1 hidden layer, 100 neurons, the total number of parameters is 301) on a regression problem. The noisy data set of 1,000 samples was generated from a linear model y = x + 0.2ξ, where \(\xi \sim {\mathcal{N}}(0,1)\). The network was trained using Matlab Deep Learning toolbox, which runs stochastic gradient-descent with early stopping regularization. We initialized all parameters (weights and biases) from the normal distribution with zero mean and variance 0.01. We verified that our results did not depend on a particular realization of the initial parameters. a, Network architecture. b, Example data set (dots) along with the linear ground-truth model (line). c, Training and validated mean squared errors (MSEs) over the optimization epochs. Long plateau in the validated MSE indicates that many models generalize equally well. Arrow indicates the minimum of the validated MSE, that is the model with the best generalization. d, The model with the best generalization (red line, corresponds to arrow in c) contains spurious features not present in the linear ground-truth model (grey line). e, Same as b but for a different data sample from the same ground-truth model. f, Same as c, but for the data sample in e. g, Same as d, but for the data sample in e. The model with the best generalization (green line, corresponds to arrow in f) closely matches the ground-truth model (grey line).

Extended Data Fig. 3 Results of model selection using the best generalization and feature consistency methods for large data amount.

Simulations are shown for ten independent data samples each with 200,000 spikes generated from the same ground-truth model (triple-well potential). Each coloured line represents one simulation. Details of the model selection procedures are provided in Supplementary Note 1.7. a, Model selection based on the best generalization. Validated negative log-likelihood (upper panel) achieves the minimum (dots) at different gradient-descent iterations on different data samples. Fitted potentials (lower panel) selected at the minimum of the validated negative log-likelihood are consistent across data samples and with the ground-truth model for the large data amount. b, Model selection based on feature consistency. KL divergence (upper panel) between potentials discovered from two data halves at each level of feature complexity. Models are selected at the feature complexity \({{\mathcal{M}}}^{* }\) (dots) where KL divergence exceeds the threshold \({D}_{{\rm{KL}}}^{{\rm{thresh}}}\) (dashed line). The selected potentials (lower panel) are consistent across data samples and with the ground-truth model. c, Same as b, but for models selected at a larger feature complexity \({\mathcal{M}}>{{\mathcal{M}}}^{* }\) (dots). These models differ between the two data halves and are inconsistent across data samples. Only five simulations are shown for clarity.

Extended Data Fig. 4 Results of model selection using the best generalization and feature consistency methods for moderate data amount.

Simulations are shown for ten independent data samples each with 20,000 spikes generated from the same ground-truth model (double-well potential). Each coloured line represents one simulation. Details of the model selection procedures are provided in Supplementary Note 1.7. a, Model selection based on the best generalization. Validated negative log-likelihood (upper panel) achieves the minimum (dots) at different gradient-descent iterations on different data samples. Fitted potentials (lower panel) selected at the minimum of the validated negative log-likelihood are inconsistent across data samples, and many of them exhibit spurious features. b, Model selection based on feature consistency. KL divergence (upper panel) between potentials discovered from two data halves at each level of feature complexity. Models are selected at the feature complexity \({{\mathcal{M}}}^{* }\) (dots) where KL divergence exceeds the threshold \({D}_{{\rm{KL}}}^{{\rm{thresh}}}\) (dashed line). The selected potentials (lower panel) are consistent across data samples and with the ground-truth model. c, Same as b, but for models selected at a larger feature complexity \({\mathcal{M}}>{{\mathcal{M}}}^{* }\) (dots). These models differ between the two data halves and are inconsistent across data samples. Only five simulations are shown for clarity.

Extended Data Fig. 5 Model selection using feature consistency method for complex potential shapes.

Simulations are shown for five independent data samples generated from the same ground-truth model. Each coloured line represents one simulation. The KL-divergence (left column) between models discovered from two halves of each data sample at different levels of feature complexity. Models are selected at the feature complexity \({{\mathcal{M}}}^{* }\) (coloured dots) where the KL-divergence exceeds the threshold \({D}_{{\rm{KL}}}^{{\rm{thresh}}}\) (dashed line). The selected models (right column) are consistent across data samples and with the ground truth when the data amount is sufficient (a,b). Underfitting can occur for low data amounts (c,d). a, The ground-truth is a triple-well potential with the shape inferred for the example V4 channel (Fig. 6e in the main text). Each data sample contains roughly 30,000 spikes. \({D}_{{\rm{KL}}}^{{\rm{thresh}}}=0.01\). b, The ground-truth is a complex four-well potential. Each data sample contains roughly 400,000 spikes. \({D}_{{\rm{KL}}}^{{\rm{thresh}}}=0.01\). All five KL-curves exceed the KL-threshold. The sharp rise of KL-divergence is not yet apparent for this number of GD iterations. c, The same ground-truth potential as in b. Each sample of synthetic data contains roughly 200,000 spikes. \({D}_{{\rm{KL}}}^{{\rm{thresh}}}=0.01\). Some of the selected models are underfitted. d, The same data as in c but with a higher KL-threshold \({D}_{{\rm{KL}}}^{{\rm{thresh}}}=0.02\). Increasing \({D}_{{\rm{KL}}}^{{\rm{thresh}}}\) reduces underfitting of these complex dynamics resulting in more correct outcomes, but it also increases the probability of overfitting for simple ground-truth dynamics.

Extended Data Fig. 6 Discovering models of neural dynamics from neurophysiological recordings.

Potentials at \({{\mathcal{M}}}^{* }\) identified independently from two halves of the data, \({\mathscr{D}}1\) (red) and \({\mathscr{D}}2\) (blue), for other 15 channels in the recording (potentials for the example channel 6 are shown in Fig. 6e in the main text). Error bars shade the area between two potentials discovered from the halves of \({\mathscr{D}}1\) (red) and \({\mathscr{D}}2\) (blue). Details of this analysis are provided in the main text.

Extended Data Fig. 7 Underfitting due to small data amount for the potential shape inferred from V4 data.

We generated ten synthetic data samples from the triple-well potential with the same shape as inferred for the example V4 channel (Fig. 6e in the main text). Each sample of synthetic data contained roughly 5,000 spikes (which roughly corresponds to the amount of real data for that channel). We divided each data sample in two halves, performed gradient-descent optimization on each half, and selected the potential at \({{\mathcal{M}}}^{* }\) using our feature-consistency method with \({D}_{{\rm{KL}}}^{{\rm{thresh}}}=0.01\) (same as for V4 data). The selected potentials (blue) are shown along with the ground-truth (grey) in separate panels for each simulation. Error bars shade the area between potentials discovered from two data halves. Due to small data amount, underfitting is observed in roughly half of the simulations. The pattern of underfitted potential shapes across simulations resembles the pattern of potential shapes observed across V4 channels (cf. Extended Data Fig. 6).

Extended Data Fig. 8 Discovering dynamics from spikes generated by microscopic simulations of a recurrent excitatory-inhibitory network model.

a, Schematic of a recurrent balanced network model with two excitatory (E₁ and E₂) and one inhibitory (I) populations⁴². Each excitatory population consists of 400 neurons, and the inhibitory population consists of 200 neurons. The spiking neurons are simulated with the generalized integrate-and-fire (GIF) model. Details of the model architecture and simulations are described in ref. ⁴². The model exhibits winner-take-all dynamics, whereby two excitatory populations compete via the common inhibitory population. The model’s activity alternates between two attractor states, where either E₁ or E₂ has higher firing rates (lower spike raster). Spike trains of ten example neurons from E₁ (red) and E₂ (blue) are shown. This microscopic recurrent network exhibits metastable transitions between two attractors, which is the ground truth known from the theoretical analysis of the model⁴². We analysed a data set generated by microscopic simulations of this model, which contained 100 s of spiking activity of 20 neurons from the population E₁. We divided the data set in two halves, performed gradient-descent optimization on each half, and selected the potential at \({{\mathcal{M}}}^{* }\) using our feature-consistency method (same procedures as in all other simulations). b, KL divergence between models discovered from two data halves at different levels of feature complexity. The model is selected at the feature complexity \({{\mathcal{M}}}^{* }\) (dot) where KL divergence exceeds the threshold \({D}_{{\rm{KL}}}^{{\rm{thresh}}}=0.01\) (dashed line). c, The selected potential exhibits two attractor wells, in agreement with the ground-truth dynamics for this network model.

Supplementary information

Supplementary Information

Supplementary Notes 1.1–1.7, 2.1–2.3, Tables 1–3 and Figs. 1–5.

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Genkin, M., Engel, T.A. Moving beyond generalization to accurate interpretation of flexible models. Nat Mach Intell 2, 674–683 (2020). https://doi.org/10.1038/s42256-020-00242-6

Download citation

Received: 20 November 2019
Accepted: 25 September 2020
Published: 26 October 2020
Issue Date: November 2020
DOI: https://doi.org/10.1038/s42256-020-00242-6

This article is cited by

A unifying perspective on neural manifolds and circuits for cognition
- Christopher Langdon
- Mikhail Genkin
- Tatiana A. Engel
Nature Reviews Neuroscience (2023)
A flexible Bayesian framework for unbiased estimation of timescales
- Roxana Zeraati
- Tatiana A. Engel
- Anna Levina
Nature Computational Science (2022)
Learning non-stationary Langevin dynamics from stochastic observations of latent trajectories
- Mikhail Genkin
- Owen Hughes
- Tatiana A. Engel
Nature Communications (2021)