Abstract
Generalization of time series prediction remains an important open issue in machine learning; earlier methods have either large generalization errors or local minima. Here, we develop an analytically solvable, unsupervised learning scheme that extracts the most informative components for predicting future inputs, which we call predictive principal component analysis (PredPCA). Our scheme can effectively remove unpredictable noise and minimize test prediction error through convex optimization. Mathematical analyses demonstrate that, provided with sufficient training samples and sufficiently high-dimensional observations, PredPCA can asymptotically identify hidden states, system parameters and dimensionalities of canonical nonlinear generative processes, with a global convergence guarantee. We demonstrate the performance of PredPCA using sequential visual inputs comprising handwritten digits, rotating three-dimensional objects and natural scenes. It reliably estimates distinct hidden states and predicts future outcomes of previously unseen test input data, based exclusively on noisy observations. The simple architecture and low computational cost of PredPCA are highly desirable for neuromorphic hardware.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Image data used in this work are available in the MNIST dataset33 (http://yann.lecun.com/exdb/mnist/index.html, for Fig. 2), the ALOI dataset36 (http://aloi.science.uva.nl, for Fig. 3), and the BDD100K dataset37 (https://bdd-data.berkeley.edu, for Fig. 4). Figures 2–4 are generated by applying our scripts (see below) to these image data.
Code availability
MATLAB scripts used in this work are available at https://github.com/takuyaisomura/predpca or https://doi.org/10.5281/zenodo.4362249. The scripts are covered under the GNU General Public License v3.0.
Change history
05 May 2021
A Correction to this paper has been published: https://doi.org/10.1038/s42256-021-00352-9
References
Rao, R. P. & Ballard, D. H. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nat. Neurosci. 2, 79–87 (1999).
Rao, R. P. & Sejnowski, T. J. Predictive sequence learning in recurrent neocortical circuits. Adv. Neural Info. Proc. Syst. 12, 164–170 (2000).
Friston, K. A theory of cortical responses. Phil. Trans. R. Soc. Lond. B 360, 815–836 (2005).
Srivastava, N., Mansimov, E. & Salakhudinov, R. Unsupervised learning of video representations using LSTMs. In Int. Conf. Machine Learning 843−852 (ML Research Press, 2015).
Mathieu, M., Couprie, C. & LeCun, Y. Deep multi-scale video prediction beyond mean square error. Preprint at https://arxiv.org/abs/1511.05440 (2015).
Lotter, W., Kreiman, G. & Cox, D. Deep predictive coding networks for video prediction and unsupervised learning. Preprint at https://arxiv.org/abs/1605.08104 (2016).
Hurvich, C. M. & Tsai, C. L. Regression and time series model selection in small samples. Biometrika 76, 297–307 (1989).
Hurvich, C. M. & Tsai, C. L. A corrected Akaike information criterion for vector autoregressive model selection. J. Time Series Anal. 14, 271–279 (1993).
Cunningham, J. P. & Ghahramani, Z. Linear dimensionality reduction: survey, insights, and generalizations. J. Mach. Learn. Res. 16, 2859–2900 (2015).
Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006).
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at https://arxiv.org/abs/1312.6114 (2013).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Wehmeyer, C. & Noé, F. Time-lagged autoencoders: deep learning of slow collective variables for molecular kinetics. J. Chem. Phys. 148, 241703 (2018).
Pérez-Hernández, G., Paul, F., Giorgino, T., De Fabritiis, G. & Noé, F. Identification of slow molecular order parameters for Markov model construction. J. Chem. Phys. 139, 015102 (2013).
Klus, S. et al. Data-driven model reduction and transfer operator approximation. J. Nonlinear Sci. 28, 985–1010 (2018).
Kalman, R. E. A new approach to linear filtering and prediction problems. J. Basic Eng. 82, 35–45 (1960).
Julier, S. J. & Uhlmann, J. K. New extension of the Kalman filter to nonlinear systems. In Signal Processing, Sensor Fusion, And Target Recognition VI Vol. 3068, 182−193 (International Society for Optics and Photonics, 1997).
Friston, K. J., Trujillo-Barreto, N. & Daunizeau, J. DEM: A variational treatment of dynamic systems. NeuroImage 41, 849–885 (2008).
Akaike, H. A new look at the statistical model identification. IEEE Trans. Automat. Contr. 19, 716–723 (1974).
Murata, N., Yoshizawa, S. & Amari, S. I. Network information criterion—determining the number of hidden units for an artificial neural network model. IEEE Trans. Neural Netw. 5, 865–872 (1994).
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978).
Vapnik, V. Principles of risk minimization for learning theory. Adv. Neural Info. Proc. Syst. 4, 831–838 (1992).
Arlot, S. & Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 4, 40–79 (2010).
Comon, P. & Jutten, C. (eds) Handbook of Blind Source Separation: Independent Component Analysis And Applications (Academic Press, 2010).
Ljung, L. System Identification: Theory for the User 2nd edn (Prentice-Hall, 1999).
Schoukens, J. & Ljung, L. Nonlinear system identification: a user-oriented roadmap. Preprint at https://arxiv.org/abs/1902.00683 (2019).
Akaike, H. Prediction and entropy. In Selected Papers of Hirotugu Akaike 387−410 (Springer, 1985).
Oja, E. Neural networks, principal components, and subspaces. Int. J. Neural Syst. 1, 61–68 (1989).
Xu, L. Least mean square error reconstruction principle for self-organizing neural-nets. Neural Netw. 6, 627–648 (1993).
Chen, T., Hua, Y. & Yan, W. Y. Global convergence of Oja’s subspace algorithm for principal component extraction. IEEE Trans. Neural Netw. 9, 58–67 (1998).
Bell, A. J. & Sejnowski, T. J. An information-maximization approach to blind separation and blind deconvolution. Neural Comput. 7, 1129–1159 (1995).
Amari, S. I., Cichocki, A. & Yang, H. H. A new learning algorithm for blind signal separation. Adv. Neural Info. Proc. Syst. 8, 757–763 (1996).
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
Isomura, T. & Toyoizumi, T. On the achievability of blind source separation for high-dimensional nonlinear source mixtures. Preprint at https://arxiv.org/abs/1808.00668 (2018).
Dimigen, O. Optimizing the ICA-based removal of ocular EEG artifacts from free viewing experiments. Neuroimage 207, 116117 (2020).
Geusebroek, J. M., Burghouts, G. J. & Smeulders, A. W. The Amsterdam library of object images. Int. J. Comput. Vis. 61, 103–112 (2005).
Yu, F. et al. BDD100K: a diverse driving video database with scalable annotation tooling. Preprint at https://arxiv.org/abs/1805.04687 (2018).
Schrödinger, E. What Is Life? The Physical Aspect of the Living Cell and Mind (Cambridge Univ. Press, 1944).
Palmer, S. E., Marre, O., Berry, M. J. & Bialek, W. Predictive information in a sensory population. Proc. Natl Acad. Sci. USA 112, 6908–6913 (2015).
Friston, K., Kilner, J. & Harrison, L. A free energy principle for the brain. J. Physiol. Paris 100, 70–87 (2006).
Oymak, S., Fabian, Z., Li, M. & Soltanolkotabi, M. Generalization guarantees for neural networks via harnessing the low-rank structure of the Jacobian. Preprint at https://arxiv.org/abs/1906.05392 (2019).
Suzuki, T. et al. Spectral-pruning: compressing deep neural network via spectral analysis. Preprint at https://arxiv.org/abs/1808.08558 (2018).
Neftci, E. Data and power efficient intelligence with neuromorphic learning machines. iScience 5, 52–68 (2018).
Fouda, M., Neftci, E., Eltawil, A. M. & Kurdahi, F. Independent component analysis using RRAMs. IEEE Trans. Nanotech. 18, 611–615 (2018).
Lee, T. W., Girolami, M., Bell, A. J. & Sejnowski, T. J. A unifying information-theoretic framework for independent component analysis. Comput. Math. Appl. 39, 1–21 (2000).
Isomura, T. & Toyoizumi, T. A local learning rule for independent component analysis. Sci. Rep. 6, 28073 (2016).
Isomura, T. & Toyoizumi, T. Error-gated Hebbian rule: a local learning rule for principal and independent component analysis. Sci. Rep. 8, 1835 (2018).
Dayan, P., Hinton, G. E., Neal, R. M. & Zemel, R. S. The Helmholtz machine. Neural Comput. 7, 889–904 (1995).
Frémaux, N. & Gerstner, W. Neuromodulated spike-timing-dependent plasticity, and theory of three-factor learning rules. Front. Neural Circuits 9, 85 (2016).
Kuśmierz, Ł., Isomura, T. & Toyoizumi, T. Learning with three factors: modulating Hebbian plasticity with errors. Curr. Opin. Neurobiol. 46, 170–177 (2017).
Zhu, B., Jiao, J. & Tse, D. Deconstructing generative adversarial networks. IEEE Trans. Inf. Theory 66, 7155–7179 (2020).
Lusch, B., Kutz, J. N. & Brunton, S. L. Deep learning for universal linear embeddings of nonlinear dynamics. Nat. Commun. 9, 4950 (2018).
Isomura, T. & Toyoizumi, T. Multi-context blind source separation by error-gated Hebbian rule. Sci. Rep. 9, 7127 (2019).
Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989).
Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Info. Theory 39, 930–945 (1993).
Rahimi, A. & Recht, B. Uniform approximation of functions with random bases. In Proc. 46th Ann. Allerton Conf. on Communication, Control, and Computing 555−561 (2008).
Rahimi, A. & Recht, B. Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. Adv. Neural Info. Process. Syst. 21, 1313–1320 (2008).
Hyvärinen, A. & Pajunen, P. Nonlinear independent component analysis: existence and uniqueness results. Neural Netw. 12, 429–439 (1999).
Jutten, C. & Karhunen, J. Advances in blind source separation (BSS) and independent component analysis (ICA) for nonlinear mixtures. Int. J. Neural Syst. 14, 267–292 (2004).
Koopman, B. O. Hamiltonian systems and transformation in Hilbert space. Proc. Natl Acad. Sci. USA 17, 315–318 (1931).
Ljung, L. Asymptotic behavior of the extended Kalman filter as a parameter estimator for linear systems. IEEE Trans. Automat. Contr. 24, 36–50 (1979).
Acknowledgements
We are grateful to S.-I. Amari for discussions. This work was supported by RIKEN Center for Brain Science (T.I. and T.T.), Brain/MINDS from AMED under grant number JP20dm020700 (T.T.), and JSPS KAKENHI under grant number JP18H05432 (T.T.). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
T.I. conceived and designed PredPCA, performed the mathematical analyses and simulations, and wrote the manuscript. T.T. supervised T.I. from the early state of this work, confirmed the rigour of the mathematical analyses and wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Supplementary results of PredPCA with handwritten digit images.
a, Transition mapping estimated using PredPCA \({\mathbf{B}} \in {\Bbb R}^{10 \times 10}\) accurately matches the true transition mapping \(B \in {\Bbb R}^{10 \times 10}\) that generates the ascending order sequence. Elements of xt+1|t are permuted and sign-flipped for visualization purpose. b, This is also the case for the nonlinear dynamics. The estimated mapping from xt|t-1 ⊗ xt-1|t-2 to xt+1|t, \({\tilde{\mathbf B}} \in {\Bbb R}^{10 \times 100}\), was obtained using the outcomes of PredPCA, which accurately matches the true mapping of the Fibonacci sequence \(\tilde B \in {\Bbb R}^{10 \times 100}\). Here, ⊗ indicates the Kronecker product. These results indicate that PredPCA offers the identification of the transition rules underlying the linear and nonlinear dynamics, without observing the true hidden states xt. c, Prediction error in the absence of random replacement and/or monochrome inversion of digit images, as a counterpart of Fig. 2d. PredPCA’s outcomes are retained with or without those distortions, and relevant encoders comprise up to 10 dimensions owing to the construction of the input data, highlighting the robustness of PredPCA to various types of large noise. In particular, in the presence of monochrome inversion, irrespective of random replacement of digits, Nu = 10 provides the global minimum of both equations (6) and (7). Conversely, in the absence of monochrome inversion, Nu = 9 provides their global minimum as in this case, the 10-dimensional hidden state representation becomes redundant. This is because without monochrome inversion, true hidden states take only 10 different positions in the 10-dimensional coordinate, which can be fully expressed by the 9-dimensional coordinate. Remarkably, PredPCA could detect their difference. Note that monochrome inversion corresponds to the first principal component (PC1) of PredPCA. This is because whether the next image is a ‘black digit on white background’ or ‘white digit on black background’ is the most predictable feature as the monochrome inversion rarely occurs. Thus, a relatively large prediction error in the absence of monochrome inversion is due to the lack of the PC1. d, PredPCA increases its performance as the number of past observations used for prediction (Kp) increases until reaching its finite optimum. Left panel: error in categorizing digits, which converges to near zero as Kp increases (refer to Fig. 2b). Middle panel: parameter estimation error (refer to Fig. 2c). Right panel: test prediction error (refer to Fig. 2d). The blue line is the optimal test prediction error computed via supervised learning. The red line indicates the theoretical value computed using equation (7), wherein Kp = 10 (green line) gives its minimum, which matches empirical observations (black circles). These observations imply that predicting single-time-step future outcomes (st+1) using multi-time-step past observations (ϕt) is key to reducing those errors. Note that an extension of PredPCA for multi-time-step prediction while retaining its accuracy is provided in Methods section ‘Derivation of PredPCA'. c and d are obtained with 20 different realizations of digit sequences.
Extended Data Fig. 2 Comparison with related methods.
The errors in estimating system parameters (left and middle panels, as a counterpart of Fig. 2c) and in predicting one-step future inputs in test ascending sequence (right panels, refer to Fig. 2d) are shown. a, Performance of linear TAE. Although it estimates matrix A with high accuracy, it fails to estimate other parameters, because linear TAE (same as PredPCA with ϕt = st) does not effectively filter out observation noise. Moreover, linear TAE yields a larger test prediction error even relative to PredPCA with ϕt = st owing to the difference in their cost functions. This is because PredPCA (even with ϕt = st) extracts components most important to predicting high variant signals preferentially, and thereby provides the global minimum of the squared error in predicting the non-normalized target signal (under the constraint of ϕt = st), while linear TAE minimizes a normalized target signal (see Methods section ‘Filtering out observation noise' for more details). For reference, the blue and red lines in the right panel represent the optimal test prediction error computed via supervised learning and that of PredPCA with ϕt = st, respectively. The results are obtained with 20 different realizations of digit sequences. b, Performance of SSM based on Kalman filter. SSM also tends to fail system identification depending on initial conditions and training history, which leads to a relatively larger prediction error. In the left panel, lines and shaded areas indicate the median and the 25th to 75th percentile area, respectively. The results are obtained with 100 different realizations of digit sequences.
Extended Data Fig. 3 Accuracy of long-term predictions.
PredPCA and SSM can both yield generative models to predict an arbitrary future. However, SSM can fail to identify system parameters depending on initial conditions and training history, leading to the failure of long-term predictions even if provided with a winner-takes-all operation. a, Outcomes of PredPCA offer long-term prediction via greedy prediction based on iterative winner-takes-all operations, regardless of training dataset. Each row indicates a prediction based on a different realization of training sequence. A transition mapping from xt|t-1 to xt+1|t is assumed. b, The long-term prediction is successful even if a transition mapping from xt|t-1 ⊗ xt-1|t-2 to xt+1|t is assumed, indicating the minimal influence of the assumed model structure (that is, prior knowledge). c, PredPCA can also predict Fibonacci sequences in the long term, regardless of the training dataset. d, Model selection to determine the optimal number of step backs. Here, the standard AIC was used for model selection. We considered the following four models based on four types of polynomial basis functions, xt|t-1, xt|t-1 ⊗ xt-1|t-2, xt|t-1 ⊗ xt-1|t-2 ⊗ xt-2|t-3, and xt|t-1 ⊗ xt-1|t-2 ⊗ xt-2|t-3 ⊗ xt-3|t-4. The state in the next time period xt+1|t was predicted based on these four types of bases, followed by a winner-takes-all operation to conduct the greedy prediction, and their AICs were compared. Left panel: To explain the ascending order sequences, a mapping from xt|t-1 to xt+1|t was the best among these four models. Right panel: To explain the Fibonacci sequences, a mapping from xt|t-1 ⊗ xt-1|t-2 to xt+1|t was significantly better than other three models. Here, the pairwise t test was applied based on 10 different realizations. Error bars indicate the standard deviation. e, In contrast, SSM based on Kalman filter tends to fail iterative prediction depending on the initial conditions of state and parameter values, and training history—even though it uses the winner-takes-all operation—owing to its relatively large state and parameter estimation errors. System identification using SSM is severely harmed by nonlinear interaction between state and parameter estimations, which yield local minima or spurious solutions (Extended Data Fig. 2b); consequently, SSM exhibits an approximately 6% categorization error (Fig. 2b). These inaccuracies undermine iterative predictions using SSM even when states are de-noised in each step using a winner-takes-all operation.
Extended Data Fig. 4 Instability of features extracted by TAE and SSM.
This figure is a counterpart of Fig. 3b. TAE and SSM do not guarantee the global convergence of their outcomes, and as a result their extracted features are sensitive to the initial conditions, order of supplying mini batches, and level of observation noise. The extracted features in six trials are shown; the last three are outcomes trained with a large noise. The same training dataset was used for all trials. However, as initial parameter values for TAE and SSM and order of supplying mini batches were varied, different features were extracted. The difference in the observation noise level also altered their outcomes. These results imply the unreliability of features extracted by TAE and SSM, and further highlight the benefit of the global convergence guarantee of PredPCA.
Extended Data Fig. 5 Feature extraction of diving car movies.
a, PC1–PC3 of the categorical features (that is, \({\bar{\mathbf x}}_t\)) representing the brightness and vertical and lateral symmetries of scenes. b, PC1 of the dynamical features (that is, Δxt+3|t) representing the lateral motion. Although (a)(b) were obtained using PredPCA with grouping of the data, these extracted features accurately matched those obtained using PredPCA without the six sub-groups (Fig. 4b,c). This implies that PredPCA offers reliable identification of relevant features, even when using the data grouping. c, 100 major categorical features (\({\bar{\mathbf x}}_t\)) representing different categories of scenes. d, 100 major dynamical features (Δxt+3|t) responding to motions at different positions of the screen. The white areas indicate the receptive field of each encoder. c and d were obtained using PredPCA and ICA without the six sub-groups. Similar to Fig. 3b, these images visualize linear mappings from each independent component to the observation.
Supplementary information
Supplementary Information
Supplementary Video Legends 1–3, Figs. 1 and 2, discussion, Methods 1–6, and refs.
Rights and permissions
About this article
Cite this article
Isomura, T., Toyoizumi, T. Dimensionality reduction to maximize prediction generalization capability. Nat Mach Intell 3, 434–446 (2021). https://doi.org/10.1038/s42256-021-00306-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-021-00306-1