Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Dimensionality reduction to maximize prediction generalization capability

A Publisher Correction to this article was published on 05 May 2021

This article has been updated


Generalization of time series prediction remains an important open issue in machine learning; earlier methods have either large generalization errors or local minima. Here, we develop an analytically solvable, unsupervised learning scheme that extracts the most informative components for predicting future inputs, which we call predictive principal component analysis (PredPCA). Our scheme can effectively remove unpredictable noise and minimize test prediction error through convex optimization. Mathematical analyses demonstrate that, provided with sufficient training samples and sufficiently high-dimensional observations, PredPCA can asymptotically identify hidden states, system parameters and dimensionalities of canonical nonlinear generative processes, with a global convergence guarantee. We demonstrate the performance of PredPCA using sequential visual inputs comprising handwritten digits, rotating three-dimensional objects and natural scenes. It reliably estimates distinct hidden states and predicts future outcomes of previously unseen test input data, based exclusively on noisy observations. The simple architecture and low computational cost of PredPCA are highly desirable for neuromorphic hardware.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Get just this article for as long as you need it


Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Five different prediction model structures.
Fig. 2: PredPCA of handwritten digit sequences.
Fig. 3: PredPCA-based de-noising, hidden state extraction and subsequent input prediction of videos of rotating 3D objects.
Fig. 4: PredPCA of natural scene videos.

Data availability

Image data used in this work are available in the MNIST dataset33 (, for Fig. 2), the ALOI dataset36 (, for Fig. 3), and the BDD100K dataset37 (, for Fig. 4). Figures 24 are generated by applying our scripts (see below) to these image data.

Code availability

MATLAB scripts used in this work are available at or The scripts are covered under the GNU General Public License v3.0.

Change history


  1. Rao, R. P. & Ballard, D. H. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nat. Neurosci. 2, 79–87 (1999).

    Article  Google Scholar 

  2. Rao, R. P. & Sejnowski, T. J. Predictive sequence learning in recurrent neocortical circuits. Adv. Neural Info. Proc. Syst. 12, 164–170 (2000).

    Google Scholar 

  3. Friston, K. A theory of cortical responses. Phil. Trans. R. Soc. Lond. B 360, 815–836 (2005).

    Article  Google Scholar 

  4. Srivastava, N., Mansimov, E. & Salakhudinov, R. Unsupervised learning of video representations using LSTMs. In Int. Conf. Machine Learning 843−852 (ML Research Press, 2015).

  5. Mathieu, M., Couprie, C. & LeCun, Y. Deep multi-scale video prediction beyond mean square error. Preprint at (2015).

  6. Lotter, W., Kreiman, G. & Cox, D. Deep predictive coding networks for video prediction and unsupervised learning. Preprint at (2016).

  7. Hurvich, C. M. & Tsai, C. L. Regression and time series model selection in small samples. Biometrika 76, 297–307 (1989).

    Article  MathSciNet  MATH  Google Scholar 

  8. Hurvich, C. M. & Tsai, C. L. A corrected Akaike information criterion for vector autoregressive model selection. J. Time Series Anal. 14, 271–279 (1993).

    Article  MathSciNet  MATH  Google Scholar 

  9. Cunningham, J. P. & Ghahramani, Z. Linear dimensionality reduction: survey, insights, and generalizations. J. Mach. Learn. Res. 16, 2859–2900 (2015).

    MathSciNet  MATH  Google Scholar 

  10. Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006).

    Article  MathSciNet  MATH  Google Scholar 

  11. Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at (2013).

  12. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

    Article  Google Scholar 

  13. Wehmeyer, C. & Noé, F. Time-lagged autoencoders: deep learning of slow collective variables for molecular kinetics. J. Chem. Phys. 148, 241703 (2018).

    Article  Google Scholar 

  14. Pérez-Hernández, G., Paul, F., Giorgino, T., De Fabritiis, G. & Noé, F. Identification of slow molecular order parameters for Markov model construction. J. Chem. Phys. 139, 015102 (2013).

    Article  Google Scholar 

  15. Klus, S. et al. Data-driven model reduction and transfer operator approximation. J. Nonlinear Sci. 28, 985–1010 (2018).

    Article  MathSciNet  MATH  Google Scholar 

  16. Kalman, R. E. A new approach to linear filtering and prediction problems. J. Basic Eng. 82, 35–45 (1960).

    Article  MathSciNet  Google Scholar 

  17. Julier, S. J. & Uhlmann, J. K. New extension of the Kalman filter to nonlinear systems. In Signal Processing, Sensor Fusion, And Target Recognition VI Vol. 3068, 182−193 (International Society for Optics and Photonics, 1997).

  18. Friston, K. J., Trujillo-Barreto, N. & Daunizeau, J. DEM: A variational treatment of dynamic systems. NeuroImage 41, 849–885 (2008).

    Article  Google Scholar 

  19. Akaike, H. A new look at the statistical model identification. IEEE Trans. Automat. Contr. 19, 716–723 (1974).

    Article  MathSciNet  MATH  Google Scholar 

  20. Murata, N., Yoshizawa, S. & Amari, S. I. Network information criterion—determining the number of hidden units for an artificial neural network model. IEEE Trans. Neural Netw. 5, 865–872 (1994).

    Article  Google Scholar 

  21. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978).

    Article  MathSciNet  MATH  Google Scholar 

  22. Vapnik, V. Principles of risk minimization for learning theory. Adv. Neural Info. Proc. Syst. 4, 831–838 (1992).

    Google Scholar 

  23. Arlot, S. & Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv. 4, 40–79 (2010).

    Article  MathSciNet  MATH  Google Scholar 

  24. Comon, P. & Jutten, C. (eds) Handbook of Blind Source Separation: Independent Component Analysis And Applications (Academic Press, 2010).

  25. Ljung, L. System Identification: Theory for the User 2nd edn (Prentice-Hall, 1999).

  26. Schoukens, J. & Ljung, L. Nonlinear system identification: a user-oriented roadmap. Preprint at (2019).

  27. Akaike, H. Prediction and entropy. In Selected Papers of Hirotugu Akaike 387−410 (Springer, 1985).

  28. Oja, E. Neural networks, principal components, and subspaces. Int. J. Neural Syst. 1, 61–68 (1989).

    Article  Google Scholar 

  29. Xu, L. Least mean square error reconstruction principle for self-organizing neural-nets. Neural Netw. 6, 627–648 (1993).

    Article  Google Scholar 

  30. Chen, T., Hua, Y. & Yan, W. Y. Global convergence of Oja’s subspace algorithm for principal component extraction. IEEE Trans. Neural Netw. 9, 58–67 (1998).

    Article  Google Scholar 

  31. Bell, A. J. & Sejnowski, T. J. An information-maximization approach to blind separation and blind deconvolution. Neural Comput. 7, 1129–1159 (1995).

    Article  Google Scholar 

  32. Amari, S. I., Cichocki, A. & Yang, H. H. A new learning algorithm for blind signal separation. Adv. Neural Info. Proc. Syst. 8, 757–763 (1996).

    Google Scholar 

  33. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).

    Article  Google Scholar 

  34. Isomura, T. & Toyoizumi, T. On the achievability of blind source separation for high-dimensional nonlinear source mixtures. Preprint at (2018).

  35. Dimigen, O. Optimizing the ICA-based removal of ocular EEG artifacts from free viewing experiments. Neuroimage 207, 116117 (2020).

    Article  Google Scholar 

  36. Geusebroek, J. M., Burghouts, G. J. & Smeulders, A. W. The Amsterdam library of object images. Int. J. Comput. Vis. 61, 103–112 (2005).

    Article  Google Scholar 

  37. Yu, F. et al. BDD100K: a diverse driving video database with scalable annotation tooling. Preprint at (2018).

  38. Schrödinger, E. What Is Life? The Physical Aspect of the Living Cell and Mind (Cambridge Univ. Press, 1944).

  39. Palmer, S. E., Marre, O., Berry, M. J. & Bialek, W. Predictive information in a sensory population. Proc. Natl Acad. Sci. USA 112, 6908–6913 (2015).

    Article  Google Scholar 

  40. Friston, K., Kilner, J. & Harrison, L. A free energy principle for the brain. J. Physiol. Paris 100, 70–87 (2006).

    Article  Google Scholar 

  41. Oymak, S., Fabian, Z., Li, M. & Soltanolkotabi, M. Generalization guarantees for neural networks via harnessing the low-rank structure of the Jacobian. Preprint at (2019).

  42. Suzuki, T. et al. Spectral-pruning: compressing deep neural network via spectral analysis. Preprint at (2018).

  43. Neftci, E. Data and power efficient intelligence with neuromorphic learning machines. iScience 5, 52–68 (2018).

    Article  Google Scholar 

  44. Fouda, M., Neftci, E., Eltawil, A. M. & Kurdahi, F. Independent component analysis using RRAMs. IEEE Trans. Nanotech. 18, 611–615 (2018).

    Article  Google Scholar 

  45. Lee, T. W., Girolami, M., Bell, A. J. & Sejnowski, T. J. A unifying information-theoretic framework for independent component analysis. Comput. Math. Appl. 39, 1–21 (2000).

    Article  MathSciNet  MATH  Google Scholar 

  46. Isomura, T. & Toyoizumi, T. A local learning rule for independent component analysis. Sci. Rep. 6, 28073 (2016).

    Article  Google Scholar 

  47. Isomura, T. & Toyoizumi, T. Error-gated Hebbian rule: a local learning rule for principal and independent component analysis. Sci. Rep. 8, 1835 (2018).

    Article  Google Scholar 

  48. Dayan, P., Hinton, G. E., Neal, R. M. & Zemel, R. S. The Helmholtz machine. Neural Comput. 7, 889–904 (1995).

    Article  Google Scholar 

  49. Frémaux, N. & Gerstner, W. Neuromodulated spike-timing-dependent plasticity, and theory of three-factor learning rules. Front. Neural Circuits 9, 85 (2016).

    Article  Google Scholar 

  50. Kuśmierz, Ł., Isomura, T. & Toyoizumi, T. Learning with three factors: modulating Hebbian plasticity with errors. Curr. Opin. Neurobiol. 46, 170–177 (2017).

    Article  Google Scholar 

  51. Zhu, B., Jiao, J. & Tse, D. Deconstructing generative adversarial networks. IEEE Trans. Inf. Theory 66, 7155–7179 (2020).

    Article  MathSciNet  MATH  Google Scholar 

  52. Lusch, B., Kutz, J. N. & Brunton, S. L. Deep learning for universal linear embeddings of nonlinear dynamics. Nat. Commun. 9, 4950 (2018).

    Article  Google Scholar 

  53. Isomura, T. & Toyoizumi, T. Multi-context blind source separation by error-gated Hebbian rule. Sci. Rep. 9, 7127 (2019).

    Article  Google Scholar 

  54. Hornik, K., Stinchcombe, M. & White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 2, 359–366 (1989).

    Article  MATH  Google Scholar 

  55. Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Info. Theory 39, 930–945 (1993).

    Article  MathSciNet  MATH  Google Scholar 

  56. Rahimi, A. & Recht, B. Uniform approximation of functions with random bases. In Proc. 46th Ann. Allerton Conf. on Communication, Control, and Computing 555−561 (2008).

  57. Rahimi, A. & Recht, B. Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. Adv. Neural Info. Process. Syst. 21, 1313–1320 (2008).

    Google Scholar 

  58. Hyvärinen, A. & Pajunen, P. Nonlinear independent component analysis: existence and uniqueness results. Neural Netw. 12, 429–439 (1999).

    Article  Google Scholar 

  59. Jutten, C. & Karhunen, J. Advances in blind source separation (BSS) and independent component analysis (ICA) for nonlinear mixtures. Int. J. Neural Syst. 14, 267–292 (2004).

    Article  Google Scholar 

  60. Koopman, B. O. Hamiltonian systems and transformation in Hilbert space. Proc. Natl Acad. Sci. USA 17, 315–318 (1931).

    Article  MATH  Google Scholar 

  61. Ljung, L. Asymptotic behavior of the extended Kalman filter as a parameter estimator for linear systems. IEEE Trans. Automat. Contr. 24, 36–50 (1979).

    Article  MathSciNet  MATH  Google Scholar 

Download references


We are grateful to S.-I. Amari for discussions. This work was supported by RIKEN Center for Brain Science (T.I. and T.T.), Brain/MINDS from AMED under grant number JP20dm020700 (T.T.), and JSPS KAKENHI under grant number JP18H05432 (T.T.). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations



T.I. conceived and designed PredPCA, performed the mathematical analyses and simulations, and wrote the manuscript. T.T. supervised T.I. from the early state of this work, confirmed the rigour of the mathematical analyses and wrote the manuscript.

Corresponding authors

Correspondence to Takuya Isomura or Taro Toyoizumi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Supplementary results of PredPCA with handwritten digit images.

a, Transition mapping estimated using PredPCA \({\mathbf{B}} \in {\Bbb R}^{10 \times 10}\) accurately matches the true transition mapping \(B \in {\Bbb R}^{10 \times 10}\) that generates the ascending order sequence. Elements of xt+1|t are permuted and sign-flipped for visualization purpose. b, This is also the case for the nonlinear dynamics. The estimated mapping from xt|t-1 xt-1|t-2 to xt+1|t, \({\tilde{\mathbf B}} \in {\Bbb R}^{10 \times 100}\), was obtained using the outcomes of PredPCA, which accurately matches the true mapping of the Fibonacci sequence \(\tilde B \in {\Bbb R}^{10 \times 100}\). Here, indicates the Kronecker product. These results indicate that PredPCA offers the identification of the transition rules underlying the linear and nonlinear dynamics, without observing the true hidden states xt. c, Prediction error in the absence of random replacement and/or monochrome inversion of digit images, as a counterpart of Fig. 2d. PredPCA’s outcomes are retained with or without those distortions, and relevant encoders comprise up to 10 dimensions owing to the construction of the input data, highlighting the robustness of PredPCA to various types of large noise. In particular, in the presence of monochrome inversion, irrespective of random replacement of digits, Nu = 10 provides the global minimum of both equations (6) and (7). Conversely, in the absence of monochrome inversion, Nu = 9 provides their global minimum as in this case, the 10-dimensional hidden state representation becomes redundant. This is because without monochrome inversion, true hidden states take only 10 different positions in the 10-dimensional coordinate, which can be fully expressed by the 9-dimensional coordinate. Remarkably, PredPCA could detect their difference. Note that monochrome inversion corresponds to the first principal component (PC1) of PredPCA. This is because whether the next image is a ‘black digit on white background’ or ‘white digit on black background’ is the most predictable feature as the monochrome inversion rarely occurs. Thus, a relatively large prediction error in the absence of monochrome inversion is due to the lack of the PC1. d, PredPCA increases its performance as the number of past observations used for prediction (Kp) increases until reaching its finite optimum. Left panel: error in categorizing digits, which converges to near zero as Kp increases (refer to Fig. 2b). Middle panel: parameter estimation error (refer to Fig. 2c). Right panel: test prediction error (refer to Fig. 2d). The blue line is the optimal test prediction error computed via supervised learning. The red line indicates the theoretical value computed using equation (7), wherein Kp = 10 (green line) gives its minimum, which matches empirical observations (black circles). These observations imply that predicting single-time-step future outcomes (st+1) using multi-time-step past observations (ϕt) is key to reducing those errors. Note that an extension of PredPCA for multi-time-step prediction while retaining its accuracy is provided in Methods section ‘Derivation of PredPCA'. c and d are obtained with 20 different realizations of digit sequences.

Extended Data Fig. 2 Comparison with related methods.

The errors in estimating system parameters (left and middle panels, as a counterpart of Fig. 2c) and in predicting one-step future inputs in test ascending sequence (right panels, refer to Fig. 2d) are shown. a, Performance of linear TAE. Although it estimates matrix A with high accuracy, it fails to estimate other parameters, because linear TAE (same as PredPCA with ϕt = st) does not effectively filter out observation noise. Moreover, linear TAE yields a larger test prediction error even relative to PredPCA with ϕt = st owing to the difference in their cost functions. This is because PredPCA (even with ϕt = st) extracts components most important to predicting high variant signals preferentially, and thereby provides the global minimum of the squared error in predicting the non-normalized target signal (under the constraint of ϕt = st), while linear TAE minimizes a normalized target signal (see Methods section ‘Filtering out observation noise' for more details). For reference, the blue and red lines in the right panel represent the optimal test prediction error computed via supervised learning and that of PredPCA with ϕt = st, respectively. The results are obtained with 20 different realizations of digit sequences. b, Performance of SSM based on Kalman filter. SSM also tends to fail system identification depending on initial conditions and training history, which leads to a relatively larger prediction error. In the left panel, lines and shaded areas indicate the median and the 25th to 75th percentile area, respectively. The results are obtained with 100 different realizations of digit sequences.

Extended Data Fig. 3 Accuracy of long-term predictions.

PredPCA and SSM can both yield generative models to predict an arbitrary future. However, SSM can fail to identify system parameters depending on initial conditions and training history, leading to the failure of long-term predictions even if provided with a winner-takes-all operation. a, Outcomes of PredPCA offer long-term prediction via greedy prediction based on iterative winner-takes-all operations, regardless of training dataset. Each row indicates a prediction based on a different realization of training sequence. A transition mapping from xt|t-1 to xt+1|t is assumed. b, The long-term prediction is successful even if a transition mapping from xt|t-1 xt-1|t-2 to xt+1|t is assumed, indicating the minimal influence of the assumed model structure (that is, prior knowledge). c, PredPCA can also predict Fibonacci sequences in the long term, regardless of the training dataset. d, Model selection to determine the optimal number of step backs. Here, the standard AIC was used for model selection. We considered the following four models based on four types of polynomial basis functions, xt|t-1, xt|t-1 xt-1|t-2, xt|t-1 xt-1|t-2 xt-2|t-3, and xt|t-1 xt-1|t-2 xt-2|t-3 xt-3|t-4. The state in the next time period xt+1|t was predicted based on these four types of bases, followed by a winner-takes-all operation to conduct the greedy prediction, and their AICs were compared. Left panel: To explain the ascending order sequences, a mapping from xt|t-1 to xt+1|t was the best among these four models. Right panel: To explain the Fibonacci sequences, a mapping from xt|t-1 xt-1|t-2 to xt+1|t was significantly better than other three models. Here, the pairwise t test was applied based on 10 different realizations. Error bars indicate the standard deviation. e, In contrast, SSM based on Kalman filter tends to fail iterative prediction depending on the initial conditions of state and parameter values, and training history—even though it uses the winner-takes-all operation—owing to its relatively large state and parameter estimation errors. System identification using SSM is severely harmed by nonlinear interaction between state and parameter estimations, which yield local minima or spurious solutions (Extended Data Fig. 2b); consequently, SSM exhibits an approximately 6% categorization error (Fig. 2b). These inaccuracies undermine iterative predictions using SSM even when states are de-noised in each step using a winner-takes-all operation.

Extended Data Fig. 4 Instability of features extracted by TAE and SSM.

This figure is a counterpart of Fig. 3b. TAE and SSM do not guarantee the global convergence of their outcomes, and as a result their extracted features are sensitive to the initial conditions, order of supplying mini batches, and level of observation noise. The extracted features in six trials are shown; the last three are outcomes trained with a large noise. The same training dataset was used for all trials. However, as initial parameter values for TAE and SSM and order of supplying mini batches were varied, different features were extracted. The difference in the observation noise level also altered their outcomes. These results imply the unreliability of features extracted by TAE and SSM, and further highlight the benefit of the global convergence guarantee of PredPCA.

Extended Data Fig. 5 Feature extraction of diving car movies.

a, PC1–PC3 of the categorical features (that is, \({\bar{\mathbf x}}_t\)) representing the brightness and vertical and lateral symmetries of scenes. b, PC1 of the dynamical features (that is, Δxt+3|t) representing the lateral motion. Although (a)(b) were obtained using PredPCA with grouping of the data, these extracted features accurately matched those obtained using PredPCA without the six sub-groups (Fig. 4b,c). This implies that PredPCA offers reliable identification of relevant features, even when using the data grouping. c, 100 major categorical features (\({\bar{\mathbf x}}_t\)) representing different categories of scenes. d, 100 major dynamical features (Δxt+3|t) responding to motions at different positions of the screen. The white areas indicate the receptive field of each encoder. c and d were obtained using PredPCA and ICA without the six sub-groups. Similar to Fig. 3b, these images visualize linear mappings from each independent component to the observation.

Supplementary information

Supplementary Information

Supplementary Video Legends 1–3, Figs. 1 and 2, discussion, Methods 1–6, and refs.

Reporting Summary

Supplementary Video 1

Supplementary Video 2

Supplementary Video 3

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Isomura, T., Toyoizumi, T. Dimensionality reduction to maximize prediction generalization capability. Nat Mach Intell 3, 434–446 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing