Complex sequential understanding through the awareness of spatial and temporal concepts

Pang, Bo; Zha, Kaiwen; Cao, Hanwen; Tang, Jiajun; Yu, Minghui; Lu, Cewu

doi:10.1038/s42256-020-0168-3

Article
Published: 27 April 2020

Complex sequential understanding through the awareness of spatial and temporal concepts

Bo Pang ORCID: orcid.org/0000-0003-4521-6369¹,
Kaiwen Zha¹,
Hanwen Cao¹,
Jiajun Tang¹,
Minghui Yu¹ &
…
Cewu Lu ORCID: orcid.org/0000-0003-1533-8576^1,2,3,4

Nature Machine Intelligence volume 2, pages 245–253 (2020)Cite this article

3558 Accesses
9 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Understanding sequential information is a fundamental task for artificial intelligence. Current neural networks attempt to learn spatial and temporal information as a whole, limiting their abilities to represent large-scale spatial representations over long-range sequences. Here, we introduce a new modelling strategy—‘semi-coupled structure’ (SCS)—which consists of deep neural networks that decouple the complex spatial and temporal concepts during learning. SCS can learn to implicitly separate input information into independent parts and process these parts separately. Experiments demonstrate that SCS can successfully sequentially annotate the outline of an object in images and perform video action recognition. As an example of sequence-to-sequence problems, SCS can predict future meteorological radar echo images based on observed images. Taken together, our results demonstrate that SCS has the capacity to improve the performance of long short-term memory (LSTM)-like models on large-scale sequential tasks.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Example experiments for SCS.**

**Fig. 3: Outline annotation and precipitation forecasting experiments.**

**Fig. 4: Expectation numbers of back-propagation chains.**

**Fig. 5: Ablation study of the action recognition task.**

HARNet in deep learning approach—a systematic survey

Article Open access 10 April 2024

Neelam Sanjeev Kumar, G. Deepika, … Baseem Khan

N-Omniglot, a large-scale neuromorphic dataset for spatio-temporal sparse few-shot learning

Article Open access 02 December 2022

Yang Li, Yiting Dong, … Yi Zeng

Accurate online training of dynamical spiking neural networks through Forward Propagation Through Time

Article 08 May 2023

Bojian Yin, Federico Corradi & Sander M. Bohté

Data availability

The datasets used to train and evaluate SCS are publicly available: UCF-101, https://www.crcv.ucf.edu/data/UCF101.php; HMDB-51, https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/; Kinetics, https://deepmind.com/research/open-source/kinetics; CityScapes, https://www.cityscapes-dataset.com/; comma.ai, https://archive.org/details/comma-dataset; LiVi-Set, http://www.dbehavior.net/. The REEC-2018 dataset is a private dataset and is available from the corresponding author upon reasonable request.

Code availability

A public version of the experiment codes is available at https://doi.org/10.5281/zenodo.3679134.

References

Graves, A. Generating sequences with recurrent neural networks. Preprint at https://arxiv.org/abs/1308.0850 (2013).
Sutskever, I., Vinyals, O. & Le, Q. V. Sequence to sequence learning with neural networks. In Annual Conference on Neural Information Processing Systems 3104–3112 (ACM, 2014).
Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep convolutional neural networks. In Annual Conference on Neural Information Processing Systems 1097–1105 (ACM, 2012).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN. In IEEE International Conference on Computer Vision 2980–2988 (IEEE, 2017).
Levine, S., Finn, C., Darrell, T. & Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17, 1334–1373 (2016).
MathSciNet MATH Google Scholar
Schulman, J., Levine, S., Abbeel, P., Jordan, M. & Moritz, P. Trust region policy optimization. In International Conference on Machine Learning 1889–1897 (PMLR, 2015).
Feichtenhofer, C., Fan, H., Malik, J. & He, K. SlowFast networks for video recognition. In IEEE International Conference on Computer Vision 6202–6211 (IEEE, 2019).
Kim, J., El-Khamy, M. & Lee, J. Residual LSTM: design of a deep recurrent architecture for distant speech recognition. In Conference of the International Speech Communication Association 1591–1595 (ICSA, 2017).
Kitamura, T. et al. Entorhinal cortical ocean cells encode specific contexts and drive context-specific fear memory. Neuron 87, 1317–1331 (2015).
Article Google Scholar
Oliveri, M., Koch, G. & Caltagirone, C. Spatial-temporal interactions in the human brain. Exp. Brain Res. 195, 489–497 (2009).
Article Google Scholar
Wolman, D. A tale of two halves. Nature 483, 260–263 (2012).
Article Google Scholar
Diez, I. et al. A novel brain partition highlights the modular skeleton shared by structure and function. Sci. Rep. 5, 10532 (2015).
Article Google Scholar
Simonyan, K. & Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems 568–576 (ACM, 2014).
Lucas, B. D. Generalized Image Matching by the Method of Differences. PhD thesis, Carnegie Mellon Univ. (1986).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Article Google Scholar
Soomro, K., Zamir, A. R. & Shah, M. UCF101: a dataset of 101 human actions classes from videos in the wild. Preprint at https://arxiv.org/abs/1212.0402 (2012).
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T. & Serre, T. HMDB: a large video database for human motion recognition. In IEEE International Conference on Computer Vision 2556–2563 (IEEE, 2011).
Carreira, J. & Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In IEEE Conference on Computer Vision and Pattern Recognition 4724–4733 (IEEE, 2017).
Donahue, J. et al. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition 2625–2634 (IEEE, 2015).
Shi, X. et al. Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In Annual Conference on Neural Information Processing Systems 802–810 (ACM, 2015).
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (IEEE, 2015).
Szegedy, C. et al. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition 1–9 (IEEE, 2015).
Ji, S., Xu, W., Yang, M. & Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Analysis Mach. Intel. 35, 221–231 (2013).
Article Google Scholar
Cordts, M. et al. The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition 3213–3223 (IEEE, 2016).
Castrejon, L., Kundu, K., Urtasun, R. & Fidler, S. Annotating object instances with a Polygon-RNN. In IEEE Conference on Computer Vision and Pattern Recognition 2 (IEEE, 2017).
Santana, E. & Hotz, G. Learning a driving simulator. Preprint at https://arxiv.org/abs/1608.01230 (2016).
Chen, Y. et al. Lidar-video driving dataset: learning driving policies effectively. In IEEE Conference on Computer Vision and Pattern Recognition 5870–5878 (IEEE, 2018).
Karpathy, A. et al. Large-scale video classification with convolutional neural networks. In IEEE Conference Computer Vision and Pattern Recognition 1725–1732 (IEEE, 2014).
Yue-Hei, N. J. et al. Beyond short snippets: deep networks for video classification. In IEEE Conference on Computer Vision and Pattern Recognition 4694–4702 (IEEE, 2015).
Wang, L., Qiao, Y., Tang, X. & Van, G. L. Actionness estimation using hybrid fully convolutional networks. IEEE Conference on Computer Vision and Pattern Recognition 2708–2717 (IEEE, 2016).
Weinzaepfel, P., Harchaoui, Z. & Schmid, C. Learning to track for spatio-temporal action localization. In IEEE International Conference on Computer Vision 3164–3172 (IEEE, 2015).
Wang, H., Kläser, A., Schmid, C. & Liu, C. Action recognition by dense trajectories. In IEEE Conference on Computer Vision and Pattern Recognition 443–455 (IEEE, 2011).
Wang, H., Kläser, A., Schmid, C. & Liu, C. Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comp. Vision 103, 60–79 (2013).
Article MathSciNet Google Scholar
Maji, S., Bourdev, L. & Malik, J. Action recognition from a distributed representation of pose and appearance. In IEEE Conference on Computer Vision and Pattern Recognition 3177–3184 (IEEE, 2011).
Wu, Z., Wang, X., Jiang, Y., Ye, H. & Xue, X. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In ACM International Conference on Multimedia 461–470 (ACM, 2015).
Srivastava, N., Mansimov, E. & Salakhudinov, R. Unsupervised learning of video representations using LSTMs. In International Conference on Machine Learning 843–852 (PMLR, 2015).
Wu, C. et al. Long-term feature banks for detailed video understanding. In IEEE Conference on Computer Vision and Pattern Recognition 284–293 (IEEE, 2019).
Girdhar, R., Carreira, J., Doersch, C. & Zisserman, A. Video action transformer network. In IEEE Conference on Computer Vision and Pattern Recognition 244–253 (IEEE, 2019).
Feichtenhofer, C., Pinz, A. & Zisserman, A. Convolutional two-stream network fusion for video action recognition. In IEEE Conference on Computer Vision and Pattern Recognition 1933–1941 (IEEE, 2016).
Rumelhart, D. E. et al. Learning representations by back-propagating errors. Cognitive Modeling 5, 1 (1988).
Google Scholar
Bottou, L., Curtis, F. E. & Nocedal, J. Optimization methods for large-scale machine learning. SIAM Rev. 60, 223–311 (2018).
Article MathSciNet Google Scholar
Williams, R. J. & Peng, J. An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Comput. 2, 490–501 (1990).
Article Google Scholar
Gu, C. et al. AVA: a video dataset of spatio-temporally localized atomic visual actions. In IEEE Conference on Computer Vision and Pattern Recognition 6047–6056 (IEEE, 2018).
Hou, R., Chen, C. & Shah, M. Tube convolutional neural network (T-CNN) for action detection in videos. In IEEE International Conference Computer Vision 5822–5831 (IEEE, 2017).
Pang, B., Zha, K., Cao, H., Shi, C. & Lu, C. Deep RNN framework for visual sequential applications. In IEEE Conference on Computer Vision and Pattern Recognition 423–432 (IEEE, 2019).
Song, S., Lan, C., Xing, J., Zeng, W. & Liu, J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In AAAI Conference on Artificial Intelligence 4263–4270 (AAAI, 2017).
Acuna, D., Ling, H., Kar, A & Fidler, S. Efficient interactive annotation of segmentation datasets with Polygon-RNN++. In IEEE Conference on Computer Vision and Pattern Recognition 859–868 (IEEE, 2018).
Ioffe, S. & Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning 448–456 (PMLR, 2015).
Kingma, D. & Ba, J. Adam: a method for stochastic optimization. In International Conference on Learning Representations (IEEE, 2015).
Werbos, P. J. et al. Backpropagation through time: what it does and how to do it. Proc. IEEE 78, 1550–1560 (1990).
Article Google Scholar

Download references

Acknowledgements

This work is supported in part by the National Key R&D Program of China (no. 2017YFA0700800), Shanghai Qi Zhi Institute, the National Natural Science Foundation of China (grant 61772332), SHEITC (2018-RGZN-02046) and Shanghai Science and Technology Committee.

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Bo Pang, Kaiwen Zha, Hanwen Cao, Jiajun Tang, Minghui Yu & Cewu Lu
Computer Science and Engineering (CSE), Shanghai Jiao Tong University, Shanghai, China
Cewu Lu
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
Cewu Lu
Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai, China
Cewu Lu

Authors

Bo Pang
View author publications
You can also search for this author in PubMed Google Scholar
Kaiwen Zha
View author publications
You can also search for this author in PubMed Google Scholar
Hanwen Cao
View author publications
You can also search for this author in PubMed Google Scholar
Jiajun Tang
View author publications
You can also search for this author in PubMed Google Scholar
Minghui Yu
View author publications
You can also search for this author in PubMed Google Scholar
Cewu Lu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.P. and C.L. conceived the idea. B.P., K.Z. and C.L. designed the experiments. B.P., K.Z., H.C., J.T. and M.Y. carried out programming, adjustment and data analysis. B.P. and C.L. wrote the manuscript. B.P., J.T., M.Y. and all other authors contributed to the results analysis and commented on the manuscript.

Corresponding author

Correspondence to Cewu Lu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1

a, Detailed stand-alone SCS structures for action recognition task. There are residual lines between every layer blocks. b, Performance (IoU in %) on Cityscapes validation set. Note that “Polyg-LSTM” denotes the original Polygon-RNN structure with ConvLSTM cell, “Poly-GRU” for Polygon-RNN with GRU cell, and “Polyg-SCS” for Polygon-RNN with our Semi-Coupled Structure. c, Auto-driving performance of SCS and baselines (CNN, CNN+LSTM) on the comma.ai and LiVi-Set validation set. Note that “\(\lambda\)” denotes the angle threshold, “p” denotes the initial probability to stop the back-propagation in STSGD and “length” denotes the number of observed frames. d, Accuracy of \({\mathcal{T}}^{1}\) and \({\mathcal{T}}^{2}\) on LiVi. Comparing their performances, we can get the importance of temporal information on different road conditions. e, Performance on the REEC-2018 validation set. Note that “p” denotes the initial probability to stop the back-propagation in STSGD.

Extended Data Fig. 2

We provide both the hyper-parameters that we use to train the baselines and our SCS. Note that we adopt STSGD on action recognition and outline annotation, while for auto driving and precipitation forecasting, we adopt ASTSGD to train.

Source data

Source Data Fig. 4

Exact q value data

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pang, B., Zha, K., Cao, H. et al. Complex sequential understanding through the awareness of spatial and temporal concepts. Nat Mach Intell 2, 245–253 (2020). https://doi.org/10.1038/s42256-020-0168-3

Download citation

Received: 18 September 2019
Accepted: 12 March 2020
Published: 27 April 2020
Issue Date: May 2020
DOI: https://doi.org/10.1038/s42256-020-0168-3