# Complex sequential understanding through the awareness of spatial and temporal concepts

## Abstract

Understanding sequential information is a fundamental task for artificial intelligence. Current neural networks attempt to learn spatial and temporal information as a whole, limiting their abilities to represent large-scale spatial representations over long-range sequences. Here, we introduce a new modelling strategy—‘semi-coupled structure’ (SCS)—which consists of deep neural networks that decouple the complex spatial and temporal concepts during learning. SCS can learn to implicitly separate input information into independent parts and process these parts separately. Experiments demonstrate that SCS can successfully sequentially annotate the outline of an object in images and perform video action recognition. As an example of sequence-to-sequence problems, SCS can predict future meteorological radar echo images based on observed images. Taken together, our results demonstrate that SCS has the capacity to improve the performance of long short-term memory (LSTM)-like models on large-scale sequential tasks.

## Access options

from\$8.99

All prices are NET prices.

## Data availability

The datasets used to train and evaluate SCS are publicly available: UCF-101, https://www.crcv.ucf.edu/data/UCF101.php; HMDB-51, https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/; Kinetics, https://deepmind.com/research/open-source/kinetics; CityScapes, https://www.cityscapes-dataset.com/; comma.ai, https://archive.org/details/comma-dataset; LiVi-Set, http://www.dbehavior.net/. The REEC-2018 dataset is a private dataset and is available from the corresponding author upon reasonable request.

## Code availability

A public version of the experiment codes is available at https://doi.org/10.5281/zenodo.3679134.

## References

1. 1.

Graves, A. Generating sequences with recurrent neural networks. Preprint at https://arxiv.org/abs/1308.0850 (2013).

2. 2.

Sutskever, I., Vinyals, O. & Le, Q. V. Sequence to sequence learning with neural networks. In Annual Conference on Neural Information Processing Systems 3104–3112 (ACM, 2014).

3. 3.

Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep convolutional neural networks. In Annual Conference on Neural Information Processing Systems 1097–1105 (ACM, 2012).

4. 4.

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

5. 5.

He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN. In IEEE International Conference on Computer Vision 2980–2988 (IEEE, 2017).

6. 6.

Levine, S., Finn, C., Darrell, T. & Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17, 1334–1373 (2016).

7. 7.

Schulman, J., Levine, S., Abbeel, P., Jordan, M. & Moritz, P. Trust region policy optimization. In International Conference on Machine Learning 1889–1897 (PMLR, 2015).

8. 8.

Feichtenhofer, C., Fan, H., Malik, J. & He, K. SlowFast networks for video recognition. In IEEE International Conference on Computer Vision 6202–6211 (IEEE, 2019).

9. 9.

Kim, J., El-Khamy, M. & Lee, J. Residual LSTM: design of a deep recurrent architecture for distant speech recognition. In Conference of the International Speech Communication Association 1591–1595 (ICSA, 2017).

10. 10.

Kitamura, T. et al. Entorhinal cortical ocean cells encode specific contexts and drive context-specific fear memory. Neuron 87, 1317–1331 (2015).

11. 11.

Oliveri, M., Koch, G. & Caltagirone, C. Spatial-temporal interactions in the human brain. Exp. Brain Res. 195, 489–497 (2009).

12. 12.

Wolman, D. A tale of two halves. Nature 483, 260–263 (2012).

13. 13.

Diez, I. et al. A novel brain partition highlights the modular skeleton shared by structure and function. Sci. Rep. 5, 10532 (2015).

14. 14.

Simonyan, K. & Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems 568–576 (ACM, 2014).

15. 15.

Lucas, B. D. Generalized Image Matching by the Method of Differences. PhD thesis, Carnegie Mellon Univ. (1986).

16. 16.

Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

17. 17.

Soomro, K., Zamir, A. R. & Shah, M. UCF101: a dataset of 101 human actions classes from videos in the wild. Preprint at https://arxiv.org/abs/1212.0402 (2012).

18. 18.

Kuehne, H., Jhuang, H., Garrote, E., Poggio, T. & Serre, T. HMDB: a large video database for human motion recognition. In IEEE International Conference on Computer Vision 2556–2563 (IEEE, 2011).

19. 19.

Carreira, J. & Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In IEEE Conference on Computer Vision and Pattern Recognition 4724–4733 (IEEE, 2017).

20. 20.

Donahue, J. et al. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition 2625–2634 (IEEE, 2015).

21. 21.

Shi, X. et al. Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In Annual Conference on Neural Information Processing Systems 802–810 (ACM, 2015).

22. 22.

Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (IEEE, 2015).

23. 23.

Szegedy, C. et al. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition 1–9 (IEEE, 2015).

24. 24.

Ji, S., Xu, W., Yang, M. & Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Analysis Mach. Intel. 35, 221–231 (2013).

25. 25.

Cordts, M. et al. The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition 3213–3223 (IEEE, 2016).

26. 26.

Castrejon, L., Kundu, K., Urtasun, R. & Fidler, S. Annotating object instances with a Polygon-RNN. In IEEE Conference on Computer Vision and Pattern Recognition 2 (IEEE, 2017).

27. 27.

Santana, E. & Hotz, G. Learning a driving simulator. Preprint at https://arxiv.org/abs/1608.01230 (2016).

28. 28.

Chen, Y. et al. Lidar-video driving dataset: learning driving policies effectively. In IEEE Conference on Computer Vision and Pattern Recognition 5870–5878 (IEEE, 2018).

29. 29.

Karpathy, A. et al. Large-scale video classification with convolutional neural networks. In IEEE Conference Computer Vision and Pattern Recognition 1725–1732 (IEEE, 2014).

30. 30.

Yue-Hei, N. J. et al. Beyond short snippets: deep networks for video classification. In IEEE Conference on Computer Vision and Pattern Recognition 4694–4702 (IEEE, 2015).

31. 31.

Wang, L., Qiao, Y., Tang, X. & Van, G. L. Actionness estimation using hybrid fully convolutional networks. IEEE Conference on Computer Vision and Pattern Recognition 2708–2717 (IEEE, 2016).

32. 32.

Weinzaepfel, P., Harchaoui, Z. & Schmid, C. Learning to track for spatio-temporal action localization. In IEEE International Conference on Computer Vision 3164–3172 (IEEE, 2015).

33. 33.

Wang, H., Kläser, A., Schmid, C. & Liu, C. Action recognition by dense trajectories. In IEEE Conference on Computer Vision and Pattern Recognition 443–455 (IEEE, 2011).

34. 34.

Wang, H., Kläser, A., Schmid, C. & Liu, C. Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comp. Vision 103, 60–79 (2013).

35. 35.

Maji, S., Bourdev, L. & Malik, J. Action recognition from a distributed representation of pose and appearance. In IEEE Conference on Computer Vision and Pattern Recognition 3177–3184 (IEEE, 2011).

36. 36.

Wu, Z., Wang, X., Jiang, Y., Ye, H. & Xue, X. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In ACM International Conference on Multimedia 461–470 (ACM, 2015).

37. 37.

Srivastava, N., Mansimov, E. & Salakhudinov, R. Unsupervised learning of video representations using LSTMs. In International Conference on Machine Learning 843–852 (PMLR, 2015).

38. 38.

Wu, C. et al. Long-term feature banks for detailed video understanding. In IEEE Conference on Computer Vision and Pattern Recognition 284–293 (IEEE, 2019).

39. 39.

Girdhar, R., Carreira, J., Doersch, C. & Zisserman, A. Video action transformer network. In IEEE Conference on Computer Vision and Pattern Recognition 244–253 (IEEE, 2019).

40. 40.

Feichtenhofer, C., Pinz, A. & Zisserman, A. Convolutional two-stream network fusion for video action recognition. In IEEE Conference on Computer Vision and Pattern Recognition 1933–1941 (IEEE, 2016).

41. 41.

Rumelhart, D. E. et al. Learning representations by back-propagating errors. Cognitive Modeling 5, 1 (1988).

42. 42.

Bottou, L., Curtis, F. E. & Nocedal, J. Optimization methods for large-scale machine learning. SIAM Rev. 60, 223–311 (2018).

43. 43.

Williams, R. J. & Peng, J. An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Comput. 2, 490–501 (1990).

44. 44.

Gu, C. et al. AVA: a video dataset of spatio-temporally localized atomic visual actions. In IEEE Conference on Computer Vision and Pattern Recognition 6047–6056 (IEEE, 2018).

45. 45.

Hou, R., Chen, C. & Shah, M. Tube convolutional neural network (T-CNN) for action detection in videos. In IEEE International Conference Computer Vision 5822–5831 (IEEE, 2017).

46. 46.

Pang, B., Zha, K., Cao, H., Shi, C. & Lu, C. Deep RNN framework for visual sequential applications. In IEEE Conference on Computer Vision and Pattern Recognition 423–432 (IEEE, 2019).

47. 47.

Song, S., Lan, C., Xing, J., Zeng, W. & Liu, J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In AAAI Conference on Artificial Intelligence 4263–4270 (AAAI, 2017).

48. 48.

Acuna, D., Ling, H., Kar, A & Fidler, S. Efficient interactive annotation of segmentation datasets with Polygon-RNN++. In IEEE Conference on Computer Vision and Pattern Recognition 859–868 (IEEE, 2018).

49. 49.

Ioffe, S. & Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning 448–456 (PMLR, 2015).

50. 50.

Kingma, D. & Ba, J. Adam: a method for stochastic optimization. In International Conference on Learning Representations (IEEE, 2015).

51. 51.

Werbos, P. J. et al. Backpropagation through time: what it does and how to do it. Proc. IEEE 78, 1550–1560 (1990).

## Acknowledgements

This work is supported in part by the National Key R&D Program of China (no. 2017YFA0700800), Shanghai Qi Zhi Institute, the National Natural Science Foundation of China (grant 61772332), SHEITC (2018-RGZN-02046) and Shanghai Science and Technology Committee.

## Author information

Authors

### Contributions

B.P. and C.L. conceived the idea. B.P., K.Z. and C.L. designed the experiments. B.P., K.Z., H.C., J.T. and M.Y. carried out programming, adjustment and data analysis. B.P. and C.L. wrote the manuscript. B.P., J.T., M.Y. and all other authors contributed to the results analysis and commented on the manuscript.

### Corresponding author

Correspondence to Cewu Lu.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Extended data

### Extended Data Fig. 1

a, Detailed stand-alone SCS structures for action recognition task. There are residual lines between every layer blocks. b, Performance (IoU in %) on Cityscapes validation set. Note that “Polyg-LSTM” denotes the original Polygon-RNN structure with ConvLSTM cell, “Poly-GRU” for Polygon-RNN with GRU cell, and “Polyg-SCS” for Polygon-RNN with our Semi-Coupled Structure. c, Auto-driving performance of SCS and baselines (CNN, CNN+LSTM) on the comma.ai and LiVi-Set validation set. Note that “$$\lambda$$” denotes the angle threshold, “p” denotes the initial probability to stop the back-propagation in STSGD and “length” denotes the number of observed frames. d, Accuracy of $${\mathcal{T}}^{1}$$ and $${\mathcal{T}}^{2}$$ on LiVi. Comparing their performances, we can get the importance of temporal information on different road conditions. e, Performance on the REEC-2018 validation set. Note that “p” denotes the initial probability to stop the back-propagation in STSGD.

### Extended Data Fig. 2

We provide both the hyper-parameters that we use to train the baselines and our SCS. Note that we adopt STSGD on action recognition and outline annotation, while for auto driving and precipitation forecasting, we adopt ASTSGD to train.

## Source data

### Source Data Fig. 4

Exact q value data

## Rights and permissions

Reprints and Permissions