Complex sequential understanding through the awareness of spatial and temporal concepts


Understanding sequential information is a fundamental task for artificial intelligence. Current neural networks attempt to learn spatial and temporal information as a whole, limiting their abilities to represent large-scale spatial representations over long-range sequences. Here, we introduce a new modelling strategy—‘semi-coupled structure’ (SCS)—which consists of deep neural networks that decouple the complex spatial and temporal concepts during learning. SCS can learn to implicitly separate input information into independent parts and process these parts separately. Experiments demonstrate that SCS can successfully sequentially annotate the outline of an object in images and perform video action recognition. As an example of sequence-to-sequence problems, SCS can predict future meteorological radar echo images based on observed images. Taken together, our results demonstrate that SCS has the capacity to improve the performance of long short-term memory (LSTM)-like models on large-scale sequential tasks.

Fig. 1: Overview of SCS.
Fig. 2: Example experiments for SCS.
Fig. 3: Outline annotation and precipitation forecasting experiments.
Fig. 4: Expectation numbers of back-propagation chains.
Fig. 5: Ablation study of the action recognition task.

Data availability

The datasets used to train and evaluate SCS are publicly available: UCF-101,; HMDB-51,; Kinetics,; CityScapes,;,; LiVi-Set, The REEC-2018 dataset is a private dataset and is available from the corresponding author upon reasonable request.

Code availability

A public version of the experiment codes is available at


This work is supported in part by the National Key R&D Program of China (no. 2017YFA0700800), Shanghai Qi Zhi Institute, the National Natural Science Foundation of China (grant 61772332), SHEITC (2018-RGZN-02046) and Shanghai Science and Technology Committee.

Author information

Authors and Affiliations



B.P. and C.L. conceived the idea. B.P., K.Z. and C.L. designed the experiments. B.P., K.Z., H.C., J.T. and M.Y. carried out programming, adjustment and data analysis. B.P. and C.L. wrote the manuscript. B.P., J.T., M.Y. and all other authors contributed to the results analysis and commented on the manuscript.

Corresponding author

Correspondence to Cewu Lu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Extended data

Extended Data Fig. 1

a, Detailed stand-alone SCS structures for action recognition task. There are residual lines between every layer blocks. b, Performance (IoU in %) on Cityscapes validation set. Note that “Polyg-LSTM” denotes the original Polygon-RNN structure with ConvLSTM cell, “Poly-GRU” for Polygon-RNN with GRU cell, and “Polyg-SCS” for Polygon-RNN with our Semi-Coupled Structure. c, Auto-driving performance of SCS and baselines (CNN, CNN+LSTM) on the and LiVi-Set validation set. Note that “\(\lambda\)” denotes the angle threshold, “p” denotes the initial probability to stop the back-propagation in STSGD and “length” denotes the number of observed frames. d, Accuracy of \({\mathcal{T}}^{1}\) and \({\mathcal{T}}^{2}\) on LiVi. Comparing their performances, we can get the importance of temporal information on different road conditions. e, Performance on the REEC-2018 validation set. Note that “p” denotes the initial probability to stop the back-propagation in STSGD.

Extended Data Fig. 2

We provide both the hyper-parameters that we use to train the baselines and our SCS. Note that we adopt STSGD on action recognition and outline annotation, while for auto driving and precipitation forecasting, we adopt ASTSGD to train.

Source Data Fig. 4

Exact q value data

