Spatial linear transformer and temporal convolution network for traffic flow prediction

Accurately obtaining accurate information about the future traffic flow of all roads in the transportation network is essential for traffic management and control applications. In order to address the challenges of acquiring dynamic global spatial correlations between transportation links and modeling time dependencies in multi-step prediction, we propose a spatial linear transformer and temporal convolution network (SLTTCN). The model is using spatial linear transformers to aggregate the spatial information of the traffic flow, and bidirectional temporal convolution network to capture the temporal dependency of the traffic flow. The spatial linear transformer effectively reduces the complexity of data calculation and storage while capturing spatial dependence, and the time convolutional network with bidirectional and gate fusion mechanisms avoids the problems of gradient vanishing and high computational cost caused by long time intervals during model training. We conducted extensive experiments using two publicly available large-scale traffic data sets and compared SLTTCN with other baselines. Numerical results show that SLTTCN achieves the best predictive performance in various error measurements. We also performed attention visualization analysis on the spatial linear transformer, verifying its effectiveness in capturing dynamic global spatial dependency.

coefficients of multi-dimensional features of raw data and obtain more accurate prediction results.However, the above studies are having two issues to overcome in temporal and spatial dependency modeling.
For spatial dependency, the problem is that they explore spatial dependency based on the assumption of a predefined graph structure which is impacted directly by their spatial connectivity.But in some situation the spatial connectivity cannot correctly reflect the real dependency between two nodes.To solve this problem, the graph attention network (GAT) theory came into being.The GAT 14,15 focuses on construct the dynamic spatial dependency by compute the spatial dependency from the traffic state of object node to the relational node and it becomes a popular research direction.
For temporal dependency, the problem is that despite the wide usage of RNN and its variants 16,17 in capturing temporal dependencies, they still suffer from drawbacks such as time-consuming computations, complex gating mechanisms, and slow response to dynamic changes.In addition, it inconvenient to deal with the long-term temporal dependency by stacking CNN layers whose receptive field size grows linearly with the number of layers increasing 18,19 .The problem faced by the transformer structure is that as the time series grows, the computational complexity of its internal self-attention mechanism increases significantly, which greatly affects the performance of the transformer in handling long sequence data.
Two address the above two limitations, we proposed a new framework called Spatial Linear Transformer with Temporal Convolution Network (SLTTCN).With regard to spatial dependency, we proposed a spatial linear transformer network (SLTN) which aims at dynamically capture the spatial dependency by taking the real-time traffic state, connectivity among nodes and traffic flow directions into consideration.The classical transformer 20 has the disadvantage of high memory and computational complexity.Therefore, to make the SLTN works efficiently, we optimize the original form of self-attention mechanism into a linear form to reduce the resource demand of SLTN.With regard to temporal dependency, inspired by the temporal convolution network (TCN) 21 , we capture temporal dependency stack the by dilated casual convolution layers, and the receptive field size grows exponentially with the number of stack layers increasing.In additional, we introduce a bidirectional and gate-fusion mechanism to TCN, that is mining the temporal dependency from past to future and from future to past, then fuse the two kinds of dependency to express the complex temporal dependency better.The main contributions are summarized as follows: • We construct a novel framework named Spatial Linear Transformer with Temporal Convolution Network (SLTTCN) to dynamically model the temporal and spatial dependency of traffic flow.• We propose the spatial linear transformer network (SLTN) to dynamically capture the spatial dependency, and for improving the efficiency of network, we change the structure of SLTN to reduce the memory and computational complexity.• We propose the temporal convolution network with bidirectional and gate fusion mechanism to capture the temporal dependency to take the advantage of steady gradient, parallel computing and simple structure.
The structure of the paper is as follows: In section "Introduction" we introduce the reason for traffic prediction, the issues for modeling the temporal and spatial dependency, the related works and the framework of our model.In section "Methodology", we transform the traffic prediction into a formula temporal-spatial graph prediction problem.And we describe the framework of SLTTCN, and analyze the composition respectively.In section "Experiment", we conduct the experiments of our model on real world traffic states data and make comparation with the other state-of-the-arts.In section "Conclusion", we make conclusions about our paper and look forward to our further work.

Methodology
In this section, firstly we give the mathematical definition of problem we analyze in this paper.Secondly, we introduce the two main blocks of our framework, spatial transformer network and the temporal convolution network.Lastly, we summarize the structure of our framework.

Problem definition and notations
A traffic network can be regarded as a graph G = (V , E, A) , where the V = {v 1 , v 2 , v 3 , ..., v n } is the collection of all vertices representing the sensors, and the E = {e i,j } is the collection of all edges representing the connectivity among these sensors.The adjacent matrix A = {a ij } ∈ R d N ×d N is the representing the connectivity where the A i,j = 1 if e i,j exists otherwise A i,j = 0.The traffic prediction is a temporal-spatial problem.The aim of this prob- lem is to use the X F = {x t−T+1 , x t−T+2 , . . ., x t } ∈ R d N ×d T which is the input features over the past T time slices of traffic state at the time t from N sensor, to predict the future traffic states Ŷ = {x t+1 , x t+2 , . . ., x t+T } ∈ R d N ×d T , it can be formulated as follows: where the F is the mapping relationship, that is, the model we need to learn.

Spatial linear transformer network
In this section, we propose the spatial linear transformer network, which is composed of two parts: the position embedding and the linear self-attention layer.The position embedding is used to incorporate the position information into each node information.The linear self-attention layer is used to capture dynamical spatial dependencies evolving with the time goes by.The structure is shown in Fig. 2.

Position embedding
Given the original input information X F ∈ R d N ×d F , the spatial-transformer only apply feed-forward operation ignoring the spatial position information of each node.In fact, the position information occupies an import place in modeling spatial correlation.Therefore, what we should do first is to obtain the position embedding S M ∈ R d N ×d F and merge it into the original input information X F , that is X S = X F + S M .
The graph embedding theory ProNE 22 can help us to get the embedding of each node to capture the spatial properties of the graph, and the process is shown in Fig. 3.
We define structure-edge-to which can represent a node-context pair.The edge set then forms a node-context pair set D = E.Given node v i ,the − → c j , − → v i ∈ R d F respectively represent the embedding of context node v j and embedding of current node v i ,furthermore the inner product c T j v i represents the similarity of embedding between context v j and node v i .Then we define the occurrence probability matrix P ∈ R d N ×d N where the occurrence prob- ability of context v j is pi,j = σ − → c T j − → v i ,where the σ (•) is the sigmoid function.
Here the context only considers directly connected vertices, the loss function is as follows: www.nature.com/scientificreports/ The object function can be expressed as sum of log loss over all edges, where p i,j = a i,j /d i,i and d i,i = j a i,j .The p i,j represents the normalized weight of (v i , v j ) which is also the regarded as the edge e ij normalized in D.
There is a trivial solution ( − → c j = − → v i and pi,j = 1 ) for the above loss function.The trivial solution means that there only exist positive edges but no negative edges, that is, the edge exists between every node which is unreasonable.For each observed pair of vertices (v i , v j ) , the context v j is also may come from negative samples P D,j ,so we update the loss function as: where is the negative sample ratio and the negative sample P D,j with context node v j from set of node-context pair can be define as: where α is equals to 1 or 0.75 23 .
To minimize the loss function, the sufficient condition is that let its partial derivative with grad to c T j v i equal to zero.Therefore, by calculate ∂L ∂(c T j v i ) =0 , we can get: Therefore, we further define a new similarity matrix M ∈ R d N ×d N ,where the element of this matrix is: Hence, the problem of distributional similarity-based network embedding is transformed to matrix factorization.We use the truncated singular value decomposition (TSVD), that is,M ≈ U M M V T M , Note that the M is the diagonal matrix composed of the largest d F singular values, and the singular values are arranged in descending order, the U M ∈ R d N ×d F and V M ∈ R d N ×d F orthogonal matrices of which column vectors are orthogonal and the eigenvectors corresponding to the selected singular values.Finally, our embedding matrix M where each row in S M represents the embedding of corresponding node.

Linear self-attention layer
The linear self-attention layer is used to calculate, for every position, an average of the features of all other positions with a weight which is proportional to the similarity score S S ∈ R ⊖d N ×d N between these features.Formally, the input sequence X S ∈ R d N ×d F containing position embedding information is projected into three feature matrices, that are called query Q ∈ R⊖ d N ×d K , key K ∈ R⊖ d N ×d K and value V ∈ R d N ×d V , and the three computed as follows: where Like computing the similarity between node embedding and context embedding, the similarity score S S ∈ R d N ×d N computed as equation is also the inner product of Q and K after the softmax function respectively.
Concretely, the S S(i,j) which is the i-th row and j-th column element of S S is as follows: (2) www.nature.com/scientificreports/ The softmax function is the exponential normalization function to make Q and K T non-negative and rownormalized, and the multiplication of them S S is also non-negative and row-normalized.Then the output feature Y S ∈ R d N ×d T of self-attention layer marked as Attention(Q, K, V ) is calculated as follows: To reduce the memory and computation resource demand, we adopt the linear self-attention mechanism which a kind of efficient self-attention mechanism 24 .As is shown in picture, because we apply the softmax function for Q and K respectively, we can switch the computation order which is more efficient.In detail, as is shown in Fig. 4, we firstly compute softmax . Hence the linear self- attention mechanism is efficient in reducing resource requirement by change the computation order when To stabilize the training process and prevent overfitting, we further adopt the multi-heads attention mechanism.Specifically, rather than performs a single-head attention operation which computes query, key and value once, the multi-heads attention parallelly projects the X S to the queries, keys and values for n A times of singlehead attention operations, then concatenates these outputs and projects again to get the final result.The multiheads attention H is computed as follows: with the i-th head H i computed as: where the

Bidirectional temporal convolution network
The bidirectional temporal convolution layer is composed of sequential and reverse-sequential temporal convolution layers, which is concretely shown in Fig. 5, and both of the two contain three convolution operations, that are, the one-dimension full convolution, the casual and anti-casual convolution, and the dilated convolution.More specifically, we define the input features X F ∈ R d N ×d T , where the i-th element X F(i) ∈ R d T is one dimension sequence which contains T time steps.We define the fitter collection F C ǫR d N ×d C , where the j-th filter F C(j) ∈ R d C has the kernel size d C .Therefore, the casual convolution output Y C ∈ R d N ×d T and anti-casual convolution output Y A ∈ R d N ×d T with Y C(i) ∈ R d N and anti-casual output y A(i) ∈ R d N at i-th time step t can be formulated as follows: where the d f is the dilation factor.
The dilated convolution sampling the input features in fixed interval, and d f controls the ratio of sampling.To capture long-term dependency in exponential growth, we stack the dilation convolution layers with the d f in ascending order.
After we getting the output features, the casual convolution output Y C ∈ R d N ×d T and anti-casual convolution output Y A ∈ R d N ×d T of bidirectional temporal convolution layers, we adopt the gate fusion mechanism to fuse them.The Gate fusion mechanism is formulated as follows: with the gate G computed as: where W TC ∈ R d N ×d N and W TF ∈ R d N ×d N are weight matrices, b G ∈ R d T is the bias.σ (•) is the sigmoid func- tion.ReLU(•) is the rectified linear unit as the activation function, is the element-wise product operation.The gate G fuses the different kinds of input information and control the ratio of them.

Model structure
The model structure of SLTTCN are shown in Fig. 6.It consists of spatial linear transformer block (SLT), bidirectional temporal convolution block (Bi-TCN) with gate fusion mechanism (GF), the residual block (Res).we adopt the fully connected (FC) layer firstly transformed the input historical traffic data X F ∈ R d N ×d T to hidden state H (1) F ∈ R d N ×d F and lastly transformed back to the prediction of traffic volume data Y P ∈ R d N ×d T .The rea- son is that the hidden state H (2) F ∈ R d N ×d F in spatial linear transformer block adopts the multi-heads attention mechanism, which requires the condition that the dimension d F must be divisible by the number of heads n A .We adopt the residual block 27 for stable and efficiently training.We adopt the mean absolute error (MAE) as the loss function to train the model which can be formulated as: where the θ L denotes the all of the parameters in our model.The Y Pi and Ŷi respectively denotes the prediction and ground truth at i-th time step.

Experiment Datasets
We conducted experiments on two highway data set to validate our model: the PeMSD4 and PeMSD8.The traffic flow data processing methods we adopt from 25 .Firstly, we exclude some redundant detectors to guarantee the distance between any two adjacent detectors is longer than 3.5 miles.Secondly, we aggregate the raw traffic data every 5 min, which means the traffic data contains 288 timestamps per day.Then, we use the linear interpolation method to fill the missing data and apply the zero-mean normalization operation x ′ = x − mean(x) to the (13) Zero padding

Metrics
We adopt mean absolute error (MAE), root mean squared error (RMSE) to evaluate our model performance.
Note that Y P = {y P1 , y P2 , . . ., y PQ } ∈ R d Q represents the prediction data, Ŷ = {ŷ 1 , ŷ2 , . . ., ŷQ } ∈ R d Q represents the ground truth, where the Q N is the size of data set, the MAE and RMSE is calculated as:

Settings
For HA model, we adopt the average value of the last 12 time steps to forecast the next time step value.The specific parameters of other models are set according to the original references.For our model, We set the input and output dimensions of each attention head in the Spatial Linear Transformer Network to 12, and the number of attention heads to 4. For the Bidirectional Temporal Convolution Network, we set the input and output dimensions of each layer to the number of nodes, the number of layers, and the convolution kernel size to 3.

Prediction performance comparison with baselines
We compare the performance of our model and baselines within one hour prediction which contains 12 time slices.The results of 3, 6 and 12 horizon on PeMSD4 and PeMSD8 are shown in Table 1.We can obviously observe from the results that for traditional HA model, its performance is always not ideal because they have Bi-TCN Layer www.nature.com/scientificreports/ the limitation on modeling the traffic data with the complex and nonlinear characteristics.In addition, the other deep learning models outperform the traditional models due to the ability of modeling the nonlinear data.
As shown in Table 1, our method achieves better performance on most metrics on all datasets.This is because, compared to GNN models(STGCN; DCRNN; ASTGCN; GWNet; MTGNN; AGCRN and PDFormer), our proposed Space Linear Transformer Network (SLT) uses self-attention mechanism to calculate the attention weights of each node.Specifically, the weight of self-looping is determined by the feature vector itself, while the weight of interconnection is calculated by the attention score.To balance the weights between self-looping and interconnection, we introduce position encoding in SLT to incorporate distance information between nodes into the calculation of attention weights, thereby capturing spatial relationships between different nodes.The Spatial Linear Transformer network (SLT) can consider the relationships between nodes from a global perspective, rather than just local neighboring nodes.Therefore, SLT is able to better capture long-range dependencies, even when the distance between two nodes is large, it can establish effective connections.In addition, the bidirectional TCN we employed excels in modeling temporal dependencies.The prediction pattern in Bi-TCN (Bidirectional Temporal Convolutional Network) belongs to direct multi-step prediction, avoiding the risk of accuracy degradation that conventional methods may encounter as the time steps increase during the recursive process.
In our Spatial Linear Transformer network, we introduce spatial positional encoding to differentiate the positional information of different nodes.This allows our model to simultaneously consider the similarity between node features and the distance information between nodes.In contrast, traditional GNN models often rely only on neighboring node information and have weaker handling of positional relationships between nodes.Lastly, SLT has a linear structure that enables parameter sharing.This means that the number of parameters to be trained is reduced, which improves computational efficiency of the model.
In subsequent chapters, we also demonstrate the advantages of SLTTCN in terms of computational costs.

Ablation study
To evaluate the effectiveness of each part in SLTTCN, we conduct ablation studies with 3 variants of our model as follows: Figure 7 reveals the importance of each module for our model performance.The SLT and Bi-TCN modules can respectively model the temporal and spatial dependencies, which are crucial for traffic modeling.Additionally, the performance degradation observed when removing the Res module indicates that the residual connections we adopted effectively alleviate the problem of gradient vanishing and enhance the network's capability to learn nonlinear functions.

Model analysis
Firstly, In order to better prove the significance of global dynamic spatial dependency (GDSD) mechanism of SLTTCN, we compare it with other two attention-based spatial models, the one is the local dynamic spatial dependency (LDSD) mechanism which is adopted by GAT, the other is the global static spatial dependency (GSSD) mechanism which is proposed by Graph WaveNet.Their performances are also shown in Table 2.We conclude that firstly, the GDSD and GSSD outperforms the LDSD due to the global dependency which can gather non-local node information, and the GDSD outperform more than LDSD since its dependency is evolving with www.nature.com/scientificreports/time.And to observe the three dependency performances, we visualize the three dependencies of first 30 nodes on PeMSD4 data set, and the result is shown in Fig. 8.It is obviously that for GDSD, the spatial dependency distributes on each node and change following the time.For GSSD, the spatial dependency also distributes on each node but fixed.For LDSD, the spatial dependency is distributed on partial nodes and sparse, the reason is that the spatial dependency is computed between the source node and directly connected object nodes, not indirectly connected.Secondly, to demonstrate the computational efficiency of our novel attention mechanism-based SLTN and the fused gate mechanism-based Bi-TCN, we compared the time costs of the main baseline models with SLTTCN on the PemsD4 dataset.We record the average consuming time for each epoch of training and the total consuming time of validating.The result is presented in Table 3, we can see that HA is the fastest because of its simplest structure among all models.We can observe that STGCN is efficient during training due to its fully convolution structure during the training process, and the GWNet is slower due to its combination of diffusion graph convolution and self-adaptive adjacency matrix.DCRNN adopts an iterative approach instead of directly generating all predictions, which increases training time.Meanwhile, AGCRN has a significantly increased number of parameters to learn node-specific patterns, which slows down the computation speed.The ASTGCN is slower because it combing the spectral graph convolution operation and attention mechanism, and the running time is also impacted by the stack of spatial-temporal blocks.The delay module of Pdformer significantly increases the computational cost of the model.Thanks to the linear structure of the spatial transformer and the parallel computation of TCN, our model is the most efficient.

Conclusion
In this paper, we propose a novel deep learning model for traffic flow forecasting called SLTTCN.It can capture the spatial dependency by spatial linear transformer which belongs to dynamical graph convolution operation including the graph embedding theory for learning embedding of each nodes.It also capture the time dependency by bidirectional temporal convolution which can construct the time relationship among muti-step data from past to future and vice versa.The performance and computational efficiency of SLTTCN on PeMSD4 and PeMSD8 data set outperforms all the baselines we chose.Furthermore, we will explore more methods for traffic forecasting.For data set and information, there are more external factors which also influence the traffic flow, such as weather, social events and traffic policy and so on.For model structure, we consider mining the edges relationship of traffic network and combing with the nodes features domain in our future study.

Figure 1 .
Figure 1.The observations and predictions for traffic states, where the different color node represent the different area and different lines connecting the area represent the traffic states evolving with time.

Figure 2 .
Figure 2. The structure of spatial linear transformer, where the all notations are concretely shown in later part of paper.

Figure 3 .
Figure 3. the process of ProNE, where we firstly obtain the similarity matrix M from initial graph information, and secondly apply the truncated singular value decomposition (TSVD) to M to get the node embedding matrix S M .

Figure 4 .
Figure 4.The comparison of Traditional Self-Mechanism and Efficient Self-Attention Mechanism, where the denotes the matrix multiplication.
The Bi-TCN layer which is composed of sequential temporal convolution and reverse-sequential temporal convolution layers with the convolution kernel size 3 and the exponential growth of dilation factor d f .

Table 1 .
The result of different methods on PeMSD4 and PeMSD8.Highlight the optimal performance of each horizon in bold.

Table 2 .
The comparison of different spatial dependency on PeMSD4.

Table 3 .
The comparison computation time on PeMSD4 dataset.