Introduction

5G technology has recently gained popularity worldwide for its faster transfer speed, broader bandwidth, reliability, and security. 5G technology can achieve a 20× faster theoretical peak speed over 4G with lower latency, promoting applications like online gaming, HD streaming services, and video conferences1,2,3. The development of 5G is changing the world at an incredible pace and fostering emerging industries such as telemedicine, autonomous driving, and extended reality4,5,6. These and other industries are estimated to bring a 1000-fold boost in network traffic, requiring the additional capacity to accommodate these growing services and applications7. Nevertheless, 5G infrastructure, such as board cards and routers, must be deployed and managed with strict cost considerations8,9. Therefore, operators often adopt a distributed architecture to avoid massive back-to-back devices and links among fragmented networks10,11,12,13. As shown in Fig. 1a, the emerging metropolitan router is the hub to link urban access routers, where services can be accessed and integrated effectively. However, the construction cycle of 5G devices requires about three months to schedule, procure, and deploy. Planning new infrastructures requires accurate network traffic forecasts months ahead to anticipate the moment that capacity utilization surpasses the preset threshold, where the overloaded capacity utilization might ultimately lead to performance problems. Another issue concerns the resource excess caused by building coarse-grained 5G infrastructures. To mitigate these hazards, operators formulate network expansion schemes months ahead with long-term network traffic prediction, which can facilitate long-period planning for upgrading and scaling the network infrastructure and prepare it for the next planning period14,15,16,17.

Fig. 1: Schematic illustration for the workflow of Diviner.
figure 1

a We collect the data from MAR–MER links. The orange cylinder depicts the metropolitan emerging routers (MER), and the pale blue cylinder depicts metropolitan accessing routers (MAR). b The illustration of the introduced 2D → 3D transformation process. Specifically, given a time series of network traffic data spanning K days, we construct a time series matrix \(\widetilde{{{{{{{{\bf{X}}}}}}}}}=[{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{1}\,\,{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{2}\,\,\ldots \,\,{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{K}]\), where each \({\tilde{{{{{{{{\bf{x}}}}}}}}}}_{i}\) represents the traffic data for a single day of length T. The resulting 3D plot displays time steps across each day, daily time steps, and inbits traffic along the x, y, and z axes, respectively, with the inbits traffic standardized. The blue line in the 2D plot and the side near the origin of the pale red plane in the 3D plot represent historical network traffic, while the yellowish line in the 2D plot and the side far from the origin of the pale red plane in the 3D plot represent the future network traffic to predict. c The overall working flow of the proposed Diviner. The blue solid line indicates the data stream direction. Both the encoder and decoder blocks of Diviner contain a smoothing filter attention mechanism (yellowish block), a difference attention module (pale purple block), a residual structure (pale green block), and a feed-forward layer (gray block). Finally, a one-step convolution generator (magenta block) is employed to convert the dynamic decoding into a sequence-generating procedure.

In industry, a common practice is calculating network traffic’s potential growth rate by analyzing the historical traffic data18. However, this approach cannot scale to predict the demand for new services and is less than satisfactory for long-term forecasting. And predictions-based methods have been introduced to solve this dilemma by exploring the potential dependencies involved in historical network traffic, which provides both a constraint and a source for assessing future traffic volume. Network planners can harness the dependencies to extrapolate long-enough traffic forecasts to develop sustainable expansion schemes and mitigation strategies. The key issue to this task is to obtain an accurate long-term network traffic prediction. However, directly extending the prediction horizon of existing methods is ineffective for long-term forecasting since these methods suffer a severe performance degeneration, where the long-term prediction horizon exposes the non-stationarity of time series. This inherent non-stationarity of real-world time series data is caused by multi-scale temporal variations, random perturbations, and outliers, which present various challenges. These are summarized as follows. (a) Multi-scale temporal variations. Multi-scale (daily/weekly/monthly/yearly) variations throughout long-term time series indicate multi-scale non-stationary latent patterns within the time series, which should be taken into account comprehensively in the model design. The seasonal component, for example, merely presents variations at particular scales. (b) Random factors. Random perturbations and outliers interfere with the discovery of stable regularities, which entails higher robustness in prediction models. (c) Data distribution shift. Non-stationarity of the time series inevitably results in a dataset shift problem with the input data distribution varying over time. Figure 1b illustrates these challenges.

Next, we review the shortcomings of existing methods concerning addressing non-stationarity issues. Existing time series prediction methods generally fall into two categories, conventional models and deep learning models. Most conventional models, such as ARIMA19,20 and HoltWinters21,22,23,24,25, are built with some insight into the time series but implemented linearly, causing problems for modeling non-stationary time series. Furthermore, these models rely on manually tuned parameters to fit the time series, which impedes their application in large-scale prediction scenarios. Although Prophet26 uses a nonlinear modular and interpretive parameter to address these problems, its hand-crafted nonlinear modules need help to easily model non-stationary time series, whose complex patterns make it inefficient to embed diverse factors in hand-crafted functions. This dilemma boosts the development of deep learning methods. Deep learning models can utilize multiple layers to represent latent features at a higher and more abstract level27, enabling us to recognize deep latent patterns in non-stationary time series. Recurrent neural networks (RNNs) and Transformer networks are two main deep learning forecasting frameworks. RNN-based models28,29,30,31,32,33,34 feature a feedback loop that allows models to memorize historical data and process variable-length sequences as inputs and outputs, which calculates the cumulative dependency between time steps. Nevertheless, such indirect modeling of temporal dependencies can not disentangle information from different scales within historical data and thus fails to capture multi-scale variations within non-stationary time series. Transformer-based models35,36,37 solve this problem using a global self-attention mechanism rather than feedback loops. Doing so enhances the network’s ability to capture longer dependencies and interactions within series data and thus brings exciting progress in various time series applications38. For more efficient long-term time series processing, some studies39,40,41 turn the self-attention mechanism into a sparse version. However, despite their promising long-term forecasting results, time series’ specialization is not taken into account during their modeling process, where varying distributions of non-stationary time series deteriorate their predictive performances. Recent research attempts to incorporate time series decomposition into deep learning models42,43,44,45,46,47. Although their results are encouraging and bring more interpretive and reasonable predictions, their limited decomposition, e.g., trend-seasonal decomposition, reverses the correlation between components and merely presents the variation of time series at particular scales.

In this work, we incorporate deep stationary processes into neural networks to achieve precise long-term 5G network traffic forecasts, where stochastic process theories can guarantee the prediction of stationary events48,49,50. Specifically, as shown in Fig. 1c, we develop a deep learning model, Diviner, that incorporates stationary processes into a well-designed hierarchical structure and models non-stationary time series with multi-scale stable features. To validate the effectiveness, we collect an extensive network port traffic dataset (NPT) from the intelligent metropolitan network delivering 5G services of China Unicom and compare the proposed model with numerous current arts over multiple applications. We make two distinct research contributions to time series forecasting: (1) We explore an avenue to solve the challenges presented in long-term time series prediction by modeling non-stationarity in the deep learning paradigm. This line is much more universal and effective than the previous works incorporating temporal decomposition for their limited decomposition that merely presents the temporal variation at particular scales. (2) We develop a deep learning framework with a well-designed hierarchical structure to model the multi-scale stable regularities within non-stationary time series. In contrast to previous methods employing various modules in the same layer, we perform a dynamical scale transformation between different layers and model stable temporal dependencies in the corresponding layer. This hierarchical deep stationary process synchronizes with the cascading feature embedding of deep neural networks, which enables us to capture complex regularities contained in the long-term histories and achieve precise long-term network traffic forecasting. Our experiment demonstrates that the robustness and predictive accuracy significantly improve as we consider more factors concerning non-stationarity, which provides an avenue to improve the long-term forecast ability of deep learning methods. Besides, we also show that the modeling of non-stationarity can help discover nonlinear latent regularities within network traffic and achieve a quality long-term 5G network traffic forecast for up to three months. Furthermore, we expand our solution to climate, control, electricity, economic, energy, and transportation fields, which shows the applicability of this solution to multiple predictive scenarios, showing valuable potential to solve broader engineering problems.

Results

Diviner with deep stationary processes

In this Section, we introduce our proposed deep learning model, Diviner, that tackles the non-stationarity of long-term time series prediction with deep stationary processes, which captures multi-scale stable features and models multi-scale stable regularities to achieve long-term time series prediction.

Smoothing filter attention mechanism as a scale converter

As shown in Fig. 2a, the smoothing filter attention mechanism adjusts the feature scale and enables Diviner to model time series from different scales and access the multi-scale variation features within non-stationary time series. We build this component based on Nadaraya-Watson regression51,52, a classical algorithm for non-parametric regression. Given the sample space \(\Omega =\{({x}_{i},{y}_{i})| 1\le i\le n,{x}_{i}\in {\mathbb{R}},{y}_{i}\in {\mathbb{R}}\}\), window size h, and kernel function K(  ), the Nadaraya–Watson regression has the following expression:

$$\hat{y}=\mathop{\sum }\limits_{i=1}^{n}K\left(\frac{x-{x}_{i}}{h}\right){y}_{i}/\mathop{\sum }\limits_{j=1}^{n}K\left(\frac{x-{x}_{j}}{h}\right),$$
(1)

where the kernel function K(  ) is subject to \(\int\nolimits_{-\infty }^{\infty }K(x)dx=1\) and n, x, y denote sample size, independent variable, and dependent variable, respectively.

Fig. 2: Illustration of the structure of smoothing filter attention mechanism and difference attention module.
figure 2

a This panel displays the smoothing filter attention mechanism, which involves computing adaptive weights K(ξi, ξj) (orange block) and employing a self-masked structure (gray block with dashed lines) to filter out the outliers, where ξi denotes the ith embedded time series period (yellow block). The adaptive weights serve to adjust the feature scale of the input series and obtain the scale-transformed period embedding hi (pink block). b This diagram illustrates the difference attention module. The Matrix-Difference Transformation (pale blue block) subtracts adjacent columns of a matrix to obtain the shifted query, key, and value items (ΔQ, ΔK, and ΔV). Then, an autoregressive multi-head self-attention is performed (in the pale blue background) to capture the correlation of time series across different time steps, resulting in \({\widetilde{{{{{{{{\bf{V}}}}}}}}}}_{s}^{(i)}\) for the ith attention head. Here, \({{{{{{{{\bf{Q}}}}}}}}}_{s}^{(i)}\), \({{{{{{{{\bf{K}}}}}}}}}_{s}^{(i)}\), \({{{{{{{{\bf{V}}}}}}}}}_{s}^{(i)}\), and \({\widetilde{{{{{{{{\bf{V}}}}}}}}}}_{s}^{(i)}\) represent the query, key, value, and result in items, respectively. the \({{{{{{{\rm{SoftMax}}}}}}}}\) is applied to the scaled dot-product between the query and key vectors to obtain attention weights (the pale yellow block). The formula for the \({{{{{{{\rm{SoftMax}}}}}}}}\) function is \({{{{{{{\rm{SoftMax}}}}}}}}({{{{{{{{\bf{k}}}}}}}}}_{i})={e}^{{{{{{{{{\bf{k}}}}}}}}}_{i}}/\mathop{\sum }\nolimits_{j = 1}^{n}{e}^{{{{{{{{{\bf{k}}}}}}}}}_{j}}\), where ki is the ith element of the input vector, and n is the length of the input vector. Lastly, the Matrix-CumSum operation (light orange block) accumulates the shifted features using the ConCat operation, and Ws denotes the learnable aggregation parameters.

The Nadaraya–Watson regression estimates the regression value \(\hat{y}\) using a local weighted average method, where the weight of a sample (xi, yi), \(K(\frac{x-{x}_{i}}{h})/\mathop{\sum }\nolimits_{j = 1}^{n}K(\frac{x-{x}_{j}}{h})\), decays with the distance of xi from x. Consequently, the primary sample (xi, yi) is closer to samples in its vicinity. This process implies the basic notion of scale transformation, where adjacent samples get closer on a more significant visual scale. Inspired by this thought, we can reformulate the Nadaraya–Watson regression from the perspective of scale transformation. We incorporate it into the attention structure to design a learnable scale adjustment unit. Concretely, we introduce the smoothing filter attention mechanism with a learnable kernel function and self-masked operation, where the former shrinks (or magnifies) variations for adaptive feature-scale adjustment, and the letter eliminates outliers. To ease understanding, we consider the 1D time series case here, and the high-dimensional case can be easily extrapolated (shown mathematically in Section “Methods”). Given the time step ti, we estimate its regression value \({\hat{y}}_{i}\) with an adaptive-weighted average of values {ytt ≠ ti}, \({\hat{y}}_{i}={\sum }_{j\ne i}{\alpha }_{j}{y}_{j}\), where the adaptive weights α are obtained by a learnable kernel function f. The punctured window {tjtj ≠ ti} of size n − 1 denotes our self-masked operation, and \(f{({y}_{i},y)}_{{w}_{i}}=exp({w}_{i}{({y}_{i}-y)}^{2})\), \({\alpha }_{i}=f{({y}_{i},y)}_{{w}_{i}}/{\sum }_{j\ne i}f{({y}_{j},y)}_{{w}_{i}}\). Our adaptive weights vary with the inner variation \(\{{({y}_{i}-y)}^{2}| {t}_{i}\ne t\}\) (decreased or increased), which adjusts (shrinking or magnifying) the distance of points across each time step and achieves an adaptive feature-scale transformation. Specifically, the minor variation gets further shrunk at a large feature scale, magnified at a small feature scale, and vice versa. Concerning random components, global attention can serve as an average smoothing method to help filter small perturbations. As for outliers, their large margin against regular items leads to minor weights, which eliminates the interference of outliers. Especially when the sample (ti, yi) comes to be an outlier, this structure brushes itself aside. Thus, the smoothing filter attention mechanism filters out random components and dynamically adjusts feature scales. This way, we can dynamically transform non-stationary time series according to different scales, which accesses time series under comprehensive sights.

Difference attention module to discover stable regularities

The difference attention module calculates the internal connections among stable shifted features to discover stable regularities within the non-stationary time series and thereby overcomes the interference of uneven distributions. Concretely, as shown in Fig. 2b, this module includes the difference and CumSum operations at both ends of the self-attention mechanism35, which interconnects the shift across each time step to capture internal connections within non-stationary time series. The difference operation separates the shifts from the long-term trends, where the shift refers to the minor difference in the trends between adjacent time steps. Considering trends lead the data distribution to change over time, the difference operation makes the time series stable and varies around a fixed mean level with minor distribution shifts. Subsequently, we use a self-attention mechanism to interconnect shifts, which captures the temporal dependencies within the time series variation. Last, we employ a CumSum operation to accumulate shifted features and generate a non-stationary time series conforming to the discovered regularities.

Modeling and generating non-stationary time series in Diviner framework

The smoothing filter attention mechanism filters out random components and dynamically adjusts the feature scale. Subsequently, the difference attention module calculates internal connections and captures the stable regularity within the time series at the corresponding scale. Cascading these two modules, one Diviner block can discover stable regularities within non-stationary time series at one scale. Then, we stack Diviner blocks in a multilayer structure to achieve multi-scale transformation layers and capture multi-scale stable features from non-stationary time series. Such a multilayer structure is organized in an encoder-decoder architecture with asymmetric input lengths for efficient data utilization. The encoder takes a long historical series to embed trends, and the decoder receives a relatively short time series. With the cross-attention between the encoder and decoder, we can pair the latest time series with pertinent variation patterns from long historical series and make inferences about future trends, improving calculation efficiency and reducing redundant historical information. The point is that the latest time series is more conducive to anticipating the immediate future than the remote-past time series, where the correlation across time steps generally degrades with the length of the interval53,54,55,56,57. Additionally, we design a generator to obtain prediction results in one step to avoid dynamic cumulative error problems39. The generator is built with CovNet sharing parameters throughout each time step based on the linear projection generator39,58,59, which saves hardware resources. These techniques enable deep learning methods to model non-stationary time series with multi-scale stable features and produce forecasting results in a generative paradigm, which is an attempt to tackle long-term time series prediction problems.

Performance of the 5G network traffic forecasting

To validate the effectiveness of the proposed techniques, we collect extensive NPTs from China Unicom. The NPT datasets include data recorded every 15 minutes for the whole 2021 year from three groups of real-world metropolitan network traffic ports {NPT-1, NPT-2, NPT-3}, where each sub-dataset contains {18, 5, 5} ports, respectively. We split them chronologically with a 9:1 proportion for training and testing. In addition, we prepare 16 network ports for parameter-searching. The main difficulties lie in the explicit shift of the distribution and numerous outliers. And this Section elaborates on the comprehensive comparison of our model with prediction-based and growth-rate-based models in applying 5G network traffic forecasting.

Experiment 1

We first compare Diviner to other time series prediction-based methods, we note these baseline models as Baselines-T for clarity. Baselines-T include traditional models ARIMA19,20 and Prophet26; classic machine learning model LSTMa60; deep learning-based models Transformer35, Informer39, Autoformer42, and NBeats61. These models are required to predict the entire network traffic series {1, 3, 7, 14, 30} days, aligned with {96, 288, 672, 1344, 2880} prediction spans ahead in Table 1, and inbits is the target feature. In terms of the evaluation, although the MAE, MSE, and MASE predictive accuracy generally decrease with prediction intervals, the degradation rate varies between models. Therefore, we introduce an exponential velocity indicator to measure the rate of accuracy degradation. Specifically, given time spans [t1, t2] and the corresponding MSE, MAE, and MASE errors, we have the following:

$${\,{{\mbox{dMSE}}}}_{{t}_{1}}^{{t}_{2}}=(\root {t}_{2}-{t}_{1} \of {{{{\mbox{MSE}}}}_{{t}_{2}}/{{{\mbox{MSE}}}}_{{t}_{1}}}-1)\times 100 \% ,$$
(2)
$${\,{{\mbox{dMAE}}}}_{{t}_{1}}^{{t}_{2}}=(\root {t}_{2}-{t}_{1} \of {{{{\mbox{MAE}}}}_{{t}_{2}}/{{{\mbox{MAE}}}}_{{t}_{1}}}-1)\times 100 \% ,$$
(3)
$${\,{{\mbox{dMASE}}}}_{{t}_{1}}^{{t}_{2}}=(\root {t}_{2}-{t}_{1} \of {{{{\mbox{MASE}}}}_{{t}_{2}}/{{{\mbox{MASE}}}}_{{t}_{1}}}-1)\times 100 \% ,$$
(4)

where \({\,{{\mbox{dMSE}}}}_{{t}_{1}}^{{t}_{2}},{{{\mbox{dMAE}}}}_{{t}_{1}}^{{t}_{2}},{{{\mbox{dMASE}}}\,}_{{t}_{1}}^{{t}_{2}}\in {\mathbb{R}}\). Concerning the close experimental results between {NPT-1, NPT-2, and NPT-3}, we focus mainly on the result of the NPT-1 dataset, and the experimental results are summarized in Table 1. Although there exist quantities of outliers and frequent oscillations in the NPT dataset, Diviner achieves a 38.58% average MSE reduction (0.451 → 0.277) and a 20.86% average MAE reduction (0.465 → 0.368) based on the prior art. In terms of the scalability to different prediction spans, Diviner has a much lower \({\,{{\mbox{dMSE}}}\,}_{1}^{30}\) (4.014% → 0.750%) and \({\,{{\mbox{dMAE}}}\,}_{1}^{30}\) (2.343% → 0.474%) than the prior art, which exhibits a slight performance degradation with a substantial improvement in predictive robustness when the prediction horizon becomes longer. The degradation rates and predictive performance of all baseline approaches have been provided in Supplementary Table S1 regarding to the space limitation.

Table 1 Time-series forecasting results on the 5G traffic network dataset.

The experiments on NPT-2 and NPT-3 shown in Supplementary Data 1 reproduce the above results, where Diviner can support accurate long-term network traffic prediction and exceed current art involving accuracy and robustness by a large margin. In addition, we have the following results by sorting the comprehensive performances (obtained by the average MASE errors) of the baselines established with the Transformer framework: Diviner > Autoformer > Transformer > Informer. This order aligns with the non-stationary factors considered in these models and verifies our proposal that incorporating non-stationarity promotes neural networks’ adaptive abilities to model time series, and the modeling multi-scale non-stationarity other breaks through the ceiling of prediction abilities for deep learning models.

Experiment 2

The second experiment compares Diviner with two other industrial methods, which aim to predict the capacity utilization of inbits and outbits with historical growth rates. The experiment shares the same network port traffic data as in Experiment 1, while the split ratio is changed to 3:1 chronologically for a longer prediction horizon. Furthermore, we use a long construction cycle of {30, 60, 90} days (aligned with {2880, 5760, 8640} time steps) to ensure the validity of such growth-rate-based methods for the law of large numbers. Here we first define capacity utilization mathematically:

Given a fixed bandwidth \(B\in {\mathbb{R}}\) and the traffic flow of the kth construction cycles \(\widetilde{{{{{{{{\bf{X}}}}}}}}}(k)=\left[\begin{array}{cccc}{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{kC+1}&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{kC+2}&...&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{(k+1)C}\end{array}\right]\), \(\widetilde{{{{{{{{\bf{X}}}}}}}}}(k)\in {{\mathbb{R}}}^{T\times C}\), where \({\tilde{{{{{{{{\bf{x}}}}}}}}}}_{i}\in {{\mathbb{R}}}^{T}\) is a column vector of length T representing the time series per day and C denotes the number of days in one construction cycle. Then the capacity utilization (CU) of the kth construction cycle is defined as follows:

$$\,{{\mbox{CU}}}\,(k)=\frac{\parallel \widetilde{{{{{{{{\bf{X}}}}}}}}}(k){\parallel }_{m1}}{BCT},$$
(5)

where \(\,{{\mbox{CU}}}\,(k)\in {\mathbb{R}}\). As shown in the definition, capacity utilization is directly related to network traffic, so a precise network traffic prediction leads to a quality prediction of capacity utilization. We compare the proposed predictive method with two commonly used moving average growth rate predictive methods in the industry, the additive and multiplicative moving average growth rate predictive methods. For clarity, we note the additive method as Baseline-A and the multiplicative method as Baseline-M. Baseline-A calculates an additive growth rate with the difference of adjacent construction cycles. Given the capacity utilization of the last two construction cycles CU(k − 1), CU(k − 2), we have the following:

$${\widehat{{{\mbox{CU}}}}}_{A}(k)=2\,{{\mbox{CU}}}(k-1)-{{\mbox{CU}}}\,(k-2).$$
(6)

Baseline-M calculates a multiplicative growth rate with the quotient of adjacent construction cycles. Given the capacity utilization of the last two construction cycles CU(k − 1), CU(k − 2), we have the following:

$${\widehat{{{\mbox{CU}}}}}_{M}(k)=\frac{\,{{\mbox{CU}}}(k-1)}{{{\mbox{CU}}}(k-2)}{{\mbox{CU}}}\,(k-1).$$
(7)

Different from the above two baselines, we calculate the capacity utilization of the network with the network traffic forecast. Given the network traffic of the last K construction cycles \(\widetilde{{{{{{{{\bf{X}}}}}}}}}=\left[\begin{array}{ccccccc}{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{(k-K)C+1}&...&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{(k-K+1)C}&...&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{(k-1)C}&...&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{kC}\end{array}\right]\), we have the following:

$$\widetilde{{{{{{{{\mathcal{X}}}}}}}}}(k)={{{{{\rm{Diviner}}}}}}(\widetilde{{{{{{{{\mathcal{X}}}}}}}}}),$$
(8)
$${\widehat{{{\mbox{CU}}}}}_{D}(k)=\frac{\parallel \widetilde{{{{{{{{\mathcal{X}}}}}}}}}(k){\parallel }_{m1}}{BCT}.$$
(9)

We summarize the experimental results in Table 2. Concerning the close experimental results between {NPT-1, NPT-2, and NPT-3} shown in, we focus mainly on the result of the NPT-1 dataset, which has the most network traffic ports. Diviner achieves a substantial reduction of 31.67% MAE (0.846 → 0.578) on inbits and a reduction of 24.25% MAE (0.944 → 0.715) on outbits over Baseline-A. An intuitive explanation is that the growth-rate-based methods extract particular historical features but lack adaptability. We notice that Baseline-A has a much better performance of 0.045× average inbits-MAE and 0.074× average outbits-MAE over Baseline-M. This result suggests that network traffic tends to increase linearly rather than exponentially. Nevertheless, there remain inherent multi-scale variations of network traffic series, so Diviner still exceeds the Baseline-A, suggesting the necessity of applying deep learning models such as Diviner to discover nonlinear latent regularities within network traffic.

Table 2 Long-term (1–3 months) capacity utilization forecasting results on the NPT dataset.

When analyzing the results of these two experiments jointly, we present that Diviner possesses a relatively low degradation rate for a prediction of 90 days, \({\,{{\mbox{dMASE}}}\,}_{1}^{90}=1.034 \%\). In contrast, the degradation rate of the prior art comes to \({\,{{\mbox{dMASE}}}\,}_{1}^{30}=2.343 \%\) for a three-times shorter prediction horizon of 30 days. Furthermore, considering diverse network traffic patterns in the provided datasets (about 50 ports), the proposed method can deal with a wide range of non-stationary time series, validating its applicability without modification. These experiments witness Diviner’s success in providing quality long-term network traffic forecasting and extending the effective prediction spans of deep learning models for up to three months.

Application on other real-world datasets

We validate our method on benchmark datasets for the weather (WTH), electricity transformer temperature (ETT), electricity (ECL), and exchange (Exchange). We summarize the experimental results in Table 3. We follow the standard protocol and divide them into training, validation, and test sets in chronological order with a proportion of 7:1:2 unless otherwise specified. Due to the space limitation, the complete experimental results are shown in Supplementary Data 2.

Table 3 Time-series forecasting results on other real-world datasets.

Weather temperature prediction

The WTH dataset42 records 21 meteorological indicators for Jena 2020, including air temperature and humidity, and WetBulbFarenheit is the target. This dataset is finely quantified to the 10-min level, which means that there are 144 steps for one day and 4320 steps for one month, thereby challenging the capacity of models to process long sequences. Among all baselines, NBeats and Informer have the lowest error in terms of MSE and MAE metrics, respectively. However, we notice a contrast between these two models when extending prediction spans. Informer degrades precipitously when the prediction spans increase from 2016 to 4032 (MAE:0.417 → 0.853), but on the contrary, NBeats gains a performance improvement (MAE:0.635 → 0.434). We attribute this to a trade-off of pursuing context and texture. Informer has an advantage over the texture in the short-term case. Still, it needs to capture the context dependency of the series considering the length of input history series should extend in pace with prediction spans and vice versa. As for Diviner, it achieves a remarkable 29.30% average MAE reduction (0.488 → 0.345) and 41.54% average MSE reduction (0.491 → 0.287) over both Informer and NBeats. Additionally, Diviner gains a low degradation rate of \({\,{{\mbox{dMSE}}}\,}_{1}^{30}=0.439 \%\), \({\,{{\mbox{dMAE}}}\,}_{1}^{30}=0.167 \%\) showing its ability to harness historical information within time series. The predictive performances and degradation rates of all baseline approaches have been provided in Supplementary Table S2. Our model can synthesize context and texture to balance both short-term and long-term cases, ensuring its accurate and robust long-term prediction.

Electricity transformer temperature prediction

The ETT dataset contains two-year data with six power load features from two counties in China, and oil temperature is our target. Its split ratio of training/validation/test set is 12/4/4 months39. The ETT data set is divided into two separate datasets at the 1-h {ETTh1, ETTh2} and 15-minute levels ETTm1. Therefore, we can study the performance of the models under different granularities, where the prediction steps {96, 288, 672} of ETTm1 align with the prediction steps {24, 48, 168} of ETTh1. Our experiments show that Diviner achieves the best performance in both cases. Although in the hour-level case, Diviner outperforms the baselines with the closest MSE and MAE to Autoformer (MSE: 0.110 → 0.082, MAE: 0.247 → 0.216). When the hour-level granularity turns to a minute-level case, Diviner outperforms Autoformer by a large margin (MSE:0.092 → 0.064, MAE:0.239 → 0.194). The predictive performances and the granularity change when the hour-level granularity turns into the minute-level granularity of all baseline approaches have been provided in Supplementary Table S3. These demonstrate the capacity of the Diviner in processing time series of different granularity. Furthermore, the granularity is also a manifestation of scale. These results demonstrate that modeling multi-scale features is conducive to dealing with time series of different granularity.

Consumer electricity consumption prediction

The ECL dataset records the two-year electricity consumption of 321 clients, which is converted into hour-level consumption owing to the missing data, and MT-320 is the target feature62. We predict different time horizons of {7, 14, 30, 40} days, aligned with {168, 336, 720, 960} prediction steps ahead. Next, we analyze the experimental results according to the prediction spans (≤360 as short-term prediction, ≥360 as long-term prediction). NBeats achieves the best forecasting performance for short-term electricity consumption prediction, while Diviner surpasses it in the long-term prediction case. The short-term and long-term performance of all approaches has been provided in Supplementary Table S4. Statistically, the proposed method outperforms the best baseline (NBeats) by decreasing 17.43% MSE (0.367 → 0.303), 15.14% MAE (0.482 → 0.409) at 720 steps ahead, and 6.56% MSE (0.457 → 0.427) at 9.44% MAE (0.540 → 0.489) at 960 steps ahead. We attribute this to scalability, where different models converge to perform similarly in the short-term case, but their differences emerge when the prediction span becomes longer.

Gold price prediction

The Exchange dataset contains 5-year closing prices of a troy ounce of gold in the US recorded daily from 2016 to 2021. Due to the high-frequency fluctuation of the market price, the predictive goal is to predict its general trend reasonably (https://www.lbma.org.uk). To this end, we perform a long-term prediction of {10, 20, 30, 60} days. The experimental results clearly show apparent performance degrades for most baseline models. Given a history of 90 days, only Autoformer and Diviner can predict with MAE and MSE errors lower than 1 when the prediction span is 60 days. However, Diviner still outperforms other methods with a 38.94% average MSE reduction (0.588 → 0.359) and a 22.73% average MSE reduction (0.607 → 0.469) and achieves the best forecast performance. The predictive performance of all baseline approaches has been provided in Supplementary Table S5. These results indicate the adaptability of Diviner to the rapid evolution of financial markets and its reasonable extrapolation, considering that it is generally difficult to predict the financial system.

Solar energy production prediction

The solar dataset contains the 10-minute level 1 year (2006) solar power production data of 137 PV plants in Alabama State, and PV-136 is the target feature (http://www.nrel.gov). Given that the amount of solar energy produced daily is generally stable, conducting a super long-term prediction is unnecessary. Therefore, we set the prediction horizon to {1, 2, 5, 6} days, aligned with {144, 288, 720, 864} prediction steps ahead. Furthermore, this characteristic of solar energy means that its production series tend to be stationary, and thereby the comparison of the predictive performances between different models on this dataset presents their basic series modeling abilities. Concretely, considering the MASE error can be used to assess the model’s performance on different series, we calculate and sort each model’s average MASE error under different prediction horizon settings to measure the time series modeling ability (provided in Supplementary Table S6). The results are as follows: Diviner > NBeats > Transformer > Autoformer > Informer > LSTM, where Diviner surpasses all Transformer-based models in the selected baselines. Provided that the series data is not that non-stationary, the advantages of Autoformer’s modeling time series non-stationarity are not apparent. At the same time, capturing stable long- and short-term dependencies is still effective.

Road occupancy rate prediction

The Traffic dataset contains hourly 2-year (2015–2016) road occupancy rate collected from 862 sensors on San Francisco Bay area freeways by the California Department of Transportation, where sensor-861 is the target feature (http://pems.dot.ca.gov). The prediction horizon is set to {7, 14, 30, 40} days, aligned with {168, 336, 720, 960} prediction steps ahead. Considering the road occupancy rate tends to have a weekly cycle, we use this dataset to compare different networks’ ability to model the temporal cycle. During the comparison, we mainly focus on the following two groups of deep learning models: group-1 takes the non-stationary specialization of time series into account (Diviner, Autoformer), and group-2 does not employ any time-series-specific components (Transformer, Informer, LSTMa). We find that group-1 gains a significant performance improvement over group-2, which suggests the necessity of modeling non-stationarity. As for the proposed Diviner model, it achieves a 27.64% MAE reduction (0.604 → 0.437) to the Transformer model when forecasting 30-day road occupancy rates. Subsequently, we conduct an intra-group comparison for group-1, where Diviner still gains an average 35.37% MAE reduction (0.523 → 0.338) to Autoformer. The predictive performance of all approaches has been provided in Supplementary Table S7. We attribute this to Diviner’s multiple-scale modeling of non-stationarity, while the trend-seasonal decomposition of Autoformer merely reflects time series variation at particular scales. These experimental results demonstrate that Diviner is competent in predicting time series data with cycles.

Discussion

We study the long-term 5G network traffic prediction problem by modeling non-stationarity with deep learning techniques. Although some literature63,64,65 in the early stage argues that the probabilistic traffic forecast under uncertainty is more suitable for the varying network traffic than a concrete forecast produced by time series models, the probabilistic traffic forecast and the concrete traffic forecast share the same historical information in essence. Moreover, the development of time series forecasting techniques these years has witnessed a series of works employing time series forecasting techniques for practical applications such as bandwidth management14,15, resource allocation16, and resource provisioning17, where the time series prediction-based methods can provide detailed network traffic forecast. However, existing time series forecasting methods suffer a severe performance degeneration since the long-term prediction horizon exposes the non-stationarity of time series, which raises several challenges: (a) Multi-scale temporal variations. (b) Random factors. (c) Data Distribution Shift.

Therefore, this paper attempts to challenge the problem of achieving a precise long-term prediction for non-stationary time series. We start from the fundamental property of time series non-stationarity and introduce deep stationary processes into a neural network, which models multi-scale stable regularities within non-stationary time series. We argue that capturing the stable features is a recipe for generating non-stationary forecasts conforming to historical regularities. The stable features enable networks to restrict the latent space of time series, which deals with varying distribution problems. Extensive experiments on network traffic prediction and other real-world scenarios demonstrate its advances over existing prediction-based models. Its advantages are summarized as follows. (a) Diviner brings a salient improvement on both long- and short-term prediction and achieves state-of-the-art performance. (b) Diviner can perform robustly regardless of the selection of prediction span and granularity, showing great potential for long-term forecasting. (c) Diviner maintains a strong generalization in various fields. The performance of most baselines might degrade precipitously in some or other areas. In contrast, our model distinguishes itself for consistent performance on each benchmark.

This work explores an avenue to obtain detailed and precise long-term 5G network traffic forecasts, which can be used to calculate the time network traffic might overflow the capacity and helps operators formulate network construction schemes months in advance. Furthermore, Diviner generates long-term network traffic forecasts at the minute level, facilitating its broader applications for resource provisioning, allocating, and monitoring. Decision-makers can harness long-term predictions to allocate and optimize network resources. Another practical application is to achieve an automatic network status monitoring system, which automatically alarms when real network traffic exceeds a permitted range around predictions. This system supports targeted port-level early warning and assists workers in troubleshooting in time, which can bring substantial efficiency improvement considering the tens of millions of network ports running online. In addition to 5G networks, we have expanded our solution to broader engineering fields such as electricity, climate, control, economics, energy, and transportation. Predicting oil temperature can help prevent the transformer from overheating, which affects the insulation life of the transformer and ensures proper operation66,67. In addition, long-term meteorological prediction helps to select and seed crops in agriculture. As such, we can discover unnoticed regularities within historical series data, which might bring opportunities to traditional industries.

One limitation of our proposed model is that it suffers from critical transitions of data patterns. We attribute this to external factors, whose information is generally not included in the measured data53,55,68. Our method is helpful in the intrinsic regularity discovery within the time series but cannot predict patterns not previously recorded in the real world. Alternatively, we can use dynamic network methods69,70,71 to detect such critical transitions in the time series53. Furthermore, the performance of Diviner might be similar to other deep learning models if given a few history series or in the short-term prediction case. The former contains insufficient information to be exploited, and the short-term prediction needs more problem scalability, whereas the advantages of our model become apparent in long-term forecasting scenarios.

Methods

Preliminaries

We denote the original form of the time-series data as \({{{{{{{\bf{X}}}}}}}}=\left[\begin{array}{cccc}{x}_{1}&{x}_{2}&...&{x}_{n}\end{array}\right],{x}_{i}\in {\mathbb{R}}\). The original time series data X is reshaped to a matrix form as \(\widetilde{{{{{{{{\bf{X}}}}}}}}}=\left[\begin{array}{cccc}{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{1}&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{2}&...&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{K}\end{array}\right]\), where \({\tilde{{{{{{{{\bf{x}}}}}}}}}}_{i}\) is a vector of length T with the time series data per day/week/month/year, K denotes the number of days/weeks/months/years, \({\tilde{{{{{{{{\bf{x}}}}}}}}}}_{i}\in {{\mathbb{R}}}^{T}\). After that, we can represent the seasonal pattern as \({\tilde{{{{{{{{\bf{x}}}}}}}}}}_{i}\) and use its variation between adjacent time steps to model trends, shown as the following:

$$\begin{array}{rcl}{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{{t}_{2}}&=&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{{t}_{1}}+\mathop{\sum }\limits_{t={t}_{1}}^{{t}_{2}-1}\Delta {\widetilde{{{{{{{{\rm{s}}}}}}}}}}_{t},\\ \Delta {\widetilde{{{{{{{{\rm{s}}}}}}}}}}_{t}&=&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{t+1}-{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{t},\end{array}$$
(10)

where \(\Delta {\widetilde{{{{{{{{\rm{s}}}}}}}}}}_{t}\) denotes the change of the seasonal pattern, \(\Delta {\widetilde{{{{{{{{\rm{s}}}}}}}}}}_{t}\in {{\mathbb{R}}}^{T}\). The shift reflects the variation between small time steps, but when such variation (shift) builds up over a rather long period, the trend d comes out. It can be achieved as \(\mathop{\sum }\nolimits_{t = {t}_{1}}^{{t}_{2}-1}\Delta {\widetilde{{{{{{{{\rm{s}}}}}}}}}}_{t}\). Therefore, we can model trends by capturing the long- and short-range dependencies of shifts among different time steps.

Next, we introduce a smoothing filter attention mechanism to construct multi-scale transformation layers. A difference attention module is mounted to capture and interconnect shifts of the corresponding scale. These mechanisms make our Diviner capture multi-scale variations in non-stationary time series, and the mathematical description is listed below.

Diviner input layer

Given the time series data X, we transform X into \(\widetilde{{{{{{{{\bf{X}}}}}}}}}=\left[\begin{array}{cccc}{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{1}&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{2}&...&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{K}\end{array}\right]\), where \({\tilde{{{{{{{{\bf{x}}}}}}}}}}_{i}\) is a vector of length T with the time series data per day (seasonal), and K denotes the number of days, \({\tilde{{{{{{{{\bf{x}}}}}}}}}}_{i}\in {{\mathbb{R}}}^{T}\), \(\widetilde{{{{{{{{\bf{X}}}}}}}}}\in {{\mathbb{R}}}^{T\times K}\). Then we construct the dual input for Diviner. Noticing that Diviner adopts an encoder-decoder architecture, we construct \({{{{{{{{\bf{X}}}}}}}}}_{en}^{in}\) for encoder and \({{{{{{{{\bf{X}}}}}}}}}_{de}^{in}\) for decoder, where \({{{{{{{{\bf{X}}}}}}}}}_{en}^{in}=\left[\begin{array}{cccc}{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{1}&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{2}&...&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{K}\end{array}\right]\), \({{{{{{{{\bf{X}}}}}}}}}_{de}^{in}=\left[\begin{array}{cccc}{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{K-{K}_{de}+1}&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{K-{K}_{de}}&...&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{K}\end{array}\right]\), and \({{{{{{{{\bf{X}}}}}}}}}_{en}^{in}\in {{\mathbb{R}}}^{K}\), \({{{{{{{{\bf{X}}}}}}}}}_{de}^{in}\in {{\mathbb{R}}}^{{K}_{de}}\). This means that \({{{{{{{{\bf{X}}}}}}}}}_{en}^{in}\) takes all elements from \(\widetilde{{{{{{{{\bf{X}}}}}}}}}\) while \({{{{{{{{\bf{X}}}}}}}}}_{de}^{in}\) takes only the latest Kde elements. After that, a fully connected layer on \({{{{{{{{\bf{X}}}}}}}}}_{en}^{in}\) and \({{{{{{{{\bf{X}}}}}}}}}_{de}^{in}\) is used to obtain \({{{{{{{{\bf{E}}}}}}}}}_{en}^{in}\) and \({{{{{{{{\bf{E}}}}}}}}}_{de}^{in}\), where \({{{{{{{{\bf{E}}}}}}}}}_{en}^{in}\in {{\mathbb{R}}}^{{d}_{m}\times K}\), \({{{{{{{{\bf{E}}}}}}}}}_{de}^{in}\in {{\mathbb{R}}}^{{d}_{m}\times {K}_{de}}\) and dm denotes the model dimension.

Smoothing filter attention mechanism

Inspired by Nadaraya-Watson regression51,52 bringing the adjacent points closer together, we introduce the smoothing filter attention mechanism with a learnable kernel function and self-masked architecture, where the former brings similar items closer to filter out the random component and adjust the non-stationary data to stable features, and the letter reduces outliers. The smoothing filter attention mechanism is implemented based on the input \({{{{{{{\bf{E}}}}}}}}=\left[\begin{array}{cccc}{{{{{{{{\boldsymbol{\xi }}}}}}}}}_{1}&{{{{{{{{\boldsymbol{\xi }}}}}}}}}_{2}&...&{{{{{{{{\boldsymbol{\xi }}}}}}}}}_{{K}_{in}}\end{array}\right]\), where \({{{{{{{{\boldsymbol{\xi }}}}}}}}}_{i}\in {{\mathbb{R}}}^{{d}_{m}}\), E is the general reference to the input of each layer, for encoder Kin = K, and for decoder Kin = Kde. Specifically, \({{{{{{{{\bf{E}}}}}}}}}_{en}^{in}\) and \({{{{{{{{\bf{E}}}}}}}}}_{de}^{in}\) are, respectively, the input of the first encoder and decoder layer. The calculation process is shown as follows:

$${{{{{{{{\boldsymbol{\eta }}}}}}}}}_{i}=\frac{\mathop{\sum}\limits_{j\ne i}K({{{{{{{{\boldsymbol{\xi }}}}}}}}}_{i},{{{{{{{{\boldsymbol{\xi }}}}}}}}}_{j})\odot {{{{{{{{\boldsymbol{\xi }}}}}}}}}_{j}}{\mathop{\sum}\limits_{j\ne i}\,{{\mbox{K}}}\,({{{{{{{{\boldsymbol{\xi }}}}}}}}}_{i},{{{{{{{{\boldsymbol{\xi }}}}}}}}}_{j})},$$
(11)
$$\,{{\mbox{K}}}\,({{{{{{{{\boldsymbol{\xi }}}}}}}}}_{i},{{{{{{{{\boldsymbol{\xi }}}}}}}}}_{j})=exp({{{{{{{{\bf{w}}}}}}}}}_{i}\odot {({{{{{{{{\boldsymbol{\xi }}}}}}}}}_{i}-{{{{{{{{\boldsymbol{\xi }}}}}}}}}_{j})}^{2}),$$
(12)

where \({{{{{{{{\bf{w}}}}}}}}}_{i}\in {{\mathbb{R}}}^{{d}_{m}},i\in [1,{K}_{in}]\) denotes the learnable parameters,  denotes the element-wise multiple, ()2 denotes the element-wise square and the square of a vector here represents the element-wise square. To simplify the representation, we assign the smoothing filter attention mechanism as Smoothing-Filter(E) and denote its output as Hs. Before introducing our difference attention module, we first define the difference between a matrix and its inverse operation CumSum.

Difference and CumSum operation

Given a matrix \({{{{{{{\bf{M}}}}}}}}\in {{\mathbb{R}}}^{m\times n}\), \({{{{{{{\bf{M}}}}}}}}=\left[\begin{array}{cccc}{{{{{{{{\bf{m}}}}}}}}}_{1}&{{{{{{{{\bf{m}}}}}}}}}_{2}&...&{{{{{{{{\bf{m}}}}}}}}}_{n}\end{array}\right]\), the difference of M is defined as:

$$\Delta {{{{{{{\bf{M}}}}}}}}=\left[\begin{array}{cccc}\Delta {{{{{{{{\bf{m}}}}}}}}}_{1}&\Delta {{{{{{{{\bf{m}}}}}}}}}_{2}&...&\Delta {{{{{{{{\bf{m}}}}}}}}}_{n}\end{array}\right],$$
(13)

where \(\Delta {{{{{{{{\bf{m}}}}}}}}}_{i}={{{{{{{{\bf{m}}}}}}}}}_{i+1}-{{{{{{{{\bf{m}}}}}}}}}_{i},\Delta {{{{{{{{\bf{m}}}}}}}}}_{i}\in {{\mathbb{R}}}^{m},i\in [1,n)\) and we pad Δmn with Δmn−1 to keep a fixed length before and after the difference operation. The CumSum operation Σ toward M is defined as:

$$\Sigma {{{{{{{\bf{M}}}}}}}}=\left[\begin{array}{cccc}\Sigma {{{{{{{{\bf{m}}}}}}}}}_{1}&\Sigma {{{{{{{{\bf{m}}}}}}}}}_{2}&...&\Sigma {{{{{{{{\bf{m}}}}}}}}}_{n}\end{array}\right],$$
(14)

where \(\Sigma {{{{{{{{\bf{m}}}}}}}}}_{i}=\mathop{\sum }\nolimits_{j = 1}^{i}{{{{{{{{\bf{m}}}}}}}}}_{j},\Sigma {{{{{{{{\bf{m}}}}}}}}}_{i}\in {{\mathbb{R}}}^{m}.\) The differential attention module, intuitively, can be seen as an attention mechanism plugged between these two operations, mathematically described as follows.

Differential attention module

The input of this model involves three elements: Q, K, V. The (Q, K, V) varies between the encoder and decoder, which is \(({{{{{{{{\bf{H}}}}}}}}}_{s}^{en},{{{{{{{{\bf{H}}}}}}}}}_{s}^{en},{{{{{{{{\bf{H}}}}}}}}}_{s}^{en})\) for the encoder and \(({{{{{{{{\bf{H}}}}}}}}}_{s}^{de},{{{{{{{{\bf{E}}}}}}}}}_{en}^{out},{{{{{{{{\bf{E}}}}}}}}}_{en}^{out})\) for the decoder, where \({{{{{{{{\bf{E}}}}}}}}}_{en}^{out}\) is the embedded result of the final encoder block (assigned in the pseudo-code), \({{{{{{{{\bf{H}}}}}}}}}_{s}^{en}\in {{\mathbb{R}}}^{{d}_{m}\times K},{{{{{{{{\bf{H}}}}}}}}}_{s}^{de}\in {{\mathbb{R}}}^{{d}_{m}\times {K}_{de}},{{{{{{{{\bf{E}}}}}}}}}_{en}^{out}\in {{\mathbb{R}}}^{{d}_{m}\times K}\).

$${{{{{{{{\bf{Q}}}}}}}}}_{s}^{(i)},{{{{{{{{\bf{K}}}}}}}}}_{s}^{(i)},{{{{{{{{\bf{V}}}}}}}}}_{s}^{(i)}={{{{{{{{\bf{W}}}}}}}}}_{q}^{(i)}\Delta {{{{{{{\bf{Q}}}}}}}}+{{{{{{{{\bf{b}}}}}}}}}_{q}^{(i)},{{{{{{{{\bf{W}}}}}}}}}_{k}^{(i)}\Delta {{{{{{{\bf{K}}}}}}}}+{{{{{{{{\bf{b}}}}}}}}}_{k}^{(i)},{{{{{{{{\bf{W}}}}}}}}}_{v}^{(i)}\Delta {{{{{{{\bf{V}}}}}}}}+{{{{{{{{\bf{b}}}}}}}}}_{v}^{(i)},$$
(15)
$${\widetilde{{{{{{{{\bf{V}}}}}}}}}}_{s}^{(i)}={{{{{{{{\bf{V}}}}}}}}}_{s}^{(i)}\cdot \,{{\mbox{SoftMax}}}\,\left(\frac{{{{{{{{{\bf{Q}}}}}}}}}_{s}^{{(i)}^{\top }}\cdot {{{{{{{{\bf{K}}}}}}}}}_{s}^{(i)}}{\sqrt{{d}_{m}}}\right),$$
(16)
$${{{{{{{\bf{D}}}}}}}}=\Sigma ({{{{{{{{\bf{W}}}}}}}}}_{s}{\left[\begin{array}{cccc}{\widetilde{{{{{{{{\bf{V}}}}}}}}}}_{s}^{{(1)}^{\top }}&{\widetilde{{{{{{{{\bf{V}}}}}}}}}}_{s}^{{(2)}^{\top }}&...&{\widetilde{{{{{{{{\bf{V}}}}}}}}}}_{s}^{{(h)}^{\top }}\end{array}\right]}^{\top }),$$
(17)

where \({{{{{{{{\bf{W}}}}}}}}}_{q}^{(i)}\in {{\mathbb{R}}}^{{d}_{a}\times {d}_{m}}\), \({{{{{{{{\bf{W}}}}}}}}}_{k}^{(i)}\in {{\mathbb{R}}}^{{d}_{attn}\times {d}_{m}}\), \({{{{{{{{\bf{W}}}}}}}}}_{v}^{(i)}\in {{\mathbb{R}}}^{{d}_{a}\times {d}_{m}}\), \({{{{{{{{\bf{W}}}}}}}}}_{s}\in {{\mathbb{R}}}^{{d}_{m}\times h{d}_{a}}\), \({{{{{{{\bf{D}}}}}}}}\in {{\mathbb{R}}}^{{d}_{m}\times K}\), i [1, h], h denotes the number of parallel attentions. \(\left[\begin{array}{c}\cdot \end{array}\right]\) denotes the concatenation of matrix, \({\widetilde{{{{{{{{\bf{V}}}}}}}}}}_{s}^{(i)}\) denotes the deep shift, and D denotes the deep trend. We denote the differential attention module as Differential-attention(Q, K, V) to ease representations.

Convolution Generator

The final output of Diviner is calculated through convolutional layers, called the one-step generator, which takes the output of the final decoder layer \({{{{{{{{\bf{E}}}}}}}}}_{de}^{out}\) as the input:

$${{{{{{{{\bf{R}}}}}}}}}_{predict}=\,{{\mbox{ConvNet}}}\,({{{{{{{{\bf{E}}}}}}}}}_{de}^{out}),$$
(18)

where \({{{{{{{{\bf{R}}}}}}}}}_{predict}\in {{\mathbb{R}}}^{{d}_{m}\times {K}_{r}},{{{{{{{{\bf{E}}}}}}}}}_{de}^{(M)}\in {{\mathbb{R}}}^{{d}_{m}\times {K}_{de}}\), ConvNet is a multilayer fully convolution net, whose input and output channels are the input length of the decoder Kde and the prediction length Kr, respectively.

Pseudo-code of Diviner

For the convenience of reproducing, We summarize the framework of our Diviner in the following pseudo-code: