Long term 5G network traffic forecasting via modeling non-stationarity with deep learning

Yang, Yuguang; Geng, Shupeng; Zhang, Baochang; Zhang, Juan; Wang, Zheng; Zhang, Yong; Doermann, David

doi:10.1038/s44172-023-00081-4

Download PDF

Article
Open access
Published: 06 June 2023

Long term 5G network traffic forecasting via modeling non-stationarity with deep learning

Communications Engineering volume 2, Article number: 33 (2023) Cite this article

4410 Accesses
4 Citations
2 Altmetric
Metrics details

Subjects

Abstract

5G cellular networks have recently fostered a wide range of emerging applications, but their popularity has led to traffic growth that far outpaces network expansion. This mismatch may decrease network quality and cause severe performance problems. To reduce the risk, operators need long term traffic prediction to perform network expansion schemes months ahead. However, long term prediction horizon exposes the non-stationarity of series data, which deteriorates the performance of existing approaches. We deal with this problem by developing a deep learning model, Diviner, that incorporates stationary processes into a well-designed hierarchical structure and models non-stationary time series with multi-scale stable features. We demonstrate substantial performance improvement of Diviner over the current state of the art in 5G network traffic forecasting with detailed months-level forecasting for massive ports with complex flow patterns. Extensive experiments further present its applicability to various predictive scenarios without any modification, showing potential to address broader engineering problems.

Development and evaluation of bidirectional LSTM freeway traffic forecasting models using simulation data

Article Open access 13 December 2021

Method for multi-task learning fusion network traffic classification to address small sample labels

Article Open access 30 January 2024

Interrelationships between urban travel demand and electricity consumption: a deep learning approach

Article Open access 17 April 2023

Introduction

5G technology has recently gained popularity worldwide for its faster transfer speed, broader bandwidth, reliability, and security. 5G technology can achieve a 20× faster theoretical peak speed over 4G with lower latency, promoting applications like online gaming, HD streaming services, and video conferences^1,2,3. The development of 5G is changing the world at an incredible pace and fostering emerging industries such as telemedicine, autonomous driving, and extended reality^4,5,6. These and other industries are estimated to bring a 1000-fold boost in network traffic, requiring the additional capacity to accommodate these growing services and applications⁷. Nevertheless, 5G infrastructure, such as board cards and routers, must be deployed and managed with strict cost considerations^8,9. Therefore, operators often adopt a distributed architecture to avoid massive back-to-back devices and links among fragmented networks^10,11,12,13. As shown in Fig. 1a, the emerging metropolitan router is the hub to link urban access routers, where services can be accessed and integrated effectively. However, the construction cycle of 5G devices requires about three months to schedule, procure, and deploy. Planning new infrastructures requires accurate network traffic forecasts months ahead to anticipate the moment that capacity utilization surpasses the preset threshold, where the overloaded capacity utilization might ultimately lead to performance problems. Another issue concerns the resource excess caused by building coarse-grained 5G infrastructures. To mitigate these hazards, operators formulate network expansion schemes months ahead with long-term network traffic prediction, which can facilitate long-period planning for upgrading and scaling the network infrastructure and prepare it for the next planning period^14,15,16,17.

**Fig. 1: Schematic illustration for the workflow of Diviner.**

In industry, a common practice is calculating network traffic’s potential growth rate by analyzing the historical traffic data¹⁸. However, this approach cannot scale to predict the demand for new services and is less than satisfactory for long-term forecasting. And predictions-based methods have been introduced to solve this dilemma by exploring the potential dependencies involved in historical network traffic, which provides both a constraint and a source for assessing future traffic volume. Network planners can harness the dependencies to extrapolate long-enough traffic forecasts to develop sustainable expansion schemes and mitigation strategies. The key issue to this task is to obtain an accurate long-term network traffic prediction. However, directly extending the prediction horizon of existing methods is ineffective for long-term forecasting since these methods suffer a severe performance degeneration, where the long-term prediction horizon exposes the non-stationarity of time series. This inherent non-stationarity of real-world time series data is caused by multi-scale temporal variations, random perturbations, and outliers, which present various challenges. These are summarized as follows. (a) Multi-scale temporal variations. Multi-scale (daily/weekly/monthly/yearly) variations throughout long-term time series indicate multi-scale non-stationary latent patterns within the time series, which should be taken into account comprehensively in the model design. The seasonal component, for example, merely presents variations at particular scales. (b) Random factors. Random perturbations and outliers interfere with the discovery of stable regularities, which entails higher robustness in prediction models. (c) Data distribution shift. Non-stationarity of the time series inevitably results in a dataset shift problem with the input data distribution varying over time. Figure 1b illustrates these challenges.

Next, we review the shortcomings of existing methods concerning addressing non-stationarity issues. Existing time series prediction methods generally fall into two categories, conventional models and deep learning models. Most conventional models, such as ARIMA^19,20 and HoltWinters^{21,22,23,24,25}, are built with some insight into the time series but implemented linearly, causing problems for modeling non-stationary time series. Furthermore, these models rely on manually tuned parameters to fit the time series, which impedes their application in large-scale prediction scenarios. Although Prophet²⁶ uses a nonlinear modular and interpretive parameter to address these problems, its hand-crafted nonlinear modules need help to easily model non-stationary time series, whose complex patterns make it inefficient to embed diverse factors in hand-crafted functions. This dilemma boosts the development of deep learning methods. Deep learning models can utilize multiple layers to represent latent features at a higher and more abstract level²⁷, enabling us to recognize deep latent patterns in non-stationary time series. Recurrent neural networks (RNNs) and Transformer networks are two main deep learning forecasting frameworks. RNN-based models^{28,29,30,31,32,33,34} feature a feedback loop that allows models to memorize historical data and process variable-length sequences as inputs and outputs, which calculates the cumulative dependency between time steps. Nevertheless, such indirect modeling of temporal dependencies can not disentangle information from different scales within historical data and thus fails to capture multi-scale variations within non-stationary time series. Transformer-based models^35,36,37 solve this problem using a global self-attention mechanism rather than feedback loops. Doing so enhances the network’s ability to capture longer dependencies and interactions within series data and thus brings exciting progress in various time series applications³⁸. For more efficient long-term time series processing, some studies^39,40,41 turn the self-attention mechanism into a sparse version. However, despite their promising long-term forecasting results, time series’ specialization is not taken into account during their modeling process, where varying distributions of non-stationary time series deteriorate their predictive performances. Recent research attempts to incorporate time series decomposition into deep learning models^{42,43,44,45,46,47}. Although their results are encouraging and bring more interpretive and reasonable predictions, their limited decomposition, e.g., trend-seasonal decomposition, reverses the correlation between components and merely presents the variation of time series at particular scales.

In this work, we incorporate deep stationary processes into neural networks to achieve precise long-term 5G network traffic forecasts, where stochastic process theories can guarantee the prediction of stationary events^48,49,50. Specifically, as shown in Fig. 1c, we develop a deep learning model, Diviner, that incorporates stationary processes into a well-designed hierarchical structure and models non-stationary time series with multi-scale stable features. To validate the effectiveness, we collect an extensive network port traffic dataset (NPT) from the intelligent metropolitan network delivering 5G services of China Unicom and compare the proposed model with numerous current arts over multiple applications. We make two distinct research contributions to time series forecasting: (1) We explore an avenue to solve the challenges presented in long-term time series prediction by modeling non-stationarity in the deep learning paradigm. This line is much more universal and effective than the previous works incorporating temporal decomposition for their limited decomposition that merely presents the temporal variation at particular scales. (2) We develop a deep learning framework with a well-designed hierarchical structure to model the multi-scale stable regularities within non-stationary time series. In contrast to previous methods employing various modules in the same layer, we perform a dynamical scale transformation between different layers and model stable temporal dependencies in the corresponding layer. This hierarchical deep stationary process synchronizes with the cascading feature embedding of deep neural networks, which enables us to capture complex regularities contained in the long-term histories and achieve precise long-term network traffic forecasting. Our experiment demonstrates that the robustness and predictive accuracy significantly improve as we consider more factors concerning non-stationarity, which provides an avenue to improve the long-term forecast ability of deep learning methods. Besides, we also show that the modeling of non-stationarity can help discover nonlinear latent regularities within network traffic and achieve a quality long-term 5G network traffic forecast for up to three months. Furthermore, we expand our solution to climate, control, electricity, economic, energy, and transportation fields, which shows the applicability of this solution to multiple predictive scenarios, showing valuable potential to solve broader engineering problems.

Results

Diviner with deep stationary processes

In this Section, we introduce our proposed deep learning model, Diviner, that tackles the non-stationarity of long-term time series prediction with deep stationary processes, which captures multi-scale stable features and models multi-scale stable regularities to achieve long-term time series prediction.

Smoothing filter attention mechanism as a scale converter

As shown in Fig. 2a, the smoothing filter attention mechanism adjusts the feature scale and enables Diviner to model time series from different scales and access the multi-scale variation features within non-stationary time series. We build this component based on Nadaraya-Watson regression^51,52, a classical algorithm for non-parametric regression. Given the sample space $\Omega =\{({x}_{i},{y}_{i})| 1\le i\le n,{x}_{i}\in {\mathbb{R}},{y}_{i}\in {\mathbb{R}}\}$, window size h, and kernel function K( ⋅ ), the Nadaraya–Watson regression has the following expression:

$$\hat{y}=\mathop{\sum }\limits_{i=1}^{n}K\left(\frac{x-{x}_{i}}{h}\right){y}_{i}/\mathop{\sum }\limits_{j=1}^{n}K\left(\frac{x-{x}_{j}}{h}\right),$$

(1)

where the kernel function K( ⋅ ) is subject to $\int\nolimits_{-\infty }^{\infty }K(x)dx=1$ and n, x, y denote sample size, independent variable, and dependent variable, respectively.

**Fig. 2: Illustration of the structure of smoothing filter attention mechanism and difference attention module.**

The Nadaraya–Watson regression estimates the regression value $\hat{y}$ using a local weighted average method, where the weight of a sample (x_i, y_i), $K(\frac{x-{x}_{i}}{h})/\mathop{\sum }\nolimits_{j = 1}^{n}K(\frac{x-{x}_{j}}{h})$, decays with the distance of x_i from x. Consequently, the primary sample (x_i, y_i) is closer to samples in its vicinity. This process implies the basic notion of scale transformation, where adjacent samples get closer on a more significant visual scale. Inspired by this thought, we can reformulate the Nadaraya–Watson regression from the perspective of scale transformation. We incorporate it into the attention structure to design a learnable scale adjustment unit. Concretely, we introduce the smoothing filter attention mechanism with a learnable kernel function and self-masked operation, where the former shrinks (or magnifies) variations for adaptive feature-scale adjustment, and the letter eliminates outliers. To ease understanding, we consider the 1D time series case here, and the high-dimensional case can be easily extrapolated (shown mathematically in Section “Methods”). Given the time step t_i, we estimate its regression value ${\hat{y}}_{i}$ with an adaptive-weighted average of values {y_t∣t ≠ t_i}, ${\hat{y}}_{i}={\sum }_{j\ne i}{\alpha }_{j}{y}_{j}$, where the adaptive weights α are obtained by a learnable kernel function f. The punctured window {t_j∣t_j ≠ t_i} of size n − 1 denotes our self-masked operation, and $f{({y}_{i},y)}_{{w}_{i}}=exp({w}_{i}{({y}_{i}-y)}^{2})$, ${\alpha }_{i}=f{({y}_{i},y)}_{{w}_{i}}/{\sum }_{j\ne i}f{({y}_{j},y)}_{{w}_{i}}$. Our adaptive weights vary with the inner variation $\{{({y}_{i}-y)}^{2}| {t}_{i}\ne t\}$ (decreased or increased), which adjusts (shrinking or magnifying) the distance of points across each time step and achieves an adaptive feature-scale transformation. Specifically, the minor variation gets further shrunk at a large feature scale, magnified at a small feature scale, and vice versa. Concerning random components, global attention can serve as an average smoothing method to help filter small perturbations. As for outliers, their large margin against regular items leads to minor weights, which eliminates the interference of outliers. Especially when the sample (t_i, y_i) comes to be an outlier, this structure brushes itself aside. Thus, the smoothing filter attention mechanism filters out random components and dynamically adjusts feature scales. This way, we can dynamically transform non-stationary time series according to different scales, which accesses time series under comprehensive sights.

Difference attention module to discover stable regularities

The difference attention module calculates the internal connections among stable shifted features to discover stable regularities within the non-stationary time series and thereby overcomes the interference of uneven distributions. Concretely, as shown in Fig. 2b, this module includes the difference and CumSum operations at both ends of the self-attention mechanism³⁵, which interconnects the shift across each time step to capture internal connections within non-stationary time series. The difference operation separates the shifts from the long-term trends, where the shift refers to the minor difference in the trends between adjacent time steps. Considering trends lead the data distribution to change over time, the difference operation makes the time series stable and varies around a fixed mean level with minor distribution shifts. Subsequently, we use a self-attention mechanism to interconnect shifts, which captures the temporal dependencies within the time series variation. Last, we employ a CumSum operation to accumulate shifted features and generate a non-stationary time series conforming to the discovered regularities.

Modeling and generating non-stationary time series in Diviner framework

The smoothing filter attention mechanism filters out random components and dynamically adjusts the feature scale. Subsequently, the difference attention module calculates internal connections and captures the stable regularity within the time series at the corresponding scale. Cascading these two modules, one Diviner block can discover stable regularities within non-stationary time series at one scale. Then, we stack Diviner blocks in a multilayer structure to achieve multi-scale transformation layers and capture multi-scale stable features from non-stationary time series. Such a multilayer structure is organized in an encoder-decoder architecture with asymmetric input lengths for efficient data utilization. The encoder takes a long historical series to embed trends, and the decoder receives a relatively short time series. With the cross-attention between the encoder and decoder, we can pair the latest time series with pertinent variation patterns from long historical series and make inferences about future trends, improving calculation efficiency and reducing redundant historical information. The point is that the latest time series is more conducive to anticipating the immediate future than the remote-past time series, where the correlation across time steps generally degrades with the length of the interval^{53,54,55,56,57}. Additionally, we design a generator to obtain prediction results in one step to avoid dynamic cumulative error problems³⁹. The generator is built with CovNet sharing parameters throughout each time step based on the linear projection generator^39,58,59, which saves hardware resources. These techniques enable deep learning methods to model non-stationary time series with multi-scale stable features and produce forecasting results in a generative paradigm, which is an attempt to tackle long-term time series prediction problems.

Performance of the 5G network traffic forecasting

To validate the effectiveness of the proposed techniques, we collect extensive NPTs from China Unicom. The NPT datasets include data recorded every 15 minutes for the whole 2021 year from three groups of real-world metropolitan network traffic ports {NPT-1, NPT-2, NPT-3}, where each sub-dataset contains {18, 5, 5} ports, respectively. We split them chronologically with a 9:1 proportion for training and testing. In addition, we prepare 16 network ports for parameter-searching. The main difficulties lie in the explicit shift of the distribution and numerous outliers. And this Section elaborates on the comprehensive comparison of our model with prediction-based and growth-rate-based models in applying 5G network traffic forecasting.

Experiment 1

We first compare Diviner to other time series prediction-based methods, we note these baseline models as Baselines-T for clarity. Baselines-T include traditional models ARIMA^19,20 and Prophet²⁶; classic machine learning model LSTMa⁶⁰; deep learning-based models Transformer³⁵, Informer³⁹, Autoformer⁴², and NBeats⁶¹. These models are required to predict the entire network traffic series {1, 3, 7, 14, 30} days, aligned with {96, 288, 672, 1344, 2880} prediction spans ahead in Table 1, and inbits is the target feature. In terms of the evaluation, although the MAE, MSE, and MASE predictive accuracy generally decrease with prediction intervals, the degradation rate varies between models. Therefore, we introduce an exponential velocity indicator to measure the rate of accuracy degradation. Specifically, given time spans [t₁, t₂] and the corresponding MSE, MAE, and MASE errors, we have the following:

$${\,{{\mbox{dMSE}}}}_{{t}_{1}}^{{t}_{2}}=(\root {t}_{2}-{t}_{1} \of {{{{\mbox{MSE}}}}_{{t}_{2}}/{{{\mbox{MSE}}}}_{{t}_{1}}}-1)\times 100 \% ,$$

(2)

$${\,{{\mbox{dMAE}}}}_{{t}_{1}}^{{t}_{2}}=(\root {t}_{2}-{t}_{1} \of {{{{\mbox{MAE}}}}_{{t}_{2}}/{{{\mbox{MAE}}}}_{{t}_{1}}}-1)\times 100 \% ,$$

(3)

$${\,{{\mbox{dMASE}}}}_{{t}_{1}}^{{t}_{2}}=(\root {t}_{2}-{t}_{1} \of {{{{\mbox{MASE}}}}_{{t}_{2}}/{{{\mbox{MASE}}}}_{{t}_{1}}}-1)\times 100 \% ,$$

(4)

where ${\,{{\mbox{dMSE}}}}_{{t}_{1}}^{{t}_{2}},{{{\mbox{dMAE}}}}_{{t}_{1}}^{{t}_{2}},{{{\mbox{dMASE}}}\,}_{{t}_{1}}^{{t}_{2}}\in {\mathbb{R}}$. Concerning the close experimental results between {NPT-1, NPT-2, and NPT-3}, we focus mainly on the result of the NPT-1 dataset, and the experimental results are summarized in Table 1. Although there exist quantities of outliers and frequent oscillations in the NPT dataset, Diviner achieves a 38.58% average MSE reduction (0.451 → 0.277) and a 20.86% average MAE reduction (0.465 → 0.368) based on the prior art. In terms of the scalability to different prediction spans, Diviner has a much lower ${\,{{\mbox{dMSE}}}\,}_{1}^{30}$ (4.014% → 0.750%) and ${\,{{\mbox{dMAE}}}\,}_{1}^{30}$ (2.343% → 0.474%) than the prior art, which exhibits a slight performance degradation with a substantial improvement in predictive robustness when the prediction horizon becomes longer. The degradation rates and predictive performance of all baseline approaches have been provided in Supplementary Table S1 regarding to the space limitation.

Table 1 Time-series forecasting results on the 5G traffic network dataset.

Full size table

The experiments on NPT-2 and NPT-3 shown in Supplementary Data 1 reproduce the above results, where Diviner can support accurate long-term network traffic prediction and exceed current art involving accuracy and robustness by a large margin. In addition, we have the following results by sorting the comprehensive performances (obtained by the average MASE errors) of the baselines established with the Transformer framework: Diviner > Autoformer > Transformer > Informer. This order aligns with the non-stationary factors considered in these models and verifies our proposal that incorporating non-stationarity promotes neural networks’ adaptive abilities to model time series, and the modeling multi-scale non-stationarity other breaks through the ceiling of prediction abilities for deep learning models.

Experiment 2

The second experiment compares Diviner with two other industrial methods, which aim to predict the capacity utilization of inbits and outbits with historical growth rates. The experiment shares the same network port traffic data as in Experiment 1, while the split ratio is changed to 3:1 chronologically for a longer prediction horizon. Furthermore, we use a long construction cycle of {30, 60, 90} days (aligned with {2880, 5760, 8640} time steps) to ensure the validity of such growth-rate-based methods for the law of large numbers. Here we first define capacity utilization mathematically:

Given a fixed bandwidth $B\in {\mathbb{R}}$ and the traffic flow of the kth construction cycles $\widetilde{{{{{{{{\bf{X}}}}}}}}}(k)=\left[\begin{array}{cccc}{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{kC+1}&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{kC+2}&...&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{(k+1)C}\end{array}\right]$, $\widetilde{{{{{{{{\bf{X}}}}}}}}}(k)\in {{\mathbb{R}}}^{T\times C}$, where ${\tilde{{{{{{{{\bf{x}}}}}}}}}}_{i}\in {{\mathbb{R}}}^{T}$ is a column vector of length T representing the time series per day and C denotes the number of days in one construction cycle. Then the capacity utilization (CU) of the kth construction cycle is defined as follows:

$$\,{{\mbox{CU}}}\,(k)=\frac{\parallel \widetilde{{{{{{{{\bf{X}}}}}}}}}(k){\parallel }_{m1}}{BCT},$$

(5)

where $\,{{\mbox{CU}}}\,(k)\in {\mathbb{R}}$. As shown in the definition, capacity utilization is directly related to network traffic, so a precise network traffic prediction leads to a quality prediction of capacity utilization. We compare the proposed predictive method with two commonly used moving average growth rate predictive methods in the industry, the additive and multiplicative moving average growth rate predictive methods. For clarity, we note the additive method as Baseline-A and the multiplicative method as Baseline-M. Baseline-A calculates an additive growth rate with the difference of adjacent construction cycles. Given the capacity utilization of the last two construction cycles CU(k − 1), CU(k − 2), we have the following:

$${\widehat{{{\mbox{CU}}}}}_{A}(k)=2\,{{\mbox{CU}}}(k-1)-{{\mbox{CU}}}\,(k-2).$$

(6)

Baseline-M calculates a multiplicative growth rate with the quotient of adjacent construction cycles. Given the capacity utilization of the last two construction cycles CU(k − 1), CU(k − 2), we have the following:

$${\widehat{{{\mbox{CU}}}}}_{M}(k)=\frac{\,{{\mbox{CU}}}(k-1)}{{{\mbox{CU}}}(k-2)}{{\mbox{CU}}}\,(k-1).$$

(7)

Different from the above two baselines, we calculate the capacity utilization of the network with the network traffic forecast. Given the network traffic of the last K construction cycles $\widetilde{{{{{{{{\bf{X}}}}}}}}}=\left[\begin{array}{ccccccc}{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{(k-K)C+1}&...&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{(k-K+1)C}&...&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{(k-1)C}&...&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{kC}\end{array}\right]$, we have the following:

$$\widetilde{{{{{{{{\mathcal{X}}}}}}}}}(k)={{{{{\rm{Diviner}}}}}}(\widetilde{{{{{{{{\mathcal{X}}}}}}}}}),$$

(8)

$${\widehat{{{\mbox{CU}}}}}_{D}(k)=\frac{\parallel \widetilde{{{{{{{{\mathcal{X}}}}}}}}}(k){\parallel }_{m1}}{BCT}.$$

(9)

We summarize the experimental results in Table 2. Concerning the close experimental results between {NPT-1, NPT-2, and NPT-3} shown in, we focus mainly on the result of the NPT-1 dataset, which has the most network traffic ports. Diviner achieves a substantial reduction of 31.67% MAE (0.846 → 0.578) on inbits and a reduction of 24.25% MAE (0.944 → 0.715) on outbits over Baseline-A. An intuitive explanation is that the growth-rate-based methods extract particular historical features but lack adaptability. We notice that Baseline-A has a much better performance of 0.045× average inbits-MAE and 0.074× average outbits-MAE over Baseline-M. This result suggests that network traffic tends to increase linearly rather than exponentially. Nevertheless, there remain inherent multi-scale variations of network traffic series, so Diviner still exceeds the Baseline-A, suggesting the necessity of applying deep learning models such as Diviner to discover nonlinear latent regularities within network traffic.

Table 2 Long-term (1–3 months) capacity utilization forecasting results on the NPT dataset.

Full size table

When analyzing the results of these two experiments jointly, we present that Diviner possesses a relatively low degradation rate for a prediction of 90 days, ${\,{{\mbox{dMASE}}}\,}_{1}^{90}=1.034 \%$. In contrast, the degradation rate of the prior art comes to ${\,{{\mbox{dMASE}}}\,}_{1}^{30}=2.343 \%$ for a three-times shorter prediction horizon of 30 days. Furthermore, considering diverse network traffic patterns in the provided datasets (about 50 ports), the proposed method can deal with a wide range of non-stationary time series, validating its applicability without modification. These experiments witness Diviner’s success in providing quality long-term network traffic forecasting and extending the effective prediction spans of deep learning models for up to three months.

Application on other real-world datasets

We validate our method on benchmark datasets for the weather (WTH), electricity transformer temperature (ETT), electricity (ECL), and exchange (Exchange). We summarize the experimental results in Table 3. We follow the standard protocol and divide them into training, validation, and test sets in chronological order with a proportion of 7:1:2 unless otherwise specified. Due to the space limitation, the complete experimental results are shown in Supplementary Data 2.

Table 3 Time-series forecasting results on other real-world datasets.

Full size table

Weather temperature prediction

The WTH dataset⁴² records 21 meteorological indicators for Jena 2020, including air temperature and humidity, and WetBulbFarenheit is the target. This dataset is finely quantified to the 10-min level, which means that there are 144 steps for one day and 4320 steps for one month, thereby challenging the capacity of models to process long sequences. Among all baselines, NBeats and Informer have the lowest error in terms of MSE and MAE metrics, respectively. However, we notice a contrast between these two models when extending prediction spans. Informer degrades precipitously when the prediction spans increase from 2016 to 4032 (MAE:0.417 → 0.853), but on the contrary, NBeats gains a performance improvement (MAE:0.635 → 0.434). We attribute this to a trade-off of pursuing context and texture. Informer has an advantage over the texture in the short-term case. Still, it needs to capture the context dependency of the series considering the length of input history series should extend in pace with prediction spans and vice versa. As for Diviner, it achieves a remarkable 29.30% average MAE reduction (0.488 → 0.345) and 41.54% average MSE reduction (0.491 → 0.287) over both Informer and NBeats. Additionally, Diviner gains a low degradation rate of ${\,{{\mbox{dMSE}}}\,}_{1}^{30}=0.439 \%$, ${\,{{\mbox{dMAE}}}\,}_{1}^{30}=0.167 \%$ showing its ability to harness historical information within time series. The predictive performances and degradation rates of all baseline approaches have been provided in Supplementary Table S2. Our model can synthesize context and texture to balance both short-term and long-term cases, ensuring its accurate and robust long-term prediction.

Electricity transformer temperature prediction

The ETT dataset contains two-year data with six power load features from two counties in China, and oil temperature is our target. Its split ratio of training/validation/test set is 12/4/4 months³⁹. The ETT data set is divided into two separate datasets at the 1-h {ETTh₁, ETTh₂} and 15-minute levels ETTm₁. Therefore, we can study the performance of the models under different granularities, where the prediction steps {96, 288, 672} of ETTm₁ align with the prediction steps {24, 48, 168} of ETTh₁. Our experiments show that Diviner achieves the best performance in both cases. Although in the hour-level case, Diviner outperforms the baselines with the closest MSE and MAE to Autoformer (MSE: 0.110 → 0.082, MAE: 0.247 → 0.216). When the hour-level granularity turns to a minute-level case, Diviner outperforms Autoformer by a large margin (MSE:0.092 → 0.064, MAE:0.239 → 0.194). The predictive performances and the granularity change when the hour-level granularity turns into the minute-level granularity of all baseline approaches have been provided in Supplementary Table S3. These demonstrate the capacity of the Diviner in processing time series of different granularity. Furthermore, the granularity is also a manifestation of scale. These results demonstrate that modeling multi-scale features is conducive to dealing with time series of different granularity.

Consumer electricity consumption prediction

The ECL dataset records the two-year electricity consumption of 321 clients, which is converted into hour-level consumption owing to the missing data, and MT-320 is the target feature⁶². We predict different time horizons of {7, 14, 30, 40} days, aligned with {168, 336, 720, 960} prediction steps ahead. Next, we analyze the experimental results according to the prediction spans (≤360 as short-term prediction, ≥360 as long-term prediction). NBeats achieves the best forecasting performance for short-term electricity consumption prediction, while Diviner surpasses it in the long-term prediction case. The short-term and long-term performance of all approaches has been provided in Supplementary Table S4. Statistically, the proposed method outperforms the best baseline (NBeats) by decreasing 17.43% MSE (0.367 → 0.303), 15.14% MAE (0.482 → 0.409) at 720 steps ahead, and 6.56% MSE (0.457 → 0.427) at 9.44% MAE (0.540 → 0.489) at 960 steps ahead. We attribute this to scalability, where different models converge to perform similarly in the short-term case, but their differences emerge when the prediction span becomes longer.

Gold price prediction

The Exchange dataset contains 5-year closing prices of a troy ounce of gold in the US recorded daily from 2016 to 2021. Due to the high-frequency fluctuation of the market price, the predictive goal is to predict its general trend reasonably (https://www.lbma.org.uk). To this end, we perform a long-term prediction of {10, 20, 30, 60} days. The experimental results clearly show apparent performance degrades for most baseline models. Given a history of 90 days, only Autoformer and Diviner can predict with MAE and MSE errors lower than 1 when the prediction span is 60 days. However, Diviner still outperforms other methods with a 38.94% average MSE reduction (0.588 → 0.359) and a 22.73% average MSE reduction (0.607 → 0.469) and achieves the best forecast performance. The predictive performance of all baseline approaches has been provided in Supplementary Table S5. These results indicate the adaptability of Diviner to the rapid evolution of financial markets and its reasonable extrapolation, considering that it is generally difficult to predict the financial system.

Solar energy production prediction

The solar dataset contains the 10-minute level 1 year (2006) solar power production data of 137 PV plants in Alabama State, and PV-136 is the target feature (http://www.nrel.gov). Given that the amount of solar energy produced daily is generally stable, conducting a super long-term prediction is unnecessary. Therefore, we set the prediction horizon to {1, 2, 5, 6} days, aligned with {144, 288, 720, 864} prediction steps ahead. Furthermore, this characteristic of solar energy means that its production series tend to be stationary, and thereby the comparison of the predictive performances between different models on this dataset presents their basic series modeling abilities. Concretely, considering the MASE error can be used to assess the model’s performance on different series, we calculate and sort each model’s average MASE error under different prediction horizon settings to measure the time series modeling ability (provided in Supplementary Table S6). The results are as follows: Diviner > NBeats > Transformer > Autoformer > Informer > LSTM, where Diviner surpasses all Transformer-based models in the selected baselines. Provided that the series data is not that non-stationary, the advantages of Autoformer’s modeling time series non-stationarity are not apparent. At the same time, capturing stable long- and short-term dependencies is still effective.

Road occupancy rate prediction

The Traffic dataset contains hourly 2-year (2015–2016) road occupancy rate collected from 862 sensors on San Francisco Bay area freeways by the California Department of Transportation, where sensor-861 is the target feature (http://pems.dot.ca.gov). The prediction horizon is set to {7, 14, 30, 40} days, aligned with {168, 336, 720, 960} prediction steps ahead. Considering the road occupancy rate tends to have a weekly cycle, we use this dataset to compare different networks’ ability to model the temporal cycle. During the comparison, we mainly focus on the following two groups of deep learning models: group-1 takes the non-stationary specialization of time series into account (Diviner, Autoformer), and group-2 does not employ any time-series-specific components (Transformer, Informer, LSTMa). We find that group-1 gains a significant performance improvement over group-2, which suggests the necessity of modeling non-stationarity. As for the proposed Diviner model, it achieves a 27.64% MAE reduction (0.604 → 0.437) to the Transformer model when forecasting 30-day road occupancy rates. Subsequently, we conduct an intra-group comparison for group-1, where Diviner still gains an average 35.37% MAE reduction (0.523 → 0.338) to Autoformer. The predictive performance of all approaches has been provided in Supplementary Table S7. We attribute this to Diviner’s multiple-scale modeling of non-stationarity, while the trend-seasonal decomposition of Autoformer merely reflects time series variation at particular scales. These experimental results demonstrate that Diviner is competent in predicting time series data with cycles.

Discussion

We study the long-term 5G network traffic prediction problem by modeling non-stationarity with deep learning techniques. Although some literature^63,64,65 in the early stage argues that the probabilistic traffic forecast under uncertainty is more suitable for the varying network traffic than a concrete forecast produced by time series models, the probabilistic traffic forecast and the concrete traffic forecast share the same historical information in essence. Moreover, the development of time series forecasting techniques these years has witnessed a series of works employing time series forecasting techniques for practical applications such as bandwidth management^14,15, resource allocation¹⁶, and resource provisioning¹⁷, where the time series prediction-based methods can provide detailed network traffic forecast. However, existing time series forecasting methods suffer a severe performance degeneration since the long-term prediction horizon exposes the non-stationarity of time series, which raises several challenges: (a) Multi-scale temporal variations. (b) Random factors. (c) Data Distribution Shift.

Therefore, this paper attempts to challenge the problem of achieving a precise long-term prediction for non-stationary time series. We start from the fundamental property of time series non-stationarity and introduce deep stationary processes into a neural network, which models multi-scale stable regularities within non-stationary time series. We argue that capturing the stable features is a recipe for generating non-stationary forecasts conforming to historical regularities. The stable features enable networks to restrict the latent space of time series, which deals with varying distribution problems. Extensive experiments on network traffic prediction and other real-world scenarios demonstrate its advances over existing prediction-based models. Its advantages are summarized as follows. (a) Diviner brings a salient improvement on both long- and short-term prediction and achieves state-of-the-art performance. (b) Diviner can perform robustly regardless of the selection of prediction span and granularity, showing great potential for long-term forecasting. (c) Diviner maintains a strong generalization in various fields. The performance of most baselines might degrade precipitously in some or other areas. In contrast, our model distinguishes itself for consistent performance on each benchmark.

This work explores an avenue to obtain detailed and precise long-term 5G network traffic forecasts, which can be used to calculate the time network traffic might overflow the capacity and helps operators formulate network construction schemes months in advance. Furthermore, Diviner generates long-term network traffic forecasts at the minute level, facilitating its broader applications for resource provisioning, allocating, and monitoring. Decision-makers can harness long-term predictions to allocate and optimize network resources. Another practical application is to achieve an automatic network status monitoring system, which automatically alarms when real network traffic exceeds a permitted range around predictions. This system supports targeted port-level early warning and assists workers in troubleshooting in time, which can bring substantial efficiency improvement considering the tens of millions of network ports running online. In addition to 5G networks, we have expanded our solution to broader engineering fields such as electricity, climate, control, economics, energy, and transportation. Predicting oil temperature can help prevent the transformer from overheating, which affects the insulation life of the transformer and ensures proper operation^66,67. In addition, long-term meteorological prediction helps to select and seed crops in agriculture. As such, we can discover unnoticed regularities within historical series data, which might bring opportunities to traditional industries.

One limitation of our proposed model is that it suffers from critical transitions of data patterns. We attribute this to external factors, whose information is generally not included in the measured data^53,55,68. Our method is helpful in the intrinsic regularity discovery within the time series but cannot predict patterns not previously recorded in the real world. Alternatively, we can use dynamic network methods^69,70,71 to detect such critical transitions in the time series⁵³. Furthermore, the performance of Diviner might be similar to other deep learning models if given a few history series or in the short-term prediction case. The former contains insufficient information to be exploited, and the short-term prediction needs more problem scalability, whereas the advantages of our model become apparent in long-term forecasting scenarios.

Methods

Preliminaries

We denote the original form of the time-series data as ${{{{{{{\bf{X}}}}}}}}=\left[\begin{array}{cccc}{x}_{1}&{x}_{2}&...&{x}_{n}\end{array}\right],{x}_{i}\in {\mathbb{R}}$. The original time series data X is reshaped to a matrix form as $\widetilde{{{{{{{{\bf{X}}}}}}}}}=\left[\begin{array}{cccc}{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{1}&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{2}&...&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{K}\end{array}\right]$, where ${\tilde{{{{{{{{\bf{x}}}}}}}}}}_{i}$ is a vector of length T with the time series data per day/week/month/year, K denotes the number of days/weeks/months/years, ${\tilde{{{{{{{{\bf{x}}}}}}}}}}_{i}\in {{\mathbb{R}}}^{T}$. After that, we can represent the seasonal pattern as ${\tilde{{{{{{{{\bf{x}}}}}}}}}}_{i}$ and use its variation between adjacent time steps to model trends, shown as the following:

$$\begin{array}{rcl}{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{{t}_{2}}&=&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{{t}_{1}}+\mathop{\sum }\limits_{t={t}_{1}}^{{t}_{2}-1}\Delta {\widetilde{{{{{{{{\rm{s}}}}}}}}}}_{t},\\ \Delta {\widetilde{{{{{{{{\rm{s}}}}}}}}}}_{t}&=&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{t+1}-{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{t},\end{array}$$

(10)

where $\Delta {\widetilde{{{{{{{{\rm{s}}}}}}}}}}_{t}$ denotes the change of the seasonal pattern, $\Delta {\widetilde{{{{{{{{\rm{s}}}}}}}}}}_{t}\in {{\mathbb{R}}}^{T}$. The shift reflects the variation between small time steps, but when such variation (shift) builds up over a rather long period, the trend d comes out. It can be achieved as $\mathop{\sum }\nolimits_{t = {t}_{1}}^{{t}_{2}-1}\Delta {\widetilde{{{{{{{{\rm{s}}}}}}}}}}_{t}$. Therefore, we can model trends by capturing the long- and short-range dependencies of shifts among different time steps.

Next, we introduce a smoothing filter attention mechanism to construct multi-scale transformation layers. A difference attention module is mounted to capture and interconnect shifts of the corresponding scale. These mechanisms make our Diviner capture multi-scale variations in non-stationary time series, and the mathematical description is listed below.

Diviner input layer

Given the time series data X, we transform X into $\widetilde{{{{{{{{\bf{X}}}}}}}}}=\left[\begin{array}{cccc}{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{1}&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{2}&...&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{K}\end{array}\right]$, where ${\tilde{{{{{{{{\bf{x}}}}}}}}}}_{i}$ is a vector of length T with the time series data per day (seasonal), and K denotes the number of days, ${\tilde{{{{{{{{\bf{x}}}}}}}}}}_{i}\in {{\mathbb{R}}}^{T}$, $\widetilde{{{{{{{{\bf{X}}}}}}}}}\in {{\mathbb{R}}}^{T\times K}$. Then we construct the dual input for Diviner. Noticing that Diviner adopts an encoder-decoder architecture, we construct ${{{{{{{{\bf{X}}}}}}}}}_{en}^{in}$ for encoder and ${{{{{{{{\bf{X}}}}}}}}}_{de}^{in}$ for decoder, where ${{{{{{{{\bf{X}}}}}}}}}_{en}^{in}=\left[\begin{array}{cccc}{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{1}&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{2}&...&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{K}\end{array}\right]$, ${{{{{{{{\bf{X}}}}}}}}}_{de}^{in}=\left[\begin{array}{cccc}{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{K-{K}_{de}+1}&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{K-{K}_{de}}&...&{\tilde{{{{{{{{\bf{x}}}}}}}}}}_{K}\end{array}\right]$, and ${{{{{{{{\bf{X}}}}}}}}}_{en}^{in}\in {{\mathbb{R}}}^{K}$, ${{{{{{{{\bf{X}}}}}}}}}_{de}^{in}\in {{\mathbb{R}}}^{{K}_{de}}$. This means that ${{{{{{{{\bf{X}}}}}}}}}_{en}^{in}$ takes all elements from $\widetilde{{{{{{{{\bf{X}}}}}}}}}$ while ${{{{{{{{\bf{X}}}}}}}}}_{de}^{in}$ takes only the latest K_de elements. After that, a fully connected layer on ${{{{{{{{\bf{X}}}}}}}}}_{en}^{in}$ and ${{{{{{{{\bf{X}}}}}}}}}_{de}^{in}$ is used to obtain ${{{{{{{{\bf{E}}}}}}}}}_{en}^{in}$ and ${{{{{{{{\bf{E}}}}}}}}}_{de}^{in}$, where ${{{{{{{{\bf{E}}}}}}}}}_{en}^{in}\in {{\mathbb{R}}}^{{d}_{m}\times K}$, ${{{{{{{{\bf{E}}}}}}}}}_{de}^{in}\in {{\mathbb{R}}}^{{d}_{m}\times {K}_{de}}$ and d_m denotes the model dimension.

Smoothing filter attention mechanism

Inspired by Nadaraya-Watson regression^51,52 bringing the adjacent points closer together, we introduce the smoothing filter attention mechanism with a learnable kernel function and self-masked architecture, where the former brings similar items closer to filter out the random component and adjust the non-stationary data to stable features, and the letter reduces outliers. The smoothing filter attention mechanism is implemented based on the input ${{{{{{{\bf{E}}}}}}}}=\left[\begin{array}{cccc}{{{{{{{{\boldsymbol{\xi }}}}}}}}}_{1}&{{{{{{{{\boldsymbol{\xi }}}}}}}}}_{2}&...&{{{{{{{{\boldsymbol{\xi }}}}}}}}}_{{K}_{in}}\end{array}\right]$, where ${{{{{{{{\boldsymbol{\xi }}}}}}}}}_{i}\in {{\mathbb{R}}}^{{d}_{m}}$, E is the general reference to the input of each layer, for encoder K_in = K, and for decoder K_in = K_de. Specifically, ${{{{{{{{\bf{E}}}}}}}}}_{en}^{in}$ and ${{{{{{{{\bf{E}}}}}}}}}_{de}^{in}$ are, respectively, the input of the first encoder and decoder layer. The calculation process is shown as follows:

$${{{{{{{{\boldsymbol{\eta }}}}}}}}}_{i}=\frac{\mathop{\sum}\limits_{j\ne i}K({{{{{{{{\boldsymbol{\xi }}}}}}}}}_{i},{{{{{{{{\boldsymbol{\xi }}}}}}}}}_{j})\odot {{{{{{{{\boldsymbol{\xi }}}}}}}}}_{j}}{\mathop{\sum}\limits_{j\ne i}\,{{\mbox{K}}}\,({{{{{{{{\boldsymbol{\xi }}}}}}}}}_{i},{{{{{{{{\boldsymbol{\xi }}}}}}}}}_{j})},$$

(11)

$$\,{{\mbox{K}}}\,({{{{{{{{\boldsymbol{\xi }}}}}}}}}_{i},{{{{{{{{\boldsymbol{\xi }}}}}}}}}_{j})=exp({{{{{{{{\bf{w}}}}}}}}}_{i}\odot {({{{{{{{{\boldsymbol{\xi }}}}}}}}}_{i}-{{{{{{{{\boldsymbol{\xi }}}}}}}}}_{j})}^{2}),$$

(12)

where ${{{{{{{{\bf{w}}}}}}}}}_{i}\in {{\mathbb{R}}}^{{d}_{m}},i\in [1,{K}_{in}]$ denotes the learnable parameters, ⊙ denotes the element-wise multiple, (⋅)² denotes the element-wise square and the square of a vector here represents the element-wise square. To simplify the representation, we assign the smoothing filter attention mechanism as Smoothing-Filter(E) and denote its output as H_s. Before introducing our difference attention module, we first define the difference between a matrix and its inverse operation CumSum.

Difference and CumSum operation

Given a matrix ${{{{{{{\bf{M}}}}}}}}\in {{\mathbb{R}}}^{m\times n}$, ${{{{{{{\bf{M}}}}}}}}=\left[\begin{array}{cccc}{{{{{{{{\bf{m}}}}}}}}}_{1}&{{{{{{{{\bf{m}}}}}}}}}_{2}&...&{{{{{{{{\bf{m}}}}}}}}}_{n}\end{array}\right]$, the difference of M is defined as:

$$\Delta {{{{{{{\bf{M}}}}}}}}=\left[\begin{array}{cccc}\Delta {{{{{{{{\bf{m}}}}}}}}}_{1}&\Delta {{{{{{{{\bf{m}}}}}}}}}_{2}&...&\Delta {{{{{{{{\bf{m}}}}}}}}}_{n}\end{array}\right],$$

(13)

where $\Delta {{{{{{{{\bf{m}}}}}}}}}_{i}={{{{{{{{\bf{m}}}}}}}}}_{i+1}-{{{{{{{{\bf{m}}}}}}}}}_{i},\Delta {{{{{{{{\bf{m}}}}}}}}}_{i}\in {{\mathbb{R}}}^{m},i\in [1,n)$ and we pad Δm_n with Δm_n−1 to keep a fixed length before and after the difference operation. The CumSum operation Σ toward M is defined as:

$$\Sigma {{{{{{{\bf{M}}}}}}}}=\left[\begin{array}{cccc}\Sigma {{{{{{{{\bf{m}}}}}}}}}_{1}&\Sigma {{{{{{{{\bf{m}}}}}}}}}_{2}&...&\Sigma {{{{{{{{\bf{m}}}}}}}}}_{n}\end{array}\right],$$

(14)

where $\Sigma {{{{{{{{\bf{m}}}}}}}}}_{i}=\mathop{\sum }\nolimits_{j = 1}^{i}{{{{{{{{\bf{m}}}}}}}}}_{j},\Sigma {{{{{{{{\bf{m}}}}}}}}}_{i}\in {{\mathbb{R}}}^{m}.$ The differential attention module, intuitively, can be seen as an attention mechanism plugged between these two operations, mathematically described as follows.

Differential attention module

The input of this model involves three elements: Q, K, V. The (Q, K, V) varies between the encoder and decoder, which is $({{{{{{{{\bf{H}}}}}}}}}_{s}^{en},{{{{{{{{\bf{H}}}}}}}}}_{s}^{en},{{{{{{{{\bf{H}}}}}}}}}_{s}^{en})$ for the encoder and $({{{{{{{{\bf{H}}}}}}}}}_{s}^{de},{{{{{{{{\bf{E}}}}}}}}}_{en}^{out},{{{{{{{{\bf{E}}}}}}}}}_{en}^{out})$ for the decoder, where ${{{{{{{{\bf{E}}}}}}}}}_{en}^{out}$ is the embedded result of the final encoder block (assigned in the pseudo-code), ${{{{{{{{\bf{H}}}}}}}}}_{s}^{en}\in {{\mathbb{R}}}^{{d}_{m}\times K},{{{{{{{{\bf{H}}}}}}}}}_{s}^{de}\in {{\mathbb{R}}}^{{d}_{m}\times {K}_{de}},{{{{{{{{\bf{E}}}}}}}}}_{en}^{out}\in {{\mathbb{R}}}^{{d}_{m}\times K}$.

$${{{{{{{{\bf{Q}}}}}}}}}_{s}^{(i)},{{{{{{{{\bf{K}}}}}}}}}_{s}^{(i)},{{{{{{{{\bf{V}}}}}}}}}_{s}^{(i)}={{{{{{{{\bf{W}}}}}}}}}_{q}^{(i)}\Delta {{{{{{{\bf{Q}}}}}}}}+{{{{{{{{\bf{b}}}}}}}}}_{q}^{(i)},{{{{{{{{\bf{W}}}}}}}}}_{k}^{(i)}\Delta {{{{{{{\bf{K}}}}}}}}+{{{{{{{{\bf{b}}}}}}}}}_{k}^{(i)},{{{{{{{{\bf{W}}}}}}}}}_{v}^{(i)}\Delta {{{{{{{\bf{V}}}}}}}}+{{{{{{{{\bf{b}}}}}}}}}_{v}^{(i)},$$

(15)

$${\widetilde{{{{{{{{\bf{V}}}}}}}}}}_{s}^{(i)}={{{{{{{{\bf{V}}}}}}}}}_{s}^{(i)}\cdot \,{{\mbox{SoftMax}}}\,\left(\frac{{{{{{{{{\bf{Q}}}}}}}}}_{s}^{{(i)}^{\top }}\cdot {{{{{{{{\bf{K}}}}}}}}}_{s}^{(i)}}{\sqrt{{d}_{m}}}\right),$$

(16)

$${{{{{{{\bf{D}}}}}}}}=\Sigma ({{{{{{{{\bf{W}}}}}}}}}_{s}{\left[\begin{array}{cccc}{\widetilde{{{{{{{{\bf{V}}}}}}}}}}_{s}^{{(1)}^{\top }}&{\widetilde{{{{{{{{\bf{V}}}}}}}}}}_{s}^{{(2)}^{\top }}&...&{\widetilde{{{{{{{{\bf{V}}}}}}}}}}_{s}^{{(h)}^{\top }}\end{array}\right]}^{\top }),$$

(17)

where ${{{{{{{{\bf{W}}}}}}}}}_{q}^{(i)}\in {{\mathbb{R}}}^{{d}_{a}\times {d}_{m}}$, ${{{{{{{{\bf{W}}}}}}}}}_{k}^{(i)}\in {{\mathbb{R}}}^{{d}_{attn}\times {d}_{m}}$, ${{{{{{{{\bf{W}}}}}}}}}_{v}^{(i)}\in {{\mathbb{R}}}^{{d}_{a}\times {d}_{m}}$, ${{{{{{{{\bf{W}}}}}}}}}_{s}\in {{\mathbb{R}}}^{{d}_{m}\times h{d}_{a}}$, ${{{{{{{\bf{D}}}}}}}}\in {{\mathbb{R}}}^{{d}_{m}\times K}$, i ∈ [1, h], h denotes the number of parallel attentions. $\left[\begin{array}{c}\cdot \end{array}\right]$ denotes the concatenation of matrix, ${\widetilde{{{{{{{{\bf{V}}}}}}}}}}_{s}^{(i)}$ denotes the deep shift, and D denotes the deep trend. We denote the differential attention module as Differential-attention(Q, K, V) to ease representations.

Convolution Generator

The final output of Diviner is calculated through convolutional layers, called the one-step generator, which takes the output of the final decoder layer ${{{{{{{{\bf{E}}}}}}}}}_{de}^{out}$ as the input:

$${{{{{{{{\bf{R}}}}}}}}}_{predict}=\,{{\mbox{ConvNet}}}\,({{{{{{{{\bf{E}}}}}}}}}_{de}^{out}),$$

(18)

where ${{{{{{{{\bf{R}}}}}}}}}_{predict}\in {{\mathbb{R}}}^{{d}_{m}\times {K}_{r}},{{{{{{{{\bf{E}}}}}}}}}_{de}^{(M)}\in {{\mathbb{R}}}^{{d}_{m}\times {K}_{de}}$, ConvNet is a multilayer fully convolution net, whose input and output channels are the input length of the decoder K_de and the prediction length K_r, respectively.

Pseudo-code of Diviner

For the convenience of reproducing, We summarize the framework of our Diviner in the following pseudo-code:

Data availability

The datasets supporting our work have been deposited at https://doi.org/10.5281/zenodo.7827077. However, restrictions apply to the availability of NPT data, which were used under license for the current study, and so are not publicly available. Data are, however, available from the authors upon reasonable request and with permission of China Information Technology Designing Consulting Institute.

Code availability

Codes are available at https://doi.org/10.5281/zenodo.7825740.

References

Jovović, I., Husnjak, S., Forenbacher, I. & Maček, S. Innovative application of 5G and blockchain technology in industry 4.0. EAI Endorsed Trans. Ind. Netw. Intell. Syst. 6, e4 (2019).
Osseiran, A. et al. Scenarios for 5G mobile and wireless communications: the vision of the METIS project. IEEE Commun. Mag. 52, 26–35 (2014).
Article Google Scholar
Wu, G., Yang, C., Li, S. & Li, G. Y. Recent advances in energy-efficient networks and their application in 5G systems. IEEE Wirel. Commun. 22, 145–151 (2015).
Article Google Scholar
Hui, H., Ding, Y., Shi, Q., Li, F. & Yan, J. 5G network-based internet of things for demand response in smart grid: A survey on application potential. Appl. Energy 257, 113972 (2020).
Article Google Scholar
Johansson, N. A., Wang, Y., Eriksson, E. & Hessler, M. Radio access for ultra-reliable and low-latency 5G communications. in Proceedings of IEEE International Conference on Communication Workshop, 1184–1189 (2015).
Yilmaz, O., Wang, Y., Johansson, N. A., Brahmi, N. & Sachs, J. Analysis of ultra-reliable and low-latency 5G communication for a factory automation use case. in Proceedings of IEEE International Conference on Communication Workshop (2015).
Fernández, M. L., Huertas, C. A., Gil, P. M., García, C. F. J. & Martínez, P. G. Dynamic management of a deep learning-based anomaly detection system for 5G networks. J. Ambient Intell. Hum. Comput. 10, 3083–3097 (2019).
Article Google Scholar
O’Connell, E., Moore, D. & Newe, T. Challenges associated with implementing 5G in manufacturing. Telecom 1, 48–67 (2020).
Article Google Scholar
Oughton, E. J., Frias, Z., van der Gaast, S. & van der Berg, R. Assessing the capacity, coverage and cost of 5G infrastructure strategies: Analysis of the netherlands. Telemat. Inform. 37, 50–69 (2019).
Article Google Scholar
Gupta, A. & Jha, R. K. A survey of 5g network: Architecture and emerging technologies. IEEE Access 3, 1206–1232 (2015).
Article Google Scholar
Wang, C. et al. Cellular architecture and key technologies for 5G wireless communication networks. IEEE Commun. Mag. 52, 122–130 (2014).
Article Google Scholar
Li, Q. C., Niu, H., Papathanassiou, A. T. & Wu, G. 5G network capacity: key elements and technologies. IEEE Vehicular Technol. Mag. 9, 71–78 (2014).
Article Google Scholar
Liu, H. Research on resource allocation and optimization technology in 5G communication network. In Proceedings of International Conference on Consumer Electronics and Computer Engineering, 209–212 (2022).
Yoo, W. & Sim, A. Time-series forecast modeling on high-bandwidth network measurements. J. Grid Comput. 14, 463–476 (2016).
Article Google Scholar
Wei, Y., Wang, J. & Wang, C. A traffic prediction based bandwidth management algorithm of a future internet architecture. in Proceedings of International Conference on Intelligent Networks and Intelligent Systems, 560–563 (2010).
Garroppo, R. G., Giordano, S., Pagano, M. & Procissi, G. On traffic prediction for resource allocation: a Chebyshev bound based allocation scheme. Comput. Commun. 31, 3741–3751 (2008).
Article Google Scholar
Bega, D., Gramaglia, M., Fiore, M., Banchs, A. & Costa-Pérez, X. Deepcog: Optimizing resource provisioning in network slicing with ai-based capacity forecasting. IEEE J. Sel. Areas Commun. 38, 361–376 (2019).
Article Google Scholar
Hassidim, A., Raz, D., Segalov, M. & Shaqed, A. Network utilization: the flow view. in Proceedings of 2013 IEEE INFOCOM, 1429–1437 (2013).
Box, G., Jenkins, G., Reinsel, G. & Ljung, G. Time Series Analysis: Forecasting and Control (John Wiley & Sons, America, 2015).
Box, G. E. & Jenkins, G. M. Some recent advances in forecasting and control. J. R. Stat. Soc. C 17, 91–109 (1968).
MathSciNet Google Scholar
Moayedi, H. & Masnadi-Shirazi, M. Arima model for network traffic prediction and anomaly detection. in Proceedings of International Symposium on Information Technology, vol. 4, 1–6 (2008).
Azari, A., Papapetrou, P., Denic, S. & Peters, G. Cellular traffic prediction and classification: A comparative evaluation of lstm and arima. In Proceedings of International Conference on Discovery Science, 129-144 (2019).
Tikunov, D. & Nishimura, T. Traffic prediction for mobile network using holt-winter’s exponential smoothing. In Proceedings of International Conference on Software, Telecommunications and Computer Networks, 1–5 (2007).
Shu, Y., Yu, M., Yang, O., Liu, J. & Feng, H. Wireless traffic modeling and prediction using seasonal arima models. IEICE Trans. Commun. 88, 3992–3999 (2005).
Article Google Scholar
Rafsanjani, M. K., Rezaei, A., Shahraki, A. & Saeid, A. B. Qarima: a new approach to prediction in queue theory. Appl. Math. Comput. 244, 514–525 (2014).
MathSciNet MATH Google Scholar
Taylor, S. & Letham, B. Forecasting at scale. Am. Stat. 72, 37–45 (2018).
Article MathSciNet MATH Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article Google Scholar
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Article Google Scholar
Salinas, D., Flunkert, V., Gasthaus, J. & Januschowski, T. DeepAR: probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 36, 1181–1191 (2020).
Article Google Scholar
Qin, Y. et al. A dual-stage attention-based recurrent neural network for time series prediction. in Proceedings of International Joint Conference on Artificial Intelligence, 2627–2633 (2017).
Mona, S., Mazin, E., Stefan, L. & Maja, R. Modeling irregular time series with continuous recurrent units. Proc. Int. Conf. Mach. Learn. 162, 19388–19405 (2022).
Google Scholar
Kashif, R., Calvin, S., Ingmar, S. & Roland, V. Autoregressive Denoising Diffusion models for multivariate probabilistic time series forecasting. in Proceedings of International Conference on Machine Learning, vol. 139, 8857–8868 (2021).
Alasdair, T., Alexander, P. M., Cheng, S. O. & Xie, L. Radflow: A recurrent, aggregated, and decomposable model for networks of time series. in Proceedings of International World Wide Web Conference, 730–742 (2021).
Ling, F. et al. Multi-task machine learning improves multi-seasonal prediction of the Indian ocean dipole. Nat. Commun. 13, 1–9 (2022).
Article Google Scholar
Vaswani, A. et al. Attention is all you need. Proc. Annu. Conf. Neural Inf. Process. Syst. 30, 5998–6008 (2017).
Google Scholar
Alexandre, D., Étienne, M. & Nicolas, C. TACTiS: transformer-attentional copulas for time series. in Proceedings of International Conference on Machine Learning, vol. 162, 5447–5493 (2022).
Tung, N. & Aditya, G. Transformer neural processes: uncertainty-aware meta learning via sequence modeling. in Proceedings of International Conference on Machine Learning, vol. 162, 16569–16594 (2022).
Wen, Q. et al. Transformers in time series: A survey. CoRR (2022).
Zhou, H. et al. Informer: beyond efficient transformer for long sequence time-series forecasting. in Proceedings of AAAI Conference on Artificial Intelligence (2021).
Kitaev, N., Kaiser, L. & Levskaya, A. Reformer: the efficient transformer. in Proceedings of International Conference on Learning Representations (2019).
Li, S. et al. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. in Proceedings of the 33th Annual Conference on Neural Information Processing Systems vol. 32, 5244–5254 (2019).
Wu, H., Xu, J., Wang, J. & Long, M. Autoformer: decomposition transformers with auto-correlation for long-term series forecasting. in Proceedings of Annual Conference on Neural Information Processing Systems, vol. 34, 22419–22430 (2021).
Zhou, T. et al. Fedformer: frequency enhanced decomposed transformer for long-term series forecasting. in Proceedings of International Conference on Machine Learning, vol. 162, 27268–27286 (2022).
Liu, S. et al. Pyraformer: low-complexity pyramidal attention for long-range time series modeling and forecasting. in Proceedings of International Conference on Learning Representations (ICLR) (2021).
Liu, M. et al. SCINet: Time series modeling and forecasting with sample convolution and interaction. in Proceedings of Annual Conference on Neural Information Processing Systems (2022).
Wang, Z. et al. Learning latent seasonal-trend representations for time series forecasting. in Proceedings of Annual Conference on Neural Information Processing Systems (2022).
Xie, C. et al. Trend analysis and forecast of daily reported incidence of hand, foot and mouth disease in hubei, china by prophet model. Sci. Rep. 11, 1–8 (2021).
Google Scholar
Cox, D. R. & Miller, H. D. The Theory of Stochastic Processes (Routledge, London, 2017).
Dette, H. & Wu, W. Prediction in locally stationary time series. J. Bus. Econ. Stat. 40, 370–381 (2022).
Article MathSciNet Google Scholar
Wold, H. O. On prediction in stationary time series. Ann. Math. Stat. 19, 558–567 (1948).
Article MathSciNet MATH Google Scholar
Watson, G. S. Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A359–372 (1964).
Nadaraya, E. A. On estimating regression. Theory Probab. Appl. 9, 141–142 (1964).
Article MATH Google Scholar
Chen, P., Liu, R., Aihara, K. & Chen, L. Autoreservoir computing for multistep ahead prediction based on the spatiotemporal information transformation. Nat. Commun. 11, 1–15 (2020).
Google Scholar
Lu, J., Wang, Z., Cao, J., Ho, D. W. & Kurths, J. Pinning impulsive stabilization of nonlinear dynamical networks with time-varying delay. Int. J. Bifurc. Chaos 22, 1250176 (2012).
Article MATH Google Scholar
Malik, N., Marwan, N., Zou, Y., Mucha, P. J. & Kurths, J. Fluctuation of similarity to detect transitions between distinct dynamical regimes in short time series. Phys. Rev. E 89, 062908 (2014).
Article Google Scholar
Yang, R., Lai, Y. & Grebogi, C. Forecasting the future: Is it possible for adiabatically time-varying nonlinear dynamical systems? Chaos 22, 033119 (2012).
Article MathSciNet MATH Google Scholar
Henkel, S. J., Martin, J. S. & Nardari, F. Time-varying short-horizon predictability. J. Financ. Econ. 99, 560–580 (2011).
Article Google Scholar
Wu, N., Green, B., Ben, X. & O’Banion, S. Deep transformer models for time series forecasting: the influenza prevalence case. Preprint at arXiv https://doi.org/10.48550/arXiv.2001.08317 (2020).
Lea, C., Flynn, M. D., Vidal, R., Reiter, A. & Hager, G. D. Temporal convolutional networks for action segmentation and detection. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 156-165 (2017).
Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate. in Proceedings of International Conference on Learning Representations (2015).
Oreshkin, B. N., Carpov, D., Chapados, N. & Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. in Proceedings of International Conference on Learning Representations (2020).
Li, S. et al. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. in Proceedings of Annual Conference on Neural Information Processing Systems 32 (2019).
Geary, N., Antonopoulos, A., Drakopoulos, E., O’Reilly, J. & Mitchell, J. A framework for optical network planning under traffic uncertainty. in Proceedings of International Workshop on Design of Reliable Communication Networks, 50–56 (2001).
Laguna, M. Applying robust optimization to capacity expansion of one location in telecommunications with demand uncertainty. Manag. Sci. 44, S101–S110 (1998).
Article MATH Google Scholar
Bauschert, T. et al. Network planning under demand uncertainty with robust optimization. IEEE Commun. Mag. 52, 178–185 (2014).
Article Google Scholar
Radakovic, Z. & Feser, K. A new method for the calculation of the hot-spot temperature in power transformers with onan cooling. IEEE Trans. Power Deliv. 18, 1284–1292 (2003).
Article Google Scholar
Zhou, L. J., Wu, G. N., Tang, H., Su, C. & Wang, H. L. Heat circuit method for calculating temperature rise of scott traction transformer. High Volt. Eng. 33, 136–139 (2007).
Google Scholar
Jiang, J. et al. Predicting tipping points in mutualistic networks through dimension reduction. Proc. Natl Acad. Sci. USA 115, E639–E647 (2018).
Article MathSciNet MATH Google Scholar
Chen, L., Liu, R., Liu, Z., Li, M. & Aihara, K. Detecting early-warning signals for sudden deterioration of complex diseases by dynamical network biomarkers. Sci. Rep. 2, 1–8 (2012).
Article Google Scholar
Yang, B. et al. Dynamic network biomarker indicates pulmonary metastasis at the tipping point of hepatocellular carcinoma. Nat. Commun. 9, 1–14 (2018).
Google Scholar
Liu, R., Chen, P. & Chen, L. Single-sample landscape entropy reveals the imminent phase transition during disease progression. Bioinformatics 36, 1522–1532 (2020).
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant 62076016 and 12201024, Beijing Natural Science Foundation L223024.

Author information

These authors contributed equally: Yuguang Yang and Shupeng Geng.

Authors and Affiliations

Beihang University, 100191, Beijing, China
Yuguang Yang, Shupeng Geng, Baochang Zhang & Juan Zhang
Zhongguancun Laboratory, 100094, Beijinig, China
Baochang Zhang & Juan Zhang
China Unicom, 100037, Beijing, China
Zheng Wang & Yong Zhang
University at Buffalo, 14260, Buffalo, NY, USA
David Doermann

Authors

Yuguang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Shupeng Geng
View author publications
You can also search for this author in PubMed Google Scholar
Baochang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Juan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
David Doermann
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.Y., S.G., B.Z., J.Z., and D.D. conceived the research. All authors work on the writing of the article. Y.Y. and S.G. equally contributed to this work by performing experiments and results analysis. Z.W. and Y.Z. collected the 5G network traffic data. All authors read and approved the final paper.

Corresponding authors

Correspondence to Baochang Zhang or Juan Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Inclusion and ethics

No ‘ethics dumping’ and ‘helicopter research’ cases occurred in our research.

Peer review

Peer review information

Communications Engineering thanks Akhil Gupta, Erol Egrioglu, and the other anonymous reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Miranda Vinay and Rosamund Daw. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer Review file

supplementary information

Description of Additional Supplementary Files

supplementary data 1

supplementary data 2

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yang, Y., Geng, S., Zhang, B. et al. Long term 5G network traffic forecasting via modeling non-stationarity with deep learning. Commun Eng 2, 33 (2023). https://doi.org/10.1038/s44172-023-00081-4

Download citation

Received: 07 September 2022
Accepted: 10 May 2023
Published: 06 June 2023
DOI: https://doi.org/10.1038/s44172-023-00081-4

This article is cited by

Recurrence Dynamic Modeling of Metropolitan Cellular Network Traffic
- Yingqi Li
- Yu Wang
- Xiaochuan Sun
Arabian Journal for Science and Engineering (2024)

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Diviner with deep stationary processes

Smoothing filter attention mechanism as a scale converter

Difference attention module to discover stable regularities

Modeling and generating non-stationary time series in Diviner framework

Performance of the 5G network traffic forecasting

Experiment 1

Experiment 2

Application on other real-world datasets

Weather temperature prediction

Electricity transformer temperature prediction

Consumer electricity consumption prediction

Gold price prediction

Solar energy production prediction

Road occupancy rate prediction

Discussion

Methods

Preliminaries

Diviner input layer

Smoothing filter attention mechanism

Difference and CumSum operation

Differential attention module

Convolution Generator

Pseudo-code of Diviner

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Inclusion and ethics

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links