Introduction

The River Thames (UK) plays a pivotal role in sustaining the city of London, supplying most of the drinking water for its residents. However, the ongoing sewage crisis has sparked deep concerns over its potential effects on the ecosystem (endangered aquatic life), human health (contamination of drinking water), and the economy (decreased recreational activities). When millions of tonnes of sewage enter the Thames, the water quality dynamics are profoundly disturbed. For example, the levels of dissolved oxygen (DO), a key water quality indicator essential for many aquatic organisms and reflective of river metabolic pulses, could drop substantially. Of particular interest are extreme event situations. If DO falls below a certain threshold level, which differs for different organisms, the consequences can be disastrous for aquatic life1. Thus, maintaining healthy water conditions in the River Thames has become a pressing and escalating challenge. As researchers, we aim to shed light on the spatio-temporal complex dynamics of water quality, specifically DO, to potentially assist policymakers and environmental agencies in effectively managing river conditions and maintaining healthy aquatic environments. The findings in our paper, to be presented in detail in the following sections, could provide tools for accurately forecasting DO concentrations, which would help alert stakeholders when forecasted DO values fall into ranges that endanger aquatic life. Our findings could also help to identify critical time windows where interventions, such as oxygen injection or controlled water treatment releases, should be implemented, alongside enhanced monitoring at sites with more extreme variations. Future studies may extend our methodology to other observables, such as electrical conductivity, in different river systems also outside the UK. Our approach in the following is based on two main pillars: superstatistics and machine learning.

Superstatistical methods, originating from turbulence research2,3,4, offer a powerful effective approach to describe the dynamics of complex nonequilibrium systems, assuming the existence of well-separated time scales. These methods have been applied to a wide range of fields, such as high energy scattering processes5,6,7, Ising systems8, cosmic ray physics9,10, plasma physics11,12, ultracold gases13, non-Gaussian diffusion processes in small complex systems14,15, power grid frequency16, bacterial DNA17, financial time series18,19 and traffic delays20,21. Recent years have also witnessed an increase in its application in environmental sciences, including wind statistics22, air pollution statistics23,24, rainfall statistics25, sea surface currents26, earthquake swarms27,28 and water quality time series29. The dynamics of river water quality indicators, which involve a complex interplay of physical, chemical, and biological processes, can be viewed as a superposition of various dynamics across different time scales. These include day-night cycles, seasonal temperature fluctuations, long-term trends due to global warming and pollution events, making this a suitable subject for the superstatistical approach which requires complex dynamics on well-separated time scales. Water quality time series (such as DO concentrations and electrical conductivity) as measured in the River Chess have been previously shown to exhibit superstatistical features29. However, the extent to which the water quality dynamics in other, much bigger rivers can be explained using similar techniques based on superstatistics remains unclear so far, as do the environmental insights and forecasts that can be systematically derived from the superstatistical approach. In this paper we will fill this gap and present a systematic analysis for new data from the tidal section of the River Thames.

Machine learning, as a complementary approach to statistical analysis, has been extensively utilized to predict water quality time series in various rivers, aiding in the understanding of the fluctuating dynamics of water quality. For example, Generalized Additive Models and gradient boosting methods have been employed to predict electrical conductivity in the River Chess30. In the Sakarya Basin in Turkey, models such as deep learning, support vector machine regression (SVMR), Gaussian process regression, and artificial neural networks (ANNs) were used to estimate dissolved oxygen levels, with ANNs proving to be the most effective31. ANNs were also shown to be the most effective to predict suspended sediment concentration in the Thames River, Ontario32. Additionally, SVMR, decision tree regression and extra tree regression were used to predict the water quality index (WQI) of the Lam Tsuen River in Hong Kong33. A Long short term memory (LSTM) model has been effective in predicting DO levels across the watersheds across the U.S.34. In the context of the River Thames, existing literature includes neural network applications for chlorophyll estimation and Random Forest methods for predicting nutrient concentrations in two tributaries of the River Thames35,36, applications of Random Forest and Gradient Boosting models in estimating nitrate concentrations in surface waters37, and neural network applications to river flow forecasting38. The state-of-the-art technique, Transformer, driven by the attention mechanism39, excels in sequence forecasting but has been rarely utilized in the context of water quality indicator dynamics. While the study40 applied a Transformer model to forecast the river WQI, which combines various water quality parameters to provide an overall assessment of water quality, it differs from predicting individual indicators, which offers detailed insights into specific indicator dynamics, thereby enabling targeted investigations.

Our research, to be described in detail in the following sections, innovates in the analysis and forecasting of spatio-temporal DO dynamics within river systems. We introduce a novel multiplicative detrending method for the tidal River Thames, which proves to be more effective in handling complex tidal variations. After detrending, the superstatistical analysis reveals that DO fluctuations exhibit q-Gaussian distributions with power-law tails. By systematically extracting the optimum fitted superstatistical parameters, we uncover important correlations between these parameters and geographical factors, particularly proximity to the sea-this represents a novel insight. Our descriptive statistical parameters are extracted from the analyzed time series using the standard methods of superstatistics41. In terms of machine learning, we demonstrate the robustness of the Light Gradient Boosting Machine (LGBM) model for predicting DO and used SHAP values to identify temperature, pH, and time of year as dominant predictors of DO fluctuations. We use the Informer model, a novel Transformer architecture that has been rarely applied in the study of water quality dynamics. Our results demonstrate the Informer’s superior performance, achieving the lowest errors for long-term forecasts. Moreover, the ProbSparse self-attention mechanism of the Informer model dynamically focuses on the most informative parts of a long input sequence, inherently capturing long-range dependencies and temporal dynamics. This capability addresses areas where classical machine learning models often face limitations without extensive data preprocessing. For a detailed explanation, see the ‘Forecasting models’ subsection. When forecasting 48 quarter-hour time steps of DO, the model identifies crucial time windows in the input data, specifically during morning to early afternoon and late evening to early morning of the most recent time steps, as well as between the 16th and 26th quarter-hour of the preceding half-day, which contributed to its superior performance. This identification provides novel insights into the life cycles of DO dynamics. Overall, our research not only advances the theoretical understanding of DO dynamics in diverse river settings but also enhances practical capabilities in monitoring and forecasting water quality, which is crucial for environmental management and policy-making.

Key questions include striving for a deeper understanding of the complex system interactions and processes relevant for the River Thames: Can DO concentrations in the river be explained via superstatistics, and can we comprehensively account for extreme DO events arising from environmental processes and human activities? How do different water quality indicators affect the prediction of DO? How can we construct a powerful approach for effectively forecasting water quality indicators? These important types of questions will be dealt with in the following, with data from the different measuring stations positioned along the tidal stretch.

This paper is structured as follows: First, we describe our data sets from nine water quality monitoring sites along the River Thames. We then employ statistical and detrending methods to isolate the DO fluctuations from the trend. Next, we apply a superstatistical approach to these fluctuations and uncover the relevant superstatistical parameters as a function of the sites’ distances to the sea. Following this, we utilize regression-based machine learning techniques to predict same-time DO levels, highlighting an interpretation of the model’s prediction mechanism. This sheds light on the most important features influencing the prediction. We then employ state-of-the-art machine learning methodologies, including Informer as a Transformer variant, and deduce the method with the best forecasting performance for DO. Lastly, we explore the self-attention mechanism underlying the Informer model, emphasizing attention weights to explain which inputs contribute to its superior performance.

The data available

In this paper, we utilise data from nine monitoring sites along a 45 km stretch of the River Thames, sourced from Meteor Communications’ telemetry system42. This system connects to multi-parameter water quality sondes within a sensor web network across the River Thames, jointly managed by the Environment Agency (EA) and Meteor Communications. The sites span the Upper Tideway, Central Tideway, and Estuary regions of the Thames. This stretch marks the tidal portion of the river, which is significant for understanding the complex interactions and variability caused by tidal movements, which are relevant for ecosystems and for pollution concentrations alike. This segment of the river also passes through central London.

To provide a geographical overview of our analysed data, we display all the data sites’ locations on a map in Fig. 1. The nine monitoring sites stretch from West to East along the River Thames as follows: Thames Brentford Barge (TBB) and Thames Kew Barge (TKB) in the Upper Tideway region, Thames Chiswick Pier (TChP), Thames Hammersmith (TH), Thames Putney (TPut), Thames Cadogan Pier (TCaP), Thames Barrier Gardens Pier (TBGP), Thames Erith Barge (TEB), and Thames Purfleet (TPur) in the Central Tideway and Estuary regions. The time series measured span from 01/12/2017 to 01/12/2022 and have a temporal resolution of 15 minutes, except for the rainfall feature. To maintain the integrity of all nine data sets, each underwent some checks and data processing prior to being incorporated into the analysis. Details of the data processing methods are provided in the ‘Methods’ section.

Fig. 1
figure 1

Top: A map of the nine available water quality monitoring sites (red markers) along the River Thames. The geographical visualisations were generated using the Folium Python library43 with map data sourced from OpenStreetMap44. We show the trajectory (a) and PDF (b) of the DO concentration at TCaP over a time span of five years. We apply additive seasonal (c), additive EMD (d), multiplicative seasonal (e) and multiplicative EMD (f) detrending methods. A filtering frequency of \(f = 6\) h or dropping \(m = 3\) modes (orange) captures the oscillating trend while preserving the short-term fluctuations.

Examples of DO concentration time series from site TCaP are shown in Fig. 1a. We observe strong seasonality in the trajectory, with summer exhibiting lower DO levels as compared to the winter. This can be attributed to the temperature-dependent solubility of DO, as cold water holds more DO than warm water does. Site TCaP also experiences seasonal spikes during the early summer season. This phenomenon likely reflects a period of algal blooms that is fueled by increased sunlight. This results in elevated photosynthetic activity, leading to an increase in DO levels. In Fig. 1b, the probability density function (PDF) of DO shows clear deviations from Gaussianity, in agreement with previous results found for smaller rivers29. A large portion of the observed variability arises not only from seasonal cycles but also from daily cycles, which must be accounted for before we apply our statistical analysis. In the following section, we will describe how to separate the fluctuations from the trend.

Detrending

Rather than modeling the complete distribution, including its daily and seasonal variations, our approach here is centered on describing the fluctuations around the mean. The mean can be time-dependent and is denoted as ‘trend’ in the following. To initiate the detrending procedure, we may decompose the measured time series in an additive way as

$$\begin{aligned} y_t = T_t + F_t+ R_t. \end{aligned}$$
(1)

Here \(y_t\) is the time series, \(T_t\) is the trend, \(F_t\) are the fluctuations evolving on a much faster time scale, and \(R_t\) is the remainder, all evaluated at time t. Alternatively, one may also consider a multiplicative model, in which case we decompose the time series as

$$\begin{aligned} y_t = {\hat{T}}_t\times {\hat{F}}_t \times {\hat{R}}_t. \end{aligned}$$
(2)

To effectively retrieve the fluctuations, we employ seasonal decomposition and empirical mode decomposition (EMD). Seasonal decomposition separates the trend within the data by applying a moving average with a specific filtering frequency, denoted as f. The difference between this moving average and the original data is characterised as fluctuation. Alternatively, the EMD decomposes the full trajectory into its ordered intrinsic mode functions (IMFs) ranging from slowly changing to highly oscillating modes as follows:

$$\begin{aligned} y_t = \sum _{i=1}^{N}{IMF_i(t)+r_N(t)}, \end{aligned}$$
(3)

where N is the number of modes, and \(r_N(t)\) represents the residue corresponding to N modes. The trend component is identified by summing over the first few IMFs:

$$\begin{aligned} T_t = \sum _{i=1}^{N-m}{IMF_i(t)}, \end{aligned}$$
(4)

where the decision to omit m modes to define the trend is somewhat arbitrary and has to be made by the user. After removing the IMFs that constitute the trend, we consider the remaining detrended data as representing the fluctuations:

$$\begin{aligned} F_t = \sum _{i=N-m+1}^{N}{IMF_i(t)+r_N(t)}. \end{aligned}$$
(5)

Given the additive nature of EMD, when applying a multiplicative decomposition, a logarithmic transformation is applied to Eq. (2) to obtain

$$\begin{aligned} log(y_t) = log({\hat{T}}_t) + log({\hat{F}}_t)+ log({\hat{R}}_t) \end{aligned}$$
(6)

which reduces the problem to an additive one. For this we have to assume that all functions \({\hat{T}}_t,{\hat{F}}_t,{\hat{R}}_t\) are positive so that the logarithm is well-defined.

Which detrending decomposition method is most suitable in the context of water quality dynamics? We wish to have a simple description of the fluctutions in terms of a simple stochastic process that has a q-Gaussian probability density (see next section for more details on the definition of q-Gaussians in the superstatistical context). The relevance of q-Gaussians was noticed in previous work for data from the River Chess30. Hence, for each measuring station, we computed the log-likelihood for a best fit of the observed PDF of the fluctuations by a q-Gaussian probability distribution. The result is shown in Fig. 2 for the different distances to the sea and detrending methods. One immediately noticeable observation is that, regardless of whether the chosen method is seasonal or EMD, the multiplicative method generally produces a slightly better fit with q-Gaussians than the additive decomposition, as indicated by higher log-likelihood values. Additionally, coastal conditions are more challenging for the additive method, while the multiplicative methods remain relatively consistent regardless of the distance to the sea. Ultimately, the multiplicative EMD method appears to be the most effective detrending method, exhibiting the highest log-likelihoods with one exception only. Our approach advances beyond the additive methods previously applied to the River Chess30. We introduce a novel multiplicative EMD method to address the pronounced tidal periodic variations observed in the DO time series relevant for the River Thames. Our results confirm that this approach effectively extracts q-Gaussian fluctuation spectra, demonstrating its suitability for complex DO time series that vary significantly in amplitude and frequency.

Fig. 2
figure 2

Scatter plot showing the relationship between log-likelihood values for different detrending methods and the distance to sea for various sites. The log-likelihood values are evaluated based on the best fit to a q-Gaussian distribution. The graph demonstrates the superiority of the multiplicative methods over the additive ones, with the multiplicative EMD method standing out as the most effective approach in nearly all cases.

Note that the detrended data still include some remaining level of noise, which we choose not to remove. The remainder component, or noise, usually refers to random, unexplained variations in the data that are not part of the trend or of the fluctuations. One might argue that removing the remainder from fluctuations would yield clearer insights into the inherent dynamics of DO. However, our analysis aims to understand the behavior of realistic data, including natural variability and extreme events. Hence, we do not isolate unexplained noise from the detrended data, considering both as integral parts of the overall fluctuations in DO.

The results of different detrending methods, using as an example DO levels measured at the site TCaP over a one-week period, are illustrated in Fig. 1c–f. For the seasonal decomposition (c,e), we choose a filter frequency of \(f=6\) h, and for EMD (d,f), we drop \(m=3\) modes. The filter frequency of \(f=12\) h (green curves) aligns with the periodicity of DO, hence would provide a nearly constant trend function without oscillations. However, with the choice of \(f=6\) h, the trend (orange curve) reflects more the reality of periodic average variations in the data. For this choice we get a meaningful description of fluctuations on a short time scale that are added to the oscillating trend. Similarly, for EMD, we opt for dropping \(m=3\) modes (orange curves) over the smoother trend (green curves) to extract the short timescale fluctuations. With the data now detrended, we continue to explore the fluctuation statistics using a superstatistical approach.

Superstatistical analysis

The DO fluctuations are the result of the complex interplay of physical, chemical, and biological processes in the River Thames’s ecosystem, and are within the realm where superstatistical approaches are suitable for an effective description. The core concept of superstatistics3,4 is that a complex, heavy-tailed probability distribution describing the statistics of a long time series is produced by a superposition of PDFs of many shorter time series, each with a non-heavy-tailed distribution. As shown previously in29, the marginal DO fluctuation distribution can indeed be seen as an aggregation of many simple Gaussian distributions of fluctuating variance, leading us to expect that Tsallis q-Gaussian distributions are effectively relevant for this problem. The q-Gaussian distribution45 is a probability distribution that generalises the standard Gaussian distribution. It maximizes Tsallis entropy subject to suitable constraints. The q-Gaussian distribution emerges naturally within the superstatistical framework, particularly in systems that exhibit fluctuations in an inverse variance parameter \({\hat{\beta }}\) on a larger time scale. \({\hat{\beta }}\) can in principle follow any kind of distribution. If the PDF of \({\hat{\beta }}\) is sharply peaked, then this in good approximation leads to a q-Gaussian distribution when integrating out the \({\hat{\beta }}\) fluctuations3,46. This statement is exact if \({\hat{\beta }}\) follows a \(\chi ^2\)-distribution, as described by

$$\begin{aligned} f({\hat{\beta }}) = \frac{{\hat{\beta }}^{\frac{n}{2} - 1}}{\Gamma (\frac{n}{2})} \left( \frac{n}{2{\hat{\beta }}}^{\frac{n}{2}}\right) e^{-\frac{n{\hat{\beta }}}{2{\hat{\beta }}_0}}, \end{aligned}$$
(7)

where n denotes the degrees of freedom and \({\hat{\beta }}_0\) represents the mean of \({\hat{\beta }}\). In this case the marginal distribution

$$\begin{aligned} p(x)=\int _0^\infty f({\hat{\beta }}) p(x|{\hat{\beta }}) d{\hat{\beta }} \end{aligned}$$
(8)

is exactly a q-Gaussian if \(p(x|\hat{\beta )}\) is an ordinary Gaussian with variance parameter \({\hat{\beta }}\). The q-Gaussian PDF, which is well motivated by superstatistics, is given by the formula

$$\begin{aligned} p(x) = \frac{1}{C_q} [ 1 + (q-1) \beta (x - \mu )^2 ]^{\frac{1}{1-q}}, \end{aligned}$$
(9)

where \(C_q\) is a normalisation constant, \(\mu \) is a shift parameter, q is a shape parameter, also known as the entropic index, and \(\beta \) is a scale parameter that is proportional to the expectation \(\langle {\hat{\beta }} \rangle \). The shape parameter \(1<q<3\) controls the degree of non-extensivity of the system and the distribution reduces to the Gaussian distribution when \(q=1\). Different from the standard Gaussian distribution, the q-Gaussian distribution depends on two parameters, the scale parameter \(\beta \) and the shape parameter q, making it quite an effective tool for fitting power-law tails describing extreme events.

Fig. 3
figure 3

Top: PDFs of oxygen fluctuations obtained via additive seasonal (a), additive EMD (b), multiplicative seasonal (c) and multiplicative EMD (d) detrending methods, for the example of the site TCaP. Regardless of the methods used, detrending leads to non-Gaussian distributions, which can be approximated by q-Gaussian distributions (purple). Bottom: For each site along the River Thames, we plot the scale parameter \(\beta \) and the shape parameter q from the q-Gaussian fitting against its distance to the Thames Estuary/sea.

The measured DO fluctuation distributions (blue) follow a q-Gaussian distribution (purple) in good approximation, as illustrated in Fig. 3a–d for the example of the site TCap. The PDFs are well approximated by the q-Gaussian fits regardless of which detrending method is applied. Note that these are the distributions of the deviation of the DO signal from the systematic trend, not the signal itself. We optimized the fitting parameters via maximum likelihood estimation (MLE) for all nine sites.

By employing the optimal q-Gaussian fittings, we determine the scale parameter \(\beta \) and the shape parameter q for each site. Recall that \(\beta \) adjusts the width of the distribution whereas q determines the tail behavior of the distribution. To identify underlying spatial patterns in the variability of the relevant parameters, we plot each of these parameters against their respective distances to the Thames Estuary in Fig. 3(bottom). The first general observation is the non-Gaussianity of DO fluctuations for all nine measuring sites, as indicated by \(q>1\). We confirm that the distributions of DO fluctuations exhibits a heavy-tailed characteristic in the more complex environment of the River Thames, compared to the relatively smaller River Chess. The majority of these q values fall within the 5/3 to 2 range, which suggests the presence of infinite variance in many of these q-Gaussian distributions, as the second moment diverges for \(q>5/3\). Intriguingly, we observe a linear decreasing trend in \(\beta \) versus distance to the sea, regardless of the detrending method used. This observation suggests that DO dynamics are influenced by spatial location, significantly impacted by geographical conditions, particularly proximity to the sea and associated seawater. There is greater variability or dispersion (larger \(\beta \)) in DO values for sites further away from the sea. This pattern may be due to the diminishing influence of seawater, which tends to stabilize oxygen levels closer to the coast. Factors such as decreased water mixing, reduced impact of tidal forces, and changes in salinity fronts, stratification, and sediment oxygen demand may also have a more pronounced influence inland.

Generally, the \(\beta \) values from multiplicative detrending are bigger in magnitude than those deduced from additive decompositions. This suggests that the multiplicative method may be more sensitive to capturing complex or nonlinear variability in DO levels. This observation is consistent with our previously described findings in the section on Detrending. To understand the behavior and characteristics of extreme DO events more effectively, we formulate regression models in the next section.

Regression analysis

Understanding the statistical properties of extreme DO events, as explored in our previous section, is pivotal for developing robust predictive models. Using explainable machine learning techniques, we investigate and quantify the relationships between DO and other measurements of the water quality. Central to this effort is the identification of variables that significantly impact DO levels, which is essential for enhancing water quality management. Here, we have applied a gradient-boosted tree, specifically Light Gradient Boosted Machines (LightGBM or LGBM)47, to the time-series data from TPut and TKB as examples, generating same-time predictions of the DO levels given a series of environmental and time features (see the Methods section for more details). SHAP values48,49, approximating Shapley values, are then used to infer which features have the greatest effect on the generated regression models for each site.

All of the error results are displayed below in the form of symmetric mean absolute percentage error (SMAPE), computed as follows:

$$\begin{aligned} \text {SMAPE} (y_i, {\hat{y}}_i) = \frac{1}{n} \sum _{i=1}^{n} \frac{|y_{i} - {\hat{y}}_{i}|}{|y_{i}|+|{\hat{y}}_{i}|}, \end{aligned}$$
(10)

where n is the number of samples in the test set. \(y_i, {\hat{y}}_i\) represent the real-time DO measurements and the LGBM predictions, respectively, for each corresponding entry i within the test dataset. As can be seen in Table. 1, the LGBM’s performance, as measured by SMAPE, generally exhibits low error in predictions. Site TBGP displays the highest SMAPE, which may be attributable to its proximity to the sea, leading to more fluctuating and complex DO levels in water. TH, TChp and TBB exhibit relatively higher SMAPE values, likely due to the presence of the Mogden Sewage Treatment Works upsteam of the sites, resulting in more complex, changing pattern of DO levels heavily influenced by human activities.

Table 1 SMAPE values of the LGBM models compared to the values of the XGBoost and linear regression baseline models, for the nine Thames river sites.
Fig. 4
figure 4

TPut (top) and TKB (bottom) data plotted against the same-time predictions by LGBM, all in chronological order. The SHAP values are plotted on the right hand side for each site, showing the calculated feature importances.

Although there is a lot of variability in the very short-time scale fluctuations in DO, Fig. 4 shows that the generated regression models follow the peaks and troughs of DO variation over longer time spans, demonstrating that these models are utilizing the full range of possible and realistic DO values and are not sticking only to mean values. The excellent performance is attributable to several factors. LGBM utilizes leaf-wise (best-first) tree growth, which can lead to deeper and potentially more complex trees compared to the level-wise approach used by XGBoost. LGBM also discretizes continuous feature values into bins to speed up training and reduce memory usage. This means that LGBM is generally faster and more memory-efficient, particularly for large datasets, due to its histogram-based approach and optimizations like GOSS and EFB.

We trained the model to get an understanding of how the different water quality indicators interact or work together. Figure 4 demonstrates the calculated feature importances of the featured models. Negative SHAP values indicate that a feature leads to lower DO predictions, whereas positive SHAP values indicate that a feature pushes the prediction towards higher DO levels. The colour represents the feature value, ranging from high (red) to low (blue). This information effectively increases our understanding of how the different measurements interact and work together: We can see that the order and correlation of feature importance is very similar for both sites, with the top three features being Temperature, pH and sine of the Year (encoding the time of the year). The other sites also show similar results with 4 out of the 7 other sites having the same order for the top three features. This suggests that for future research, temperature and pH sensors should be prioritised whether for analysis or predicting DO. Temperature is the most important in 6 out of those 7 sites and ranked second in the other one. Temperature and DO have a negative correlation, i.e. higher values of temperature reduce DO, while lower values increase DO. Meanwhile, pH, the second most impactful factor, and the predicted DO values exhibit a positive correlation. Specifically, alkaline waters are associated with an increase in the predicted DO values. Conversely, acidity tends to decrease the DO prediction. This is consistent with the fact that high-pH water contains low levels of hydrogen ions, meaning more oxygen is available, and vice versa. From the selected calendar features, the Year (and its encoding via sine) clearly takes precedence, suggesting that the most predictable fluctuations are over an annual cycle of the DO concentration.

The feature ranking deduced here for DO differs from a previous feature importance analysis of electrical conductivity in the river Chess30, where temperature was not as important, and River Level, pH, and Month were the most significant features. This can be explained by the fact that DO is highly dependent on water temperature, and these annual fluctuations are the primary control on DO levels. Furthermore, pH varies with the photosynthetic activity of algae, providing a biological control on water quality. During daylight, photosynthesis increases oxygen production, reduces carbon dioxide (and thus carbonic acid) levels, and raises alkalinity. Conversely, at night, increased respiration raises carbon dioxide levels, lowers DO, and increases water acidity. Additionally, monthly and yearly periodicities were prioritized when predicting electrical conductivity and DO, respectively. This distinction highlights that although both rivers are subject to temporal factors, the specific impacts of these factors-monthly and yearly periodicities-on water quality indicators are distinct. The ‘month’ feature indicates that electrical conductivity is more affected by monthly patterns due to factors such as heavy rainfall or dry spells. Conversely, the ‘time of the year’ feature implies that DO levels are more influenced by annual patterns, likely due to temperature variations and related photosynthetic activities of aquatic plants. Another possible reason for the differences is the greater influence of anthropogenic activities, such as sewage pollution, in the River Thames compared to the more remotely located (small) River Chess.

Time series forecasting

In the previous section, we focused on the analysis and interaction of different quantities (e.g. DO and temperature) at the same time. Now, we are moving to the question of whether and how an important water quality indicator, such as DO, can be forecasted for the next hours or even days. In this section, we briefly review and apply the recently developed Informer model50, which is a state-of-the-art deep learning model that can be employed to predict a wide range of long-sequence time series, including DO. To our knowledge, this marks (one of) the first application(s) of this model to sequential DO forecasting, aimed at achieving precise forecasts for multiple time steps.

The core of the Informer model is a stacked transformer architecture, which includes an encoder and a decoder. The encoder captures dependencies in the input data, transforming the input sequence into a set of high-level features. Subsequently, the decoder takes the encoded sequence and generates the output sequence. In addition to the Informer, we compare various forecasting models, including baseline approaches such as “Last”, “Repeat”, and “Linear”, as well as neural network architectures like Sequential Dense Network (“Dense”)51, Convolutional Neural Network (“Conv”)52, and Long Short-Term Memory (“LSTM”)53. Details regarding the architecture and hyperparameter settings of these models are provided in the Methods section.

To evaluate the performance of a model, we compute its mean absolute error (MAE) and SMAPE for forecasts \( t \in {1, 12, 24, 48} \) quarter-hours into the future based on the test sets. They are computed as follows:

$$\begin{aligned} MAE_t (y_i, {\hat{y}}_i)= & {} \frac{1}{n} \sum _{i=1}^{n} |y_{i,t} - {\hat{y}}_{i,t}|, \end{aligned}$$
(11)
$$\begin{aligned} SMAPE_t (y_i, {\hat{y}}_i)= & {} \frac{1}{n} \sum _{i=1}^{n} \frac{|y_{i,t} - {\hat{y}}_{i,t}|}{|y_{i,t}|+|{\hat{y}}_{i,t}|}, \end{aligned}$$
(12)

where n is the number of samples in the test set. \(y_i, {\hat{y}}_i\) represent the actual DO measurements and the DO predictions respectively, for each prediction horizon t and corresponding entry i within the test dataset. The metrics were computed for five iterations and then averaged to evaluate each model’s performance in prediction compatibility. The analysis was performed on the TBGP dataset, and comprises fourteen covariates. The results are presented in Table 2.

Table 2 MAE and SMAPE results for sequence time-series forecasting across various horizons (“Pred”) and input lengths (“Input”) on the TBGP data set using different methods.

By comparing the seven models, the MAE and SMAPE averaged over all times and outputs from the Informer are the lowest, except when predicting for a single time step. This demonstrates the superior predictive capability of the Informer, particularly for long prediction horizons, compared to the other models. Notably, the performance of the Informer in predicting 12, 24 and 48 time steps (3, 8 and 12 h respectively) improves with an increased input sequence, as indicated by the decreasing MAE and SMAPE. This can be attributed to the novel Probsparse self-attention mechanism of the Informer model, which dynamically focuses on the most informative parts of a long input sequence and effectively captures long-range dependencies and temporal dynamics. The Repeat model appears to be the second-best for forecast horizons of \(t \in {12, 24, 48}\). This can be attributed to the half-day periodicity of DO, which the Repeat model utilizes, aligning well with short-term dynamics in the aquatic environment and thereby enhancing its forecasting ability. Example windows for Informer’s forecasting results of \(t\in {12, 48}\) are illustrated in Fig. 5 top. The Informer model exhibits capabilities in forecasting future values of DO for both shorter (24) and longer (48) time steps, with the predicted values generally aligning with the actual values.

Above, we demonstrated the high-performance capabilities of the Informer. However, which inputs are particularly important remains unknown so far. We will next explore the self-attention mechanism that underlies these models, aiming to identify the elements driving their forecasting power. The self-attention mechanism, which assigns different weights to different inputs for the prediction (“giving attention to the important contributors”), is detailed in the Methods section.

Fig. 5
figure 5

Top: Example windows for the Informer’s forecast of \(t\in {12, 48}\) with input lengths of 48 and 192 time steps, respectively. The blue line represents the DO input at every time step. While the model processes all features, this visualization only displays the DO. The green dots depict the desired prediction values. The purple crosses represent the forecast of DO made by the Informer model in a single shot. Bottom: A visualization of the attention weights, represented through a heatmap, when predicting 48 future time steps from 192 past steps. The x-axis and y-axis correspond to the keys and queries, respectively, associated with each element in the sequence. Lighter colors correspond to higher weights.

To identify which parts of the input are important for the prediction, we illustrate the attention weights in scenarios involving the maximum input and output lengths, i.e., 192 input time steps (48 h) to predict 48 time steps (12 h) in Fig. 5. This approach offers insights into the internal workings of the model’s self-attention mechanism. Note that the last few columns on the right contain numerous high attention weights. The information at positions 175 to 180, representing the recent time steps in the input sequence, was highly influential during two distinct periods: morning to early afternoon and late evening to early morning. Each of these periods spans 6 h. This observation suggests that our model has effectively incorporated half-daily patterns while creating new representations. Furthermore, the high attention weights in columns 160 to 170 indicate that the model paid significant attention to the keys from the 16th to the 26th time step of the last half-day while processing the queries. This aligns with the Repeat model’s good performance in forecasting results. The identified high-risk periods, specifically from morning to early afternoon and from late evening to early morning of the most recent time steps in the input, as well as the 16th to 26th time steps of the last half-day, are crucial. When combined with accurate long-term forecasting, these insights could guide and refine environmental governance strategies. Such proactive measures in water management may include controlled treatment releases and oxygen injection, particularly when forecasting low DO levels.

Discussion

Our research extends previous results, which were obtained for a comparatively small river, the River Chess29, to a much larger river, the tidal stretch of the River Thames, covering nine sites over a distance of 45 km. The study for the River Chess29 utilized the additive EMD method for detrending, with fluctuations following a q-Gaussian distribution. In contrast, the time series from the River Thames exhibit more pronounced tidal periodic variations, varying significantly in frequency and amplitude. To address this difference, we employed the multiplicative detrending method for the River Thames, which proved to be more effective, as demonstrated by the higher log-likelihood values obtained from the q-Gaussian fittings compared to those from the River Chess. After detrending, the fluctuation statistics, in very good approximation, are q-Gaussian and thus exhibit power-law tails. This pattern is confirmed across all sites along the tidal Thames. Moreover, our analysis did not only focus on individual sites but also systematically compared the best-fitting superstatistical parameters across all sites. This comparative analysis helped to derive correlations between these parameters and their distance to the sea, highlighting how geographical conditions, particularly proximity to the sea and associated seawater influence, affect the tail behaviors of the PDFs of DO.

In the context of machine learning, previous research for the River Chess30 demonstrated the effectiveness of the LGBM in predicting temperature and electrical conductivity. We expanded the use of LGBM to include dissolved oxygen in the River Thames, emphasizing the robustness of the boosted tree approach across different water quality indicators and rivers. Our feature importance analysis revealed differences in the importance of and (inter)dependencies between individual indicators such as electrical conductivity as compared to the River Chess30, where temperature was less significant compared to river level, pH, and month. Although the Transformer model has been recently applied to forecast the river WQI in the Bhavani River, which combines various water quality parameters to provide an overview of water quality40, we aimed to forecast individual indicators such as DO. This approach offers detailed insights into specific aspects of water quality, enabling targeted investigations. In comparison, we demonstrated that the Informer model is an effective tool for accurate long-term forecasts of DO up to a half-day, achieving the lowest SMAPE and MAE. The Informer’s embedded ProbSparse self-attention mechanism highlighted specific time periods-morning to early afternoon, late evening to early morning in the most recent inputs, and the 16th to 26th quarter-hour of the previous half-day-as particularly influential in the forecasting process, offering a significant and novel insight.

Our data analysis so far is limited by the relatively small number of real-time measuring sites along the River Thames. Further data sets from more sites located at each river stretch could help to provide a more systematic picture of the overall statistics, allowing us to judge which PDF shapes are generic and which observed shapes are specific to a given measuring site. Currently, the resolution discrepancy between our rainfall data and other environmental parameters has limited its utility in predictive models. High-resolution rainfall data in future studies could refine our analysis, potentially increasing the accuracy of our forecasts.

Our research may provide policymakers and other stakeholders with insights and effective forecasting tools to effectively monitor and manage the health of aquatic ecosystems. The Informer model could be integrated into a monitoring system to alert environmental agencies when forecasted values indicate potential risks, such as low oxygen levels which could harm aquatic life. Additionally, our analysis identified critical periods-specifically from morning to early afternoon, and from late evening to early morning-that, along with key factors like temperature, pH, and time of year, can inform and update environmental regulations and policies. For instance, implementing measures such as oxygen injection or controlled water treatment releases during these high-risk periods could effectively enhance environmental protection efforts for the River Thames. Furthermore, correlation analysis between the superstatistical parameters and distance to the sea reveals that inland sites, which are more susceptible to greater variability or dispersion in DO values, could be prioritized for enhanced monitoring. These sites could be equipped with more precise sensor technology to better reflect changes in river conditions. In this context, the method developed in our study provides a valuable tool for researchers and policymakers in modeling the tail behaviors (extreme events) of environmental indicators.

Our methods developed here could pave the way for broader exploration and application. This possible extension may include different environments ranging from coastal to estuarine, and of course also rivers in other countries than the UK. Future studies may extend this methodology also to other observables, such as electrical conductivity, building on the insights gained from our targeted examination of DO, and examining cross-correlations between different water quality indicators. While our focus has been on the statistical properties of time series data, future work will also consider the specific geographical and environmental conditions of each river site. This approach will enable more targeted actions and scientific investigations to enable, maintain and preserve the health of river ecosystems. For comparative analysis of machine learning models, additional statistical methods could be explored to aid in choosing the best-performing model. In addition to comparing performance metrics in DO estimation, recent research in the Sakarya Basin, Turkey31 utilized Taylor and violin diagrams to provide a more comprehensive statistical comparison of model performances. The use of Taylor diagrams, error boxplots, and violin graphs can be an additional tool to further enhance the visualization of model performance comparisons. Moreover, the Kruskal-Wallis test could provide a statistical basis for evaluating and validating the performance of the different models in future studies. Forecasting methods such as Temporal Fusion Transformer54 and N-HiTS55 have the potential to enhance predictions. Probabilistic Transformer56 and Natural Gradient Boosting57, on the other hand, allow for forecasts and probabilistic predictions respectively, with uncertainty estimates. However, these methods require higher computational cost and a different experimental design, which were beyond the resources and scope of this paper. We prioritized a first proof-of-concept application and evaluation of these existing models to establish an accurate DO prediction technique. Future projects could explore these methodologies to further investigate the impact of uncertainty on ML model predictions.

Conclusion

This study introduces a novel approach to analyzing and forecasting the spatio-temporal dynamics of DO in the River Thames by applying superstatistical methods and machine learning models. We introduced a multiplicative detrending method that effectively separates trends from rapidly fluctuating DO levels, leading to more accurate q-Gaussian fittings. The fluctuations were observed to follow a q-Gaussian distribution and exhibit power-law tails, confirming previous results obtained for much smaller rivers. The superstatistical parameters \(\beta \) and q of the q-Gaussians revealed that proximity to the sea and associated seawater significantly influences DO fluctuations. Another key contribution is demonstrating the robustness of the LGBM model for real-time DO predictions and the superior long-term forecasting capabilities of the Informer model. Interpretation of the LGBM model using SHAP values consistently identified temperature, pH, and time of the year as dominant predictors of the DO fluctuations across different sites. Additionally, an exploration of the Informer model’s self-attention mechanism revealed that inputs from morning to early afternoon and from late evening to early morning of the most recent time steps, as well as from the 16th to the 26th quarter-hour of the previous half-day, primarily drive its superior performance.

Our study is limited by the small number of real-time measuring sites and the resolution discrepancy between rainfall data and other environmental parameters. Future research should address these limitations by incorporating more extensive data. Furthermore, future studies may extend our methodology to other observables, such as electrical conductivity, in different river systems outside the UK, and conduct closer examinations of geographical and site-specific extreme events at each location. While our current paper provides a proof-of-concept approach to DO forecasting, future studies should explore advanced models like the Temporal Fusion Transformer and Natural Gradient Boosting to enhance accuracy and account for uncertainty. Our findings suggest that the Informer model could be integrated into monitoring systems to provide early warnings for low DO levels, identifying critical time windows when targeted environmental interventions could be implemented. The q-Gaussian method for analysing extreme events and tail behaviour of PDFs can also be adapted for different contexts, offering valuable quantitative tools for identifying sites that exhibit more extreme variations than others and that may thus require enhanced monitoring.

Methods

The data set considered

We utilise water quality data from nine monitoring sites along the River Thames from Meteor Communications telemetry system42. The Environmental Agency (EA), in collaboration with Meteor Communications, operates a sensor web network across the River Thames. This network features multi-parameter water quality sondes connected to Meteor’s telemetry system, transmitting nine measurement data in real-time to a central database, accessible via a web host. Water quality indicators were measured using real-time sensors from 01/12/2017 to 01/12/2022 with a temporal resolution of 15-min. We use the following water quality indicators:

  1. 1.

    Dissolved oxygen (DO) refers to the concentration of oxygen gas incorporated in water, vital for aquatic life, measured in milligrams per liter (mg/L).

  2. 2.

    Temperature (Temp) is a local temperature measured in degrees Celsius (\(^\circ \)C).

  3. 3.

    Electrical conductivity denotes the ability of a material to conduct an electrical charge, measured in microsiemens per centimeter (\(\upmu \)S/cm).

  4. 4.

    pH (pH) indicates the acidity or alkalinity of a solution, using the pH scale ranging from 0 to 14.

  5. 5.

    Ammonium (AMMONIUM) refers to the concentration of ammonium ions, a measure of nitrogen pollution in water, measured in milligrams per liter (mg/L).

  6. 6.

    Turbidity (TURBIDITY) refers to the cloudiness or haziness of a fluid caused by large numbers of individual particles, measured in nephelometric turbidity units (NTU). Additionally, we extract rainfall data from the CEDA Archive58,59 to take into account the impact of rainfall on water levels in our predictions and forecasts. We pair each site’s data set with daily/hourly rainfall data from the closest current running rainfall monitoring station. For example, we extract daily rainfall data from the site “DEPTFORD P STA”, which is the closest current running rainfall monitoring station to the site TBGP (4.5 km aerial distance). We paired the daily rainfall data with the 15-minute measurements taken on the same day to incorporate its potential impact as a predictive feature in the forecast.

Data processing

The data sets retrieved contain several problems, which we rectify in the following way:

  1. 1.

    DO and DO-MGL parameters not considered, instead use the DOO-MGL parameter measured with an optical optode for more reliable data. ‘DOO-MGL’ is equivalent to the commonly known ‘DO’ in this paper.

  2. 2.

    DOO-MGL measurements exceeding 25mg/L are deemed faulty and discarded, as according to Wetzel60, dissolved oxygen in most natural waters typically ranges between 8 and 14 mg/L under standard pressure. While this can increase based on temperature, salinity, and pressure interactions, achieving a concentration as high as 25mg/L is generally unlikely under standard conditions.

  3. 3.

    Remove ‘salinity’ measurements as they are the same as electrical conductivity.

  4. 4.

    The sites experienced unusual fluctuations in DOO-MGL in August 2022 due to the artificial injection of oxygen into the River Thames61. To eliminate bias from human interaction, we have excluded these measurements from our analysis.

  5. 5.

    We remove non-positive measurements for ‘COND’ (electrical conductivity), ‘PH’, ‘AMMONIUM’, ‘Turbidity’ and ‘DOO-MGL’ indicators.

  6. 6.

    Remove large spikes and sudden dropouts due to probe faults or human activities, such as the oxygen pumped into the river from a boat in August 2022.

  7. 7.

    EC measurements in ms/cm unit, change to us/cm.

  8. 8.

    Remove data outages, which were caused by sonde changes.

We measure the aerial distances between the sites and the Thames estuary in kilometers (dist_to_sea), to analyse the effect of the sea distance on the results.The estuary of the Thames is the transitional area where the River Thames’s fresh waters mix with the saltwater of the North Sea.

Other covariates include time and calendar features: the hour of the day, the day of the week, the month of the year (all encoded as value between [\(-\) 0.5, 0.5]), the time of the half-day and the time of the year (both sine- and cosine-encoded). The latter two signals are extracted specifically to manage both half-daily and yearly periodicity. We identify these as the two most critical frequencies, accomplished by extracting features using the Fast Fourier Transform, see Fig. 6. In total, this amounts to fourteen covariates. All the models employ a multivariate feature set to predict a single output feature, DO, across all output time steps.

Fig. 6
figure 6

By performing a Fast Fourier Transform, we can determine the important frequencies. Note the intuitive peaks at frequencies near 1/year and 1/half day.

We use a (80%, 20%) split for the training and test sets for the regression analysis, and a (70%, 20%, 10%) split for the training, validation, and test sets for time series forecasting. Note that the data is not being randomly shuffled before splitting for two reasons. Firstly, it allows us to ensure that chopping the data into windows of consecutive samples is still possible. Secondly, it ensures that the results obtained from validation and testing are more realistic as they are being evaluated on data that was collected after the model was trained. Normalising the data is the next step, and the mean and standard deviation are calculated based on the training set. Thus, the models are not exposed to data from the validation and test sets.

Detrending methods

We implement seasonal decomposition using the python statsmodels.tsa.seasonal package62, and EMD using the PyEMD package63. We choose a filtering frequency of \(f=6\) h and vary the number of modes dropped based on each site’s trajectories, as illustrated in the code.

Regression models

A Light Gradient Boosting Machine (LGBM)47 was used as the interpretable regression model for the river data. This is a gradient-boosting method which grows gradient-boosted trees leaf by leaf rather than row by row and uses Gradient-Based One-Sided Sampling (GOSS) and Exclusive Feature Bundling (EFB) to allow its histogram-based decision tree learning algorithm to run faster, see Fig. 7.

Fig. 7
figure 7

Illustration of the LGBM model. Decision tree is built leaf by leaf as opposed to other methods such as XGBoost gradient-boosted trees which build level by level.

To tune the hyperparameters of the LGBM model, the Optuna tuning software64 was used (through the Verstack wrapper), a stepwise algorithm to tune hyperparameters.

To interpret the results of the LGBM models, a random sample of 500 SHAP (Shapley Additive Explanations) was used48,49. This is a method from coalitional game theory that tells us how to fairly distribute the “payout” (the prediction by the model) between the “players” (the features in the model) in a “coalition” (the model). The explanation itself is \(g(z')=\phi _0+\sum _{j=1}^M\phi _jz_j' \) where g is the explanation model, \(z'\in \{0,1\}^M \) is the coalition vector, M is the maximum coalition size and \(\phi _j\in {\mathbb {R}} \) is the feature attribution for a feature j, the Shapley values.

Forecasting models

The employed forecasting models are as follows:

  1. 1.

    Last: A simple baseline is to repeat the last input time step for the required number of output time steps.

  2. 2.

    Repeat: The second baseline method involves repeating the same length of initial data from the previous half-day as the prediction length. This assumes that the patterns from the start of the previous half-day will be similar in the upcoming period, given the half-day periodicity observed in our data.

  3. 3.

    Linear: A linear model is based on the last axis of the data, changing its format from [batch, time, inputs] to [batch, time, units]. It works independently on each item across both the batch and time axes. The model aims to predict time steps based on a single input time step using a linear projection.

  4. 4.

    Sequential Dense Network (Dense): The model selects the last time step from the input, reshaping it from [batch, time, features] to [batch, 1, features]. It then passes the data through a dense layer of 512 units with ‘ReLU’ (Rectified Linear Unit) activation28, maintaining the shape as [batch, 1, dense units]. Finally, it reshapes the output to match the desired number of steps and features. It is capable of capturing more complex patterns in the data than a linear model.

  5. 5.

    Convolutional Neural Network (Conv)65: The model first slices the last three time steps from the input, changing the shape from [batch, time, features] to [batch, 3, features]. It then applies a 1D convolutional layer with 256 units and ‘relu’ activation, reshaping the data to [batch, 1, conv units]. A dense layer and a reshape layer adjusts the data to [batch, out steps, 1], yielding a convolutional model capable of producing a prediction based on that sequence.

  6. 6.

    Long Short-Term Memory (LSTM)53: LSTM network, a recurrent neural network, processes a time series in a sequential manner and maintains an internal state across time steps. It starts with an LSTM layer containing 32 units, transforming the data shape from [batch, time, features] to [batch, lstm units]. After experimentation with various numbers of layers and units per layer, we found this model architecture provided the best performance. If historical inputs are relevant for the predictions, the LSTM model can be trained to exploit them effectively. In this specific application, the model accumulates internal state information for a half-day before generating a forecast for the subsequent time steps.

Additionally, we use the Informer, which is the cutting-edge machine learning methodology for numerous sequence-oriented tasks. Zhou et al.50 introduced a unique “ProbSparse Self-Attention” mechanism that targets the main influential factors while maintaining a sparse attention distribution. This mechanism significantly reduces computational complexity by probabilistically selecting the most important elements from the input sequence to pay attention to. An overview of the mechanism behind informer can be found in the Fig. 8. Left: The encoder takes in 14-dimensional input sequences for 48, 92, and 192 time spans. ProbSparse self-attention selects the active queries. Trapezoids represent the self-attention distillation operation, which reduces the network size by extracting dominant attention. Right: The decoder acquires the token length of the preceding 48 time steps of input, fills the prediction elements with zeros, examines the structure of the attention feature map. In a single shot, it generates univariate output predictions for time spans of 1, 12, 24, and 48.

Fig. 8
figure 8

Informer model architecture.

The data fed into the model is structured with the shape [batch, time, features]. Here, we generate batches of 32 windows from the training, evaluation, and test data. The order of the windows is shuffled before they are batched, which in turn means that the order of the data within each batch is also random. In each batch, the model makes a set of independent one-shot predictions on future time steps. The term ‘time’ represents the sequence length per sample. We use values from half a day (i.e., 48 time steps), a full day, or two days as input, with forecasting horizons of 1, 12, 24, and 48 quarter-hours. The term ‘features’ refers to the number of attributes each time step possesses. The output shape of the model is [batch, time, 1], which reflects the prediction of only one feature. All the models employ a multivariate feature set to predict a single output feature, DO, across all output time steps.

We utilise vectors representing the previous 48 time steps (equivalent to half a day) as input to the encoder, and vectors representing the previous 24 time steps for the start token length of the decoder. The prediction sequence length are 1, 12, 24, or 48 time steps. The architecture includes three encoder and three decoder layers, with each attention module having eight heads. The model dimension is set to 512. The specific hyperparameters of the model are provided in the code. Each model undergoes training utilising the MAE as the loss function. Our proposed methods are optimised using the Adam optimiser66. The learning rate is set to 0.0001, and this is subsequently halved with each increment in the epoch number. We employ early stopping to attain the minimal generalization error on the validation data, with a patience of 3 epochs. In the evaluation, the model with the lowest validation error is saved.

In the Informer model, the encoder takes in the complete input sequence and generates a corresponding sequence of representations. Each input requires three distinct representations, known as the key (K), query (Q), and value (V). Each representation is a vector encoding information about a specific timestep in the input, considering the context of the entire sequence. The decoder receives these representations to generate the output sequence. To derive these representations, every input is multiplied with a distinct set of weights assigned for K, Q, and V. These weight matrices represent trainable parameters of the model, which are optimised during training to minimise the loss function. Subsequently, the dot product of the query vector and every key vector in the sequence is computed to generate attention scores. These scores determine the level of attention each element should receive. These results are subjected to a softmax function, converting the scores into probabilities, known as attention weights. The final output is a composite of these interactions and corresponding attention scores. The ProbSparse Self-attention can be computed as follows:

$$\begin{aligned} \text {Attention}(Q, K, V) = \text {softmax}({\overline{Q}}K^T/\sqrt{d})V, \end{aligned}$$
(13)

where d represents the dimensionality of the queries and keys, while \({\overline{Q}}\) is the sparse query matrix containing the top-u queries. This configuration allows each key to attend only to the ‘u’ most dominant queries. The aim is to select the most “active” queries, those that result in high attention weights. The softmax function ensures the weights sum to 1 and are each between 0 and 1.