Wavelet gated multiformer for groundwater time series forecasting

Developing accurate models for groundwater control is paramount for planning and managing life-sustaining resources (water) from aquifer reservoirs. Significant progress has been made toward designing and employing deep-forecasting models to tackle the challenge of multivariate time-series forecasting. However, most models were initially taught only to optimize natural language processing and computer vision tasks. We propose the Wavelet Gated Multiformer, which combines the strength of a vanilla Transformer with the Wavelet Crossformer that employs inner wavelet cross-correlation blocks. The self-attention mechanism (Transformer) computes the relationship between inner time-series points, while the cross-correlation finds trending periodicity patterns. The multi-headed encoder is channeled through a mixing gate (linear combination) of sub-encoders (Transformer and Wavelet Crossformer) that output trending signatures to the decoder. This process improved the model’s predictive capabilities, reducing Mean Absolute Error by 31.26 % compared to the second-best performing transformer-like models evaluated. We have also used the Multifractal Detrended Cross-Correlation Heatmaps (MF-DCCHM) to extract cyclical trends from pairs of stations across multifractal regimes by denoising the pair of signals with Daubechies wavelets. Our dataset was obtained from a network of eight wells for groundwater monitoring in Brazilian aquifers, six rainfall stations, eleven river flow stations, and three weather stations with atmospheric pressure, temperature, and humidity sensors.


MF-DCCHM for groundwater level pattern recognition.
We have employed the Multifractal detrended cross-correlation heatmaps (details in the "Methods" section), a technique based on the detrended cross-correlation coefficients that can map the relationships between pairs of fluctuations across different multifractal regimes.We have used sliding boxes with sizes of up to 5 % of the entire series for local analysis.After the main computations, the images are generated from tensors and plotted using Python 66 and matplotlib package 67 .The plots are finally post-processed with Inkscape 68 to generate the diagrams and heatmaps.These heatmaps can uncover non-explicit cyclic patterns between signals obtained from a combination of sensors under specific constraints.Figure 2 represents the Multifractal Detrended Cross-correlation heatmaps with an average over the intensity of the cross-correlation coefficients for each day on top of the maps.The average can provide a peculiar view of positive and negative trends over multiple regimes considering all possible windows and sliding boxes.The importance of the previous assumption relies on the possibility of uncovering and inferring cyclical patterns across multiple regimes for the time series of groundwater level measured from well sensors as compared to other regional well data, local pressure, humidity data, river levels, or weather attributes such as the local rainfall acquired through sensors from different stations.Our dataset is composed of daily time series from multiple sensors ranging from 1 January 2016 to 2019: (i) eight data series (W1-W8) representing the groundwater levels from the sensor inside the well logs, (ii) six data series to measure the rainfall (R1-R6) from weather stations, (iii) eleven data series (RI1-R11) measuring the river levels, (iv) three data sets representing the atmospheric pressure (P1-P3) in different locations, (v) three data series for the local temperature (T1-T3), and (vi) three data series for the local humidity (H1-H3).
Figure 2a represents the cross-correlation between two wells, W1 and W8.The x-axis represents the temporal variation (daily time series).In contrast, the vertical axis symbolizes the variation of the scale ν considering the interval 10 ≤ ν ≤ 60 with a fixed-sized sliding box.The colors correspond to the variation in the intensity of σ , representing the cross-correlation coefficients between wells W1 and W8.The computed detrended cross- correlation coefficients are in the interval of −1 ≤ σ ≤ 1 , as shown in the sidebar of each heatmap.On top, the vertical axis describes the average σ (cross-correlation coefficients) for the entire map scope.This chart is crucial since we can focus on identifying cyclical trends due to oscillations and uncover the trend for the entire period (averaged persistence or anti-persistence).These time series will have strong or weak detrended cross-correlated characteristics depending on the color and uniformity, given a vertical range for all scales ν.
The MF-DCCHM for the pair of wells W1 and W8 shown in Fig. 2a displays periodicity for σ > 0 and ν > 25 , which represents approximately 6 months by taking P cp as a starting point ≈(2016.7;25),where P cp denotes the values with positive cross-correlation coefficients.Additionally, we have identified a cyclical pattern for σ < 0 and ν > 30 , representing approximately 8 months, where the reference value is P cn ≈(2017.1;30)for the negative cross-correlation coefficients.On top of the heatmap, the averaged detrended cross-correlation coefficients (vertically) show an exact cyclical pattern due to the oscillations for the entire period.Therefore, the groundwater level in these two wells (W1 and W8) follows a similar trend of fluctuations for the period with positive coefficients.In contrast, its fluctuations have an inverse relationship for negative coefficients.Furthermore, we have found residual cross-correlation coefficients exhibiting weak correlations, hovering around σ ≈ 0.
We have also obtained similar results for all combinations of pairs considered (eight wells) distributed in the region.Our results indicate that the closer the wells are to each other, the greater the probability of these wells following the same fluid volume fluctuations during specific periods.Figure 2b shows the MF-DCCHM for wells W1 and W6, which have a similar pattern as compared to the pair of wells shown in Fig. 2a (W1 and W8) in the time interval between January and August of 2016.However, we have detected anomalies for the entire period of August 2016.Figure 2b shows a uniform green color from 2016 to the end of 2019, with coefficients σ ≈ 0 ,  S1 (Supplementary material) for information on latitude and longitude.The map diagram was generated with ArcGIS 10.8 76 and post-processed with Inkscape 68 (Open Source Software licensed under the GPL).The dataset is publicly available and can be obtained from the Geoscience system of the Geological Survey of Brazil (GeoSGB) 77 .The license is Creative Commons Attribution NonCommercial 4.0 International.for any P ≈ (t > 2016.7;ν ), where t is the temporal variable.After conducting a detailed analysis of the conditions regarding all the wells, the anomaly in well W6 suggests a massive influx of water due to regular groundwater pumping from a nearby well (around 800 m distant, according to observations in the field).
Figure 2c shows the MF-DCCHM for river two (RI2) and river ten (RI10).Our results indicate direct and indirect proportionality trends related to the fluid volume in different sensors across the rivers for different periods.By analyzing the MF-DCCHM of eleven data series (sensors), we have discovered a standard direct proportionality signature (positive σ ) between the fluctuations in the rivers' fluid level every 3 months, a char- acteristic of the region.We have also observed a periodic abnormal pattern in the MF-DCCHM associated with rainfall data series.Figure 2d show the MF-DCCHM for stations R1 and R4, with an atypical pattern between February and March.These periodic intervals coincided with the four bands shown on the map, indicating the non-occurrence of rain in the period.
We have also examined the MF-DCCHM of nine data series from pressure, temperature, and humidity collected in three locations.We have constructed the maps for all pairs of signals.Figure 2e represents the crosscorrelation mapping between temperature and humidity series from stations T1 and H1.The map contains www.nature.com/scientificreports/cyclical patterns composed of bands in red where σ > 0.8 for the entire length of the series 10 ≤ ν ≤ 60 , showing a substantial proportionality across all regimes.We have also found the same behavior when comparing temperature at T1, humidity at H1, and rain at R1 with humidity at H1.These high cross-correlated cyclical patterns indicate a signature that the air can saturate under high relative humidity.At a particular temperature, the air is unable to hold water content leading to the formation of clouds and precipitation.The temperature where the air gets saturated (unable to hold moisture) is also known as the dew point.However, we have noticed a weakening of the detrended cross-correlation coefficients affecting the direct proportionality when considering different locations, which can be characterized as a regional effect.Figure 2f shows the MF-DCCHM for temperature and humidity at stations T1 and H1.The average period for direct and inverse proportionality events is approximately 6 months, with attenuation for different distances.
Figure 3a shows the cross-correlations for well W1 and rainfall R4 data series.In addition to the periodicity reported for σ < 0 and σ > 0 , the maps can show bands composed of the absence of rain with patterns of 0.5 cycle/year considering σ ≈ 0 and 10 ≤ ν ≤ 60 .Figure 3b represents the MF-DCCHM for the pair well W1 and river RI10.This result shows an intrinsic relationship in the cross-correlation between the river and well levels.The closer the well is to the river, the greater the cross-correction between them since interconnection through regional channels can provide different delays of influx and outflux of water, raising groundwater levels at different rates.However, factors such as rainfall intensity in the region considerably impact the affinity between the measurement levels from wells and river stations.
Figure 3d-f were obtained from the pair of variables well and river, including a low-pass filter with Daubechies 4 wavelets.In Fig. 3e, f, we have used a filtering threshold of 0.35 for Fig. 2e and 0.70, respectively.After removing white noise, we can observe an increase in explicit uniform bands across the maps with stronger positive and negative cross-correlation periodicities.Therefore, by reducing high-frequency fluctuations, our signal carries seasonal trends characteristics of low frequencies, which can be relevant for determining the cycles.The filtering helps uncover significative trends extending across multiple regimes since extracting white noise generates more uniform distributions along the vertical axes.It is crucial to highlight the influence of high-frequency fluctuations for pattern recognition and the sensitivity of the MF-DCCHM for capturing seasonal trends after filtering the noise.
Figures S8 and S9 (Supplementary Material) show the DFA 0 and DFA 1 analyses for all the series in Figs. 2 and  3.The vertical axis represents the fluctuation, the horizontal axis represents the window size, and the vertical bar on the side shows the symbol and color associated with each attribute.In Figs. 2 and 3, the MF-DCCHM captured signatures of two regimes.We have also obtained similar curves for the DCCA 0 and DCCA 1 analy- ses.These regimes can be described by the multiscale exponents shown in Tables S5-S13.These results show the autocorrelation and cross-correlation exponents with indications of negatively correlated ( 0.0 < α < 0.5 ), close to uncorrelated ( α ≈ 0.5 ), and positively correlated ( 0.5 < α < 1.0 ) patterns.Tables 1 and 2 also show an exponent of α > 1.0 for well W8 in the second regime represented by region II.For the DFA 1 , we have found river RI35 in region I and river RI70 in both regions I and II.We have also obtained a multifractal exponent of α > 1.0 .The occurrence of α > 1.0 69 can also be associated with levels of high-frequency noise in the linear trend.Therefore, we have employed the Daubechies 4 low-pass filter and managed to amplify critical cyclical signatures, denoise the fluctuations, and, in the case of well W8, reduce the multifractal exponents for less than one.That is a highly significant result since we have shown that the MF-DCCHM is very sensitive to high-frequency noise, and denoising with specific thresholds can assist in uncovering low-frequency trends.www.nature.com/scientificreports/Groundwater well-log deep-forecasting.We have also predicted the most probable time series for a length I given an input signal L and a multivariate dataset obtained from multiple stations.We have proposed the Wavelet Gated Multiformer (details in the "Methods" section), which introduces a wavelet decomposition block for multivariate time-series forecasting inside the encoder.This approach simultaneously employs the advantages of past information from multiple sub-encoders through a mixing gate ⊕ , as shown in Fig. 4.This proce- dure carries a combination of Transformer-based techniques since the output from the encoder has information regarding the relationship between scattered points (point-wise dependencies discovery) from the self-attention family 50,57 , aggregation of selected points by dot-product, and accumulation of similar sub-series from different periods through the time delay block (Wavelet Crossformer).Our approach can also be extended for application in other fields since the deep model provides higher accuracy for some well-logs.
Figure 4 shows the decomposed signal into trend-cyclical (long-term progression) and seasonal parts to understand the temporal patterns during the forecasting.The Wavelet Crossformer progressively uses a decomposition block similar to the Autoformer to pull the long-term stationary trend from intermediate predictions.However, instead of moving the average, we denoise the signal with a low-pass wavelet filter based on Daubechies 4 to concentrate on the long-term trends.Removing high-frequency noise has proven to be an essential factor for trend feature extraction.We have performed multiple experiments with eight wells by training a sequence of 180 time steps (days) to predict the next 30 and 60 time steps (days).The results are summarized in Table 3.The proposed Wavelet Gated Multiformer was also compared against the vanilla Transformer 45 , Autoformer 56 , and Informer 57 using two different metrics: (i) Mean Squared Error (MSE), and (ii) Mean Absolute Error (MAE).The Wavelet Gated Multiformer, which takes advantage of the Cross-correlation Blocks, was considerably ahead of different transformers in wells W1 and W8.However, we have obtained mixed performance for the other six wells without a dominant architecture concerning the metrics MSE and MAE (Fig. 5).
Comparing the predictions of these three transformer-like architectures can also shed light on their capabilities.Figure 6 shows the predictions for the well W1, with the Wavelet Gated Crossformer providing better performance (MSE and MAE) for an input length of 180 days and prediction length of 30 days (left) and 60 days (right).However, the Transformer's prediction falls entirely out of the signal trend, and the Autoformer fails to stabilize the signal over the curve.The Wavelet Gated Crossformer has shown higher variance, which can be induced by the selected Daubechies 4 wavelet, which provides the decomposed signal in trend and seasonal parts.Figure 7 shows the predictions on the well W4, where the Wavelet Gated Crossformer had the worst performance compared to the other transformers.However, the Wavelet Gated Crossformer still captures the general trend in its prediction for both 30 days and 60 days prediction lengths.
Figure 4a highlights the schematics of our encoder, composed of parallel N layers of multiple sub-encoders.The output of each sub-encoder is then mixed through a linear combination represented by the mixing gate ⊕ .Furthermore, the Mixing Gate combines the trending output of the Transformer embeds information concerning the relationship between scattered points (point-wise dependencies discovery) from the self-attention family 50,57 and aggregation of selected points by dot-product, and the accumulation of similar sub-series from different periods through the time delay block (Wavelet Crossformer).Figure 4d clearly shows the weighted characteristics of both sub-encoders after the Mixing Gate.4e, f show the decomposed signal into trend-cyclical (long-term progression) and seasonal parts to understand the temporal patterns.Our proposed encoder and its operations are detailed in the "Methods" section.Our Decoder is shown in Fig. 4g, detailed in the "Methods" section, which has M layers and two different inputs, a trend cyclical and seasonal (residual) derived from the encoder input.From Fig. 4e, the cyclical trend part, one can see that the signal has a lower frequency (long-term tendencies) as compared to the signals in Fig. 4f, where fast oscillations are dominant.Our Decoder also includes the wavelet denoising through the Wavelet Decomposition Block.The Decoder extracts the trend from the hidden variables and allows the Wavelet Gated Multiformer to refine the trend prediction by denoising the signals with wavelet filters for period-based pattern recognition inside the Wavelet Auto-Correlation block.
The cross-correlation block from Wavelet Gated Crosformer significantly improves the removal of highfrequency noise, as seen in Fig. 5. Before effective cross-correlation, both Q and V, as seen in Fig. 5b, c, are filtered with Wavelet Shrinkage Denoising using Daubechies 4 wavelet with a soft threshold for coefficients estimations.Fig. 5d shows the effect of wavelet filtering in two random batches with Q tensor samples.The high-frequency noise content is removed from the original signal under a specific threshold.The same outcomes are observed for K in Fig. 5f. Figure 5e, g, with the signal in the frequency domain, also supports the enhancement in quality provided by wavelet filtering.Finally, the outcomes of QK cross-correlation with and without wavelet denoise are compared in Fig. 5h, with explicit attenuation of white noise on the former.This process provides effective results once these signals are cross-correlated and better estimations for the TopK values used in time-delay aggregation of the value component V.
The deep forecasting models were trained with a multivariate dataset of rain, climate (atmospheric pressure, temperature, and humidity sensors), river, and well-log data that comprises 4 years of continuum daily measures from 34 sensors with a total of 49708 data points.We have calibrated the dataset with standardization (mean equal to zero and standard deviation equal to one).The input has a sequence length of 180 and a prediction length of 30 and 60 days.We have split the dataset with 70% for training, 10% for validation, and 20% for testing purposes.We have also fine-tuned and adjusted the following parameters for the validation: (i) input sequence length of 180, (ii) batch size of 32, (iii) encoder input and output size of 34, and (iv) 15 training epochs.For each

Discussion
We have explored the MF-DCCHM to discover zero-order (around the mean) and first-order (around the first-degree polynomial) cyclical patterns and trends among all pairs of signals in our dataset.These results showed the local and global behavior related to dependence for all pairs of physical quantities analyzed.The main result can be verified in the heatmaps from Figs. 2 and 3 with the attenuation variance after filtering the trend, employing a sliding window, and obtaining, for all scales, all intensity levels related to correlation coefficients.Therefore, the heatmap was built with much more information when compared with previous works 1, 2 .Furthermore, the heatmaps provided seasonal trends and cycles of 90 and 180 days, guiding our decision to use the deep-forecasting models with 30 and 60 days prediction lengths.It is essential to highlight that applying the Daubechies (db4) wavelet filters can reduce the noise and increase the magnitude of low-frequency signals to uncover cyclical trends.We have also shown that using the low-pass wavelet-based filters allows one to denoise the fluctuations and, in the case of well W8, to reduce the multifractal exponents for less than one.We have also found high cross-correlated cyclical patterns indicating a signature compatible with weather anomaly.In this case, the air can saturate under high relative humidity, and at a particular temperature, the air cannot hold water content, leading to clouds and further precipitation, also known as the dew point.Furthermore, we have also noticed a weakening of the detrended cross-correlation coefficients when considering different locations.Figure 2f shows the MF-DCCHM for temperature and humidity at stations T1 and H1.The average attenuates for different distances.However, the period for events is of approximately 6 months.
Regarding the multivariate time-series deep-model forecasting approach, Fig. 4 shows our proposed Wavelet Gated Crossformer, which takes advantage of multiple sub-encoders with the mixing gates.Our model carries information regarding all parallel sub-encoders, and it can be generalized for multiple sub-encoders and decoders, possibly extrapolating the capacities of any single transformer-like architecture.Additionally, wavelet decomposition blocks provide another route to separate signals in trend and seasonal parts as the moving average from the Autoformer suffers from the border effect at the beginning and later parts of the curve.Furthermore, Wavelet Shrinkage Denoising can overcome this limitation on capturing trend signals by extracting wavelet coefficients associated with higher frequencies.The proposed architectures also implement a new Wavelet Cross-Correlation Block that employs wavelets to denoise the signal before the inner QK cross-correlation.The improved signal quality concatenates the cross-correlation and provides better estimations for the time delay aggregation.The Wavelet Gated Multiformer has provided better results in two of the eight wells when compared to the Autoformer and Transformer, reducing Mean Absolute Error (MAE) by 31.26 % compared to the second-best performing transformer-like models evaluated.Nevertheless, considering only predictions trends, our approach efficiently captures long-term patterns and has the potential for application in other fields.
Deep learning has been successfully used to uncover subtle and hidden patterns in time series data.Deep Learning models have been used to understand traffic flow and congestion (time series) and have shown promise in time-series forecasting.The most recent deep-forecasting models can outperform statistical methods for large multivariate datasets.Transformers are used to perform natural language processing (NLP) tasks, with several architectures for long-term forecastings, such as FEDformer, PatchTST, Informer, Autoformer, and Transformer.Our article provides a new deep-forecasting model to predict groundwater level from a multivariate dataset which includes rain, climate (atmospheric pressure, temperature, and humidity sensors), river, and well data and comprises 4 years of continuum daily measures from 34 sensors (or 49708 data points).However, performance is a key limitation of Wavelet Gated Multiformer because of its parallel encoder structures and the additional wavelet denoise shrinkage.Future works could focus on computational costs, benefit from the new approach, and explore combinations of different transformers-like models.Furthermore, research prospects in this field could investigate the potential of the Wavelet Gated Multiformer for Time Series Forecasting using open-source benchmarks such as the Electricity Transformer Temperature (ETT), a crucial indicator in long-term electric power deployment.where χ l j = A l,2 en represents the outcome of the l-th Transformer layer.The final operation in the mixing gate χ l en = χ l i ⊕ χ l j can be expressed as:

Encoder. The encoder shown in
and G en is the Encoder output trend, composed of a linear combination of parallel sub-encoders.The model is not restricted to only two sub-encoders as it can be generalized for a linear combination of multiple encoders.
Decoder.In this model, the decoder input is divided in two: a seasonal part χ des ∈ R ( I 2 +O)xd and a cyclical trend part χ det ∈ R ( I 2 +O)xd , each of which is formulated by a specific rule.χ des is the concatenation of the latter half of the seasonal part of the encoder's input with O zeros, and χ det is the concatenations of the latter half of the trend part of the encoder's input with O values representing the average of the encoder's input χ en .One can put it formally as: where χ ens , χ ent ∈ R I 2 xd are the seasonal and trend parts of χ en , respectively, and χ 0 , χ Mean ∈ R Oxd represent placeholders filled with zero and the mean of χ en , respectively.
Figure 4 shows the decoder with two components: the trend-cyclical accumulated over the decoder and a seasonal part stacked in a series of blocks.We have introduced wavelet denoising in the Wavelet Decomposition Block to replace the original Series Decomposition block from Autoformer.Inside the decoder, information from the encoder is integrated as K and V on one of its Autocorrelation blocks with wavelet denoising.The decoder extracts the trend from the hidden variables and allows the Wavelet Gated Multiformer to refine the trend prediction by denoising the signals with wavelet filters for period-based pattern recognition inside the Wavelet Cross-Correlation block.The decoder has M layers and receives input χ l en from the encoder, and past information from the decoder, such that the l-th decoder layer can be represented by χ l de = Decoder(χ l−1 de , χ N en ) , where the internal operations can be described as: where χ l de = S l,3 de , l ∈ {1, ..., M} is the output of l-th decoder layer.χ 0 de is embedded from χ des and T 0 de = χ det .There- fore, the decoder's prediction output is represented by the sum of two decomposed components W S * χ M de + T M de , where W S indicates the projection of seasonal components X M de to the proper output dimension and M is the number of decoder layers.The variables S l,3 de and T l,3 de characterize the seasonal and trend-cyclical components for the first, second, and third wavelet-series decomposition block, respectively.The W l,i represents the projec- tion for the trend T l,i de , where i = {1, 2, 3} is directly associated with the first, second, and third decoder's inner wavelet-series decomposition block.
Wavelet decomposition block.Wavelet shrinkage denoising.Wavelets form a basis in the space of integrable square functions and, as other approaches like Fourier series, local polynomials, splines, and kernels, can be used to represent unknown functions.One advantage of wavelets over the other approaches is that they are localized in time, as this property substantially decreases the computational cost in the representation of the functions.The Wavelet Transform (WT) coefficients of a signal x(t) are obtained from the following equation: where ψ(t) is the continuous root wavelet scaled by a factor a and mapped with factor b. The continuous repre- sentation has infinite coefficients a and b mapped through the Continuous Wavelet Transform (CWT).However, due to the intense computational cost, the CWT is only feasible when considering discrete signals with limited coefficients.Instead, most applications use the Discrete Wavelet Transform (DWT).In DWT, the efficiency comes from the increase of a and b factors based on the power of two.Some of these decomposed wavelet coefficients correspond to signal averages, and others are associated with details on the original signal.DWT coefficients are filtered with a threshold to denoise or recover the partial trend.Additionally, we apply an inverse wavelet transform to restore the filtered signal to the time domain, a process known as wavelet shrinkage denoising.Based on the methodology developed by Donoho and Johnstone 72 , we have employed soft threshold to estimate the wavelet coefficients of the trending signal.

Wavelet decomposition.
We have used the decomposition to cluster the signal in two subgroups representing a long-trend pattern with low frequency and a seasonal part with higher frequency and more fluctuations.Unlike Autoformer, which computes an average of a padded signal, we have proposed the low-pass filter where S and T denote the seasonal and trend-cyclical signals, respectively, and the soft threshold 72 is employed to filter the wavelet coefficients of the trending signal.For the sake of simplicity, the decomposition blocks were denoted as S, T = Ŵ(χ) in the previous sections.
Wavelet cross-correlation block.The Wavelet Cross-correlation Block is an extension of the autocorrelation block introduced by the Autoformer architecture 56 .Figure 5 shows that, in Wavelet Cross-correlation Block, the components Q and K are denoised through wavelet shrinkage before the autocorrelation computation.
The cross-correlation is done in three steps for better computation efficiency: (i) Fast Fourier Transform, (ii) multiplication in the frequency domain, and (iii) an inverse Fast Fourier Transform.The Wavelet cross-correlation block also discovers period-based dependencies by aggregating sub-series with the time-delay mechanism.
The time delay aggregation is responsible for rolling the series based on selected time delay τ 1 , ..., τ n .Therefore, the block aligns similar sub-series with a subsequent aggregation through the softmax normalized confidences.For a series X with length L and a single head, we can replace the self-attention mechanism considering that Q,K, and V values are obtained through the following operations: where argmax Topk(•) represents the arguments of the Topk cross-correlations, R Q,K is cross-correlation, Roll(V , τ i ) represents the operation where V is shifted left by τ i , during which elements that are shifted beyond firs position are placed at last position.Furthermore, K, V are from the encoder X N en , which is resized to length O, and Q comes from the previous block of the decoder.Additionally, the multi-head would include the Q i , , K i , and V i , for i = {1, ..., h} , where h represents the number of heads.The i-th head contains Q i , K i , V i ∈ R

Multifractal detrended cross-correlation heatmaps. The Multifractal Detrended Cross-Correlation
Heatmaps were strictly computed according to Paulo et al. 62 .The procedure is to first obtain the long-range correlation through the Detrended Fluctuation Analysis (DFA) 69,73 .Additionally, we have estimated the Detrended Cross-Correlation Analysis (DCCA) 74 for long-range cross-correlation memories.Finally, we compute the average over the fluctuations F 2 X for all m-ith sliding boxes with different sizes v over the entire series, such that: f 2 X (m, v) represents the variance and covariance of fluctuations, M v is the number of windows inside the entire series, and X characterizes the methods DFA and DCCA .
The computed variance and covariance can be characterized by a power-law F X ≈ v α where v represents the box size.The scaling factor α was obtained by linearization log(F X )X log(v) , where α = corresponds to the cross-correlation.Therefore, we can classify anti-persistent and persistent patterns based on the multifractal scaling factor α .The trend can reverse shortly for 0 < α < 0.5 .However, white noise could affect the integrated series and provide 0 cross-correlations for α = 0.5 .The integrated signal continues the previous trend for 0.5 < α < 1.0.

Figure 1 .
Figure 1.Map diagram with the stations and wells represented in green (W1-W8), rivers stations in red, rainfall stations in yellow and weather stations (atmospheric pressure, humidity, and temperature) in blue.See TableS1(Supplementary material) for information on latitude and longitude.The map diagram was generated with ArcGIS 10.876 and post-processed with Inkscape 68 (Open Source Software licensed under the GPL).The dataset is publicly available and can be obtained from the Geoscience system of the Geological Survey of Brazil (GeoSGB)77 .The license is Creative Commons Attribution NonCommercial 4.0 International.

Figure 4 .
Figure 4. Wavelet Gated Multiformer with (a) Encoder, (b) random batch output signal from Wavelet Crossformer and (c) Transformer, and (d) signal after mixing gate.The Encoder's output (e) wavelet cyclical trend, (f) seasonal trend, and (g) Decoder with inner cross-correlation blocks.

Figure 5 .
Figure 5. Wavelet Cross-correlation Block with (a) layer's internal operations and self-attention input for (b) Queries and (c) Keys (K).Random batch tensor before and after denoising with wavelet db4 filter for (d, e) Queries and (f, g) Keys, respectively.(h) Represent a random batch for the Q-K cross-correlation using an average filter and wavelet db4.

Figure
Figure 4b-d show the output Wavelet Crossformer sub-encoder, the Transformer sub encoder, and the result of utilizing past information from multiple sub-encoders through the Mixing Gate ⊕ , respectively.The Wavelet Crossformer uses progressive decomposition blocks similar to the Autoformer to deal with long-term stationery trends with a low-pass wavelet filter based on Daubechies 4 to denoise the output signals and extract long-term trends.High-frequency signal denoising has proven to be an essential step for feature extraction inside the blocks.Furthermore, the Mixing Gate combines the trending output of the Transformer embeds information concerning the relationship between scattered points (point-wise dependencies discovery) from the self-attention family50,57 and aggregation of selected points by dot-product, and the accumulation of similar sub-series from different periods through the time delay block (Wavelet Crossformer).Figure4dclearly shows the weighted characteristics of both sub-encoders after the Mixing Gate.4e, f show the decomposed signal into trend-cyclical (long-term progression) and seasonal parts to understand the temporal patterns.Our proposed encoder and its operations are detailed in the "Methods" section.Our Decoder is shown in Fig.4g, detailed in the "Methods" section, which has M layers and two different inputs, a trend cyclical and seasonal (residual) derived from the encoder input.From Fig.4e, the cyclical trend part, one can see that the signal has a lower frequency (long-term tendencies) as compared to the signals in Fig.4f, where fast oscillations are dominant.Our Decoder also includes the wavelet denoising through the Wavelet Decomposition Block.The Decoder extracts the trend from the hidden variables and allows the Wavelet Gated Multiformer to refine the trend prediction by denoising the signals with wavelet filters for period-based pattern recognition inside the Wavelet Auto-Correlation block.The cross-correlation block from Wavelet Gated Crosformer significantly improves the removal of highfrequency noise, as seen in Fig.5.Before effective cross-correlation, both Q and V, as seen in Fig.5b, c, are filtered with Wavelet Shrinkage Denoising using Daubechies 4 wavelet with a soft threshold for coefficients estimations.Fig.5dshows the effect of wavelet filtering in two random batches with Q tensor samples.The high-frequency noise content is removed from the original signal under a specific threshold.The same outcomes are observed for K in Fig.5f.Figure5e, g, with the signal in the frequency domain, also supports the enhancement in quality provided by wavelet filtering.Finally, the outcomes of QK cross-correlation with and without wavelet denoise are compared in Fig.5h, with explicit attenuation of white noise on the former.This process provides effective results once these signals are cross-correlated and better estimations for the TopK values used in time-delay aggregation of the value component V.The deep forecasting models were trained with a multivariate dataset of rain, climate (atmospheric pressure, temperature, and humidity sensors), river, and well-log data that comprises 4 years of continuum daily measures from 34 sensors with a total of 49708 data points.We have calibrated the dataset with standardization (mean equal to zero and standard deviation equal to one).The input has a sequence length of 180 and a prediction length of 30 and 60 days.We have split the dataset with 70% for training, 10% for validation, and 20% for testing purposes.We have also fine-tuned and adjusted the following parameters for the validation: (i) input sequence length of 180, (ii) batch size of 32, (iii) encoder input and output size of 34, and (iv) 15 training epochs.For each

Figure 6 .
Figure 6.Predictions of 30 and 60 days for groundwater well P1 with (a, d) Wavelet Gated Multiformer, (b, e) Transformer, and (c, f) Autoformer with input size of 180 days, respectively.

Figure 7 .
Figure 7. Predictions of 30 and 60 days for groundwater well P4 with (a, d) Wavelet Gated Multiformer, (b, e) Transformer, and (c, f) Autoformer with input size of 180 days, respectively.

Table 3 .
Time series forecasting results for wells W1-W8 with a predictive length of 30 and 60 days.