Abstract
This paper focuses on dayahead electricity load forecasting for substations of the distribution network in France; therefore, the corresponding problem lies between the instability of a single consumption and the stability of a countrywide total demand. Moreover, this problem requires to forecast the loads of over one thousand substations; consequently, it belongs to the field of multiple time series forecasting. To that end, the paper applies an adaptive methodology that provided excellent results at a national scale; the idea is to combine generalized additive models with statespace representations. However, extending this methodology to the prediction of over a thousand time series raises a computational issue. It is solved by developing a frugal variant that reduces the number of estimated parameters: forecasting models are estimated only for a few time series and transfer learning is achieved by relying on aggregation of experts. This approach yields a reduction of computational needs and their associated emissions. Several variants are built, corresponding to different levels of parameter transfer, to find the best tradeoff between accuracy and frugality. The selected method achieves competitive results compared to individual models. Finally, the paper highlights the interpretability of the models, which is important for operational applications.
Introduction
Electricity consumption forecasting is essential for numerous activities: managing the electricity network, investment and production planning, trading on the electricity markets, and reducing power wastage. To suit these activities, forecasts are performed at different horizons, from shortterm (hours, days) to longterm (years), and on different scales, from individual to national. New variability in the electricity load has recently emerged due to several factors: new decentralized production units (like solar panels), new actors with the opening of the French electricity market, new uses (like electric mobility), and the COVID19 pandemic. They bring essential changes in the electricity consumption, and forecasting models must be updated to take them into account.
Numerous forecasting approaches have been proposed to forecast electricity consumption and the Global Energy Forecasting Competitions (GEFCom) provides an overview^{1,2,3}. These methods include classical time series approaches such as autoregressive integrated movingaverage (ARIMA)^{4,5} and exponential smoothing^{6} that have been used to forecast at the veryshort term (hoursahead) or other statistical and machine learning approaches such as Gradient Boosting^{7,8} and Neural Networks^{9,10} that provide better forecasts using exogeneous variables, especially for larger horizons. Indeed, judging that the electricity demand is a human activity, it can be predicted using explanatory variables, the most important ones being weather and calendar data. In particular, Generalized Additive Models (GAMs)^{11} have been widely applied to forecast the electricity consumption^{12,13,14,15}, yielding good results.
More recently, a statespace approach has been investigated to adapt GAMs over time^{16}; this online model allows GAMs to adapt to new variabilities, such as the COVID19 pandemic, to improve forecasts^{17,18}. Finally, the variety of models motivates predicting the demand with a combination of forecasters, yielding a better final prediction than any individual model; that is the aggregation of experts^{19}, which has also shown good results on load forecasting^{20}. Aggregation of experts is an online method which combines predictions given by experts in a weighed sum where the weights evolve over time.
This paper investigates dayahead electricity load forecasting on the distribution network in France. The electricity consumption is measured on about 2,200 substations located at the frontier between the high voltage grid and the distribution network. More precisely, data of 1344 of them is accessible, enabling local level forecasting. Forecasting multiple time series is an important task in many industries like retail, car sharing or electricity management. Several competitions have tackled this problem such as the ASHRAE competition on forecasting multiple buildings’ energy consumption^{21} or the Makridakis competitions^{22,23} leading to interesting findings. Among which, one of the most important being the importance of crosslearning from multiple timeseries to improve accuracy. As underlined in the findings of these competitions, forecasting multiple time series is usually tackled using two kinds of approaches^{24}: local methods that estimate one model independently for each time series and global methods that fit a unique model jointly for all of them. Local methods have been used the most for these problems in the past but global approaches have been shown to perform just as well especially when increasing model complexity^{25}. In particular an extensive comparison between the two approaches has been tested on forecasting buildings’ consumption^{26}. Computationally the two approaches face different costs and issues^{26}: local approaches require time to train all the models but are easily parallelized. However, the cost of producing forecasts from trained models requires saving, maintaining and updating a large number of models which can be a hindrance. On the other hand, global approaches are harder to parallelize but are easier to maintain since only one model will be required at the end. Global approaches have the added advantage of being useful for forecasting new time series for which no data is available. This paper buildsup on a local approach^{13} while aiming to alleviate the computational and maintenance constraints associated with it. To that end, a tradeoff must be found between accuracy and the amount of parameters and computational complexity.
Computational efficiency is a longstanding research issue in optimization^{27}; the objective is to get a desired estimation in the shortest amount of time, or to get the best estimation in a fixed amount of time. An interesting subfield is online optimization^{19,28}, where the objective is to use streaming data to update a model in a recursive and efficient manner. While computational time and costs remain important incentives for efficient machine learning methods, other motivations have gained attention in the recent years: energy consumption for environmental goals^{29}, sparsity to improve interpretability^{30}, and use of data to respect privacy or to avoid costly data collection or transmission^{31,32}. This tradeoff between performances and efficiency has previously been referred to as frugal machine learning^{32}. In the case study of this paper, the original forecasting method learns an individual model for each time series, and therefore its computational time and energy consumption grow linearly with the number of time series considered. Reducing the time of the learning process has operational advantages; also, it paves the way to an application to even more local consumption (up to the extreme of individual households), leading to more numerous time series, in which case the learning process would become infeasible.
To reduce the computational burden this paper adopts a transfer learning point of view represented with diagrams in Fig. 1. Individual models are trained on a few time series, relying on GAMs and their adaptive variant based on statespace models. Then, these models learned on a small fraction of the time series are transferred to all of them, instead of training an individual model for each time series, using aggregation of experts as a transfer tool. Precisely, the base models are applied on each time series even if they were not trained on them, and then aggregating these weak forecasters yields a procedure able to scale to a large number of time series. Several aggregations are built involving different kinds of models, where transfer learning occurs at multiple levels.
The paper discusses the computational cost for each aggregation type and focuses on their frugal nature. Indeed, optimization of individual models to a large number of time series is very costly while aggregation of experts is not. It is shown that the presented methods compete with individual GAMs, even during the first French lockdown due to COVID19, while achieving low computational cost. Transfer learning through aggregation of experts has also been studied in a hierarchical setting, using regional load data as sources and national load data as target^{33}.
Our contribution may be summarized as follows: building on local GAM models, we propose to use adaptive statespace models and transfer learning based on aggregation of experts to alleviate the constraints associated with local approaches (mainly in terms of maintenance and computing costs), leading to a frugal model that performs better than the individual models, with the added benefit of being usable on new time series with minimal historical depth. To that end, we first propose different ways to transfer a model learned on a time series to the others. Then, we introduce a way to use several models learned on several time series and combine them to forecast each time series. Finally, we apply the approach on the electricity load of French substations of the distribution network.
Following this introduction, the “Methodology” section establishes the methods used in this paper. The “Data and model presentation” section introduces the data and the models applied. The results of the forecasting tasks are analysed in the “Experiments” section. Finally, the “Conclusion” section summarises the paper and proposes future work.
Methodology
This section first characterises the forecasting problem in the transfer learning framework. Then, the principles of the aggregation of experts are detailed along with its use for transfer learning. Finally, the different models considered in the paper, and the resulting aggregation methods, are introduced.
Transfer learning context
The transfer learning approach taken in this paper can be expressed using the definitions and vocabulary used for transfer learning methods^{34}. First, it is essential to provide the definition of domain and task. A domain \({\mathscr {D}} = \{ {\mathscr {X}}, P(X) \}\) is composed of a feature space \({\mathscr {X}}\) and a marginal distribution P(X) of an instance set \(X = \{ x \mid x_i \in {\mathscr {X}}, i = 1, \dots , D \}\). A task \({\mathscr {T}} = \{ {\mathscr {Y}}, f \}\) is composed of a label space \({\mathscr {Y}}\) and a decision function f. Given \(m_S \in {\mathbb {N}}^+\) source domains and tasks and a target domain and task, transfer learning is the use of the knowledge within the source domains and tasks to improve the performance of the target decision function.
Each substation represents one target whose decision function varies from the other substations’ decision functions. Thus, there are \(m_T = 1344\) transfer learning tasks, each corresponding to one forecasting task. This results in a multitask transfer learning situation. On the other hand, several sources are used to do the transfer, \(m_S > 1\), resulting in a multisource transfer learning method.
The investigated forecasting problem may be characterised as homogeneous and inductive in the transfer learning terminology. Indeed, a homogeneous transfer learning scenario occurs when the source and target feature spaces are equal, as well as the source and target label spaces. The explanatory variables (calendar and meteorological variables and past electricity load) are equally available for all substations and range thus in the same feature space. The mutual label space is \({\mathbb {R}}_+\) corresponding to the electricity measurement range. Concerning inductive transfer learning, it happens when labeled data are available in the target domain to induce the target decision functions, which is the present case.
Transfer learning using aggregation of experts
A natural idea to apply transfer learning in this context could be to determine a subsample of \(m_S\) data sets, build one individual model per data set, and select the best among the \(m_S\) models for each of the \(m_T\) substations to forecast. The information on the best model being unavailable; one way to estimate it is through the aggregation of experts.
In the aggregation of experts context^{19}, the aim is to predict a bounded sequence of observations \(y_1,\dots ,y_n \in [0,B]\) (B is unknown) using E forecasting models called experts. For each time step \(t \in \{ 1,\dots ,n \}\), they provide E forecasts \((\hat{y}_t^e)_{e=1}^E\) of the observation \(y_t\). The aggregation \(\hat{y}_t = \sum _{e=1}^E \hat{p}_{e,t} \hat{y}_t^e\) is then computed where the weights \((\hat{p}_{e,t})_{e=1}^E\) are updated online according to past performances of each expert. Forecast error is measured with a convex loss function \(\ell _t(y_t, \cdot )\). The goal is to minimise the socalled regret \(R_T = \frac{1}{T} \sum _{t=1}^T \ell _t(y_t, \hat{y}_t)  \frac{1}{T} \sum _{t=1}^T \ell _t(y_t, \hat{y}_t^\star )\) where \(\hat{y}_t^\star \) is given by an oracle model which can use unavailable information to build a forecast difficult to beat. The regret is the difference between the error suffered by the aggregation and the error of the oracle. The latter can be the bestfixed convex combination of all the experts or the bestfixed expert (constant over time). The used algorithm for the present work is the MLPoly algorithm^{35}, successfully applied for electricity load forecasting^{36} and implemented in the R package opera^{37}. This algorithm tracks the best expert or the best convex combination of experts by giving more weight to an expert that will generate a low regret. This makes this algorithm particularly interesting as no parameter tuning is needed.
Transfer learning through aggregation of experts can be considered as a parameterbased transfer learning method. Indeed, the aim is to produce multiple source learner models, which are introduced in the next section, and combining them to create specific target learners.
Adaptive experts
This paragraph details the individual models that will be aggregated. They are GAMs adapted by Kalman filtering as proposed in^{17}. This section considers the forecasting of one time series.
Generalized additive models
Generalized additive models (GAM)^{11} assumes that the response variable \(y_t\) is expressed as the following sum:
where \(\beta _0\) is an intercept, \(\varepsilon _t\) is the model error at time t, \((x_{t,d})_{d=1}^D\) are the D explanatory variables available at time t and \((f_d)_{d=1}^D\) are linear or nonlinear smooth functions called GAM effects. An effect \(f_d\) is expressed as a projection
where \((B_{d,k})_{k=1}^{m_d}\) is a spline basis of dimension \(m_d\) and \((\beta _{d,k})_{k=1}^{m_d}\) are the corresponding coefficients estimated by a ridge regression, where the following criterion is minimized:
The penalty term controls the second derivatives \(f_d ''\) to force the effects to be smooth. \(C_1\) denotes the computational cost of GAM estimation.
Adaptation by Kalman filter
To adapt a GAM, a multiplicative correction of the following GAM effects vector \(f(x_t) = (1, \overline{f}_1(x_{t,1}), \dots , \overline{f}_d(x_{t,d}))^\top \) is applied, where \(\overline{f}_j\) is a normalized version of \(f_j\) obtained by subtracting the mean on the train set and dividing by the standard deviation.
The adaptation is obtained assuming that a statespace property is satisfied. Precisely, a vector \(\theta _t\) called state is estimated under the assumption that
where \((\varepsilon _t)\) and \((\eta _t)\) are Gaussian white noises of respective variance / covariance \(\sigma ^2\) and Q.
Starting from a Gaussian prior and assuming the variances \(\sigma ^2\) and Q are known, the Kalman filter^{38} achieves the estimation of the state \(\theta _t\). This is a Bayesian method where at each step, the state posterior distribution is obtained as a Gaussian distribution \(\theta _t\mid (x_s,y_s)_{s<t}\sim {\mathscr {N}}(\hat{\theta }_t,P_t)\).
The Kalman filter depends on the initial parameters \(\hat{\theta }_1,P_1\) (the prior) and the variances \(\sigma ^2\) and Q. These hyperparameters are chosen by maximizing the likelihood on the training set, using an iterative greedy procedure described in^{16}, chapter 5. This choice of hyperparameters is referred as the dynamic setting. Its computational cost is noted \(C_2\) and can be large; it is detailed in the experimental study. Thanks to transfer learning, the costly estimation of these hyperparameters is avoided for each time series.
Another interesting setting, called static, is when \(Q = 0\), \(\sigma ^2=1\), \(P_1 = I\) and \(\hat{\theta }_1 = 0\). In that case, the state equation becomes \(\theta _{t+1}=\theta _t\) and the estimate \(\hat{\theta }_t\) is equivalent to the ridge forecaster:
Models considered
As previously mentioned, the estimation of the GAM and of the Kalman variances is computationally costly when applied to 1344 time series. On the other hand, the computational time due to Kalman updates can be neglected. The proposed method reduces the number of GAMs and Kalman variances that are estimated.
Hybrid experts \({\mathscr {M}}_{i,j,k}\) are thus defined, where the different parts of the forecaster are trained on different time series. GAM is trained on the data set i, Kalman variances optimized (in the dynamic setting) on the data set j, and resulting adapted model applied on the data set k, where i, j, and k range in \(\{1,\dots ,m_T\}\). In particular, \({\mathscr {M}}_{i,\emptyset ,k}\) denotes a nonadapted GAM and \({\mathscr {M}}_{i,0,k}\) an adapted GAM in the static setting where no variances are optimized.
There are three basic models without transfer learning where the models are optimized on the same substation’s data to forecast: \({\mathscr {M}}_{k,k,k}\), \({\mathscr {M}}_{k,0,k}\), and \({\mathscr {M}}_{k,\emptyset ,k}\). It is the scenario of traditional machine learning. Three models involving transfer learning are considered. The most simple transfer is GAM transfer: a source data set is used to train the GAM, and the resulting GAM is applied to a different target data set. It corresponds to the model \({\mathscr {M}}_{i,\emptyset ,k},\ i \ne k\). An immediate improvement of the previous model is the adaptation of the transferred GAM with a Kalman filter optimized on the same data set used to train the GAM. It results in models \({\mathscr {M}}_{i,i,k}, \ i \ne k\). The adaptation step helps the GAM transfer and improves the basic GAM. Finally, the Kalman filter transfer case: the data set to forecast is used to train a GAM adapted by a Kalman filter optimized on another data set. These are models \({\mathscr {M}}_{k,j,k},\ j \ne k\).
Aggregation models
As previously said, the estimation of Kalman variances in the dynamic setting is computationally costly (e.g. training of \({\mathscr {M}}_{k,k,k}\) model), and the goal is to avoid individual training for all time series. Moreover, although the computational cost of GAM effects estimation is smaller, it is desirable to train only a few of them to reduce the cost and the number of model parameters. To do so, three aggregation methods are built based on the three previous models where transfer learning occurs, provided in the “Models considered” section. AGG GAM TL is obtained from the aggregation of \(n_1~{\mathscr {M}}_{i,\emptyset ,k}\) models, AGG GAMKalman TL from \(n_2~ {\mathscr {M}}_{i,i,k}\) models, and AGG Kalman TL from \(n_3~ {\mathscr {M}}_{k,j,k}\) models. The data sets used to train GAMs and Kalman filters in each aggregation method are randomly chosen among the 1344 data sets.
Table 1 details the three aggregation methods, and the two individual adapted GAMs. It specifies if a transfer is involved, the type of the model used, and the computational costs corresponding to GAM and Kalman variances estimation. Judging that GAM, Kalman filter, and aggregation applications are computationally cheap, they are overlooked. For each aggregation method, parameter \(n_i\) is the unique hyperparameter of the method and corresponds to its number of sources \(m_S\). It must be as small as possible compared to \(m_T = 1344\) to force the frugality of the method.
Data and model presentation
Firstly, French local electricity consumption and explanatory variables are introduced. The models are then detailed in the practical point of view: definition of GAMs formula, training of Kalman filter and construction of the aggregation methods. To guarantee data confidentiality, substations’ identities are not provided, and the electricity loads represented in the different figures are normalized by the average load.
Presentation of the data
Electricity load data
The data are provided by Enedis, the operator in charge of the electric power distribution in France. The data are composed of 1344 time series, each of which is the electricity consumption of one substation represented in blue in Fig. 2.
They cover metropolitan France and reflect thus its local electricity consumption. Forecasting each substation’s consumption involves \(m_T = 1344\) forecasting tasks. The data are available from June 1st, 2014, to December 31st, 2021, with a 30minute temporal resolution. Although most of the 1344 time series present classical temporal and meteorological patterns, there are some counterintuitive and contrary variations. Figure 3 is a comparison between one substation with basic behavior and two substations with unusual behaviors.
An operational constraint on data availability is fixed: the load of the previous day, labeled as \(D1\), becomes available on day D for forecasting day \(D+1\). Given the temporal resolution of the data, 48 forecasts are made for the 48 instants of day. The forecast for an instant is made once the load of the instant of the previous day is available, independently of the instant. In other words, the forecasting strategy is sequential or online. The choice of forecasting context is crucial as it impacts the quality of the forecasts.
Explanatory variables
For explanatory variables, meteorological and calendar variables are chosen, a common practice in electricity load forecasting. Weather variables are obtained from MétéoFrance, the French public institution of meteorology and climatology. They are composed of temperature, cloudiness, and wind measurements from 27 weather stations. Temperature is represented in degrees Celsius, cloudiness is the amount of cloud cover measured in octas, and wind is measured 10 meters above the ground and is expressed in meters per second. The weather stations are unequally spread over metropolitan France and are represented in red in Fig. 2. Weather data are available in the same range as electricity consumption with a 3 hours temporal resolution. These data are transformed into 30 minutes of temporal resolution thanks to linear interpolation. Temperature is highly correlated to electricity load with a different impact for cold and hot regions, as shown in Fig. 4. The two patterns are due to the use of electrical heating and airconditioning systems in France, which are highly electric consuming. To generate localized datasets, each substation and its nearest weather station are paired, creating substationspecific datasets with local weather data.
An aspect to consider within the forecasting context is the availability of weather variables during the prediction process. In operational conditions, the weather variables used in the models are forecasted since future weather is not yet available. On the other hand, this article operates under the assumption that, on day D, the weather variables for day \(D+1\) are already available. While this setup is clearly unfeasible operationally, it holds significance for two main reasons. First, in the context of dayahead forecasting, the weather inputs would be weather forecasts of the next day, which can be assumed to be accurate and thus close to the actual weather of the next day. Second, the objective is to disentangle the forecasting error induced by the quality of the weather forecast, an external factor beyond our control, from the error linked to the forecasting model itself. In a similar setting, weather data availability has been discussed with a highlight on the importance of a good temperature forecast in operational conditions^{13}.
Concerning calendar variables, they gather indicators of school holidays by region, bank holidays, and working days, as well as the instant of the day, time of year, and days of the week. French regions are divided into three zones, and each zone has its own calendar for school holidays. This factor is taken into consideration when constructing our models.
Time segmentation
The data ranges from June 2014 to December 2021. Training and validation of the models are performed on the data up to December 31st, 2019, and test is done afterwards. The test set includes COVID19 observations and three lockdown periods in France. The second and third lockdowns are considered as normal periods because the electricity consumption strongly varies only during the first lockdown, see Fig. 5. Three validation periods are thus chosen: 2020 out of the first lockdown, the first lockdown (from March 16th, 2020, to May 11th, 2020), and 2021.
Generalized additive model formula
To build a GAM, one must determine its formula: which explanatory variables and spline bases to use. To do so, 2018 is used as a test period, and the different tests have been achieved with a forwardbackward heuristic. An assumption is made that a single GAM formula could be used for all target forecasting tasks. Indeed, although substations’ behaviors are different, they can be explained by the same explanatory variables. Finally, a GAM is applied by instant of the day; that is, 48 GAMs are optimized for each forecasting task. The unique following GAM formula is obtained:
where at each time step t,

\(y_t\) is the electricity load for the considered instant,

\(DayType_t\) is a categorical variable indicating the type of the day. There are five categories: Monday, Tuesday to Thursday, Friday, Saturday, and Sunday,

\(BankHoliday_t\) is a binary variable indicating whether the day t is a bank holiday or a school holiday depending on the region of the relevant substation,

\(ToY_t\) is the time of year whose values grow linearly from 0 on the 1st of January midnight to 1 on the 31st of December 23h30,

\(WorkingDay_t\) is a binary variable indicating whether the day t is a working day, i.e., not a weekend day or a bank holiday,

\(Trend_t\) is the number of the current observation,

\(Temp_t\) is the measured temperature of the closest weather station,

\(Temp95_t\) and \(Temp99_t\) are exponentially smoothed \(Temp_t\) variable of factor \(\alpha = 0.95\) and 0.99. E.g. for \(\alpha = 0.95\) at a given time step t, \(Temp95_t = \alpha Temp95_{t1} + (1  \alpha ) Temp_t\),

\(TempMin_t\) and \(TempMax_t\) are the minimal and maximal value of \(Temp_t\) at the current day,

\(Load2D_t\) and \(Load1W_t\) are loads of 2 days before and the load of the week before,

\(\varepsilon _t\) is Gaussian noise with 0 mean and constant variance.
This model is a shortterm model in terms of electricity past consumption availability. It is thus called ST GAM. A model called MT GAM is also considered, which is the same model without the \(Load2D_t\) and \(Load1W_t\) effects.
The case where \(\varepsilon _t\) is an autocorrelated error term is also examined. In that case, an ARIMA (autoregressive integrated moving average) model is chosen by selecting the best model with AIC criterion in the family of ARIMA(p,d,q). Correcting residuals of mid and shortterm GAM with an ARIMA model achieves the same performance; therefore, the more straightforward midterm formula is chosen. It can be explained by the redundancy of correcting autocorrelated residuals with an ARIMA model and the linear lag terms in the shortterm GAM. The following model is called MT GAM + ARIMA.
Thin plate spline basis with low dimensions represent all the effects except \(f_1\) and \(f_2\). Indeed, time of year has a cyclic impact (see Fig. 4 (a)); therefore, a cyclic cubic splines basis of dimension 20 is employed. GAM effects estimation is done using the Generalized Cross Validation criterion and takes a few tens of seconds in practice. Thus, it is computationally reasonable to train individual GAM for many forecasting tasks. Finally, the GAMs are trained using R and the package mgcv^{39}.
Kalman filtering
This section examines Kalman filtering adaptation of the previous shortterm GAM. In the static setting, one can apply one individual adapted GAM to each forecasting task as no Kalman variances estimation is necessary. It is called GAM + Kalman Static. It is a traditional machine learning context where the model is optimized on the same data set to be forecasted.
Execution time is essential in the dynamic setting. It is an unavoidable operational constraint, and the decision is made to calculate Kalman variances on a subsample of the 1344 substations. This subsample is chosen to represent the variety of the difficulty of forecasting. First, the 1344 substations are sorted according to the performance (NMAE, see section “Experiments”) of their shortterm GAM predictions for 2018. The 44 substations corresponding to the 44 worst performances are put aside as they won’t provide interesting experts. This allows dividing the 1300 substations into 13 groups of 100 substations each. A sample of 6 substations is then randomly drawn from each group of 100. Ultimately, a subsample of 78 representative substations is obtained, reflecting the forecasting difficulty. The parallel computing of the 78 corresponding sets of Kalman variances on a 36core virtual machine on 2014–2018 data takes about 1.6 days. The computation of 1344 individual Kalman variances would thus last about 28 days in a similar setting. The parameters of this sampling method have not been optimized for two reasons. First, it would require a significant amount of computational time and associated emissions, which contradicts the frugal philosophy of the article. Second, it is assumed that the sampling method has a very low impact on the final quality of the forecast compared to other model parameters, like the number of experts used in the aggregations.
Two sets of Kalman variances are calculated on the same substations subsample: one on 2014–2018 data and the other on 2014–2019 data. The first set is used to set the aggregation hyperparameters \(n_i\) with 2019 as the validation set, and the second set is used to forecast 2020–2021 as the test set. Thus, there are at most 78 GAMs adapted by their individual Kalman variances. In that case, the corresponding model is called GAM + Kalman Dynamic, and it is shown in the Experiments section that the aggregation methods achieve equivalent performances.
Kalman filters estimation and application are achieved using R and the package viking^{40}.
Aggregation
A reminder on the three aggregation methods can be found in the Aggregation models section and Table 1. This section focuses on their application.
For each aggregation method, the unique hyperparameter is \(n_i\), the number of experts in the aggregation, which is also the number of sources in the transfer learning task. As said previously, \(m_S\) should be the smallest possible compared to \(m_T = 1344\). A grid search is performed on 2019 data as the validation set. 10 forecasts of 182 representative substations are carried out for different values of \(n_i\) and the 10 corresponding medians are then computed. The 182 representative substations are obtained in the same way as the 78 substations used to optimize Kalman variances. The results are presented in Fig. 6. The aggregation becomes gradually robust with the increase in the number of experts under the randomness of substations selection to compute GAM effects and Kalman variances. Moreover, there is an elbow phenomenon after \(n_1 = n_2 = 9\) for AGG GAM TL and AGG GAMKalman TL, and after \(n_3 = 6\) for AGG Kalman TL. These values of \(n_i\) are very little compared to \(m_T = 1344\).
Experiments
In this section, a model analysis is provided thanks to visualization plots and performances on substations data sets are compared using a comparative metric.
Model dynamics
Aggregation methods are first analyzed at the experts’ scale and then at the GAM effects’ scale.
At the experts’ scale, one might want to know which expert is most important in the mixture and when. To do so, the distribution of the weights of each expert is studied over all the forecasts produced for each substation. Indeed, the higher the weight, the more important the corresponding expert is in the forecast. One expert can be important in the forecast for specific observations and not for others. To give an example, Fig. 7 represents boxplots of the weights of the 1344 AGG GAMKalman TL aggregations for three interesting instants of the day where the consumption behaviors are very different. From this representation, it can be deduced that experts 3, 4, 5, and 6 are as essential during the three instants of the day, with medians near the uniform weight of 1/9. On the other hand, the other experts’ weights vary more according to the instant of the day: for each instant, each expert is of low importance at least once (weight close to 0) and of great importance at least once (weight close to 0.5 or bigger).
Another way of analysing the importance of experts is by focusing on each forecast to see which experts are used and when in an aggregation. Figure 8 displays the evolution of weights from AGG GAMKalman TL aggregation for two specific substations at midday. The same previous conclusions are observed: some experts have very little impact on the aggregation, while others are important. The new information these figures provide is the evolution of experts’ importance during aggregation. For example, in Fig. 8a, Exp4’s weight increases and becomes of paramount importance, with the weight being superior to 1/2 for the last observations. In Fig. 8b, Exp7 vanishes rapidly while the impact of Exp1 increases. Exp5 and Exp9 contribute very little to the first observations and are quickly absent in the aggregation.
At the GAM effects’ scale, a visualization of the evolution of state coefficients from Kalman filtering is done; precisely, by plotting the D curves corresponding to the D GAM effects. Figure 9 (a) provides an example with the evolution of state coefficients of an adapted GAM \({\mathscr {M}}_{k,k,k}\) during the first lockdown. It can be observed that the coefficients associated with Bias and \(f_2(ToY)\), which is the effect modeling the time of year during working days, are more significant compared to others and evolve rapidly. Concerning the bias coefficients, they evolve to balance the gap between the source load used to train the GAM and the target load, which can be of different scales. Finally, some state coefficients evolve significantly, like BankHoliday or DayType, while others seem timeinvariant.
To extend this visual diagnosis tool to the case of AGG Kalman TL, one can express the aggregation as a GAM adapted by hybrid state coefficients \((\widetilde{\theta }_{t,d})_{d=1}^D\):
where at time t, \(\widetilde{\theta }_{t,d} = \sum _{e=1}^E \hat{p}_{e,t} \hat{\theta }_{t,d,e}\) denotes the hybrid state coefficients, E the number of experts, D the number of GAM effects, \(\hat{p}_{e,t}\) the weight associated with expert e, and \(\hat{\theta }_{t,d,e}\) the adaptation coefficient associated with the GAM effect \(f_d\) of expert e. This new expression allows for gathering information on both adaptation and aggregation and improving the interpretation of the whole model. Figure 9 (b) represents the evolution of these hybrid state coefficients during the first lockdown for the same target forecasting task as in Fig. 9a. One can therefore compare the Kalman coefficients \((\hat{\theta }_{t,d,e})_{d=1}^D\) of an individual model \({\mathscr {M}}_{k,k,k}\) and the hybrid state coefficients \((\widetilde{\theta }_{t,d})_{d=1}^D\) calculated from \(n_3 ~ {\mathscr {M}}_{k,j,k}\) models coefficients. Both types of coefficients differ in amplitude and dynamic. Bias coefficients are still important compared to the others and evolve rapidly, but it’s not the case for \(f_2\) coefficients.
When considering any aggregation method, the entire mixture can be expressed as a GAM using a linear combination of GAM effects. The new coefficients of the spline basis are composed of the basic coefficients of the spline basis, the adaptation coefficients, and the aggregation weights.
Numerical results
To compute numerical performances of the models, the normalized mean absolute error (NMAE) is used. It is defined for a time series \((y_t)_t\), a forecast \(({\hat{y}}_t)_t\) and a test set \({\mathscr {T}}\) by:
It is the mean absolute error between the ground truth and the forecast, normalized by the absolute mean of the time series. This metric allows to compare forecasts of different scales, which is necessary in the case study because the individual NMAE are computed for each time series. The NMAE is chosen rather than the mean absolute percentage error because the time series are sometimes very close to 0. More precisely, the percentage of NMAE is examined by multiplying it by 100.
The performances of the individual and aggregation models displayed in Table 2 are first compared. The improvements in performance are referenced with respect to the three respective validation periods.
It can be noted that the information on the past electricity load added in the GAM formula is of great importance, with a significant improvement between ST GAM and MT GAM (1.84%, 2.83%, and 3.13% for the medians). There is also a gain when MT GAM residuals are modeled with an ARIMA model: there is an improvement between ST GAM and MT GAM + ARIMA (0.77%, 1.46%, and 1.70% for the medians). GAM + Kalman Static provides slightly weaker performances than MT GAM + ARIMA. The latter is thus the individual benchmark to beat.
Concerning the aggregation methods, AGG GAM TL shows the worst performances. However, this model is still interesting as its performances are close to the ST GAM performances while its computational cost is much lower. On the other hand, the two other aggregation methods show excellent and close performances. AGG Kalman TL is better than AGG GAMKalman TL outside of the first lockdown (improvement in the median of 0.03% and 0.07%), while it is the opposite for the first lockdown (decrease of 0.16%). The gap between these two last models is not significant. As AGG GAMKalman TL is less complex than AGG Kalman TL, it is considered as the best compromise between forecasting performance and computational cost. Compared to MT GAM + ARIMA, the best individual model, its median performances improve by 0.62%, 1.54%, and 0.84%.
Figure 10 shows the evolution of the median errors over months. Three models are considered : the initial model, the best individual model and the best aggregation. It indicates that, except for the month of January 2020, the final model is consistently better than the benchmark.
As explained in the “Transfer learning using aggregation of experts” section, aggregation algorithms aim at beating two oracles: the bestfixed expert and the bestfixed convex combination of experts. They are called Expert Oracle and Convex Oracle, respectively, and their performances are computed for AGG GAMKalman TL. As displayed by Table 2, the aggregation competes with the bestfixed expert and approaches the bestfixed combination of experts. The same can be said for the other two aggregation methods.
Concerning the test periods, one can see that the 2020 forecasts worsen during the first French lockdown with a downgrade of 2.18% between the two MT GAM + ARIMA median performances. AGG GAMKalman TL and AGG Kalman TL present lower downgrades: 1.26% and 1.45% in the median, respectively; that is, the online nature of Kalman filtering and aggregation of experts allow adaptation to the extreme change in electricity consumption. On the other hand, ST GAM and MT GAM performances worsen in 2021 as these models are not updated while the consumption behavior evolves. The downgrades are significant: 1.37% and 2.36% in the median, respectively. The performance gaps between 2020 out of the first lockdown and 2021 are tighter for adaptative models, especially for the two best aggregation methods: 0.22% and 0.18% in the median, respectively. Once again, this shows that the aggregation methods adapt well to the data distribution evolution.
Table 3 provides performances of 69 forecasting tasks. Precisely, they correspond to the 78 substations used to optimize the set of Kalman variances reduced of the 9 substations providing the experts involved in the three aggregations. This allows the comparison of the aggregation methods with GAM + Kalman Dynamic. As previously said, GAM + Kalman Dynamic achieves great performances forecasting the national electricity load and is thus a competitive model and the best one among the individual models. However, it requires large amounts of time to estimate. It is remarkable that AGG GAMKalman TL and AGG Kalman TL perform similarly while being much less computationally costly.
Finally, the distribution of the 1344 NMAE scores is depicted in Fig. 11a and the geographical distribution in Fig. 11b. It is interesting to notice that the best forecasted substations are close to urban areas like Paris, Marseille, Lille, or Bordeaux, whereas the worst forecasted are located in rural areas.
Conclusion
In this paper, a frugal method is proposed to forecast multiple electricity loads. The estimation of an individual model per time series is avoided, resulting in a scalable methodology based on the aggregation of experts with limited computational resources. The chosen experts are GAMs adapted by Kalman filtering, which have performed very well for electricity consumption forecasting. The aggregation methods allow the transfer of GAMs and Kalman filters separately or simultaneously. This paper demonstrates that they provide good forecasts compared to competitive models, especially during the first French lockdown. Moreover, they don’t need human intervention nor expertise and are thus simple to use. The method is frugal in terms of parameter estimation, and the computational cost has been discussed extensively. Finally, it benefits from the interpretability of GAM and the aggregation of experts.
There are various ways to extend the work presented in this paper. First, the random selection of the time series used to train GAMs and Kalman variances can be improved. Clusters of substations may be identified according to some characteristics (geographic, weather, type of consumers) and one representative selected per cluster. A second way of improvement is the inclusion of new explanatory variables at the same local scale : geotracking and communication data reflect human behavior and are therefore helpful for electric consumption forecasting^{33}.
This paper focused on learning models on single time series, then transferring them to other ones; it may be possible to transfer models trained on several substations. One could easily train a GAM jointly on several time series (concatenating data sets), however, it is not trivial to learn the Kalman hyperparameters using several substations due to the sequential structure of statespace models.
Finally, the computational complexity of the method introduced essentially depends on the estimation of a fixed number of models; our objective in doing so is to obtain a very scalable forecasting method. Therefore one could apply the method to a larger data set, for instance the electrical consumption at a finer granularity.
Data availability
The data that support the findings of this study are available from Enedis but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Enedis.
References
Hong, T., Pinson, P. & Fan, S. Global energy forecasting competition 2012. Int. J. Forecast. 30, 357–363. https://doi.org/10.1016/j.ijforecast.2013.07.001 (2014).
Hong, T. et al. Probabilistic energy forecasting: Global energy forecasting competition 2014 and beyond. Int. J. Forecast. 32, 896–913. https://doi.org/10.1016/j.ijforecast.2016.02.001 (2016).
Hong, T., Xie, J. & Black, J. Global energy forecasting competition 2017: Hierarchical probabilistic load forecasting. Int. J. Forecast. 35, 1389–1399. https://doi.org/10.1016/j.ijforecast.2019.02.006 (2019).
Huang, S.J. & Shih, K.R. Shortterm load forecasting via arma model identification including nongaussian process considerations. IEEE Trans. Power Syst. 18, 673–679. https://doi.org/10.1109/TPWRS.2003.811010 (2003).
Chodakowska, E., Nazarko, J. & Nazarko, Ł. Arima models in electrical load forecasting and their robustness to noise. Energieshttps://doi.org/10.3390/en14237952 (2021).
Jalil, N., Ahmad, M. & Mohamed, N. Electricity load demand forecasting using exponential smoothing methods. World Appl. Sci. J. 22, 1540–1543. https://doi.org/10.5829/idosi.wasj.2013.22.11.2891 (2013).
Aguilar Madrid, E. & Antonio, N. Shortterm electricity load forecasting with machine learning. Information 12, 50 (2021).
Lloyd, J. R. Gefcom 2012 hierarchical load forecasting: Gradient boosting machines and gaussian processes. Int. J. Forecast. 30, 369–374 (2014).
Park, D. C., ElSharkawi, M., Marks, R., Atlas, L. & Damborg, M. Electric load forecasting using an artificial neural network. IEEE Trans. Power Syst. 6, 442–449 (1991).
Ryu, S., Noh, J. & Kim, H. Deep neural network based demand side short term load forecasting. Energies 10, 3 (2016).
Wood, S. N. Generalized Additive Models: An Introduction with R 2nd edn. (Chapman and Hall, 2017).
Pierrot, A. & Goude, Y. Shortterm electricity load forecasting with generalized additive models. Proceedings of ISAP power 2011 (2011).
Goude, Y., Nedellec, R. & Kong, N. Local short and middle term electricity load forecasting with semiparametric additive models. IEEE Trans. Smart Grid 5, 440–446. https://doi.org/10.1109/TSG.2013.2278425 (2014).
Fasiolo, M., Wood, S. N., Zaffran, M., Nedellec, R. & Goude, Y. Fast calibrated additive quantile regression. J. Am. Stat. Assoc. 116, 1402–1412. https://doi.org/10.1080/01621459.2020.1725521 (2021).
Fan, S. & Hyndman, R. J. Forecasting electricity demand in australian national electricity market. In 2012 IEEE Power and Energy Society General Meeting, 1–4 (IEEE, 2012).
de Vilmarest, J. Modèles espaceétat pour la prévision de séries temporelles. Application aux marchés électriques. Ph.D. thesis, Sorbonne Université (2022).
Obst, D., de Vilmarest, J. & Goude, Y. Adaptive methods for shortterm electricity load forecasting during Covid19 lockdown in France. IEEE Trans. Power Syst. PP, 1. https://doi.org/10.1109/TPWRS.2021.3067551 (2021).
de Vilmarest, J. & Goude, Y. Statespace models for online postcovid electricity load forecasting competition. IEEE Open Access J. Power Energy 9, 192–201. https://doi.org/10.1109/OAJPE.2022.3141883 (2022).
CesaBianchi, N. & Lugosi, G. Prediction, Learning, and Games (Cambridge University Press, 2006).
Gaillard, P. & Goude, Y. Forecasting electricity consumption by aggregating experts; how to design a good set of experts. In Modeling and Stochastic Learning for Forecasting in High Dimensions (eds Antoniadis, A. et al.) 95–115 (Springer, 2015). https://doi.org/10.1007/9783319187327_6.
Miller, C. et al. The ashrae great energy predictor iii competition: Overview and results. Sci. Technol. Built Environ. 26, 1427–1447. https://doi.org/10.1080/23744731.2020.1795514 (2020).
Makridakis, S., Spiliotis, E. & Assimakopoulos, V. The m4 competition: Results, findings, conclusion and way forward. Int. J. Forecast. 34, 802–808. https://doi.org/10.1016/j.ijforecast.2018.06.001 (2018).
Makridakis, S., Spiliotis, E. & Assimakopoulos, V. Predicting/hypothesizing the findings of the m5 competition. Int. J. Forecast. 38, 1337–1345. https://doi.org/10.1016/j.ijforecast.2021.09.014 (2022).
Januschowski, T. et al. Criteria for classifying forecasting methods. Int. J. Forecast. 36, 167–177. https://doi.org/10.48550/arXiv.2212.03523 (2020).
Buonanno, A. et al. Global vs. local models for shortterm electricity demand prediction in a residential/lodging scenario. Energies 15, 2037. https://doi.org/10.3390/en15062037 (2022).
MonteroManso, P. & Hyndman, R. J. Principles and algorithms for forecasting groups of time series: Locality and globality. Int. J. Forecast.37, 1632–1653. arXiv:2008.00444 (2021).
Bottou, L. & Bousquet, O. The tradeoffs of large scale learning. Advances in neural information processing systems 20 (2007).
Hazan, E. et al. Introduction to online convex optimization. Found. Trends Optim.2, 157–325, arXiv:1909.05207 (2016).
GarcíaMartín, E., Rodrigues, C. F., Riley, G. & Grahn, H. Estimation of energy consumption in machine learning. J. Parallel Distrib. Comput. 134, 75–88. https://doi.org/10.1016/j.jpdc.2019.07.007 (2019).
Carvalho, D. V., Pereira, E. M. & Cardoso, J. S. Machine learning interpretability: A survey on methods and metrics. Electronics 8, 832. https://doi.org/10.3390/electronics8080832 (2019).
Kaissis, G. A., Makowski, M. R., Rückert, D. & Braren, R. F. Secure, privacypreserving and federated machine learning in medical imaging. Nat. Mach. Intell. 2, 305–311. https://doi.org/10.1038/s4225602001861 (2020).
Evchenko, M., Vanschoren, J., Hoos, H. H., Schoenauer, M. & Sebag, M. Frugal machine learning. arXiv preprint arXiv:2111.03731 (2021).
Gaucher, S., Goude, Y. & Antoniadis, A. Hierarchical transfer learning with applications for electricity load forecasting. arXiv preprint arXiv:2111.08512 (2021).
Zhuang, F. et al. A comprehensive survey on transfer learning. Proceedings of the IEEE PP, 1–34. https://doi.org/10.1109/JPROC.2020.3004555 (2020).
Gaillard, P., Stoltz, G. & van Erven, T. A secondorder bound with excess losses. In Proceedings of The 27th Conference on Learning Theory, vol. 35 of Proceedings of Machine Learning Research (eds. Balcan, M. F., Feldman, V. & Szepesvári, C.) 176–196 (PMLR, Barcelona, 2014).
Gaillard, P., Goude, Y. & Nedellec, R. Additive models and robust aggregation for gefcom2014 probabilistic electric load and electricity price forecasting. Int. J. Forecast. 32, 1038–1050 (2016).
Gaillard, P. & Goude, Y. opera: Online prediction by expert aggregation. https://CRAN.Rproject.org/package=opera, R package version (2016).
Kalman, R. E. A new approach to linear filtering and prediction problems. J. Basic Eng. 82, 35–45. https://doi.org/10.1115/1.3662552 (1960).
Wood, S. Package ‘mgcv’. https://cran.rproject.org/package=mgcv, R package version 1, 729 (2015).
de Vilmarest, J. Viking: Statespace models inference by Kalman or Viking. https://cran.rproject.org/package=viking, R package version (2022).
Acknowledgements
We thank Enedis for the dataset of the French electricity demand.
Author information
Authors and Affiliations
Contributions
J.d.V. developed the viking package. G.L. did the study during an internship supervised by J.d.V. and B.H. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lambert, G., Hamrouche, B. & de Vilmarest, J. Frugal dayahead forecasting of multiple local electricity loads by aggregating adaptive models. Sci Rep 13, 15784 (2023). https://doi.org/10.1038/s41598023424881
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598023424881
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.