A time-reversed model selection approach to time series forecasting

In this paper, we introduce a novel model selection approach to time series forecasting. For linear stationary processes, such as AR processes, the direction of time is independent of the model parameters. By combining theoretical principles of time-reversibility in time series with conventional modeling approaches such as information criteria, we construct a criterion that employs the backwards prediction (backcast) as a proxy for the forecast. Hereby, we aim to adopt a theoretically grounded, data-driven approach to model selection. The novel criterion is named the backwards validated information criterion (BVIC). The BVIC identifies suitable models by trading off a measure of goodness-of-fit and a models ability to predict backwards. We test the performance of the BVIC by conducting experiments on synthetic and real data. In each experiment, the BVIC is examined in contrast to conventionally employed criteria. Our experimental results suggest that the BVIC has comparable performance as conventional information criteria. Specifically, in most of the experiments performed, we did not find statistically significant differences between the forecast error of the BVIC under certain parameterizations and that of the different information criteria. Nonetheless, it is worth emphasizing that the BVIC guarantees are established by design where the model order penalization term depends on strong mathematical properties of time-reversible time series forecasting properties and a finite data assessment. In particular, the penalization term is replaced by a weighted trade-off between functional dimensions pertaining to forecasting.That said, we observed that the BVIC recovered more accurately the real order of the underlying process than the other criteria, which rely on a static penalization of the model order. Lastly, leveraging the latter property we perform the assessment of the order model (or, memory) of time series pertaining to epileptic seizures recorded using electrocorticographic data. Our results provide converging evidence that the order of the model increases during the epileptic events.


Scientific Reports
| (2022) 12:10912 | https://doi.org/10.1038/s41598-022-15120-x www.nature.com/scientificreports/ never stationary in the strict sense, but weaker forms of stationarity exist (e.g. wide-sense stationarity), and may be used to analyse time series that satisfy this assumption. In some cases, such as with an (intracranial) electroencephalography recording 6 , the stationarity assumption holds for limited periods of time. Therefore, data for estimation is constrained to the duration of stationarity. As a result, when the number of samples is small, there is a high probability that methods such as AIC over-fit 7 .
In this paper, we focus on autoregressive processes and we present a novel model selection technique that employs time reversibility properties of autoregressive models to validate forecast ability by evaluating the backwards prediction. Considering past data is accessible, the technique aims to minimize prediction error and uncertainty in models for the backward prediction. Theoretically, we show that backward prediction achieves the same objective as forward prediction.
As such, the new method will be referred to as backwards validated information criterion (BVIC). To perform model selection and determine the model parameters, we formulate an optimization problem that explicitly weighs three features: (a) regression, (b) generalization, and (c) uncertainty. The first component, regression, enables us to evaluate the goodness-of-fit of a specified model with respect to the observed data, and can be quantified through maximum likelihood estimation or least-squares regression. Secondly, generalization represents the ability of a model to predict outside the sample of given observed data, and can be seen analogous to a measure of accuracy of prediction. Lastly, the uncertainty feature of the criterion can be considered as a level of precision exhibited by a prediction. That said, we may quantify the features of generalization and uncertainty through the metrics of mean square prediction error and theoretical prediction variance, respectively.
Consequently, to test the novel model selection criterion we conduct thorough Monte Carlo simulations using experiments with both real and synthetic data. Here, the first two experiments will consist in generating synthetic time series via specified autoregressive models with different model orders and sets of parameters. The third experiment assesses the different criteria when considering intracranial electroencephalographic data. In each experiment, we assess the quality of the BVIC, and other conventional information criteria, on the basis of a selection of performance metrics. Additionally, we evaluate the effects of noise on model selection by varying the noise variance throughout the experiments with synthetic data.

Methods
Consider an autoregressive process of order p, or simply AR(p), described by 5 : where {X t ∈ R : t ∈ N} is a stochastic process, with parameters {θ i ∈ R : i = 1, . . . , p} . The process noise is an independent and identically distributed (i.i.d.) Gaussian white noise (WN) sequence, {W t ∈ R : t ∈ N} , with variance σ 2 ∈ R + . The linear prediction X n n+h denotes the h-step ahead predictor using the last n measurements which is described as 8 where X = (X n , X n−1 , . . . , X 1 ) ⊺ ∈ R n and θ (h) n ∈ R n for forecast horizon h ∈ N . Note that the predictor is described to at most n parameters. In practice, one would consider an estimate of j non-zero parameters.
We are interested in finding a model that follows the linear structure in (2) that best represents the true process as described in (1), according to some predefined metric. Selection of models is concerned with finding a parametric model for a specified objective, such as minimal error between forecasted and measured data. In order to find the predictor that obtains minimum mean squared error we use Theorem 1-see S1 Supplementary Information.
Ideally, we find the optimal linear predictor by minimizing the mean squared error over a specified prediction horizon as follows: for some h 1 , h 2 ∈ N , and h 2 > h 1 . Subsequently, it readily follows that (3) is minimized when Notice that it is only possible to determine f i j (X) when the probability distribution of X n+i is known, which is not the case. Alternatively, if measured values of x n+i ∈ R are known, we may use them to construct an empirical distribution. In other words, measured data can be used to fit a model. However, if we consider the objective to find a model for forecasting, future values are not at our disposal. Thus, instead of using future values to obtain an empirical distribution, past (i.e., available) data must be used to find this function. Model selection for forecasting is concerned with finding models that approximate the probability distribution of the underlying process.

Model selection.
In the case of linear processes, specifically an AR(p) process considered in (1), model selection can be divided into two components: (1) order selection and (2) parameter estimation.
Order selection and parameter estimation can be performed simultaneously. For a set candidate model orders M = {1, . . . , m} , parameters are estimated for each model order j ∈ M . Subsequently, candidate models are assessed based on a metric that describes the goodness-of-fit of a model with respect to the data. Often, the loglikelihood function is used as a measure of goodness-of-fit, where a greater log-likelihood is associated with Scientific Reports | (2022) 12:10912 | https://doi.org/10.1038/s41598-022-15120-x www.nature.com/scientificreports/ a better fit 1 . Usually, the maximum likelihood (ML) estimate or the least squares (LS) estimate may be used to approximate the log-likelihood 9 . At the same time, these methods serve as estimators for the parameters upon the data 5 .
In practice, when observations are limited, it is difficult to precisely capture the underlying structure of a process. For instance, when we consider the ML estimator is used to estimate the parameters of a model, the measure of goodness-of-fit is biased and the comparison of different models is not fair. Essentially, the bias is caused by reusing the same data for estimation and for evaluation, resulting in a preference of complex models (i.e., with very high values of p)-i.e., over-fitting 1 . In other words, models with maximum log-likelihood are a very good fit of the observed data, but tend to display poor predictive ability.
Typically, to address the problem of over-fitting, model selection is conducted by using information criteria. Candidate models are assessed as a function of maximum likelihood and an additional penalization (or regularization) term. Often, these information criteria penalize the number of parameters, to prevent over-fitting. For instance, the Akaike information criterion (AIC) 2 described by for an estimated model with parameters θ j , with j ∈ M , representing the number of model parameters. For small samples, the AIC still has a high probability of over-fitting 7 . Therefore, the corrected version of AIC (AICc) is often considered in such cases. The AICc is formulated as where n ∈ N represents the number of observations. A third method for model selection is the Bayesian information criterion (BIC) 3

described by
That said, while information criteria are commonly used to reduce over-fitting, their properties are only meaningful when the number of samples approach infinity. For instance, AIC is asymptotically efficient 10 , meaning that when n → ∞ , AIC chooses a model f (X) → E[Y | X] . However, for n < ∞ , AIC offers no guarantees on the ability of a specified model to generalize outside of its sample with respect to any other model.
Let us now introduce a different situation. Consider the process described in (1). Instead of trying to find the optimal linear predictor for the h-step ahead prediction, we want to find the optimal linear predictor for the h-step back prediction. The predictor can be formulated as 8 where X B = (X 1 , X 2 , . . . , X n ) ⊺ ∈ R n is a reversed version of X, and θ (h) n ∈ R n for backcast horizon h ∈ N . We present a similar argument as in (3) to minimize the mean square error for the backwards prediction, as follows: Consequently, it follows-see S5 Supplementary Information for details-that (3) is minimized when Again, we can approximate f i j,B (X B ) if we have an approximation of the distribution of X 1−i . However, in contrast to the forward prediction, we can now construct an empirical distribution on the basis of past data. Therefore, we are able to find f i j,B (X B ) for all i ∈ {h 1 , . . . , h 2 }. In order to further leverage the argument presented in this section, we present an illustrative example of three separate validation schemes, of which the first is in-sample, and the other two are out-of-sample. The illustration is given in Fig. 1. The three cases are explained as follows according to an arbitrary set of observations, x k , where k ∈ [t 0 , t] : (A) For in-sample model validation (i.e., regression), the same data is used for training as for validation. This method does not allow for out-of-sample validation because no observations x k , for k > n , exist. It is advantageous in the sense that more data can be used for training, but has high risk of overfitting. Therefore, it often does not generalize well, making it unsuitable for forecasting. (B) In the second case, we split the observations such that it is possible to perform a out-of-sample validation scheme using a forecast. This is possible because the set of observations {x k : k > n} , is non-empty. However, the ensuing model is trained and validated to forecast from time instance n, and not t. Forecasting more than h 2 steps ahead is generally bad practice (depending on h 2 ), as the uncertainty of the forecast will grow too large. Alternatively, one cannot simply assume that the model obtained using only the points x k for k = 1, . . . , n is applicable to the entire set of observations (i.e., x k , k = t 0 , . . . , t ). In fact, this would make it an in-sample model, and the same argument as in A holds. Therefore, we argue that this approach is not suitable for forecasting. (C) In the final case, we split the data opposite to B, with the validation set before the training set. Suppose that we temporarily reverse the time axis (i.e., mirror the plot from C). We can present a similar argument as (10) www.nature.com/scientificreports/ in B, and do out-of-sample validation on the backcast because the set {x k : k < 1} is non-empty. However, in constrast to B, we can now use properties of time-reversibility (to be discussed in the next section) to show that the set of parameters (i.e., the model) is entirely independent of the direction of time. Therefore, we can apply the mirrored model without suffering from a 'time gap' between n and t, resulting in an outof-sample model that can forecast from the present time, t.
In the next section, we argue that, from a theoretical perspective, f i j,B (X) = f i j (X).

Reversibility of time series.
A time series is time-reversible when the sequence of random variables {X t , . . . , X t+h } has equal joint probability distribution to the sequence {X t+h , . . . , X t } 11 . Standard ARMA models driven by Gaussian noise are time-reversible 12 . Therefore, ARMA models are independent of the direction in which time progresses. In the remaining of this paper, and without loss of generality, we focus on autoregressive processes (AR). That said, let us consider a case where we want to compare the one-step ahead prediction with the one-step backward prediction making use of n measurements. The linear prediction of the one-step ahead prediction is given by 13 with ϕ n ∈ R n . The linear prediction for the one-step backcast is described by with θ n ∈ R n . In this case the vector containing the states is X = (X n , X n−1 , . . . , X 1 ) ⊺ . Here, X B denotes a time reversed vector of X which is formulated as X B = JX , where J ∈ R n×n is an anti-diagonal identity matrix described as follows: Next, the Yule-Walker equations (or, prediction equations) can be used to determine the parameters for both the predictors as a function of the autocovariance function of the X t . The Yule-Walker equations are described as follows 13 : Their corresponding variances are given by In these equations, Ŵ n ∈ R n×n is the autocovariance matrix described by (13) Ŵ n ϕ n =γ n , for the forward prediction, and Ŵ n θ n =γ n , for the backward prediction.
for the forward prediction, and σ 2 b =γ (0) − θ ⊺ n γ n , for the backward prediction. . . , n} is the vector of autocovariances up to n lags. Both Ŵ n and γ n consist of values from the autocovariance function of {X t } , which is symmetric. This means that γ X (k) = γ X r (k) . Therefore, the function is independent of the direction of time in which the time series is ordered. Hence, we have that ϕ n = θ n .
Example To further illustrate the property of time-reversibility, consider an AR(1) process described by If the Yule-Walker equations are applied for the forward prediction as described in (11), we obtain Alternatively, the autocovariance γ (k) is found by taking the expectation of the process from (15) with a shifted version of itself as follows: Now, using the result from (16) and (17), and the fact that the forward and backward parameter vectors are the same, we can conclude that This results suggests that the model for the one-step ahead prediction is identical to the model of the onestep backward prediction.
In summary, the above derivation is readily applicable to all orders of an AR(p) model, and also for different prediction horizons. Thus, this means that f i j,B (X) = f i j (X) . As such, we can leverage previous data to assess the quality of the backcasting, which serves as a proxy to the forecasting capabilities. In the next section, we systematically leverage this insight to introduce a novel information criteria that builds upon these ideas.

Backwards validated information criterion
We propose a novel information criterion called the backwards validated information criterion (BVIC)-not to be confused with BIC (i.e., Bayesian information criterion). The BVIC is designed to estimate the order of of an AR(p) process. Therefore, we consider a range of candidate autoregressive model orders up to a specified maximum order m ∈ N , where a single candidate model order is defined as j ∈ M , with M = {1, 2, . . . , m} . The BVIC is given by where β, γ ≥ 0 . The set of parameters for a specified model j ∈ M is described by either θ j ∈ R j , θ m ∈ R m , denoting the parameters obtained through ML estimation, or by θ YW j ∈ R j , θ YW m ∈ R m , denoting the parameters obtained through Yule-Walker estimation. The denominators of the BVIC, depending on m, act as a normalization of the criterion, this ensures that the components are weighted more equally and makes the choice for β and gamma more intuitive. Moreover, the criterion contains three functions that depend on θ j : ( i ) the log-likelihood function denoted by ℓ(·) , ( ii ) the mean square backcast error over horizon interval [h 1 , h 2 ] , denoted by err( · ), and ( iii ) the mean variance of backcasted values, denoted by var( · )-see S1 Supplementary Information for the details in the derivation of the BVIC. That said, we emphasize that the BVIC is to be used only as an index criterion. In other words, the BVIC is used to give a quantitative assessment of each model order (i.e., to perform model identification), and it does not play a role in parameter estimation. Parameters are estimated using the Yule-Walker equations with the order selected by the BVIC.
Consider a set of observations described by X = {X t−h : t = 1, . . . , n} ∈ R n , with number of samples n ∈ N , and backcasting horizon interval h 1 , h 2 ∈ N . The h-step linear backwards prediction can be denoted as X j 1−h = θ ⊺ j X 1:j , with θ j , X 1:j ∈ R j . Consequently, we can write each of the discussed components of the BVIC as a function of the respective estimated parameters, as follows: (1).
Interestingly, the BVIC can be divided into four dimensions: (a) regression, (b) generalization, (c) uncertainty and (d) forecasting. Each dimension illustrates a functionality of the criterion with regards to time series analysis. The four dimensions are depicted in four quadrants in Fig. 2. Each quadrant contains the functionality of the BVIC depending on the selection of the parameters β and γ . For instance, if β, γ = 0 , the BVIC is equal to the ML estimate. When γ = 0 and β ≫ 1 , the BVIC selects models with the smallest out-of-sample error (on the backcast) that intuitively corresponds to generalization. When both β, γ ≫ 1 , the focus of the BVIC is to minimize the out-of-sample error along with the theoretical variance (uncertainty) of the backcast, thereby also minimizing these quantities for the forecast.
That said, recall that a forecast consists of both the point and variance estimate, and as such, they both play a key role in forecasting that simply speaking would relate with the precision and accuracy of the predictions. Thus, the different dimensions associated with the parameters of the BVIC give a principle approach to model selection, contrasting with the classical information criteria previously discussed (i.e., AIC, AICc, and BIC). Specifically, this is achieved by replacing the penalization of the order with a regularization term on the backward validation metric and adding parameters that introduce adaptability in the the model selection problem. In other words, the adaptability allows users of the BVIC to select their preferred forecasting goal.
Moreover, one must also select a forecast horizon interval h 1 , h 2 when using the BVIC. In general, the selection of a look-ahead interval (not necessarily consecutive) depend on the application. For instance, if you want to use the BVIC to validate a model, you can use the one-step ahead prediction. On the other hand, if you consider applications where you want to forecast further into the future, such as prediction of epileptic seizures, then you may consider a larger forecast horizon, or even a concentrated range of steps in the future.

Simulations
In what follows, we will perform Monte Carlo simulations using three experiments with both synthetic and real data. Specifically, the first and second experiments will consist in generating data according to specified autoregressive models, and assess the model order obtained through different information criteria, as well as the goodness-of-fit. Next, we assess the quality of the results when considering intracranial electroencephalographic (i.e., electrocorticographic, or ECoG for short) data from epileptic patients undergoing a seizure. Data description. Firstly, for the synthetic data we generate an autoregressive process of order p ∈ N and parameters θ ∈ R p as described in (1). Subsequently, similar to 6 , we add noise to the synthetic data. First, the realization is standardized such that the mean and variance are equal to zero and one, respectively. Second, the measurement noise is added as follows: from which we obtain Y = {Y t : t = 1, 2, . . . , N} ∈ R N that contains N ∈ N measurements. Note that N is used to describe the entire generated sample size, whereas n denotes the effective sample size that may be used for   14 . We look at a range of channels from three different patients from two different locations. The first and second datasets are acquired from two separate patient studies at the Hospital of the University of Pennsylvania, Philadelphia, where the ECoG signals were recorded at a sampling frequency of 512 Hz. The third dataset is recorded at a frequency of 500 Hz, and is from a patient study at the the Mayo Clinic in Rochester, Minnesota.
Seizures are marked by clinical experts 15 and the seizure-onset time and location are defined by the so called earliest electrographic change (EEC) and the unequivocal electrographic onset (UEO) 16 , where we consider the period between EEC and UEO to be the pre-ictal phase, i.e., the phase between a normal (interictal) state and a seizing (ictal) state.
We extract univariate time series blocks from channels in which seizures were identified. Each block has been associated to one of three states of the brain, being: ( i ) interictal, ( ii ) pre-ictal, or ( iii ) ictal. Subsequently, two steps of pre-processing were performed on the data. Initially, the common reference was removed from all the recorded data. Hereafter, each recording is filtered through a 60 Hz notch filter to remove line-noise present in the recordings. Both these steps were also performed in 15 , where the same database is used.

Experimental setup.
A specific realization of Y is referred to as a window (of data collected over a period of time). A single window is denoted by w q ∈ R N , with q = 1, 2, . . . , n w . In each experiment, we generate a collection of n w ∈ N windows.
Each window is split into three parts. Firstly, a training set T q := {w q,t : t = n 0 , . . . , n} . Secondly, a validation set V q := {w q,t : t = 1, . . . , n 0 − 1} . Thirdly, a test set T * q := {w q,t : t = n + 1, . . . , N} . See Fig. 3 for an illustration of the split. For the conventional criteria that use in-sample validation, the training and validation set are combined.
We set h 1 = 1 to test the BVIC over the entire horizon. This means that V q , T * q ∈ R h 2 . The size of the combined training and validation set corresponds to n.
The respective sizes of the training and validation set depend on the true order (which is known) of the autoregressive process and the forecasting horizon h 2 ∈ N . Specifically, we choose n = 2(p + h 2 ) and S T * = h 2 . As a result, when p = h 2 , we have training and testing split of n = 4h 2 (80%) and |T * q | = h 2 (20%). Subsequently, to ensure that the windows have sufficient samples, the window size is chosen to be N = S T + 2S T * . Finally, the windows are always normalized (z-scored) to facilitate a fair evaluation.

Metrics.
The results from the Monte Carlo simulations are evaluated based on three metrics. The first two metrics are based on the ℓ 2 -loss function of the forecast (i.e., the mean squared error over the forecast horizon). This metric is computed as follows: where q ∈ {1, 2, . . . , n w } is the index of the window, and m ∈ {1, 2, . . . , n m } is the index of the Monte Carlo simulation. The metrics MSE and VAR are calculated by respectively taking the mean and variance over all the windows and simulations as follows: www.nature.com/scientificreports/ Additionally, we include prediction uncertainty of the forecast as an evaluation metric by taking the average variance over the forecast horizon, i.e., Subsequently, the mean over all simulations can be computed as Experiments. Experiment 1. In this experiment, we test the ability of the different models determined using the different information criteria to forecast different time series. We evaluate four different synthetic AR(5) models by conducting Monte Carlo simulations for the BVIC and benchmark criteria. Since the autoregressive model is a discrete linear filter, the parameters of the model can be determined when the poles (or, roots) of the system are known 13 . For the autoregressive process to be stationary, the poles of the system need to lie inside the unit circle. The location of the poles affect the frequency behaviour and exponential decay of the time domain signal. For instance, a larger phase angle of a complex conjugate pole set results in higher frequency of the time domain signal. Essentially, the dominant pole(s) (i.e., the poles that lie nearest to the unit circle) of the system determine the majority of this behaviour. Therefore, we define four sets of dominant poles that display different response behaviour to assess the different information criteria. This results in a frequency of ω 0 = 1.46 rad/s. The magnitude of the poles will lead to slow exponential decay while the phase angle will induce high frequency sinusoidal behaviour. Along with the benchmark criteria, we further consider the BVIC with two different sets of parameters to evaluate the penalization effect each term has on the forecasting performance. The sets are as follows: ( i ) (β, γ ) = (1, 1) , ( ii ) (β, γ ) = (5, 1) , and ( iii ) (β, γ ) = (1, 5) . Finally, we conduct the Monte Carlo simulations for three different values of the noise parameter, δ . Specifically, we chose the values of δ ∈ {0, 0.1, 0.316} , which corresponds to a signal-to-noise ratio of SNR = ∞ dB (no noise), 10 dB, and 5 dB, respectively. In Table 1, we summarize the results from Experiment 1.

Experiment 2.
The objective is to assess the performance of the BVIC on a range of synthetic autoregressive time series of order p ∈ {10, 20, 30, 40, 50} . We conduct a Monte Carlo study in which we generate 100 windows with characteristics similar to those set in Experiment 1. Each window is generated by a set of poles that was generated randomly. For window w q , with q = 1, . . . , n w , complex conjugate poles are generated by randomizing a phase angle q and a magnitude M q , where 0 ≤ � q ≤ π , and 0.5 ≤ M q < 1 . We define a set of complex conjugate poles with real and imaginary part described by α q = M q cos � q , and β q = M q sin � q , respectively. To include the possibility of having real-valued poles, there is a 50 % chance that q is either 0 or π . For a detailed description on how the poles are generated, see Algorithm 1.
That said, regarding the remaining input parameters, we look at three different sets of hyperparameters for the BVIC: ( i ) (β, γ ) = (1, 1) , and ( ii ) (β, γ ) = (5, 1) . Simply speaking, the first puts equal weight on uncertainty, (24) Assuming that autoregressive models of higher orders are also capable of forecasting over longer horizons, we initially evaluate the performance of each of the criteria over a forecast horizon h 2 = p, i.e., h 2 ∈ {10, 20, 30, 40, 50} for each of the previously mentioned orders p, respectively. Additionally, to analyse the effects of the forecast horizon h 2 , we conducted experiments where instead h 2 = ceil(p/4), i.e., h 2 ∈ {3, 5, 8, 10, 13} . Furthermore, to prevent a possible lack of observations for training of the BVIC, we increased the sample size to include more samples, thereby decreasing the prediction error. The amount of observations used to train models is derived functionally by S T = 4.5(p + h 2 ) . Thus, for p = h 2 , the size of the training set becomes S T = 9h 2 . A further split of S T into training and validation for the BVIC gives a training, validation, and test ratio of 0.8, 0.1, and 0.1, respectively. The results of Experiment 2 are summarized in Table 2.
Moreover, to give a clearer image of what orders are being selected by the criteria, a graphical representation of the distributions is showed in Fig 5. Here, we have plotted the histograms of all the orders that were selected in all simulations by the BVIC with two different sets of hyperparameters, and AICc. Experiment 3. Hereafter, we aim to test the predictability of ECoG data during epileptic events. As such, we analyse the ability of the previously discussed criteria to forecast sections of data corresponding to seizures and non-seizures. www.nature.com/scientificreports/ Similarly to the previous experiments, sections are extracted from the time series that are subsequently split up into windows. These windows have a total of N = 1000 samples such that we can effectively use 800 (80% ) samples for training, 100 (10%) for validation, and 100 (10% ) samples for testing. The choice of N comes from a sensitivity analysis that we present in detail in S4 Supplementary Information. Next, we perform statistical tests to assess the stationarity of the ECoG recordings used in this experiment. We found that a sample size of N = 1000 is a suitable amount that results in sufficient evidence for stationarity in the majority of the considered windows.
Differences in scaling are found in recordings from different patients and between ictal and interictal phases of a single patient. Therefore, to facilitate a fair comparison among the different data, we use the mean absolute   17 as a metric to compare the results from this experiment. The advantage of using the MASE over metrics such as MSE is that the prior is scale-independent. The mean absolute scaled error is formulated as Simply speaking, the MASE is constructed by dividing the mean absolute error (MAE) by the average naïve forecast computed in-sample. Thus, for a single forecast horizon, a MASE of less than one implies that the forecast is better than the average in-sample naïve forecast. Furthermore, we establish that the performance of the BVIC in comparison to other information criteria is similar, without any statistically significant difference in the majority of the simulations-see details in S3 Supplementary Information. Therefore, for this experiment we consider only the BVIC with β = 1 , and γ = 1 , to assess the ability of the BVIC to forecast electrocorticography data.
The results of the experiment are plotted in Fig. 6. Here, we plotted the average MASE over the channels in which a seizure was identified for single forecast horizons ranging from 1 to 100 steps into the future. Figure 6a,c,e contain a sample forecast with red color, while Fig. 6b,d,f depict in blue the average MASE with the variance among different channels. Therefore, it is worth noticing that the red shading in Fig 6a,c,e indicates a prediction interval, and shows the estimated interval in which the forecasted observation is within 95% certainty. Whereas, the blue shading in Fig. 6b,d,f is simply showing the interval in which 95% of the computed values are contained (i.e., ±1.96σ).

Discussion
We introduced a principled analysis of an information criterion that utilizes theoretical principles of timereversibility and time series to assemble a finite-sample data-driven approach to model selection that eliminates the penalization of the model order and replaces it with a backward validation scheme that can be tuned to trade-off between uncertainty, regression, generalization and forecasting.
Information criteria performance. Experiment 1 explores pedagogical examples to capture the behavior of the different information criteria when different pole locations and signal-to-noise ratios of the time series are considered. It is possible to notice that these impact the performance of the BVIC relative to the other information criteria. For instance, from Case 1 we notice that for systems with a large real-valued pole ( z i = 0.9 ), the BVIC predicts with larger error compared to the other criteria. On the other hand, when all poles have large absolute values ( |z| < 0.6 ), such as in Case 4, we observe a relative decrease in prediction error of the BVIC, especially when the signal-to-noise ratio is larger than zero. Moreover, we find that the BVIC selects larger orders than the other criteria, on average. Here, we noticed is that the BVIC shows a certain consistency over all the simulations. For β = 1 , the BVIC finds approximately the true order of 5. Whereas for β = 5 , the average order selected is roughly 3.6. On the contrary, the AIC, BIC, and AICc all have much larger variance in their average selected orders. Thus, in contrast with the three other criteria, the BVIC is more consistent in selecting the order, independent of the location of the poles and the variance of the measurement noise.
In Experiment 2, we provide converging evidence that the different information criteria perform in a similar fashion to the BVIC. Nonetheless, it is important to emphasize that the BVIC provides a principled method that relies on finite samples and offers a trade-off between uncertainty, regression, generalization and forecasting, in contrast with other information criteria where the penalization term is fixed to satisfy asymptotic properties 10 . Specifically, based on statistical tests, namely the one-way analysis of variance and the Kruskal-Wallis test, we -100 -50 0 5 0 100 Step (  -100 -50 0 5 0 100 Step (  www.nature.com/scientificreports/ found that none of the obtained distributions were significantly different between the criteria at a significance level of 0.05.

Model order selection in autoregressive models with the BVIC.
It is interesting to notice how the BVIC is able to capture a different range of orders across the different synthetically generated data-see Fig 5. Remarkably, we also notice that the BVIC selects orders that are, on average, closer to the true order of the synthetic process-see Tables 1 and 2. Thus, providing evidence that the BVIC may be a preferable method when it comes to estimating the true order of an autoregressive model. Lastly, the BVIC is also able to adapt to a changing forecast horizon, where a shorter forecast horizon means that the BVIC selects, on average, lower orders. On the other hand, the selection of the orders by the remaining information criteria is not influenced by the forecast horizon.
Given that the BVIC selects model orders that are, on average, closer to the true order of the model, in Experiment 3, we tested the ability of the BVIC to assess the memory order (i.e., the statistical significant dependency or previous realizations of the time series) in the context of seizure prediction 6 .
Furthermore, there has been long reported evidence that the memory order increases during the ictal state 18,19 . Implicitly, an increase in memory would also indicate an increase in the number of steps for which we can forecast ahead. In Experiment 3, we collected converging evidence towards the later points, as the predictability increases during the pre-ictal and ictal state compared with the interictal state.
Nonetheless, it is worth reporting that this is not always the case, as can be seen in Fig. 6f, where there is no significant decrease in error between the different states. The reason for the irregularity is something that would require more study. However, there are a few possible causes that may attribute to this outcome. First of all, we provided evidence for the stationarity of the three recordings that were used in the experiment-see S4 Supplementary Information. Nevertheless, the recording from patient study 016 showed the weakest evidence for stationarity. Thus, certain amount of non-stationarity in the data could ascribe to the differing results seen in Fig. 6f. Secondly, following up from the previous argument, we must consider that it might not be possible to predict seizures with a single framework due to the variety of mechanisms that underlie an epileptic seizure. Finally, considering the fact that the average MASE obtained for study 016 is similar for all states, it could be that the electrodes that were identified as seizure electrodes were not actually placed in the location of the seizure, or, the entire recording was incorrectly classified as a seizure 20 . Extensions and future work. Whereas we have focused on the univariate autoregressive models, it would be interesting to extend the BVIC to a multivariate setting. This may reveal to be beneficial when the underlying dynamics captured by a multivariate time series have spacial dependencies. For instance, this is the case of ECoG recordings explored in Experiment 3, where there is evidence that a seizure propagates through the brain and reveals itself across different channels over time. Additionally, it would be worth to expand the proposed information criteria to handle both moving average and fractional integrative models known to be able to handle long-term memory 13 . Furthermore, the BVIC is a data-driven model, therefore, it can be used in different contexts. For instance, it would be interesting to find an equivalent to the BVIC to determine the minimum number of parameters for deep learning models that are capable to guarantee forecasting capabilities. Lastly, it would be interesting to see if it would be possible to have a recursive method for online implementation such that it becomes dynamic in nature.

Data availibility
All the data and algorithms used to produce the plots in the manuscript are available at https:// github. com/ maxsi beijn/ BVIC_ Matla bFiles. All patient data was obtained from the IEEG Portal (IEEG.ORG). The IEEG Portal is a collaborative initiative between the National Institutes of Neurological Disorders and Stroke, Mayo Clinic and the Hospital of the University of Pennsylvania. All data that was used is publicly available. These organizations handle strict guidelines and regulations for their methods. Their experimental protocols are approved, and informed consent is required from all subjects.