Forecasting high-dimensional dynamics exploiting suboptimal embeddings

Delay embedding—a method for reconstructing dynamical systems by delay coordinates—is widely used to forecast nonlinear time series as a model-free approach. When multivariate time series are observed, several existing frameworks can be applied to yield a single forecast combining multiple forecasts derived from various embeddings. However, the performance of these frameworks is not always satisfactory because they randomly select embeddings or use brute force and do not consider the diversity of the embeddings to combine. Herein, we develop a forecasting framework that overcomes these existing problems. The framework exploits various “suboptimal embeddings” obtained by minimizing the in-sample error via combinatorial optimization. The framework achieves the best results among existing frameworks for sample toy datasets and a real-world flood dataset. We show that the framework is applicable to a wide range of data lengths and dimensions. Therefore, the framework can be applied to various fields such as neuroscience, ecology, finance, fluid dynamics, weather, and disaster prevention.


Introduction
Forecasting a future system state is an important task in various fields such as neuroscience, ecology, finance, fluid dynamics, weather, and disaster prevention.If accurate system equations are unknown but a time series is observed, we can use modelfree forecasting approaches.Although there are various model-free methods, if the time series is assumed to be nonlinear and deterministic, one of the most common approaches is delay embedding.According to embedding theorems 1,2 , we can reconstruct the underlying dynamics by delay embedding (see Methods for details).These theorems ensure that the map from the original attractor to the reconstructed attractor has a one-to-one correspondence.These embedding theorems have also been extended to multivariate data and nonuniform embeddings 3 .
When multivariate time series are observed, various embeddings can be obtained on the basis of the extended embedding theorem 3 .Ye and Sugihara 4 exploited this property; they developed multiview embedding (MVE), which combines multiple forecasts based on the top-performing multiple embeddings scored by the in-sample error.MVE can yield a more accurate forecast than the single best embedding, especially for short time series 4 .Although MVE works well with low-dimensional data, it cannot be simply applied to high-dimensional data.This is because MVE requires forecasts derived from all possible embeddings, whose total number combinatorially increases with the number of variables.Although we can approximate MVE by randomly chosen embeddings if the total number of possible embeddings is beyond the computable one 5 , it is difficult to find high-performance embeddings for high-dimensional data.Recently, Ma et al. 6 presented an outstanding framework, randomly distributed embedding (RDE), to tackle these high-dimensional data.Their key idea is to combine forecasts yielded by randomly generated "nondelay embeddings" of target variables.These "nondelay embeddings" can also reconstruct the original state space for some cases and drastically reduce the possible number of embeddings.Ma et al. 6 also showed that small embedding dimensions worked fine, even for high-dimensional dynamics, and successfully forecasted short-term high-dimensional data using the RDE framework.Although RDE showed outstanding results, there is room for improvement for some specific tasks.For example, "nondelay embedding" may overlook important pieces of information if delayed observations carry such pieces of information, e.g., the upstream river height for flood forecasting.A random distribution may also be a problem for the case where only partial variables are valid system variables; in this case, most of the random embeddings are invalid and not suitable to forecast.In addition, several important hyperparameters-the embedding dimension and the number of embeddings to combine-must be manually or empirically selected.Although ensembles tend to yield better results with significant diversity among its members 7,8 , RDE only aggregates forecasts for a fixed embedding dimension with a fixed number of embeddings to combine, regardless of their forecast performance.
In this study, we propose another forecasting framework that overcomes the disadvantages of MVE and RDE.Our key idea is to prepare diverse "suboptimal embeddings" via combinatorial optimization.We combine the optimal number of these embeddings to maximize the performance of the combined forecast.In contrast to the existing frameworks, the suitable embedding dimensions and the number of embeddings to combine are automatically determined through this procedure.Our proposed framework achieves better forecast performance than the existing frameworks for high-dimensional toy models and a real-world flood dataset.

Proposed forecasting framework
Our proposed framework yields a single forecast to combine multiple ones, which are obtained by multiple embeddings.These embeddings are obtained through two steps.First, we obtain suboptimal embeddings, which are suboptimal solutions to minimize the forecast error.The aim of this procedure is to yield diverse embeddings to improve the performance of the combined forecast because such diversity is important for the performance of the ensemble 7,8 .Second, we make the in-sample forecasts based on all suboptimal embeddings following MVE and RDE but pick the optimal number of embeddings to minimize the error of the combined forecasts.A schematic of the procedure is illustrated in Figure 1.We describe each step in detail in the following sub-subsections.We prepare a pool of suboptimal embeddings in the first step.We solve K combinatorial optimization problems to obtain various embeddings in this step.Next, we pick kp embeddings to minimize the error of the combined forecast in the second step.We combine the forecasts obtained by the kp embeddings to test the time series.

First step: Preparing suboptimal embeddings
In our framework, we first obtain diverse suboptimal embeddings to minimize the in-sample forecast error.With the observed time series y(t) ∈ R n , a set of possible embeddings E , and a corresponding delay vector v e (t) ∈ R E with e ∈ E , we can obtain the p-steps-ahead forecast at time t by a map ψ e as follows: We can obtain such maps by suitable fitting algorithms, e.g., the conventional method of analogues 9 , as in MVE, and Gaussian process regression 10 , as in RDE.Throughout this paper, we apply the same algorithm, a variation of the method of analogues (see Supplementary Information), to all forecasting frameworks including the proposed one.This is because we need to compare the performance of the frameworks on a fair basis that is not affected by the performance of each fitting algorithm.
To obtain high-performance embeddings in an efficient manner, we solve combinatorial optimization problems instead of random selection or using brute force.We split the set of training time indices T train = {t | t < 0} into K datasets.Then, we solve the following optimization problem for each K ∈ {1, 2, ..., K} to minimize the error: where T K train is the K th set of training time indices and • is an appropriate norm.We use the L 2 norm throughout this paper.Note that ŷe f is computed in an in-sample manner; we use all of T train , except for the current time, as its library.When we consider embeddings up to l lags to embed the latest observation of the target variable, the number of possible embeddings is 2 nl−1 .If nl is sufficiently large, it is almost impossible to compute all possible embeddings because of combinatorial explosion.
Here, we take a straightforward approach to minimize the forecast error: an evolution strategy that interprets an embedding as a one-dimensional binary series 11,12 .See Methods for details.Throughout this paper, we applied the (µ + λ ) evolution strategy 13 .
In the process of each optimization, we store not only the best embedding but also the "Hall of Fame," which preserves the best individuals in each generation.From the Hall of Fame, we select the best M solutions that satisfy the condition that the Hamming distance d(•, •) is larger than or equal to a certain threshold θ as follows: where e i is the ith embedding of the sorted Hall of Fame.Namely, we sort the Hall of Fame by the fitness (the objective function of equation 2) and pick the best embeddings in order from the sorted Hall of Fame to satisfy equation (3).This condition prunes similar embeddings and keeps diverse embeddings for the next step.To apply this procedure, we obtain K • M suboptimal embeddings in total.See also the first part of Algorithm 1 in Methods.
The procedure of the K times optimizations yields diverse but high-performance embeddings.If a dataset contains a large number of variables, it is extremely difficult to obtain the exact solution by evolution strategies or any other metaheuristic algorithms.These algorithms find suboptimal solutions for most cases, and these solutions can be easily changed by slight changes in the optimization conditions.The procedure rather exploits this property; the K times optimization can find K sets of diverse suboptimal solutions.Instead of minimizing the whole in-sample error with K different conditions, we minimize each K split dataset and utilize the history of solutions to significantly reduce the computational time.We numerically show the effect of K times optimization in Supplementary Information.
To create diverse embeddings and improve the performance of the combined forecast, we can optionally consider L finite impulse response (FIR) filters for each suboptimal embedding; namely, K • L • M embeddings are obtained in total.These filtered embeddings are justified by the Filtered Delay Embedding Prevalence Theorem 2 , and a study has shown that the combination of various forecasts via linear filters improves the forecast result in some cases 14 .Although we can compute suboptimal embeddings for each filter, we simply consider L filters based on the obtained K • M embeddings.Empirically, the optimization of the embeddings for each filter does not realize a significant improvement in performance to justify the additional computational cost.See Methods for the details of forecasting filtered time series.In this paper, we set the filter coefficients as h i (0) = 1, h i (1) = ρ ∀i for all numerical examples.

Second step: Combining forecasts to minimize the in-sample error
Next, we pick the optimal number of embeddings from the K • L • M embeddings to minimize the error of the combined forecast.We compute in-sample forecasts by the embeddings for all of T train , as done in the MVE or RDE aggregation schemes.We determine the optimal number of embeddings to integrate forecasts, which is neither the square root of the possible number of embeddings (as MVE does) nor the embedding dimension (as the RDE aggregation scheme does).Here, we define a tuple of the forecast indices sorted by the in-sample error for step p as I p , and the ith best embedding is written as I p (i).Note that I p is separately computed for each forecast horizon because accurate embeddings may differ according to the forecast horizons, as discussed later.We obtain kp to minimize the combined forecast as follows: We can obtain kp in a brute-force manner with a very small computational cost because the kth combined forecast Y k (t + p|t) can be recursively computed as follows: where We compute Y k and the corresponding error for all k and pick the optimal value to minimize the squared error.Note that kp can be different for each p; empirically, kp tends to be larger for a longer step.
For unobserved samples t ≥ 0, the forecast ŷ f (t + p|t) is computed as follows: Note that only kp forecasts need to be computed rather than K • L • M forecasts.We summarize the whole algorithm in Algorithm 1 in Methods.

Toy models
We demonstrate our framework with toy models.We first tested with the 10-dimensional Lorenz'96I model 15 .The model is described as 10-dimensional differential equations, and we used five of their system variables and five random walks as a dataset-namely, Lorenz'96I's x 0 , x 1 , x 2 , x 3 , x 4 and random walks x 5 , x 6 , x 7 , x 8 , x 9 .We forecasted x 0 up to 10 steps.We compared the proposed framework with existing frameworks-namely, randomly distributed embedding with an aggregation scheme (RDE), multiview embedding (MVE), a variation of MVE (state-dependent weighting 5 (SDW)), and the single best embedding via the (µ + λ )-ES algorithm.See Supplementary Information for the detailed conditions.We first tested the performance with the data length of 4000 for the database and 500 for the evaluation.Figure 2(a) shows that the proposed forecast achieved the best performance among the existing frameworks up to 10 steps ahead with the length of 4000.
We conducted a sensitivity analysis with the data length.We set the data length, i.e., the size of the training dataset, to {100, 200, 500, 1000, 2000, 4000} and tested the performance with 500 samples for all cases.Figure 2(b) shows that the proposed five-steps-ahead forecast yielded the minimum error among the existing frameworks for all the data lengths.
We also checked the robustness against observational noise.We added Gaussian observational noise scaled by the standard deviation of the target variable and tested the performance with a scale range of {0.0, 0.1, 0.2, 0.3}.We set the training data length as 500 and evaluated with 500 samples.Figure 2(c) shows that a lower noise level results in better performance for the proposed framework compared with the other frameworks.Although the proposed framework achieved the best performance up to 10% noise, the performance was degraded by high-level noise and is the second best for 30% noise.
We profiled which variables were embedded for the combined forecasts for the case where the data length is 4000 without noise.We summed kp embeddings for each variable, where 1 means embedded and 0 means not embedded.Then, we computed the proportion of embedding of each variable, normalizing the sum to one for each p. Figure 3(a) shows that random walks x 5 , x 6 , x 7 , x 8 , and x 9 were seldom embedded with the proposed framework.This result suggests that our proposed framework can select the appropriate variables.The results also show that the proportion of embedding of the target variable x 0 is large for short-term prediction and small for longer-term prediction.This result is consistent with that of an existing study 16 .
We also profiled the filter coefficient ρ and the embedding dimension E of the proposed combined forecasts.We averaged ρ and E over the kp combined members for each step.Figure 3(b) shows that smaller dimensions are selected for shorter steps, and larger dimensions are selected for longer predictions.The figure also shows that smaller and larger values of ρ are selected for shorter and longer steps, respectively.Because we employ high-pass filtering (ρ ≤ 0) with this model, a smaller ρ focuses on a smaller time period and vice versa.We investigated the number of combined forecasts and the forecast errors to check whether the proposed integration scheme works properly.Although the proposed framework determines kp on the basis of the in-sample error, kp also minimizes the test error in this example (see Figure 3(c)).
We also demonstrate our framework with the Kuramoto-Sivashinsky equations 17,18 .We generated 20-dimensional time series: the values of the first 10 grids of the Kuramoto-Sivashinsky equations x 0 , x 1 , ..., x 9 and 10 random walks x 10 , x 11 , ..., x 19 .We forecasted x 0 up to 10 steps using the same parameter values, data length, and noise scales as used in the Lorenz'96I example.See Supplementary Information for the detailed conditions.
Despite the increases in the variables, Figures 4(a) and (b) show that the proposed forecast yielded the minimum error among the existing frameworks, as in the Lorenz'96I example.Regarding the noise sensitivity, a lower noise level results in better performance for the proposed framework compared to the other frameworks.The proposed framework achieved the best performance for a wide data range and relatively small observational noise.
Finally, we tested the sensitivity to the number of variables.We changed the dimensions of Lorenz'96I to {10, 20, 40, 80, 160, 320}, and we substituted a half of each dataset with random walks.We set the training data length to 500 and evaluated with 500 samples.For all ranges of variables, the proposed framework achieved the best performance (Figure 5(a)).We also tested the case where all variables are available and the datasets do not contain random walks; this is the case in which delay embedding is not needed because all system variables are observed.In this case, the performance difference is negligible for all frameworks (except the single best embedding), and RDE achieved the best score for the case where the number of variables is 320.Although this condition is too ideal because the observation function is an identity map, we can expect RDE to perform well for such cases that include many informative variables.

Flood dataset
We tested the proposed framework with real-world data, namely the flood forecasting competition dataset at "Artificial Neural Network Experiment (ANNEX 2005/2006)" 19 .This dataset contains nine variables: the river stages of the target site and three upstream sites and five rain gauges for three periods.We forecasted 6, 12, 18, and 24 h ahead of the target river stage using all nine variables.See Supplementary Information for the detailed conditions.We also compared our framework with four other conventional machine-learning methods used in an existing study 5 : a recurrent neural network with long short-term memory (LSTM) 20 , support vector regression (SVR) 21 , and random forest regression 22 .
The results in Table 1 show that the proposed framework yielded the best root-mean-square error (RMSE) through 12-24 h ahead, and the differences compared to the other methods were expanded over the forecast horizon.Although the 6 h (one step)-ahead forecast was not the best, the difference compared to the best score was negligible.The dataset contains only 1460 points, and long-term memory is not required for this task; hence, LSTM did not perform well.Although the test data include the largest river stage on 1995-02-01, Figure 6(a) shows that the proposed framework properly forecasted the inexperienced river stage.
We compared the combined forecast with its individual members obtained with the (µ + λ )-ES algorithm.Regarding the individual members, the order of the in-sample score was different from the order of the test (see Figure 6(b)).In contrast, the combined forecast achieved the minimum error by far for the both the in-sample and test errors.This forecast stability is one of the largest advantages of combining forecasts.We also profiled ρ and E averaged over the kp combined members for each step.The results in Figure 6(c) suggest the same trends as the toy models, even for the real-world data.Specifically, smaller dimensions and smaller values of ρ were selected for shorter steps.

Discussion
Our numerical experiments suggest that our proposed framework works well for a wide range of the data length, including short time series such as 100.In contrast, the framework did not appear to work fine with very noisy data, and MVE achieved the best performance, as shown in the noise sensitivity analyses (Figures 2(c) and 4(c)).This suggests that noise prevents the framework from selecting optimal embeddings on the basis of the squared errors, and it results in obtaining invalid embeddings.MVE can still be a good option for very noisy data.Note that the performance of RDE was not satisfactory in some of our numerical experiments, but this is because the RDE framework was originally proposed for extremely high-dimensional data, e.g., hundreds of variables.As shown in Figure 5(b), we can expect RDE to perform well for cases that include many informative variables, and delay embedding is not needed.
As shown in the numerical experiments, the proposed framework worked well for high-dimensional dynamics and datasets containing high-dimensional variables.On the other hand, it is not suitable to apply it to relatively low-dimensional data.We tested the framework with low-dimensional datasets, and the results suggest that the proposed framework does not always yield the best performance, especially for low-dimensional datasets such as the Rössler equations 23 (see Supplementary Information).

6/12
(a) This is because it is easy to find suitable embeddings to forecast in a brute-force manner, especially for low-dimensional datasets, while it becomes combinatorially difficult as the number of variables increases.In addition, the aim of K times optimization is to yield diverse embeddings (see the effect of K times optimization in Supplementary Information).However, if a dataset contains a small number of variables, almost the exact solutions are obtained for every optimization.
The proposed method combines forecasts via the simple average, optimizing the number of forecasts.The simple average works fine for most cases because suboptimal embeddings can equally forecast well thanks to solving the combinatorial optimization problem, and the test performance order is not always the same as the corresponding training data, especially for real-world data (see Figure 6(b)).If suboptimal embeddings show quite different forecast performances, the weighted average based on the error distribution 6 may work better.We can also apply the expert advice framework 24 if better forecasts are assumed to be changed over time.
We need to select the values of several parameters to apply the proposed framework, i.e., the number of split datasets K, the number of the suboptimal embeddings for each batch M, the criteria of the Hamming distance θ , the number of filters L, and their components.We can control the variations in forecasts with these parameters.In order to set K, we consider the number of samples in each divided dataset; empirically, the framework works well with more than 50 samples for each divided dataset.Samples that are too small prevent the framework from finding good embeddings, but samples that are too large reduce the diversity of the embeddings.Although M can be any large integer (up to the total number of individuals of the evolution strategy), M ≤ 5 is sufficient for most cases.Note that the value of M that is too large is overkill and unnecessarily increases the computational cost.The criteria for the Hamming distance θ directly controls the similarity of suboptimal embeddings.We set θ = 3 in this study, and it worked fine for all cases.Note that conventional MVE and RDE guarantee that the Hamming distance among all embeddings is larger than two (i.e., θ = 2) for a fixed embedding dimension.In practice, appropriate filters depend on the prediction steps.If we predict shorter time periods, high-pass filters will work fine; if we predict longer time periods, low frequencies are needed to obtain fine forecasts, as the numerical experiments suggest.Although we treat high-pass filters in this study, low-pass filters may work for extremely noisy data.We need to understand the effects of these parameters and set the appropriate values within reason, P = KLM, which is usually much smaller than the total number of embeddings evaluated in the first step.
Our numerical experiments suggest that the appropriate embeddings vary with the prediction steps.The results show that a smaller E with a smaller ρ is preferable for a short-term forecast, and a larger E with a larger ρ is preferable for a longer-term forecast.These results are consistent with those of an existing study 16 and with the fact that the longer-step forecasts need more complex map from the current inputs to the forecasts.Therefore, more information is needed to express the complex map.
Although we applied the (µ + λ )-ES algorithm to obtain the suboptimal embeddings, it is worth considering other methods to solve the combinatorial optimization problem.We can apply other evolution algorithms, e.g., simulated annealing 25 and ant colony optimization 26 .Any algorithm may work as long as it can store the suboptimal embeddings through its optimization procedure.Testing and evaluating other algorithms are open problems.
Throughout this study, we used a variation of a local nonlinear prediction method to forecast the target variable.One of the most significant advantages of these local prediction methods is that we can compute in-sample forecasts very easily with a small computational cost; all we have to do is to exclude the current query from the nearest neighbors.In addition, owing to recent progress in approximate neighboring search algorithms (e.g., Refs.27, 28), these algorithms are fast enough to apply the evolution algorithms.Moreover, it is also suitable to obtain suboptimal embeddings through the forecast errors because local nonlinear prediction methods are sensitive to the false nearest neighbors derived from invalid features.However, we can apply other regression methods, e.g., Gaussian process regression 10 , as done in RDE 6 , for further improvements.Note that these methods incur a very large computational cost to yield in-sample forecasts, as mentioned in Ref. 6.To reduce the computational cost, we can apply these methods to only the process in the second step and can approximate the in-sample error by the k-fold cross-validation error.

Optimizing embedding by an evolution strategy
One can optimize embedding to solve combinatorial optimization problems to treat embeddings as binary series 11,12 .Here, we solve equation ( 2) by a combinatorial optimization problem.We treat embeddings as binary series, i.e., 1 meaning embedded and 0 meaning not embedded.For example, with n = 3 and l = 2, the embedding [y 1 (t), y 2 (t), y 3 (t − 1)] can be expressed by the nl = 6 binary code e = [1, 0, 1, 0, 0, 1]: the first element 1 corresponds to y 1 (t) to embed, the second element 0 corresponds to y 1 (t − 1) to embed, the third element 1 corresponds to y 2 (t) to embed, and so on.The optimization of a binary series is a typical combinatorial optimization problem.In this paper, we apply the (µ + λ ) evolution strategy 13 .The parameter µ denotes the number of parents, λ denotes the number of offspring, and + denotes plus selection, which means that the next parents are selected from both the previous parents and offspring.Precisely, the algorithm yields λ new offspring from the top-performing µ individuals selected from the previous parents and offspring.The algorithm is an elitist method since the algorithm retains the best individuals unless they are replaced by superior individuals.The algorithm tends to converge fast, but it is likely to become trapped at local optima.Therefore, the algorithm is suitable for our proposed framework.However, it is worth considering the application of other existing methods [29][30][31] to obtain suboptimal embeddings.

Forecasting through FIR filters
We apply FIR filters to the original time series to obtain various embeddings and forecasts.We apply the following linear transformation for each variable: where h i 's are filter coefficients with N elements, h i (0) = 1 ∀i, and µ i and σ i are the coefficients for standardization.After forecasting z f (t), we restore ẑ f (t) for the original time series.For p = 1, we restore ŷ f (t + p|t) using ẑ f (t + p|t) as follows: For p ≥ 2, we obtain the p-steps-ahead forecast ŷ f (t + p|t) by substituting forecasts for the unobserved y f (t ≥ t + 1) as follows: We can restore ŷ f (t + p|t) for any p to apply equation ( 12) recursively.We can apply FIR filters such as the moving average filter to suppress noise.In contrast, an existing study 14 focuses on the filtered delay coordinates constructed by blending the derivatives of time series and the original one to search the nearest neighbors.Although these are a type of high-pass FIR filters, such filters improve the prediction accuracy in less noisy time series.

Overall algorithm of the proposed framework
We denote a multivariate time series as y(t) ∈ R n and the target variable to be forecasted as y f .The proposed forecasting framework is presented in Algorithm 1 in terms of the user-defined number of split training datasets K, the number of top-performing embeddings chosen for each split M, the linear filters, and the number of filters L. for each e h in H do  We computed y(x,t) spatially periodic boundary conditions in the interval [0, L], where L = 22 and the number of uniform grids is 128.We generated 20-dimensional time series: the values of the first 10 grids of the Kuramoto-Sivashinsky equations x 0 , x 1 , ..., x 9 with a sampling time of 1.0 and 10 random walks x 10 , x 11 , ..., x 19 .We set E = 5 and considered up to five lags for the existing schemes (RDE, MVE, and SDW).We forecasted x 0 up to 10 steps using the same parameter values, data length, and noise scales as those in the Lorenz'96I example.

Lorenz'63 equations
The Lorenz'63 equations 3 are expressed as three-dimensional differential equations as follows: We generated three-dimensional time series: Lorenz'63's x and y and a random-walk series.We set the integration step to 0.001 and recorded every 100 points with p = 10, b = 8/3, and r = 28.Note that the initial transients were disregarded.We forecasted x up to 10 steps.We set K = 10, M = 3, θ = 3, and h i (0) = 1, h i (1) = ρ ∀i for ρ ∈ {0.0, −0.2, −0.4,−0.6, −0.8, −1.0}.For the (µ + λ )-ES algorithm, we set µ = 50, λ = 100, the number of generations to 10, and the number of populations to 100.We compared the proposed framework with MVE and SDW with E = 4 up to five lags and the single-best embedding via the (µ + λ )-ES algorithm to minimize the total in-sample error within the whole training dataset.Note that we did not carry out calculations with RDE because we were not able to prepare a sufficient number of "nondelay embeddings" for this type of low-dimensional data.

Rössler equations
The Rössler equations 4 are expressed as three-dimensional differential equations as follows: We generated three-dimensional time series: Rössler x and y and a random-walk series.We set the integration step to 0.001 and recorded every 500 points with a = 0.36, b = 0.4, and c = 4.5.Note that the initial transients were disregarded.We forecasted x up to 10 steps with the same conditions as those of the Lorenz'63 equations.

Figure 1 .
Figure1.Schematic of the proposed forecasting procedure.We prepare a pool of suboptimal embeddings in the first step.We solve K combinatorial optimization problems to obtain various embeddings in this step.Next, we pick kp embeddings to minimize the error of the combined forecast in the second step.We combine the forecasts obtained by the kp embeddings to test the time series.

Figure 2 .
Figure 2. Forecast performance with the Lorenz'96I model: comparisons of the performance (a) up to 10 steps ahead with the fixed data length (4000) without noise, (b) with different values of the data length, and (c) with different scales of observational noise.We computed the RMSE of the five-steps-ahead forecasts with randomly distributed embedding (RDE), multiview embedding (MVE), state-dependent weighting (SDW), single-best embedding based on the (µ + λ )-ES algorithm (SBE), and the proposed framework.These tests were carried out with 20 datasets generated with different random initial conditions and noise.The median, upper quartile, and lower quartile are shown.

Figure 3 .
Figure 3. Forecast profile of the Lorenz'96I model with the data length of 4000.Panel (a) shows the proportion of embedding of the proposed forecasts.The color indicates the proportion of embedding for each variable averaged over the number of combined forecasts for each step.Note that for each prediction step, the sum of the proportions of all variables is one.Variables x 0 , ..., x 4 are the variables of the 10-dimensional Lorenz'96I equations, and the other variables are random walks.Panel (b) shows the averaged filter coefficient ρ and embedding dimension E over the combined forecasts for each step.Note that the averaged dimensions can be a decimal fraction since they are averaged over the number of combined forecasts for each step.Panel (c) shows the relation between the number of combined forecasts and the errors.The solid line shows the in-sample error, and the dashed line shows the test error.The vertical solid line shows the selected number of combined forecasts k p to minimize the in-sample error.

Figure 4 .
Figure 4. Forecast performance with the Kuramoto-Sivashinsky equations: comparisons of the performance (a) up to 10 steps ahead with the fixed data length (4000) without noise, (b) with different values of the data length, and (c) with different scales of observational noise.We computed the RMSE of the five-steps-ahead forecasts with randomly distributed embedding (RDE), multiview embedding (MVE), state-dependent weighting (SDW), single-best embedding based on the (µ + λ )-ES algorithm (SBE), and the proposed framework.These tests were carried out with 20 datasets generated with different random initial conditions and noise.The median, upper quartile, and lower quartile are shown.

Figure 5 .
Figure 5. Forecast performance with the Lorenz'96I model for various numbers of variables: cases where (a) a half of the variables are substituted with random walks and (b) all variables are available.We computed the RMSE of the five-steps-ahead forecasts with randomly distributed embedding (RDE), multiview embedding (MVE), state-dependent weighting (SDW), single-best embedding based on the (µ + λ )-ES algorithm (SBE), and the proposed framework.These tests were carried out with 20 datasets generated with different random initial conditions and noise.The median, upper quartile, and lower quartile are shown.

17 In 9 EFigure 6 .
Figure 6.Forecast results for the flood dataset.Panel (a) shows a comparison of the ground truth and the proposed 24-h-ahead forecast.The proposed forecast did not underestimate the maximum river stage, which is the maximum value of the whole dataset.Panel (b) shows a comparison of the in-sample and test errors of the ensemble members.The results demonstrate the difficulty of selecting the best forecast because the best in-sample forecast does not always perform the best.In contrast, the combined forecast performed the best by far for both training and test data.Panel (c) shows the filter coefficient ρ and embedding dimension E averaged over the combined forecasts for each step.

Figure 2 .
Figure 2. Forecast performance for low-dimensional datasets: RSMEs for (a) the Lorenz'63 dataset, (b) the Rössler dataset, and (c) the six-dimensional Lorenz'96I dataset.We compared the performance of multiview embedding (MVE), state-dependent weighting (SDW), and single-best embedding based on the (µ + λ )-ES algorithm (SBE).These tests were carried out with 20 datasets generated with different random initial conditions and noise.The median, upper quartile, and lower quartile are shown.
: end for 21: I p (i) := Sort S by the in-sample error of the pth step 22: for each k in the indices of I p do 13: if d(e h , e s ) ≥ θ ∀e s ∈ S then Append e h to S with L filters 14: if M • L embeddings are appended to S then break f (t + p|t) ∀t ∈ T train by map ψ e 19: Evaluate the in-sample error of ŷe f (t + p|t) 20