Introduction

The prediction of future values in a complex time series is a major interest for scientists1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17, with applications to various scientific fields18,19,20. There are many natural phenomena that require a prediction algorithm for answering important questions, such as estimating future population variations, predicting the orbits of astronomical objects and predicting the occurrence of seismic waves. The prediction of twenty-first century global temperature rise would be a valuable resource for policy makers and planners6. Population projections may be used to predict species extinction before they reach a crisis point5. Moreover, prediction is an ongoing and pressing problem in the forecasting of economic time series8. In the medical sciences, there are also many applications for which an efficient prediction algorithm could save lives. A large number of time series obtained from monitoring the human body can be used as a basis for the decision-making process to treat or prevent grave diseases such as epilepsy or Alzheimer's disease1,2,3.

It has been shown that data generated by such natural phenomena often behave chaotically11. Although chaotic behaviours are deterministic, their complex properties make it difficult to distinguish them from random behaviour. Chaotic behaviours are known to be strongly dependent on initial conditions; small changes in initial conditions can possibly lead to immense changes in subsequent time steps and are particularly difficult to predict. Because the exact conditions for many natural phenomena are not known and the properties of a chaotic time series are very complex, it is difficult to model these systems.

Most of the existing methods for complex time series prediction are based on modelling the time series to predict future values, although there are other types of methods such as agent-based simulations that model the system generating the time series21. Model-based approaches can be classified into two main domains: linear modelling such as in ARIMA (autoregressive integrated moving average)18 and nonlinear modelling such as in MLP (multi-layer perceptron)22 and GARCH (generalised autoregressive conditional heteroskedasticity)23 (for details, see Supplementary Information, section 1). However, other studies have concluded that there is no clear evidence in favour of nonlinear models over linear models in terms of forecasting performance7. Regardless, there is no robust procedure that can produce an accurate model for chaotic time series. For all of these methods, the prediction error increases dramatically with the number of time points predicted7,9,21. Therefore, most of the existing methods focus on very short-term predictions to obtain a reasonable level of accuracy. None of the existing methods demonstrate an acceptable level of accuracy for long-term prediction12. For example, for financial time series predictions, most methods can predict only one step ahead, which is not very helpful for acting against a financial crisis before it occurs10,12,21. To address this deficiency in existing methods, we propose a novel approach to making long-term time series predictions (see Methods), GenericPred24, with applications to financial time series, medical diagnosis and global temperature prediction.

Results

For the DJIA (Dow Jones Industrial Average) time series between 1993 and 2001, when markets were stable with no major changes and no financial crisis, our proposed GenericPred method predicts the trend with very high accuracy, whereas the predictions based on other methods strongly and rapidly diverge from the real data (see Fig. 1a and Supplementary Table 1). GenericPred is the only method able to discover the decreasing trend corresponding to the financial crisis, while the three other methods predicted a growth in the stock market (see Fig. 1b and Supplementary Table 1). Although the 2009 financial crisis data were used for training in the third time period, when the recession was in the middle of the considered range, the GenericPred method successfully discovered the general trends for the next 500 days, with a particularly high accuracy for the first 300 days, effectively predicting the increase in the stock market (see Fig. 1c and Supplementary Table 1). The other three methods failed to predict the trend.

Figure 1
figure 1

Dow Jones Industrial Average stock market index prediction.

We examined the DJIA time series with respect to the daily closing values of the DJIA for three time periods: (a) September 1993-September 2001 for the prediction of DJIA values before the 2009 financial crisis, (b) July 2001-July 2009 for the prediction of the financial crisis in 2009 and (c) August 2004- August 2012 for the prediction of DJIA values after the financial crisis in 2009. For each time series, 1500 time steps (approximately 6 years) were analysed to predict the next 500 time steps (approximately two years).

The GenericPred method's prediction errors were significantly lower than the other methods for all three periods with respect to both short-range and long-range predictions (see Supplementary Table 1). Moreover, the GenericPred predictions were more stable with a constant lower standard deviation, regardless of whether the target data lie before the recession, during the recession, or after the recession. The prediction error for the first 200 time steps is especially smaller than that of the other methods. For the three considered periods, GenericPred also significantly outperformed L-FABS21 and the MLP22 model for short-term predictions (see Supplementary Table 1). GenericPred is also stable given that it has a low sensitivity to the starting point of the prediction (see Supplementary Figure 2).

For the EEG time series, the P&H chaoticity values25 were predicted using GenericPred (see Supplementary Figures 2 and 3) on a constant-length (20 minutes) sliding window (the window moves every 20 seconds) of the five EEG time series for all the patients. During a seizure, a peak in the P&H values obtained from the EEG time series appeared. Based on the analysis of all 21 patients, a threshold for the prediction of seizures (P&H value equal to 2.4) was determined based on patient data (see Supplementary Information). Using this threshold, the GenericPred method can predict the epileptic seizure with 100% sensitivity and specificity up to 17 minutes in advance (see Table 1). The same results have been obtained by considering the data of any five electrodes independently. This represents a considerable improvement over one of the best current predictive methods, which only achieves accuracy levels of 73% sensitivity and 67% specificity for 10 patients within a 1–10 minute range26.

Table 1 Sensitivity and specificity of epileptic seizure prediction for 21 patients for different lengths of prediction. For each patient, one positive and 10 negative samples were constructed. The positive sample contains one epileptic seizure event and the 10 negative samples are seizure-free. Therefore, there are 21 positive and 210 negative samples in total that were used to compute the specificity and the sensitivity accuracy levels

For the global temperature anomalies prediction, unlike ARIMA, GenericPred accurately predicted the increasing trend in the last 30 years (Fig. 2a). Moreover, most of the successive peaks and depressions were predicted with a high degree of precision. The mean square error for GenericPred is 0.64, whereas for ARIMA, it is 1.6. The GARCH and VAR methods were unable to make predictions due to insufficient data.

Figure 2
figure 2

Predicting the annual records of global temperature anomaly (a) for 30 years (1983–2013) and (b) until the end of the 21st century (2014–2100).

The existing dedicated forecasting models for global temperature anomalies predict that, as the world consumes more fossil fuel, greenhouse gas concentrations will continue to increase and the Earth's average surface temperature will continue to rise27. Based on recent predictions, average surface temperatures could rise between 2°C and 6°C by the end of the 21st century6,28. We predicted the global temperature anomaly until the end of the 21st century (2014–2100) (Fig. 2b). GenericPred predicted an average anomaly of 2.5°C for the years 2085–2100, which is in accordance with the predictions of the dedicated models.

Our approach demonstrates a significant gain in accuracy over traditional methods with respect to predicting different DJIA time series for both short- and long-term predictions. Its ability to predict the evolutionary trend of the stock market is vastly superior to the predictive ability of existing methods. Further, the ability of our GenericPred method to predict epileptic seizures could be a major breakthrough in terms of managing epilepsy, providing, for the first time, a robust method for predicting an epileptic seizure long before its occurrence. Finally, the success of GenericPred's predictions of the trend of increasing global temperature shows that this method has the potential to be a powerful tool for controlling the global warming phenomenon. Another advantage of our approach is that it does not rely on a complex model of the original time series and is therefore highly general and highly computationally efficient.

This method provides a first step towards accurate and comprehensive time series long-term predictions. Although, with respect to long-term predictions, it is impossible to predict exact values, GenericPred's performance shows great potential for predicting time series' trends.

Methods

Several researchers emphasise the potential of market predictions to improve important financial decisions29, from helping businesses make sounder investment decisions to helping governments make more efficient fiscal and monetary policy decisions8. These time series are amongst the most complex time series because of the number of parameters involved. Our results are compared with respect to long-term predictions with ARIMA, GARCH and VAR30, which are the most widely used and most efficient methods for making long-term time series predictions. We also compared our results for short-term predictions with those obtained by two existing methods: the Learning Financial Agent Based Simulator (L-FABS)21 and the MLP model22.

For the first period, we considered the DJIA (Dow Jones Industrial Average) time series between 1993 and 2001, when markets were stable with no major changes and no financial crisis. In the second period considered, the US stock market peaked in October 2007, but by March 2009, the Dow Jones average had reached its minimum, which reflects the most serious effects of a financial crisis. In the third period (August 2004–August 2012), the recession was in the middle of the considered range.

Another important application of time series predictions is in medical science. Approximately 1% of the world population suffers from epilepsy31. Epileptic seizures are the result of unusual and irregular neuronal activity in the brain32,33. Many recent methods have been proposed for predicting epileptic seizure26,34,35 but none of them as shown their ability to perform accurate predictions more than 10 minutes in advance on a large number of patients. To evaluate the performance of our new method for predicting epileptic seizures, we examined the EEG time series measured by five electrodes, generating five different time series, for 21 patients. For each EEG time series, the exact time of the seizure is known.

Predicting the monthly records of global temperature anomalies is currently one of the most pressing and controversial environmental concerns36. As a third experiment, we used the global temperature anomaly data from 1880 to 1983 to train for the prediction of global temperatures during 1983–2013. Global temperature anomaly data come from the Global Historical Climatology Network-Monthly (GHCN-M) data set and International Comprehensive Ocean-Atmosphere Data Set (ICOADS), which have data from 1880 to the present. These two datasets are blended into a single product to produce the combined global land and ocean temperature anomalies.

Our new method for complex time series prediction is based on the concepts of chaos theory and an optimisation process. The general idea is to extract a unique characteristic from an existing time series that somehow represents the behaviour of the time series and to subsequently generate successive new values that continue the time series, each value minimising the difference between the characteristic of the new time series and the initial one. The details of the GenericPred method for long-term time series prediction are as follows. We consider a time series SN:

A nonlinear measure V() is computed on SN The fractal dimension37 and the Lyapunov exponent38 are examples of such nonlinear measures that return a single value for a time series. A possible mapping may be required, forming a new time series , for different applications as follows:

otherwise, , where 0 < L < N is the size of a sliding window used to compute the local level of chaos measured by V(). Therefore, when the mapping is applied, the new considered time series corresponds to the variation in time of the local non-linear measure in the initial time series SN.

We consider as a reference value that will be used for predicting the next k values of the time series:

The parameter σ of a normal distribution N(yi, σ2) is estimated by computing the variation between every two consecutive values (yi to yi+1) of the time series . This distribution represents the probability distribution P(yi|yi−1) (see Fig. 3). Several data sets have been considered to determine that a normal distribution is a good approximation of the real distribution. However, the same method has been applied using other distributions without significant degradation is the prediction.

Figure 3
figure 3

Successive steps of the GenericPred method for time series prediction.

For predicting yN+i, a set Pos(yN+i) of Nrand random values are generated following the distribution N(yN+i−1, σ2) (Fig. 3):

Nrand is a parameter that can impact the quality of the prediction because having more values will increase the chance of finding an optimal value. However, no significant improvement was observed for the data considered when Nrand was greater than 10. For this reason, we chose 10 as the value of Nrand for each experiment. yN+i is then computed by selecting the that makes the new nonlinear measure the closest to :

The value is chosen to make as close as possible to .

The important point is that the reference value is always , which is the calculated nonlinear measure from the original time series. Therefore, the GenericPred method uses two basic rules:

R1: Always endeavour to keep the value of a nonlinear measure as steady as possible during prediction (Fig. 3).

R2: The new value must be chosen from a set of potential values generated from a probability distribution.

The prediction has to be pursued one step at a time because the predicted value in the current step is needed for determining the valid range of change for the next step. For those problems for which a binary prediction (‘yes’ or ‘no’) is required, (e.g., the epileptic seizure prediction), a threshold t is computed from the learning data. Whenever the value yN+i is greater than the threshold t, the prediction is positive. For example, yes there is an epileptic seizure at time N + i if yN+i > t; otherwise, there will be no seizure at time N + i.

Classical model-based prediction approaches consider a unique value for the next step, whereas in the GenericPred method, several points are considered simultaneously. Our method is also able to constantly adjust the information regarding the current time series, whereas classical predictive methods apply the model without taking into account the concordance between the original time series and the predicted ones. Technically, any nonlinear measure could be used for the time series characterisation. However, here, we used the P&H method25 because it has been shown that this method can efficiently discriminate between different types of nonlinear behaviour39.