Predictability limit of partially observed systems

Applications from finance to epidemiology and cyber-security require accurate forecasts of dynamic phenomena, which are often only partially observed. We demonstrate that a system’s predictability degrades as a function of temporal sampling, regardless of the adopted forecasting model. We quantify the loss of predictability due to sampling, and show that it cannot be recovered by using external signals. We validate the generality of our theoretical findings in real-world partially observed systems representing infectious disease outbreaks, online discussions, and software development projects. On a variety of prediction tasks—forecasting new infections, the popularity of topics in online discussions, or interest in cryptocurrency projects—predictability irrecoverably decays as a function of sampling, unveiling predictability limits in partially observed systems.


Proof of Corollary 1
Proof. Autocorrelation is defined as Pearson correlation between values of the signal at different times, i.e., This yields the following expression for autocorrelation of the time series Y : The last equality comes from replacing Equation S.1 in the numerator and Equation S.2 in the denominator. Finally, we assume the ground truth process is stationary, i.e., the process has a time-independent variance (Var(X j )=Var(X j ) ∀i, j) and mean (E(X j )=E(X j ) ∀i, j), yielding the desired result: . (S.4)

Proof of Corollary 2
Proof. Based on Corollary 1, the autocorrelation of the sampled signal Y between times i and j, which is a function of sampling rate p, is defined as The autocorrelation lies in the range [−1, 1], and we want to prove that its magnitude increases as a function of p. Hence, next we show that d dp R 2 (p) ≥ 0, ∀0 ≤ p ≤ 1. d dp where both the numerator and denominator are trivially positive for all values in 0 ≤ p ≤ 1 given that E[X] ≥ 0, and the result follows.

Proof of Corollary 3
Proof. We compute the covariance between sampled signal Y and external signal S as = p Cov(X, S) (S.7)

Permutation Entropy Criterion
Entropy measures the uncertainty of a random variable, which intuitively serves as an indicator of predictability of a stochastic event. In statistical physics, entropy characterizes the amount of possible microscopic state in a system, thus the more microscopic states exist in a system, the more chaotic a system is, and the harder it becomes to predict its behavior. Specifically, the Shannon entropy for a random variable x with distribution p is defined as Entropy rate, on the other hand, measures the growth rate of entropy w.r.t. the length of a procession (in our setting the time series). Entropy rate of a time series X t is defined as lim t→∞ where H(X 1 , · · · , X t ) is the joint entropy of random variables X 1 , · · · , X t . The higher this rate is, the less predictable the time series, X t , is. While the direct computation of such rate is difficult in practice, a quantity that approximates the entropy rate is the permutation entropy, which depicts the complexity of a time series through the statistics of the values of its subsequences using ordinal analysis. The complexity can be interpreted as the diversity of the trends among the subsequences of certain length (1,2). Therefore, the higher the permutation entropy, the more different trends exist in the time series, which renders its prediction more difficult.
Permutation entropy has been used in various fields to characterize the predictability of time series under interest (2,3). Interestingly, this quantity has also been used as a forensic tool to inspect and identify potential corruption in the source data (4).
Definition 1 (Permutation Entropy). Given a time series {x t } N t=1 . Let S d be the collection of all d! permutations π of order d. For each π ∈ S d , determine the relative frequency of that permutation occurring in {x t } N t=1 : where P (π) quantifies the probability of an ordinal pattern π, and δ(a, b) The permutation entropy of order d ≥ 2 is defined as The ordinal pattern means the relative magnitude relation among successive time series values. As an example, if x 1 = 3, x 2 = 6, x 3 = 1, then the ordinal pattern of this subsequence {x 1 , Besides the order d, the more general definition of permutation entropy (see main text Equation 1) has one more parameter: temporal delay τ . The ordinal pattern can be defined in the same way with respect to the subsequence x t , x t+1τ , x t+2τ , · · · , x t+(d−1)τ , which gives permutation entropy H P (d, τ ).
To lessen the noise on the ordinal pattern of the signal, the weight w.r.t. a subsequence with certain ordinal pattern is introduced to reflect the importance of ordinal changes in large amplitude. For a subsequence of length/order d consisting times series values from x t+1 to x t+d , which is denoted as x As a result, the weighted frequency of a permutation is defined as The weighted permutation entropy is defined as Here we normalize the weighted permutation entropy by the log-number of possible permutations, i.e., log(d!). Thus, the weighted permutation entropy takes value between 0 and 1.

Synthetic Data Experiments
Here, we validate our findings on synthetically generated time series data. We show that predictability diminishes as data is lost to sampling. We first consider an idealized scenario, where X represents an autoregressive process, from which events are sampled at random to create the observed sampled signal Y .

Synthetic Time Series Generation
External Signal First, we generate an external signal S. To assure autocorrelation, we generate it using the autoregressive integrated moving average (ARIMA) model: ARIMA coefficients satisfy the stationarity conditions, so that the external signal S is second-order stationary. The stationarity conditions require that all roots of the polynomials α(x) = 1 + α 1 x + · · · + α k x k and β(x) = 1+β 1 x+· · ·+b l x l satisfy |z| > 1; i.e., all roots of these two polynomials are located outside the unit disk. We enforce the second order stationarity by determining roots of α(x) and β(x) first and then solving for the corresponding regression coefficients {α i } and {β j }, which specify the model.

Ground Truth Signal
We generate the ground truth signal X in a similar manner, except that the generation model entails a term for the external signal S. This ensures that the ground truth and the external signals are correlated. Specifically, we assume the ground truth signal is defined as: When generating the ground truth signal time series, we require that the autocorrelation of the ground truth signal is strong enough so that it can be distinguished from the random white noise.
Observed signal We sample the ground truth data X to obtain the time series of observed events, Y . The sampling rate p characterizes the probability of sampling an event. Because the count process is described by the ARIMA model, which inevitably gives real-number-valued count instead of integervalued count, the decimal part of the count X is treated as a separate instance, and if the decision is made to keep it, its original value will be added to the posterior data. In other words, each instance of Y obeys the Binomial distribution B(X, p). Due to the stochastic nature of sampling, we generate ten different samples Y based on the ground truth signal with the same sampling rate p. In each experiment, the sampled signal Y is split into a training and testing data set, with training data used to train a predictorŶ to predict the test data. The accuracy of the predictor for a given sampling rate is then averaged over the ten experiments.
Model Training Training an ARIMA model consists of two steps: First, a grid search is performed to find the best hyper-parameters (k, l), where k is the order of the autoregressive model, and l is the order of the moving-average model. For each input signal, we search over the grid for the set of parameters resulting in the lowest AIC score. Next, the corresponding coefficients of the fixed-order ARIMA model are fitted to the data.
After the best ARIMA order parameters are determined, we do step by step prediction over a specified time range. At each prediction step, the data from the previous step is incorporated into the known data as new input signal, and consequently, the model is retrained to find the updated parameters. Prediction In order to compare predictions at different sampling rates, we use normalized rooted-meansquare error (NRMSE) to measure how accurately we predict the observable Y . Given the predicted valuesŷ t with respect to time series Y , NRMSE is defined as

Numeric Experiments
First, we illustrate all our theoretical claims using one instance of an ARIMA process generated with an external signal. Second, we present aggregated results of the prediction task, using multiple randomly generated ARIMA time series. For the first set of results on the synthetic data, we generated an external signal S with the ARIMA of order (3,0,2) and length 365, representing a full year of event counts.
Meanwhile, the ground truth signal X assumes the ARIMA order(5,0,1). Figure S.1 shows the ground truth and the sampled signals. Notice that with a simple visual inspection of the plot, one can observe that many of the temporal patterns present in the unsampled data seem to have disappeared in the sampled signal.
Predictors We train three predictors for Y , each of which can be used with or without an external signal. The predictors are: Poisson predictor assumes that events in Y are generated independently of each other at some rate. This predictor estimates the Poisson intensity as the average of counts of all available past data. Y t uses the sampled signal to predict the observable Y .
X t uses the unsampled signal to predict the observable Y . i.e., the ARIMA parameters are fitted to X, and then used to predict the future values of Y given its past values.
Prediction Accuracy Figure S.2 shows normalized prediction errors (normalized RMSE) as a function of sampling rate to demonstrate the nonlinear decrease of the prediction error. As sampling rate decreases, prediction error grows. We study the performance of the predictorŶ , which is trained on the history of the observed signal Y , as it is often employed in practice. Performance of the predictorX, trained on the full signal X, is almost always better (lower NRMSE) than performance of predictorŶ (the plot show difference between predictors on the log scale). This phenomenon reveals that sampling weakens prediction accuracy. Moreover, our results suggest that the sampled process' increased noise and low autocorrelation obfuscates the underlying dynamic, making it harder to be described by an ARIMA model. Using an informative external signal S in prediction helps recover some of the lost information, shrinking the gap betweenX andŶ predictors, as well as the overall prediction error.
When little information is lost (i.e., at high sampling rate), predictorsX andŶ outperform the Poisson predictor, since they are able to leverage the autocorrelation of the signal with the ARIMA model. In addition, by comparing the gaps between predictors at high sampling rates (note the log scale), we see that adding external signal makes the Poisson predictor less competitive than the other predictors. On the other hand, Poisson predictor performs almost as well asŶ andX when much of the information is lost (i.e., at low sampling rate). This indicates that Poisson predictors are strong baselines for the observable Y at low sampling rates. Figure S.3 shows that the autocorrelation of the sampled signal increases with sampling rate, consistent with Corollary 2. This figure shows that sampling at low rate quickly destroys the innate autocorrelation of the signal, fundamentally altering the properties of the signal and rendering the prediction task harder. We also see that the correlation between sampled ground truth signal Y and the external signal gradually increases in agreement with Corollary 3. As a consequence of the loss of  Figure S.4 shows that the weighted permutation entropy decreases at high sampling rates, when more of the signal is retained. This shows the loss of predictability of the process at low sampling rates. The trends in the figure imply that even when little of the signal is filtered out, its predictability significantly degrades.

Decrease in Mutual Information
The loss of predictability cannot be offset using an informative external signal. This is because even if the external signal is highly correlated with the ground truth signal, sampling reduces its utility in predictive tasks. Figure S.5 shows this decay in mutual information between the external and observed signals at low sampling rates. The informative external signal does not reduce the uncertainty of the observed signal.

Nonstationarity of Prediction Errors
Another quantity that characterizes the impact of sampling on the predictability of a time series is the covariance of the prediction error at different times, as the prediction error can intuitively reflect how well one can predict the event count. We show that sampling the time-series will introduce an autocorrelation into the prediction error and render it dependent on the evolution of the counting process, with errors growing larger or smaller depending on the type of process. Next, we provide an example to motivate how sampling can induce a correlation between variances of predictions at different times.
Proposition S2. Consider an auto-regressive (AR) process: X t = αX t−1 + ε t , with ε t white noise. The variance at the next step is given by, Proof. The information filter can be described as a Binomial distribution, Y ∼ B(X, p). Hence, Var(Y t−1 |X t−1 ) = X t−1 p(1 − p) using the fact that Y t−1 |X t−1 is a Bernoulli random variable. Then, the variance at the next step is given by, Proposition S2 shows that the variances of the sampled process, Y , are related by the exact same AR model that generated the process X. In other words, the conditional variance of the sampled process is autocorrelated. Notice, that this is not true for the unsampled data, given that Var(X t |X t−1 ) = Var(ε t ) = σ 2 . Proposition S2 shows that sampling a time series may introduce autoregressive conditional heteroskedasticity (ARCH) of the variance into the time series. This is usually tested by analyzing the residuals of the model. We use Engle's Langrage Multiplier test to demonstrate the appearance of ARCH effects in the residual signal. Figure S.6 shows that sampling does result in the introduction of ARCH effects to the residuals of predictions.
We have applied Engle's Lagrange Multiplier (ELM) test (6) on the residualŶ −Y , which is evaluated at time t, t − 1, · · · , t − L where L = 100 is the maximum time lag we consider, to examine the existence of ARCH behavior. Under the null hypothesis that there is no ARCH effect, the test statistic used in ELM test has the asymptotic distribution χ 2 (L). When p-value is less than the significance levelα = 0.05, the ELM test says that we can reject the null hypothesis with 95% confidence.
From Figure S.6, we see that for the original time-series (top-left plot) there is no heteroskedasticity, p-values are largely over the significant level threshold 0.05 for most of time lags under consideration. In the case of q = 0, ELM test does not provide us enough evidence to reject the hypothesis that ARCH effect does not exist. However, when we filter the original signal, as we see in the case q = 0.1, 0.5, 0.8 respectively, the p-value for time lags from 6 to 40 are mostly below the significance level bar, which, by ELM test, strongly suggests that the null hypothesis be rejected. In other words, ELM test suggests with 95% confidence that ARCH effect exists in the residual of prediction records on the filtered time-series.

Generalizability
So far, we have explored methodically one example where we have validated our theoretical results as well as provided new insights about the unpredictability of sampled time series. Here we show the average loss of predictability across many randomly generated ground truth time series.
We generate multiple external signal-ground truth signal pairs, and apply ARIMA and Poisson models on these data sets to test if the tendencies we have observed previously (i.e. nonlinear decrease of NRMSE with respect to sampling rate) will also appear in signals of different ARIMA orders. We observe in Figure S eral behavior is partially due to the normalization by sample mean when we evaluate the NRMSE; 2. The autoregressive model outperforms the non-autoregressive Poisson model for higher sampling rates, but, this difference fades out for lower sampling rates as a consequence of sampling.
We further postulate that this decreasing tendency between increasing sampling rate and NRMSE bears generality with respect to any combination of ARIMA orders of external signal-ground truth data pair. When predicting Y with the external signal S, we observe a very similar tendency of NRMSEsampling rate relationship as in Figure S.7. Albeit noticeable difference in the high end of sampling rate spectrum, the decreasing tendency, decreasing speed and the amplitude of NRMSE are by large the same as in the not-using-S case.
In the general setting of ARIMA model, we show a decreasing prediction error w.r.t. the sampling rate, and the shrinking of the advantage of autoregressive model in prediction accuracy over Poisson model as sampling rate becomes smaller.
We demonstrated the decoupling of the external signal and the filtered signal as a prevalent phenomenon in the generic ARIMA setting. We observe in Figure S.8 that for all ground truth-external signal pairs, the lower the sampling rate is, the smaller mutual information becomes. Therefore, we postulate that the positive correlation between sampling rate and mutual information can always be observed, thus implying that the knowledge of external signal cannot help recover the predictability of the ground truth signal.    Figure S.17: Decay of mutual information between the popularity of cryptocurrencies repositories and their prices for different sampling rates. For each cryptocurrency, and each sampling rate, we obtained 100 samples, and calculated the median mutual information (7) between the price and the popularity of related Github repositories. The solid line represents the median mutual information for each coin. Shaded region marks the interquartile ranges for each coin. For each cryptocurrency, we report the total number of activity events (watches, forks, and create); the total number of repositories where these events happened; the total number of users responsible for the activity; the average number of events per day; the average number of events per repository; and the standard deviations (SD).