Introduction

Success in many scientific fields centers on prediction. From the early history of celestial mechanics we know that predicting how planetary objects move stimulated the birth of physics. Today, predicting neuronal spiking drives advances in theoretical neuroscience. Outside the sciences, prediction is quite useful as well—predicting stock prices fuels the finance industry and predicting English text fuels social media companies. Recent advances in prediction and generation are so impressive (e.g., GPT-4) that one is left with the impression that time series prediction is a nearly solved problem. As we will show using randomness- and correlation-calibrated data sources, this hopeful state of affairs could not be further from the truth.

Recurrent neural networks1, of which reservoir computers are a prominent and somewhat recent example2, have risen to become one of the major tools for prediction. From mathematics’ rather prosaic perspective, recurrent neural networks are simply input-dependent dynamical systems. Since input signals to a learning system affect its behavior, over time it can build up a “memory trace” of the input history. This memory trace can then be used to predict future inputs.

There are broad guidelines for how to build recurrent neural networks1 and reservoir computers that are good predictors2. For instance, a linearized analysis shows that one wants to be at the edge of instability3. However, a theory of how these recurrent neural networks work optimally is lacking; though see Ref. 4. Recently, a new architecture was introduced for prediction called a “next-generation reservoir computer”, whose memory trace intriguingly only included the last few timesteps of the input, while demonstrating low prediction error with simultaneously small compute power5.

The general impression from these and many additional reports is that these recurrent neural networks have conquered natural stimuli, including language6, video7, and even climate data8. They have certainly maximized performance on toy tasks9,10 that test long memory. This noted, it is unknown how far they are from optimal performance on the tasks of most importance, such as prediction of language, video, and climate. We need a calibration for how far away they are from nearly-perfect prediction. And this suggests developing a suite of complex processes for which we know the minimal achievable probability of error in prediction.

In the service of this goal, the following adopts the perspective that calibration is needed to understand the limitations inherent in the architecture of the next-generation reservoir computers and to understand how well state-of-the-art recurrent neural networks (including next-generation reservoir computers) perform on tasks for which optimal prediction strategies are known. This calibration is provided by time series data generated by a special type of hidden Markov model specialized for prediction called \(\epsilon\)-machines. We find, surprisingly perhaps, that large random multi-state \(\epsilon\)-machines are an excellent source of complex prediction tasks with which to probe the performance limits of recurrent neural networks.

More to the point, benchmarking on these data demonstrates that reasonably-sized next-generation reservoir computers are inherently performance limited: they achieve no better than a \(\sim 60\%\) increase in error probability above and beyond optimal for “typical” \(\epsilon\)-machine tasks even with a reasonable amount of memory. A key aspect of the calibration is that the optimalities are derived analytically from the \(\epsilon\)-machine data generators, providing an objective ground truth. This increase in error probability above and beyond the optimal increases to \(10^5\%\) if interesting11,12 stimuli are used. Altogether, we find that state-of-the-art recurrent neural networks fail to perform well predicting the high-complexity time series generated by large \(\epsilon\)-machines. In this way, next-generation reservoir computers are fundamentally limited. Perhaps more surprisingly, a more powerful recurrent neural network9 also has an increase in error probability above and beyond the minimum of roughly \(50\%\) for these new prediction benchmarks.

Section “Background” reviews reservoir computers, recurrent neural networks, and \(\epsilon\)-machines. Section “Prediction error bounds” derives a lower bound on the average rate of prediction errors. Section “Results” describes a new set of complex prediction tasks and surveys the performance of a variety of recurrent neural networks on these tasks. Section “Conclusion” draws out the key lessons and proposes new calibration strategies for neural network architectures. Such objective diagnostics should enable significant improvements in recurrent neural networks.

Background

Section “Complex processes and \(\epsilon\)-machines” describes \(\epsilon\)-machines and Section “Recurrent neural networks” lays out the setup of the typical recurrent neural network (RNN) and reservoir computer (RC).

Complex processes and \(\epsilon\)-machines

Each stationary stochastic process is uniquely represented by a predictive model called an \(\epsilon\)-machine. This one-to-one association is particularly noteworthy as it gives explicit structure to the space of all such processes. One can either explore the space of stationary processes or, equivalently, the space of all \(\epsilon\)-machines. This is made all the more operational, since \(\epsilon\)-machines can be efficiently enumerated13.

In information theory they are viewed as process generators and described as minimal unifilar hidden Markov chains (HMC). In computation theory they are viewed as process recognizers and described as minimal probabilistic deterministic automata (PDA) 14,15. Briefly, an \(\epsilon\)-machine has hidden states \(\sigma \in \mathcal {S}\), referred to as causal states, and generates a process by emitting symbols \(x\in \mathcal {A}\) over a sequence of state-to-state transitions. For purposes of neural-network comparison in the following, we explore binary-valued processes, so that \(\mathcal {A}=\{0,1\}\). \(\epsilon\)-Machines are unifilar or “probabilistic deterministic” models since each transition probability \(p(\sigma '|x,\sigma )\) from state \(\sigma\) to state \(\sigma '\) given emitted symbol x are singly supported. More simply, there is at most a single destination state. In computation theory this is a deterministic transition in the sense that the model reads in symbols which uniquely determine the successor state. That said, these models are probabilistic as process generators: given that one is in state \(\sigma\), a number of symbols x can be emitted, each with emission probability \(p(x|\sigma )\). In this way, these models represent stochastic languages—a set of output strings each occurring with some probability.

While every stationary process has an \(\epsilon\)-machine presentation, it is usually not finite. An example is shown in Fig. 116. The finite HMC on the top is nonunifilar since starting in state A and emitting a 0 does not uniquely determine to which state one transits—either A or B. The HMC on the bottom is unifilar, since in every state, knowing the emitted symbol uniquely determines the next state. Note that the \(\epsilon\)-machine for the process generated by the finite nonunifilar HMC has an infinite number of causal states. Also, note that the process has infinite Markov order: if one sees a past of all 0s, one has not “synchronized” to the \(\epsilon\)-machine’s internal hidden state17, meaning that one does not know which hidden state of the \(\epsilon\)-machine one is in. And, therefore, there is not a complete one-to-one correspondence between sequences of observed symbols and chains of hidden states. In contrast, with each step in the \(\epsilon\)-machine presentation one inches closer to a one-to-one correspondence between observed symbols and hidden states—in reality, as close as possible.

Figure 1
figure 1

At top, we see a nonunifilar hidden Markov model that is not an \(\epsilon\)-Machine because, when in state A, knowing that you have emitted a 0 does not uniquely determine to which state one has transitioned. At bottom, we see the corresponding \(\epsilon\)-Machine, for which in every state, knowing the emitted symbol uniquely determines the next state. For this \(\epsilon\)-Machine, we have \(F(n)={\left\{ \begin{array}{ll} (1-p)(1-q)(p^n-q^n) / (p-q) &{} p\ne q ~, \\ (1-p)^2 n p^{n-1} &{} p=q ~. \end{array}\right. }\) and \(w(n)=\sum _{m=n}^{\infty } F(m)\)16. Note that both hidden Markov models generate an identical infinite-order Markov process: if one sees a past of all 0’s, one has not “synchronized” to the internal hidden state of the \(\epsilon\)-Machine. Therefore, there is not a complete one-to-one correspondence between sequences of observed symbols and hidden states.

In a way, a nonunifilar HMC is little more than a process generator18 for which the equivalent \(\epsilon\)-machine presentation has an infinite number of causal states. In another sense, \(\epsilon\)-machines are a very special type of HMC generator since the \(\epsilon\)-machine’s causal states actually represent clusters of pasts that have the same conditional probability distribution over futures14. As a result, the causal states and so \(\epsilon\)-machines are predictive.

Consider observing a process generated by a particular \(\epsilon\)-machine and becoming synchronized so that you know the hidden state. (Now, this happens with probability 1 but it does not always happen17, as we just described with the nonunifilar HMC example.) Then you can build a prediction algorithm based on the known hidden state. The result, in fact, is the best possible prediction algorithm that one can build. Moreover, the latter is simple: when synchronized to hidden state \(\sigma\), you predict the symbol \(\arg \max _x p(x|\sigma )\).

This has one key consequence in our calibrating neural networks: the minimal attainable time-averaged probability \(P_\text {e}^\text {min}\) of error in predicting the next symbol can be explicitly calculated as:

$$\begin{aligned} P_\text {e}^\text {min}= \sum _{\sigma } \left[ 1 - \max _x p(x|\sigma ) \right] p(\sigma ) ~. \end{aligned}$$
(1)

(The following considers binary alphabets, so that \(1 - \max _x p(x|\sigma ) = \min _x p(x|\sigma )\).) We are also able to calculate the Shannon entropy rate \(h_\mu\) directly from the \(\epsilon\)-machine14 via:

$$\begin{aligned} h_\mu&= H[X_0|\overleftarrow{X}_0] \nonumber \\&= - \sum _{\sigma } p(\sigma ) \sum _x p(x|\sigma ) \log p(x|\sigma )~. \end{aligned}$$
(2)

In contrast, until recently determining \(h_\mu\) for processes generated by nonunifilar HMCs was intractable. The key advance is that for these processes we recently solved Blackwell’s integral equation19,20.

Recurrent neural networks

Let \(s_t\in \mathbb {R}^d\) be the state of the learning system—perhaps a sensor—and let \(x_t\in \mathbb {R}^N\) be a time-varying N-dimensional input, both at time t. Discrete-time recurrent neural networks (RNN) are input-driven dynamical systems of the form:

$$\begin{aligned} s_{t+1} = f_{\theta }(s_t,x_t) \end{aligned}$$
(3)

where \(f_{\theta }(\cdot ,\cdot )\) is a function of both sensor state \(s_t\) and input \(x_t\) with parameters \(\theta\). See Fig. 2. These parameters are weights that govern how \(s_t\) and \(x_t\) affect future sensor states \(s_{t+1}, s_{t+2}, \ldots\). Alternative RNN architectures result from different choices of f. See below. Long Short-Term Memory Units (LSTMs) and Gated Recurrent Units (GRUs) are often used to optimize prediction of input. For simplicity, the following considers scalar time series: \(N = 1\).

Generally, RNNs are hard to train, both in terms of required data sample size and compute resources (memory and time)21. RCs2,22, also known as echo state networks23 and liquid state machines24,25, were introduced to address these challenges.

Figure 2
figure 2

A recurrent neural network (RNN) for which the future state of the recurrent node depends on its previous state and the current input. The present state of the recurrent node is then used to make a prediction.

RCs involve two components. The first is a reservoir—an input-dependent dynamical system with high dimensionality d as in Eq. (3). And, the second is a readout layer \(\hat{u}\)—a simple function of the hidden reservoir state. Here, the readout layer typically employs logistic regression:

$$\begin{aligned} P(\hat{u}|s) = \frac{e^{a_{\hat{u}}^{\top }s+b_{\hat{u}}}}{\sum _{\hat{u}'}e^{a_{\hat{u}'}^{\top }s+b_{\hat{u}'}}}~, \end{aligned}$$

with regression parameters \(a_{\hat{u}}\) and \(b_{\hat{u}}\). To model binary-valued processes, our focus here, we have:

$$\begin{aligned} P(\hat{u}=1|s) = \frac{e^{a^{\top }s+b}}{1+e^{a^{\top }s+b}}. \end{aligned}$$

The regression parameters are easily trained and can include regularization if desired. Note that while s was used as the input into the logistic regression probabilities, one can move to nonlinear readout by also using \(ss^{\top }\) to inform the logistic regression probabilities.

The following compares several types of RNNs: ‘typical’ RCs, ‘next generation’ RCs, and LSTMs.

‘Typical’ RCs

In the following, as a model of typical RCs, a subset of RC nodes are updated linearly, while others are updated according to a \(\tanh (\cdot )\) activation function. Let \(s = \begin{pmatrix} s^{\text {nl}} \\ s^{\text {l}} \end{pmatrix}\). We have:

$$\begin{aligned} s^{\text {nl}}_{t+1} = \tanh \bigl ( W^{\text {nl}} s^{\text {nl}}_t+v^{\text {nl}} x_t + b^{\text {nl}} \bigr )~, \end{aligned}$$

and

$$\begin{aligned} s^{\text {l}}_{t+1} = W^{\text {l}} s^{\text {l}}_t + v^{\text {l}} x_t + b^{\text {l}} , \end{aligned}$$

where \(v^\text {l,nl}\) controls how strongly the input affects the state, \(b^{\text {l,nl}}\) is a bias term, and \(W^\text {l,nl}\) are the weight matrices.

The weight matrices \(W^\text {l,nl}\) are chosen to have: 0 entries based on a small-world network of density 0.1 and \(\beta =0.1\); nonzero entries normally distributed according to the standard normal; and a spectral radius \(\sim 0.99\) to guarantee the RC fading-memory condition23. Different recipes for choosing which nodes were connected (small-world networks with varying \(\beta\) and density), what distribution the weights were drawn from (normal versus uniform), and whether or not there was a bias term were tried. These variations had virtually no effect on the results. The only thing that clearly made a difference was whether or not the readout was linear (logistic regression) or nonlinear (logistic regression on a concatenation of s and \(ss^{\top }\)). These two cases are clearly demarcated in the appropriate figure. There was no bias term in the figure shown.

‘Next generation’ RCs

Next-generation RCs employ a simple reservoir that tracks some amount of input history and a more complex readout layer5 to improve accuracy over RC’s universal approximation property. The reservoir records inputs from the last m timesteps and, then, uses a readout layer consisting of polynomials of arbitrary order. Technically, next-generation RCs are a subset of general RCs in that a reservoir can be made into a shift register that records the last m timesteps. As introduced in Ref. 5 next-generation RCs solve a regression task, but they can easily be modified to solve classification tasks. The following simply takes second-order polynomial combinations of the last m timesteps and uses those as features for the logistic regression layer. In other words, let \(s_t = (x_t,x_{t-1},...,x_{t-m+1})\), a column vector, be the state of the reservoir; then we use \(s_t\) and \(s_t^{\top }s_t\) as input to the logistic regression.

LSTMs

In contrast, long short-term memory networks (LSTM)9 take a different approach by optimizing \(f_{\theta }\) for training and for retaining memory. There, s is a combination of several hidden states and the update equations for the network are given in Ref.9. An LSTM’s essential components consist of linearly-updated memory cells that make training easier and avoid exploding or vanishing gradients and a forget gate that may improve performance by allowing the network to access a range of timescales4.

Prediction error bounds

No matter the RNN, the conditional entropy of the next input symbol \(X_t\) given the learning system’s state \(S_t\),

$$\begin{aligned} H[X_0|S_0] = - \sum _{s_t} p(s_t) \sum _{x_t} p(x_t|s_t) \log p(x_t|s_t) ~, \end{aligned}$$

places a fundamental upper bound on the RNN prediction performance through Fano’s inequality:

$$\begin{aligned} H[X_0|S_0] \le H_\text {b}\left( P_\text {e}\right) + P_\text {e}\log \left( |\mathcal {A}|-1\right) ~. \end{aligned}$$

In this, \(P_\text {e}\) is the time-averaged probability of making an error in predicting the next symbol \(x_t\) from RNN’s state \(s_t\), and \(H_\text {b}\) is the binary entropy function. We have also invoked stationarity of the time series, to remove the dependence on t in the steady-state operation of the RNN. In particular, for a binary process where \(|\mathcal {A}| = 2\):

$$\begin{aligned} P_\text {e}\ge H_\text {b}^{-1}[ H(X_0 | S_0)] ~, \end{aligned}$$

where \(H_\text {b}^{-1}\), defined on the domain [0, 1], is the inverse of \(H_\text {b}\) on its monotonically increasing domain [0, 1/2].

In other words, the measure of RNN performance is given by a function of \(H[X_0|S_0]\) that lower bounds \(P_\text {e}\), coupled with the minimal attainable probability of error calculable directly from the \(\epsilon\)-machine as described in Section “Background”. The lower the model’s conditional entropy, the better prediction performance. For any RNN, due to the Markov chain \(S_0\rightarrow \overleftarrow{X}_0\rightarrow X_0\), this cannot be lower than \(h_\mu\)—the entropy rate:

$$\begin{aligned} H[X_0|S_0]&\ge H[X_0|\overleftarrow{X}_0] \\&= h_\mu ~. \end{aligned}$$

Notably, the next-generation RC takes into account only the last m timesteps, so that:

$$\begin{aligned} H[X_0|S_0]&= H[X_0|\overleftarrow{X}_0^m] \\&= h_{\mu }(m), \end{aligned}$$

where the myopic entropy rate \(h_\mu (m) \ge h_\mu\) is discussed at length in Ref. 26.

Results

We are now ready to calibrate RNN and RC performance on the task of time-series prediction. First, we survey the performance of RCs when predicting a random sample of typical complex stochastic processes. Second, we explore RC performance on an “interesting” complex process—one from the family of memoryful renewal processes—hidden semi-Markov processes with infinite Markov order. Third and finally, we compare the prediction performance of RCs, next-generation RCs, and LSTM RNNs on a large suite of complex stochastic processes.

Limits of next-generation RCs predicting “typical” processes

We construct exemplars of “typical” complex processes by sampling the space of \(\epsilon\)-machines as follows:

  • An arbitrary large number of candidate states is chosen for the HMC stochastic process generator. This parallels the fact that most processes have an infinite number of causal states15,27;

  • For each (\(\sigma\), x) pair, a labeled transition \(\sigma \xrightarrow { \, x \, } \sigma '\) is randomly generated, with the destination state \(\sigma '\) chosen from the uniform distribution over candidate states;

  • Symbol emission probabilities \(p(x|\sigma )\) are randomly generated from a Dirichlet distribution with uniform concentration parameter \(\alpha\);

  • We retain the largest recurrent component of this construction as our sample \(\epsilon\)-machine.

Numerically, we find that approximately 20% of the candidate states become transients in the constructed directed network, which are then trimmed from the final \(\epsilon\)-machine. This number of transients strongly clusters around \(20\%\) as the number of candidates grows large. (Note that this is a topological feature, independent of \(\alpha\).) Moreover, this candidate network typically has a single recurrent component. Accordingly, the resulting causal states typically number about \(80\%\) of the candidate states in our construction, as the number of candidate states grows large.

This results in a finite-state unifilar HMC or, equivalently, a presentation that can generate a process with a finite number of causal states. Interestingly, though, the process generated is usually infinite-order Markov 28. This can be seen from the mixed-state presentation that describes optimal prediction20,26, whose transient states of uncertainty generically maintain nonzero probability even after arbitrarily long observation time. [This is typical even when the mixed-state presentation has a finite number of transient states. Adding a further challenge to the task of prediction, though, the mixed-state presentation typically has infinitely many transient states.]

An expression for the myopic entropy rate \(h_\mu (m)\) was developed in Ref. 26 that allows one to exactly compute \(h_\mu (m)\) from the generating \(\epsilon\)-machine’s mixed-state presentation. However, for binary-valued processes it was more straightforward to explicitly enumerate possible length-m futures. Note, though, that this is impractical for the trajectory lengths used here if the emitted-symbol alphabet is larger than two. Figure 3(top) shows \(h_\mu (m)\) as a function of m, in the case that \(\alpha = 1\). Figure 3(bottom) shows percentage increases in the \(P_\text {e}\) lower bounds for next-generation RCs above and beyond the minimal \(P_\text {e}^\text {min}\), tracking prediction error lower bounds given by Fano’s inequality in Section “Results”.

Across this family of stochastic processes, typical values of the myopic entropy rate \(h_\mu (m)\) and the entropy rate \(h_\mu\) exhibit a concentration of measure as the number of causal states grows large, with values clustering around 1/2 nat (not shown here). Typical values of the percentage increase in the \(P_\text {e}\) above and beyond the minimal \(P_\text {e}^\text {min}\) show a concentration of measure, and the minimum probability \(P_\text {e}^\text {min}\) of error cluster around 1/4 (not shown here), reminiscent of the process-survey results reported by Ref. 29.

A quick plausibility argument suggests that there is a genuine concentration of measure for these two quantities, using the formulae in Section “Background”. Roughly speaking, when the \(\epsilon\)-machine generator has a large number of causal states, the transitions from any particular state have little effect on the stationary state distribution \(p(\sigma )\). Hence, \(h_\mu\) and \(P_\text {e}^\text {min}\) are roughly the sum of N i.i.d.  random variables. The Central Limit Theorem dictates for the concentration parameter \(\alpha =1\) that \(h_\mu\) estimates should cluster around 1/2 nat and that \(P_\text {e}^\text {min}\) should cluster around 1/4. In contrast, \(H[X_0]\) has the larger expected value of \(\ln (2)\) nats, which becomes typical as the number of causal states grows large. The gradual decay of uncertainty from \(\ln (2)\) to 1/2 nat per symbol can only be achieved by predictors that (at least implicitly) synchronize to the latent state of the source via distinguishing long histories.

Figure 3
figure 3

(Top) Finite-length entropy rate \(h_\mu (m)\) in nats for typical random unifilar HMCs constructed with 30 (blue), 300 (orange), and 3000 (green) candidate states as a function of the number of input timesteps m. (Bottom) Increase of the lower bound on the probability \(P_\text {e}\) of error from Fano’s inequality, above and beyond \(P_\text {e}^\text {min}\), with the same random unifilar HMCs as a function of the number of input timesteps m. Since occassionally the lower bound on this quantity fell below \(0\%\), the maximum of \(0\%\) and the quantity is used. \(90\%\) confidence intervals are shown on both graphs.

These typical processes are surprisingly non-Markovian, exhibiting infinite-range correlation. A process’ degree of non-Markovianity is reflected in how long it takes for \(h_\mu (m)\) to converge to \(h_\mu\): how large must m be to synchronize? Even after observing \(m = 15\) symbols, these processes (with a finite but large number of causal states) are still \(\sim 0.2\) nats away from synchronization. This convergence failure contributes to a minimal probability of error that cannot be circumvented no matter the cleverness in choosing the RC nonlinear readout function.

Limits of next-generation RCs predicting an “interesting” process

References11,12 define complex and thus “interesting” processes as those that have infinite mutual information between past and future—the so-called “predictive information” or “excess entropy”. The timescales of predictability are revealed through the growth \(I_\text {pred}(m)\) as longer length-m blocks of history and future are taken into account. The predictive information is:

$$\begin{aligned} I_\text {pred}(m) = \sum _{l=0}^m \left[ h_\mu (l) - h_\mu \right] ~. \end{aligned}$$

And so, its growth rate is:

$$\begin{aligned} I_\text {pred}(m+1) - I_\text {pred}(m)&= \!\! \sum _{l=0}^{m+1} \!\! \left[ h_\mu (l)-h_\mu \right] - \!\! \sum _{l=0}^m \!\! \left[ h_\mu (l)-h_\mu \right] \\&= h_\mu (m+1)-h_\mu ~. \end{aligned}$$

That is:

$$\begin{aligned} h_\mu (m+1)&= h_\mu + I_\text {pred}(m+1) - I_\text {pred}(m)~. \end{aligned}$$

The gap between \(h_\mu (m+1)\) and \(h_\mu\) quantifies the excess uncertainty in the next observable, due to observation of only a finite-length past. This is governed by \(I_\text {pred}(m+1) - I_\text {pred}(m)\) in discrete-time processes or, analogously, by \(d I_\text {pred}(t) / dt\) in continuous-time processes.

What constitutes an acceptable increase in prediction error above and beyond \(h_{\mu }\)? The intuition for this follows from inverting Fano’s inequality to determine the additional conditional entropy implied by a substantial increase in the probability of error.

To illustrate this, we turn to an interesting process that has a very slow gain in predictive information—the discrete-time renewal process shown in Fig. 1(Bottom), with survival function:

$$\begin{aligned} w(n) = {\left\{ \begin{array}{ll} 1 &{} n = 0 \\ n^{-\beta } &{} n\ge 1 \end{array}\right. }~. \end{aligned}$$

Discrete- and continuous-time renewal processes are encountered broadly—in the physical, chemical, biological, and social sciences and in engineering—as sequences of discrete events consisting of an event type and an event duration or magnitude. An example critical to infrastructure design occurs in the geophysics of crustal plate tectonics, where the event types are major earthquakes tagged with duration time, time between their occurrence, and an approximate or continuous Richter magnitude30. Another example is seen in the history of reversals of the earth’s geomagnetic field31. In physical chemistry they appear in single-molecule spectroscopy which reveals molecular dynamics as hops between conformational states that persist for randomly distributed durations32,33. A familiar example from neuroscience is found in the spike trains generated by neurons that consist of spike-no-spike event types separated by interspike intervals34. Finally, a growing set of renewal processes appear in the quantitative social sciences, in which human communication events and their durations are monitored as signals of emergent coordination or competition35.

At \(\beta =1\), this discrete-time renewal process has \(I_\text {pred}(m) \sim \log \log m\)36. The minimal achievable lower bound is \(P_\text {e}^\text {min}= 0.0001\). Due to an additional \(\sim 0.1\) nats from not using an infinite-order memory trace and instead only using the last \(m = 5\) symbols, the probability-of-error lower bound jumps to 0.02. This is a percentage increase in probability of error of \(10^4\%\) at \(m = 11\) timesteps—about two and half orders of magnitude worse than that of a typical complex process. We emphasize these are fundamental bounds that no amount of cleverness can circumvent. While any nonlinear readout function might be chosen for a next-generation RC, the process’ inherent complexity demands that an infinite-order memory trace be used for relatively good prediction.

Figure 4
figure 4

Predicting a discrete-time fractal renewal process with infinite excess entropy: (Top) Percentage increase in the lower bound for the probability of error \(P_\text {e}\) above and beyond the minimum using Fano’s inequality as a function of time steps m. (Bottom) Percentage increase in the lower bound for the probability of error \(P_\text {e}\) above and beyond the minimum using Fano’s inequality as a function of time steps m for a process such as that in Ref. 36.

References 37,38 constructed an HMC that ergodically39 generated \(I_\text {pred}(m) \sim \log m\). For this process:

$$\begin{aligned} h_{\mu }(m+1) \approx h_{\mu } + \frac{1}{m}~. \end{aligned}$$

Consider a process that has a “typical” entropy rate of 0.5 nats, we can invert Fano’s inequality—that is not necessarily tight—to find a lower bound on the probability of error with an infinite memory trace. Assuming this lower bound, the bound on the percentage increase of the probability of error above and beyond \(P_\text {e}^\text {min}\) decays to \(10\%\) only when the RC uses more than \(m = 1000\) symbols. See Fig. 4(bottom).

RCs, next-generation RCs, and state-of-the-art RNNs predicting highly non-markovian processes

Knowing that there are fundamental limits to the next-generation RC’s ability to predict processes forces the question: how well do next-generation RCs actually do at predicting these processes when using second-order polynomial readout? Moreover, do more traditional RCs and state-of-the-art RNNs do any better?

In all experiments, we are careful to hold the number of input nodes to the readout constant for a fair comparison.

We now compare typical RCs with linear readout, typical RCs with nonlinear readout (second-order polynomial), and LSTMs to next-generation RCs on prediction tasks generated by the large \(\epsilon\)-machines of Section “Limits of next-generation RCs predicting “typical” processes”. Although RCs with nonlinear readout and many more nodes outperform next-generation RCs, Fig. 5 shows that when the number of readout nodes is held constant, next-generation RCs are indeed the best RC possible. This is expected from Ref. 5. LSTMs beat all reservoir computers, however, as one can see from the red violin plot of Fig. 5 settling primarily on the lowest possible values of \((P_e-P_e^{\min })/P_e^{\min }\times 100\%\). This is somewhat expected since LSTMs optimize both the reservoir and readout, although the fact that they do is a testament to the success of backpropagation through time (BPTT)40.

Figure 5
figure 5

Percentage increase in the probability of error of trained next generation RCs (green), trained RCs with linear readout (orange), trained RCs with nonlinear readout (blue), and trained LSTMs (red) above and beyond \(P_\text {e}^\text {min}\) for 100 \(\epsilon\)-machines with 300 candidate states. The next-generation RC has 10 timesteps as input; the typical RC with nonlinear readout has 10 nodes with 5 linear nodes; the typical RC with linear readout has 110 nodes with 10 linear nodes; and the LSTM has 110 nodes. The number of nodes has been chosen so that the number of readout nodes is equivalent across machines. Note that these nearly saturate the lower bound provided by Fano’s inequality.

Figure 5’s surprise is that all RNNs perform quite poorly, leaving at least \(\sim 50\%\) increase in the probability of error above and beyond optimal, as one can see from the surprisingly large values on the y-axis, achieved at \(m=10\) for the next-generation RC. This nearly saturates the lower bound on this percentage increase in the probability of error placed by Fano’s inequality.

Conclusion

The striking advances made by RNNs in predicting a very wide range of systems—from language to climate—have not been accompanied by markedly improved explorations of how much structure they fail to predict. Here, we introduced and illustrated such a calibration.

We addressed the task of leveraging past inputs to forecast future inputs, for any stochastic process. We showed that \(P_\text {e}^\text {min}\)—the minimal time-averaged probability of incorrectly guessing the next input, minimized over all possible strategies that can operate on historical input—can be directly calculated from a data source’s generating \(\epsilon\)-machine. This provides a benchmark for all possible prediction algorithms. We compared this optimal predictive performance with a lower bound on various RNNs’ \(P_\text {e}\)—the actual time-averaged probability of incorrectly guessing the next input, given the state of the model. We found that so-called next-generation RCs are fundamentally limited in their performance. And we showed that this cannot be improved on via clever readout nonlinearities.

In our comparison of various prediction models, we tested next-generation RCs with highly-correlated inputs that are challenging to predict. This input data was generated from large \(\epsilon\)-machines. The \(\epsilon\)-machines are the optimal prediction algorithm, and the minimal probability of error for these data are known in closed-form. Our extensive surveys showed, surprisingly, that models from RCs with linear readout to next-generation RCs of reasonable size to LSTMs all have a probability of prediction error that is \(\sim 50\%\) greater than the theoretical minimal probability of error.

The fact that simple large random \(\epsilon\)-machines generate such challenging stimuli might be a surprise. Recently, though, it was reported that tractable \(\epsilon\)-machines can lead to “interesting” processes11,12. We showed that these processes provide even more of a challenge for next-generation RCs.

At first, it may seem that this new calibration is somewhat useless, both theoretically and from a practical point of view. For instance, it is perhaps not surprising that RCs, NGRCs, and maybe even LSTMs perform poorly on highly non-Markovian processes such as the ones used here. However, with N nodes, one can find N predictive features that potentially reach far back into the past even though one might naively think that the N features correspond to the last N time points. Secondly, the processes used here are not of general interest, as large random \(\epsilon\)-machines do not correspond to real-world signals in structure. However, one can manufacture \(\epsilon\)-machines that do have the structure of real-world signals, as any real-world signal can be represented by an \(\epsilon\)-machine. Then, the calibration here can improve the RC or RNN’s ability to predict real-world signals. This potential research program extends even to nonstationary real-world data. Both natural language and natural data from the physical world can be understood as stochastic processes which, in principle, have some \(\epsilon\)-machine representation. While we focused on stationary processes in this manuscript, nonstationary processes can be accommodated via simple adaptations of our methods, where the unifilar HMM of the process would have a unique start state and possible absorbing states.

Finally, next-generation RCs—that do indeed outperform typical RCs with the same number of readout nodes—are fundamentally limited in prediction performance by the nature of their limited memory traces. We suggest that effort should be expended to optimize standard RCs that do not suffer from the same fundamental limitations—so that memory becomes properly incorporated and typical performance improves.