Reservoir Computing Beyond Memory-Nonlinearity Trade-off

Reservoir computing is a brain-inspired machine learning framework that employs a signal-driven dynamical system, in particular harnessing common-signal-induced synchronization which is a widely observed nonlinear phenomenon. Basic understanding of a working principle in reservoir computing can be expected to shed light on how information is stored and processed in nonlinear dynamical systems, potentially leading to progress in a broad range of nonlinear sciences. As a first step toward this goal, from the viewpoint of nonlinear physics and information theory, we study the memory-nonlinearity trade-off uncovered by Dambre et al. (2012). Focusing on a variational equation, we clarify a dynamical mechanism behind the trade-off, which illustrates why nonlinear dynamics degrades memory stored in dynamical system in general. Moreover, based on the trade-off, we propose a mixture reservoir endowed with both linear and nonlinear dynamics and show that it improves the performance of information processing. Interestingly, for some tasks, significant improvements are observed by adding a few linear dynamics to the nonlinear dynamical system. By employing the echo state network model, the effect of the mixture reservoir is numerically verified for a simple function approximation task and for more complex tasks.

at best for a reservoir subject to saturating nonlinearity 21 (see ref. 22 for the relation between the two memory capacities; MC and J tot ). The best memory lifetime achieved by a nonlinear network reported so far is O(N/logN), i.e., nearly extensive, as rigorously estimated by Toyoizumi, where the nonlinearity is harnessed for the error-correcting 23 . In summary, extensive memory capacity can be realized by a linear reservoir, and memory capacity seems to be degraded by introducing nonlinearity into the reservoir dynamics.
Previous studies suggest that nonlinear dynamics might degrade the memory capacity; however, nonlinear dynamics is apparently important for RC. For instance, the so-called linearly inseparable problem 24 , which often appears in practical tasks, cannot be solved without the nonlinear transformation of the input signal. In other words, the nonlinear dynamics of the reservoir is essential for general information processing. Therefore, it can be expected that there exists some trade-off relation between linearity and nonlinearity in reservoir dynamics, which is required respectively for memory capacity and for the general information processing. In the seminal paper 25 , Dambre et al. introduced a computational capacity of a dynamical system which is a natural generalization of the linear memory capacity to the nonlinear one, by employing a complete orthonormal basis of a function space. Importantly, by using the computational capacity, they suggested that there exists the universal memory-nonlinearity trade-off relation, and moreover, demonstrated it numerically for some dynamical systems with different types of nonlinearity 25 . And also, other numerical studies have concluded that linear nodes are effective for linear memory capacity and the linear-like reservoir becomes optimal for a task requiring longer memory [26][27][28] .
In the present work, we introduce a simple task which has controllable memory and nonlinearity and clearly demonstrate the memory-nonlinearity trade-off on the task, using the echo state network (random network) model as a simple reservoir. Moreover, focusing on the variational equation from the viewpoint of information theory, we give a theoretical interpretation that reveals a dynamical mechanism illustrating how the nonlinear dynamics degrades memory as observed in the previous studies [25][26][27] . The theoretical interpretation will imply the trade-off is indeed universal in the sense that the memory degradation occurs independently of the form of the nonlinearity of the dynamical system.
What sort of dynamical system is preferable for the reservoir that realizes the universal (nonlinear) transformation of the input signal and possess the appropriate memory capacity? The pioneering works in this direction tackled to find the answer; Butcher et al. [29][30][31] introduced RC with random static projection (R 2 SP) and Extreme Learning Machines with a time delay based on the discussion on the trade-off, and reported these architectures improves performance well for some tasks compared with the standard echo state network model. The trade-off suggests that coexistence of linearity and nonlinearity in RC will improve its performance. Actually, Vinckier et al. 15 introduced a linear optical dynamics on a photonic chip with nonlinear readout and showed that it possesses a remarkably high (total) memory capacity, and interestingly, exhibits high-performances for the complex tasks.
Here, we consider the coexistence of linearity and nonlinearity in RC in a different way. Namely, we propose a novel reservoir structure endowed with both linear and nonlinear activation functions, which is referred to as mixture reservoir. We show that introducing the mixture reservoir improves the performance of information processing for a variety of simple tasks. Interestingly, for some tasks, significant improvements are observed by adding a few linear dynamics to the nonlinear dynamical system. Finally, we verify the effect of the mixture reservoir for more practical and complex tasks: time series forecasting of the Santa Fe Laser data set 32 and the NARMA task.

Results
Formulation. We here consider the echo state network model, which uses a random recurrent neural network as a reservoir. Its time evolution is given by is an input signal, and g,  ε ∈ are control parameters. The function is a so-called activation function. We use N = 100 and φ[a] = a or φ = a a [ ] tanh in the numerical experiments. Elements J ij of the connection matrix are independently and identically drawn from the Gaussian distribution with mean zero and variance 1/N; . In the RC framework, we consider linear readout 1 is a set of readout weights. The goal of RC, in general, is to approximate the functional relation where the brackets represent the time average To evaluate the performance of RC, we use the generalization error E(w * ) throughout this paper, where its relation to the capacity C of the dynamical system defined by Dambre et al. 25 is C = 1 − E(w * ). In the present formulation, the reservoir has two parameters, (g, ε), so the error depends on them; E(w * |g,ε). Hereafter, the error  represents [0 1, 3 0], [0 2, 6 0]}. The minimum value of the error is obtained numerically by calculating the error in the parameter region P discretely with step size Δg = 0.1, Δε = 0.2. We checked the main results of this paper are insensitive to the choice of the parameter space P and step sizes in the Supplemental Information.
Common-signal-induced synchronization. When employing a signal-driven dynamical system x(t + 1) = T(x(t), s(t)) as a reservoir, there is at least one necessary condition: the dynamical system has to exhibit common-signal-induced synchronization. Let us consider two different initial states x(t 0 ) and x t ( ) 0 (≠x(t 0 ), see Fig. 1(a)). If these two states converge to the same state asymptotically under the action of the same dynamical system T and the common signal x t t ( ) ( ) 0 ( ), the signal-driven dynamical system T is said to exhibit common-signal-induced synchronization. This condition is also referred to as echo state property 6 or consistency 33 . This condition means, if the transient state is discarded, the asymptotic state  x t t t ( ) ( ) 0 depends not on the initial condition x(t 0 ) but only on the sequence of the input signal . If the dynamical system (reservoir) does not satisfy this condition, different results will be obtained from the same input, depending on the initial condition of the reservoir.
A key quantity determining whether the reservoir satisfies these conditions is the conditional Lyapunov exponent λ({s(t)}) for a given signal sequence 0 . Then, the time evolution of the perturbation δ(t) is described by the variational equation
In the above formulation, the variational equation of the dynamical system (2) is as follows: It is known that ref. 34, considering the deterministic case (i.e. ε = 0), the origin is a stable fixed point when g < 1 and chaotic behavior appears when g > 1. Moreover, it is also known that, the conditional Lyapunov Memory-nonlinearity trade-off. First, we introduce a simple function approximation task. Although practical tasks such as time series forecasting are important, it is difficult to recognize in such complex tasks how an input signal should be transformed in a nonlinear way and how much memory capacity is required. Therefore, for a basic understanding of RC, we study the simple function approximation task first, which allows us to control the degree of the nonlinearity and the memory required in the tasks separately.
The simple function approximation task requires computation y(t) = f(s(t − τ)), where f is a nonlinear function such as f(x) = sin x, tan x, and x(1 − x 2 ) and s(t − τ) is the input signal of τ-step before. For all results shown in this paper for the simple approximation tasks, the input signal s(t) is independently and identically drawn from the uniform distribution  − ( 1, 1). In Fig. 2, we show results in the case of y(t) = f(s(t − τ)) = sin (νs(t − τ)), where (τ, ν) are task parameters that control respectively the 'depth' of the required memory and 'strength' of the required nonlinearity.
We compare the linear function φ[a] = a and the nonlinear function φ[a] = tanh a to study the roles of the linearity and nonlinearity of the activation function in RC. We refer to the reservoir employing φ[a] = a (φ[a] = tanh a) as a linear (nonlinear) reservoir. As described in detail in the Supplemental Information, the linear reservoir can be interpreted as ε → 0 limit of the nonlinear reservoir. Figure 2(a) shows a diagram summarizing the results of the direct comparison of the two activation functions in the task parameter space. For some parameters (τ, ν), if the error with the linear reservoir, L  , is lower than that with the nonlinear reservoir,  NL , i.e., < These results indicate that, if the task requires 'strong' nonlinear transformation with 'short' memory , the nonlinear reservoir outperforms the linear one. If the task requires 'long' memory with 'weak' nonlinear transformation ( ν τ < − . > ∼  log 1 0, 4), the linear reservoir outperforms the nonlinear one. The linear dynamics is suitable for tasks requiring memory, although the linear dynamics cannot perform nonlinear transformation. On the other hand, the nonlinear dynamics is suitable for the tasks requiring nonlinear transformation, although the nonlinearity of the dynamics seems to degrade the linear memory capacity. In this sense, the above direct comparison clearly shows the memory-nonlinearity trade-off, which is consistent with previous studies 25, 27 . Why nonlinear dynamics degrades memory. The nonlinearity of the dynamics seems to degrade memory.
We show that it can be interpreted by employing the variational equation with the viewpoint of information theory. First, we introduce two sequences of the input signals, {s(t)} t and ŝ t { ( )} t , and assume that they are the same except for , where Δ represents a small difference in the two sequences (see Fig. 1(a)). For simplicity, let us consider the case of N = 1 (see Supplementary Information for general dimensional case (N ≥ 1)). The difference in the input signal Δ leads to a difference in states; the state driven by the input sequence ŝ t { ( )} t is described by represents the difference between two orbits x(t 0 + k) and . Let us consider the ability to reconstruct the initial difference δ 0 from the later difference δ n as memory. If there exists some relation between δ 0 and δ n (e.g., they are functionally dependent on each other), we could reconstruct the initial difference δ 0 from δ n . In other words, it is potentially possible to readout some information about the past difference in the input sequences from the present reservoir state. In that case, it can be interpreted that the reservoir stores memory. On the other hand, if there is no relation between δ 0 and δ n (e.g., they are independent of each other), we cannot reconstruct the initial difference δ 0 from the later difference δ n . In other words, we cannot readout any information about the past difference in the input sequences from the present reservoir state. In that case, it can be interpreted that the reservoir forgets memory.
The relation between δ 0 and δ n is given by the variational equation, , is a kind of 'noise' in view of preserving the information of δ 0 , because the product term does not correlate with δ 0 . Hence, the product term due to the nonlinearity always weakens the relation between δ 0 and δ n , implying that introducing nonlinearity degrades memory. In brief, it can be interpreted that the nonlinear dynamics degrades memory, while the linear dynamics preserves it.
To study the above statement more quantitatively, we measure the strength of the relation using the mutual information I(δ 0 ; δ n ). Then, simply by using the fundamental inequality (data-processing inequality) in information theory, we can show δ δ δ δ Here, δ 0 denotes the random perturbation at the initial point x 0 in the state space of the signal-driven dynamical system x k+1 = T(x k , s k ). Let us consider that δ 0 is drawn from p(δ 0 ) independently of the initial point x 0 . We write the perturbation vector at x n as δ n . The mutual information can be defined by where h x 0 represents differential entropy defined by δ δ p ( , ) and this inequality holds for each x 0 in the state space, i.e., nonlinearity degrades memory.
Note that the above argument is general in two senses. First, it does not assume any particular function form of the map T defining dynamical system x(t + 1) = T(x(t), s(t)). Therefore, we conclude that introducing any form of nonlinearity in the reservoir dynamics degrades memory, which suggests a positive resolution of the 'Jaeger conjecture' 20 : linear networks are always optimal for large memory capacity. Second, the above argument does not assume linear readout, which is specific to RC. Thus, the above statement, nonlinearity degrades memory, holds for general signal-driven dynamical systems.
Beyond the trade-off. We showed the memory-nonlinearity trade-off in our numerical experiment, and gave the dynamical mechanism behind the trade-off. With this trade-off, it is natural to use both linear and nonlinear activation functions with an expectation of storing memory by linear dynamics and realizing general transformation by nonlinear dynamics. We show numerically that a reservoir endowed with both linear and nonlinear activation functions, hereafter referred to as a mixture reservoir, is superior to the linear or nonlinear reservoir. Here the effect of the mixture reservoir is demonstrated for the simple function approximation task y(t) = sin (νs(t − τ)).
We extend the standard echo state network as follows: where a i (t) is the same as in the equation (2). V L is an index set corresponding to the set of nodes utilizing linear activation function (linear nodes): V L = {1, …, N p }, V NL is that utilizing nonlinear activation function (nonlinear nodes): Let p be 'mixture rate' of the linear and nonlinear reservoir: p = 1 − N p /N, i.e., p = 0 (resp. p = 1) means all of the activation functions are linear (resp. nonlinear), and 0 < p < 1 means the reservoir consists of the linear and nonlinear activation functions. Here, we use again the random matrix J ij as the connection matrix, and therefore, the mixture reservoir we introduce is the network with linear and nonlinear nodes that are randomly coupled. Throughout this paper, for each fixed mixture rate p, the error  is obtained with the optimal parameter (g, ε) in the parameter space P as described in the Formulation. Figure 3 shows the approximation error  versus the mixture rate p for some tasks. As an example of a task requiring nonlinear transformation, the error in the case of log ν = 0 is depicted in Fig. 3(a) with τ = 1, 3, 5. As seen in the above results (Fig. 2), the nonlinear reservoir outperforms the linear one for this task, and, correspondingly, the error  at p = 1 is less than that at p = 0 in Fig. 3(a) (see Supplementary Information for the enlarged view of these figures). Furthermore, the error  at p ∈ (0, 1) is less than those two cases, i.e., the mixture reservoir outperforms both the linear and nonlinear reservoir. Note that the errors of the mixture reservoir at p = 0.1 are considerably less than those of the linear reservoir (p = 0.0). Moreover, for the case of τ = 1, the errors of the mixture reservoir at p = 0.8 are considerably less than those of the nonlinear reservoir (p = 1.0). It is interesting that introducing only a few nonlinear (linear) nodes to the linear (nonlinear) reservoir improves its performance remarkably. For other tasks as well, the same remarkably improvements can be found qualitatively (see Fig. 3(b,c)).
An optimal mixture rate depends on the task, i.e., , where  ν τ p ( , ) denotes the error with a mixture rate p for a given task (ν,τ). To study this dependency, we show the optimal mixture rates in the diagram in the Fig. 4(a). As in the diagram in Fig. 2(a), for a set of given task parameters (ν, τ), we indicate the optimal mixture rate p opt. (ν, τ) with different symbols, where the minimal value is numerically found in the set p ∈ {0.00, 0.05, 0.15, 0.25, 0.50, 0.75, 1.00}. The crosses represent draw again, i.e. ν τ ν τ | | ∈ . . p p min ( , )/max ( , ) (0 95, 1 00] p p   . As in Fig. 2(b,c), the error , time series, and function approximation plots are shown in Fig. 4(b,c).
The diagram indicates that the optimal mixture rate depends on the task gradually, and, importantly, the mixture reservoir (0 < p < 1) outperforms the linear and nonlinear reservoir (p = 0, 1) over a broad region in the task parameter space.  More complex tasks. The simple function approximation task y(t) = sin (νs(t − τ)) allows us to explicitly decompose the degree of the nonlinearity and memory required for the task. However, practically important tasks such as time series prediction are much more complicated than the tasks employed above. Here, we study the effect of introducing the mixture reservoir for two more practical tasks: time series forecasting of the Santa Fe Laser data set 32, 36 and the NARMA task. These tasks are frequently used in the RC studies [26][27][28]36 to assess the performance of the reservoir.
The Santa Fe Laser data set is a time series {y(t)} obtained from chaotic laser experiments. Given the past data … − − y t y t y t ( , ( 2), ( 1), ( )), the task is to predict future values y(t + k) (k ≥ 1), which is referred to as a k-step ahead prediction. We show the prediction errors for the k = 1, 2, 3, 4, 5 versus the mixture rate p(∈ . . [0 25, 1 0]) in Fig. 5(a), where the error with p = 0 is not depicted because of its large value, i.e., the linear reservoir does not work at all for this task. The time series of the original data y(t) and the predicted data ŷ t ( ) are depicted in Fig. 5(b). In each case of k, introducing the mixture reservoir suppresses the error. Let us define the error suppression rate as R p p p : ( )/ ( 1) opt for a fixed task. Then, the error suppression rate R attains its minimum for the three-step ahead prediction ( .  R 0 5), while R > 0.5 for the one-step and five-step ahead predictions. The optimal mixture rate p opt. depends on k in this task as well, and, interestingly, p opt. (k) decreases with increasing k, i.e., to accomplish a prediction of a more distant future, the reservoir needs more linear dynamics.
The NARMA task is to emulate a signal-driven dynamical system with a highly nonlinear auto-regressive moving average as follows: where α = 0.3, β = 0.05, γ = 1.5, and δ = 0.1. Note that the parameter m changes simultaneously both the required memory and nonlinearity. The signal s(t) is independently and identically drawn from the uniform distribution . [0, 0 5]  , which drives both the NARMA system and the reservoir. Figure 6(a) shows the error in the emulation of the NARMA system with parameters m = 1, 2, 5, 10 by the mixture reservoir p ∈ .
. [0 25, 1 0]. Correspondingly, typical time series are depicted in Fig. 6(b). For the task m = 10, the errors with several mixture rates p are almost the same (or the mixture reservoir with p = 0.5 is slightly better than the others). However, for the tasks m = 1, 2, 5, the error is clearly reduced by introducing the mixture reservoirs. Furthermore, the smaller parameter m is, the more effective the mixture reservoir becomes, e.g., the error suppression rate .  R 0 5 when m = 5, and, moreover, .  R 0 008 when m = 1.

Discussion
In the present work, we numerically demonstrated the memory-nonlinearity trade-off for the echo state network model. Namely, the linear dynamics is suitable for storing memory but useless for nonlinear transformation, while the nonlinear dynamics is suitable for nonlinear transformation but degrades memory.
We have uncovered the dynamical mechanism behind the memory-nonlinearity trade-off, using the variational equation from the viewpoint of information theory. The mechanism describes how the nonlinear dynamics degrades memory and the linear dynamics preserves it. In terms of information theory, storing memory with the nonlinear (resp. linear) dynamics corresponds to transferring message in a noisy (resp. noiseless) communication channel. The above theoretical interpretation assumes neither the function form of nonlinearity in the reservoir dynamics nor the linear readout. Hence, we conclude that, as a property of general signal-driven dynamical systems, introducing nonlinearity in the dynamics always degrades memory (Jaeger conjecture 20 ). On the basis of the memory-nonlinearity trade-off, we proposed the mixture reservoir, which is endowed with both linear and nonlinear dynamics. We numerically showed that it reduces function approximation errors effectively. Moreover, the observation shows that adding 'a pinch of linearity' considerably improves the performance of the nonlinear reservoir. This conclusion may be valuable for physical implementation of RC, since nonlinear dynamical systems are often used for the reservoir. While the both effects of adding 'a pinch of linearity' and 'a pinch of nonlinearity' to the RC performance are numerically observed for some tasks, the magnitudes of the effects may depend on how much nonlinearity or memory the task requires. For instance, in Fig. 3(a), adding 'a pinch of linearity' is not effective for τ = 5. It can be interpreted as the task τ = 5 requires 'deep' memory, and thus, adding a large amount of linearity is needed.
Finally, we verified the effect of the mixture reservoir in more practical and complex tasks, time series forecasting of the Santa Fe Laser data set 32 and the NARMA task. It is interesting to note that the optimal mixture rate p changes depending on the tasks: in the Santa Fe time series forecasting task . .  p 0 9 opt ; on the other hand, in the NARMA task . .  p 0 5 opt . It may be interesting to compare the performance improvement by introducing the mixture reservoir with that by simply increasing the number of nodes which were reported by Rodan & Tino 36 . For the 1-step ahead prediction of the SantaFe data set, the comparison suggests a conjecture; replacing a small number of nonlinear nodes with linear nodes improves RC performance as effective as doubling the number of nonlinear nodes. See the Supplementary Information for a detailed comparative argument.
As future work, it is important to study the universality of the memory-nonlinearity trade-off and the effect of the mixture reservoir, i.e., to see if the results presented in this paper hold in other reservoirs, e.g. with different network topology, and for other tasks. Theoretically, it would be interesting to clarify the relationships between the quantities relating to the memory, i.e. the (maximal) conditional Lyapunov exponent, the linear memory capacity 20 MC, and the mutual information I(δ 0 ; δ n ). These relationships could provide a strategy for determining the optimal reservoir parameters for its performance. To quantify the memory capacity of the mixture reservoir, it may be interesting to study the mutual information in the case of the mixture reservoir and how the mutual information changes with the mixture rate p. Moreover, it is an important future work to compare the mixture reservoir with other methods such as RC with random static projection (R 2 SP) [29][30][31] . One of applications of the idea of the mixture reservoir is to add an auxiliary linear feedback to the implementation of RC with delay feedback (i.e., adding linear virtual nodes), which could improve its performance remarkably.
In this work, we found that one of the characteristics of dynamical systems suitable for RC is the coexistence of both linear and nonlinear dynamics. This is a step toward uncovering a guiding principle of reservoir design for high-performance information processing, which is expected to provide an answer to the question stated in the introduction: for a given task, what characteristics of a dynamical system are crucial for information processing? Once revealed, such a guiding principle will enrich our knowledge of computer science, deepen our understanding of brain functions, and contribute to extending dynamical system theory. The upper panels show the time series of the target y(t) (i.e., the NARMA system) and the answer  y t ( ) (i.e., the emulated value), corresponding to the red line and blue dots respectively. The left two panels are the time series for the NARMA1 task (m = 1), with the mixture rate p = 0.5 (left) and p = 1.0 (right). The right two panels are the time series for the NARMA10 task (m = 10), with the mixture rate p = 0.5 (left) and p = 1.0 (right). The lower panels show their error values corresponding to the upper panels.