Echo State Networks with Self-Normalizing Activations on the Hyper-Sphere

Among the various architectures of Recurrent Neural Networks, Echo State Networks (ESNs) emerged due to their simplified and inexpensive training procedure. These networks are known to be sensitive to the setting of hyper-parameters, which critically affect their behavior. Results show that their performance is usually maximized in a narrow region of hyper-parameter space called edge of criticality. Finding such a region requires searching in hyper-parameter space in a sensible way: hyper-parameter configurations marginally outside such a region might yield networks exhibiting fully developed chaos, hence producing unreliable computations. The performance gain due to optimizing hyper-parameters can be studied by considering the memory–nonlinearity trade-off, i.e., the fact that increasing the nonlinear behavior of the network degrades its ability to remember past inputs, and vice-versa. In this paper, we propose a model of ESNs that eliminates critical dependence on hyper-parameters, resulting in networks that provably cannot enter a chaotic regime and, at the same time, denotes nonlinear behavior in phase space characterized by a large memory of past inputs, comparable to the one of linear networks. Our contribution is supported by experiments corroborating our theoretical findings, showing that the proposed model displays dynamics that are rich-enough to approximate many common nonlinear systems used for benchmarking.


Introduction
Although the use of Recurrent Neural Networks (RNNs) in machine learning is boosting, also as effective building blocks for deep learning architectures, a comprehensive understanding of their working principles is still missing 1,2 .Of particular relevance are Echo State Networks (ESNs), introduced by Jaeger 3 and independently by Maass et al. 4 under the name of Liquid State Machine (LSM), which emerge from RNNs due to their training simplicity.The basic idea behind ESNs is to create a randomly connected recurrent network, called reservoir, and feed it with a signal so that the network will encode the underlying dynamics in its internal states.The desired -task dependent -output is then generated by a readout layer (usually linear) trained to match the states with the desired outputs.Despite the simplified training protocol, ESNs are universal function approximators 5 and have shown to be effective in many relevant tasks [6][7][8][9][10][11][12] .
These networks are known to be sensitive to the setting of hyper-parameters like the Spectral Radius (SR), the input scaling and the sparseness degree 3 , which critically affect their behaviour and, hence, the performance at task.Fine tuning of hyper-parameters requires cross-validation or ad-hoc criteria for selecting the best-performing configuration.Experimental evidence and some results from the theory show that ESNs performance is usually maximised in correspondence of a very narrow region in hyper-parameter space called Edge of Chaos (EoC) [13][14][15][16][17][18][19][20] .However, we comment that beyond such a region ESNs behave chaotically, resulting in useless and unreliable computations.At the same time, it is everything but trivial configuring the hyperparameters to lie on the EoC still granting a non-chaotic behaviour.A very important property for ESNs is the Echo State Property (ESP), which basically asserts that their behaviour should depend on the signal driving the network only, regardless of its initial conditions 21 .Despite being at the foundation of theoretical results 5 , the ESP in its original formulation raises some issues, mainly because it does not account for multi-stability and is not tightly linked with properties of the specific input signal driving the network [21][22][23] .
In this context, the analysis of the memory capacity (as measured by the ability of the network to reconstruct or remember past inputs) of input-driven systems plays a fundamental role in the study of ESNs [24][25][26][27] .In particular, it is known that ESNs are characterized by a memory-nonlinearity trade-off [28][29][30] , in the sense that introducing nonlinear dynamics in the network degrades memory capacity.Moreover, it has been recently shown that optimizing memory capacity does not necessarily lead to networks with higher prediction performance 31 .
In this paper, we propose an ESN model that eliminates critical dependence on hyper-parameters, resulting in models arXiv:1903.11691v1[cs.NE] 27 Mar 2019

Input layer
Output layer Reservoir that cannot enter a chaotic regime.In addition to this major outcome, such networks denote nonlinear behaviour in phase space characterised by a large memory of past inputs: the proposed model generates dynamics that are rich-enough to approximate nonlinear systems typically used as benchmarks.Our contribution is based on a nonlinear activation function that normalizes neuron activations on a hyper-sphere.We show that the spectral radius of the reservoir, which is the most important hyper-parameter for controlling the ESN behaviour, plays a marginal role in influencing the stability of the proposed model, although it has an impact on the capability of the network to memorize past inputs.Our theoretical analysis demonstrates that this property derives from the impossibility for the system to display a chaotic behaviour: in fact, the maximum Lyapunov exponent is always null.An interpretation of this very important outcome is that the network always operates on the EoC, regardless of the setting chosen for its hyper-parameters.

Echo state networks
An ESN without output feedback connections is defined as: where x k ∈ R N is the system state at time-step k, N is the number of hidden neurons composing the network reservoir, and W ∈ R N ×N is the connection matrix of the reservoir.The signal value at time k, s k ∈ R N in , is processed by the input-to-reservoir matrix W in ∈ R N ×N in .The activation function φ usually takes the form of the hyperbolic tangent function, for which the network is a universal function approximator 5 .Also linear networks (i.e., when φ is the identity) are commonly studied, both for the proven impact on applications and the very interesting results that can be derived in closed-form 24,25,27,31,32 .Other activation functions have been proposed in the neural networks literature, including those that renormalize the activation values on a closed surface 33 and those based on non-parametric estimation via composition of kernel functions 34 The output y ∈ R N out is generated by the matrix W out , whose weights are learned: generally by ridge regression or lasso 3,35 but also with online training mechanisms 36 .In ESNs, the training phase does not affect the dynamics of the system, which are de facto task-independent.It follows that once a suitable reservoir rich in dynamics is generated, the same network may serve to perform different tasks.

Self-normalizing activation function
Here, we propose a new model for ESNs characterized by the use of a particular self-normalizing activation function that provides important features to the resulting network.Notably, the proposed activation function allows the network to exhibit nonlinear behaviours and, at the same time, provides memory properties similar to those observed for linear networks.This superior memory capacity is linked to the fact that the network never displays a chaotic behaviour: we will show that the maximum Lyapunov exponent is always zero, implying that the network operates on the EoC.The proposed activation function guarantees that the SR of the reservoir matrix (whose value is used as a control parameter) can vary in a wide range without affecting the network stability.

2/16
The proposed self-normalizing activation function is and leaves the readout (1c) unaltered.The normalization in Eq. 2b projects the network state a k onto an (N − 1)-dimensional hyper-sphere S N −1 r := { p ∈ R N , p = r } of radius r.Fig. 2 illustrates the normalization operator applied to the state.Note that the operation (2b) is not element-wise like most of activation functions as its effect is global, meaning that a neuron's activation value depends on all other values.

Universal function approximation property
The fact that recurrent neural networks are universal approximators has been proven 37,38 , but only recently Grigoryeva and Ortega 5 proved that ESNs share the same property.Here, we show that the universal function approximation property also holds for the proposed ESNs (2).To this end, we define a squashing function as a map f : R → [−1, 1] that is non decreasing and saturating, i.e., lim x→±∞ f (x) = ±1.We note that σ i (x) := x i / x can be intended as a squashing function for each ith component.In the same work 5 , the authors show that an ESN in the form (1) is a universal function approximator provided it satisfies the contractivity condition: σ(x) − σ(y) ≤ x − y for each x and y, when σ is a squashing function.Jaeger and Haas 3 showed that this condition is sufficient to grant the ESP, implying that ESNs are universal approximators.
We prove in the Methods section that (2) grants the contractivity condition, provided that: where σ min (W) is the smallest singular value of matrix W and s max denotes the largest norm associated to an input.Eq. 3 results in a sufficient yet not necessary condition that may be understood as requiring that the input will never be strong enough to contrast the expansive dynamics of the system, leading the network state inside the hyper-sphere of radius r.In fact, unless the signal is explicitly designed for violating such a condition, it will very likely not bring the system inside the hyper-sphere as long as the norm of W is large enough compared to the signal variance.
Example of the behavior of the proposed model in a 2-dimensional scenario.The blue lines represent the linear update step of Eq. (2a), while the dashed lines denote the projection of Eq. (2b).The red lines represent the actual steps performed by the system.Note that condition (3) accounts for the fact that the linear step must never bring the system state inside the hyper-sphere.

Network state dynamics: the autonomous case
We now discuss the network state dynamics in the autonomous case, i.e., in the absence of input.This allows us to prove why the network cannot be chaotic.
From now on, we assume r = 1 as this does not affect the dynamics, provided that condition (3) is satisfied.From (2), the system state dynamics in the autonomous case reads: At time-step n the system state is given by By iterating this procedure, one obtains: where x 0 is the initial state.This implies that, for the autonomous case, a system evolving for n time-steps coincides with updating the state by performing n matrix multiplications and projecting the outcome only at the end.
It is worth to comment that this holds also if matrix W changes over time.In fact, let W n := W(n) be W at time time n.Then, the evolution of the dynamical system reads: Furthermore, note that a system described by matrix W and a system characterized by W = aW coincide.In turn, this implies that the SR of the matrix does not alter the dynamics in the autonomous case.

Edge of chaos
When tuning the hyper-parameters of ESNs, one usually tries to bring the system close to the EoC, since it is in that region that their performance is optimal 14 .This can be explained by the fact that, when operating in that regime, the system introduces rich dynamics without denoting chaotic behaviour.
Here, we show that the proposed recurrent model (2) cannot enter a chaotic regime.Notably, we prove that, when the number of neurons in the network is large, the maximum (local) Lyapunov exponent cannot be positive, hence neglecting the possibility to introduce a chaotic behaviour.To this end, we determine the Jacobian matrix of (2b) and then show that, since its spectral radius tends to 1, the maximum Local Lyapunov Exponent (LLE) must be null.The Jacobian matrix of (2b) reads: where the time index k is omitted to ease the notation.We observe that, asymptotically for large networks (N → ∞), we have that a i / a → 0 for each i, meaning that the Jacobian matrix reduces to J(x) = W/ W x .As we are considering the case with r = 1, we know that x = 1.This allows us to approximate the norm of W with its spectral radius ρ, Under this approximation (9), the largest eigenvalue of J must be 1 as the SR ρ = ρ(W) is the largest absolute value among the eigenvalues of W. We thus characterize the global behaviour of (2b) by considering the maximum LLE 14 , which is defined as: where ρ k is the spectral radius of the Jacobian at time-step k.Eq. 10 implies that Λ → 0 as n → ∞, hence proving our claim.
In order to demonstrate that Λ = 0 holds also for networks with a finite number N of neurons in the recurrent layer, we numerically compute the maximum LLE by considering the Jacobian in (8).Fig. 3a shows the average value of the maximum LLE with the related standard deviation obtained for different values of SR. Results show that the LLE is not significantly different from zero.In Fig. 3b, instead, we show the Lyapunov spectrum of a network with N = 100 neurons in the recurrent layer, obtained for different SR values.Again, our results show that the maximum LLE of (2b) is zero for finite-size networks, regardless of the values chosen for the SR.

Network state dynamics: input-driven case
Let us define the "effective input" contributing to the neuron activation as u := W in s.Accordingly, Eq. 2a takes on the following form: where u k operates as a time-dependent bias vector.Let us define the normalization factor as: Then, as shown in the Methods section, the state at time-step n can be written as: where By looking at (13), it is possible to note that each u k is multiplied by a time-dependent matrix, i.e., the network's final state x n is obtained as a linear combination of all previous inputs.

Memory of past inputs
Here, we analyze the ability of the proposed model (2) to retain memory of past inputs in the state.In order to perform such analysis, we define a measure of network's memory that quantifies the impact of past inputs on current state x n .Our proposal shares similarities with a memory measure called Fisher memory, first proposed by Ganguli et al. 27 and then further developed in 24,32 .However, our measure can be easily applied also to non-linear networks like the one (2) proposed here, justifying the analysis discussed in the following.
Considering one time-step in the past, we have: All past history of the signal is processed in x n−1 .Note that (15) keeps the same form for every n.Going backward in time one more step, we obtain:

5/16
As x n = x n−1 = x n−1 = 1, we see that there is a sort of recursive structure in this procedure, where each incoming input is summed to the current state and then their sum is normalized.This is a key feature of the proposed architecture.In fact, it guarantees that each input will have an impact on the state, since the norm of the activations will not be too big or too small, because of the normalization.We now express this idea in a more formal way.The norm of ( 14) can be written as where the approximation holds thanks to the Gelfand's formula1.If the input is null, we expect each N l to be of the order of ρ as x l = 1.So we write: where δ l denotes the impact of the l-th input on the l-th normalization factor.Its presence is due to the fact that without any input N l would be approximately equal to ρ, while the input modifies the state and so the normalization value will be modified accordingly.The value δ l is introduced to account for this fact: δ : = N l − ρ.
If we assume that each input produces a similar effect on the state (i.e.δ l = δ for every l), we finally obtain: We note that such assumption is reasonable for inputs that are symmetrically distributed around a mean value with relatively small variance, e.g. for Gaussian or uniformly distributed inputs (as demonstrated by our results).However, our argument might not hold for all types of inputs and a more detailed analysis is left as future research.
We define the memory of the network M α (k |n) as the influence of input at time k on the network state at time n.More formally, having defined α := ρ/δ, from (19) it follows: Eq. 20 goes to 0 (i.e., no impact of the input on the states) as α → 0 and tends to 1 for α → ∞, implying that for an infinitely large SR the network perfectly remember its past inputs, regardless of their occurrence in the past.Note that (20) does not depend on k and n individually, but only on their difference (elapsed time): the larger the difference, the lower the value of M α (k |n), meaning that far away inputs have less impact on the current state.

Memory loss
By using (20), we define the memory loss between state x n and x m of the signal at time-step k (with m > n > k and m = n + τ) as follows: For very large values of α, we have ∆M(k |m, n) → 0, meaning that in our model (2b) larger spectral radii eliminate memory losses of past inputs.Now, we want to assess if inputs in the far past have higher/lower impact on the state more than recent inputs.To this end, we selected n > k > l and defined k = n − a and l = n − b, leading to b > a > 0 and b = a + δ.Define: We have that lim δ→∞ ∆M(n − a, n − a − δ|n) = 1 − 1 α+1 a , showing how the impact of an input that is infinitely far in the past is not null compared with one that is only a < ∞ steps behind the current state.This implies that the proposed network is able to preserve in the network state (partial) information from all inputs observed so far.
1The Gelfand's formula 39 states that for any matrix norm ρ(A) = lim k→∞ A k

Linear networks
In order to assess the memory of linear models, we perform the same analysis for linear RNNs (i.e., x n = a n ) by using the definitions given in the previous section.In this case, an expression for the memory can be obtained straightforwardly and reads: It is worth noting that there is no dependency on δ and, therefore, on the input.Just like before, we have the memory loss of the signal at time-step k between state x n and x m , as: where we set m > n > k and m = n + τ.As before, we also discuss the case of two different inputs observed before time-step n.
By selecting time istances n > k > l, k = n − a and l = n − b, we have b > a > 0 allowing us to write b = a + δ.As such: We see that in both ( 24) and ( 25) the memory loss tends to zero as the spectral radius tends to one, which is the critical point for linear systems.So, according to our analysis, linear networks should maximize their ability to remember past inputs when their SR in close to one and, moreover, their memory should be the same disregarding the particular signal they are dealing with.We will see in the next section that both these claims are confirmed by our simulations.

Hyperbolic tangent
We now extend the analysis to standard ESNs, i.e., those using a tanh activation function.Define σ := tanh (applied element-wise).
Then, for generic scalars a and b, when |b| |a| the following approximation holds: When a is the state and b is the input, our approximation can be understood as a small-input approximation 40 .Then, the state-update reads: Thus, applying the same argument used in the previous cases, it is possible to write: where we defined: We see that, differently from the previous cases, the final state x n is a sum of nonlinear functions of the signal u k .Each signal element u k is encoded in the network state by first multiplying it by ρ and then passing the outcome through the nonlinearity σ(•).This implies that, for networks equipped with hyperbolic tangent function, larger spectral radii favour the forgetting of inputs (as we are in the non-linear region of σ(•)).On the other hand, for small spectral radii the network operates in the linear regime of the hyperbolic tangent and the network behaves like in the linear case.
A plot depicting the decay of the memory of input at time-step k for all three cases described above is shown in Fig. 4. The trends demonstrate that, in all cases, the decay is consistent with our theoretical predictions.

Performance on memory tasks
We now perform experiments to assess the ability of the proposed model to reproduce inputs seen in the past, meaning that the desired output at time-step k is y k = u k−τ , where τ ranges from 0 to 100.We compare our model with a linear network and a network with hyperbolic tangent (tanh) activation functions.We use fixed, but reasonable hyper-parameters for all   20), (23), and (28).In all cases, the decay is consistent with the predictions.networks, since in this case we are only interested in analyzing the network behavior on different tasks.In particular, we selected hyper-parameters that worked well in all cases taken into account; in preliminary experiments, we noted that different values did not result in substantial (qualitative) changes of the results.The number of neuron N is fixed to 1000 for all models.For linear and nonlinear networks, the input scaling (a constant scaling factor of the input signal) is fixed to 1 and the SR equals ρ = 0.95.For the proposed model ( 2), the input scaling is chosen to be 0.01, while the SR is ρ = 15.For the sake of simplicity, in what follows we refer to ESNs resulting from the use of (2) as "spherical reservoir".
To evaluate the performance of the network, we use the accuracy metric defined as γ = max{1 − NRMSE, 0}, where the Normalized Root Mean Squared Error (NRMSE) is defined as: Here, ŷk denotes the computed output and • represents the time average over all time-steps taken into account.In the following, we report the accuracy γ while varying τ in the past for various benchmark tasks.Each result in the plot is computed using the average over 20 runs with different network initializations.The shaded area represents the standard deviation.All the considered signals were normalized to have unit variance.

White noise
In this task, the network is fed with white noise uniformly distributed in the [−1, 1] interval.Results are showed in Fig. 5a.We note that networks using the spherical reservoir have a performance comparable with linear networks, while tanh networks do not correctly reconstruct the signal when k exceeds 20.It is worth highlighting the similarity of the plots shown here with our theoretical predictions about the memory (Eqs.( 20), (23), and ( 28)) shown in Fig. 4.

Multiple superimposed oscillators
The network is fed with Multiple Superimposed Oscillators MSO with 3 incommensurable frequencies: Results are showed in Fig. 5b.This task is difficult because of the multiple time scales characterizing the inputs 3 .We note the performance of the linear and the spherical reservoirs are again similar and both outperform the network using the hyperbolic tangent.The accuracy peak when k ≈ 60 is due to the fact that the autocorrelation of the signal reaches its maximum value at that time-step.

Lorenz series
The Lorenz system is a popular mathematical model, originally developed to describe atmospheric convection. 8/16

9/16
It is well-known that this system is chaotic when σ = 10, β = 8 3 and ρ = 28.We simulated the system with these parameters and then fed the network with the x coordinate only.Results are shown in Fig. 5c.Also in this case, while the accuracy for spherical and linear networks does not seem to be affected by k, the performance of networks using the tanh activation dramatically decreases when k is large.This stresses the fact that non-linear networks are significantly penalized when they are requested to memorize past inputs.

Santa Fe laser dataset
The Santa Fe laser dataset is a chaotic time series obtained from laser experiments.The results are showed in Fig. 5e.Also in this case the hyperbolic tangent networks do not manage to remember the signal, while the other systems show the usual behaviour.

Mackley-Glass system
The Mackley-Glass system is given by the following delayed differential equation: In our experiments, we simulated the system using standard parameters, that is, λ = 17, α = 0.2 and β = 0.1.Results are showed in Fig. 5d.Note that in this case the performance of the network with spherical reservoir is comparable with the one obtained using the hyperbolic tangent and both of them are outperformed by the linear networks.

Memory/non-linearity trade-off
Here, we evaluate the capability of the network to deal with tasks characterized by tunable memory and non-linearity features 30 .
The network is fed with a signal where the u k s are univariate random variables drawn from a uniform distribution in [−1, 1].The network is then trained to reconstruct the desired output of y k = sin(ν • u k−τ ).We see that τ accounts for the memory required to correctly solve the task, while ν controls the amount non-linearity involved in the computation.For each configuration of τ and ν chosen in suitable ranges, we run a grid search on the range of admissible values of SR and input scaling.Notably, we considered 20 equally-spaced values of the SR and for the input scaling.For networks using hyperbolic tangent, the SR varies in In Figures 7, 8, and 9, we show the NRMSE for the task described above for different values of ν and τ, and the ranges of the input scaling factor and the spectral radius which performed best for the hyperbolic tangent, linear and spherical activation function, respectively.The results of our simulations agree with those reported in 30 and, most importantly, confirm our theoretical prediction: the proposed model possess memory of past inputs that is comparable with the one of linear networks but, at the same time, it is also able to perform nonlinear computations.This explains why the proposed model denotes the best performance when the task requires both memory and nonlinearity, i.e., when both τ and ν are large.Predictions obtained for a specific instance of this task requiring both features are given in Fig 6, showing how the proposed model outperforms the competitors.Performance of the proposed model.We see that the network performs reasonably well for all the tasks displaying only a weak dependency on the memory.Moreover the spectral radius does not to play any role in the network performance.The choice of the scaling factor denotes similar patterns with the hyperbolic tangent case.

Discussion
In this work, we studied the properties of ESNs equipped with an activation function that projects at each time-step the network state on a hyper-sphere.The proposed activation function is global, contrarily to most activation functions used in the literature, meaning that the activation value of a neuron depends on the activations of all other neurons in the recurrent layer.Here, we first proved that the proposed model is a universal approximator just like regular ESNs and gave sufficient conditions to support this claim.Our theoretical analysis shows that the behaviour of the resulting network is never chaotic, regardless of the setting of the main hyper-parameters affecting its behaviour in phase space.This claim was supported both analytically and experimentally by showing that the maximun Lyapunov exponent is always zero, implying that the proposed model always operates on the EoCs.This leads to networks which are (globally) stable for any hyper-parameter configurations, regardless of the particular input signal driving the network, hence solving stability problems that might affect ESNs.
The proposed activation function allows the model to display a memory capacity comparable with the one of linear ESNs, but its intrinsic nonlinearity makes it capable to deal with tasks for which rich dynamics are required.To support this claim, we developed a novel theoretical framework to analyze the memory of the network.By taking inspiration from the Fisher memory proposed by Ganguli et al. 27 , we focused on quantifying how much past inputs are encoded in future network states.The developed theoretical framework allowed us to account for the nonlinearity of the proposed model and predicted a memory capacity comparable with the one of linear ESNs.This theoretical prediction was validated experimentally by studying the behaviour of the memory loss using white noise as inputs as well as well-known benchmarks used in the literature.Finally, we experimentally verified that the proposed model offers an optimal balance of nonlinearity and memory, showing that it outperforms both linear and ESNs with tanh activation functions in those tasks where both features are required at the same time.

Contractivity condition
The universal approximation property, as exhaustively discussed in 5 , can be proved for an ESN of the form (1b) provided that it has the ESP.To prove the ESP, it is sufficient to show that the network has the contractivity condition, i.e., that for each x and y in the domain of the activation function σ, it must be true that: Where a • b is the scalar product between two vectors and the inequality a b + b a ≥ 2 follows from the fact that (a − b) 2 = a 2 + b 2 − 2ab > 0 for all a, b > 0. Now, we assume x , y > r and show that: proving the contractivity condition (34).We see that the only condition needed is x , y > r, which means that the linear part of the update (2a) must map states outside the hyper-sphere of radius r.Finally, by applying properties of norms, we observe that: (37)   and asking this to be larger than r results in the condition (3).

Input-driven dynamics
Here we show how to derive (13).Consider a dynamical system evolving according to (2): if we explicitly expand the first steps from the initial state x 0 we obtain:

Figure 1 .
Figure 1.Schematic representation of an ESN with a 4-dimensional input, a mododimensional output and N = 7 neurons in the reservoir.Fixed connections are drawn using continuous lines while dashed lines represent connections that are learned.

Figure 3 .
Figure 3. Panel (a) shows LLEs for different value of the SR.Each point in the plot represents the mean of 10 different realizations, using a network with N = 500 neurons.In panel (b) examples of the Lyapunov spectrum for different networks are plotted.Note that the largest exponent is always zero.

Figure 4 .
Figure 4. Decay of the memory in the cases associated with expressions (20),(23), and(28).In all cases, the decay is consistent with the predictions.

Figure 5 .
Figure 5. Results of the experiments on memory for different benchmarks.Panel (a) displays the white noise memorization task, (b) the MSO, (c) the x-coordinate of the Lorenz system, (d) the Mackey-Glass series and (e) the Santa Fe laser dataset.As described in the legend (f), different line types account for results obtained on training and test data.The shaded areas represent the standard deviations, computed using 20 different realization for each point.
[0.2, 3]; [0.2, 1.5] for linear networks, and [0.2, 10] for networks with spherical reservoir.The input scaling always ranges in [0.01, 2].We then choose the hyper-parameters combination that minimizes (30) on a training set of length L train = 500 and then assess the error on a test set with L test = 200.

Figure 6 .
Figure 6.Comparison of the network prediction the memory-nonlinearity task for ν = 2.5 and τ = 10.The hyper-parameters of the networks are the same used to generate Fig. 5.Here the accuracy values are γ tanh = 0.12, γ spherical = 0.63 and γ linear = 0.61.

Figure 7 .Figure 8 .Figure 9 .
Figure 7. Results for the hyperbolic tangent activation function.The network performs as expected: the error grows with the memory required to solve the task.The choice of the spectral radius displays a pattern, where larger SRs are preferred when more memory is required.The scaling factor tends to be small for almost every every configuration.