A training algorithm for networks of high-variability reservoirs

Physical reservoir computing approaches have gained increased attention in recent years due to their potential for low-energy high-performance computing. Despite recent successes, there are bounds to what one can achieve simply by making physical reservoirs larger. Therefore, we argue that a switch from single-reservoir computing to multi-reservoir and even deep physical reservoir computing is desirable. Given that error backpropagation cannot be used directly to train a large class of multi-reservoir systems, we propose an alternative framework that combines the power of backpropagation with the speed and simplicity of classic training algorithms. In this work we report our findings on a conducted experiment to evaluate the general feasibility of our approach. We train a network of 3 Echo State Networks to perform the well-known NARMA-10 task, where we use intermediate targets derived through backpropagation. Our results indicate that our proposed method is well-suited to train multi-reservoir systems in an efficient way.

1 Details on the used training algorithm 1

.1 Used models and architecture
As already introduced in the main text, we use Echo State Networks (ESNs) to evaluate our proposed training algorithm, defined by the equations where n is the timestep, x(n) is the state vector of the ESNs nodes, v(n) is a vector holding the input to each reservoir node at timestep n and W in and W res are the matrices holding the fixed weights of the input and hidden layers respectively. b res is fixed bias vector. f (h) is a nonlinear function which is computed elementwise on the hidden layer activation vector h = W res x(n)+W in v(n). a is the leaking decay rate, which determines the response of the system to various input timescales. The output layer of the reservoir is defined on top of the reservoir state vector x(n) aŝ where w out is the trainable readout weight vector and b out is a trainable bias term. For all conducted experiments, we initialise W in , W res and b res sampling from a uniform distribution with range [−0.5, 0.5], after which the matrices are scaled and sparsed out as dictated by the corresponding reservoir hyperparameters: spectral radius, input scaling, bias scaling, as well as input and connection sparsity.

Training ESNs with linear regression
Classically, ESNs are trained using L2-regularised linear regression [1], where we find w out such thatŷ(n) approximates the desired signal y(n). We do so by collecting a time-trace of the ESN state vector x(n) as we present an input signal to the reservoir and arrange our collected state vectors in the rows of a reservoir state matrix X Note the incorporation of a column vector of ones in the rightmost column of X. This allows us the find w out and b out jointly by means of the L2-regularised Moore-Penrose pseudoinverse as I is the identity matrix, where the most right, most bottom diagonal element has been set to 0, i.e.
Using I instead of I here bears the advantage that regularization is imposed on w out , but not on b out . α is the corresponding regularization parameter controlling the extent to which the model is allowed to fit the training data precisely. It needs to be optimised individually for every training run (see also Section 2). In order to train a multi-ESN architecture as shown in Figure 1 we need 3 ESNs, which we label as ESN 1 to ESN 3. We refer to the input signal, system matrix, output vector and target signal of ESN i as v(n) i , W i res ,ŷ i (n) and y i (n) respectively. When training multi-reservoir architectures using linear regression (assuming that we have suitable desired signals for all intermediate signals, be it analytical or trained) we proceed as follows. We first train ESN 1 using the original input vector v(n) as well as its assigned target signal y 1 (n) as shown in Section 1.2. We obtain the trained output of ESN 1 asŷ

Training multi-ESN networks with linear regression
For subsequent ESN, we feed the output of a trained ESN (as well as the original input in the case of ESN 2) as input to the next stage after which the training process is performed again using target signal y i+1 (n). Using such an incremental approach, the resulting trained system is more robust to small errors made by ESNs closer to the input, opposed to an approach where all ESNs are trained using the desired signal of the previous reservoir stage as input.

Training with backpropagation
We train our proposed multi-reservoir architecture use stochastic gradient descent (SGD) with backpropagation to minimise the mean squared error e MSE between the prediction of ESN 3ŷ 3 (n) and the desired output signal y(n).
We do so by iteratively updating the readout vectors w 1 out , w 2 out and w 3 out of ESN 1, 2 and 3 respectively, as we present minibatches of the input sequence v(n) to the networks.
We use the adaptive moment estimation (Adam) [2] learning rule, an extension of stochastic gradient descent (SGD) as follows. For every minibatch presented to the network, we update the output weight vector w i out of ESN i as .
The hyperparameter η is used to control the stepsise towards the estimated minimum and is called learning rate. λ is another hyperparameter that regularises the learning process and is referred to as the weight decay.m n andv n are the bias-corrected first and second moment gradient estimateŝ where where • denotes the Hadamard product, the normalised gradient vector g i is computed as where || || 2 is the 2-norm and is the gradient matrix with e mse betweenŷ 3 (n) and y(n) with respect to the weight vector w i out , as found by backpropagation through time (BPTT) [3]. As we implement our experiments in Pytorch [4], the step of gradient computation is performed automatically for us by the framework. β 1 and β 2 are hyperparameters controlling how strong the current gradient is incorporated into the actual first and second order moment gradient estimates. For more details on Adam, refer to [2], for chosen hyperparameter values, refer to Section 2.
To additionally improve convergence while training, we use an approach inspired by batch normalisation [5] during training, where we shift and scale the input for each ESN. If we denote the input to ESN i as v i (n) then we compute the normalised ESN input v i norm (n) to be fed to ESN i as where γ i and δ i are parameters to be optimised by means of gradient updates as shown in Equation 9, using gradients ∂e mse ∂γ i and ∂e mse ∂δ i . respectively. Therefore, this setup learns a suitable normalization for intermediate reservoir signals. When transferring our training approach to physical reservoirs, one could work with modulators and adders to adjust the input signal for each reservoir correspondingly.

Deriving targets for linear regression
After having trained a multi-ESN architecture using backpropagation, we extract the full output sequences for the intermediate outputs of ESN 1 and ESN 2 on which we in the can be set as desired signals d 1 and d 2 for a new set of ESNs with different random weights, such that and whereŷ i bprop denotes the output of ESN i trained by means of backpropagation as shown in Section 1.4. That way, using our derived target signals, we can train our given multireservoir architecture as shown in Section 1.3.

Used hyperparameters 2.1 ESN
For both the monolithic ESN as well as the multi-reservoir architecture with analytical targets, the validation set was used to find the hyperparameters that are intrinsic to the reservoir: spectral radius, input scaling, and bias scaling, leak rate, as well as the sparsity of the reservoir input and connection matrices. To tune the hyperparameters of our Echo State Networks, we have used the python package hyperopt [6], which attempts to find good hyperparameter configurations for computational models through the usage of surrogate modeling. Essentially, hyperopt samples from the hyperparameter space of the models error function, by computing the model errors of preselected points, i.e. hyperparameter sets. Based on these obtained errors it attempts to approximate the model error function by means of a surrogate model. This surrogate model of the model error surface can then be used to find a good set of hyperparameters. In order to account for varying reservoir weights, when sampling a single point in hyperparameter space, we supply the surrogate model with the average model error over 5 different random sets of reservoir weights. In general, the more points are sampled from the hyperparameter space, the more accurate the approximation, and the better the found set of hyperparameters. For our multi-ESN network, we sample 10000 times from the hyperparameter space for each of our 100-node ESNs. For the 300 node monolithic baseline ESN, we use the same amount of overall sampling operations and sample 30000 times to find a good set of hyperparameters. Table 1

Linear regression
The regularisation strength α was tuned individually for each reservoir using 5-fold crossvalidation on the train set. Possible values for α were evaluated on a logarithmic scale between 10 2 and 10 −13 .

SGD with Backpropagation
We have tuned the minibatch sise, the training time in epochs, the learning rate η, as well as the weight decay λ. For the moment decay rates β 1 and β 2 , Pytorch default values have been used. We have initialised all readout weights using orthogonal initialization [7]. We found that, despite the Adam learning rule is used, our results improved when dividing η by 2 after half of the total epochs trained. Furthermore we have found that the best results are achieved when not shuffling minibatches, but rather presenting them in the order dictated by the input sequence. This makes sense since presenting minibatches in their intended order allows the network to utilise its memory of input beyond the current minibatch. We applied a simple form of early stopping: Every 10 epochs in the training process, we recorded a snapshot of the current weights used in the network as well as the validation score it obtained. After 120 epochs, we chose the weight set with the lowest validation score. Finally, while we have experimented with the addition of artificial gradient noise [8], we found that the best results were achieved without adding any additional noise. Table 2 lists all found and set parameters for our backpropagation approach.