Model-size reduction for reservoir computing by concatenating internal states through time

Reservoir computing (RC) is a machine learning algorithm that can learn complex time series from data very rapidly based on the use of high-dimensional dynamical systems, such as random networks of neurons, called “reservoirs.” To implement RC in edge computing, it is highly important to reduce the amount of computational resources that RC requires. In this study, we propose methods that reduce the size of the reservoir by inputting the past or drifting states of the reservoir to the output layer at the current time step. To elucidate the mechanism of model-size reduction, the proposed methods are analyzed based on information processing capacity proposed by Dambre et al. (Sci Rep 2:514, 2012). In addition, we evaluate the effectiveness of the proposed methods on time-series prediction tasks: the generalized Hénon-map and NARMA. On these tasks, we found that the proposed methods were able to reduce the size of the reservoir up to one tenth without a substantial increase in regression error.

Efficiently processing time-series data is important for various tasks, such as time-series forecasting, anomaly detection, natural language processing, and system control.Recently, machine-learning approaches for these tasks have attracted much attention of researchers and engineers because they not only require little domain knowledge but also often perform better than traditional approaches.In particular, machine-learning models that employ recurrent neural networks, such as long short-term memory, have achieved great success in natural language processing and speech recognition [1], and their fields of applications continue to expand.However, the standard learning algorithms for recurrent neural networks, which include backpropagation through time [2] and its variants [3], require large computational resources.These computational burdens often hinder real-world applications, especially when computing is performed near end users.Such computing is called "edge computing," and characterized by limited computational power and limited battery capacity.
Reservoir computing (RC) is a machine-learning algorithm that aims to reduce the computational resources required for predicting time series without reducing accuracy.As shown in Fig. 1, a typical RC consists of three parts: an input layer, a "reservoir" layer where neurons are randomly connected, and an output layer [4,5].Because only the weights between the reservoir layer and the output layer are trained while the other weights remain fixed, the learning process of RC is much faster than that of backpropagation Figure 1: Typical RC architecture.The reservoir layer consists of randomly connected neurons.The connections between the input and reservoir layer W in and connections within the reservoir layer W res are fixed (solid arrows), whereas the output weights W out are trained (dashed arrows).
through time [6,7,8].Therefore, RC is expected to be a lightweight machine-learning algorithm that enables machine learning in edge computing [9].
To develop the applications of RC in edge computing, its hardware implementation must be improved to enhance its computational speed and energy efficiency.For realizing such efficient hardware implementation, variants of RC models, some of which employ delay-feedback systems [21], simple network topologies such as ring-topology and delay lines [22,23,24], and billiard systems [25], have been proposed.Efficient hardware based on these variants have been implemented using field programmable integrated circuits (FPGAs) [26,27,28,29].Moreover, numerous types of implementation employing physical systems, such as photonics [30,31,32], spintronics [33], mechanical oscillators [34], and analog integrated electronic circuits [35,36], have been demonstrated [37].Although these implementations have exhibited the superiority of RC in computational speed and energy efficiency, the maximum size of the reservoir is limited by the physical size of the hardware.
In this study, we propose three methods that reduce the size of the reservoir without any performance impairment.The three methods share the concept that the number of the effective dimension of the reservoir is increased by allowing additional connections from the reservoir layer at multiple time steps to the output layer at the current time step.We analyze the mechanism of the proposed methods based on the information processing capacity (IPC) proposed by Dambre et al. [38].We also demonstrate how the proposed methods reduce the size of the reservoir in the generalized Hénon-map and NARMA tasks.

RC framework
In this section, we introduce the standard definition of RC and our proposed methods.

Definition of RC
In the mathematical representation of RC, four vector variables are defined as follows: u(t) ∈ R N in for the inputs, x(t) ∈ R N res for the states of the reservoir, y(t) ∈ R N out for the outputs, and and y tc (t) ∈ R N out for the teaching signals.The constants N in , N res , and N out are the dimensions of the inputs, states of the reservoir, and outputs, respectively.The updates of the reservoir states are given by where W in ∈ R N res ×N in is a weight matrix representing the connections from the neurons in the input layer to those in the reservoir layer.Its elements are independently drawn from uniform distribution U (−ρ in , ρ in ), where ρ in is a positive constant.Another weight matrix W res ∈ R N res ×N res represents the connections among the neurons in the reservoir layer.Its elements are initialized by drawing values from uniform distribution U (−1, 1) and subsequently divided by a positive value to ensure that the spectral radius of W res is ρ res .Note that elements in matrices W in and W res are fixed to the initialized values.The outputs are obtained by where W out ∈ R N out ×N res is a weight matrix representing connections from the neurons in the reservoir layer to those in the output layer.The output weight matrix W out is trained in the offline learning process of RC by using the pseudoinverse (see Materials and Methods sections).

Proposed methods
We propose three methods that modify the connections between the reservoir and output layers.We call these three methods (i) delay-state concatenation, (ii) drift-state concatenation, and (iii) delay-state concatenation with transient states.These methods share the idea that the number of the effective dimension of the reservoir is increased by allowing additional connections from the reservoir layer at multiple time steps to the output layer at the current time step.For the delay-state concatenation and delay-state concatenation with transient states, additional connections are formed from the past states of the reservoir layer to the current output layer, as illustrated in Figs. 2 A and D. On the other hand, for the drift-state concatenation, additional connections are formed from newly introduced states of the reservoir, called drifting states, to the current output layer, as illustrated in Fig. 2 B. The drifting states are obtained by updating the current states of the reservoir layer without input signals.In the following section, we mathematically formulate these three proposed methods.
First, we formulate the delay-state concatenation with concatenated states of the reservoir given by Note that x(t) is a column vector and the number of neurons in the reservoir does not change.A positive integer Q represents the unit of delays.Another positive integer P represents the number of past states that are concatenated to the current states.The outputs are obtained with the concatenated states x(t) and the corresponding output-weight matrix Ŵ out as follows: Here, x(t) and Ŵ out are defined in R (P +1)N res and R N out ×(P +1)N res , respectively.Figure 2 A shows a schematic of this method illustrating the prediction of y(3) when Q = 1 and P = 2.One can see that there are additional connections from the past states of the reservoirs x(1) and x(2) to output y(3), as indicated by red dashed arrows.From a different point of view, this model can be illustrated using the concatenated states of the reservoir x(t), as in Fig. 2 C, where the reservoir consists of three identical smaller reservoirs, each with a different time delay of 0, 1, and 2 from the inputs.Evidently, the effective dimension of the concatenated reservoir is three times larger than that of the original reservoir.Because the learning performance is enhanced by using a larger reservoir [7], the proposed method should be able to increase the computing capability without needing to add neurons in the reservoir.Second, we formulate drift-state concatenation, as illustrated in Fig. 2 B by introducing the drifting states of the reservoir given by where x drift (t ; t) represents the drifting states of the reservoir and t is the time step after the current time step t.Using the drifting states, we redefine the concatenated states of the reservoir and the corresponding output matrix as follows: . . .
Here, W drift is a matrix representing the connections within the reservoir to obtain the drifting states, and its elements are drawn from uniform distribution U (−1, 1) divided by a positive value to ensure that the spectral radius of W drift is ρ drift .Third, delay-state concatenation with transient states introduces transient states to the delay-state concatenation, as illustrated in Fig. 2 D where the states of the reservoir update twice (so it has one transient state) during the inputs and the outputs update once (see Materials and Methods section).
Although we have shown that the proposed methods can increase the effective dimension of the reservoir without adding neurons, one potential drawback of the methods is the cost of the memory required to store the past reservoir states.Against expectation, however, delay-state concatenation is very memory efficient, as explained below.
To carry out RC with delay-state concatenation, the terms are computed and stored in memory at time step t.The dimensions of these vectors are all N out .The vector W out i x(t) must be stored until they are used for calculating the outputs at time steps t + iQ.Therefore, the total memory cost is obtained as follows: It should be noted that this method increases the effective dimension of the reservoir state by a factor of (P +1), but the number of the neurons in the reservoir is not increased.If the number of neurons is increased by (P + 1) times, the memory cost to store the reservoir states becomes (P + 1)N res .For moderate numbers of P and Q (typically less than 5), the memory cost for the proposed method given by Eq. 11 is much less than (P + 1)N res because N out N res for typical RC applications.The memory required to store the weights within the reservoir is also reduced from (P N res ) 2 + P N out to (N res ) 2 + P N res N out in the same way given that the reservoir is fully connected.Therefore, delay-state concatenation can increase the dimensions of the reservoir more efficiently than by simply increasing the number of neurons in the reservoir.

Quantitative analysis based on IPC
Before benchmarking the proposed RC, we quantitatively analyze the learning capacity of the RC to elucidate how the proposed methods work.
The memory capacity (MC) is a performance measure commonly used in the RC research community [39].The MC represents how precisely the RC system can reproduce the past inputs.A number of studies have shown that the MC is theoretically bounded by the number of neurons in the reservoir, and the MC can reach this bound in some situations [40,41,23,24].Boedecker et al. [42] evaluated the MC at the edge of chaos, which is a region in the model parameter space where RC is stable but near to unstable.Farka et al. [43] have evaluated the MC for various model parameters.However, the MC does not evaluate how well RC processes information in a nonlinear way.Because many tasks in the real world targeted by RC are nonlinear problems, the MC is not a suitable measure for analyzing the proposed methods in this sense.Therefore, to elucidate the mechanism of the methods proposed in this study, we employed another criterion [38] called the information processing capacity (IPC), which handles nonlinear tasks.
The IPC is a measure that integrates both memory and information processing performance.By employing an orthogonal basis set that spans the Hilbert space, one IPC can be obtained from one corresponding basis.The IPC can be interpreted as a quantity that represents not only how well the network can memorize past inputs but also how precisely the network can convert inputs into the target outputs in a nonlinear manner given the basis set.Dambre et al. [38] showed that the total IPC C total , which is a sum of all IPCs, is identical to the number of neurons in the reservoir, provided (I) the inputs are independent and identically distributed (i.i.d.), (II) the fading memory condition is satisfied, and (III) all the neurons are linearly independent (see Theorem 7 and its proof in [38]).By analyzing the IPCs, one can obtain a large amount of information about how RC processes input data.For example, the degree of nonlinearity of the information processing carried out in RC can be analyzed by calculating multi-order IPCs.The kth-order IPC C kth is defined as the sum of the IPCs corresponding to the subset of a basis with kth-order nonlinearity.Based on the IPC, informative results such as the memory-nonlinearity tradeoff have been obtained [38].Therefore, using the IPC, we can analyze how RC stores and processes information in the reservoir as well as how the proposed methods affect the way information is processed.
We calculated the IPCs for the standard RC and for RC with the proposed methods (see Materials and Methods section).Figure 3 shows the IPCs when N res = 12 and N res = 24 for various values of ρ in and ρ res .Note that only odd-order IPCs were observed because of the symmetry.In each setting, we calculated the IPCs for the standard RC (left columns), those for the RC with delay-state concatenation with P = Q = 1 (center columns), and those for the RC with drift-state concatenation with P = Q = 1 (right columns).One can find that the value of C total /(P + 1) almost reaches the number of neurons N res in reservoir except when ρ res = 1.05.This result indicates that the proposed methods actually increase the total IPC by (P + 1) times.The observed lower values of the total IPCs for the case of ρ res = 1.05 can be attributed to the failure-of-fading-memory condition.In the RC research community, it is well known that the dynamics of RC is more likely to be chaotic when ρ res increases (typically occurring when ρ res is larger than 1) [44], and this corresponds to the failure of fading memory.For all cases, as ρ in increases, the third-order IPC C 3rd and the fifth-order IPC C 5th tend to increase, which reflects the increase in nonlinearity in the reservoir [38] given the selected basis set.Note that as ρ in increases, the total IPC C total tends to decrease, which is attributed to increases in the importance of higher-order IPCs (e.g., seventh-order and ninth-order IPCs).
In Fig. 4 (A), we show the IPCs for delay-state concatenation with P = 1 for several values of the unit of delay Q.As Q increases, the first-order IPC increases as well.This result may be trivial because RC with large Q can access the past states of the reservoir, rendering the reproduction of the past inputs easy.To investigate the effects of the unit of delay Q on IPCs, we decomposed the kth-order IPC C kth into components in terms of their delay such as C kth 0 , C kth 1 , and C kth 2 , which correspond to a different subset of the basis.Figure 4 (b) shows the distributions of the delay components for four values of Q under the same experimental conditions.As the values of Q increase, the distribution tends to shift to the right (larger delays) for each order of IPC.This fact indicates that, as demonstrated in the subsequent section, one can tune RC models by adjusting the value of Q according to the delay structure of the target tasks.
Next, for various values of P , we show the IPCs in Fig. 5 (A) for delay-state concatenation, drift-state concatenation, and delay-state concatenation with one transient state.For all three proposed methods, the contributions of higher-order IPCs tend to be dominant in the total IPCs as P increases.In Fig. 5 (B), we show the delay structures of the IPCs.Note that to clarify how the distributions of the delay structure change as the value of P changes, we used the normalized IPC C nth τ /(P + 1) τ C nth τ .The top panels in Fig. 5 (B) show that, as P increases, the distribution of the delay structure of the IPCs for delay-state concatenation tends to shift to the right (larger delays).Conversely, in the middle panels, the IPCs for drift-state concatenation do not change significantly.These results may be explained as follows: the increase in P for delay-state concatenation increases the memory of past inputs because of the additional connections from the past states of the reservoir, whereas the increase in P for drift-state concatenation does not increase the memory of the past inputs because drifting states are obtained from the current states of the reservoir.The bottom panels of this figure show that the distribution of IPCs for delay-state concatenation with one transient state tends to shift to larger delays as P increases.However, the delays in this distribution are smaller than the delays in the distribution obtained using delay-state concatenation.This difference may stem from the fact that the information of past states is more likely to be thrown away in delay-state concatenation with one transient state because the RC model in this case carries out nonlinear transformation twice for each input (see Fig. 2 D).
Here, we present a short summary of the above experiments.We have numerically shown that the total IPCs divided by P +1 are almost independent of the values of Q and P , which is consistent with the theory in Ref. [38].Furthermore, we have found that the importance among IPC components and the delay structure of IPCs can be modified by selecting the values of Q and P .These findings indicate that the learning performance on real-world tasks may be enhanced by selecting appropriate values of Q and P adjusted to a target task with a specific temporal structure.

Effectiveness on Complex Data
Although we have shown that the proposed methods can increase the IPCs efficiently, the conditions assumed above are not always guaranteed in real-world applications; for example, inputs may not be drawn from i.i.d.data, and neurons in the reservoir may not be linearly independent.Therefore, the IPCs are just a guide that help us understand the mechanisms of the proposed methods.In this section, to evaluate the effectiveness of the proposed methods on complex data, we applied them to two prediction tasks: generalized Hénon-map tasks and NARMA tasks (see Eq. 24 and Eq. 25 in Materials and Methods section).In the following experiments, the dimensions of the reservoir are approximately given by an integer N * , and the actual number of neurons in the reservoir is given by N res = N * /(P + 1) , where m := max{q ∈ Z|q ≤ m}.We optimized the model parameters, ρ in , ρ res , and ρ drift with Bayesian optimization [45,46].
We first investigated the effects of the value of Q in the delay-state concatenation method.Figure 6 (a) shows the normalized mean-squared errors (NMSEs) for the sixth-order and eighth-order Hénon-map tasks for various values of Q with P = 1.As the value of Q increases, the NMSE first decreases, but abruptly increases when Q is larger than 4 in the sixth-order Hénon-map and larger than 6 in the eighth-order Hénon-map.Considering the fact that to predict the output, the mth-order Hénon-map has two informative inputs at the mth and (m − 1)th previous steps, these increases in performance are reasonable because a RC system with the appropriate values of Q can possess the information needed from past inputs.We also note that similar results were obtained recently in [47].In contrast, the NMSE monotonically increases as the value of Q increases for NARMA5 and NARMA10, as shown in Fig. 6 (b).These results imply that simply adjusting the value of Q is not effective for tasks such as NARMA5 and NARMA10 with complicated temporal structures.
We next investigated the effects of the value of P for the NARMA tasks in which adjusting the value of Q was not effective.Figure 7 shows the NMSE against the values of P for the NARMA5 task (the top panel) and NARMA10 task (the bottom panel) for three settings of N * .Note that N * is fixed as P is varied to ensure that the size of reservoir N res reduces to approximately 1/(P + 1) times as P varies.We found that the NMSEs were almost constant up to specific values of P .For example, for the case of the NARMA10 task with N * = 300, the NMSE was almost constant up to P = 5, indicating that the number of neurons in the reservoir can be reduced from 300 to 50 without impairing performance.
Finally, we compared the proposed methods, delay-state concatenation with and without one transient state, and drift-state concatenation on the NARMA10 task. Figure 8 shows the NMSE as functions of P values with N * = 200 and 300 for the proposed methods.We found that the NMSEs for delay-state concatenation with one transient state and drift-state concatenation were lower than those for delay-state concatenation when P was larger than 7.For the NARMA10 task, inputs more than 10 steps in the past are not very informative for predicting one step forward.The lack of information in the past steps may explain the increase in NMSE for delay-state concatenation when P is larger than 7.The observed lower NMSEs for delay-state concatenation with one transient state and drift-state concatenation are also reasonable because (I) the former method uses the past states of the reservoir, which contain more recent input information for the same value of P (see Fig. 2 D), and (II) the latter method uses the states of the reservoir, which contain current input information.

Discussion
In this study, we proposed three methods to reduce the size of an RC reservoir without impairing performance.
To elucidate the mechanism of the proposed methods, we analyzed the IPC.We found that the value of the total IPCs almost reaches N res (P + 1) using the proposed methods, whereas the importance of their components (the first-, third-, and fifth-order IPCs) changes drastically.We also found that the delay structures of the IPCs depend on the value of Q and P .To investigate the applicability of the proposed methods on complex data, we presented the experimental results on generalized Hénon-map and NARMA tasks.We found that when the target task has a relatively simple temporal structure, as demonstrated with the Hénon-map tasks, selecting an appropriate value of Q enhances the performance substantially.In contrast, when the target task contains complex temporal structure, as demonstrated in the NARMA tasks, adjusting the value of Q does not enhance the performance.However, in those cases, we found that increasing the value of P can reduce the size of the reservoir without impairing performance.We have demonstrated that the number of neurons in the reservoir can be reduced by up to one tenth in the NMARMA10 task.
Here, we briefly note the relationship between our work and the most relevant previous work [47].In [47], the authors proposed a method that is similar to delay-state concatenation.Their proposed model corresponds to the case when the number of additional connected past states is one (i.e., P = 1).They observed that performance enhancement depends on the value of Q, as we showed in this paper.However, to the best of our knowledge, the dependence of the performance on the value of P has not been reported.In addition, the other two proposed methods, drift-state concatenation and delay-state concatenation with transient states, are introduced for the first time in this paper.Moreover, the authors of [47] explained the mechanism of their proposed method in terms of the delayed embedding theorem [48].In contrast, we have provided a more intuitive explanation based on the IPC [38].
Because the proposed methods do not assume a specific topology for the reservoir, they can readily be implemented in FPGAs and physical reservoir systems, such as photonic reservoirs [37].Therefore, the proposed methods could be an important set of techniques that facilitates the introduction of RC in edge computing.

Training output weights
The training procedures are the same as those for standard RC models [7].The output weights are trained by minimizing Adding a regularization term Ŵ out 2 2 did not improve the performance in our case.

Delay-state concatenation with transient states
We inserted transient states in the RC system with delay-state concatenation as follows: where N tran is the number of inserted transient states.

IPC
Following the same procedure given in [38], the IPCs are calculated as follows: The total IPC is defined as where C total is the total IPC and C({d i }) is the IPC for a basis represented with a list {d i } = {d 0 , d 1 , . . .}.
The list represents an orthogonal basis in Hilbert space.We employed the following Legendre polynomials as the orthogonal basis: where P di (•) is the d i th-order Legendre polynomial and u(t) is drawn from uniform distribution on [−1, 1].
A constant τ max is the maximum delay, which must be large enough to converge the calculation.In our simulations, we set τ max to 50 for ρ in = 0.1 and to 25, otherwise.Then, the IPC C({d i }) can be calculated as where x(t) := 1 T T t=1 x(t).We set the simulation steps T to 10 6 in all experiments.To avoid overestimation, the value of C({d i }) was set to zero when the value is less than threshold of 7N res (P + 1) × 10 −5 .
We define the kth-order IPCs as The kth-order IPC was decomposed into components corresponding to a subset of a basis whose maximum delay is τ as follows:

Dataset
The mth-order generalized Hénon-map [49] is given by y tc (t) = 1.76 − y tc (t − m + 1) 2 − 0.1y tc (t − m) + σ(t) (m ≥ 2), (24) where σ is Gaussian noise with zero mean and standard deviation of 0.05.The inputs and outputs of the RC are the time series of an n-dimensional generalized Hénon-map.The task is to predict one step forward y(t + 1) with past inputs y(t), y(t − 1), . . . .The NARMA time series is obtained with a nonlinear auto-regressive moving average as follows: y tc (t) = 0.3y tc (t − 1) + 0.05y tc (t − 1) where s(t) is drawn from uniform distribution of [0, 0.5].The NARMA5 and NARMA10 time series correspond to the case when m = 5 and m = 10, respectively.The inputs of the RC are s(t).The task is to predict y(t) from the inputs s(t).
For both tasks, we used 2,000 steps as a training dataset and used 3,000 steps as a test dataset.We removed the first 200 steps (free run) both during the training and test phases to avoid the effects of the initial conditions in the reservoirs [7].We evaluated the performance based on the normalized mean-squared error (NMSE) during the test phase following: where x(t) = 1 T T t=1 x(t).We averaged the NMSEs over 10 trials.For each iteration, the dataset and connection matrix of the reservoir were generated using their corresponding probabilistic distributions.

Figure 2 :
Figure 2: Schematic of the proposed methods.A. Delay-state concatenation when the number of additional connections P is two and the unit of delay Q is one.B. Drift-state concatenation when P is two and Q is one.C. Another view of delay-state concatenation.The reservoir consists of three identical dynamical systems and delay lines.The added dynamical systems have +1 delay lines and +2 delay lines, respectively.D. Delay-state concatenation with one transient state when P is two and Q is one.

Figure 3 :
Figure 3: Information processing capacities (IPCs) of standard RC and RC with the proposed methods for various values of input weight strength ρ in and spectral radius ρ res .The left and right panels show the results for N res = 12 and N res = 24 , respectively.The dashed horizontal line within each subgraph represents the value of N res .For each value of ρ in , the left column presents the standard RC with P = 0, the center column presents delay-state concatenation with P = Q = 1, and the right column presents drift-state concatenation with P = Q = 1.

Figure 4 :
Figure 4: (A) IPCs of the RC with delay-state concatenation for various values of Q. (B) Delay structures of the IPCs for various values of Q. From top to bottom, the first-, third-, and fifth-order IPCs are shown.The parameters are set as follows: N res = 24, ρ in = 0.9, ρ res = 0.95, and P = 1.

Figure 5 :
Figure 5: (A) IPCs with the proposed methods for various values of P .Left columns: IPCs for delay-state concatenation, center columns: IPCs for drift-state concatenation, and right columns: IPCs for delaystate concatenation with one transient state.(B) Comparison of the delay structures of the IPCs for the three proposed methods.Top panels (i): delay structures for delay-state concatenation, middle panels (ii): delay structures for drift-state concatenation, and bottom panels (iii): delay structures for delay-state concatenation with one transient state.To highlight the difference in the distribution of the delay structures, the IPCs for each order are normalized.From left to right, the first-, third-, and fifth-order IPCs are shown.The parameters are set as follows: N res = 24, ρ in = 0.9, ρ res = 0.95, and Q = 1.

Figure 6 :
Figure 6: (A) NMSE for a sixth-order Hénon-map task (the upper panel) and for an eighth-order Hénon-map task (the lower panel) for various values of Q with P = 1.(B) NMSE for NARMA5 (the upper panel) task and for NARMA10 (the lower panel) task for various values of Q with P = 1.Error bars show the standard deviation of the results of 10 trials.

Figure 7 :
Figure 7: NMSEs for the NARMA5 (the top panel) and NARMA10 (the bottom panel) tasks with delaystate concatenation as functions of P for various values of N * .Error bars show the standard deviation of the results of 10 trials.

Figure 8 :
Figure 8: Results of regression performance on the NARMA10 task for various values of P with N * = 200 (the top panel) and N * = 300 (the bottom panel) for (A) delay-state concatenation, (B) drift-state concatenation, and (C) delay-state concatenation with one transient state.Error bars show the standard deviation of the results of 10 trials.