Learning Reservoir Dynamics with Temporal Self-Modulation

Reservoir computing (RC) can efficiently process time-series data by transferring the input signal to randomly connected recurrent neural networks (RNNs), which are referred to as a reservoir. The high-dimensional representation of time-series data in the reservoir significantly simplifies subsequent learning tasks. Although this simple architecture allows fast learning and facile physical implementation, the learning performance is inferior to that of other state-of-the-art RNN models. In this paper, to improve the learning ability of RC, we propose self-modulated RC (SM-RC), which extends RC by adding a self-modulation mechanism. The self-modulation mechanism is realized with two gating variables: an input gate and a reservoir gate. The input gate modulates the input signal, and the reservoir gate modulates the dynamical properties of the reservoir. We demonstrated that SM-RC can perform attention tasks where input information is retained or discarded depending on the input signal. We also found that a chaotic state emerged as a result of learning in SM-RC. This indicates that self-modulation mechanisms provide RC with qualitatively different information-processing capabilities. Furthermore, SM-RC outperformed RC in NARMA and Lorentz model tasks. In particular, SM-RC achieved a higher prediction accuracy than RC with a reservoir 10 times larger in the Lorentz model tasks. Because the SM-RC architecture only requires two additional gates, it is physically implementable as RC, providing a new direction for realizing edge AI.


Introduction
Vast amounts of data are generated and observed in the form of time series in the real world.Efficiently processing these time-series data is important in real-world applications such as forecasting the renewable energy supply and monitoring sensor data in factories.In the past decade, data-driven methods based on deep learning have progressed significantly and have successfully linked data prediction and analysis to social values, and they are becoming increasingly important [1,2].However, the computational load of data-driven methods results in considerable energy consumption, limiting their applicability [3,4].In particular, to perform prediction and analysis near the location where the data are generated, which is called edge AI, a high energy efficiency is required [5].
Reservoir computing (RC) is attracting attention as a candidate for edge AI because it achieves high prediction performance and a high energy efficiency.The RC model consists of an input layer, a reservoir layer, and an output layer [6,7].The reservoir layer is typically a recurrent neural network (RNN) with fixed random weights.Because only the output layer is usually trained in RC, the training process is faster than those of other RNN models such as long-short term memory (LSTM) [8] and gated recurrent units (GRUs) [9].In addition, because the reservoir layer can be configured with various dynamical systems [10], its high energy efficiency has been demonstrated through physical implementations [11,12].
However, as only the output layer is trained in RC, it is unclear whether it can achieve comparable prediction accuracy for real-world applications to other state-of-the approaches [13].To improve the computational power of RC, various RC architectures and methods have been proposed [14].Recent proposals include structures that combine convolutional neural networks [15], parallel reservoirs [16,17], multilayer (deep) RC [18], methods that use information from past reservoir layers [19], and regularization methods that combine autoencoders [20].These studies indicated that the performance can be improved by using an appropriate reservoir structure.However, because only the output layer is trained in these methods, the performance improvement is limited.
One promising approach for improving the flexibility of information processing in RC is to temporally vary the dynamical properties of the reservoir layer to adapt to the input signal.An architecture that feeds the output back to the reservoir can realize this [21,22].Sussilo and Abbott proposed a FORCE learning method for stably training an RC model with a feedback architecture and reported that the network can learn various types of autonomous dynamics [21].This architecture has been extended to spiking neural networks [23] and applied to reinforcement-learning tasks [24].However, although these feedback connections are thought to control the dynamics according to tasks [22], their control is limited because the connections are random.
Research has also been conducted to acquire task-dependent dynamics in the reservoir layer by training not only the output layer but also the reservoir layer.Intrinsic plasticity [25] is a method for making the outputs of neurons closer to a desired distribution, and Hebbian rule or anti-Hebbian rule [26] allows control of correlations between neurons.These methods increased the prediction accuracy and memory capacity [27].Lage and Buonomano proposed innate training, which realizes a long-term memory function by training some connections in the reservoir layer to construct an attractor that is stable for a certain period of time [28].Inoue et al. used this method to construct a chaotic itinerary [29].The aforementioned studies demonstrated that it is possible to perform tasks that are difficult to achieve with conventional RC models by training the reservoir layer.However, in all the methods used, the properties of the trained reservoir layer were static (e.g., fixed network connections in time), limiting the diversity of the dynamics.
Recently, the attention mechanism has been considered as an effective method for realizing information processing adapted to the input signal.In deep learning, the introduction of the attention mechanism was one of the breakthrough techniques proposed in recent years [30,31].The introduction of this mechanism allows efficient learning through the selection and processing of important information and has impacted various research fields, such as natural language processing [32,33].In neuroscience, attention is considered an important factor for realizing cognitive function, and neuromodulation is known as a neural mechanism closely related to attention [34].Neuromodulation is caused by neurons in brain areas such as the basal forebrain releasing neuromodulators such as acetylcholine into various brain areas to modulate the activity of neurons therein [34,35,36].The attention mechanism can increase the efficiency of information processing.However, in RC, the input signal is uniformly transferred to the reservoir layer and converted into highdimensional features; thus, there is no attention mechanism.
In this paper, we propose self-modulated RC (SM-RC) an architecture that incorporates the advantages of the above feedback structure, reservoir-layer learning, and the attention mechanism.SM-RC has trainable gates that can dynamically modulate the strength of input signals and the dynamical properties of the reservoir layer.Thus, it is possible to learn the reservoir dynamics adapted to the input signal, allowing information processing, e.g., attention, and significantly improving the learning performance in a wide range of tasks.Importantly, the gate structure has at most two variables and is controlled by feedback from the reservoir layer; thus, we expect SM-RC to be highly hardware-implementable, similar to conventional RC.In the following, we focus on the learning performance and model characteristics of SM-RC.In particular, we compare the prediction performance with that of conventional RC through simple attention tasks, NARMA tasks, and Lorentz model tasks.We also discuss prospects for hardware implementation.

Results
Figure 1 shows the SM-RC architecture.In addition to the input layer, reservoir layer, and output layer, the proposed architecture has an input gate g in that modulates the input signal and a reservoir gate g res that changes the dynamical properties of the reservoir layer, both of which are controlled by feedback from the reservoir layer.In conventional RC, the input signal is directly transferred to the reservoir layer, and the output is obtained from the reservoir layer.The reservoir layer is typically constructed with an RNN having fixed weights, but various dynamical systems can be used as reservoirs, which can be implemented with electronic circuits, optical systems, and other physical systems.SM-RC inherits the basic architecture of conventional RC, including the reservoir layer and output layer.However, the input signal is modulated by the input gate g in before being transferred to the reservoir.Additionally, the reservoir dynamics are modulated with the reservoir gate g res .Both gates are controlled by the feedback from the reservoir layer.The black solid arrows and green dotted arrows represent the randomly initialized fixed connections and trainable connections, respectively.The red gradient arrow represents the function that changes the dynamical properties of the reservoir.
These properties significantly extend the functionality of conventional RC, wherein input information is "uniformly" transferred to the reservoir layer.In addition, the information of an input signal stored in the reservoir layer tends to decay over time, which is related to the echo-state property and fading memory [37,10].These properties of conventional RC do not pose a significant problem for relatively simple tasks, but they severely limit the regression performance for tasks that involve capturing long or complex temporal dependencies [19].SM-RC overcomes the shortcomings of conventional RC by introducing the self-modulation mechanism.
In this study, for simplicity, we construct SM-RC on the basis of the echo-state network (ESN) [6,38], which is the most commonly used form of RC.The ESN is a discrete-time dynamical system, and the reservoir layer consists of an RNN with fixed weights.In addition, we consider an input gate that dynamically changes the input strength and a reservoir gate that dynamically changes the internal connection strength in the reservoir layer.In this case, the input u res to the reservoir layer and the spectral radius ρ res of the reservoir layer are time-modulated as follows: where u(t) is the original input vector and ρres is the spectral radius of the inner connection matrix of the unmodulated reservoir layer.Note that g in and g res are trained scalar gates.From the function of these gates, we can intuitively understand how SM-RC extends RC.For example, the input gate can realize attention by sending important input signals to the reservoir layer and discarding other information, and the reservoir gate can change the memory retention characteristics depending on the input signal.
To evaluate the learning performance of SM-RC from various viewpoints, we compared it with that of conventional RC for three tasks with different characteristics: simple attention tasks, NARMA tasks, and Lorentz model tasks.In all the experiments, the number of neurons in the SM-RC reservoir layer N res was set as 100, and the hyperparameters were fixed to the same values for SM-RC.In contrast, conventional RC used reservoir layers with various numbers of neurons, and the hyperparameters were optimized.Details of the models and their learning procedure are presented in Materials and Methods.

Simple attention tasks
To investigate the self-modulation mechanism of SM-RC, we evaluated the learning performance for a simple attention task.In this task, the input signal was u(t) = 1 in the time interval of [250, 259], and Gaussian noise with a mean of 0 and standard deviation of σ in was added at other timesteps.The model was trained to output 1 in the time interval of [290,291] and output 0 at other timesteps.To cancel the effect of initial values of x i (0) = 0, the first 200 steps were discarded (free run), and common jitter noise was added to the input and output time windows (see Materials and Methods for details).This task required a memory retention function that retained the input signal for a long time while reducing the influence of Gaussian noise when no informative input signals were provided.
Figure 2 shows the learning results for the simple attention task.In the case of conventional RC (Fig. 2 A), for σ in = 0.1, the regression of the output pulse was poor.Although it has been reported that RC has a long-term memory function [39], memory retention is difficult when noise is continuously added to the reservoir layer.In contrast, for SM-RC (Fig. 2 B), even when the noise was intensified, i.e., σ in = 0.3, the regression of the output pulse was performed accurately.Figure 2 C presents a comparison of the regression performance between conventional RC and SM-RC.SM-RC significantly outperformed conventional RC for various input noise intensities σ in .For the same input noise intensity, SM-RC with N res = 100 outperformed conventional RC with a reservoir at least 10 times larger.
By examining the time evolution of the input gate g in (t) and the reservoir gate g res (t) (the figure shows the modulated spectral radius ρ res (t)) (Fig. 2 B), we can intuitively understand the factors contributing to successful regression in the case of SM-RC.The input gate autonomously took large values during the time period when the input pulse arrives, efficiently feeding information into the reservoir layer.After the input pulse disappears, the input gate takes small values to prevent the input noise from corrupting the information stored in the reservoir layer.In contrast, the modulated spectral radius ρ res (t) takes large values after the input pulse disappears.Because the fading memory condition is not met when the spectral radius is sufficiently large [10,38], the reservoir layer can retain the information for a long period.
To further study the dynamics of SM-RC, we performed a sensitivity analysis of the reservoir layer.This was done by adding perturbation to the reservoir states and examining the evolution of the perturbation two steps forward (t p = 2).The results for other values of t p are presented elsewhere (Appendix, Fig. 6).When the perturbation increases after t p steps, the sensitivity has a positive value; otherwise, it has a negative value (see Materials and Methods for details).In Fig. 2 A, the sensitivity of the conventional RC model exhibits only negative values during the simulation period.This indicates that the reservoir was in stable states, which is consistent with the fading memory condition that usually holds for conventional RC [10].In contrast, for SM-RC (Fig. 2 B), the sensitivity changed significantly over time and occasionally exhibited positive values.The positive sensitivity indicated that the reservoir was in chaotic states.Usually, chaotic states make regression difficult because the input-output relationship becomes sensitive to the initial conditions.However, in the case of SM-RC, by shutting the input gate, the problem related to the initial conditions was avoided for this attention task.We confirmed that the effects of Gaussian noise were negligible in the reservoir states after the input pulse and before the output pulse.The fact that the positive sensitivity only appeared after the input pulse and before the output pulse indicated that SM-RC successfully learned the complex reservoir dynamics adapted to the input signal.
To investigate the independent characteristics of the gates, we evaluated the regression performance when one or both gates did not evolve over time (see Materials and Methods for details).Figure 2 D shows the results.For all input noise intensities, the performance was the best when both gates were time-evolved (dynamic gates), followed by the case where the reservoir gate was modulated (static input gate) and then the case where the input gate was modulated (static reservoir gate).The performance was the worst when both gates were static (static gates).This indicated that the observed performance improvement was due to not the optimization of the spectral radius and input intensity but their temporal modulation.In addition, the fact that the performance was the best when both gates were time-modulated indicated that the input gate and reservoir gate were cooperatively modulated, as intuitively explained above.
Time-series prediction: NARMA and Lorentz model SM-RC improves the prediction performance even when there is no apparently salient time series to which attention should be paid, as in the case of simple attention tasks.To demonstrate this, we evaluated the performance of time-series prediction using NARMA and the Lorentz model.The NARMA time series is obtained with a nonlinear autoregressive moving average, which is often used to evaluate the learning performance of RC [40,19,41].In particular, we use the NARMA5 and NARMA10 time series, which have different internal dynamics.The Lorentz model is a chaotic dynamical system that is also used for RC performance evaluation owing to its difficulty of prediction [13,20,42,43,44].For the Lorentz model tasks, Gaussian noise with different standard deviations σ in was added to the input to approximate the situation of real-world data.The aforementioned tasks involved input signals with different characteristics: a uniform random number for the NARMA tasks and a chaotic time series for the Lorentz model tasks (see Materials and Methods).
Figure 3 A shows the time evolution of the conventional RC and SM-RC models after they were trained on the NARMA10 task.Because the input signal for the task was uniform noise, the reservoir states evolved accordingly in the conventional RC and SM-RC models.In contrast to the case of the simple attention task, the gates in the SM-RC model did not change significantly over time, which reflected the characteristics of the input signal.The prediction performance of SM-RC was better than that of conventional RC, indicating that the self-modulation mechanism is effective even for uniform input signals.Figure 3 B shows the time evolution of the conventional RC and SM-RC models after they were trained on the Lorentz model task.For both models, the time evolution of the reservoir states reflected the input chaotic time series.Although it was not apparent in the time evolution of the reservoir states, we observed that the gates of the SM-RC model evolved adapting to the input chaotic time series.The dynamics of SM-RC for various task conditions, i.e., σ in and N forward , are presented elsewhere (Appendix, Fig. 8).Because of the dynamic behavior of the gates, the prediction performance of SM-RC was significantly better than that of conventional RC.
Figure 4 shows a comparison of the prediction performance of conventional RC and SM-RC for NARMA10 (Fig. 4 A) and Lorentz model tasks (Fig. 4 B).In the top figures in Figs. 4 A and B, the prediction error of conventional RC is plotted as a function of the reservoir size.The horizontal lines indicate the SM-RC prediction error.The number of neurons in the reservoir layer of the SM-RC models were all fixed to 100.For the Lorentz model tasks (Fig. 4 B), we present the results for different values of the following two parameters: the input noise σ in and prediction steps N forward .As shown, SM-RC achieved significantly better prediction performance than conventional RC.For the NARMA tasks, SM-RC exhibited comparable prediction performance to conventional RC with a reservoir three times larger.In contrast, for the Lorentz model tasks, SM-RC exhibited prediction performance comparable to conventional RC with a reservoir 10 times larger.
In the bottom figures in Figs. 4 A and B, the prediction performance of SM-RC when one or both gates do not evolve over time is shown.For the NARMA and Lorentz model tasks, the performance was the best when both gates evolved, followed by the case where the reservoir gate was modulated and then the case where the input gate was modulated.The performance was the worst when both gates were static.These results are similar to those for the simple attention task.As before, they indicate that the observed performance improvement was due to not the optimization of the spectral radius and input intensity but their temporal modulation.
Finally, to investigate the scalability of SM-RC, we evaluated the prediction performance for the Lorentz model task with an increasing reservoir size.Figure 5 A shows the learning curves (MSEs with respect to the learning epoch).From top to bottom, the panels show the results for SM-RC with different reservoir sizes (N size = 100, 200, . . ., 500) when the input noise was 0.01, 0.03, and 0.1.The learning curve became spiky as the reservoir size increased, possibly because of the difficulties in RNN learning (Appendix, Fig. 7).Nevertheless, the training of SM-RC was successful.Figure 5 B shows the resulting performance of SM-RC for different reservoir sizes.The MSEs decreased monotonically as the reservoir size increased, indicating that SM-RC is scalable with regard to the learning performance.

Discussion
We propose SM-RC, which can alter the dynamical properties of the reservoir layer by introducing gate structures.We found that the SM-RC architecture achieved better learning performance than conventional RC for simple attention, NARMA, and Lorentz model tasks.In particular, the improvement in the learning performance for the simple attention task and the Lorentz model task was remarkable.In the case of non-uniform input signals, the mechanism that adapt the way of information processing to the input signal is considered to be effective.Because most real-world time-series data are non-uniform, the proposed architecture is expected to be effective for real-world data.In the simple attention task, we observed local chaotic dynamics in SM-RC.Chaotic dynamics are of interest in deep-learning research because they increase the expressiveness of learning models [45,46].Recently, Inoue et al. observed transient chaos in transformer models [47].It is expected that SM-RC utilized the high expressivity of chaotic dynamics for encoding the input signal and outputting the pulse.However, the underlying learning mechanism requires further investigation.We expect that detailed analysis of dynamical systems [10,48,49,22] will produce ideas for improving SM-RC.SM-RC provides insights into the mechanism of the nervous system.Neural networks with random connections, as used in RC, are attracting attention as models of the brain [7,48,50,51].The proposed SM-RC model can be thought of as incorporating the mechanism of neuromodulation into RC.This is because both systems globally and temporally modulate the activity levels of neurons [34,35,36].For example, the neuromodulator acetylcholine is delivered from the basal forebrain to brain areas, increasing the neuronal activity therein [34].SM-RC realizes an attention mechanism that can be related to neuromodulation [34].We expect that the resemblance between SM-RC and the neuromodulation mechanism can be studied by adopting biologically plausibility characteristics such as spike-based communications, synaptic weights and time constants, and neural geometries.
Efficient hardware implementation is important for real-world implementation of SM-RC [11,12].Because the input gate g in only directly modulates the input intensity, it is not difficult to implement.Additionally, we believe that the reservoir gate g res can be implemented according to the hardware mechanism.For example, in analog circuit implementations, connections between neurons are represented by current magnitudes [52,53].If the connection weights are implemented with digital memory, it is sufficient to provide a mechanism for modulating the digital values.In the case of analog memory, if it is implemented with three-terminal memory devices such as floating-gate devices, by modulating the gate voltage of all or part of the field-effect transistor of the memory, modulation of reservoir-layer dynamics can be achieved.In addition, delay feedback systems [54] can build a reservoir layer using a single nonlinear node, which is often used in photonic systems [55] and electronic integrated circuits [56].In these systems, the entire reservoir system can be modulated simply by changing the single node; thus, it is relatively easy to implement the reservoir gate.This hardware-friendly aspect of SM-RC is what motivates us to use SM-RC for edge AI rather than other state-of-the RNN models such as LSTM [8] and GRUs [9].

A B
We believe that it is possible to train SM-RC implemented in physical systems using various learning techniques.In this paper, we trained the SM-RC model via backpropagation through time (BPTT) [57].In BPTT, detailed information regarding the reservoir-layer dynamics is required, but in physical and analog systems, it is often difficult to capture the internal dynamics in detail.Efforts to train such hardware have progressed in recent years.For example, Wright et al. successfully trained a black-boxed system by capturing the internal dynamics using deep learning [58].In addition, error backpropagation approximations, such as feedback alignment [59] and its variants [60,61,62,63], are suitable for training models implemented in physical systems.It has been reported that BPTT can be approximated so that RNNs can be trained in a biologically plausible manner [62,64].Nakajima et al. extended these methods and successfully trained multilayer physical RC [65].By applying those methods to SM-RC, it is expected that SM-RC can be trained in analog and physical systems.We believe that SM-RC represents a new paradigm for physical RC.

Model
The time evolution of the ESN-type SM-RC model can be expressed as where x(t) ∈ R N res and y(t) ∈ R represent the internal state and output of the reservoir layer, respectively.W in ∈ R N res ×N in and W res ∈ R N res ×N res are matrices representing the connection weights from the input to the reservoir and the inner connection weights of the reservoir, respectively.N in and N res represent the input dimension and reservoir dimension, respectively.Each element of W in is initialized by randomly sampling from the uniform distribution of [−ρ in , ρ in ].Each element of W res is randomly sampled from the uniform distribution of [−1, 1]; then, it is multiplied a constant so that the "initial" spectral radius becomes ρres .
1 is a vector of 1s, and ξ ∈ R controls the magnitude of the bias term of neurons in the reservoir layer [42].g in (t) ∈ R is the input gate that modulates the input intensity and is controlled by feedback from the reservoir layer via weights W in fb .Similarly, g res (t) ∈ R is the reservoir gate that modulates the connection strength of the reservoir and is controlled by feedback from the reservoir layer via weights W res fb .The output function f of the gate is a non-negative function.In this study, the output value was limited to (0, 2) to stabilize the learning process: The input gate temporally modulates the intensity of the input by multiplying it by a value within the range of (0, 2), allowing selection of useful input signals.In contrast, the reservoir gate temporally modulates the spectral radius ρ res within (0, 2ρ res ), allowing the information stored in the reservoir layer to be retained or discarded.The time evolution of conventional RC is obtained by replacing both g in and g res with 1.
W in and W res are fixed weights, as in the conventional RC framework.In SM-RC, in addition to the output weights (W out ), feedback weights (W in fb , b res fb , W res fb , b res fb ) are trained.In this study, the output weights were trained using the pseudo-inverse method, and the feedback weights were trained using BPTT [57]) (see below).

Learning procedure
All the tasks performed in this study involved predicting the target signal y t (t) using the input time series {u(0), u(1), . . .u(t)} obtained by time t.RC usually inputs only u(t) to the reservoir layer at time t (Eq.3).In the case of conventional RC, training of the output weights was performed via the pseudo-inverse method to minimize the squared error between the output and the target signal [38]: where T represents the number of timesteps of the training data.When there are multiple training data, T is the product of the number of timesteps of the training data and the number of training data.To avoid the influence of the initial state, the first 200 steps in the simulation were excluded from the training data (free run).In this study, adding the regularization term W out (ridge regression) did not improve the performance.
In the case of SM-RC, we used the same loss function (Eq.8) and trained weights via the gradient descent method.In each epoch, the weights were updated via full-batch training, and this was repeated for 10 4 epochs.In each full-batch training process, the output layer was trained using the pseudo-inverse method as in conventional RC, and then the gates were trained using BPTT.In BPTT, the weights were updated using the Adam optimizer with a learning rate of 10 −3 [66].All implementations and simulations were performed using the PyTorch framework [67].It is also possible to uniformly train all the learning parameters, including the output weights, via BPTT.However, in this case, the convergence of the learning was slow and the learning process became unstable, degrading the learning performance.By learning the output layer in advance, the backpropagating error signal is reduced, which may lead to stable learning (Appendix, Fig. 7).
The hyperparameters of the conventional RC model (ρ in , ρ res , and ξ) were optimized via Bayesian optimization [68,69].We set ξ = 0 for the simple attention tasks and NARMA tasks.The hyperparameters of the SM-RC model were set as ρ in = 0.12, ρres = 0.9.The bias term was set as ξ = 0 for the simple attention tasks and NARMA tasks and ξ = 0.2 for the Lorentz model tasks.We observed that SM-RC training was somewhat unstable (Appendix, Fig. 7).Therefore, in the performance evaluation of SM-RC, we considered the best results for training with 50 different initial weights.
In this study, we investigated the case where a single gate or both gates did not change over time.This was done by fixing all the weight elements (W in (res) fb ] and 0 at other timesteps.t jitter i represents the jitter noise; one of {−2, −1, 0, 1, 2} ∈ Z was randomly sampled uniformly.The jitter noise was introduced to prevent the learning models from generating output pulses using the initial value (x i (0) = 0) instead of using the input signal.We used 100 data obtained as described above as training data and 100 other data as test data.
NARMA is a nonlinear autoregressive moving average model given by y tc (t) = 0.3y tc (t − 1) + 0.05y tc (t − 1) where s(t) ∈ R is uniformly sampled from the interval [0, 0.5].The NARMA5 and NARMA10 time series were obtained with m = 5 and m = 10, respectively.The task was to predict y tc (t) using s(t) as the input.To eliminate the influence of the initial value, the first 200 steps were discarded, and 2000 steps were generated as training data.We also generated test data for 2000 steps using the same method.We only used a single data for the training dataset and a single data for the test dataset.The Lorentz model evolves over time according to the following differential equation: Discrete time-series data were obtained from the above differential equation by using the Euler method with a timestep of ∆t = 0.01.To eliminate the influence of the initial value, the first 1000 steps were discarded, and the time series of the subsequent 2000 steps was obtained.100 training data and 100 different test data were generated in the same way from different initial values.The task was to predict z(t + N forward ∆t) with x(t) as the input.N forward ∈ {10, 20, 30} represents the number of steps forward to predict.x(t) and z(t) were normalized to a mean of 0 and variance of 1 for the training and test data, respectively.We also added Gaussian noise with a mean of 0 and variance of σ in ∈ {0.01, 0.03, 0.1} to the input x(t) in the training data.

Sensitivity analysis
We evaluated the sensitivity of the learning models in the simple attention task.The sensitivity indicates how a perturbation to the reservoir state is magnified in t p timesteps.For the input data without jitter noise, SM-RC was operated, and the states of the reservoir layer are taken as the reference states {x base (0), x base (1), . . ., x base (t), . . .}.Then, a perturbation p ∈ R N res is applied to the reservoir state x base (t) ∈ R N res at time t, and the reservoir state x pj (t + t p ) ∈ R N res at time t + t p is obtained using Eq. 3. The perturbation vector was obtained by sampling a vector contained in the unit circle using the Metropolis-Hastings algorithm, followed by rescaling so that the obtained vector satisfied p j = .We estimated the sensitivity using the reference and perturbed states as follows: where N p represents the number of perturbation vectors.Similarly, we estimated the maximum sensitivity: In the simulation, we used t p = 2, = 10 −8 , and N p = 200.The simulation results when other values of t p were used are presented elsewhere (Appendix, Fig. 6).The perturbation to the gates g res , g in was applied indirectly in accordance with Eq. 4 because they are not independent (state) variables.The gates g res , g in are dependent variables because they are determined by the state of the reservoir layer.Therefore, in the case of SM-RC, the perturbation is applied only to the state of the neurons in the reservoir layer.Note that the maximum sensitivity is also known as the maximum local Lyapunov exponent.
attributed to the fact that the RNN was indirectly trained through gates.Figure 7 shows the MSE for each learning epoch in 50 different trials.The learning task involved the Lorentz model, with N forward = 20 and σ Lorentz = 0.03. Figure 7 A shows the case where all the trainable weights, including the output layer, were trained via the gradient descent method.In this case, the learning process was unstable, resulting in poor performance.This implies that SM-RC training has the same problems as RNN training described above.Figure 7 B shows the case where the weights of the output layer were trained via the pseudo-inverse method and the weights related to the gates were learned via the gradient method.In this case, the learning process was stabilized, and the performance was significantly improved.By training the output layer in advance, the loss is reduced; consequently, the backpropagating error signal is suppressed, which is thought to lead to stabilization.However, because the resulting performance still varied among trials, we evaluated the learning performance of SM-RC by repeating the training procedure with different initial weights.Gradient clip [71], which is known as a method to stabilize learning, did not improve the performance

A.3 Time evolution of SM-RC for Lorentz model tasks
Figure 8 shows the time evolution of the SM-RC models trained on the Lorentz tasks when the task conditions σ in , N forward were changed.As shown, in all the cases, the gates were modulated to adapt to the input signal.The behaviors of the gates were complex and depended on the task conditions, indicating that the self-modulation mechanism can adapt the reservoir dynamics to tasks in a flexible manner.

Figure 2 :
Figure 2: Simulation results for simple attention tasks.A. Dynamics of a conventional RC model after training with N res = 100 and σ in = 0.1.From the top, the time evolution of the input signal, reservoir states, output, and sensitivity are shown.B. Dynamics of the SM-RC model after training with N res = 100and σ in = 0.3.From the top, the time evolution of the input signal, reservoir states, outputs, gates, and sensitivity are shown.In the gate panel, the blue solid line represents the input gate, and the orange dashed line represents the spectral radius.The standard deviations of these gate values were obtained using 100 different input signals without jitter noise.We show the modulated spectral radius ρ res (t) obtained using Eq. 2. In A and B, in the sensitivity panel, the blue solid line indicates the mean sensitivity of the reservoir layer, and the orange dashed line indicates the maximum sensitivity of the reservoir layer.The standard deviations of the sensitivity values were obtained using 100 different input signals without jitter noise.For the output layer, the solid line indicates the predicted output, and the dashed line indicates the teacher signal.In the panels displaying the reservoir states, only 20 reservoir neurons are shown.C. Comparison of regression errors (mean squared errors, MSEs) between conventional RC models and SM-RC models.The regression errors for the conventional RC models are plotted as a function of the reservoir size for different input noise intensities (σ in ).The standard deviations were obtained over 10 trials.The regression errors for the SM-RC models are represented by blue dashed horizontal lines for the input noise intensity of 0.1.For the SM-RC models, the reservoir size was fixed to 100.D. Regression errors for the SM-RC models when one or both gates are made static.

Figure 3 :
Figure3: Time evolution of the reservoir dynamics for (A) NARMA10 tasks with N res = 100 and (B) Lorentz model tasks with N res = 100, σ in = 0.03 and N forward = 20.In A and B, the left figures show the inputs, reservoir states, and outputs (from top to bottom) for the conventional RC models, and the right figures show the inputs, reservoir states, outputs, and gates (from top to bottom) for the SM-RC models.For the gates, the solid red lines represent the input gates, and the dashed blue lines indicate the modulated spectral radius.For the output layer, the solid line indicates the predicted output, and the dashed line indicates the teacher signal.In the panels displaying the reservoir states, only 20 reservoir neurons are shown.

Figure 4 :
Figure 4: Performance evaluation for the (A) NARMA tasks and (B) Lorentz model tasks.In A and B, the top figures show comparisons of the MSEs for the conventional RC and SM-RC models.In each figure, the MSEs for the conventional RC models are plotted.The horizontal axis indicates the size of the reservoir for the conventional RC model.The standard deviations were obtained over 10 trials.The horizontal lines indicate the MSEs for the SM-RC models.The size of the SM-RC models was fixed to 100.The bottom figures show comparisons of the MSEs for the SM-RC models when some gates were temporally fixed.In B, the results for the cases of N forward = 10, 20, and 30 are presented in the left, center, and right figures, respectively.In addition, the results for different values of the input noise σ in are presented.

Figure 5 :
Figure 5: Regression performance of SM-RC models with different reservoir sizes for the Lorentz model tasks (N forward = 30).A. MSEs at each training step (epoch) for σ in = 0.01, 0.03, and 0.1, from top to bottom.B. MSEs for SM-RC models with different reservoir sizes.

)
to 0 and training only the bias term b in (res) fb .Datasets In the simple attention task, the ith input training datum had a value of 1 in the input time interval [250 + t jitter i , 259 + t jitter i ] and a value sampled from a Gaussian distribution with a mean of 0 and a standard deviation of σ in ∈ {0.1, 0.2, 0.3} at other timesteps.The ith output training datum had a value of 1 in the time interval of [290 + t jitter i , 291 + t jitter i

Figure 6 :BFigure 7 :
Figure6: Results of the sensitivity analysis for the simple attention task. A. Results for the case of RC with N res = 100 and σ in = 0.1.From the top, the time evolution of the input signal, reservoir states, output, and sensitivity are shown.B. Results for the case of SM-RC with N res = 100 and σ in = 0.3.From the top, the time evolution of the input signal, reservoir states, outputs, gates, and sensitivity are shown.In A and B, the sensitivity is analyzed using different timesteps t p , at which the magnitudes of expansion of the perturbations were measured.For the output layer, the solid line indicates the predicted output, and the dashed line indicates the teacher signal.In the panels displaying the reservoir states, only 20 reservoir neurons are shown.

Figure 8 :
Figure 8: Time evolution of SM-RC for the Lorentz model tasks with various input noises σ in and numbers of prediction steps N forward .For the output layer, the solid lines indicate the predicted outputs, and the dashed lines indicate the teacher signals.For the gates, the solid red lines indicate the input gates, and the blue dashed lines indicate the spectral radius.In the panels displaying the reservoir states, only 20 reservoir neurons are shown.