Universal scaling between wave speed and size enables nanoscale high-performance reservoir computing based on propagating spin-waves

Neuromorphic computing using spin waves is promising for high-speed nanoscale devices, but the realization of high performance has not yet been achieved. Here we show, using micromagnetic simulations and simplified theory with response functions, that spin-wave physical reservoir computing can achieve miniaturization down to nanoscales keeping high computational power comparable with other state-of-art systems. We also show the scaling of system sizes with the propagation speed of spin waves plays a key role to achieve high performance at nanoscales.


Introduction
Non-local magnetization dynamics in a nanomagnet, spin-waves, can be used for processing information in an energy-efficient manner since spin-waves carry information in a magnetic material without Ohmic losses (1).The wavelength of the spin-wave can be down to the nanometer scale, and the spin-wave frequency becomes several GHz to THz frequency, which are promising properties for nanoscale and high-speed operation devices.Recently, neuromorphic computing using spintronics technology has attracted great attention for the development of future low-power consumption artificial intelligence (2).Spin-waves can be created by various means such as magnetic field, spin-transfer torque, spin-orbit torque, voltage induced change in magnetic anisotropy and can be detected by the magnetoresistance effect (3).Therefore, neuromorphic computing using spin waves may have a potential of realisable devices.
Reservoir computing (RC) is a promising neuromorphic computation framework.RC is a variant of recurrent neural networks (RNNs) and has a single layer, referred to as a reservoir, to transform an input signal into an output (4).In contrast with the conventional RNNs, RC does not update the weights in the reservoir.Therefore, by replacing the reservoir of an artificial neural network with a physical system, for example, magnetization dynamics, we may realize a neural network device to perform various tasks, such as time-series prediction (4, 5), short-term memory (6,7), pattern recognition, and pattern generation.Several physical RC has been proposed: spintronic oscillators (8,9), optics (10), photonics (11,12), fluids, soft robots, and others (see reviews (13)(14)(15)).Among these systems, spintronic RC has the advantage in its potential realization of nanoscale devices at high speed of GHz frequency with low power consumption, which may outperform conventional electric computers in future.So far, spintronic RC has been considered using spin-torque oscillators (8,9), magnetic skyrmion (16), and spin waves in garnet thin films (17)(18)(19).However, the current performance of spintronic RC still remains poor compared with the Echo State Network (ESN) (6,7), idealized RC systems.The biggest issue is a lack of our understanding of how to achieve high performance in the RC systems.
To achieve high performance, the reservoir has to have a large degree of freedom, N.However, in practice, it is difficult to increase the number of physical nodes, N p , because it requires more wiring of multiple inputs.In this respect, wave-based computation in continuum media has attracting features.The dynamics in the continuum media have large, possibly infinite, degrees of freedom.In fact, several wave-based computations have been proposed (20,21).The challenge is to use the advantages of both wave-based computation and RC to achieve highperformance computing of time-series data.For spin wave-based RC, so far, the large degrees of freedom are extracted only by using a large number of input and/or output nodes (19,22).
Here, to propose a realisable spin wave RC, we use an alternative route; we extract the information from the continuum media using a small number of physical nodes.
Along this direction, using N v virtual nodes for the dynamics with delay was proposed to increase N in (23).This idea was applied in optical fibres with a long delay line (11) and a network of oscillators with delay (24).Nevertheless, the mechanism of high performance remains elusive, and no unified understanding has been made.The increase of N = N p N v with N v does not necessarily improve performance.In fact, RC based-on STO struggles with insufficient performance both in experiments (9) and simulations (25).The photonic RC requires a large size of devices due to the long delay line (11,12).
In this work, we show nanoscale and high-speed RC based on spin wave propagation with a small number of inputs can achieve performance comparable with the ESN and other state-of-art RC systems.More importantly, by using a simple theoretical model, we clarify the mechanism of the high performance of spin wave RC.We show the scaling between wave speed and system size to make virtual nodes effective.

Reservoir computing using wave propagation
The basic task of RC is to transform an input signal U n to an output Y n for the discrete step n = 1, 2, . . ., T at the time t n .For example, for speech recognition, the input is an acoustic wave, and the output is a word corresponding to the sound.Each word is determined not only by the instantaneous input but also by the past history.Therefore, the output is, in general, a function of all the past input, Y n = g ({U m } n m=1 ) as in Fig. 1(a).The RC can also be used for time-series prediction by setting the output as Y n = U n+1 (4).In this case, the state at the next time step is predicted from all the past data; namely, the effect of delay is included.The performance of the input-output transformation g can be characterized by how much past information does g have, and how much nonlinear transformation does g perform.We will discuss that the former is expressed by memory capacity (MC) (26), whereas the latter is measured by information processing capacity(IPC) (27).
We propose physical computing based on a propagating wave (see Fig. 1(b,c)).Time series of an input signal U n can be transformed into an output signal Y n (Fig. 1(a)).As we will discuss below, this transformation requires large linear and nonlinear memories; for example, to predict Y n , we need to memorize the information of U n−2 and U n−1 .The input signal is injected in the first input node and propagates in the device to the output node spending a time τ 1 as in Fig. 1(b).Then, the output may have past information at t n − τ 1 corresponding to the step n − m 1 .The output may receive the information from another input at different time The sum of the two peices of information is mixed and transformed as U n−m 1 U n−m 2 either by nonlinear readout or by nonlinear dynamics of the reservoir (see also Sec.B in Supplementary Information).We will demonstrate the wave propagation can indeed enhances memory capacity and learning performance of the input-output relationship.Before explaining our learning strategy, we discuss how to achieve accurate learning of the input-output relationship Y n = g ({U m } n m=1 ) from the data.Here, the output may be dependent on a whole sequence of the input {U m } n m=1 = (U 1 , . . ., U n ).Even when both U n and Y n are one-variable time-series data, the input-output relationship g(•) may be T -variable polynomials, where T is the length of the time series.Formally, g(•) can be expanded in a polynomial series (Volterra series) such that g Therefore, even for the linear input-output relationship, we need T coefficients in g(•), and as the degree of powers in the polynomials increases, the number of the coefficients increases exponentially.This observation implies that a large number of data is required to estimate the input-output relationship.Nevertheless, we may expect a dimensional reduction of g(•) due to its possible dependence on the time close to t and on the lower powers.
Still, our physical computers should have degrees of freedom N ≫ 1, if not exponentially large.
The reservoir computing framework is used to handle time-series data of the input U and the output Y (6).In this framework, the input-output relationship is learned through the reservoir dynamics X(t), which in our case, is magnetization at the detectors.The reservoir state at a time t n is driven by the input at the nth step corresponding to t n as with nonlinear (or possibly linear) function f (•).The output is approximated by the readout operator ψ(•) as Our study uses the nonlinear readout ψ (X(t)) = W 1 X(t) + W 2 X 2 (t) (5,28).The weight matrices W 1 and W 2 are estimated from the data of the reservoir dynamics X(t) and the true output Y n , where X(t) is obtained by (1).With the nonlinear readout, the RC with linear dynamics can achieve nonlinear transformation, as Fig. 1(b).We stress that the system also works with linear readout when the RC has nonlinear dynamics.We discuss this case in Sec.B.

Spin wave reservoir computing
We consider a magnetic device of a thin rectangular system with cylindrical injectors (see Fig. 1(c)).The size of the device is L × L × D. Under the uniform external magnetic field, the magnetization is along the z direction.Electric current is injected at the N p injectors with the radius a and the same height with the device.The spin-torque by the current drives magnetization m(x, t) and propagating spin-waves as schematically shown in Fig. 1(c).The actual demonstration of the spin-wave reservoir computing is shown in Fig. 2. We demonstrate the spin-wave RC using two methods: the micromagnetic simulations and the theoretical model using a response function.
In the micromagnetic simulations, we analyze the Landau-Lifshitz-Gilbert (LLG) equation with the effective magnetic field H eff = H ext +H demag +H exch consists of the external field, demagnetization, and the exchange interaction (see Theoretical analysis using response function in Methods).The spin waves are driven by Slonczewski spin-transfer torque (29).The driving term is proportional to the DC current j(t) at the nanocontact.We inject the DC current proportional to the input time series U with a pre-processing filter.From the resulting spatially inhomogeneous magnetization m(x, t), we measure the averaged magnetization at ith nanocontact m i (t).We use the method of time multiplexing with N v virtual nodes (23).We choose the xcomponent of magnetization m x,i as a reservoir state, namely, (see (14) in Methods for its concrete form).For the output transformation, we use Therefore, the dimension of our reservoir is 2N p N v .The nonlinear output transformation can enhance the nonlinear transformation in reservoir (5), and it was shown that even under the linear reservoir dynamics, RC can learn any nonlinearity (28,30).In Sec.B in  Supplementary Information, we also discuss the linear readout, but including the z-component of magnetization X = (m x , m z ).In this case, m z plays a similar role to m 2 x .The performance of the RC is measured by three tasks: MC, IPC, and NARMA10.The weights in the readout are trained by reservoir variable X and the output Y (Fig. 2(b), see also Methods).
To understand the mechanism of high performance of learning by spin wave propagation, we also consider a simplified model using the response function of the spin wave dynamics.By linearizing the magnetization around m = (0, 0, 1) without inputs, we may express the linear response of the magnetization at the ith readout m i = m x,i + im y,i to the input as(see Methods) Here, U (j) (t) is the input time series at jth nanocontact.The response function has a self part G ii , that is, input and readout nanocontacts are the same, and the propagation part G ij , where the distance between the input and readout nanocontacts is |R i − R j |.We use the quadratic nonlinear readout, which has a structure The response function of the nonlinear readout is G (2) The same structure as (4) appears when we use a second-order perturbation for the input (see Methods).In general, we may include the cubic and higher-order terms of the input.This expansion leads to the Volterra series of the output in terms of the input time series, and suggests how the spin wave RC works (see Sec. A.1 in Supplementary Information for more details).Once the magnetization at each nanocontact is computed, we may estimate MC and IPC.
Figure 3 shows the results of the three tasks.When the time scale of the virtual node θ is small and the damping is small, the performance of spin wave RC is high.As Fig. 3(a) shows, we achieve MC ≈ 60 and IPC ≈ 60.Accordingly, we achieve a small error in the NARMA10  To confirm the high MC and IPC are due to spin-wave propagation, we perform micromagnetic simulations with damping layers between nodes (Fig. 4(a)).The damping layers inhibit spin wave propagation.The result of Fig. 4(b) shows that the memory capacity is substantially lower than that without damping, particularly when θ is small.The NARMA10 task shows a larger error (Fig. 4(d)).When θ is small, the suppression is less effective.This may be due to incomplete suppression of wave propagation.
We also analyze the theoretical model with the response function by neglecting the interaction between two physical nodes, namely, G ij = 0 for i = j.In this case, information transmission between two physical nodes is not allowed.We obtain smaller MC and IPC than the system with wave propagation, supporting our claim (see (Fig. 4(c))).
Our spin wave RC also works for the prediction of time-series data.In the study of ( 5), the functional relationship between the state at t + ∆t and the states before t is learned by the ESN.The trained ESN can estimate the state at t + ∆t from the past states, and therefore, it can predict the dynamics without the data.In ( 5), the prediction for the chaotic time-series data was demonstrated.Figure 5 shows the prediction using our spin wave RC for the Lorenz model.We can demonstrate that the RC shows short-time prediction and, more importantly, reconstruct the chaotic attractor.

Scaling of system size and wave speed
To clarify the mechanism of the high performance of our spin wave RC, we investigate MC and IPC of the system with different characteristic length scales L and different wave propagating speed v.The characteristic length scale is controlled by the radius of the circle on which inputs are located (see Fig. 2(a)).We use our theoretical model with the response function to compute MC and IPC in the parameter space (v, R).This calculation can be done because the computational cost of our model is much cheaper than numerical micromagnetic simulations.which the response function is replaced by the Gaussian function where R ij is the distance between ith and jth physical nodes, and w is the width of the function.
Even in this simplified model, we obtain MC≈ 40 and IPC≈ 60, and also the maximum when L ∝ v (Fig. 6(c,d)).From this result, the origin of the optimal ratio between the length and speed becomes clearer; when L ≪ v, the response functions under different R ij overlap so that different physical nodes cannot carry the information of different delay times.On the other hand, when L ≫ v, the characteristic delay time L/v exceeds the maximum delay time to compute MC and IPC, or exceeds the total length of the time series.Note that we set the maximum delay time as 100, which is much longer than the value necessary for the NARMA10 task.
The result suggests the universal scaling between the size of the system and the speed of the RC based on wave propagation.Our system of the spin wave has a characteristic length L ∼ 500 nm and a speed of v ∼ 200 m s −1 .In fact, the reported photonic RC has characteristic length scale of optical fibres close to the scaling in Fig. 7.

Discussion
Figure 7 shows reports of reservoir computing in literature with multiple nodes plotted as a function of the length of nodes L and products of wave speed and delay time vτ 0 for both photonic and spintronic RC.For the spintronic RC, the dipole interaction is considered for wave propagation in which speed is proportional to both saturation magnetization and thickness of the line with a ratio L/(vτ 0 ) ∼ 1.Therefore, the photonic RC requires a larger system size, as long as the delay time of the input τ 0 = N v θ is the same order (τ 0 = 0.3 − 3 ns in our spin wave RC).As can be seen in Fig. 6, if one wants to reduce the length of physical nodes, one must reduce wave speed or delay time; otherwise the information is dense, and the reservoir cannot memorize many degrees of freedom (See Fig. 6(e)).Reducing delay time is challenging since the experimental demonstration of the photonic reservoirs has already used the short delay close to the instrumental limit.Also, reducing wave speed in photonics systems is challenging.
On the other hand, the wave speed of propagating spin-wave is much lower than the speed of light and can be tuned by configuration, thickness and material parameters.If one reduces wave speed or delay time over the broad line in Fig. 7, information becomes sparse and cannot be used efficiently(See Fig. 6(e)).Therefore, there is an optimal condition for high-performance RC.
The performance is comparable with other state of the art techniques, which are summarized in Fig. 8.For example, for the spintronic RC, MC ≈ 30 (19) and NRMSE ≈ 0.2 (22) in the NARMA10 task are obtained using N p ≈ 100 physical nodes.The spintronic RC with one physical node but with 10 1 − 10 2 virtual nodes do not show high performance; MC is less than 10 (the bottom left points in Fig. 8).This fact suggests that the spintronic RC so far cannot use virtual nodes effectively.On the other hand, for the photonic RC, comparable performances are achieved using N v ≈ 50 virtual nodes, but only one physical node.As we discussed, however, the photonic RC requires mm system sizes.Our system achieves comparable performances using 10 physical nodes, and the size is down to nanoscales keeping the 2 − 50 GHz computational speed.We also demonstrate that the spin wave RC can perform time-series prediction and reconstruction of an attractor for the chaotic data.To our knowledge, this has not been done in nanoscale systems.
Our results of micromagnetic simulations suggest that our system can be physically im-   Open blue symbols are values reported using photonic RC while solid red symbols are values reported using spintronic RC.MC and NRMSE for NARMA10 task are taken from Refs.(9,19,22,36,37) for spintronic RC and Refs.(32-34, 38, 39) for photonic RC.
Nanoscale propagating spin waves in a ferromagnetic thin film excited by spin-transfer torque using nanometer electrical contacts have been observed (44)(45)(46).Patterning of multiple electrical nanocontacts into magnetic thin films was demonstrated in mutually synchronized spintorque oscillators (46).In addition to the excitation of propagating spin-wave in a magnetic thin film, its non-local magnetization dynamics can be detected by tunnel magnetoresistance effect at each electrical contact, as schematically shown in Fig. 1(c), which are widely used for the development of spintronics memory and spin-torque oscillators.In addition, virtual nodes are effectively used in our system by considering the speed of propagating spin-wave and distance of physical nodes; thus, high-performance reservoir computing can be achieved with the small number of physical nodes, contrary to many physical nodes used in previous reports.This work provides a way to realize nanoscale high-performance reservoir computing based on propagating spin-wave in a ferromagnetic thin film.
There is an interesting connection between our study to the recently proposed next-generation RC (28,47), in which the linear ESN is identified with the NVAR (nonlinear vectorial autoregression) method to estimate a dynamical equation from data.Our formula of the response function (3) results in the linear input-output relationship with a delay Y n+1 = a n U n +a n−1 U n−1 +. . .

(see Sec. A in Supplementary Information
).More generally, with the nonlinear readout or with higher-order response functions, we have the input-output relationship with delay and non-

Supplementary Information
).These input-output relations are nothing but Volterra series of the output as a function of the input with delay and nonlinearity (48).The coefficients of the expansion are associated with the response function.Therefore, the performance of RC falls into the independent components of the matrix of the response function, which can be evaluated by how much delay the response functions between two nodes cover without overlap.The results would be helpful to a potential design of the network of the physical nodes.
We should note that the polynomial basis of the input-output relation in this study originates from spin wave excitation around the stationary state m z = 1.When the input data has a hierarchical structure, another basis may be more efficient than the polynomial expansion.Another setup of magnetic systems may lead to a different basis.We believe that our study shows simple but clear intuition of the mechanism of high-performance RC, that can lead to the exploration of another setup for more practical application of the physical RC.

Micromagnetic simulations
We analyze the LLG equation using the micromagnetic simulator mumax 3 (49).The LLG equation for the magnetization M(x, t) yields We consider the effective magnetic field as where H ext is the external magnetic field, H ms is the magnetostatic interaction, and H exch is the exchange interaction with the exchange parameter A ex .
The size of our system is L = 1000 nm and D = 4 nm.The number of mesh points is 200 in the x and y directions, and 1 in the z direction.We consider Co 2 MnSi Heusler alloy ferromagnet, which has a low Gilbert damping and high spin polarization with the parameter A ex = 23.5 pJ/m, M s = 1000 kA/m, and α = 5 × 10 −4 (40,41,(41)(42)(43).Out-of-plane magnetic field µ 0 H 0 = 1.5 T is applied so that magnetization is pointing out-of-plane.The spin-polarized current field is included by the Slonczewski model ( 29) with polarization parameter P = 1 and spin torque asymmetry parameter λ = 1 with the reduced Planck constant and the charge of an electron e.The uniform fixed layer magnetization is m f = e x .We use absorbing boundary layers for spin waves to ensure the magnetization vanishes at the boundary of the system (50).
We set the initial magnetization as m = e z .
The reference time scale in this system is τ 0 = 1/γµ 0 M s ≈ 5 ps, where γ is the gyromagnetic ratio, µ 0 is permeability, and M s is saturation magnetization.The reference length scale is the exchange length l 0 ≈ 5 nm.The relevant parameters are Gilbert damping α, the time scale of the input time series θ, and the characteristic length between the input nodes R 0 .
The injectors and detectors of spin are placed as cylindrical nanocontacts embedded in the region with their radius a and height D. We set a = 20nm unless otherwise stated.The input time series is uniform random noise U n ∈ U(0, 0.5).The injected density current is set as j(t n ) = 2j c U n with j c = 2 × 10 −4 /(πa 2 )A/m 2 .Under a given input time series of the length T , we apply the current during the time θ, and then update the current at the next step.The same input current with different filters is injected for different virtual nodes (see Learning with reservoir computing).The total simulation time is, therefore, T θN v .

Learning with reservoir computing
Our RC architecture consists of reservoir state variables and the readout In our spin wave RC, the reservoir state is chosen as x-component of the magnetization for the indices for the physical nodes i = 1, 2, . . ., N p .Here, N p is the number of physical nodes, and each m x,i (t n ) is a T -dimensional row vector with n = 1, 2, . . ., T .We use a timemultiplex network of virtual nodes in RC (23), and use N v virtual nodes with time interval θ.
The expanded reservoir state is expressed by N p N v × T matrix X as (see Fig. 2 where t n,k = ((n − 1)N v − (k − 1))θ for the indices of the virtual nodes k = 1, 2, . . ., N v .The total number of rows is N = N p N v .We use the nonlinear readout by augmenting the reservoir state as where X(t) • X(t) is the Hadamard product of X(t), that is, component-wise product.The readout weight W is trained by the data of the output Y (t) where X † is pseudo-inverse of X.
In the time-multiplexing approach, the input time-series U = (U 1 , U 2 , . . ., U T ) ∈ R T is translated into piece-wise constant time-series Ũ (t) = U n with t = (n − 1)N v θ + s under k = 1, . . ., T and s = [0, N v θ) (see Fig. 2(a)).This means that the same input remains during the time period τ 0 = N v θ.To use the advantage of physical and virtual nodes, the actual input J i (t) at the ith physical node is Ũ(t) multiplied by τ 0 -periodic random binary filter B i (t).Here, B i (t) ∈ {0, 1} is piece-wise constant during the time θ.At each physical node, we use different realizations of the binary filter as in Fig. 2(a).
Unless otherwise stated, We use 1000 steps of the input time-series as burn-in.After these steps, we use 5000 steps for training and 5000 steps for test for the MC, IPC, and NARMA10 tasks.

NARMA task
The NARMA10 task is based on the discrete differential equation, Here, U n is an input taken from the uniform random distribution U(0, 0.5), and y k is an output.
We choose the parameter as α = 0.3, β = 0.05, γ = 1.5, and δ = 0.1.In RC, the input is The performance of the NARMA10 task is measured by the deviation of the estimated time series Ŷ = W • X from the true output Y.The normalized root-mean-square error (NRMSE) is Performance of the task is high when NRMSE ≈ 0. In the ESN, it was reported that NRMSE ≈ 0.4 for N = 50 and NRMSE ≈ 0.2 for N = 200 (51).The number of node N = 200 was used for the speech recognition with ≈ 0.02 word error rate ( 51), and time-series prediction of sptiotemporal chaos (5).Therefore, NRMSE ≈ 0.2 is considered as reasonably high performance in practical application.We also stress that we use the same order of nodes (virtual and physical nodes) N = 128 to achieve NRMSE ≈ 0.2.

Memory capacity and information processing capacity
Memory capacity (MC) is a measure of the short-term memory of RC.This was introduced in (6).For the input U n of random time series taken from the uniform distribution, the network is trained for the output Y n = U n−k .The MC is computed from This quantity is decaying as the delay k increases, and MC is defined as Here, k max is a maximum delay, and in this study we set it as k max = 100.The advantage of MC is that when the input is independent and identically distributed (i.i.d.), and the output function is linear, then MC is bounded by N, the number of internal nodes.
Information processing capacity (IPC) is a nonlinear version of MC (27).In this task, the output is set as where d k is non-negative integer, and P d k (x) is the Legendre polynomials of x order d k .We may define and then compute jth order IPC as We may define When j = 1, the IPC is, in fact, equivalent to MC, because P 0 (x) = 1 and P 1 (x) = x.In this case, Y n = U n−k for d i = 1 when i = k and d i = 0 otherwise.( 23) takes the sum over all possible delay k, which is nothing but MC.When j > 1, IPC captures all the nonlinear transformation and delays up to the jth polynomial order.For example, when j = 2, the output In this study, we focus on j = 2 because the second-order nonlinearity is essential for the NARMA10 task (see Sec.A in Supplementary Information).
The relevance of MC and IPC is clear by considering the Volterra series of the input-output relation, Instead of polynomial basis, we may use orthonormal basis such as the Legendre polynomials Each term in ( 25) is characterized by the non-negative indices (k 1 , k 2 , . . ., k n ).Therefore, the terms corresponding to j = i k i = 1 in Y n have information on linear terms with time delay.Similarly, the terms corresponding to j = i k i = 2 have information of second-order nonlinearity with time delay.In this view, the estimation of the output Y (t) is nothing but the estimation of the coefficients β k 1 ,k 2 ,...,kn .In RC, the readout of the reservoir state at ith node (either physical or virtual node) can also be expanded as the Volterra series Therefore, MC and IPC are essentially a reconstruction of vector, and using the matrix M associated with the readout weights as MC corresponds to the reconstruction of If all of the reservoir states are indepen-dent, we may reconstruct N components in β k 1 ,k 2 ,••• ,kn .In realistic cases, the reservoir states are not independent, and therefore, we can estimate only < N components in Prediction of chaotic time-series data Following (5), we perform the prediction of time-series data from the Lorenz model.The model is a three-variable system of (A 1 (t), A 2 (t), A 3 (t)) yielding the following equation The parameters are chosen such that the model exhibits chaotic dynamics.Similar to the other tasks, we apply the different masks of binary noise for different physical nodes, B i (t) ∈ {−1, 1}.Because the input time series is three-dimensional, we use three independent masks for A 1 , A 2 , and A 3 , therefore, l ∈ {1, 2, 3}.The input for the ith physical node after the mask is given as Then, the input is normalized so that its range becomes [0, 0.5], and applied as an input current.Once the input is prepared, we may compute magnetization dynamics for each physical and virtual node, as in the case of the NARMA10 task.We note that here we use the binary mask of {−1, 1} instead of {0, 1} used for other tasks.We found that the {0, 1} does not work for the prediction of the Lorenz model, possibly because of the symmetry of the model.

Theoretical analysis using response function
We consider the Landau-Lifshitz-Gilbert equation for the magnetization field m(x, t), We normalize both the magnetic and effective fields by saturation magnetization as m = M/M s and h eff = H eff /M s .This normalization applies to all the fields including external and anisotropic fields.We also normalize the current density as σ(x, t) = J(x, t)/j 0 for the current density J(x) and the unit of current density j 0 = 4M 2 s eπa 2 Dµ 0 P . We apply the current density at the nanocontact as Here χ a (x) is a characteristic function χ a (x) = 1 when x ≤ a and χ a (x) = 0 otherwise.
We expand the solution of (31) around the uniform magnetization m(x, t) = (0, 0, 1) without current injection as Here, m 0 (x, t) = (0, 0, 1) and ǫ ≪ 1 is a small parameter corresponding to the magnitude of the input σ(x, t).The first-order term corresponds to a linear response of the magnetization to the input σ, whereas the higher-order terms describe nonlinear responses, for example, m (2) (x, t) ∼ σ(x 1 , t 1 )σ(x 2 , t 2 ).Because our input is driven by the spin torque with fixed layer magnetization in the x-direction, m f = e x , only m x and m y appear in the first-order term O(ǫ).Deviation of m z from m z = 1 appears in O(ǫ 2 ).Therefore, for the first-order term m (1) , we may define the complex magnetization Here, we will show the magnetization is expressed by the response function G ij (t).The input at the jth physical node affects the magnetization at the ith physical node as The input for the jth physical node is expressed by σ j (t) = 2j c B j (t) Ũj (t).Because different physical nodes have different masks discussed in Learning with reservoir computing in Methods.When the wave propagation is dominated by the exchange interaction, the response function for the same node is and for different nodes, it becomes When the wave propagation is dominated by the dipole interaction, the response function for the same node is and for different nodes it becomes We have N p cylindrical shape inputs with radius a and the ith input is located at R i .The input function is expressed as We are interested in the magnetization at the input m i (t) = m(x = R i , t), which is For the same node, |R i − R j | = 0, and we may compute the integral explicitly as (36).When ka ≪ 1, we may assume J 1 (ka) ≈ ka/2, and finally, come up with (37).
When the thickness d of the material is thin, the dispersion relation becomes where We assume for k ≪ β h, then the linearized operator becomes leading to (38) and (39).
nine step previous data and second-order nonlinearity.We discuss these properties in two methods.The first method is based on the extended Dynamic Mode Decomposition (DMD) ( 52) and the higher-order DMD (53).The second method is a regression of the input-output relationship.
We will discuss the details of the two methods.Our results are consistent with previous studies; the requirement of memory was discussed in (54), and the second-order nonlinear terms with a time delay in (55).
The NARMA10 task is based on the discrete differential equation, Here, U n is an input at the time step n taken from the uniform random distribution U(0, 0.5), and Y n is an output.We choose the parameter as α = 0.3, β = 0.05, γ = 1.5, and δ = 0.1.
In the first method, we estimate the transition matrix A from the state variable We may extend the notion of the state variable to contain delayed data and polynomials of the output with time delay as Including the delay terms following from the higher-order DMD (53), while the polynomial nonlinear terms are used as a polynomial dictionary in the extended DMD (52).Here, (53) contains all the combination of the second-order terms with time delay, Y n−i 1 Y n−i 2 with the integers i 1 and i 2 in 0 ≤ i 1 ≤ l 2 ≤ n − 1.We may straightforwardly include higher-order terms in powers in (53).In the NARMA10 task, the output Y n+1 is also affected by the input U n .
Therefore, the extended DMD is generalized to include the control as ( 56) we may express the state variable as Note that U n includes the input and its polynomials with a time delay as in (55).Similar to the first method, we estimate G by The estimated Ĝ gives us information on which time delay and nonlinearity dominate the state variable.The results of the two estimation methods are shown in Fig. 9.Both approaches suggest that memory of ≈ 10 steps is enough to get high performance, and further memory does not improve the error.The second-order nonlinear term shows a reasonably small NRMSE of ≈ 0.01.Including the third-order nonlinearity improves the error, but there is a sign of overfitting at a longer delay because the number of the state variables is too large.It should also be noted that even with the linear terms, the NRMSE becomes ≈ 0.35.This result implies that although NRMSE ≈ 0.35 is often considered good performance, nonlinearity of the data is not learned at the error of this order.
A.1 The MC and IPC tasks as Volterra series for linear and nonlinear readout In ( 3) and ( 4) in the main text, we show that the magnetization at the input region is expressed by the response function.The magnetization at the time t n corresponding to the input U n at the n step is expressed as where the coefficients a n can be computed from the response function.We first consider the linear case, but we will generalize the expression for the nonlinear case.Because we use virtual nodes, the input U n at the step n continues during the time period t ∈ [t n , t n+1 ) discretized by N v steps as (t n,1 , t n,2 , . . ., t n,Nv ), and is multiplied by the filter of the binary noise (see Fig. 2 and Methods in the main text).Therefore, the magnetization is expressed by the response functions where σ i (t n ) ∝ U n is the non-dimensionalized current injection at the time t n at the ith physical node, which is proportional to U n .Therefore, (61) results in the expression of (60).Our input is taken from a uniform random distribution.Therefore, the inner product of the reservoir state, which is nothing but magnetization, and (delayed) input to learn MC is Similarly, the variance of the magnetization is equal to the variance of the input with the coefficient associated with m(t n ).
We may express the MC and IPC tasks in a matrix form as Here, S is the matrix associated with the original input, and S is the delayed one.The output weight is denoted by W, and W in is the matrix associated with the mask of binary noise.The goal of MC and IPC tasks is to approximate the delayed input S by the reservoir states G • S.
Here, the reservoir states are expressed by the response function G and input denoted by S. We Here, T is the number of the time series, and K is the total length of the delay that we consider.
The ith row shows the i−1 delayed time series.The input S ∈ R T Nv×T to compute the reservoir states are expressed as Note that σ(t n ) ∝ U n upto constant.Due to time multiplexing, each row is repeated N v times, and then the time series is delayed in the next row.After multiplying the input filter W in , the input is fed into the response function.The input filter W in ∈ R T Nv×T is a stack of constant row vectors with the length T .The N v different realizations of row vectors are taken from binary noise, and then the resulting N v × T matrix is repeated T times in the row direction.This input is multiplied by the coefficients of the Volterra series (63) implies that by choosing the appropriate W, we can get a canonical form of G.If the canonical form has N × N identity matrix in the left part of W • G, then the reservoir reproduces the time series up to N − 1 delay.This means that the rank of the matrix G, or the number of independent rows, is the maximum number of steps of the delay.This is consistent with the known fact that MC is bounded by the number of independent components of reservoir variables (6).
Next we extend the Volterra series of the magnetization, including nonlinear terms.The magnetization is expressed as The delayed input S is rewritten as The matrix S contains all the nonlinear combinations of the input series (U n , U n+1 , • • • ).Accordingly, we should modify S and also G to include the nonlinear response functions.Note that to guarantee the orthogonality, Legendre polynomials (or other orthogonal polynomials) should be used instead of polynomials in powers.Nevertheless, up to the second order of nonlinearity, which is relevant to consider the performance of NARMA10 (see Sec. A), the difference is only in the constant terms (P 2 (x) = x 2 − 1 2 ).Because we subtract the mean value of the time series of all the input, output, and reservoir states, these constant terms do not change our conclusion.With nonlinear terms, (66) is extended as G = (G lin , G nonl ).Still, the rank of the matrix remains N at most.This is the reason why the total sum of IPC, including all the linear and nonlinear delays, is bounded by the number of independent reservoir variables.When G nonl = 0, the reservoir can memorize only the linear delay terms, but MC can be maximized to be N. On the other hand, when G nonl = 0, it is possible that MC is less than N, but the reservoir may have finite IPC.
When the readout is nonlinear, we use the reservoir state variable as where • is the Hadamard product.If M is linear in the input, G has a structure of In this case, rank(G) = rank(G lin ) + rank(G nonlin ).

B Learning with multiple variables
In the main text, we use only m x for the readout as in ( 13)- (15).The readout is nonlinear and has both the information of m x and m 2 x .In this section, we consider the linear readout, but use both m x and m z for the output in micromagnetic simulations.We begin with the linear readout only with m x .The results of the MC and IPC tasks are shown in Fig. 10(a,b).We obtain a similar performance for the MC task with the result in the main text (Fig. 3).On the other hand, the performance for the IPC task in Fig. 10(a) is significantly poorer than the result in Fig. 3(a).This result demonstrates that the linear readout only with m x does not learn the nonlinearity effectively.Note that in the theoretical model with the response function, the IPC is exactly zero when we use the linear readout only with m x .The discrepancy arises from the expansion (33) around m 0 = (0, 0, 1) in the main text.Strictly speaking, the expansion should be made around m 0 under the constant input σ averaged over time at the input nanocontact.
This reference state is inhomogeneous in space, and is hard to compute analytically.Due to this effect, m x in the micromagnetic simulations contain small nonlinearity.
Next, we consider the linear readout with m x and m z .As seen in Fig. 10(c,d), m z carries nonlinear information, and enhances the IPC and learning performance of NARMA10 compared with linear readout only with m x (Fig. 10 (a,b)).The performance is IPC ≈ 60 under α = 5 × 10 −4 , which is comparable value with the results in the main text (Fig. 3(a,c)) where the readout is (m x , m 2 x ).Also, high performance for NARMA10 task, NRMSE ≈ 0.2, can be obtained using variables (m x , m z ).These results show that adding m z into the readout has a similar effect to adding m 2 x .
This result suggests that m (2) contains only the z component, and is slaved by m (1) , which does not have z component.Therefore, m z can be computed as Because m x and m y carry similar information, m z in the readout has a similar effect with m 2 x in the readout.

E.2 exchange interaction
In the main text, we use the dipole interaction to compute the response function as (38) and (39).In this section, we show the result using the exchange interaction shown in (36) and (37).
Figure 12 shows the results.

Figure 1 :
Figure 1: Illustration of physical reservoir computing and reservoir based on propagating spin-wave network.(a) Schematic illustration of output function prediction by using time-series data.Output signal Y is transformed by past information of input signal U. (b) Schematic illustration of reservoir computing with multiple physical nodes.The output signal at physical node A contains past input signals in other physical nodes, which are memorized by the reservoir.(c) Schematic illustration of reservoir computing based on propagating spin-wave.Propagating spin-wave in ferromagnetic thin film (m e z ) is excited by spin-transfer torque at multiple physical nodes with reference magnetic layer (m e x ).x-component of magnetization is detected by the magnetoresistance effect at each physical node.

Figure 2 :
Figure 2: Dimension of spin-wave reservoir and prediction of NARMA10 task.(a) Input signals U are multiplied by binary mask B i (t) and transformed into injected current j(t) = 2j c Ũi (t) for the ith physical node.Current is injected into each physical node with the cylindrical region to apply spin-transfer torque and to excite spin-wave.Higher damping regions in the edges of the rectangle are set to avoid reflection of spin-waves.(b) Prediction of NARMA10 task.x-component of magnetization at each physical and virtual node are collected and output weights are trained by linear regression.

Figure 3 :
Figure 3: Effect of virtual node distance on performance of spin-wave reservoir computing obtained with 8 physical nodes and 8 virtual nodes.Memory capacity MC and information processing capacity IPC obtained by (a) micromagnetics simulation and (b) response function method plotted as a function of virtual node distance θ with different damping parameters α.(c) Normalized root mean square error, NRMSE for NARMA10 task is plotted as a function of θ with different α.

Figure 6 (Figure 4 : 3 Figure 5 :
Figure 6(a,b) shows that both MC and IPC have maximum when L ∝ v.To obtain a deeper understanding of the result, we perform the same analyzes for the further simplified model, in

film ( 31 )Figure 6 :
Figure 6: Scaling between characteristic size and propagating wave speed obtained by response function method.MC (a,c) and IPC (b,d) as a function of the characteristic length scale between physical nodes R and the speed of wave propagation v.The results with the response function for the dipole interaction (a,b) and for the Gaussian function (5) (c,d) are shown.(e) Schematic illustration of the response function and its relation to wave propagation between physical nodes.When the speed of the wave is too fast, all the response functions are overlapped (dense regime), while the response functions cannot cover the time windows when the speed of the wave is too slow (sparse regime).

Figure 7 :1
Figure7: Reports of reservoir computing using multiple nodes are plotted as a function of the length between nodes and characteristic wave speed (v) times delay time (τ 0 ) for photonics system (open symbols) and spintronics system (solid symbols).The size of symbols corresponds to memory capacity, which is taken from literature(12,19,22,(32)(33)(34)(35) and this work.The gray scale represents memory capacity evaluated by using the response function method [Eq.(5)].

Figure 8 :
Figure 8: Reservoir computing performance compared with different systems.(a) Memory capacity, MC reported plotted as a function of physical nodes N p .(b) Normalized root mean square error, NRMSE for NARMA10 task is plotted as a function of N p .Open blue symbols are values reported using photonic RC while solid red symbols are values reported using spintronic RC.MC and NRMSE for NARMA10 task are taken from Refs.(9,19,22,36,37) for spintronic RC and Refs.(32-34, 38, 39)  for photonic RC.

Figure 9 :
Figure9: (A-C) the estimation based on the extended DMD, (D-F) the estimation based on the Volterra series.The dictionary of each case is (A,D) first-order (linear) delay terms, (B,E) up to second-order delay terms, and (C,F) up to third-order delay terms.

Fig. 11 shows
Fig. 11 shows N v and N p dependencies of MC, IPC and NRMSE for NARMA10 task.As N v and N p are increased, MC and IPC increase.Then, NARMA10 prediction task becomes better with increasing N v and N p .MC and NRMSE for NARMA10 with different N p with fixed N v = 8 are compared with other reservoirs shown in Fig. 8 in the main text.

Table 1 :
Report of photonic RC with different length scales used in Fig.7in the main text : speed of light, v = 3 × 10 8 m/s is used. Note

Table 2 :
Report of spin reservoirs with different length scales used in Fig.7in the main text Note: v is calculated based on magneto-static spin wave using Eq.74.E Other data E.1 N v and N p dependence of performance