Forecasting the outcome of spintronic experiments with Neural Ordinary Differential Equations

Deep learning has an increasing impact to assist research, allowing, for example, the discovery of novel materials. Until now, however, these artificial intelligence techniques have fallen short of discovering the full differential equation of an experimental physical system. Here we show that a dynamical neural network, trained on a minimal amount of data, can predict the behavior of spintronic devices with high accuracy and an extremely efficient simulation time, compared to the micromagnetic simulations that are usually employed to model them. For this purpose, we re-frame the formalism of Neural Ordinary Differential Equations to the constraints of spintronics: few measured outputs, multiple inputs and internal parameters. We demonstrate with Neural Ordinary Differential Equations an acceleration factor over 200 compared to micromagnetic simulations for a complex problem – the simulation of a reservoir computer made of magnetic skyrmions (20 minutes compared to three days). In a second realization, we show that we can predict the noisy response of experimental spintronic nano-oscillators to varying inputs after training Neural Ordinary Differential Equations on five milliseconds of their measured response to a different set of inputs. Neural Ordinary Differential Equations can therefore constitute a disruptive tool for developing spintronic applications in complement to micromagnetic simulations, which are time-consuming and cannot fit experiments when noise or imperfections are present. Our approach can also be generalized to other electronic devices involving dynamics.


Supplementary Note 1: Training performance for the skyrmion systems, using voltage as input
This note provides additional results on the training process of Neural ODEs on the one-skyrmion system without grain inhomogeneity and the multi-skyrmions system with grain inhomogeneity, using voltage as input. Fig. S1a shows the normalized random sine voltage input used in the training process. Figs. S1b-c show the predicted training output of ∆m z , by Neural ODE (orange dashed curve) and micromagnetic simulation output (blue curve), for the one-skyrmion and the multi-skyrmions system, respectively. Excellent agreement is seen. Fig. S2 shows the training loss (MSE) of the one-skyrmion system without grain inhomogeneity as a function of iterations, for: • a different numbers of units N h in hidden layers. We observe that, in general, the greater N h is, the faster the training loss converges.
• b different sampling intervals ∆t of the initially observed trajectory y 1 (the unit p is equal to 2.5 ps, as in the main body text). These results show that an accurate Neural ODE model can be trained even if the original continuous time series is downsampled, until a certain point in which the neighboring states of the downsampled trajectory lose correlations.
• c different dimensions k of the Neural ODE (with k − 1 being the number of delays). The Neural ODE model can be successfully trained as long as k ≥ 2. Generally, the higher the dimension is, the longer time it takes for the model to train.
• d different choices of training optimization algorithms. It is shown that the ADAM and stochastic gradient descent (SGD) methods ensure a quick convergence of the loss function.
Supplementary Figure 1. Training performance of Neural ODEs for the skyrmion systems using voltage as input. a Normalized random sine voltage input, in the form of ∆K u , with amplitude ranging from -2×10 5 J/m 3 to 2 ×10 5 J/m 3 (corresponding to -1 to 1 in the graph). Here, the variation of the PMA (∆K u ) is linearly dependent on the voltage applied, because of the VCMA effect. Comparisons of the predicted training output of ∆m z by Neural ODE (orange dashed curve) and micromagnetic simulation output (blue curve) for the one-skyrmion system without grain inhomogeneity (b) and the multi-skyrmions system with grain inhomogeneity in (c).

Supplementary Note 2: Training Neural ODE models using incomplete noisy data of skyrmion dynamics
This note investigates the training of Neural ODEs over noisy data. The strategy for treating noisy dynamical system comes from the idea, present in the time-delay embedding for state space reconstruction, that increasing the dimension of the reconstructed space can typically decrease the distortion of noise distribution and reduce the amplification of noise level when the delay embedding is transformed to the original state space 1 . Normally, the prediction errors depend on both the noise amplification and the estimation error, which depends on the method of approximation, i.e., the model itself. For most good approximation schemes, the estimation error can be close to zero in the limit of a large number of data points. The prediction errors in this limit are entirely due to the noise. Therefore, by increasing the dimension of the delay vector, a more precise model can be learned.
We use the simulation data of the one-skyrmion system as a demonstration. We obtain noisy training data by artificially adding random Gaussian noise with mean µ = 0 and standard deviation σ noi = 0.1, 0.5 and 1 into the normalized training data (∆m z ), as shown in Fig. S3a. Fig. S3c presents the training loss (MSE) as a function of iterations, obtained when training Neural ODEs of dimension k = 2, 4, 8, 12 over these noisy data with different standard deviations. These results differ from the training results obtained in the noiseless system (see Fig. S2c), where k ≥ 2 guarantees a sufficiently good model to be trained. Here, the larger the k, the better the model is. On the other hand, training Neural ODE with a large k value (e.g., for k = 12) does not always ensure stable convergence, as seen in the low noise system, because the delayed system dynamics becomes a stiff problem to be solved in the high dimensional space, therefore, it is necessary to choose an appropriate k for training. Further, we remark that the fitted standard deviations of the training error distribution σ err of the trained Neural ODEs, compared to the true trajectories (without noise), gradually approach the standard deviation of the added noise σ noi when k is increased (see Fig. S3b for σ noi = 0.1). Fig. S3d shows the testing results of the Neural ODE for k = 2 and k = 12 under different standard deviations of the added noise. Here, the Mean Square Error (MSE) showed in the title of each graph in Fig. S3d is the MSE between the testing trajectory and the true trajectory without noise. The MSE approaches zero, which justifies a good training system, as k increases especially for the low noise system (e.g., for σ noi = 0.1). However, as the noise level grows, increasing the k value still helps in making more precise prediction with smaller MSE value, it becomes more difficult to lower the error to zero compared to the true trajectory even with a relatively large k value. A total number of data points n = 15, 000 are used for training and testing, respectively. The Neural ODEs used in this note featured N h = 50 units per hidden layer.
Still, the effect of noise is very complicated in many real world applications, the demonstration presented here provides a perspective of the possibility of making better prediction when only limited information of the noise is known. There are still many other ways to help distinguish the real system dynamics from the dynamics contaminated with noise, for example, by conjoining different filters with the reconstruction scheme.

Supplementary Note 3: Comparisons between the time-delayed method and the timederivative method
This note compares the training performance of the time-delayed method and the time-derivative method to model the Neural ODE. The training data y(t) is taken from the one-skyrmion system as a demonstration. We consider the observed data s(t) as a superposition of a noise n(t) and the clean data y(t), so that s(t) = y(t) + n(t). The noise is assumed to be generated from a Gaussian distribution with zero mean and standard deviation σ noi = 0.5. We adopt the information at three continuous time points of (t − ∆t,t,t + ∆t). For the time-delay method, we consider a vector of variables s(t) = (s(t − ∆t), s(t), s(t + ∆t)). For the time-derivative method, the vector s(t) = (s(t),ṡ(t)) withṡ(t) = (s(t + ∆t) − s(t − ∆t))/(2∆t)) is employed as the training variables. The variance of the signal y(t) extracted from simulations is σ 2 noi = E[y(t) 2 ] − E[y(t)] 2 = 1.22; the noise-to-signal ratio is therefore σ noi /σ sig = 0.45. By contrast, if we consider the real signal derivative asẏ(t) = (y(t + ∆t) − y(t − ∆t))/(2∆t)), the noise-to-signal 5/20 ratio of the first order derivativeṡ(t)) is σ noi,der /σ sig,der = 3.22, which is much larger than that of the original signal s(t).
Figs. S4a and b show the training loss (mean square error, MSE) as a function of iterations for σ noi = 0 (without noise) and σ noi = 0.5, respectively. It is observed that both methods show similar training performance w.r.t accuracy when noise is absent from the system, however, the time derivative method performs worse than the time delay method when noise exists. This is reasonable because the noise is amplified during the calculation of the first-order derivative, and thus the convergence becomes difficult. Figure 4. Comparison between the time-delay and time-derivative methods in terms of training performance. a Training loss (mean square error, MSE) as a function of iterations for a σ noi = 0(without noise) and b σ noi = 0.5 (with noise), for the one-skyrmion system.

Supplementary Note 4: Mackey-Glass time-series prediction task evaluation
This note provides additional results on the skyrmion-based Mackey-Glass time series prediction task. To perform this task, we send the preprocessed input data into the reservoir, simulated with the trained Neural ODE or Mumax (as a control). With N r = 50 virtual nodes in the reservoir, the total number of input values, including both training and testing set is 50 × 10, 000. Each value is fed into the reservoir for a duration of t step = 2p (p = 2.5 ps). The output trajectory comparison is shown in Fig. S5a and b for the one-skyrmion system without grain inhomogeneity and multi-skyrmions system with grain inhomogeneity, respectively. Using the trained model, we evaluated the reservoir performance with different N r values, as shown in Fig. S5c. These results show that the prediction capability of the reservoir is not improved by increasing the number of virtual nodes of the reservoir. On the other hand, Fig. S5d shows that the prediction accuracy can be relatively improved by increasing the step time t step .
To improve the reservoir performance further, we also proposed a new method for output reconstruction. Instead of using only the reservoir's output from the current step, we concatenate the response matrix A at each column with output states from previous steps to form a new matrix A * ∈ R (N r ·(1+n p ))×T r , where n p is the number of previous steps' states. In this way, the information of the reservoir states for training are enriched through compensation of previous states from reservoir. Better accuracy of prediction can then be achieved, as shown in The results are presented for the one-skyrmion system without grain inhomogeneity in a and for the multi-skyrmions system with grain inhomogeneity in b. Prediction accuracy in terms of NRMSE as a function of predicted horizontal steps for different c N r values, d time durations t step for each preprocessed input value fed into the reservoir, and e numbers of considered previous steps' states n p . f Selected training set (orange) and testing set (green, dashed) for Mackey-Glass predictions for H=20 and n p = 16 (red dot in Fig. S1e) compared to the true trajectory (blue) of Mackey-Glass time series using one skyrmion system as reservoir, simulated by Mumax and Neural ODE.

Supplementary Note 5: Modeling of the skyrmion system with electric current as input
The main article focuses on the voltage control of skyrmion-based devices. Skyrmions can also be moved by an electric current through current-induced spin-orbit torques (see Fig. S6a). Skyrmions are easily deformed when they are moving in a confined nanodisk, making this situation quite complex. NeurODE are highly attractive for modeling this situation, and we demonstrate here that the trajectory of a single skyrmion (position of the skyrmion core x c , y c ) as well as the resulting mean magnetization of the device can be modeled using a Neural ODE.
The training dataset is again obtained from Mumax simulations. We use as input a random sine electric current, with amplitude of current density ranging from -1×10 11 to 1×10 11 A/m 2 , and a frequency f = 0.5 GHz (see Fig. S6c). The output trajectory of the skyrmion is recorded every p = 20 ps. A three-layer neural network with 100 units in each hidden layer is trained using 40,000 position data points. The final training output, as well as its comparison with the Mumax results, are shown in Fig. S6c. To further infer the magnetization information, another three-layer neural network M out θ with 50 units in each hidden layer was trained to recognize the variation of mean perpendicular magnetization ∆m z with skyrmion positions as input into the neural network (see Fig. S6b).
The trained Neural ODE is then evaluated on the task of Mackey-Glass time-series prediction, as in the voltagecontrol case. The preprocessed Mackey-Glass series, with N r = 50, in the form of an electric current, is sent into Mumax and Neural ODE, correspondingly, with t step = 8p. We obtained 410,000 data points (corresponding to 8,200 points of Mackey-Glass series) of final output trajectory after a simulation time of more than 40 days for Mumax and 80 minutes for Neural ODE. The corresponding magnetization ∆m z is then calculated by sending the output from Neural ODE into M out θ . MSE of ∆m z over the whole 410,000 data points computed by Neural ODE and M out θ is as low as 7.09×10 −5 compared to that simulated by Mumax (see Fig. S7a). The Mackey-Glass time series prediction error (NRMSE) as a function of predicted horizontal step by Mumax and Neural ODE is shown in Fig. S7b. Fig. S7c shows the selected comparative prediction results of H = 1 and H = 25 (red dots in b) obtained by Mumax and Neural ODE model respectively. Excellent agreement is again seen. Here, we used the first T r = 5, 000 points for training and the rest for testing.

8/20
Supplementary Figure 6. Modeling of the skyrmion system with electric current as input. a Schematic of the nanodisk where skyrmion exists in the magnetic layer (blue), and electric current applied through the heavy metal layer (grey) is used as an external input. b Topology of the neural network used for obtaining the mean perpendicular magnetization ∆m z from the skyrmion core position (x c , y c ). c Random sine electric current input, with amplitude ranging from -1×10 11 to 1×10 11 A/m 2 and frequency f = 0.5 GHz, into the skyrmion system, and corresponding output trajectory of skyrmion position (x c , y c ) calculated by Mumax and by the trained Neural ODE, as well as m z , obtained from Neural ODE results and from Mumax.

Supplementary Note 6: Modeling abrupt switching and oscillations of a spin-valve nanopillar by Neural ODE
In this Note, we apply our method into modeling a situation, where significantly different behavior types may occur in the same system, depending on the input parameters. We have simulated a situation where a nanostructure may switch or generate sustained oscillations according to the value of the applied external magnetic field.
The training data is obtained from micromagnetic simulations of a spin-valve nanopillar, under the influence of a spin current polarized out-of-plane, and an out-of-plane magnetic field H(t) applied in the z direction. The magnetization of the pinned layer is fixed along the negative z axis. The diameter and the thickness of the nanopillar are 80 nm and 2 nm, respectively. The key parameters used in the simulations are: saturation magnetization M s = 1.2 MA/m, exchange stiffness A = 20 pJ/m, K u = 0.5 MJ/m 3 and damping constant α = 0.1. In the simulations, a constant electric direct current is applied with a current density of 10 11 A/m 2 . In the training dataset, artificial sinusoidal waveform of magnetic field with a variety of random amplitudes are applied sequentially as input, as shown below in Fig. S8a. In these artificial training conditions, the structures exhibits relatively complex and difficult-to-interpret behaviors, but which allow training a Neural ODE that is valid in all regimes.
For this purpose, we use a three-dimensional vector containing the information (m x , m y , m z ). A total length of 500-nanoseconds of dynamics is used to train the system. The trained results of m z (dashed orange curve) and m x (dashed olive curve), shown in Fig. S8a, demonstrate excellent agreement with the micromagnetic simulations.
In the test dataset, we apply a time-varying magnetic field at three different initial conditions of magnetization, covering a wide range of situations where the system is expected to exhibit either switching or sustained oscillations. • In the third scenario, no external field and only the constant electric current is applied initially. In this situation, the system exhibits self-sustained magnetization oscillations. A subsequent decreasing or increasing field is applied till a constant value to switch the magnetization (e.g., µ 0 H = -0.5, 0.7 T), as shown in the right column of Fig. S9a for different constant values.
We can see that in this test dataset, the results of Neural ODE and micromagnetic simulations are again in excellent agreement. The Neural ODE is able to predict all the various behaviors, equivalently to micromagnetic simulations.
For clearer comparisons, we further extract the oscillation frequency at different magnetic fields based on the results from Neural ODE and the micromagnetic simulations, as shown in Fig. S9b. This example demonstrates that the Neural ODE is able to describe different dynamical behaviors (switching and sustained oscillations) under different input parameters. In the middle column, the magnetization is aligned in the negative z direction by applying an external field of µ 0 H = -0.5 T initially, a subsequent increasing field is applied till a constant value to switch the magnetization (µ 0 H = 0.7 T) or to generate a sustained oscillation (µ 0 H = 0.5, 0.2, -0.2 T). In the right column, no external field and only the constant electric current is applied initially, the system exhibits self-sustained magnetization oscillations. A subsequent decreasing or increasing field is applied till a constant value to switch the magnetization (µ 0 H = -0.5, 0.7 T). The blue and red curves show the mean output magnetization of x component (m x ) and z component (m z ), respectively, from micromagnetic simulations. The dashed orange (m x ) and dashed olive curves (m x ) are the corresponding trained results of Neural ODE. b Oscillation frequency F at different magnetic fields obtained by using the results from Neural ODE (orange star) and the micromagnetic simulations (blue curve). The red cross symbols represent the switching behavior at the specific amplitude of magnetic fields obtained by both Neural ODE and the micromagnetic simulations. Figure 10. Prediction of experimental results using Neural ODEs (spectrogram method). a Training output trajectory of voltage ∆V out by Neural ODE (dashed orange) with corresponding preprocessed input ∆V in , in comparison with the experimental measurement (blue), for k = 2. This training set is adopted from the first utterance of the first speaker. A three-layer neural network f θ with each hidden layer of 100 units is trained. b Training loss (MSE) of Neural ODE with k = 1, 2, 3, 4 as a function of iterations. c Selected response output trajectories predicted by the trained Neural ODE with corresponding preprocessed input, in comparison with the experimental output, for digit nine of the seventh utterance of the fifth speaker. d Left to right: Error distribution (green shadow) and fitted Gaussian distribution (dashed curve) extracted by computing the difference between the output predicted by Neural ODE and the experimental measurement, a Gaussian noise (purple shadow) added in the preprocessed input into the Neural ODE and the corresponding noise distribution (green shadow) with fitted pdf (black dashed curve) in the predicted output trajectory solved by Neural ODE. σ in was adjusted so that σ out = 1.03 mV is close to σ err = 0.97 mV. e Spoken digit recognition rate in the testing set as a function of utterances N used for training. The solid curve, blue dashed curve, orange dashed curve are the experimental result, Neural ODE result with noise considered, Neural ODE result without any noise, respectively.

Supplementary Note 8: Mathematical derivations A Converting a system of ODEs into a single higher-order ODE in one variable
Motivations Our idea originates in the insight that it is possible to convert a set of the first-order differential equations in multiple variables into a single higher-order differential equation in one variable. For example, let us consider the case with a single hidden variableỹ 2 , By substitutingỹ 2 and its derivativeẏ 2 calculated from the first ODE into the second one, a second-order ODE with variable y 1 can be derived asÿ 1 = (a + d)ẏ 1 + (bc − ad)y 1 , whereỹ 2 no longer appears. This equation is equivalent to the following ODE in terms of y 1 and y 2 =ẏ 1 This simple derivation suggests that an appropriate way for training a Neural ODE of m internal variables where only one variable y 1 is accessible is to train a Neural ODE where the state vector y is composed of y 1 and its (m − 1) th -order derivatives.
Theorem 1 (for linear system) Let A be an n × n matrix, and let p A (λ ) be the characteristic polynomial of A. Let y(t) = (y 1 (t), . . . , y n (t)) T be a solution to the system of ODEṡ Then, each coordinate function y j (t) satisfies the nth-order differential equation Proof : Rewrite the differential equation using the operator notation as where Dy indicates differentiation of the vector y by d dt and Ay indicates multiplication of vector y by the matrix A. By differentiating both sides of the ODEs, we obtain Hence, p(D)y = p(A)y for any polynomial p(λ ) by linearity. The Cayley-Hamilton theorem states that p A (A) = 0. Therefore, p A (D)y = 0. So, for each coordinates, p A (D)y j = 0.
Differential elimination procedure (for nonlinear system). The procedures of converting systems of ODEs to a single-higher order ODE in one variable was mathematically proved for nonlinear systems, we refer readers to 16/20 refs 2-4 for further details. Here, we are giving the general ideas of the procedures followed in these references. We consider a system of nonlinear ODEs of the form (ẋ = f (x)): x 1 = f 1 (x 1 , x 2 , . . . , x n ) x 2 = f 2 (x 1 , x 2 , . . . , x n ) . . .
Here x = (x 1 , x 2 , . . . , x n ) is an n-dimensional state variable, and y is considered as an output vector, which could be any component of the state variable. We assume that f (x) is rational polynomial functions of the state variables, a reasonable assumption in most applications. Our aim is to obtain a single ODE in the variable y.
The procedure can be implemented by taking a sufficient number of derivatives of the system, followed by computation of a Gröbner Basis of the new system [2][3][4] . In general, for elimination to work, the number of equations must be strictly greater than the number of unknowns 2, 3 . Firstly, from the output equation, we can obtainẏ =ẋ j and y =ẍ j . This, however, introduces the second derivative of x j , and thus differentiation of the corresponding state variable equation is needed: . . .
Now, at the second step, we have n + 4 equations and 2n + 1 unknowns (x 1 , x 2 , . . . , x n ,ẋ 1 ,ẋ 2 , . . . ,ẋ n ,ẍ j ). So, for n > 2, the number of unknowns is greater than or equal to the number of equations. Next, at step three, we take an additional derivative of the output equation, which causes the third derivative of the state variable to appear. Thus, we take the corresponding derivative of the state variable x j and add the following equations to the system: . . , n and k ̸ = j.
In doing this, we add 2 + n − 1 = n + 1 equations and 1 + (n − 1) = n unknowns; so, we have 2n + 5 equations and 3n + 1 unknowns. So, for n > 3, the number of unknowns is greater than or equal to the number of equations. Differentiation of the output equation and corresponding state variable equations is continued until the number of equations is greater than the number of unknowns.
Lemma 1 Let the system consist of n state variables. There is always an equal number of equations and unknowns when there are n − 1 derivatives of the output equation.
Proof : Initially, there are n + 1 equations and 2n unknowns. After one step derivative on the output, there are always n + 2 equations and 2n unknowns. Thus, there are n − 2 more unknowns than equations. In each step thereafter, we add one more equation than unknown, so at the (n − 1)th step ((n − 1)th-order derivatives to the output), there are zero more unknowns than equations. Thus, the Lemma is proven.
Corollary to Lemma 1 At the n-th step, there is one more equation than unknown.
This suggests that an ODE of the same (or possibly lower) differential order as the number of state variables can always be obtained. In other words, since there is one more equation than unknown at step n, an ODE of differential order n, or lower, can be obtained.
Remark: If there is a time-dependent input e added into the system, we use the same elimination procedure. Then, after taking the second derivative of the output y, we have the following n + 4 equationṡ . . .
x n = f n (x 1 , x 2 , . . . , x n , e) in which the first derivative of e is included. Thus, after taking nth-order derivatives to the output y at n-th step, the first to (n − 1)th-order derivative of e are included, and a single nth-order ODE in variable y can be obtained, which explains why the time-delayed state of input e(t) should be included when training the neural network.

B Noise level for higher-order derivatives
We now suppose the observed data s(t) = y(t) + n(t) is a stationary random process 5 , recorded with measurement noise n(t), identically distributed with variance σ 2 noi , zero mean, and a normalized autocorrelation function c noi (τ) = E[n(t)n(t − τ)]/σ 2 noi . Let y(t) be the clean variable with variance σ 2 We assume that the data is recorded with a high sampling rate ∆t, such that the successive observations are strongly correlated. If we take the first derivative through the following forṁ The real signal derivative is, neglecting terms higher than the second order in ∆t, (y(t + ∆t) − y(t − ∆t)) (12)

18/20
Therefore the corresponding (derivative) signal variance is The corresponding noise of the derivative signalṡ(t) −ẏ(t) is Thus, the noise to signal ratio of the second derivative is σ noi,der σ sig,der = σ noi σ sig For white noise, the autocorrelation function is an impulse at lag 0, thus c noi (2∆t) is close to zero. Therefore, the noise-to-signal ratio of the derivative (σ noi,der /σ sig,der ) can be much larger than the original signal (σ noi /σ sig ) if the sampled adjacent signal is strongly correlated, which is a common situation in most applications. This consideration explains our choice, in the main paper, to use delayed signals instead of derivates as state variables for Neural ODEs for systems with incomplete information of dynamics.

Supplementary Note 7: Alternative interpretation based on the embedding theorem
An alternative interpretation for our method can also be obtained through the "embedding theorem" for statespace reconstruction of dynamical systems. It provides additional insight into our technique. Let us assume the system can be described by an ensemble of physical variables represented by vector u, and that the variable that can we can measure (e.g., a voltage) can be computed by a measurement function s(u). Let us assume that the underlying ODE describing the physics of the system isu = F(u,t). Using the point of view of the embedding theorem, our technique corresponds to changing the physical variables u to new variables y. We have y 1 = s(u(t)) and y 2 = y 1 (t +∆t d ) = s(u(t)+ t+∆t d t F(u,t ′ ) dt ′ ). We define the transformation G by G(u(t)) = u(t)+ t+∆t d t F(u,t ′ ) dt ′ so that we have y 2 = s(G(u(t)) and, for all i values, y i = y 1 (t + i∆t d ) = s(G i (u(t))).
According to the embedding theorem 5-8 , for almost every smooth measurement function s and positive ∆t d value, such a delay map y(t) is a smooth, one-to-one coordinate transformation with a smooth inverse if its dimension k is at least twice the capacity dimension of the dynamical system (also called fractal dimension or box-counting dimension). This consideration provides an alternative justification for the choice of these coordinates, in the main paper.