Introduction

In broad terms there are two types of method used to train neural networks, divided according to whether or not they explicitly evaluate gradients of the loss function. Gradient-based methods include the backpropagation algorithm1,2,3,4,5,6,7,8. The non-gradient-based methods (sometimes called “black box” methods) include stochastic processes in which changes to a neural network are proposed and accepted with certain probabilities, and encompass Monte Carlo9,10 and genetic algorithms11,12,13. Both gradient-based and non-gradient-based methods have been used to train neural networks for a variety of applications, and, where comparison exists, perform similarly well14,15,16,17. For instance, recent numerical work shows that stochastic evolutionary strategies applied to neural networks are competitive with explicit gradient-based methods on hard reinforcement-learning problems16.

Gradient-based- and non-gradient-based strategies are different in implementation and are sometimes thought of as entirely different approaches18. Here, we show that the two sets of methods have a fundamental connection. We demonstrate analytically an equivalence between the dynamics of neural-network training under conditioned stochastic mutations, and under gradient descent. This connection follows from one identified in the 1990s between the overdamped Langevin dynamics and Metropolis Monte Carlo dynamics of a particle in an external potential19,20. In the limit of small Monte Carlo trial moves, those things are equivalent. Similarly, we show here that a single copy of a neural network (a single individual) exposed to parameter mutations that are accepted probabilistically is equivalent, in the limit of small mutation size, to gradient descent on the loss function in the presence of Gaussian white noise. The details of the resulting dynamics depend on the details of the acceptance criterion, and encompass both standard- and clipped gradient descent. Such a mutation scheme corresponds to the simple limit of the set of processes called “neuroevolution”13,16,21,22,23. This connection demonstrates explicitly that optimization without access to gradients can, nonetheless, enact noisy gradient descent on the loss function.

In simple gradient descent, equivalent to noise-free overdamped Langevin dynamics, the parameters (weights and biases) x of a neural network evolve with training time according to the prescription \(\dot{{{{{{\bf{x}}}}}}}=-\alpha \nabla U({{{{{\bf{x}}}}}})\). Here, α is a learning rate, and U(x) is the gradient of a loss function U(x) with respect to the network parameters. Now consider a simple neuroevolution scheme in which we propose a mutation x → x + ϵ of all neural-network parameters, where ϵ is a set of independent Gaussian random numbers of zero mean and variance σ2. Let us accept the proposal with the Metropolis probability \(\min \left(1,{{{{{{\rm{e}}}}}}}^{-\beta {{\Delta }}U}\right)\). Here, β is a reciprocal temperature, and ΔU is the change of the loss function under the proposal. This is a Metropolis Monte Carlo algorithm, a Markovian dynamics that constitutes a form of importance sampling, and a common choice in the physics literature9,10,24. In physical systems, β is inversely proportional to the physical temperature, and we consider finite values of β in order to make contact with that literature. However, in the context of training a neural network it is interesting to consider the zero-temperature limit β = , where mutations are accepted only if the loss does not increase. That regime is not normally considered in particle-based simulations.

Our main results can be summarized as follows. When β ≠  the weights of the network evolve, to leading order in σ, as \(\dot{{{{{{\bf{x}}}}}}}=-(\beta {\sigma }^{2}/2)\nabla U({{{{{\bf{x}}}}}})\) plus Gaussian white noise. Averaged over independent realizations of the learning process, this form of neuroevolution is therefore equivalent to simple gradient descent, with learning rate βσ2/2. In the limit β = , where mutations are accepted only if the loss function does not increase, weights under neuroevolution evolve instead as \(\dot{{{{{{\bf{x}}}}}}}=-(\sigma /\sqrt{2\pi })| \nabla U({{{{{\bf{x}}}}}}){| }^{-1}\nabla U({{{{{\bf{x}}}}}})\) plus Gaussian white noise, which corresponds to clipped gradient descent on U(x)25. Note that conditioning the acceptance of neural-network parameter mutations on the change of the loss function for a single copy of that network is sufficient to enact gradient descent: a population of individuals is not required.

In this paper, we use the term “neuroevolution” to refer to a sequence of mutation steps applied to the parameters of a single copy of a neural network and accepted probabilistically. In general, neuroevolutionary algorithms encompass a broader variety of processes, including mutations of populations of communicating neural networks16 and mutations of network topologies21,26,27. Similarly, the set of procedures for particles that can be described as “Monte Carlo algorithms” is large, and ranges from local moves of single particles—roughly equivalent to the procedure used here—to nonlocal moves and moves of collections of particles24,28,29,30,31. The dynamics of those collective-move Monte Carlo algorithms and of the more complicated neuroevolutionary schemes21,26,27 do not correspond to simple gradient descent. Here, we demonstrate a correspondence between one member of this set of algorithms and gradient descent, the implication being that, given any potentially complicated set of neuroevolutionary methods, it is enough to add a simple mutation-acceptance protocol in order to ensure that gradient descent is also approximated. The neuroevolution-gradient descent correspondence is similar to the proofs that neural networks with enough hidden nodes can represent any smooth function32: it does not necessarily suggest how to solve a given problem, but provides understanding of the limits and capacity of the tool and its relation to other methods of learning.

Our work provides a rigorous connection between gradient descent and what is arguably the simplest form of neuroevolution. It complements studies that demonstrate a numerical similarity between gradient-based methods and population-based evolutionary methods16,17, and studies that show analytically that the gradients approximated by those methods are, under certain conditions, equivalent to the finite-difference gradient33,34,35.

The paper is structured as follows. We summarize the neuroevolution-gradient descent correspondence in section “Results”, and derive it in section “Methods”. Our derivation uses ideas developed in ref. 20 to treat physical particles, and applies them to neural networks: we consider a different set-up (in effect, we work with a single particle in a high-dimensional space, rather than with many particles in three-dimensional space) and proposal rates, and we consider the limit β =  that is rarely considered in the physics literature but is natural in the context of a neural network. We can associate the state x of the neural network with the position of a particle in a high-dimensional space, and the loss function U(x) with an external potential. The result is a rewriting of the correspondence between Langevin dynamics and Monte Carlo dynamics as a correspondence between the simplest forms of gradient descent and neuroevolution. Just as the Langevin-Monte Carlo correspondence provides a basis for understanding why Monte Carlo simulations of particles can approximate real dynamics31,36,37,38,39,40,41, so the neuroevolution-gradient descent correspondence shows how we can effectively perform gradient descent on the loss function without explicit calculation of gradients. The correspondence holds exactly only in the limit of vanishing mutation scale, but we use numerics to show in section “Numerical illustration of the neuroevolution-gradient descent correspondence” that it can be observed for neuroevolution done with finite mutations and gradient descent enacted with a finite timestep. We conclude in section “Conclusions”.

Results

In this section, we summarize the main analytic results of this paper. These results are derived in section “Methods”.

Consider a neural network with N parameters (weights and biases) x = {x1, …, xN}, and a loss U(x) that is a deterministic function of the network parameters. The form of the network does not enter the proof, and so the result applies to neural networks of any architecture (we shall illustrate this point numerically by considering both deep and shallow nets). The loss function may also depend upon other parameters, such as a set of training data, as in supervised learning, or a set of actions and states, as in reinforcement learning; the correspondence we shall describe applies regardless.

Gradient descent

Under the simplest form of gradient descent, the parameters xi of the network evolve according to numerical integration of

$$\frac{{{{{{\rm{d}}}}}}{x}_{i}}{{{{{{\rm{d}}}}}}t}=-\alpha \frac{\partial U({{{{{\bf{x}}}}}})}{\partial {x}_{i}}.$$
(1)

Here, time t measures the progress of training, and α is the learning rate3,4,5,6,7.

Neuroevolution

Now consider training the network by neuroevolution, defined by the following Monte Carlo protocol.

  1. 1.

    Initialize the neural-network parameters x and calculate the loss function U(x). Set time t = 0.

  2. 2.

    Propose a change (or “mutation”) of each neural-network parameter by an independent Gaussian random number of zero mean and variance σ2, so that

    $${{{{{\bf{x}}}}}}\to {{{{{\bf{x}}}}}}+{{{{{\boldsymbol{\epsilon }}}}}},$$
    (2)

    where ϵ = {ϵ1, …, ϵN} and \({\epsilon }_{i} \sim {{{{{\mathcal{N}}}}}}(0,{\sigma }^{2})\).

  3. 3.

    Accept the mutation with the Metropolis probability \(\min \left(1,{{{{{{\rm{e}}}}}}}^{-\beta [U({{{{{\bf{x}}}}}}+{{{{{\boldsymbol{\epsilon }}}}}})-U({{{{{\bf{x}}}}}})]}\right)\), and otherwise reject it. In the latter case we return to the original neural network. The parameter β can be considered to be a reciprocal evolutionary temperature.

  4. 4.

    Increment time t → t + 1, and return to step 2.

For finite β, and in the limit of small mutation scale σ, the parameters of the neural network evolve under this procedure according to the Langevin equation

$$\frac{{{{{{\rm{d}}}}}}{x}_{i}}{{{{{{\rm{d}}}}}}t}=-\frac{\beta {\sigma }^{2}}{2}\frac{\partial U({{{{{\bf{x}}}}}})}{\partial {x}_{i}}+{\xi }_{i}(t),$$
(3)

where ξ is a Gaussian white noise with zero mean and variance σ2:

$$\left\langle {\xi }_{i}(t)\right\rangle =0,\;\;\;\; \left\langle {\xi }_{i}(t){\xi }_{j}(t^{\prime} )\right\rangle ={\sigma }^{2}{\delta }_{ij}\delta (t-t^{\prime} ).$$
(4)

Eq. (3) describes an evolution of the neural-network parameters xi that is equivalent to gradient descent with learning rate α = βσ2/2 in the presence of Gaussian white noise. Averaging over independent stochastic trajectories of the learning process (starting from identical initial conditions) gives

$$\frac{{{{{{\rm{d}}}}}}\left\langle {x}_{i}\right\rangle }{{{{{{\rm{d}}}}}}t}=-\frac{\beta {\sigma }^{2}}{2}\frac{\partial U({{{{{\bf{x}}}}}})}{\partial {x}_{i}},$$
(5)

which has the same form as the gradient descent equation (1). Thus, when averaged over many independent realizations of the learning process, the neuroevolution procedure 1–4, with finite β, is equivalent in the limit of small mutation scale to gradient descent on the loss function.

In the case β = , where mutations are only accepted if the loss function does not increase, the parameters of the network evolve according to the Langevin equation

$$\frac{{{{{{\rm{d}}}}}}{x}_{i}}{{{{{{\rm{d}}}}}}t}=-\frac{\sigma }{\sqrt{2\pi }}\frac{1}{| \nabla U({{{{{\bf{x}}}}}})| }\frac{\partial U({{{{{\bf{x}}}}}})}{\partial {x}_{i}}+{\eta }_{i}(t),$$
(6)

where η is a Gaussian white noise with zero mean and variance σ2/2:

$$\left\langle {\eta }_{i}(t)\right\rangle =0,\;\;\;\; \left\langle {\eta }_{i}(t){\eta }_{j}(t^{\prime} )\right\rangle =\frac{{\sigma }^{2}}{2}{\delta }_{ij}\delta (t-t^{\prime} ).$$
(7)

The form (6) is different to (3), because the gradient in the first term is normalized by the factor \(| \nabla U({{{{{\bf{x}}}}}})| =\sqrt{\mathop{\sum }\nolimits_{i = 1}^{N}{(\partial U({{{{{\bf{x}}}}}})/\partial {x}_{i})}^{2}}\), which serves as an effective coordinate-dependent rescaling (or vector normalization) of the timestep, but (6) nonetheless describes a form of gradient descent on the loss function U(x). Note that the drift term in (6) is of lower order in σ than the diffusion term (which is not the case for finite β). In the limit of small σ, (6) describes an effective process in which uphill moves in loss cannot be made, consistent with the stochastic process from which it is derived.

Averaged over independent realizations of the learning process, (6) reads

$$\frac{{{{{{\rm{d}}}}}}\left\langle {x}_{i}\right\rangle }{{{{{{\rm{d}}}}}}t}=-\frac{\sigma }{\sqrt{2\pi }}\frac{1}{| \nabla U({{{{{\bf{x}}}}}})| }\frac{\partial U({{{{{\bf{x}}}}}})}{\partial {x}_{i}}.$$
(8)

The results (3) and (6) show that training a network by making random mutations of its parameters is, in the limit of small mutations, equivalent to noisy gradient descent on the loss function.

Writing \(\dot{U}({{{{{\bf{x}}}}}})=\dot{{{{{{\bf{x}}}}}}}\cdot \nabla U({{{{{\bf{x}}}}}})\), using (3) and (6), and averaging over noise shows the evolution of the mean loss function under neuroevolution to obey, in the limit of small σ,

$$\left\langle \dot{U}({{{\bf{x}}}})\right\rangle =\left\{\begin{array}{ll}-\frac{\beta {\sigma }^{2}}{2}{(\nabla U({{{\bf{x}}}}))}^{2} & {{{\rm{if}}}}\, \beta \,\ne\, \infty \hfill\\ -\frac{\sigma }{\sqrt{2\pi }} | \nabla U({{{\bf{x}}}})| & {{{\rm{if}}}}\, \beta =\infty \end{array}\right.,$$
(9)

equivalent to evolution under the noise-free forms of gradient descent (5) and (8).

In section “Numerical illustration of the neuroevolution-gradient descent correspondence” we illustrate numerically the correspondence described here, and show that it can be observed numerically for non-vanishing mutations and finite integration steps. In section “Methods” we detail the derivation of the correspondence.

Discussion

Numerical illustration of the neuroevolution-gradient descent correspondence

In this section, we demonstrate the neuroevolution-gradient descent correspondence numerically. We consider single-layer neural networks for the cases of infinite and finite β, and a deep network for the case of infinite β.

Shallow net, β  = 

In order to observe correspondence numerically, the neuroevolution mutation scale σ must be small enough that correction terms neglected in the expansion leading to (25) and (55) are small. The required range of σ is difficult to know in advance, but straightforward to determine empirically: below the relevant value of σ, the results of neuroevolution will be statistically similar when scaled in the manner described below.

We consider a simple supervised-learning problem in which we train a neural network to express the function \({f}_{0}(\theta )=\sin (2\pi \theta )\) on the interval θ [0, 1). We calculated the loss using K = 1000 points on the interval,

$$U({{{{{\bf{x}}}}}})=\frac{1}{K}\mathop{\sum }\limits_{j=0}^{K-1}{\left[{f}_{{{{{{\bf{x}}}}}}}(j/K)-{f}_{0}(j/K)\right]}^{2},$$
(10)

where

$${f}_{{{{{{\bf{x}}}}}}}(\theta )=\mathop{\sum }\limits_{i=0}^{M-1}{x}_{3i+1}\tanh ({x}_{3i+2}\theta +{x}_{3i+3})$$
(11)

is the output of a single-layer neural network with one input node, one output node, M = 30 hidden nodes, and N = 3M parameters xi. These parameters are initially chosen to be Gaussian random numbers with zero mean and variance \({\sigma }_{0}^{2}=1{0}^{-4}\). The correspondence is insensitive to the choice of initial conditions, and we shall show that it holds for different choices of initial network.

We performed gradient descent with learning rate α = 10−5. We chose the learning rate arbitrarily, and verified that the results of gradient-descent simulations were unchanged upon a changing learning rate by a factor of 10 and 1/10. We used Euler integration of the noise-free version of Eq. (6), updating all weights xi at each timestep tgd = 1, 2, … according to the prescription

$${x}_{i}({t}_{{{{{{\rm{gd}}}}}}}+1)={x}_{i}({t}_{{{{{{\rm{gd}}}}}}})-\frac{\alpha }{| \nabla U({{{{{\bf{x}}}}}})| }\frac{\partial U({{{{{\bf{x}}}}}})}{\partial {x}_{i}},$$
(12)

where

$$\frac{\partial U({{{{{\bf{x}}}}}})}{\partial {x}_{i}}=\frac{2}{K}\mathop{\sum }\limits_{j=0}^{K-1}\left[{f}_{{{{{{\bf{x}}}}}}}(j/K)-{f}_{0}(j/K)\right]\frac{\partial {f}_{{{{{{\bf{x}}}}}}}(j/K)}{\partial {x}_{i}},$$
(13)

and

$$\frac{\partial {f}_{{{{\bf{x}}}}}(\theta )}{\partial {x}_{i}}=\left\{\begin{array}{ll}{{{\rm{tanh}}}} (\theta {x}_{i+1}+{x}_{i+2})& {{{\rm{if}}}}\, i\ , {{{\rm{mod}}}}\, 3 = 1\\ \theta {x}_{i-1}\, {{{\rm{sech}}}}^{2}(\theta {x}_{i}+{x}_{i+1}) & {{{\rm{if}}}}\, i\, {{{\rm{mod}}}}\, 3 = 2\hfill\\ {x}_{i-2}\, {{{\rm{sech}}}}^{2}(\theta {x}_{i-1}+{x}_{i}) & {{{\rm{if}}}}\, i\, {{{\rm{mod}}}}\, 3 = 0.\hfill\end{array}\right.$$

We did neuroevolution following the Monte Carlo procedure described in section “Neuroevolution”, in the limit β = , i.e., we accepted only moves that did not increase the loss function. We chose the mutation scale

$$\sigma =\lambda \alpha \sqrt{2\pi },$$
(14)

where λ is a parameter. According to (6) and (12), this prescription sets the neuroevolution timescale tevol to be a factor λ times that of the gradient-descent timescale. Thus, one neuroevolution step corresponds to λ integration steps of the gradient descent procedure. In figures, we compare gradient descent with neuroevolution as a function of common (scaled) time t = αtgd = αλtevol.

In Fig. 1(a) we show the evolution of four individual weights under neuroevolution (using mutation scale λ = 1/10) and gradient descent (weights are distinguishable because they always have the same initial values). The correspondence predicted analytically can be seen numerically: individual neuroevolution trajectories (gray) fluctuate around the gradient descent result (black), and when averaged over individual trajectories the results of neuroevolution (green) approximate those of gradient descent. In Fig. 1(b) we show the individual and averaged values of the weights of neuro-evolved networks at time t = 10 compared to those of gradient descent. In general, the weights generated by averaging over neuroevolution trajectories approximate those of gradient descent, with some discrepancy seen in the values of the largest weights. In Fig. 1(c) we show the loss under neuroevolution and gradient descent. As predicted by (9), averaged neuroevolution and gradient descent are equivalent.

Fig. 1: Numerical illustration of the equivalence of gradient descent and neuroevolution in the case β = .
figure 1

a Evolution with time of 4 of the 90 weights of the neural network (11) under the gradient descent equation Eq. (12) (black: “gradient” stands for “gradient descent”), and under neuroevolution with mutation scale λ = 1/10 [see (14)]. Here and in subsequent panels we show 25 independent neuroevolution trajectories (gray, marked “evolution”) and the average of 1000 independent trajectories (green, marked “\(\left\langle {{{{{\rm{evolution}}}}}}\right\rangle \)”). b All weights at time t = 10 under the two methods. c Loss as a function of time.

In Supplementary Fig. 1 we show similar quantities using a different choice of initial neural network; the correspondence between neuroevolution and gradient descent is again apparent.

In Fig. 2(a) we show the time evolution of a single weight of the network under gradient descent and neuroevolution, the latter for three sizes of mutation step σ. As σ increases, the size of fluctuations of individual trajectories about the mean increase, as predicted by (6). As a result, more trajectories are required to estimate the average, and for fixed number of trajectories (as used here) the estimated average becomes less precise. In addition, as σ increases, the assumptions underlying the correspondence derivation eventually break down, in which case the neuroevolution average will not converge to the gradient descent result even as more trajectories are sampled.

Fig. 2: The smaller the neuroevolution mutation scale, the closer the neuroevolution-gradient descent correspondence.
figure 2

a Time evolution of a single neural-network weight, for three different neuroevolution mutation scales [see (14)]. b The quantity Δ(t), Eq. (15), is a measure of the difference between networks evolved under gradient descent and neuroevolution.

In Fig. 2(b) we show the mean-squared difference of the parameter vector of the model under gradient descent and neuroevolution,

$${{\Delta }}(t)\equiv \frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{\left({x}_{i}^{{{{{{\rm{gradient}}}}}}}(t)-\left\langle {x}_{i}^{{{{{{\rm{evolution}}}}}}}(t)\right\rangle \right)}^{2}.$$
(15)

Here, N is the number of network parameters; \({x}_{i}^{{{{{{\rm{gradient}}}}}}}(t)\) is the time evolution of neural-network parameter i under the gradient descent equation Eq. (12); and \(\left\langle {x}_{i}^{{{{{{\rm{evolution}}}}}}}(t)\right\rangle \) is the mean value of neural-network parameter i over the ensemble of neuroevolution trajectories. The smaller the neuroevolution step size, the smaller is Δ(t), and the closer the neuroevolution-gradient descent correspondence.

In Supplementary Fig. 2 we show the evolution with time of the loss for different mutation scales (the left-hand plot is a reproduction of Fig. 1(c)). The trend shown is similar to that of the weights in Fig. 2.

Deep net, β = 

One feature of the correspondence derivation is that the architecture of the neural network does not appear. As long as the loss U(x) is a deterministic function of the neural-network parameters x, correspondence between gradient descent and neuroevolution will be observed if the mutation scale is small enough (it is likely that what constitutes “small enough” does depend on neural-network architecture, as well as the problem under study. The required mutation scale can be determined empirically, even without access to gradient-descent results: when correspondence holds, the results of neuroevolution simulations will be statistically similar, when scaled as we have described).

To demonstrate invariance to architecture we repeat the previous comparison, now using a deep neural network (we train the net to reproduce the target function \({f}_{0}(\theta )=\sin (\pi \theta )\) on the interval θ [ − 1, 1]). The network has 8 fully-connected hidden layers, each 32 nodes wide, and 7489 total parameters. As before we use tanh activations on the hidden nodes, and have one input node and one output.

Results are shown in Fig. 3. In panel (a) we show the evolution of two parameters of the network, under gradient descent and for neuroevolution with step-size parameter λ = 1/10 [see (17)]. As for the shallow net, the correspondence is apparent. Neuroevolution averages (green lines) are taken over 100 trajectories. In panel (b) we show the loss, for gradient descent and two different neuroevolution step-size parameters. As expected, the correspondence is more precise for smaller λ. As before, for large enough λ the correspondence breaks down: see Supplementary Fig. 3.

Fig. 3: As Fig. 1, now for a deep neural network with eight hidden layers.
figure 3

Panel a shows 2 of the 7489 parameters of the network. Panel b shows the loss, for two different neuroevolution step-size parameters [see (17)].

In Fig. 4 we show all parameters of the deep net at training time t = 5, under the two dynamics. We show the results of gradient descent in black, and independent neuroevolution trajectories in gray. As predicted analytically, the neuroevolution results fall either side of the gradient-descent result, and the network constructed by averaging over independent neuroevolution trajectories (green) is essentially identical to the network produced by gradient descent.

Fig. 4: Numerical illustration of correspondence for a deep net.
figure 4

We show all 7489 parameters xi of the deep net of Fig. 3 at training time t = 5. As predicted analytically, the neural network created by averaging (green symbols) over a noninteracting population of neuroevolutionary processes (gray symbols) is essentially the same network as that produced by gradient descent (black symbols). For clarity, weights are ordered by their final values under gradient descent.

In Fig. 5, we illustrate the dynamics of learning and the scale of the loss function by showing a comparison between the target function \({f}_{0}(\theta )=\sin (\pi \theta )\) and the net function fx(θ). We show the latter at three different training times, for gradient descent and neuroevolution trajectories.

Fig. 5: Illustration of the learning dynamics.
figure 5

Loss versus time and comparison of the target function \({f}_{0}(\theta )=\sin (\pi \theta )\) and the net function fx(θ) at three different training times, for the learning dynamics of Fig. 3 (λ = 1/10 for the neuroevolution results).

Shallow net, finite β

In this section, we illustrate the gradient descent-neuroevolution correspondence for finite β. We consider the same supervised-learning problem as before, and set the network width to 256 nodes. We did neuroevolution with the Metropolis acceptance rate with reciprocal temperature parameter β = 103. This choice is arbitrary, but is representative of a wide range of finite values of β. Finite-temperature simulations are common in particle-based systems24. Here, temperature has no particular physical significance, but comparing simulations done at finite and infinite β makes the point that different choices of neuroevolution acceptance rate result in a dynamics equivalent to different gradient-descent protocols.

We did gradient descent using the integration scheme

$${x}_{i}({t}_{{{{{{\rm{gd}}}}}}}+1)={x}_{i}({t}_{{{{{{\rm{gd}}}}}}})-\alpha \frac{\partial U({{{{{\bf{x}}}}}})}{\partial {x}_{i}},$$
(16)

where α = 10−4 is the learning rate. Comparing (3) and (16), we set the neuroevolution mutation scale to be

$$\sigma =\lambda \sqrt{2\alpha /\beta },$$
(17)

where λ is a parameter. Thus, one neuroevolution step corresponds to λ integration steps of the gradient descent procedure. In figures, we compare gradient descent with neuroevolution as a function of common (scaled) time t = αtgd = αλtevol.

Results are shown in Fig. 6, for step-size parameter λ = 1. In panel (a) we show the evolution with time of two of the weights of the network. The noise associated with neuroevolution at this value of β is considerable: individual trajectories (gray lines) bear little resemble to the gradient-descent result (black line). However, the population average (green line, average of n = 1000 trajectories) shows the expected correspondence. The correspondence is less precise than that shown previously, because we use a larger effective step size and each trajectory is much noisier than its infinite-β counterpart.

Fig. 6: Illustration of the neuroevolution-gradient descent correspondence for the case of finite β.
figure 6

Panel a shows two parameters of the network, panel b shows the loss, and panel c shows the parameter Δ, Eq. (15), a measure of the difference between the average network produced by neuroevolution and the network produced by gradient descent.

In Fig. 6(b) we show the loss, with line colors corresponding to the quantities of panel (a). In addition, we show the loss of the average network produced by neuroevolution (red line), \(U(\left\langle {{{{{\bf{x}}}}}}\right\rangle )\), which, if correspondence holds, should be equal to \(\left\langle U({{{{{\bf{x}}}}}})\right\rangle \) (green line). The initial fast relaxation of the loss (the boxed region) shows a difference between gradient descent and averaged neuroevolution results; doing neuroevolution for smaller step-size parameter λ = 1/10 (inset) reduces this difference, as expected.

In panel (c) we show the parameter Δ, Eq. (15), a measure of the difference between the average network \(\left\langle x\right\rangle \) produced by neuroevolution and the network produced by gradient descent, as a function of n, the number of trajectories included in the average. If correspondence holds, this quantity should vanish in the limit of large n; the observed trend is consistent with this behavior.

In Supplementary Fig. 4, we compare a gradient-descent trajectory with a set of neuroevolution trajectories, periodically resetting the latter to the gradient-descent solution. The periodic resetting tests the correspondence for a range of initial conditions. The correspondence between gradient descent and the averaged neuroevolution trajectory is approximate (averages were taken over 152 trajectories, fewer than in Supplementary Fig. 4) but apparent.

Conclusions

We have shown analytically that training a neural network by neuroevolution of its weights is equivalent, in the limit of small mutation scale, to noisy gradient descent on the loss function. Conditioning neuroevolution on the Metropolis acceptance criterion at finite evolutionary temperature is equivalent to a noisy version of simple gradient descent, while at infinite reciprocal evolutionary temperature the procedure is equivalent to clipped gradient descent on the loss function. Averaged over noise, the evolutionary procedures correspond to forms of gradient descent on the loss function. This correspondence is described by Equations (3), (5), (6), and (8).

Correspondence in the sense described above means that each neural-network parameter evolves the same way as a function of time under the two dynamics. Correspondence implies that the convergence properties of the two methods are the same (see e.g., Fig. 3(b)) and that the neural networks produced by the same methods are the same (see e.g., Fig. 4). The generalization properties of those networks will then also be the same.

The correspondence is formally exact only in the limit of zero mutation scale, and holds approximately for small but finite mutations. It will fail when the assumptions underlying the derivation are violated, such as when the terms neglected in (22) and (23) are not small, or when the passage from (28) to (29) is not valid because the change βU(x + ϵ) − βU(x) is not small. It is straightforward to determine empirically where correspondence holds, even without access to gradient-descent results: the results of neuroevolution, with time scaled as described, will be statistically similar when the mutation size is small enough. The time duration for which the correspondence holds increases with decreasing mutation scale (see e.g., Fig. 2(b)). We have shown here that the correspondence can be observed for a range of mutation scales, and for different neural-net architectures.

More generally, several dynamical regimes are contained within the neurevolution master equation (19), according to the scale σ of mutations: for vanishing σ, neurevolution is formally noisy gradient descent on the loss function; for small but nonvanishing σ it approximates noisy gradient descent enacted by explicit integration with a finite timestep; for larger σ it enacts a dynamics different to gradient descent, but one that can still learn; and for sufficiently large σ the steps taken are too large for learning to occur on accessible timescales. An indication of these various regimes can be seen in Fig. 2 and Supplementary Fig. 2.

Separate from the question of its precise temporal evolution, the master equation (19) has a well-defined stationary distribution ρ0(x). Requiring the brackets on the right-hand of (19) to vanish ensures that P(x, t) → ρ0(x) becomes independent of time. Inserting (20) into (19) and requiring normalization of ρ0(x) reveals the stationary distribution to be the Boltzmann one, \({\rho }_{0}({{{{{\bf{x}}}}}})={{{{{{\rm{e}}}}}}}^{-\beta U({{{{{\bf{x}}}}}})}/\int {{{{{\rm{d}}}}}}{{{{{\bf{x}}}}}}^{\prime} {{{{{{\rm{e}}}}}}}^{-\beta U({{{{{\bf{x}}}}}}^{\prime} )}\). For finite β the neuroevolution procedure is ergodic, and this distribution will be sampled given sufficiently long simulation time. For β →  we have \({\rho }_{0}({{{{{\bf{x}}}}}})\to \delta \left(U({{{{{\bf{x}}}}}})-{U}_{0}\right)\), where U0 is the global energy minimum; in this case the system is not ergodic (moves uphill in U(x) are not allowed) and there is no guarantee of reaching this minimum.

We have focused on the simple limit of the set of neuroevolution algorithms, namely a non-interacting population of neural networks that experience sequential probabilistic mutations of their parameters. We have illustrated the correspondence at the level of population averages, Equations (5) and (8). However, no communication between individuals is required, and each individually observes the correspondence defined by Equations (3) and (6).

Our results are also relevant to population-based genetic algorithms in which members of the population are periodically reset to the identities of the individuals with lowest loss values11,12,13. For instance, when correspondence holds, individuals in the neuroevolution populations considered in this paper have an averaged loss equal to that of the corresponding gradient descent algorithm. Therefore, some individuals must have loss less than that of the corresponding gradient descent algorithm (see e.g. Fig. 1(a), and Fig. 3(b) for the case λ = 1/10). This observation indicates the potential for such methods to be competitive with gradient-descent algorithms.

The neuroevolution-gradient descent correspondence we have identified follows from that between the overdamped Langevin dynamics and Metropolis Monte Carlo dynamics of a particle in an external potential19,20. Our work therefore adds to the existing set of connections between machine learning and statistical mechanics42,43, and continues a trend in machine learning of making use of old results: the stochastic and deterministic algorithms considered here come from the 1950s9,10 and 1970s1,2,3,4,5,6, and are connected by ideas developed in the 1990s19,20.

Methods

Derivation of the neuroevolution-gradient descent correspondence

We start by considering the quantity P(x, t), the probability that a neural network has the set of parameters x at time t under a given stochastic protocol. The time evolution of this quantity is governed by the master equation44,45, which in generic form reads

$${\partial }_{t}P({{{{{\bf{x}}}}}},t)= \int {{{{{\rm{d}}}}}}{{{{{\bf{x}}}}}}^{\prime} \left[P({{{{{\bf{x}}}}}}^{\prime} ,t){W}_{{{{{{\bf{x}}}}}}-{{{{{\bf{x}}}}}}^{\prime} }({{{{{\bf{x}}}}}}^{\prime} )\right.\\ -\left.P({{{{{\bf{x}}}}}},t){W}_{{{{{{\bf{x}}}}}}^{\prime} -{{{{{\bf{x}}}}}}}({{{{{\bf{x}}}}}})\right].$$
(18)

The two terms in (19) describe, respectively, gain and loss of the probability P(x, t) (note that the probability to have some set of parameters is conserved, i.e., ∫dxP(x, t) = 1). The symbol \({W}_{{{{{{\bf{x}}}}}}^{\prime} -{{{{{\bf{x}}}}}}}({{{{{\bf{x}}}}}})\) (sometimes written \({W}(({{{{{\bf{x}}}}}} \to {{{{{\bf{x}}}}}}^{\prime}))\)) quantifies the probability of moving from the set of parameters x to the set of parameters \({{{{{\bf{x}}}}}}+({{{{{\bf{x}}}}}}^{\prime} -{{{{{\bf{x}}}}}})={{{{{\bf{x}}}}}}^{\prime} \), and encodes the details of the stochastic protocol. For the neuroevolution procedure defined in section “Neuroevolution” we write (18) as

$${\partial }_{t}P({{{{{\bf{x}}}}}},t)= \int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\left[P({{{\bf{x}}}}-{{{{{\boldsymbol{\epsilon }}}}}}, t){W_{{{{{\boldsymbol{\epsilon }}}}}}}({{{\bf{x}}}}-{{{{{\boldsymbol{\epsilon }}}}}})\right.\\ -\left.P({{{{{\bf{x}}}}}},t){W}_{{{{{{\boldsymbol{\epsilon }}}}}}}({{{{{\bf{x}}}}}})\right].$$
(19)

Here, ϵ denotes the set of random numbers (the “mutation”) by which the neural-network parameters are changed; the integral \(\int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}=\int\nolimits_{-\infty }^{\infty }{{{{{\rm{d}}}}}}{\epsilon }_{i}\cdots \int\nolimits_{-\infty }^{\infty }{{{{{\rm{d}}}}}}{\epsilon }_{N}\) runs over all possible choices of mutations; and

$${W}_{{{{{{\boldsymbol{\epsilon }}}}}}}({{{{{\bf{x}}}}}})=p({{{{{\boldsymbol{\epsilon }}}}}})\min \left(1,{{{{{{\rm{e}}}}}}}^{-\beta [U({{{{{\bf{x}}}}}}+{{{{{\boldsymbol{\epsilon }}}}}})-U({{{{{\bf{x}}}}}})]}\right)$$
(20)

is the rate for going from the set of parameters x to the set of parameters x + ϵ. Eq. (20) contains two factors. The first,

$$p({{{{{\boldsymbol{\epsilon }}}}}})=\mathop{\prod }\limits_{i=1}^{N}p({\epsilon }_{i})\;\;\;\; {{{{{\rm{with}}}}}}\;\;\;\; p({\epsilon }_{i})=\frac{1}{\sqrt{2\pi {\sigma }^{2}}}{{{{{{\rm{e}}}}}}}^{-\frac{{\epsilon }_{i}^{2}}{2{\sigma }^{2}}},$$
(21)

quantifies the probability of proposing a set of Gaussian random numbers ϵ. The second factor, the Metropolis “min” function in (20), quantifies the probability of accepting the proposed move from x to x + ϵ; recall that U(x) is the loss function.

We can pass from the master equation (19) to a Fokker-Planck equation by assuming a small mutation scale σ, and expanding the terms in (19) to second order in σ44,45. Thus

$$P({{{\bf{x}}}}-{{{{{\boldsymbol{\epsilon }}}}}}, t) \approx \left(1-{{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla +\frac{1}{2}{({{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla )}^{2}\right)P({{{{{\bf{x}}}}}},t)$$
(22)

and

$${W}_{{{{{{\boldsymbol{\epsilon }}}}}}}({{{{{\bf{x}}}}}}-{{{{{\boldsymbol{\epsilon }}}}}})\approx \left(1-{{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla +\frac{1}{2}{({{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla )}^{2}\right){W}_{{{{{{\boldsymbol{\epsilon }}}}}}}({{{{{\bf{x}}}}}}),$$
(23)

where \({{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla =\mathop{\sum }\nolimits_{i = 1}^{N}{\epsilon }_{i}{\partial }_{i}\) (note that \({\partial }_{i}\equiv \frac{\partial }{\partial {x}_{i}}\)). Collecting terms resulting from the expansion gives

$${\partial }_{t}P({{{{{\bf{x}}}}}},t)\approx -\int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}({{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla )P({{{{{\bf{x}}}}}},t){W}_{{{{{{\boldsymbol{\epsilon }}}}}}}({{{{{\bf{x}}}}}})\\ +\frac{1}{2}\int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}{({{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla )}^{2}P({{{{{\bf{x}}}}}},t){W}_{{{{{{\boldsymbol{\epsilon }}}}}}}({{{{{\bf{x}}}}}}).$$
(24)

Taking the integrals inside the sums, (24) reads

$${\partial }_{t}P({{{{{\bf{x}}}}}},t)\approx -\mathop{\sum }\limits_{i=1}^{N}\frac{\partial }{\partial {x}_{i}}\left({A}_{i}({{{{{\bf{x}}}}}})P({{{{{\bf{x}}}}}},t)\right)\\ +\frac{1}{2}\mathop{\sum }\limits_{i,j=1}^{N}\frac{{\partial }^{2}}{\partial {x}_{i}\partial {x}_{j}}\left({B}_{ij}({{{{{\bf{x}}}}}})P({{{{{\bf{x}}}}}},t)\right),$$
(25)

where

$${A}_{i}({{{{{\bf{x}}}}}})\equiv \int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ {\epsilon }_{i}{W}_{{{{{{\boldsymbol{\epsilon }}}}}}}({{{{{\bf{x}}}}}}),$$
(26)

and

$${B}_{ij}({{{{{\bf{x}}}}}})\equiv \int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ {\epsilon }_{i}{\epsilon }_{j}{W}_{{{{{{\boldsymbol{\epsilon }}}}}}}({{{{{\bf{x}}}}}}).$$
(27)

What remains is to calculate (26) and (27), which we do in different ways depending on the value of the evolutionary reciprocal temperature β.

The case of finite β

First we consider finite β, in which case we can evaluate (26) and (27) using the results of Refs. 19,20 (making small changes in order to account for differences in proposal rates between those papers and ours).

Eq. (26) can be evaluated as follows (writing U(x) = U for brevity):

$${A}_{i}({{{{{\bf{x}}}}}})=\int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ p({{{{{\boldsymbol{\epsilon }}}}}}){\epsilon }_{i}\min \left(1,{{{{{{\rm{e}}}}}}}^{-\beta [U({{{{{\bf{x}}}}}}+{{{{{\boldsymbol{\epsilon }}}}}})-U({{{{{\bf{x}}}}}})]}\right)$$
(28)
$$\approx \int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ p({{{{{\boldsymbol{\epsilon }}}}}}){\epsilon }_{i}\min \left(1,1-\beta {{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla U\right)$$
(29)
$$\;\;\;= \int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ p({{{{{\boldsymbol{\epsilon }}}}}}){\epsilon }_{i}{{\Theta }}(-{{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla U)\\ +\int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ p({{{{{\boldsymbol{\epsilon }}}}}}){\epsilon }_{i}(1-\beta {{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla U){{\Theta }}({{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla U)$$
(30)
$$= \int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ p({{{{{\boldsymbol{\epsilon }}}}}}){\epsilon }_{i}\\ -\beta \int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ p({{{{{\boldsymbol{\epsilon }}}}}}){{\Theta }}({{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla U)\mathop{\sum}\limits_{j}{\epsilon }_{i}{\epsilon }_{j}{\partial }_{j}U$$
(31)
$$=-\beta \int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ p({{{{{\boldsymbol{\epsilon }}}}}}){{\Theta }}({{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla U)\mathop{\sum}\limits_{j}{\epsilon }_{i}{\epsilon }_{j}{\partial }_{j}U$$
(32)
$$=-\frac{\beta }{2}\int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ p({{{{{\boldsymbol{\epsilon }}}}}})\mathop{\sum}\limits_{j}{\epsilon }_{i}{\epsilon }_{j}{\partial }_{j}U$$
(33)
$$=-\frac{\beta {\sigma }^{2}}{2}{\partial }_{i}U.\hskip50pt$$
(34)

In these expressions Θ(x) = 1 if x ≥ 0 and is zero otherwise. In going from (28) to (29) we have assumed that βϵU(x) is small. This condition cannot be met for β = ; that case is treated later. In going from (30) to (31) we have used the result Θ(x) + Θ( − x) = 1; the first integral in (31) then vanishes by symmetry. The second integral, shown in (32), can be turned into (33) using the symmetry arguments given in Ref. 20, which we motivate as follows. Upon a change of sign of the integration variables, ϵ → − ϵ, the value of the integral in (32) is unchanged and it is brought to a form that is identical except for a change of sign of the argument of the Θ function. Adding the two forms of the integral removes the Θ functions, giving the form shown in (33), and dividing by 2 restores the value of the original integral. (33) can be evaluated using standard results of Gaussian integrals.

Eq. (27) can be evaluated in a similar way:

$${B}_{ij}({{{{{\bf{x}}}}}})=\int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ p({{{{{\boldsymbol{\epsilon }}}}}}){\epsilon }_{i}{\epsilon }_{j}\min \left(1,{{{{{{\rm{e}}}}}}}^{-\beta [U({{{{{\bf{x}}}}}}+{{{{{\boldsymbol{\epsilon }}}}}})-U({{{{{\bf{x}}}}}})]}\right)$$
(35)
$$\,\,\,\,\,\,\,\,\approx \int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ p({{{{{\boldsymbol{\epsilon }}}}}}){\epsilon }_{i}{\epsilon }_{j}\min \left(1,1-\beta {{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla U\right)$$
(36)
$$\hskip10pt =\int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ p({{{{{\boldsymbol{\epsilon }}}}}}){\epsilon }_{i}{\epsilon }_{j}{{\Theta }}(-{{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla U)\hskip-10pt\\ \;\;\;\;+\int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ p({{{{{\boldsymbol{\epsilon }}}}}}){\epsilon }_{i}{\epsilon }_{j}(1-\beta {{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla U){{\Theta }}({{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla U)\hskip-21pt$$
(37)
$$\approx \int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ p({{{{{\boldsymbol{\epsilon }}}}}}){\epsilon }_{i}{\epsilon }_{j}\hskip12pt$$
(38)
$$={\sigma }^{2}{\delta }_{ij}.\hskip105pt$$
(39)

The ≈ sign in (38) indicates that we have omitted terms of order σ3.

Inserting (34) and (39) into (25) gives us, to second order in σ, the Fokker-Planck equation

$$\frac{\partial P({{{{{\bf{x}}}}}},t)}{\partial t} \approx -\mathop{\sum }\limits_{i=1}^{N}\frac{\partial }{\partial {x}_{i}}\left(-\frac{\beta {\sigma }^{2}}{2}\frac{\partial U({{{{{\bf{x}}}}}})}{\partial {x}_{i}}P({{{{{\bf{x}}}}}},t)\right)\\ +\frac{1}{2}\mathop{\sum }\limits_{i=1}^{N}\frac{{\partial }^{2}}{\partial {x}_{i}^{2}}\left({\sigma }^{2}P({{{{{\bf{x}}}}}},t)\right).$$
(40)

This equation is equivalent (the diffusion term is independent of x, and so the choice of stochastic calculus is unimportant) to the N Langevin equations44,45

$$\frac{{{{{{\rm{d}}}}}}{x}_{i}}{{{{{{\rm{d}}}}}}t}=-\frac{\beta {\sigma }^{2}}{2}\frac{\partial U({{{{{\bf{x}}}}}})}{\partial {x}_{i}}+{\xi }_{i}(t)\;\;\;\; \forall i,$$
(41)

where ξ is a Gaussian white noise with zero mean and variance σ2:

$$\left\langle {\xi }_{i}(t)\right\rangle =0,\;\;\;\; \left\langle {\xi }_{i}(t){\xi }_{j}(t^{\prime} )\right\rangle ={\sigma }^{2}{\delta }_{ij}\delta (t-t^{\prime} ).$$
(42)

Eq. (41) describes an evolution of the neural-network parameters xi that is equivalent to gradient descent with learning rate α = βσ2/2, plus Gaussian white noise. Averaging over independent stochastic trajectories of the learning process (starting from identical initial conditions) gives

$$\frac{{{{{{\rm{d}}}}}}\left\langle {x}_{i}\right\rangle }{{{{{{\rm{d}}}}}}t}=-\frac{\beta {\sigma }^{2}}{2}\frac{\partial U({{{{{\bf{x}}}}}})}{\partial {x}_{i}},$$
(43)

which is equivalent to simple gradient descent on the loss function.

The case β = 

When β =  we only accept mutations that do not increase the loss function. To treat this case we return to (26) and take the limit β → :

$${A}_{i}({{{{{\bf{x}}}}}})=\int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ p({{{{{\boldsymbol{\epsilon }}}}}}){\epsilon }_{i}{{\Theta }}(-{{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla U({{{{{\bf{x}}}}}})).$$
(44)

We can make progress by introducing the integral form of the Θ function (see e.g., ref. 46),

$${{\Theta }}(-{{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla U({{{{{\bf{x}}}}}})) =\int\nolimits_{-\infty }^{0}{{{{{\rm{d}}}}}}z\ \delta (z-{{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla U({{{{{\bf{x}}}}}}))\\ =\int\nolimits_{-\infty }^{0}\frac{{{{{{\rm{d}}}}}}z}{2\pi }\int\nolimits_{-\infty }^{\infty }{{{{{\rm{d}}}}}}\omega \ {{{{{{\rm{e}}}}}}}^{{{{{{\rm{i}}}}}}\omega (z-{{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla U({{{{{\bf{x}}}}}}))}.$$
(45)

Then (44) reads

$${A}_{i}({{{{{\bf{x}}}}}})=\int\nolimits_{-\infty }^{0}\frac{{{{{{\rm{d}}}}}}z}{2\pi }\int\nolimits_{-\infty }^{\infty }{{{{{\rm{d}}}}}}\omega \ {{{{{{\rm{e}}}}}}}^{{{{{{\rm{i}}}}}}\omega z}\ {G}_{i}^{(1)}\mathop{\prod}\limits_{j\ne i}{G}_{j}^{(0)},$$
(46)

where the symbols

$${G}_{j}^{(n)}\equiv \frac{1}{\sqrt{2\pi {\sigma }^{2}}}\int\nolimits_{-\infty }^{\infty }{{{{{\rm{d}}}}}}{\epsilon }_{j}\ {\epsilon }_{j}^{n}\ {{{{{{\rm{e}}}}}}}^{-{\epsilon }_{j}^{2}/(2{\sigma }^{2})-{{{{{\rm{i}}}}}}\omega {\epsilon }_{j}{\partial }_{j}U}$$
(47)

are standard Gaussian integrals. Upon evaluating them as

$${G}_{j}^{(0)}={{{{{{\rm{e}}}}}}}^{-\frac{1}{2}{\sigma }^{2}{\omega }^{2}{({\partial }_{j}U)}^{2}}$$
(48)

and

$${G}_{i}^{(1)}=-{{{{{\rm{i}}}}}}\omega {\sigma }^{2}({\partial }_{i}U){{{{{{\rm{e}}}}}}}^{-\frac{1}{2}{\sigma }^{2}{\omega }^{2}{({\partial }_{i}U)}^{2}},$$
(49)

(46) reads

$${A}_{i}({{{{{\bf{x}}}}}}) =-\int\nolimits_{-\infty }^{0}\frac{{{{{{\rm{d}}}}}}z}{2\pi }\int\nolimits_{-\infty }^{\infty }{{{{{\rm{d}}}}}}\omega \ {{{{{\rm{i}}}}}}{\sigma }^{2}\omega ({\partial }_{i}U){{{{{{\rm{e}}}}}}}^{-\frac{1}{2}{\omega }^{2}{\sigma }^{2}| \nabla U{| }^{2}+{{{{{\rm{i}}}}}}\omega z}\\ =\int\nolimits_{-\infty }^{0}\frac{{{{{{\rm{d}}}}}}z}{2\pi }\sqrt{2\pi }\frac{{\partial }_{i}U}{\sigma | \nabla U{| }^{3}}z{{{{{{\rm{e}}}}}}}^{-{z}^{2}/(2{\sigma }^{2}| \nabla U{| }^{2})}$$
(50)
$$=-\frac{\sigma }{\sqrt{2\pi }}\frac{{\partial }_{i}U({{{{{\bf{x}}}}}})}{| \nabla U({{{{{\bf{x}}}}}})| }.\hskip50pt$$
(51)

The form (51) is similar to (34) in that it involves the derivative of the loss function U(x) with respect to xi, but contains an additional normalization term, U. This term is sometimes introduced as a form of regularization in gradient-based methods25. Here, the form emerges naturally from the acceptance criterion that sees any move accepted if the move does not increase the loss function: as a result, the length of the step taken does not depend strongly on the size of the gradient.

In the limit β → , (27) reads

$${B}_{ij}({{{{{\bf{x}}}}}})=\int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ p({{{{{\boldsymbol{\epsilon }}}}}}){\epsilon }_{i}{\epsilon }_{j}{{\Theta }}(-{{{{{\boldsymbol{\epsilon }}}}}}\cdot \nabla U({{{{{\bf{x}}}}}}))$$
(52)
$$=\frac{1}{2}\int {{{{{\rm{d}}}}}}{{{{{\boldsymbol{\epsilon }}}}}}\ p({{{{{\boldsymbol{\epsilon }}}}}}){\epsilon }_{i}{\epsilon }_{j}$$
(53)
$$=\frac{1}{2}{\sigma }^{2}{\delta }_{ij},\hskip32pt$$
(54)

upon applying the symmetry arguments used to evaluate (32). Eq. (54) is half the value of the corresponding term for the case β ≠ , Eq. (39), because one term corresponds to Brownian motion in unrestricted space, the other to Brownian motion on a half-space.

Inserting (51) and (54) into (25) gives a Fokker–Planck equation equivalent to the N Langevin equations

$$\frac{{{{{{\rm{d}}}}}}{x}_{i}}{{{{{{\rm{d}}}}}}t}=-\frac{\sigma }{\sqrt{2\pi }}\frac{1}{| \nabla U({{{{{\bf{x}}}}}})| }\frac{\partial U({{{{{\bf{x}}}}}})}{\partial {x}_{i}}+{\eta }_{i}(t)\;\;\;\; \forall i,$$
(55)

where η is a Gaussian white noise with zero mean and variance σ2/2:

$$\left\langle {\eta }_{i}(t)\right\rangle =0,\;\;\;\; \left\langle {\eta }_{i}(t){\eta }_{j}(t^{\prime} )\right\rangle =\frac{1}{2}{\sigma }^{2}{\delta }_{ij}\delta (t-t^{\prime} ).$$
(56)

As an aside, we briefly consider the case of non-isotropic mutations, for which the Gaussian random update applied to parameter i has its own variance \({\sigma }_{i}^{2}\), i.e., \({\epsilon }_{i} \sim {{{{{\mathcal{N}}}}}}(0,{\sigma }_{i}^{2})\) in step 2 of the procedure described in section “Neuroevolution”. In this case the derivation above is modified to have σ replaced by σi in (21). In the case of finite β the equations (41) and (42) retain their form with the replacement σ → σi. In the case of infinite β, (56) retains its form with the replacement σ → σi and (55) reads

$$\frac{{{{{{\rm{d}}}}}}{x}_{i}}{{{{{{\rm{d}}}}}}t}=-\frac{{\sigma }_{i}}{\sqrt{2\pi }}\frac{1}{| \tilde{\nabla }U({{{{{\bf{x}}}}}})| }{\tilde{\partial }}_{i}U({{{{{\bf{x}}}}}})+{\eta }_{i}(t),$$
(57)

with \({\tilde{\partial }}_{i}\equiv {\sigma }_{i}\frac{\partial }{\partial {x}_{i}}\) and \(| \tilde{\nabla }U| \equiv \sqrt{\mathop{\sum }\nolimits_{j = 1}^{N}{\left({\tilde{\partial }}_{j}U\right)}^{2}}\). Non-isotropic mutations are used in covariance matrix adaptation strategies47. Those schemes also evolve the step size parameter dynamically, and to model this more general case one must make σ a dynamical variable of the master equation, with update rules appropriate to the algorithm of interest.