Correspondence between neuroevolution and gradient descent

We show analytically that training a neural network by conditioned stochastic mutation or neuroevolution of its weights is equivalent, in the limit of small mutations, to gradient descent on the loss function in the presence of Gaussian white noise. Averaged over independent realizations of the learning process, neuroevolution is equivalent to gradient descent on the loss function. We use numerical simulation to show that this correspondence can be observed for finite mutations, for shallow and deep neural networks. Our results provide a connection between two families of neural-network training methods that are usually considered to be fundamentally different.


I. INTRODUCTION
In broad terms there are two types of method used to train neural networks, divided according to whether or not they explicitly evaluate gradients of the loss function. Gradient-based methods include the backpropagation algorithm [1][2][3][4][5][6][7][8]. The non-gradient-based methods (sometimes called "black box" methods) include stochastic processes in which changes to a neural network are proposed and accepted with certain probabilities, and encompass Monte Carlo [9,10] and genetic algorithms [11][12][13]. Both gradient-based and non-gradient-based methods have been used to train neural networks for a variety of applications, and, where comparison exists, perform similarly well [14][15][16][17]. For instance, recent numerical work shows that stochastic evolutionary strategies applied to neural networks are competitive with explicit gradient-based methods on hard reinforcement-learning problems [16].
Gradient-based-and non-gradient-based strategies are different in implementation and are sometimes thought of as entirely different approaches [18]. Here we show that the two sets of methods have a fundamental connection. We demonstrate analytically an equivalence between the dynamics of neural-network training under conditioned stochastic mutations, and under gradient descent. This connection follows from one identified in the 1990s between the overdamped Langevin dynamics and Metropolis Monte Carlo dynamics of a particle in an external potential [19,20]. In the limit of small Monte Carlo trial moves, those things are equivalent. Similarly, we show here that a single copy of a neural network (a single individual) exposed to parameter mutations that are accepted probabilistically is equivalent, in the limit of small mutation size, to gradient descent on the loss function in the presence of Gaussian white noise. The details of the resulting dynamics depend on the details of the acceptance criterion, and encompass both standard- * swhitelam@lbl.gov † isaac.tamblyn@nrc.ca and clipped gradient descent. Such a mutation scheme corresponds to the simple limit of the set of processes called "neuroevolution" [13,16,[21][22][23]. This connection demonstrates explicitly that optimization without access to gradients can, nonetheless, enact noisy gradient descent on the loss function.
In simple gradient descent, equivalent to noisefree overdamped Langevin dynamics, the parameters (weights and biases) x of a neural network evolve with training time according to the prescriptionẋ = −α∇U (x). Here α is a learning rate, and ∇U (x) is the gradient of a loss function U (x) with respect to the network parameters. Now consider a simple neuroevolution scheme in which we propose a mutation x → x + of all neural-network parameters, where is a set of independent Gaussian random numbers of zero mean and variance σ 2 . Let us accept the proposal with the Metropolis probability min 1, e −β∆U . Here β is a reciprocal temperature, and ∆U is the change of the loss function under the proposal. This is a Metropolis Monte Carlo algorithm, a Markovian dynamics that constitutes a form of importance sampling, and a common choice in the physics literature [9,10,24]. In physical systems, β is inversely proportional to the physical temperature, and we consider finite values of β in order to make contact with that literature. However, in the context of training a neural network it is interesting to consider the zerotemperature limit β = ∞, where mutations are accepted only if the loss does not increase. That regime is not normally considered in particle-based simulations.
Our main results can be summarized as follows. When β = ∞ the weights of the network evolve, to leading order in σ, asẋ = −(βσ 2 /2)∇U (x) plus Gaussian white noise. Averaged over independent realizations of the learning process, this form of neuroevolution is therefore equivalent to simple gradient descent, with learning rate βσ 2 /2. In the limit β = ∞, where mutations are accepted only if the loss function does not increase, weights under neuroevolution evolve instead aṡ x = −(σ/ √ 2π)|∇U (x)| −1 ∇U (x) plus Gaussian white noise, which corresponds to clipped gradient descent on U (x) [25]. Note that conditioning the acceptance of neural-network parameter mutations on the change of the loss function for a single copy of that network is sufficient to enact gradient descent: a population of individuals is not required.
In this paper we use the term "neuroevolution" to refer to a sequence of mutation steps applied to the parameters of a single copy of a neural network and accepted probabilistically. In general, neuroevolutionary algorithms encompass a broader variety of processes, including mutations of populations of communicating neural networks [16] and mutations of network topologies [21,26,27]. Similarly, the set of procedures for particles that can be described as "Monte Carlo algorithms" is large, and ranges from local moves of single particlesroughly equivalent to the procedure used here -to nonlocal moves and moves of collections of particles [24,[28][29][30][31].) The dynamics of those collective-move Monte Carlo algorithms and of the more complicated neuroevolutionary schemes [21,26,27] do not correspond to simple gradient descent. Here, we demonstrate a correspondence between one member of this set of algorithms and gradient descent, the implication being that, given any potentially complicated set of neuroevolutionary methods, it is enough to add simple a mutation-acceptance protocol in order to ensure that gradient descent is also approximated. The neuroevolution-gradient descent correspondence is similar to the proofs that neural networks with enough hidden nodes can represent any smooth function [32]: it does not necessarily suggest how to solve a given problem, but provides understanding of the limits and capacity of the tool and its relation to other methods of learning.
Our work provides a rigorous connection between gradient descent and what is arguably the simplest form of neuroevolution. It complements studies that demonstrate a numerical similarity between gradientbased methods and population-based evolutionary methods [16,17], and studies that show analytically that the gradients approximated by those methods are, under certain conditions, equivalent to the finite-difference gradient [33][34][35].
The paper is structured as follows. We summarize the neuroevolution-gradient descent correspondence in Section II, and derive it in Section III. Our derivation uses ideas developed in Ref. [20] to treat physical particles, and applies them to neural networks: we consider a different set-up [36] and proposal rates, and we consider the limit β = ∞ that is rarely considered in the physics literature but is natural in the context of a neural network. We can associate the state x of the neural network with the position of a particle in a high-dimensional space, and the loss function U (x) with an external potential. The result is a rewriting of the correspondence between Langevin dynamics and Monte Carlo dynamics as a correspondence between the simplest forms of gradient descent and neuroevolution. Just as the Langevin-Monte Carlo correspondence provides a basis for understanding why Monte Carlo simulations of particles can approximate real dynamics [31,[37][38][39][40][41][42], so the neuroevolution-gradient descent correspondence shows how we can effectively perform gradient descent on the loss function without explicit calculation of gradients. The correspondence holds exactly only in the limit of vanishing mutation scale, but we use numerics to show in Section IV that it can be observed for neuroevolution done with finite mutations and gradient descent enacted with a finite timestep. We conclude in Section V.

II. SUMMARY OF MAIN RESULTS
In this section we summarize the main analytic results of this paper. These results are derived in Section III.
Consider a neural network with N parameters (weights and biases) x = {x 1 , . . . , x N }, and a loss U (x) that is a deterministic function of the network parameters. The form of the network does not enter the proof, and so the result applies to neural networks of any architecture (we shall illustrate this point numerically by considering both deep and shallow nets). The loss function may also depend upon other parameters, such as a set of training data, as in supervised learning, or a set of actions and states, as in reinforcement learning; the correspondence we shall describe applies regardless.

A. Gradient descent
Under the simplest form of gradient descent, the parameters x i of the network evolve according to numerical integration of Here time t measures the progress of training, and α is the learning rate [3][4][5][6][7].

B. Neuroevolution
Now consider training the network by neuroevolution, defined by the following Monte Carlo protocol.
1. Initialize the neural-network parameters x and calculate the loss function U (x). Set time t = 0.
3. Accept the mutation with the Metropolis probability min 1, e −β[U (x+ )−U (x)] , and otherwise reject it. In the latter case we return to the original neural network. The parameter β can be considered to be a reciprocal evolutionary temperature. 4. Increment time t → t + 1, and return to step 2.
For finite β, and in the limit of small mutation scale σ, the parameters of the neural network evolve under this procedure according to the Langevin equation where ξ is a Gaussian white noise with zero mean and variance σ 2 : Eq.
(3) describes an evolution of the neural-network parameters x i that is equivalent to gradient descent with learning rate α = βσ 2 /2 in the presence of Gaussian white noise. Averaging over independent stochastic trajectories of the learning process (starting from identical initial conditions) gives which has the same form as the gradient descent equation (1). Thus, when averaged over many independent realizations of the learning process, the neuroevolution procedure 1-4, with finite β, is equivalent in the limit of small mutation scale to gradient descent on the loss function.
In the case β = ∞, where mutations are only accepted if the loss function does not increase, the parameters of the network evolve according to the Langevin equation where η is a Gaussian white noise with zero mean and variance σ 2 /2: The form (6) is different to (3), because the gradient in the first term is normalized by the factor which serves as an effective coordinate-dependent rescaling (or vector normalization) of the timestep, but (6) nonetheless describes a form of gradient descent on the loss function U (x). Note that the drift term in (6) is of lower order in σ than the diffusion term (which is not the case for finite β). In the limit of small σ, (6) describes an effective process in which uphill moves in loss cannot be made, consistent with the stochastic process from which it is derived.
Averaged over independent realizations of the learning process, (6) reads The results (3) and (6) show that training a network by making random mutations of its parameters is, in the limit of small mutations, equivalent to noisy gradient descent on the loss function. WritingU (x) =ẋ · ∇U (x), using (3) and (6), and averaging over noise shows the evolution of the mean loss function under neuroevolution to obey, in the limit of small σ, equivalent to evolution under the noise-free forms of gradient descent (5) and (8).
In the following section we derive the correspondence described here; in Section IV we show that it can be observed numerically for non-vanishing mutations and finite integration steps.

III. DERIVATION OF THE NEUROEVOLUTION-GRADIENT DESCENT CORRESPONDENCE
We start by considering the quantity P (x, t), the probability that a neural network has the set of parameters x at time t under a given stochastic protocol. The time evolution of this quantity is governed by the master equation [43,44], which in generic form reads The two terms in (11) describe, respectively, gain and loss of the probability P (x, t) (note that the probability to have any set of parameters is conserved, i.e. dx P (x, t) = 1). The symbol W x −x (x) (sometimes written W (x → x )) quantifies the probability of moving from the set of parameters x to the set of parameters x + (x − x) = x , and encodes the details of the stochastic protocol. For the neuroevolution procedure defined in Section II B we write (10) as Here denotes the set of random numbers (the "mutation") by which the neural-network parameters are changed; the integral d = is the rate for going from the set of parameters x to the set of parameters x + . Eq. (12) contains two factors. The first, quantifies the probability of proposing a set of Gaussian random numbers . The second factor, the Metropolis "min" function in (12), quantifies the probability of accepting the proposed move from x to x + ; recall that U (x) is the loss function. We can pass from the master equation (11) to a Fokker-Planck equation by assuming a small mutation scale σ, and expanding the terms in (11) to second order in σ [43,44]. Thus and . Collecting terms resulting from the expansion gives Taking the integrals inside the sums, (16) reads where and What remains is to calculate (18) and (19), which we do in different ways depending on the value of the evolutionary reciprocal temperature β.

A. Finite β
In this section we consider finite β, in which case we can evaluate (18) and (19) using the results of Refs. [19,20] (making small changes in order to account for differences in proposal rates between those papers and ours).
Eq. (18) can be evaluated as follows (writing U (x) = U for brevity): In these expressions Θ(x) = 1 if x ≥ 0 and is zero otherwise. In going from (20) to (21) we have assumed that β · ∇U (x) is small. This condition cannot be met for β = ∞; that case is treated in Section III B. In going from (22) to (23) we have used the result Θ(x) + Θ(−x) = 1; the first integral in (23) then vanishes by symmetry. The second integral, shown in (24), can be turned into (25) using the symmetry arguments given in Ref. [20], which we motivate as follows. Upon a change of sign of the integration variables, → − , the value of the integral in (24) is unchanged and it is brought to a form that is identical except for a change of sign of the argument of the Θ function. Adding the two forms of the integral removes the Θ functions, giving the form shown in (25), and dividing by 2 restores the value of the original integral. (25) can be evaluated using standard results of Gaussian integrals. Eq. (19) can be evaluated in a similar way: The ≈ sign in (30) indicates that we have omitted terms of order σ 3 .
Inserting (26) and (31) into (17) gives us, to second order in σ, the Fokker-Planck equation This equation is equivalent [45] to the N Langevin equations [43,44] where ξ is a Gaussian white noise with zero mean and variance σ 2 : Eq. (33) describes an evolution of the neural-network parameters x i that is equivalent to gradient descent with learning rate α = βσ 2 /2, plus Gaussian white noise. Averaging over independent stochastic trajectories of the learning process (starting from identical initial conditions) gives which is equivalent to simple gradient descent on the loss function.
B. The case β = ∞ When β = ∞ we only accept mutations that do not increase the loss function. To treat this case we return to (18) and take the limit β → ∞: We can make progress by introducing the integral form of the Θ function (see e.g. [46]),

Then (36) reads
where the symbols are standard Gaussian integrals. Upon evaluating them as and The form (43) is similar to (26) in that it involves the derivative of the loss function U (x) with respect to x i , but contains an additional normalization term, |∇U |. This term is sometimes introduced as a form of regularization in gradient-based methods [25]. Here, the form emerges naturally from the acceptance criterion that sees any move accepted if the move does not increase the loss function: as a result, the length of the step taken does not depend strongly on the size of the gradient.
In the limit β → ∞, (19) reads upon applying the symmetry arguments used to evaluate (24). Eq. (46) is half the value of the corresponding term for the case β = ∞, Eq. (31), because one term corresponds to Brownian motion in unrestricted space, the other to Brownian motion on a half-space.
Inserting (43) and (46) into (17) gives a Fokker-Planck equation equivalent to the N Langevin equations where η is a Gaussian white noise with zero mean and variance σ 2 /2: As an aside, we briefly consider the case of nonisotropic mutations, for which the Gaussian random update applied to parameter i has its own variance σ 2 i , i.e. i ∼ N (0, σ 2 i ) in step 2 of the procedure described in Section II B. In this case the derivation above is modified to have σ replaced by σ i in (13). In the case of finite β the equations (33) and (34) retain their form with the replacement σ → σ i . In the case of infinite β, (48) retains its form with the replacement σ → σ i and (47) reads Nonisotropic mutations are used in covariance matrix adaptation strategies [47]. Those schemes also evolve the step size parameter dynamically, and to model this more general case one must make σ a dynamical variable of the master equation, with update rules appropriate to the algorithm of interest.

IV. NUMERICAL ILLUSTRATION OF THE NEUROEVOLUTION-GRADIENT DESCENT CORRESPONDENCE
In this section we demonstrate the neuroevolutiongradient descent correspondence numerically. We consider single-layer neural networks for the cases of infinite and finite β, and a deep network for the case of infinite β.
A. Shallow net, β = ∞ In order to observe correspondence numerically, the neuroevolution mutation scale σ must be small enough that correction terms neglected in the expansion leading to (17) and (47) are small. The required range of σ is difficult to know in advance, but straightforward to determine empirically: below the relevant value of σ, the results of neuroevolution will be statistically similar when scaled in the manner described below.
We consider a simple supervised-learning problem in which we train a neural network to express the function f 0 (θ) = sin(2πθ) on the interval θ ∈ [0, 1). We calculated the loss using K = 1000 points on the interval, where is the output of a single-layer neural network with one input node, one output node, M = 30 hidden nodes, and N = 3M parameters x i . These parameters are initially chosen to be Gaussian random numbers with zero mean and variance σ 2 0 = 10 −4 . The correspondence is insensitive to the choice of initial conditions, and we shall show that it holds for different choices of initial network.
We performed gradient descent with learning rate α = 10 −5 . We chose the learning rate arbitrarily, and verified that the results of gradient-descent simulations were unchanged upon a changing learning rate by a factor of 10 and 1/10. We used Euler integration of the noisefree version of Eq. (6), updating all weights x i at each timestep t gd = 1, 2, . . . according to the prescription where and We did neuroevolution following the Monte Carlo procedure described in Section II B, in the limit β = ∞, i.e. we accepted only moves that did not increase the loss function. We chose the mutation scale where λ is a parameter. According to (6) and (52), this prescription sets the neuroevolution timescale t evol to be a factor λ times that of the gradient-descent timescale. Thus one neuroevolution step corresponds to λ integration steps of the gradient descent procedure. In figures, we compare gradient descent with neuroevolution as a function of common (scaled) time t = αt gd = αλt evol . In Fig. 1 (a) we show the evolution of four individual weights under neuroevolution (using mutation scale λ = 1/10) and gradient descent (weights are distinguishable because they always have the same initial values). The correspondence predicted analytically can be seen numerically: individual neuroevolution trajectories (gray) fluctuate around the gradient descent result (black), and when averaged over individual trajectories the results of neuroevolution (green) approximate those of gradient descent. In Fig. 1(b) we show the individual and averaged values of the weights of neuro-evolved networks at time t = 10 compared to those of gradient descent. In general, the weights generated by averaging over neuroevolution trajectories approximate those of gradient descent, with some discrepancy seen in the values of the largest weights. In Fig. 1(c) we show the loss under neuroevolution and gradient descent. As predicted by (9), averaged neuroevolution and gradient descent are equivalent.
In Fig. S1 we show similar quantities using a different choice of initial neural network; the correspondence between neuroevolution and gradient descent is again apparent.

evolution
(1)     In Fig. 2(a) we show the time evolution of a single weight of the network under gradient descent and neuroevolution, the latter for three sizes of mutation step σ. As σ increases, the size of fluctuations of individual trajectories about the mean increase, as predicted by (6). As a result, more trajectories are required to estimate the average, and for fixed number of trajectories (as used here) the estimated average becomes less precise. In addition, as σ increases, the assumptions underlying the correspondence derivation eventually break down, in which case the neuroevolution average will not converge to the gradient descent result even as more trajectories are sampled.
In Fig. 2(b) we show the mean-squared difference of the parameter vector of the model under gradient descent and neuroevolution, Here N is the number of network parameters; x gradient i (t) is the time evolution of neural-network parameter i under the gradient descent equation Eq. (52); and x evolution i (t) is the mean value of neural-network parameter i over the ensemble of neuroevolution trajectories. The smaller the neuroevolution step size, the smaller is ∆(t), and the closer the neuroevolution-gradient descent correspondence.
In Fig. S2 we show the evolution with time of the loss for different mutation scales (the left-hand plot is a reproduction of Fig. 1(c)). The trend shown is similar to that of the weights in Fig. 2. B. Deep net, β = ∞ One feature of the correspondence derivation is that the architecture of the neural network does not appear. As long as the loss U (x) is a deterministic function of the neural-network parameters x, correspondence between gradient descent and neuroevolution will be observed if the mutation scale is small enough. (It is likely that what constitutes "small enough" does depend on neuralnetwork architecture, as well as the problem under study. The required mutation scale can be determined empirically, even without access to gradient-descent results: when correspondence holds, the results of neuroevolution simulations will be statistically similar, when scaled as we have described.) To demonstrate invariance to architecture we repeat the comparison of Section IV A, now using a deep neural network (we train the net to reproduce the target function f 0 (θ) = sin(πθ) on the interval θ ∈ [−1, 1]). The network has 8 fully-connected hidden layers, each 32 nodes wide, and 7489 total parameters. As before we use tanh activations on the hidden nodes, and have one input node and one output.
In Fig. 4 we show all parameters of the deep net at training time t = 5, under the two dynamics. We show the results of gradient descent in black, and independent neuroevolution trajectories in gray. As predicted analytically, the neuroevolution results fall either side of the gradient-descent result, and the network constructed by averaging over independent neuroevolution trajectories (green) is essentially identical to the network produced by gradient descent.
In Fig. 5 we illustrate the dynamics of learning and the scale of the loss function by showing a comparison between the target function f 0 (θ) = sin(2πθ) and the net function f x (θ). We show the latter at three different training times, for gradient descent and neuroevolution trajectories.

C. Shallow net, finite β
In this section we illustrate the gradient descentneuroevolution correspondence for finite β. We consider the same supervised-learning problem as in Section IV A, and set the network width to 256 nodes. We did neuroevolution with the Metropolis acceptance rate with re-ciprocal temperature parameter β = 10 3 . This choice is arbitrary, but is representative of a wide range of finite values of β. Finite-temperature simulations are common in particle-based systems [24]. Here, temperature has no particular physical significance, but comparing simulations done at finite and infinite β makes the point that different choices of neuroevolution acceptance rate result in a dynamics equivalent to different gradient-descent protocols.
We did gradient descent using the integration scheme where α = 10 −4 is the learning rate. Comparing (3) and (56), we set the neuroevolution mutation scale to be where λ is a parameter. Thus one neuroevolution step corresponds to λ integration steps of the gradient descent procedure. In figures, we compare gradient descent with neuroevolution as a function of common (scaled) time t = αt gd = αλt evol . Results are shown in Fig. 6, for step-size parameter λ = 1. In panel (a) we show the evolution with time of two of the weights of the network. The noise associated with neuroevolution at this value of β is considerable: individual trajectories (grey lines) bear little resemble to the gradient-descent result (black line). However, the population average (green line, average of n = 1000 trajectories) shows the expected correspondence. The correspondence is less precise than that shown in Section IV A because we use a larger effective step size and each trajectory is much noisier than its infinite-β counterpart.
In Fig. 6(b) we show the loss, with line colors corresponding to the quantities of panel (a). In addition, we show the loss of the average network produced by neuroevolution (red line), U ( x ), which, if correspondence holds, should be equal to U (x) (green line). The initial fast relaxation of the loss (the boxed region) shows a difference between gradient descent and averaged neuroevolution results; doing neuroevolution for smaller step-size parameter λ = 1/10 (inset) reduces this difference, as expected.
In panel (c) we show the parameter ∆, Eq. (55), a measure of the difference between the average network x produced by neuroevolution and the network produced by gradient descent, as a function of n, the number of trajectories included in the average. If correspondence holds, this quantity should vanish in the limit of large n; the observed trend is consistent with this behavior.
In Fig. S4 we compare a gradient-descent trajectory with a set of neuroevolution trajectories, periodically resetting the latter to the gradient-descent solution. The periodic resetting tests the correspondence for a range of initial conditions. The correspondence between gradient descent and the averaged neuroevolution trajectory is approximate (averages were taken over 152 trajectories, fewer than in Fig. S4) but apparent.

V. CONCLUSIONS
We have shown analytically that training a neural network by neuroevolution of its weights is equivalent, in the limit of small mutation scale, to noisy gradient descent on the loss function. Conditioning neuroevolution on the Metropolis acceptance criterion at finite evolutionary temperature is equivalent to a noisy version of simple gradient descent, while at infinite reciprocal evolutionary temperature the procedure is equivalent to clipped gradient descent on the loss function. Averaged over noise, the evolutionary procedures correspond to forms of gradient descent on the loss function. This correspondence is described by Equations (3), (5), (6) and (8).
Correspondence in the sense described above means that each neural-network parameter evolves the same way as a function of time under the two dynamics. Correspondence implies that the convergence properties of the two methods are the same (see e.g. Fig. 3(b)) and that the neural networks produced by the same methods are the same (see e.g. Fig. 4). The generalization properties of those networks will then also be the same.
The correspondence is formally exact only in the limit of zero mutation scale, and holds approximately for small but finite mutations. It will fail when the assumptions underlying the derivation are violated, such as when the terms neglected in (14) and (15) are not small, or when the passage from (20) to (21) is not valid because the change βU (x + ) − βU (x) is not small. It is straightforward to determine empirically where correspondence holds, even without access to gradient-descent results: the results of neuroevolution, with time scaled as described, will be statistically similar when the mutation size is small enough. The time duration for which the correspondence holds increases with decreasing mutation scale (see e.g. Fig. 2(b)). We have shown here that the correspondence can be observed for a range of mutation scales, and for different neural-net architectures.
More generally, several dynamical regimes are con-tained within the neurevolution master equation (11), according to the scale σ of mutations [48]: for vanishing σ, neurevolution is formally noisy gradient descent on the loss function; for small but nonvanishing σ it approximates noisy gradient descent enacted by explicit integration with a finite timestep; for larger σ it enacts a dynamics different to gradient descent, but one that can still learn; and for sufficiently large σ the steps taken are too large for learning to occur on accessible timescales. An indication of these various regimes can be seen in Figs. 2 and S2. Separate from the question of its precise temporal evolution, the master equation (11) has a well-defined stationary distribution ρ 0 (x). Requiring the brackets on the right-hand of (11) to vanish ensures that P (x, t) → ρ 0 (x) becomes independent of time. Inserting (12) into (11) and requiring normalization of ρ 0 (x) reveals the stationary distribution to be the Boltzmann one, ρ 0 (x) = e −βU (x) / dx e −βU (x ) . For finite β the neuroevolution procedure is ergodic, and this distribution will be sampled given sufficiently long simulation time. For β → ∞ we have ρ 0 (x) → δ(U (x) − U 0 ), where U 0 is the global energy minimum; in this case the system is not ergodic (moves uphill in U (x) are not allowed) and there is no guarantee of reaching this minimum.
We have focused on the simple limit of the set of neuroevolution algorithms, namely a non-interacting population of neural networks that experience sequential probabilistic mutations of their parameters. We have illustrated the correspondence at the level of population averages, Equations (5) and (8). However, no communication between individuals is required, and each individually observes the correspondence defined by Equations (3) and (6).
Our results are also relevant to population-based genetic algorithms in which members of the population are periodically reset to the identities of the individuals with lowest loss values [11][12][13]. For instance, when correspondence holds, individuals in the neuroevolution populations considered in this paper have an averaged loss equal to that of the corresponding gradient descent algorithm. Therefore, some individuals must have loss less than that of the corresponding gradient descent algorithm (see e.g. Fig. 1(a), and Fig. 3(b) for the case λ = 1/10). This observation indicates the potential for such methods to be competitive with gradient-descent algorithms.
The neuroevolution-gradient descent correspondence we have identified follows from that between the overdamped Langevin dynamics and Metropolis Monte Carlo dynamics of a particle in an external potential [19,20]. Our work therefore adds to the existing set of connections between machine learning and statistical mechanics [49,50], and continues a trend in machine learning of making use of old results: the stochastic and deterministic algorithms considered here come from the 1950s [9,10] and 1970s [1][2][3][4][5][6], and are connected by ideas developed in the 1990s [19,20].