CoolMomentum: a method for stochastic optimization by Langevin dynamics with simulated annealing

Deep learning applications require global optimization of non-convex objective functions, which have multiple local minima. The same problem is often found in physical simulations and may be resolved by the methods of Langevin dynamics with Simulated Annealing, which is a well-established approach for minimization of many-particle potentials. This analogy provides useful insights for non-convex stochastic optimization in machine learning. Here we find that integration of the discretized Langevin equation gives a coordinate updating rule equivalent to the famous Momentum optimization algorithm. As a main result, we show that a gradual decrease of the momentum coefficient from the initial value close to unity until zero is equivalent to application of Simulated Annealing or slow cooling, in physical terms. Making use of this novel approach, we propose CoolMomentum—a new stochastic optimization method. Applying Coolmomentum to optimization of Resnet-20 on Cifar-10 dataset and Efficientnet-B0 on Imagenet, we demonstrate that it is able to achieve high accuracies.


Introduction
A rapid growth of machine learning applications has been observed in recent years. Training of machine learning models is performed by finding such values of their parameters that optimize an objective function. Usually the number of parameters is large and the training dataset is massive. The first order stochastic optimization methods are proved to be most appropriate in this case. To reduce computational costs, the gradient of the objective function with respect to the model parameters is computed on relatively small subsets of the training data, called mini-batches. The resulting value is an unbiased stochastic estimator of the true gradient and it is used with stochastic gradient descent (SGD) methods.
Most theoretical works are focused on convex optimization [17,14], but optimization of nonconvex objective functions is required usually. Empirically it is shown that several optimization algorithms, e.g SGD with momentum, Adagrad, RMSProp, Adadelta and Adam are efficient for training artificial neural networks and optimization of nonconvex objective functions [2]. In nonconvex setting, the objective function has multiple local minima and the efficient algorithms rely on the "hill climbing" heuristics. Currently, there is a large gap between mathematical theory and heuristic stochastic optimization methods popular in machine learning.
There is a useful connection between multivariate optimization and statistical mechanics. In statistical mechanics the hill climbing heuristics is related to passing through the energy barriers. Local energy minima are typical for molecular systems. Based on the detailed analogy between the multivariate optimization and annealing in molecular systems, the Simulated Annealing method was proposed [8]. This nature-inspired optimization method takes name and inspiration coming from annealing (slow cooling) in metallurgy and computational physics. Simulation of annealing can be used to find an approximation of the global minimum for a function U (x) of many variables. In statistical mechanics, this function is known as a potential energy U (x) of a molecular system. In order to apply Simulated Annealing, one needs a method for sampling from the Gibbs-Boltzmann distribution where T is a parameter called temperature and Z is a normalizing constant, Z = n exp(−U n /T ). The Gibbs distribution w n gives the probability to find a system x in a state n with energy U n = U (x). The mean of any quantity f (x) may be calculated by means of Gibbs distribution, using the formula f = n w n f . The Gibbs distribution is one of most important formulas in statistical physics [9].
Classical methods for simulation of molecular systems are Markov chain Monte Carlo (MCMC), molecular dynamics (MD) and Langevin dynamics (LD). Either MD, LD or MCMC lead to equilibrium averaged distributions in the limit of infinite time or number of steps. If simulation is performed at a constant temperature T , these methods may be used to generate samples of Eq. (1). Simulated Annealing can be used with any of these methods, but instead of performing simulation at a constant temperature T , the temperature should be decreased slowly. By performing simulation first at high temperature and then gradually decreasing the temperature value, the states close to the global minimum of U (x) may be found. MCMC, MD and LD have different application areas. MD and LD are based on a numerical integration of the classical equation of motion. They simulate the dynamics of systems, based on the values of the gradient dU (x)/dx, that has to be computed on every step. MCMC does not require the gradient information, only U (x) values are required to compute the Metropolis acceptance probability. MCMC methods may overcome energy barriers more efficiently, but they require special MCMC proposals, and there are no equivalently efficient proposals for different systems. If the values of dU (x)/dx are available, then MD and LD are more straightforward methods.
The adaptation of MCMC and LD for optimization is a prospective research direction [11]. MCMC methods are widely used in machine learning, but applications of Langevin dynamics to machine learning only start to appear [20,21,10]. In this paper, we propose to adapt the methods of molecular and Langevin dynamics to the problems of nonconvex optimization, that appear in machine learning.

Molecular and Langevin Dynamics
Molecular and Langevin dynamics were proposed for simulation of molecular systems by integration of the classical equation of motion to generate a trajectory of the system of particles. Both methods operate with the classical equation of motion of N particles with coordinates x = (x 1 , x 2 , .., x N ), velocities v = dx/dt and accelerations a = d 2 x/dt 2 . The Newton's equation of motion for a conservative system is given by where m is the mass of particles, f (x) is known as force, and U (x) is the potential energy. The kinetic energy is given by There are several integration schemes based on discretization of the differential equation Eq. (2), the Verlet and Velocity-Verlet algorithms being the most popular among them [16].
In conservative systems, described by Eq. (2), the sum of potential and kinetic energies conserves: E k + U = const. The mean double kinetic energy per dimension per particle is a parameter called temperature. Here and below f = 1 t t 0 f (t )dt means averaging over time or iterations. Often it is desirable to perform simulations at a given temperature, so that where T is the desirable temperature, a parameter of the simulation. In physical simulations, an algorithm or a rule which controls the temperature is conventionally called a thermostat.
If molecules under consideration are allowed to exchange their kinetic energy with a medium, then their total energy does not conserve any more. In Langevin Dynamics, two forces are added to the conservative force to account for the energy exchange with the medium -a friction force proportional to the velocity with a friction coefficient γ ≥ 0 and a thermal white noise. These two forces play a role of the thermostat in LD. Explicitly, the Langevin dynamics may be described by the following equations [1,19,18,16]: where R(t) is a random uncorrelated force with zero mean and a temperature-dependent magnitude: δ(t − t ) being the Dirac Delta function.
The magnitude of the friction γ determines the relative strength of the dissipation forces with respect to the conservative force f (x). If γ = 0, one only has conservative forces without energy dissipation and Eq. (6) reduces to Eq. (2).
Several discretization schemes for the Langevin equation were proposed, e.g. a generalization of the Velocity-Verlet integrator to Langevin Dynamics by Vanden-Eijnden and Cicotti [19].
In the high friction limit, the acceleration term in the LHS of Eq. (6) may be neglected and one has It is known as overdamped Langevin equation. Its first order integrator was proposed by Ermak and McCammon [16]: where ξ is a random Gaussian noise with zero mean and unit variance. The last term in the RHS of Eq. (9) results from the integral of the random force (7) ∆t 0 R(t )dt , known as the Wiener process. From Eq. (9) one can see that γ enters its denominator, and would result in infinitely large values of updating steps if friction γ is close to zero. Therefore, this integrator is appropriate for essentially high friction values only.

Optimization by Simulated Annealing for Machine Learning
Simulated Annealing (SA) is a well established optimization technique to locate the global U (x) minimum without getting trapped into local minima. Though originally SA was proposed as an extension of MCMC [8], SA can be considered as an extension of either MCMC or molecular/Langevin dynamics (see Ch. 12.5 of Schlick [16]). In this paper we propose to adapt these methods to the problem of optimization in machine learning, that require minimization of a function based on the values of its gradients. For instance, this function may be attributed as a loss and the values of the gradient dU/dx may be computed by backpropagation.
To get an idea about the basics of Simulated Annealing, one can think as follows. Consider a heavy ball moving in a one-dimensional potential well with multiple minima, separated by barriers. The deepest of the minima is the global one, the others are local. Let the initial mean kinetic energy of the ball be high enough to overcome any energy barrier, therefore the ball passes through all the minima on its quasiperiodic trajectory. According to Eq. (4), high kinetic energy corresponds to high temperature. Suppose now, that the temperature (mean kinetic energy) is gradually decreased. This process has to be slow enough, to ensure that the characteristic cooling time is much longer than the characteristic time of the quasiperiodic motion. In the course of this cooling, another higher-lying local minimum eventually becomes inaccessible as soon as the mean kinetic energy becomes less than the height of its energy barrier. And finally, when the mean kinetic energy becomes less than the barrier between the global and the first local minimum, the ball becomes localized in the global minimum. This consideration may be freely generalized to multiple dimensions. Therefore, if the values of dU (x)/dx are available, then Simulated Annealing in a combination with molecular dynamics is a well-established method for locating the global minimum of a multivariate function U (x). It is proved to be particularly efficient for nonconvex functions. The value of constant m may be selected arbitrary. For simplicity we can set m = 1 throughout. SA may be implemented using e.g. the Velocity-Verlet integrator and one of the thermostats [16]. The beauty of the described above SA is that it has theoretical guarantees to converge to the global minimum of a nonconvex function [3]. However, the convergence is guaranteed in the limit of very slow cooling only. In practice, the efficiency of SA depends on the annealing schedule, that has to be specified by the user.
If the training data is large, then it is computationally expensive to compute the loss and its gradient on the full training set. In this case stochastic optimization is proved to be the only appropriate approach. In stochastic optimization, the values of the loss and its gradient are estimated approximately, on small subsets of training data, called minibatches. If these minibatches are selected randomly from the training data, then the estimated values of the lossÛ (x) and its gradient dÛ /dx are the Monte Carlo approximations of their exact values. Stochastic Gradient Descent is the simplest optimization method and is the method of choice for many applications. Formally it may be written as In Eq. (10) the constant lr is known as a learning rate, and dÛ /dx is a stochastic gradient. This equation can be compared with Eq. (9). Besides the thermal noise, there are only two differences between these equations: I) f (x) = − dU dx in (9) is the exact gradient, while dÛ dx in (10) is the stochastic gradient and II) the discrete time variable t in Eq. (9) is substituted with the iteration number n, so that lr = ∆t/(mγ).
Though the Monte Carlo approximation dÛ /dx is a good unbiased approximation, it is still an approximation and contains noise. One can writê where R is an uncorrelated random noise with zero mean. If the size of the minibatch is large, or the gradient dÛ /dx is computed on the full training data set, then dÛ /dx = dU/dx and R = 0. In this case molecular dynamics in a combination with simulated annealing is a well established method for global optimization [16]. On the other hand, if the batch size is small, then the random noise R may be large. In this case the Langevin dynamics in a combination with simulated annealing may be adapted for global optimization [16].

Relation of the Langevin equation with Momentum optimizer
Setting m = 1 in the Langevin equation (6), defining the stochastic forcef = f + R and expressing the time derivatives in finite differences, one can obtain the next equation: Now, it is straightforward to obtain the next coordinate updating formula: with and lr = ∆t 2 1 + γ∆t/2 = 1 + ρ 2 ∆t 2 .
Eq. (13) is nothing else but a famous Momentum optimization algorithm [15] with ρ being a momentum coefficient and lr a learning rate constant.
Due to the change to discreet variables and m = 1, Eq. (7) becomes: Using Eq. (14) to obtain one can change the last Eq. (16) to: For many machine learning applications the optimal ρ value is in the range from 0.5 to 0.99. If ρ = 0 then Eq. (13) becomes equivalent to Eq. (10), the Langevin dynamics becomes overdamped, and the Momentum optimizer becomes SGD.

Algorithm
In order to apply Simulated Annealing for optimization, one needs a thermostat to control the temperature. In addition, a temperature schedule (or cooling strategy) has to be specified by the user. From Eq. (16) one can see that, for R 2 n ∆t = const, the product of the temperature and friction coefficient stays constant: 2γT = const. Therefore, to decrease the temperature, one can increase the friction coefficient. Taking into account that the temperature does not enter Eq. (13) explicitly, while the friction coefficient does so via the momentum coefficient Eq. (14) and the learning rate Eq. (15), the advantage of this approach becomes evident. From Eqs. (14) and (17) one can see that ρ decreases from unity to zero as γ increases from zero to its maximal value 2/∆t, which corresponds to the overdamped regime. Therefore, it is convenient to change from increasing γ to decreasing ρ in the application of the cooling strategy. The decreasing ρ schedule has to be specified by the user. Different ρ schedules may be used. A possible ρ schedule is given by If α = 1 then ρ n = ρ 0 , and if α < 1 then ρ n is a decreasing function of n. In the Momentum optimizer the ρ n value should be in the range from 0 to 1. Let S be the number of steps (usually S= number of epochs · steps per epoch). Then the algorithm we propose may be presented as a pseudocode given in Table 1 2 .   Table 1 Algorithm "CoolMomentum" Require: x 0 (Initial parameter vector) Require: ∆t (time step) Require: ρ 0 (initial ρ value, good default value ρ 0 = 0.99) Require: α (cooling rate) Initialization: ∆x 0 = 0 (Initialize update vector) for n = 0..(S − 1) do: (loop over S iterations) f (x n ) = −dÛ /dx (compute stochastic gradient) ρ n = max (0, 1 − (1 − ρ 0 )/α n ) (slowly decrease ρ value until zero) lr n = ∆t 2 · (1 + ρ n ) /2 (recalculate the learning rate) ∆x n+1 = ρ n ∆x n +f (x n ) · lr n (Momentum updating step) x n+1 = x n + ∆x n+1 end do return x S (Resulting parameters) Comparing with the classical Momentum optimizer, described by Eq. (13), this algorithm requires one additional hyperparameter α, that we call a "cooling rate". Every additional hyperparameter may by painful for machine learning application. However, a good α value may be easily computed. In Simulated Annealing the temperature should be slowly decreased until some minimal value, and therefore the ρ value should be slowly decreased until ρ = 0. Given ρ S = 0, from Eq. (19) one can obtain:

Evaluation
To evaluate our optimization method, at first we study the problem of multiclass classification. We trained a deep residual neural network [4] ResNet-20 on the CIFAR-10 dataset with 50000 training images and 10000 testing ones using Adam [7], Momentum and Coolmomentum optimizers. This model has a complicated architecture, more than 270k of trainable parameters and therefore it is a good model to check the performance of optimization methods. We In order to check the performance of the optimization methods on ResNet-20, for each epoch we compute the training loss on the training data set (50000 images) and the testing loss on the testing data set (10000 images), and compare the optimization results in Fig. 1 and Fig. 2, respectively. In Fig. 1a and Fig. 2a we present the training and testing loss respectively, computed with the final parameter values x S of each epoch. In Fig. 1b and Fig. 2b we calculated the training and testing loss with the parameter values x = 1 S S n=1 x n , that were averaged over the latest epoch. This procedure is known as the Polyak-Ruppert averaging [7,13].
To be sure that Simulated Annealing is applied properly, i.e. that the temperature is decreased slowly, one needs a method to calculate the temperature directly during the optimization process. This may be done by using Eq. (4), setting m = 1 and changing to discreet variables to obtain: where Size is a number of training parameters of the model and S is a number of time iterations per epoch.
In Fig. 3 we present the values of rescaled temperature T · ∆t 2 calculated with Eq. (21) for all the three optimizers being compared. We choose to calculate rescaled temperature instead of the ordinary one because the actual value of the time step ∆t is inavailable for Adam. From Fig. 3 one can see that on the first epoch the temperature significantly drops down for all three optimizers, but only in the case of Coolmomentum it continues to decrease on further epochs, while it evolves near its minimal value for Adam and Momentum. Therefore, Coolmomentum performs optimization in the Simulated Annealing regime, and by slowly decreasing the temperature it samples the states of the Gibbs distribution (1), which approach the global minimum of the loss function. On the contrary, Adam and Momentum drop the temperature immediately down to its minimum. In metallurgy and physical simulations this cooling regime is called quenching. It produces a variety of non-equilibrium disordered structures, including different glasses. Similarly to physical systems, in this regime the trained model becomes caught in a local minimum of the loss function, and continues to walk there, because the temperature is too low to overcome the local barrier.
On the first epochs the training and testing losses, produced by CoolMomentum, are higher than those of Momentum and Adam. Indeed, on the first epochs Coolmomentum gives the temperature values significantly higher than Adam and Momentum do. But at high temperatures the Gibbs distribution (1) is less efficient to distinguish between the states with high and low values of the loss function.
One can see that the loss values, produced by CoolMomentum, become smaller than those of Adam and Momentum starting from about the 40th epoch. From Fig. 1 it is clear that the optimization method that we propose significantly outperforms both Adam and Momentum.
The testing loss depends strongly on the regularization, on the size of training data etc., and is given in Fig. 2. By comparing Fig. 2a and Fig. 2b, one can see that the Polyak-Ruppert averaging decreases the testing loss. New methods to decrease the testing loss based on weight averaging were proposed recently [6,22], and these methods may benefit from the optimization method that we propose. We also trained LSTMs [5] for language modeling on the Penn Treebank dataset. Following Zhang et al [22] we used the model setup of Merity et al. [12] and made use of their publicly available code 3 . Merity et al. [12] used SGD with learning rate 30, and we used CoolMomentum with time step ∆t 2 = 0.1 and α calculated from Eq. (20) with ρ 0 = 0.99. The results are compared if Fig. 4. The model was trained for 500 epochs on gtx1080ti GPU. One can see that in this case Coolmomentum also outperforms SGD.

Conclusions
We explore relations between the Langevin dynamics and the stochastic optimization methods, popular in machine learning. The relation of underdamped Langevin dynamics with the Momentum optimizer was studied recently [10]. In this paper we combine Langevin dynamics with Simulated Annealing. To apply Simulated Annealing, the temperature should be decreased slowly until some minimal value. This may be done by decreasing the learning rate. Indeed, from Eq. (16) one can see that, from decreasing the time step ∆t, the temperature T decreases proportionally. The drawback of this approach is that, when the learning rate is substantially decreased, the learning process almost stops. Alternatively, we propose to adapt Simulated Annealing by slowly decreasing the momentum coefficient of the Momentum optimizer, and propose a decreasing schedule for the values of this coefficient. In our case, at the minimal temperature the momentum coefficient becomes zero and the Langevin dynamics becomes overdamped. The performance of the proposed optimizer is demonstrated on two popular neural networks, ResNet and LSTM. The obtained results indicate that the combination of the Langevin dynamics with Simulated Annealing is an efficient approach for gradient-based optimization of stochastic objective functions.