Abstract
We numerically test an optimization method for deep neural networks (DNNs) using quantum fluctuations inspired by quantum annealing. For efficient optimization, our method utilizes the quantum tunneling effect beyond the potential barriers. The path integral formulation of the DNN optimization generates an attracting force to simulate the quantum tunneling effect. In the standard quantum annealing method, the quantum fluctuations will vanish at the last stage of optimization. In this study, we propose a learning protocol that utilizes a finite value for quantum fluctuations strength to obtain higher generalization performance, which is a type of robustness. We demonstrate the performance of our method using two wellknown open datasets: the MNIST dataset and the Olivetti face dataset. Although computational costs prevent us from testing our method on large datasets with highdimensional data, results show that our method can enhance generalization performance by induction of the finite value for quantum fluctuations.
Similar content being viewed by others
Introduction
Datadriven approach is being widely adopted in many science and engineering fields. The key technology is machine learning, which is supported by successful examples of the use of deep neural networks (DNNs)^{1}. Deep neural networks have achieved stateoftheart results in a wide variety of tasks, including computer vision, natural language processing, and reinforcement learning^{2}. The revolutionary event in which artificial intelligence bested a human at a game of Go exemplifies the potential power of machine learning. In DNNs, iterative structures of linear and nonlinear transformations construct a patternrecognition system for designing a feature extractor from the raw data (such as the pixel values of natural image data) into a nontrivial internal representation or feature vector. The extracted features enable us to classify the different patterns from the input data.
To promote DNN technology, various researchers have developed learning algorithms to provide faster results and better performance. The algorithms for optimizing DNNs are based on the stochastic gradient descent^{3,4,5}; it partitions a large dataset into several batches and approximates the gradient of the cost function. The standard choice among the various algorithms stemming from the stochastic gradient method is the Adaptive Momentum (Adam) algorithm^{6}. This algorithm is designed to efficiently escape saddle points that often appear in the cost functions of DNNs. In practice, however, the learning of DNNs suffers from local minima with different generalization performance resulting from the shape of the DNN cost functions. The sharp minimizer has poorer generalization performance than that in the wideflat minimizer. It is thus important to design a learning algorithm to find a more optimal solution by escaping from both the saddle points and the local minima. In a recent study^{7}, the batch size is closely related to the generalization performance, which is characterized by the shape of the local minima. They experimentally demonstrate that the largebatch stochastic gradient method and its variants tend to converge to sharp minimizers with poor generalization performance. The smallbatch stochastic gradient descent, on the other hand, is likely to fall into the wider minimizers, in which the DNNs have high generalization performance. The batch size is closely related to the magnitude of the stochastic noise during learning. In other words, injection of the stochastic noise can be an origin of an efficient learning algorithm for converging into wider local minima. In addition, an analytical study on discreteweight networks revealed the subdominant solutions with relatively higher generalization performance than the exponentially dominant (typical) solutions that deviated from the ground truth^{8,9}. The subdominant solutions can be algorithmically reachable by considering the effect of entropy. As proposed in the literature^{10}, they compute the local entropy by injection of stochastic noise and update the weight to take the DNN to wider local minima with better generalization performance.
The gradient descent algorithm is closely related to classical dynamics in physics, and the stochastic version also has a connection with Langevin dynamics, which models the classical stochastic dynamics in various fields of nature. In the present study, we test the optimization of DNNs using the quantum fluctuation as employed in quantum annealing (QA). Quantum annealing is a method that is developing as a generic solver for the optimization problems. This scheme was originally proposed as an algorithm that used numerical computations to optimize cost functions with discrete variables^{11}. The theoretical aspects of QA are well known. Its basic concept is derived from the quantum adiabatic theorem^{12,13,14}, and a successful experimental implementation of QA was realized using presentday technology^{15,16,17,18}. Since then, QA has been developed rapidly and has attracted much attention. Several protocols based on QA do not stick to the adiabatic quantum computation or maintain the system at the ground state; rather, they employ a nonadiabatic counterpart^{19,20,21,22}. In addition, some studies have used a more sophisticated quantum effect^{23,24,25}. Although the original proposal for QA was designed for optimization problems with discrete variables, as described in the form of a spinglass Hamiltonian^{11}, the concept of QA can be generalized to a wider range of optimization problems, even those with continuous values. Most practical optimization problems, including machine learning, use continuous variables. One typical instance is the optimization problem for DNNs. Below, we apply the concept of QA to the DNN optimization problem. In the previous study, they assessed the potential efficiency of using quantum fluctuations to avoid the nonconvex cost function by means of the replica method, which is a sophisticated tool in statistical mechanics^{26}. Although the analysis in the previous study discussed the learning of the discreteweight neural network (binary variable as in the Ising model), the essential features are expected not to differ from the continuousvariable neural networks. As discussed in the previous study, the generalization performance attained by the optimization with quantum fluctuations can be better than that without them. In the present study, we perform practical tests: the optimization of DNNs with quantum fluctuations, and discuss its efficiency. Because the computational cost for simulating quantum dynamics is prohibitive, as shown below, our test is restricted to the case for the relatively shallow networks. However our approach is straightforward to apply deeper networks.
The paper is organized as follows: The second section describes our method for optimizing DNNs. The following section demonstrates the method using three simple tasks. The last section discusses the feasibility of our method.
Methods
Quantum annealing for continuous variables
The optimization problem is interpreted as the minimization of the energy function (potential energy) V(w) in the context of physics. We address the optimization of the weights of DNNs below. The weights are denoted by \({\bf{w}}\in {{\mathbb{R}}}^{N}\). The standard gradient descent is given as the equation of motion for the overdamped system
where t is the update step. This is regarded as a dynamical system in a lowtemperature region in the context of physics. Considering the thermal effect characterized by the temperature T, the weights fluctate following the GibbsBoltzmann distribution as
where Z is the partition function that acts as a normalization constant. In this case, instead of the equation of motion, a dynamical system with Langevin dynamics is adequate for description of the weights following the Gibbs–Boltzmann distribution as
This is the procedure known as the stochastic gradient Langevin method^{27}, in which the learning rate decreases in the same manner as in simulated annealing (SA)^{28}. In QA, we introduce quantum fluctuations in addition to the energy function in the extremely low temperature T → 0(β → ∞). We consider the following timedependent Hamiltonian:
where \(\hat{{\bf{w}}}\) denotes degrees of freedom and \(\hat{{\bf{p}}}\) represents momentum that satisfies the commutation relation \([\hat{{\bf{w}}},\,\hat{{\bf{p}}}]=i\hslash \mathrm{.}\) In addition, ρ(t) represents the mass of the weights and increases from 0 to ∞ over time throughout the QA process. Following the ideas of quantum mechanics, the weights fluctuate as characterized by the following density matrix, instead of directly by the distribution function; this is defined as
where \(Z={\rm{Tr}}(\exp (\,\,\,\beta \hat{H}(t)))\). To specify the probability distribution of the realized configuration of the weights, we compute the matrix elements as
where \(\hat{{\bf{w}}}{\bf{w}}\rangle ={\bf{w}}{\bf{w}}\rangle \). However, the computation of the density matrix is intractable in general. We then employ the Suzuki–Trotter decomposition to reduce the operators to cnumbers by introducing M copies^{29} and obtain the following pathintegral representation as shown in Appendix:
where \(\int {\mathscr{D}}{\bf{w}}={\prod }_{k=1}^{M1}\int d{{\bf{w}}}_{k}\), M is the Trotter number and k is the index of the replicated system. The boundary condition is set to w_{0} = w_{ M } = w. The numerical implementation of the SuzukiTrotter decomposition is established as an approximation of the distribution function (7) by setting a finite number for M. For instance, in the quantum Monte Carlo simulation^{30}, the configuration of the degrees of freedom is sampled using the distribution function as
in which the inverse temperature is taken to be β → ∞ with β/M being finite. In other words, the quantum Monte Carlo simulation deals with many replicated realizations or paths w_{ k }(t) with index k (imaginary time) following Langevin dynamics as
where T_{ q } = M/β. One might recognize that many DNN realizations interact with each other through the elastic term, which represents the quantum effect. The elastic term urges many DNN realizations into a single condensed solution w^{*} when ρ(t) takes relatively a large value. By the boundary condition w_{0} = w_{ M }, w^{*} = w. For simplicity, let us first consider the case with a large ρ(t). The path integral formulation allows fluctuation around w^{*}. In other words, the action in the exponential function in P(w) has two terms: one is the cost function, which is what we originally want to optimize, and the other is degree of condensation of the realizations. As in Appendix, we find that w_{ k } − w follows a Gaussian distribution with some covariance βV_{kk′}(t). Thus, the approximated distribution function in a large ρ(t) is reduced to
Here, we set the minimizer of the (logarithm of) the distribution function in order to make analysis simpler.
where Mγ is a constant for maintaining this inequality. The minimizer on the righthand side is the cost function appearing in the entropy stochastic gradient descent (ESGD) algorithm, which captures the wider local minima^{9}. In order to obtain the most probable weights w, taking the derivative with respect to w of the minimizer of logP(w), we obtain the following update equation
where \(\langle \cdots \rangle \) takes the average of w′ in the integrand of (11). The average is directly intractable and is instead estimated by the following Langevin dynamics:
In the ESGD algorithm, γ(t) is a decreasing value, which will vanish at the completion of optimization. The time dependence of γ(t) is closely related to ρ(t) as described in the Appendix. In standard QA, we gradually increase ρ(t). Then γ(t) similarly increases. Thus, the ESGD algorithm is essentially different from the standard QA procedure. As they stated, the “reverse annealing” method is considered in the literature^{9}.
Reverse annealing is now implemented in the current system of the DWave machine, and shows better performance for optimization. A similar approach for increasing the performance is to search by induction of quantum fluctuation^{31}. In these cases, reverse annealing is induction of the quantum fluctuation, namely ρ(0) = ρ(T) = 0 while ρ(t) > 0.
Finitevalue quantum annealing
As described in previous studies^{9,26}, there is a useful algorithm exploiting an entropic effect around a single condensed solution. In this algorithm, the author can elucidate one of the aspects related to the quantum effect: i.e., the entropy effect. In our study, we perform the direct optimization of the cost function, which appears in the exponential of the probability distribution (8) as,
which involves nontrivial quantum tunneling stemming from nonperturbative effects. Here we assume β/M = 1 because we take β → ∞ and M → ∞. Thus, we must deal with M replicated systems for optimizing the DNNs. In this sense, our procedure is not reasonable for optimizing DNNs in practical applications. However, our trial may stimulate motivation for possible applications of the quantum computation. We report several simple DNN optimization tests to provide future perspectives in machine learning with respect to the quantum mechanics described below.
From this point forward, we do not focus on cases with a large ρ(t). We consider directly optimizing the cost function (8), but T → 0 in order to obtain only the quantum effect for simplicity, as
In addition, we consider a finitevalue quantum annealing, in which the quantum fluctuation remains at the final stage of optimization. In standard QA, we gradually increase ρ(t) to obtain a single realization among many replicas. However, as discussed later, a moderate ρ(t) value is beneficial for obtaining improved generalization performance. When we do not consider the “quality” of the solution, the standard QA is one of the best choices. The theoretical assurance of the ideal QA toward the optimal solution with the lowest cost function value is well established on the basis of the adiabatic theorem^{12}. However, as in the case of DNN optimization, the quality of the solution is measured using a different scale than the cost function itself, namely the generalization performance. Therefore, the standard QA method is not necessarily the best choice for optimization of DNNs. As a result, we inject a finite quantum fluctuation value to attain better generalization performance.
Here, we provide a simple schematic picture for the finitevalue QA to attain improved generalization performance. For simplicity, we assume that a DNN loss function has two local minima: a sharp local minimum and a wide local minimum. Both of the depths are the same, as shown in Fig. 1.
In other words, the first term in the cost function (14) takes the same values in two local minima. Let us here consider the favorable solution in the standard QA. In standard QA, we increase ρ(t) to a very large value. When the optimization is successfully performed without entrapment in any saddle points or trivial local minima, we compare the two representative local minima of the cost function (14). When most of the realizations of the Mreplicated DNNs are condensed to the sharp local minimum, the cost function (14) takes a smaller value compared to the case of the wide local minimum. Thus, the successful result of the standard QA is absorbed in the sharp local minimum. In this sense, standard QA is not suitable for optimization of DNNs. Instead, in finitevalue QA, the final value of ρ(t) is set to be finite. Then, depending on the final value of ρ(t), the resultant solution is allowed to be absorbed into the wider local minimum of the loss function. In a previous study^{9}, γ(t) (similar to ρ(t)) is referred to as the scoping coefficient and is gradually decreased.
The remaining problem is that, in general, a priori we do not find an adequate strength value for quantum fluctuation. We propose an adaptive approach for tuning the value of ρ(t) in the next subsection.
Quantum Adam
We hereafter assume the loss function \(L({\mathscr{D}}{\bf{w}})\) for a training dataset \({\mathscr{D}}\) as the energy function. The loss function measures the discrepancy between the ground truth labels t and the output y predicted by the network. The gradient of the loss function is coen used in parallel computing enviromputed using the backpropagation method^{32}. We here employ the stochastic gradient descent method by dividing the training dataset into M minibatches as \(\{{{\mathscr{D}}}_{1},\,{{\mathscr{D}}}_{2},\,\cdots ,\,{{\mathscr{D}}}_{M}\}\). It is convenient to process a large amount of training data and mitigate the computational cost of the gradient. We then distribute the minibatch to each Trotter slice k. Following the standard prescription of the SuzukiTrotter decomposition, we should utilize the same energy function on each Trotter slice. However, to induce the stochastic ingredients over Mreplicated DNNs to perform efficient learning, we employ the loss function as \(L({{\mathscr{D}}}_{k}{{\rm{w}}}_{k})\) on each Trotter slice k. Thus, we divide the training dataset into M minibatches, where M is the number of Trotter slices. We then sweep all the minibatches over each Trotter slice in an epoch. The minibatches are randomly shuffled in each epoch.
We here assume that our procedure is employed in practice in a parallel computing environment. In the context of the current machine learning environment, parallel computing for learning is sometimes employed for very large datasets. As in our case, the elastic term \(\rho {\Vert {{\bf{w}}}_{k}{{\bf{w}}}^{\ast }\Vert }_{2}^{2}\) has been used in parallel computing environments^{33}. Another study prepared the master with w and updated it by summing over gradients obtained by slaves with w_{ k }^{34}.
We now address the remaining problem of determining the magnitude of the coefficient ρ(t) of the elastic term. We exploit the idea of the Adam method, which is often implemented in DNN optimization^{6}, to adaptively change the coefficient. It accelerates the update when the gradient tends to shrink around the saddle point. In Adam, instead of the standard gradient descent method (1),
where \(\tilde{{\bf{m}}}(t)={\bf{m}}(t)/(1{\beta }_{1}^{t})\), \(\tilde{{\bf{v}}}(t)={{\bf{v}}}_{k}(t)/(1{\beta }_{2}^{t})\), and
Here, g(t) is the gradient of the loss function. The hyperparameters β_{1} and β_{2} are chosen a priori. The quantity of ε avoids accidental division by zero. The calculation of the product \(\odot \) and the division between vectors are performed in a componentwise manner. During update iterations, the magnitude of the gradient becomes small around the saddle point. Then, v(t) becomes a vector with smallvalued elements. The coefficient \(\eta /\sqrt{\tilde{{\bf{v}}}(t)}+\varepsilon \) of the effective gradient \(\tilde{{\bf{m}}}\)(t) is then increased. The updates are then efficiently performed, even around the saddle point. This is a rough sketch of the learning acceleration provided by Adam.
For tuning ρ(t), we employ a technique similar to one in Adam, in which the coefficient of the effective gradient is adaptively changed as follows:
where \({\tilde{{\bf{m}}}}_{{k}}\)(t) and \({\tilde{{\bf{v}}}}_{{k}}\)(t) are obtained in the same manner as in Adam, and \({\tilde{{\bf{m}}}}_{k}^{q}(t)={{\bf{m}}}_{k}^{q}(t)/(1{\alpha }_{1}^{t})\), \({\tilde{{\bf{v}}}}_{k}^{q}(t)={{\bf{v}}}_{k}^{q}(t)/(1{\alpha }_{2}^{t})\) and
Here, \({{\bf{g}}}_{k}^{q}(t)=2{{\bf{w}}}_{k}(t){{\bf{w}}}_{k+1}(t){{\bf{w}}}_{k1}(t)\). Similar to the process followed in Adam, the hyperparameters α_{1} and α_{2} are set a priori. The above update rule adequately tunes the elastic term. It reads that the coefficient is tuned as \(\rho (t)\to \rho /(\sqrt{{\tilde{{\bf{v}}}}_{k}^{q}(t)}+\varepsilon )\).
Following the standard QA, the weights are randomly initialized in order to search for good candidates for the optimal solution over a relatively wide range. In other words, in the initial stage of optimization, the weights associated with the different Trotter slices deviate. Owing to the elastic term, the discrepancies between Trotter slices begin to lessen after several iterations. In other words, the tunneling effect gradually decays, and the effective coefficient \(\rho /(\sqrt{{\tilde{{\bf{v}}}}_{k}^{q}(t)}+\varepsilon )\) then increases to enhance the tunneling effect again. Therefore, the above update rule efficiently induces the tunneling effect without directly tuning the value of the mass ρ. We call the above update rule “quantum Adam” in the sense that we add the quantum effects stemming from \({{\bf{g}}}_{k}^{q}(t)\) while tuning the contribution of the effect during the learning. We emphasize that other gradient methods developed for machine learning, including AdaGrad^{35}, AdaDelta^{36}, RMSprop^{37}, and the Sum of Functions Optimizer^{38}, can be implemented in conjunction with the quantum effect in the same manner.
In the following section, we demonstrate the effectiveness of quantum Adam by testing it against two datasets: the MNIST handwritten digit dataset^{39} and the Olivetti face image dataset^{40}; both are open datasets often used in benchmark tests for machine learning.
Results
In this section, we demonstrate the application of quantum Adam to DNNs by using a wellknown open dataset. Although the datasets used in the experiments contain data that are relatively easy to analyze, there are high computational costs incurred when implementing the Mreplicated DNNs for the realization of quantum Adam. In this sense, the present study is simply a proof of concept.
For simplicity, we used ReLU as the activation function in the middle layers in all experiments. We used cross entropy as the cost function for classification and the meansquared error for autoencoding in the results shown below. The weights are initialized with i.i.d. Gaussian samples with a zero mean and deviation \(\sqrt{\mathrm{1/}{N}_{l}}\), where N_{ l } is the number of inputs for each layer l. We use the standard choice of α_{1} = β_{1} = 0.9 and α_{2} = β_{2} = 0.999. We set the common initial conditions and performed Mindependent classical (standard) and quantum Adam tests for comparison. We then assessed the generalization performance in terms of the average and minimum/maximum of the loss function/accuracy.
The first task was to classify the MNIST 8 × 8pixel images of handwritten digits. We constructed an alltoall singlelayer neural network (NN) for classifying the handwritten digits. Figure 2 shows the accuracy with test data for classical and quantum Adam. We trained the NN by feeding it 500 data items and setting M = 500. We then measured the accuracy using 1297 data items. In this case, we set the coefficient ρ = 2.0. Both the average and the maximum accuracy confirm that quantum Adam is superior to classical Adam.
The second task was to make the auto encoder. It recovers the original input as the output by using MNIST 8 × 8pixel images of handwritten digits. To encode the handwritten digits, we constructed twoconvolution layers with a filter size of three and an output of six channels. The middle layer has 96 nodes in this case. To decode the images, we constructed two deconvolution layers in an inverse manner. Figure 3 shows the loss function for the test data with classical and quantum Adam. We trained the NN by feeding it 100 data items and setting M = 100. We then measured the loss function for 1697 data items to determine the generalization performance. In this case, we set the coefficient ρ = 1.0. Both the average and the minimum of the loss function in the replicated systems confirm that quantum Adam is superior to classical Adam. However, this result might be accidental, as there were no significant improvements in several experiments in terms of the meansquare error.
The third task was to classify the Olivetti 64 × 64pixel images of human faces. We constructed an alltoall threelayer (40962048102440) NN for classifying face images. Figure 4 shows the accuracy with the test data for classical and quantum Adam. We trained the NN by feeding it 200 data points and setting M = 40. We then determined the accuracy using 200 data items. In this case, we set the constant ρ = 1.0 and performed batch normalization at each layer. Both the average and the maximum accuracy are evidence that quantum Adam is superior to classical Adam in the last stage of learning.
Discussion
We proposed a quantum Adam formulated through a pathintegral representation for optimization of DNNs. The proposed algorithm generates an elastic term between different realizations of DNNs and could find a better solution in terms of generalization performance than that by classical Adam. The point is to control the quantum fluctuation by introducing the adaptive change of the coefficient and inducing the wideflat local minimum by means of the entropy effect, as discussed in the previous studies^{9,26}. In the present study, we directly optimize the Mreplicated DNNs while dealing with the nonperturbative effect, which allows the quantum tunneling effect. Although relatively small datasets are used, we demonstrate better generalization performance by considering the optimization with a finite quantum fluctuation strength. In this sense, our method does not conform to the standard QA method. The ideal QA might not be the best choice of learning algorithm for DNNs because the resultant solutions are absorbed into a sharp minimum. In recent development of manufacturing microdevices, QA has been successfully implemented in superconducting qubits, or socalled quantum annealer. Several experiments have shown that the resultant solutions seem to fall into wide local minima^{41}. However, this is due to the freezing phenomena in the quantum annealer, which is a particular problem in the quantum device. The resultant solutions are closely related to lowenergy states with a certain value of quantum fluctuation as pointed out in the literature^{42}. In other words, the output from the present version of the quantum annealer follows the GibbsBoltzmann distribution with a certain value of quantum fluctuations. In this sense, QA, which is performed in real experiments, can be a choice of learning algorithm. In addition, the current version of a quantum annealer, the DWave 2000Q, implements two optimization techniques by manipulating a certain value of quantum fluctuation, namely quenching, and reverse annealing. These two techniques will be available for efficiently attaining better generalization performance in real experiments, as discussed in the literature^{26}.
In the present study, we manipulate the optimization in classical computers. In addition, we select the strength of the quantum fluctuation by employing adaptive change inspired by the Adam method. The potential performance of quantum Adam emerges in cases with many Trotter numbers that correspond to the number of minibatches. When we use a small number of minibatches, quantum Adam does not work well. This is because most of the DNNs fall into the sharp minimizers. In addition, the ρ value should be tuned adequately. When we select a ρ value that is too high, the searching range will be narrow, whereas a ρ value that is too small will not lead to a condensed solution. We tested three different tasks to assess the performance of quantum Adam in comparison to classical Adam. The results demonstrate that quantum Adam can provide fairly good performance. We emphasize that the most important feature of quantum Adam should be its generalization performance. In machine learning, the purpose of improvements in learning is nothing more than enhancing generalization performance with limited epochs and computational resources. In quantum Adam, the elastic term aggregates DNNs while learning. This effect might work to prevent sudden falls into the valley. In other words, when most of the DNNs are in the wide minimizer, the others do not tend to fall into the sharp minimizer; this can lead to improved generalization performance.
In quantum Adam, we use Mreplicated DNNs. In a sense, this seems to be too abundant. However, when we process a large number of datasets, we distribute each batch to a number of processors or GPUs and establish a consensus to obtain DNNs with high generalization performance. Our present method is too computationally expensive to implement in the ordinary environments used in a wide range of research efforts, although it might be useful for learning large datasets in parallel computing environments. In this sense, our algorithm might be helpful even in classical computers. In future research, we shall test quantum Adam in a parallel computing environment with a large dataset comprising highdimensional components, and propose another simplified algorithm by elucidating the most significant part of the quantum fluctuations, as in previous studies^{9,26}.
We remark on the time complexity of quantum Adam. The standard assessment of the time complexity of QA can be performed by estimating the energy gap in the timedependent Hamiltonian. In our case, through the Suzuki–Trotter decomposition, the problem is reduced to the optimization problem for the cost function with continuous variables. By considering the rate of convergence to be at a minimum in the feasible set, the classical Adam method has a convergence rate of \(O(\mathrm{1/}\sqrt{T})\), as shown in the literature^{6}. We believe that a similar analysis can also be performed for quantum Adam. In addition, we emphasize that the most important feature of quantum Adam is its generalization performance. In this sense, the present study triggers a new aspect of QA not for pursuing the minimum of the cost function, but for different optimality measured in a different indicator from the cost function itself.
Finally, in present study, we demonstrate a potential power of quantum fluctuation, as done by QA. It promotes “quality” of solution via optimization with quantum fluctuation. The standard assessment of the performance of optimization solver is evaluated by the cost function itself. In particular, the performance of QA has been discussed through the decrease of the cost function. However, the robustness of the solution can be attained by optimization of the cost function in conjunction with the local entropy as discussed in the literature^{9,26}. The optimization with quantum fluctuation automatically and potentially leads to the robustness of the solution as discussed in the present study. In the context of machine learning, the generalization performance is robustness of the solution. In future, deepening the understanding of the quantum fluctuation would promote various approaches in machine learning and beyond.
Path integral representation
By use of the SuzukiTrotter decomposition, we formulate the path integral representation. Let us start the following expression of the SuzukiTrotter decomposition as
We insert the summation over the complete set \(\int d{{\bf{w}}}_{k}{{\bf{w}}}_{k}\rangle \langle {{\bf{w}}}_{k}\) and \(\int d{{\bf{p}}}_{k}{{\bf{p}}}_{k}\rangle \langle {{\bf{p}}}_{k}\) where \(\hat{{\bf{w}}}{{\rm{w}}}_{k}\rangle ={{\bf{w}}}_{k}{{\bf{w}}}_{k}\rangle \) and \(\hat{{\bf{p}}}{{\bf{p}}}_{k}\rangle ={{\bf{p}}}_{k}{{\bf{p}}}_{k}\rangle \). Then we obtain
This expression can be reduced to
where we have used
Manipulation of the Gaussian integral with respect to p_{ k } yields
Strong limit of ρ(t)
First we consider the Fourier transformation on the discrepancy from the center of weights w^{*} as
where a_{ r } = a_{M−r} because w_{ k } is a real vector. Then the elastic term is diagonalized as
where we have used \({\sum }_{k=0}^{M1}{{\rm{e}}}^{i2\pi kr/M}=M\delta (r)\). When \(\rho (t)\gg 1\), the exponentiated elastic term is reduced to
We find that a_{ r } follows the Gaussian distribution. We then perform the inverse Fourier transformation and attain
In M → ∞, we use 2πr/M = x and 2π/M = dx
References
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).
Robbins, H. & Monro, S. A stochastic approximation method. Ann. Math. Statist. 22, 400–407 (1951).
Bottou, L. Online algorithms and stochastic approximations. In Saad, D. (ed.) Online Learning and Neural Networks (Cambridge University Press, Cambridge, UK, 1998). Revised, oct 2012.
Sutskever, I., Martens, J., Dahl, G. & Hinton, G. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning  Volume 28, ICML'13, III–1139–III–1147 (JMLR.org, 2013).
Kingma, D. P. & Ba, J. Adam: A Method for Stochastic Optimization. In the 3rd International Conference for Learning Representations (ICLR), 2015 (2015).
Shirish Keskar, N., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. On LargeBatch Training for Deep Learning: Generalization Gap and Sharp Minima. ArXiv eprints (2016).
Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L. & Zecchina, R. Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. Phys. Rev. Lett. 115, 128101 (2015).
Baldassi, C. et al. Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. Proceedings of the National Academy of Sciences 113, E7655–E7662 (2016).
Chaudhari, P. et al. EntropySGD: Biasing Gradient Descent Into Wide Valleys. ArXiv eprints (2016).
Kadowaki, T. & Nishimori, H. Quantum annealing in the transverse ising model. Phys. Rev. E 58, 5355–5363, https://doi.org/10.1103/PhysRevE.58.5355 (1998).
Suzuki, S. & Okada, M. Residual energies after slow quantum annealing. Journal of the Physical Society of Japan 74, 1649–1652, https://doi.org/10.1143/JPSJ.74.1649 (2005).
Morita, S. & Nishimori, H. Mathematical foundation of quantum annealing. Journal of Mathematical Physics 49 https://doi.org/10.1063/1.2995837 (2008).
Ohzeki, M. & Nishimori, H. Quantum annealing: An introduction and new developments. Journal of Computational and Theoretical Nanoscience 8, 963–971 (20110601T00:00:00). https://doi.org/10.1166/jctn.2011.1776963.
Johnson, M. W. et al. A scalable control system for a superconducting adiabatic quantum optimization processor. Superconductor Science and Technology 23, 065004 (2010).
Berkley, A. J. et al. A scalable readout system for a superconducting adiabatic quantum optimization system. Superconductor Science and Technology 23, 105014 (2010).
Harris, R. et al. Experimental investigation of an eightqubit unit cell in a superconducting optimization processor. Phys. Rev. B 82, 024511, https://doi.org/10.1103/PhysRevB.82.024511 (2010).
Bunyk, P. I. et al. Architectural considerations in the design of a superconducting quantum annealing processor. IEEE Transactions on Applied Superconductivity 24, 1–10, https://doi.org/10.1109/TASC.2014.2318294 (2014).
Ohzeki, M. Quantum annealing with the jarzynski equality. Phys. Rev. Lett. 105, 050401, https://doi.org/10.1103/PhysRevLett.105.050401 (2010).
Ohzeki, M., Nishimori, H. & Katsuda, H. Nonequilibrium work on spin glasses in longitudinal and transverse fields. J. Phys. Soc. Jpn. 80, 084002, https://doi.org/10.1143/JPSJ.80.084002 (2011).
Ohzeki, M. & Nishimori, H. Nonequilibrium work performed in quantum annealing. Journal of Physics: Conference Series 302, 012047 (2011).
Somma, R. D., Nagaj, D. & Kieferová, M. Quantum speedup by quantum annealing. Phys. Rev. Lett. 109, 050501 (2012).
Seki, Y. & Nishimori, H. Quantum annealing with antiferromagnetic fluctuations. Phys. Rev. E 85, 051112, https://doi.org/10.1103/PhysRevE.85.051112 (2012).
Nishimori, H. & Takada, K. Exponential enhancement of the efficiency of quantum annealing by nonstoquastic hamiltonians. Frontiers in ICT 4, 2 (2017).
Ohzeki, M. Quantum monte carlo simulation of a particular class of nonstoquastic hamiltonians in quantum annealing. Scientific Reports 7, 41186 (2017).
Baldassi, C. & Zecchina, R. Efficiency of quantum vs. classical annealing in nonconvex learning problems. Proceedings of the National Academy of Sciences 115, 1457–1462, https://doi.org/10.1073/pnas.1711456115 (2018).
Welling, M. & Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML'11, 681–688 (Omnipress, USA, 2011).
Kirkpatrick, S., Gelatt, C. D. & Vecchi, M. P. Optimization by simulated annealing. Science 220, 671–680, https://doi.org/10.1126/science.220.4598.671 (1983).
Hatano, N. Localization in nonhermitian quantum mechanics and fluxline pinning in superconductors. Physica A: Statistical Mechanics and its Applications 254, 317–331 (1998).
Suzuki, M. Relationship between ddimensional quantal spin systems and (d + 1)dimensional ising systems: Equivalence, critical exponents and systematic approximants of the partition function and spin correlations. Progress of Theoretical Physics 56, 1454–1469, https://doi.org/10.1143/PTP.56.1454 (1976).
PerdomoOrtiz, A., Dickson, N., DrewBrook, M., Rose, G. & AspuruGuzik, A. Finding lowenergy conformations of lattice protein models by quantum annealing. Scientific Reports 2, 571 EP (2012).
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by backpropagating errors. Nature 323, 533–536 (1986).
Zhang, S., Choromanska, A. & LeCun, Y. Deep learning with elastic averaging sgd. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS'15, 685–693 (MIT Press, Cambridge, MA, USA, 2015).
Li, M., Andersen, D. G., Smola, A. & Yu, K. Communication efficient distributed machine learning with the parameter server. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS'14, 19–27 (MIT Press, Cambridge, MA, USA, 2014).
Duchi, J., Hazan, E. & Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011).
Zeiler, M. D. Adadelta: An adaptive learning rate method. CoRR abs/1212.5701 (2012).
Tieleman, T. & Hinton, G. Lecture 6.5  rmsprop. COURSERA: Neural Networks for Machine Learning (2012).
SohlDickstein, J., Poole, B. & Ganguli, S. Fast largescale optimization by unifying stochastic gradient and quasinewton methods. In Xing, E. P. & Jebara, T. (eds) Proceedings of the 31st International Conference on Machine Learning, vol. 32 of Proceedings of Machine Learning Research, 604–612 (PMLR, Bejing, China, 2014).
Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324, https://doi.org/10.1109/5.726791 (1998).
Samaria, F. S. & Harter, A. C. Parameterisation of a stochastic model for human face identification. In Proceedings of 1994 IEEE Workshop on Applications of Computer Vision, 138–142 (1994).
Johnson, M. W. et al. Quantum annealing with manufactured spins. Nature 473, 194 EP (2011).
Amin, M. H. Searching for quantum speedup in quasistatic quantum annealers. Phys. Rev. A 92, 052323 (2015).
Acknowledgements
The authors would like to thank Shu Tanaka and Muneki Yasuda for many fruitful discussions that contributed to the work. The present work is financially supported by MEXT KAKENHI Grant No. 15H03699, 16K13849, and 16H04382, and by JST START.
Author information
Authors and Affiliations
Contributions
M.O. conceived and conducted the experiment and analyzed the results. S.O. tested the previous version of the optimization method, M.T. discussed the possibility of the other applications of our method to industry, S.T. directed the project in our study and investigated the possible design of our method. All authors discussed the details of the results and reviewed the manuscript.
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ohzeki, M., Okada, S., Terabe, M. et al. Optimization of neural networks via finitevalue quantum fluctuations. Sci Rep 8, 9950 (2018). https://doi.org/10.1038/s41598018282124
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598018282124
This article is cited by

Multidimensional hyperspin machine
Nature Communications (2022)

BG3DM2F: Bidirectional gated 3D multiscale feature fusion for Alzheimer’s disease diagnosis
Multimedia Tools and Applications (2022)

Traffic signal optimization on a square lattice with quantum annealing
Scientific Reports (2021)

Assessment of image generation by quantum annealer
Scientific Reports (2021)

Model Predictive Control for Finite Input Systems using the DWave Quantum Annealer
Scientific Reports (2020)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.