Optimization of neural networks via finite-value quantum fluctuations

Ohzeki, Masayuki; Okada, Shuntaro; Terabe, Masayoshi; Taguchi, Shinichiro

doi:10.1038/s41598-018-28212-4

Download PDF

Article
Open access
Published: 02 July 2018

Optimization of neural networks via finite-value quantum fluctuations

Masayuki Ohzeki¹,
Shuntaro Okada²,
Masayoshi Terabe² &
…
Shinichiro Taguchi²

Scientific Reports volume 8, Article number: 9950 (2018) Cite this article

4748 Accesses
22 Citations
8 Altmetric
Metrics details

Subjects

Abstract

We numerically test an optimization method for deep neural networks (DNNs) using quantum fluctuations inspired by quantum annealing. For efficient optimization, our method utilizes the quantum tunneling effect beyond the potential barriers. The path integral formulation of the DNN optimization generates an attracting force to simulate the quantum tunneling effect. In the standard quantum annealing method, the quantum fluctuations will vanish at the last stage of optimization. In this study, we propose a learning protocol that utilizes a finite value for quantum fluctuations strength to obtain higher generalization performance, which is a type of robustness. We demonstrate the performance of our method using two well-known open datasets: the MNIST dataset and the Olivetti face dataset. Although computational costs prevent us from testing our method on large datasets with high-dimensional data, results show that our method can enhance generalization performance by induction of the finite value for quantum fluctuations.

Deep neural networks for the evaluation and design of photonic devices

Article 17 December 2020

Quantum variational algorithms are swamped with traps

Article Open access 15 December 2022

Solving the electronic Schrödinger equation for multiple nuclear geometries with weight-sharing deep neural networks

Article 19 May 2022

Introduction

Data-driven approach is being widely adopted in many science and engineering fields. The key technology is machine learning, which is supported by successful examples of the use of deep neural networks (DNNs)¹. Deep neural networks have achieved state-of-the-art results in a wide variety of tasks, including computer vision, natural language processing, and reinforcement learning². The revolutionary event in which artificial intelligence bested a human at a game of Go exemplifies the potential power of machine learning. In DNNs, iterative structures of linear and non-linear transformations construct a pattern-recognition system for designing a feature extractor from the raw data (such as the pixel values of natural image data) into a nontrivial internal representation or feature vector. The extracted features enable us to classify the different patterns from the input data.

To promote DNN technology, various researchers have developed learning algorithms to provide faster results and better performance. The algorithms for optimizing DNNs are based on the stochastic gradient descent^3,4,5; it partitions a large dataset into several batches and approximates the gradient of the cost function. The standard choice among the various algorithms stemming from the stochastic gradient method is the Adaptive Momentum (Adam) algorithm⁶. This algorithm is designed to efficiently escape saddle points that often appear in the cost functions of DNNs. In practice, however, the learning of DNNs suffers from local minima with different generalization performance resulting from the shape of the DNN cost functions. The sharp minimizer has poorer generalization performance than that in the wide-flat minimizer. It is thus important to design a learning algorithm to find a more optimal solution by escaping from both the saddle points and the local minima. In a recent study⁷, the batch size is closely related to the generalization performance, which is characterized by the shape of the local minima. They experimentally demonstrate that the large-batch stochastic gradient method and its variants tend to converge to sharp minimizers with poor generalization performance. The small-batch stochastic gradient descent, on the other hand, is likely to fall into the wider minimizers, in which the DNNs have high generalization performance. The batch size is closely related to the magnitude of the stochastic noise during learning. In other words, injection of the stochastic noise can be an origin of an efficient learning algorithm for converging into wider local minima. In addition, an analytical study on discrete-weight networks revealed the subdominant solutions with relatively higher generalization performance than the exponentially dominant (typical) solutions that deviated from the ground truth^8,9. The subdominant solutions can be algorithmically reachable by considering the effect of entropy. As proposed in the literature¹⁰, they compute the local entropy by injection of stochastic noise and update the weight to take the DNN to wider local minima with better generalization performance.

The gradient descent algorithm is closely related to classical dynamics in physics, and the stochastic version also has a connection with Langevin dynamics, which models the classical stochastic dynamics in various fields of nature. In the present study, we test the optimization of DNNs using the quantum fluctuation as employed in quantum annealing (QA). Quantum annealing is a method that is developing as a generic solver for the optimization problems. This scheme was originally proposed as an algorithm that used numerical computations to optimize cost functions with discrete variables¹¹. The theoretical aspects of QA are well known. Its basic concept is derived from the quantum adiabatic theorem^12,13,14, and a successful experimental implementation of QA was realized using present-day technology^15,16,17,18. Since then, QA has been developed rapidly and has attracted much attention. Several protocols based on QA do not stick to the adiabatic quantum computation or maintain the system at the ground state; rather, they employ a nonadiabatic counterpart^19,20,21,22. In addition, some studies have used a more sophisticated quantum effect^23,24,25. Although the original proposal for QA was designed for optimization problems with discrete variables, as described in the form of a spin-glass Hamiltonian¹¹, the concept of QA can be generalized to a wider range of optimization problems, even those with continuous values. Most practical optimization problems, including machine learning, use continuous variables. One typical instance is the optimization problem for DNNs. Below, we apply the concept of QA to the DNN optimization problem. In the previous study, they assessed the potential efficiency of using quantum fluctuations to avoid the non-convex cost function by means of the replica method, which is a sophisticated tool in statistical mechanics²⁶. Although the analysis in the previous study discussed the learning of the discrete-weight neural network (binary variable as in the Ising model), the essential features are expected not to differ from the continuous-variable neural networks. As discussed in the previous study, the generalization performance attained by the optimization with quantum fluctuations can be better than that without them. In the present study, we perform practical tests: the optimization of DNNs with quantum fluctuations, and discuss its efficiency. Because the computational cost for simulating quantum dynamics is prohibitive, as shown below, our test is restricted to the case for the relatively shallow networks. However our approach is straightforward to apply deeper networks.

The paper is organized as follows: The second section describes our method for optimizing DNNs. The following section demonstrates the method using three simple tasks. The last section discusses the feasibility of our method.

Methods

Quantum annealing for continuous variables

The optimization problem is interpreted as the minimization of the energy function (potential energy) V(w) in the context of physics. We address the optimization of the weights of DNNs below. The weights are denoted by ${\bf{w}}\in {{\mathbb{R}}}^{N}$. The standard gradient descent is given as the equation of motion for the overdamped system

$${\bf{w}}(t+1)={\bf{w}}(t)-\eta \frac{\partial }{\partial {\bf{w}}}V({\bf{w}})\mathrm{.}$$

(1)

where t is the update step. This is regarded as a dynamical system in a low-temperature region in the context of physics. Considering the thermal effect characterized by the temperature T, the weights fluctate following the Gibbs-Boltzmann distribution as

$$P({\bf{w}})=\frac{1}{Z}\exp (-\beta V({\bf{w}})),$$

(2)

where Z is the partition function that acts as a normalization constant. In this case, instead of the equation of motion, a dynamical system with Langevin dynamics is adequate for description of the weights following the Gibbs–Boltzmann distribution as

$${\bf{w}}(t+1)={\bf{w}}(t)-\eta \frac{\partial }{\partial w}V({\bf{w}})+\sqrt{2T\eta }N(\mathrm{0,}\,1)\mathrm{.}$$

(3)

This is the procedure known as the stochastic gradient Langevin method²⁷, in which the learning rate decreases in the same manner as in simulated annealing (SA)²⁸. In QA, we introduce quantum fluctuations in addition to the energy function in the extremely low temperature T → 0(β → ∞). We consider the following time-dependent Hamiltonian:

$$\hat{H}(t)=V(\hat{{\bf{w}}})+\frac{1}{2\rho (t)}{\hat{{\bf{p}}}}^{2}$$

(4)

where $\hat{{\bf{w}}}$ denotes degrees of freedom and $\hat{{\bf{p}}}$ represents momentum that satisfies the commutation relation $[\hat{{\bf{w}}},\,\hat{{\bf{p}}}]=i\hslash \mathrm{.}$ In addition, ρ(t) represents the mass of the weights and increases from 0 to ∞ over time throughout the QA process. Following the ideas of quantum mechanics, the weights fluctuate as characterized by the following density matrix, instead of directly by the distribution function; this is defined as

$$\hat{\rho }=\frac{1}{Z}\exp (\,-\,\beta \hat{H}(t))$$

(5)

where $Z={\rm{Tr}}(\exp (\,-\,\,\beta \hat{H}(t)))$. To specify the probability distribution of the realized configuration of the weights, we compute the matrix elements as

$$P({\bf{w}})=\langle {\bf{w}}|\hat{\rho }|{\bf{w}}\rangle \mathrm{.}$$

(6)

where $\hat{{\bf{w}}}|{\bf{w}}\rangle ={\bf{w}}|{\bf{w}}\rangle $. However, the computation of the density matrix is intractable in general. We then employ the Suzuki–Trotter decomposition to reduce the operators to c-numbers by introducing M copies²⁹ and obtain the following path-integral representation as shown in Appendix:

$$P({\bf{w}})=\mathop{\mathrm{lim}}\limits_{M\to \infty }\int {\mathscr{D}}{\bf{w}}\,\exp (-\frac{\beta }{M}V({{\bf{w}}}_{k})-\frac{M\rho (t)}{2\beta }{\Vert {{\bf{w}}}_{k}-{{\bf{w}}}_{k-1}\Vert }_{2}^{2})\mathrm{.}$$

(7)

where $\int {\mathscr{D}}{\bf{w}}={\prod }_{k=1}^{M-1}\int d{{\bf{w}}}_{k}$, M is the Trotter number and k is the index of the replicated system. The boundary condition is set to w₀ = w_M = w. The numerical implementation of the Suzuki-Trotter decomposition is established as an approximation of the distribution function (7) by setting a finite number for M. For instance, in the quantum Monte Carlo simulation³⁰, the configuration of the degrees of freedom is sampled using the distribution function as

$$P({{\bf{w}}}_{1},\,{{\bf{w}}}_{2},\,\cdots ,\,{{\bf{w}}}_{M})=\prod _{k\mathrm{=1}}^{M}\exp (-\frac{\beta }{M}V({{\bf{w}}}_{k})-\frac{M\rho (t)}{2\beta }{\Vert {{\bf{w}}}_{k}-{{\bf{w}}}_{k-1}\Vert }_{2}^{2}),$$

(8)

in which the inverse temperature is taken to be β → ∞ with β/M being finite. In other words, the quantum Monte Carlo simulation deals with many replicated realizations or paths w_k(t) with index k (imaginary time) following Langevin dynamics as

$${{\bf{w}}}_{k}(t+1)={{\bf{w}}}_{k}(t)-\eta \frac{\partial }{\partial {{\rm{w}}}_{k}}V({{\bf{w}}}_{k}(t))-\eta {T}_{q}^{2}\rho (t)(2{{\bf{w}}}_{k}(t)-{{\bf{w}}}_{k-1}(t)-{{\bf{w}}}_{k+1}(t))+\sqrt{2{T}_{q}\eta }N(\mathrm{0,}\,1)\mathrm{.}$$

(9)

where T_q = M/β. One might recognize that many DNN realizations interact with each other through the elastic term, which represents the quantum effect. The elastic term urges many DNN realizations into a single condensed solution w^* when ρ(t) takes relatively a large value. By the boundary condition w₀ = w_M, w^* = w. For simplicity, let us first consider the case with a large ρ(t). The path integral formulation allows fluctuation around w^*. In other words, the action in the exponential function in P(w) has two terms: one is the cost function, which is what we originally want to optimize, and the other is degree of condensation of the realizations. As in Appendix, we find that w_k − w follows a Gaussian distribution with some covariance βV_kk′(t). Thus, the approximated distribution function in a large ρ(t) is reduced to

$$P({\bf{w}})\approx \int {\mathscr{D}}{{\rm{w}}}_{k}\,\exp (-\frac{\beta }{M}\sum _{k}V({{\rm{w}}}_{k}))\exp (-\frac{\beta }{2}\sum _{k,k^{\prime} }({{\bf{w}}}_{k}-{\bf{w}}){V}_{kk^{\prime} }(t)({{\bf{w}}}_{k^{\prime} }-{\bf{w}}))$$

(10)

Here, we set the minimizer of the (logarithm of) the distribution function in order to make analysis simpler.

$$\mathrm{log}\,P({\bf{w}})\ge M\,\mathrm{log}\,\int d{\bf{w}}{\boldsymbol{^{\prime} }}\exp (-\frac{\gamma (t)}{2{T}_{q}}{({\bf{w}}{\boldsymbol{^{\prime} }}-{\bf{w}})}^{2}-\frac{1}{{T}_{q}}V({\bf{w}}\text{'}))$$

(11)

where Mγ is a constant for maintaining this inequality. The minimizer on the right-hand side is the cost function appearing in the entropy stochastic gradient descent (E-SGD) algorithm, which captures the wider local minima⁹. In order to obtain the most probable weights w, taking the derivative with respect to w of the minimizer of logP(w), we obtain the following update equation

$${\bf{w}}(t)=\gamma (t)({\bf{w}}(t)-\langle {\bf{w}}{\boldsymbol{^{\prime} }}\rangle ),$$

(12)

where $\langle \cdots \rangle $ takes the average of w′ in the integrand of (11). The average is directly intractable and is instead estimated by the following Langevin dynamics:

$${\bf{w}}\text{'}(s+1)={\bf{w}}{\boldsymbol{^{\prime} }}(s)-\eta \{\frac{\partial }{\partial w}V({\bf{w}})+\gamma (t)({\bf{w}}(t)-{\bf{w}}{\boldsymbol{^{\prime} }}(s))\}+\sqrt{2{T}_{q}\eta }N(\mathrm{0,}\,1)\mathrm{.}$$

(13)

In the E-SGD algorithm, γ(t) is a decreasing value, which will vanish at the completion of optimization. The time dependence of γ(t) is closely related to ρ(t) as described in the Appendix. In standard QA, we gradually increase ρ(t). Then γ(t) similarly increases. Thus, the E-SGD algorithm is essentially different from the standard QA procedure. As they stated, the “reverse annealing” method is considered in the literature⁹.

Reverse annealing is now implemented in the current system of the D-Wave machine, and shows better performance for optimization. A similar approach for increasing the performance is to search by induction of quantum fluctuation³¹. In these cases, reverse annealing is induction of the quantum fluctuation, namely ρ(0) = ρ(T) = 0 while ρ(t) > 0.

Finite-value quantum annealing

As described in previous studies^9,26, there is a useful algorithm exploiting an entropic effect around a single condensed solution. In this algorithm, the author can elucidate one of the aspects related to the quantum effect: i.e., the entropy effect. In our study, we perform the direct optimization of the cost function, which appears in the exponential of the probability distribution (8) as,

$$C({{\bf{w}}}_{1},\,{{\bf{w}}}_{2},\,\cdots ,\,{{\bf{w}}}_{M})=\sum _{k=1}^{M}V({{\bf{w}}}_{k})+\sum _{k=1}^{M}\frac{\rho (t)}{2}{\Vert {{\bf{w}}}_{k}-{{\bf{w}}}_{k-1}\Vert }_{2}^{2},$$

(14)

which involves nontrivial quantum tunneling stemming from non-perturbative effects. Here we assume β/M = 1 because we take β → ∞ and M → ∞. Thus, we must deal with M replicated systems for optimizing the DNNs. In this sense, our procedure is not reasonable for optimizing DNNs in practical applications. However, our trial may stimulate motivation for possible applications of the quantum computation. We report several simple DNN optimization tests to provide future perspectives in machine learning with respect to the quantum mechanics described below.

From this point forward, we do not focus on cases with a large ρ(t). We consider directly optimizing the cost function (8), but T → 0 in order to obtain only the quantum effect for simplicity, as

$${{\bf{w}}}_{k}(t+1)={{\bf{w}}}_{k}(t)-\eta \frac{\partial }{\partial {{\bf{w}}}_{k}}V({{\bf{w}}}_{k}(t))-\eta \rho (t)(2{{\bf{w}}}_{k}(t)-{{\bf{w}}}_{k-1}(t)-{{\bf{w}}}_{k+1}(t))\mathrm{.}$$

(15)

In addition, we consider a finite-value quantum annealing, in which the quantum fluctuation remains at the final stage of optimization. In standard QA, we gradually increase ρ(t) to obtain a single realization among many replicas. However, as discussed later, a moderate ρ(t) value is beneficial for obtaining improved generalization performance. When we do not consider the “quality” of the solution, the standard QA is one of the best choices. The theoretical assurance of the ideal QA toward the optimal solution with the lowest cost function value is well established on the basis of the adiabatic theorem¹². However, as in the case of DNN optimization, the quality of the solution is measured using a different scale than the cost function itself, namely the generalization performance. Therefore, the standard QA method is not necessarily the best choice for optimization of DNNs. As a result, we inject a finite quantum fluctuation value to attain better generalization performance.

Here, we provide a simple schematic picture for the finite-value QA to attain improved generalization performance. For simplicity, we assume that a DNN loss function has two local minima: a sharp local minimum and a wide local minimum. Both of the depths are the same, as shown in Fig. 1.

In other words, the first term in the cost function (14) takes the same values in two local minima. Let us here consider the favorable solution in the standard QA. In standard QA, we increase ρ(t) to a very large value. When the optimization is successfully performed without entrapment in any saddle points or trivial local minima, we compare the two representative local minima of the cost function (14). When most of the realizations of the M-replicated DNNs are condensed to the sharp local minimum, the cost function (14) takes a smaller value compared to the case of the wide local minimum. Thus, the successful result of the standard QA is absorbed in the sharp local minimum. In this sense, standard QA is not suitable for optimization of DNNs. Instead, in finite-value QA, the final value of ρ(t) is set to be finite. Then, depending on the final value of ρ(t), the resultant solution is allowed to be absorbed into the wider local minimum of the loss function. In a previous study⁹, γ(t) (similar to ρ(t)) is referred to as the scoping coefficient and is gradually decreased.

The remaining problem is that, in general, a priori we do not find an adequate strength value for quantum fluctuation. We propose an adaptive approach for tuning the value of ρ(t) in the next subsection.

Quantum Adam

We hereafter assume the loss function $L({\mathscr{D}}|{\bf{w}})$ for a training dataset ${\mathscr{D}}$ as the energy function. The loss function measures the discrepancy between the ground truth labels t and the output y predicted by the network. The gradient of the loss function is coen used in parallel computing enviromputed using the back-propagation method³². We here employ the stochastic gradient descent method by dividing the training dataset into M minibatches as $\{{{\mathscr{D}}}_{1},\,{{\mathscr{D}}}_{2},\,\cdots ,\,{{\mathscr{D}}}_{M}\}$. It is convenient to process a large amount of training data and mitigate the computational cost of the gradient. We then distribute the minibatch to each Trotter slice k. Following the standard prescription of the Suzuki-Trotter decomposition, we should utilize the same energy function on each Trotter slice. However, to induce the stochastic ingredients over M-replicated DNNs to perform efficient learning, we employ the loss function as $L({{\mathscr{D}}}_{k}|{{\rm{w}}}_{k})$ on each Trotter slice k. Thus, we divide the training dataset into M minibatches, where M is the number of Trotter slices. We then sweep all the minibatches over each Trotter slice in an epoch. The minibatches are randomly shuffled in each epoch.

We here assume that our procedure is employed in practice in a parallel computing environment. In the context of the current machine learning environment, parallel computing for learning is sometimes employed for very large datasets. As in our case, the elastic term $\rho {\Vert {{\bf{w}}}_{k}-{{\bf{w}}}^{\ast }\Vert }_{2}^{2}$ has been used in parallel computing environments³³. Another study prepared the master with w and updated it by summing over gradients obtained by slaves with w_k³⁴.

We now address the remaining problem of determining the magnitude of the coefficient ρ(t) of the elastic term. We exploit the idea of the Adam method, which is often implemented in DNN optimization⁶, to adaptively change the coefficient. It accelerates the update when the gradient tends to shrink around the saddle point. In Adam, instead of the standard gradient descent method (1),

$${\bf{w}}(t+1)={\bf{w}}(t)-\frac{\eta }{\sqrt{\tilde{{\bf{v}}}(t)}+\varepsilon }\tilde{{\bf{m}}}(t),$$

(16)

where $\tilde{{\bf{m}}}(t)={\bf{m}}(t)/(1-{\beta }_{1}^{t})$, $\tilde{{\bf{v}}}(t)={{\bf{v}}}_{k}(t)/(1-{\beta }_{2}^{t})$, and

$${\bf{m}}(t)=(1-{\beta }_{1}){\bf{m}}(t-\mathrm{1)}+{\beta }_{1}\,{\bf{g}}(t)$$

(17)

$${\bf{v}}(t)=(1-{\beta }_{2}){\bf{v}}(t-1)+{\beta }_{2}\,{\bf{g}}(t)\odot {\bf{g}}(t)\mathrm{.}$$

(18)

Here, g(t) is the gradient of the loss function. The hyperparameters β₁ and β₂ are chosen a priori. The quantity of ε avoids accidental division by zero. The calculation of the product $\odot $ and the division between vectors are performed in a component-wise manner. During update iterations, the magnitude of the gradient becomes small around the saddle point. Then, v(t) becomes a vector with small-valued elements. The coefficient $\eta /\sqrt{\tilde{{\bf{v}}}(t)}+\varepsilon $ of the effective gradient $\tilde{{\bf{m}}}$(t) is then increased. The updates are then efficiently performed, even around the saddle point. This is a rough sketch of the learning acceleration provided by Adam.

For tuning ρ(t), we employ a technique similar to one in Adam, in which the coefficient of the effective gradient is adaptively changed as follows:

$${{\bf{w}}}_{k}(t+1)={{\bf{w}}}_{k}(t)-\frac{\eta }{\sqrt{{\tilde{{\bf{v}}}}_{k}(t)}+\varepsilon }{\tilde{{\bf{m}}}}_{k}(t)-\frac{\eta \rho }{\sqrt{{\tilde{{\bf{v}}}}_{k}^{q}(t)}+\varepsilon }{\tilde{{\bf{m}}}}_{k}^{q}(t),$$

(19)

where ${\tilde{{\bf{m}}}}_{{k}}$(t) and ${\tilde{{\bf{v}}}}_{{k}}$(t) are obtained in the same manner as in Adam, and ${\tilde{{\bf{m}}}}_{k}^{q}(t)={{\bf{m}}}_{k}^{q}(t)/(1-{\alpha }_{1}^{t})$, ${\tilde{{\bf{v}}}}_{k}^{q}(t)={{\bf{v}}}_{k}^{q}(t)/(1-{\alpha }_{2}^{t})$ and

$${{\bf{m}}}_{k}^{q}(t)=(1-{\alpha }_{1}){{\bf{m}}}_{k}^{q}(t-1)+{\alpha }_{1}{{\bf{g}}}_{k}^{q}(t)$$

(20)

$${{\bf{v}}}_{k}^{q}(t)=(1-{\alpha }_{2}){{\bf{v}}}_{k}^{q}(t-1)+{\alpha }_{2}{{\bf{g}}}_{k}^{q}(t)\odot {{\bf{g}}}_{k}^{q}(t\mathrm{).}$$

(21)

Here, ${{\bf{g}}}_{k}^{q}(t)=2{{\bf{w}}}_{k}(t)-{{\bf{w}}}_{k+1}(t)-{{\bf{w}}}_{k-1}(t)$. Similar to the process followed in Adam, the hyperparameters α₁ and α₂ are set a priori. The above update rule adequately tunes the elastic term. It reads that the coefficient is tuned as $\rho (t)\to \rho /(\sqrt{{\tilde{{\bf{v}}}}_{k}^{q}(t)}+\varepsilon )$.

Following the standard QA, the weights are randomly initialized in order to search for good candidates for the optimal solution over a relatively wide range. In other words, in the initial stage of optimization, the weights associated with the different Trotter slices deviate. Owing to the elastic term, the discrepancies between Trotter slices begin to lessen after several iterations. In other words, the tunneling effect gradually decays, and the effective coefficient $\rho /(\sqrt{{\tilde{{\bf{v}}}}_{k}^{q}(t)}+\varepsilon )$ then increases to enhance the tunneling effect again. Therefore, the above update rule efficiently induces the tunneling effect without directly tuning the value of the mass ρ. We call the above update rule “quantum Adam” in the sense that we add the quantum effects stemming from ${{\bf{g}}}_{k}^{q}(t)$ while tuning the contribution of the effect during the learning. We emphasize that other gradient methods developed for machine learning, including AdaGrad³⁵, AdaDelta³⁶, RMSprop³⁷, and the Sum of Functions Optimizer³⁸, can be implemented in conjunction with the quantum effect in the same manner.

In the following section, we demonstrate the effectiveness of quantum Adam by testing it against two datasets: the MNIST handwritten digit dataset³⁹ and the Olivetti face image dataset⁴⁰; both are open datasets often used in benchmark tests for machine learning.

Results

In this section, we demonstrate the application of quantum Adam to DNNs by using a well-known open dataset. Although the datasets used in the experiments contain data that are relatively easy to analyze, there are high computational costs incurred when implementing the M-replicated DNNs for the realization of quantum Adam. In this sense, the present study is simply a proof of concept.

For simplicity, we used ReLU as the activation function in the middle layers in all experiments. We used cross entropy as the cost function for classification and the mean-squared error for auto-encoding in the results shown below. The weights are initialized with i.i.d. Gaussian samples with a zero mean and deviation $\sqrt{\mathrm{1/}{N}_{l}}$, where N_l is the number of inputs for each layer l. We use the standard choice of α₁ = β₁ = 0.9 and α₂ = β₂ = 0.999. We set the common initial conditions and performed M-independent classical (standard) and quantum Adam tests for comparison. We then assessed the generalization performance in terms of the average and minimum/maximum of the loss function/accuracy.

The first task was to classify the MNIST 8 × 8-pixel images of handwritten digits. We constructed an all-to-all single-layer neural network (NN) for classifying the handwritten digits. Figure 2 shows the accuracy with test data for classical and quantum Adam. We trained the NN by feeding it 500 data items and setting M = 500. We then measured the accuracy using 1297 data items. In this case, we set the coefficient ρ = 2.0. Both the average and the maximum accuracy confirm that quantum Adam is superior to classical Adam.

The second task was to make the auto encoder. It recovers the original input as the output by using MNIST 8 × 8-pixel images of handwritten digits. To encode the handwritten digits, we constructed two-convolution layers with a filter size of three and an output of six channels. The middle layer has 96 nodes in this case. To decode the images, we constructed two deconvolution layers in an inverse manner. Figure 3 shows the loss function for the test data with classical and quantum Adam. We trained the NN by feeding it 100 data items and setting M = 100. We then measured the loss function for 1697 data items to determine the generalization performance. In this case, we set the coefficient ρ = 1.0. Both the average and the minimum of the loss function in the replicated systems confirm that quantum Adam is superior to classical Adam. However, this result might be accidental, as there were no significant improvements in several experiments in terms of the mean-square error.

The third task was to classify the Olivetti 64 × 64-pixel images of human faces. We constructed an all-to-all three-layer (4096-2048-1024-40) NN for classifying face images. Figure 4 shows the accuracy with the test data for classical and quantum Adam. We trained the NN by feeding it 200 data points and setting M = 40. We then determined the accuracy using 200 data items. In this case, we set the constant ρ = 1.0 and performed batch normalization at each layer. Both the average and the maximum accuracy are evidence that quantum Adam is superior to classical Adam in the last stage of learning.

Discussion

We proposed a quantum Adam formulated through a path-integral representation for optimization of DNNs. The proposed algorithm generates an elastic term between different realizations of DNNs and could find a better solution in terms of generalization performance than that by classical Adam. The point is to control the quantum fluctuation by introducing the adaptive change of the coefficient and inducing the wide-flat local minimum by means of the entropy effect, as discussed in the previous studies^9,26. In the present study, we directly optimize the M-replicated DNNs while dealing with the non-perturbative effect, which allows the quantum tunneling effect. Although relatively small datasets are used, we demonstrate better generalization performance by considering the optimization with a finite quantum fluctuation strength. In this sense, our method does not conform to the standard QA method. The ideal QA might not be the best choice of learning algorithm for DNNs because the resultant solutions are absorbed into a sharp minimum. In recent development of manufacturing microdevices, QA has been successfully implemented in superconducting qubits, or so-called quantum annealer. Several experiments have shown that the resultant solutions seem to fall into wide local minima⁴¹. However, this is due to the freezing phenomena in the quantum annealer, which is a particular problem in the quantum device. The resultant solutions are closely related to low-energy states with a certain value of quantum fluctuation as pointed out in the literature⁴². In other words, the output from the present version of the quantum annealer follows the Gibbs-Boltzmann distribution with a certain value of quantum fluctuations. In this sense, QA, which is performed in real experiments, can be a choice of learning algorithm. In addition, the current version of a quantum annealer, the D-Wave 2000Q, implements two optimization techniques by manipulating a certain value of quantum fluctuation, namely quenching, and reverse annealing. These two techniques will be available for efficiently attaining better generalization performance in real experiments, as discussed in the literature²⁶.

In the present study, we manipulate the optimization in classical computers. In addition, we select the strength of the quantum fluctuation by employing adaptive change inspired by the Adam method. The potential performance of quantum Adam emerges in cases with many Trotter numbers that correspond to the number of minibatches. When we use a small number of minibatches, quantum Adam does not work well. This is because most of the DNNs fall into the sharp minimizers. In addition, the ρ value should be tuned adequately. When we select a ρ value that is too high, the searching range will be narrow, whereas a ρ value that is too small will not lead to a condensed solution. We tested three different tasks to assess the performance of quantum Adam in comparison to classical Adam. The results demonstrate that quantum Adam can provide fairly good performance. We emphasize that the most important feature of quantum Adam should be its generalization performance. In machine learning, the purpose of improvements in learning is nothing more than enhancing generalization performance with limited epochs and computational resources. In quantum Adam, the elastic term aggregates DNNs while learning. This effect might work to prevent sudden falls into the valley. In other words, when most of the DNNs are in the wide minimizer, the others do not tend to fall into the sharp minimizer; this can lead to improved generalization performance.

In quantum Adam, we use M-replicated DNNs. In a sense, this seems to be too abundant. However, when we process a large number of datasets, we distribute each batch to a number of processors or GPUs and establish a consensus to obtain DNNs with high generalization performance. Our present method is too computationally expensive to implement in the ordinary environments used in a wide range of research efforts, although it might be useful for learning large datasets in parallel computing environments. In this sense, our algorithm might be helpful even in classical computers. In future research, we shall test quantum Adam in a parallel computing environment with a large dataset comprising high-dimensional components, and propose another simplified algorithm by elucidating the most significant part of the quantum fluctuations, as in previous studies^9,26.

We remark on the time complexity of quantum Adam. The standard assessment of the time complexity of QA can be performed by estimating the energy gap in the time-dependent Hamiltonian. In our case, through the Suzuki–Trotter decomposition, the problem is reduced to the optimization problem for the cost function with continuous variables. By considering the rate of convergence to be at a minimum in the feasible set, the classical Adam method has a convergence rate of $O(\mathrm{1/}\sqrt{T})$, as shown in the literature⁶. We believe that a similar analysis can also be performed for quantum Adam. In addition, we emphasize that the most important feature of quantum Adam is its generalization performance. In this sense, the present study triggers a new aspect of QA not for pursuing the minimum of the cost function, but for different optimality measured in a different indicator from the cost function itself.

Finally, in present study, we demonstrate a potential power of quantum fluctuation, as done by QA. It promotes “quality” of solution via optimization with quantum fluctuation. The standard assessment of the performance of optimization solver is evaluated by the cost function itself. In particular, the performance of QA has been discussed through the decrease of the cost function. However, the robustness of the solution can be attained by optimization of the cost function in conjunction with the local entropy as discussed in the literature^9,26. The optimization with quantum fluctuation automatically and potentially leads to the robustness of the solution as discussed in the present study. In the context of machine learning, the generalization performance is robustness of the solution. In future, deepening the understanding of the quantum fluctuation would promote various approaches in machine learning and beyond.

Path integral representation

By use of the Suzuki-Trotter decomposition, we formulate the path integral representation. Let us start the following expression of the Suzuki-Trotter decomposition as

$$Z={\rm{T}}{\rm{r}}\{\exp (-\beta V(\hat{{\bf{w}}})-\frac{{\hat{{\bf{p}}}}^{2}}{2\rho })\}={\rm{T}}{\rm{r}}\{\prod _{k=1}^{M}\exp (-\frac{\beta }{M}V(\hat{{\bf{w}}}))\exp (-\frac{\beta {\hat{{\bf{p}}}}^{2}}{2\rho M})\}.$$

(22)

We insert the summation over the complete set $\int d{{\bf{w}}}_{k}|{{\bf{w}}}_{k}\rangle \langle {{\bf{w}}}_{k}|$ and $\int d{{\bf{p}}}_{k}|{{\bf{p}}}_{k}\rangle \langle {{\bf{p}}}_{k}|$ where $\hat{{\bf{w}}}|{{\rm{w}}}_{k}\rangle ={{\bf{w}}}_{k}|{{\bf{w}}}_{k}\rangle $ and $\hat{{\bf{p}}}|{{\bf{p}}}_{k}\rangle ={{\bf{p}}}_{k}|{{\bf{p}}}_{k}\rangle $. Then we obtain

$$Z=\int d{{\bf{w}}}_{0}\langle {{\bf{w}}}_{0}|\int {\mathscr{D}}w{\mathscr{D}}p\prod _{k\mathrm{=1}}^{M}\{\exp (-\frac{\beta }{M}V(\hat{{\bf{w}}}))|{{\bf{w}}}_{k}\rangle \langle {{\bf{w}}}_{k}|\exp (-\frac{\beta {\hat{{\bf{p}}}}^{2}}{2\rho M})|{{\bf{p}}}_{k}\rangle \langle {{\bf{p}}}_{k}|\}|{{\bf{w}}}_{0}\rangle $$

(23)

This expression can be reduced to

$$Z\propto \int d{{\bf{w}}}_{0}\int {\mathscr{D}}{\bf{w}}{\mathscr{D}}{\bf{p}}\prod _{k\mathrm{=1}}^{M}\{\exp (-\frac{\beta }{M}V({{\bf{w}}}_{k}))\exp (i{{\bf{p}}}_{k}({{\bf{w}}}_{k}-{{\bf{w}}}_{k-1}))\exp (-\frac{\beta {{\bf{p}}}_{k}^{2}}{2\rho M})\}$$

(24)

where we have used

$$\langle {{\bf{w}}}_{k^{\prime} }|{{\bf{p}}}_{k}\rangle =\exp (i{{\bf{p}}}_{k}{{\bf{w}}}_{k^{\prime} })\mathrm{.}$$

(25)

Manipulation of the Gaussian integral with respect to p_k yields

$$Z\propto \int d{{\bf{w}}}_{0}\int {\mathscr{D}}{\bf{w}}\prod _{k\mathrm{=1}}^{M}\exp (-\frac{\beta }{M}V({\bf{w}})-\frac{M\rho }{2\beta }{\Vert {{\bf{w}}}_{k}-{{\bf{w}}}_{k-1}\Vert }_{2}^{2})\mathrm{.}$$

(26)

Strong limit of ρ(t)

First we consider the Fourier transformation on the discrepancy from the center of weights w^* as

$${{\bf{w}}}_{k}={{\bf{w}}}^{\ast }+\frac{1}{\sqrt{M}}\sum _{r=0}^{M-1}{{\rm{a}}}_{r}{{\rm{e}}}^{i2\pi kr/M},$$

(27)

where a_r = a_M−r because w_k is a real vector. Then the elastic term is diagonalized as

$${\Vert {{\bf{w}}}_{k}-{{\bf{w}}}_{k-1}\Vert }_{2}^{2}=2\sum _{r=1}^{[M\mathrm{/2]}}{{\rm{a}}}_{r}{{\rm{a}}}_{M-r}(1-\,\cos (\frac{2\pi r}{M}))\mathrm{.}$$

(28)

where we have used ${\sum }_{k=0}^{M-1}{{\rm{e}}}^{i2\pi kr/M}=M\delta (r)$. When $\rho (t)\gg 1$, the exponentiated elastic term is reduced to

$$\prod _{k=1}^{M}\exp (-\frac{M\rho (t)}{2\beta }{\Vert {{\bf{w}}}_{k}-{{\bf{w}}}_{k-1}\Vert }_{2}^{2})=\prod _{r=1}^{[M\mathrm{/2]}}\exp (-\frac{M\rho }{\beta }{{\bf{a}}}_{r}{{\bf{a}}}_{M-r}(1-\,\cos (\frac{2\pi r}{M})))\mathrm{.}$$

(29)

We find that a_r follows the Gaussian distribution. We then perform the inverse Fourier transformation and attain

$$\prod _{k=1}^{M}\exp (-\,\frac{M\rho (t)}{2\beta }{\Vert {{\bf{w}}}_{k}-{{\bf{w}}}_{k-1}\Vert }_{2}^{2})=\prod _{k\mathrm{=1}}^{M}\exp (-\frac{\beta }{2}\sum _{k,k^{\prime} }({{\bf{w}}}_{k}-{\bf{w}}){V}_{k^{\prime} ,k^{\prime} }({{\bf{w}}}_{k^{\prime} }-{\bf{w}}))\mathrm{.}$$

(30)

In M → ∞, we use 2πr/M = x and 2π/M = dx

$$\frac{1}{\beta }{V}_{kk^{\prime} }^{-1}=\sum _{r}\frac{\beta }{2M\rho (1-cos(\frac{2\pi r}{M}))}{{\rm{e}}}^{i2\pi (k-k^{\prime} )r/M}=\frac{\beta }{2\rho }{\int }_{0}^{2\pi }\frac{dx}{2\pi }\frac{{{\rm{e}}}^{i(k-k^{\prime} )x}}{1-\,\cos \,x}\mathrm{.}$$

(31)

References

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article ADS PubMed CAS Google Scholar
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).
Robbins, H. & Monro, S. A stochastic approximation method. Ann. Math. Statist. 22, 400–407 (1951).
Article MathSciNet MATH Google Scholar
Bottou, L. Online algorithms and stochastic approximations. In Saad, D. (ed.) Online Learning and Neural Networks (Cambridge University Press, Cambridge, UK, 1998). Revised, oct 2012.
Sutskever, I., Martens, J., Dahl, G. & Hinton, G. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML'13, III–1139–III–1147 (JMLR.org, 2013).
Kingma, D. P. & Ba, J. Adam: A Method for Stochastic Optimization. In the 3rd International Conference for Learning Representations (ICLR), 2015 (2015).
Shirish Keskar, N., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ArXiv e-prints (2016).
Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L. & Zecchina, R. Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. Phys. Rev. Lett. 115, 128101 (2015).
Article ADS PubMed CAS Google Scholar
Baldassi, C. et al. Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. Proceedings of the National Academy of Sciences 113, E7655–E7662 (2016).
Article CAS Google Scholar
Chaudhari, P. et al. Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. ArXiv e-prints (2016).
Kadowaki, T. & Nishimori, H. Quantum annealing in the transverse ising model. Phys. Rev. E 58, 5355–5363, https://doi.org/10.1103/PhysRevE.58.5355 (1998).
Article ADS CAS Google Scholar
Suzuki, S. & Okada, M. Residual energies after slow quantum annealing. Journal of the Physical Society of Japan 74, 1649–1652, https://doi.org/10.1143/JPSJ.74.1649 (2005).
Article ADS CAS Google Scholar
Morita, S. & Nishimori, H. Mathematical foundation of quantum annealing. Journal of Mathematical Physics 49 https://doi.org/10.1063/1.2995837 (2008).
Ohzeki, M. & Nishimori, H. Quantum annealing: An introduction and new developments. Journal of Computational and Theoretical Nanoscience 8, 963–971 (2011-06-01T00:00:00). https://doi.org/10.1166/jctn.2011.1776963.
Johnson, M. W. et al. A scalable control system for a superconducting adiabatic quantum optimization processor. Superconductor Science and Technology 23, 065004 (2010).
Article ADS CAS Google Scholar
Berkley, A. J. et al. A scalable readout system for a superconducting adiabatic quantum optimization system. Superconductor Science and Technology 23, 105014 (2010).
Article ADS CAS Google Scholar
Harris, R. et al. Experimental investigation of an eight-qubit unit cell in a superconducting optimization processor. Phys. Rev. B 82, 024511, https://doi.org/10.1103/PhysRevB.82.024511 (2010).
Article ADS CAS Google Scholar
Bunyk, P. I. et al. Architectural considerations in the design of a superconducting quantum annealing processor. IEEE Transactions on Applied Superconductivity 24, 1–10, https://doi.org/10.1109/TASC.2014.2318294 (2014).
Article Google Scholar
Ohzeki, M. Quantum annealing with the jarzynski equality. Phys. Rev. Lett. 105, 050401, https://doi.org/10.1103/PhysRevLett.105.050401 (2010).
Article ADS MathSciNet PubMed CAS Google Scholar
Ohzeki, M., Nishimori, H. & Katsuda, H. Nonequilibrium work on spin glasses in longitudinal and transverse fields. J. Phys. Soc. Jpn. 80, 084002, https://doi.org/10.1143/JPSJ.80.084002 (2011).
Article ADS CAS Google Scholar
Ohzeki, M. & Nishimori, H. Nonequilibrium work performed in quantum annealing. Journal of Physics: Conference Series 302, 012047 (2011).
Google Scholar
Somma, R. D., Nagaj, D. & Kieferová, M. Quantum speedup by quantum annealing. Phys. Rev. Lett. 109, 050501 (2012).
Article ADS PubMed CAS Google Scholar
Seki, Y. & Nishimori, H. Quantum annealing with antiferromagnetic fluctuations. Phys. Rev. E 85, 051112, https://doi.org/10.1103/PhysRevE.85.051112 (2012).
Article ADS CAS Google Scholar
Nishimori, H. & Takada, K. Exponential enhancement of the efficiency of quantum annealing by non-stoquastic hamiltonians. Frontiers in ICT 4, 2 (2017).
Article Google Scholar
Ohzeki, M. Quantum monte carlo simulation of a particular class of non-stoquastic hamiltonians in quantum annealing. Scientific Reports 7, 41186 (2017).
Article ADS PubMed PubMed Central CAS Google Scholar
Baldassi, C. & Zecchina, R. Efficiency of quantum vs. classical annealing in nonconvex learning problems. Proceedings of the National Academy of Sciences 115, 1457–1462, https://doi.org/10.1073/pnas.1711456115 (2018).
Article MathSciNet Google Scholar
Welling, M. & Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML'11, 681–688 (Omnipress, USA, 2011).
Kirkpatrick, S., Gelatt, C. D. & Vecchi, M. P. Optimization by simulated annealing. Science 220, 671–680, https://doi.org/10.1126/science.220.4598.671 (1983).
Article ADS MathSciNet PubMed MATH CAS Google Scholar
Hatano, N. Localization in non-hermitian quantum mechanics and flux-line pinning in superconductors. Physica A: Statistical Mechanics and its Applications 254, 317–331 (1998).
Article ADS CAS Google Scholar
Suzuki, M. Relationship between d-dimensional quantal spin systems and (d + 1)-dimensional ising systems: Equivalence, critical exponents and systematic approximants of the partition function and spin correlations. Progress of Theoretical Physics 56, 1454–1469, https://doi.org/10.1143/PTP.56.1454 (1976).
Article ADS MathSciNet MATH Google Scholar
Perdomo-Ortiz, A., Dickson, N., Drew-Brook, M., Rose, G. & Aspuru-Guzik, A. Finding low-energy conformations of lattice protein models by quantum annealing. Scientific Reports 2, 571 EP (2012).
Article ADS CAS Google Scholar
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).
Article ADS MATH Google Scholar
Zhang, S., Choromanska, A. & LeCun, Y. Deep learning with elastic averaging sgd. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS'15, 685–693 (MIT Press, Cambridge, MA, USA, 2015).
Li, M., Andersen, D. G., Smola, A. & Yu, K. Communication efficient distributed machine learning with the parameter server. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS'14, 19–27 (MIT Press, Cambridge, MA, USA, 2014).
Duchi, J., Hazan, E. & Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011).
MathSciNet MATH Google Scholar
Zeiler, M. D. Adadelta: An adaptive learning rate method. CoRR abs/1212.5701 (2012).
Tieleman, T. & Hinton, G. Lecture 6.5 - rmsprop. COURSERA: Neural Networks for Machine Learning (2012).
Sohl-Dickstein, J., Poole, B. & Ganguli, S. Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In Xing, E. P. & Jebara, T. (eds) Proceedings of the 31st International Conference on Machine Learning, vol. 32 of Proceedings of Machine Learning Research, 604–612 (PMLR, Bejing, China, 2014).
Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324, https://doi.org/10.1109/5.726791 (1998).
Article Google Scholar
Samaria, F. S. & Harter, A. C. Parameterisation of a stochastic model for human face identification. In Proceedings of 1994 IEEE Workshop on Applications of Computer Vision, 138–142 (1994).
Johnson, M. W. et al. Quantum annealing with manufactured spins. Nature 473, 194 EP (2011).
Article ADS CAS Google Scholar
Amin, M. H. Searching for quantum speedup in quasistatic quantum annealers. Phys. Rev. A 92, 052323 (2015).
Article ADS CAS Google Scholar

Download references

Acknowledgements

The authors would like to thank Shu Tanaka and Muneki Yasuda for many fruitful discussions that contributed to the work. The present work is financially supported by MEXT KAKENHI Grant No. 15H03699, 16K13849, and 16H04382, and by JST START.

Author information

Authors and Affiliations

Graduate School of Information Sciences, Tohoku University, Sendai, 980-8579, Japan
Masayuki Ohzeki
Electronics Research and Innovation Division, DENSO Corporation, Chuo-ku, Tokyo, 103-6015, Japan
Shuntaro Okada, Masayoshi Terabe & Shinichiro Taguchi

Authors

Masayuki Ohzeki
View author publications
You can also search for this author in PubMed Google Scholar
Shuntaro Okada
View author publications
You can also search for this author in PubMed Google Scholar
Masayoshi Terabe
View author publications
You can also search for this author in PubMed Google Scholar
Shinichiro Taguchi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.O. conceived and conducted the experiment and analyzed the results. S.O. tested the previous version of the optimization method, M.T. discussed the possibility of the other applications of our method to industry, S.T. directed the project in our study and investigated the possible design of our method. All authors discussed the details of the results and reviewed the manuscript.

Corresponding author

Correspondence to Masayuki Ohzeki.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ohzeki, M., Okada, S., Terabe, M. et al. Optimization of neural networks via finite-value quantum fluctuations. Sci Rep 8, 9950 (2018). https://doi.org/10.1038/s41598-018-28212-4

Download citation

Received: 11 October 2017
Accepted: 19 June 2018
Published: 02 July 2018
DOI: https://doi.org/10.1038/s41598-018-28212-4

This article is cited by

Multidimensional hyperspin machine
- Marcello Calvanese Strinati
- Claudio Conti
Nature Communications (2022)
BG-3DM2F: Bidirectional gated 3D multi-scale feature fusion for Alzheimer’s disease diagnosis
- Ibtissam Bakkouri
- Karim Afdel
- Gwénaëlle Catheline For the Alzheim Initiative
Multimedia Tools and Applications (2022)
Traffic signal optimization on a square lattice with quantum annealing
- Daisuke Inoue
- Akihisa Okada
- Hiroaki Yoshida
Scientific Reports (2021)
Assessment of image generation by quantum annealer
- Takehito Sato
- Masayuki Ohzeki
- Kazuyuki Tanaka
Scientific Reports (2021)
Model Predictive Control for Finite Input Systems using the D-Wave Quantum Annealer
- Daisuke Inoue
- Hiroaki Yoshida
Scientific Reports (2020)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.