Breaking limitation of quantum annealer in solving optimization problems under constraints

Quantum annealing is a generic solver for optimization problems that uses fictitious quantum fluctuation. The most groundbreaking progress in the research field of quantum annealing is its hardware implementation, i.e., the so-called quantum annealer, using artificial spins. However, the connectivity between the artificial spins is sparse and limited on a special network known as the chimera graph. Several embedding techniques have been proposed, but the number of logical spins, which represents the optimization problems to be solved, is drastically reduced. In particular, an optimization problem including fully or even partly connected spins suffers from low embeddable size on the chimera graph. In the present study, we propose an alternative approach to solve a large-scale optimization problem on the chimera graph via a well-known method in statistical mechanics called the Hubbard-Stratonovich transformation or its variants. The proposed method can be used to deal with a fully connected Ising model without embedding on the chimera graph and leads to nontrivial results of the optimization problem. We tested the proposed method with a number of partition problems involving solving linear equations and the traffic flow optimization problem in Sendai and Kyoto cities in Japan.


problem Setting
In QA, we formulate the optimization problem as the target Hamiltonian. In addition to the target Hamiltonian, we employ the driver Hamiltonian, which generates the quantum fluctuation driving the system. We consider the case with the target Hamiltonian f(q) where q = (q 1 , q 2 , ..., q N ), and q i is a binary variable as 0 and 1. The binary variable can be written as the z-component of the Pauli matrices, which represent the logical qubits, namely, the Ising variables as q i = (1 + σ i )∕2. Against the longitudinal Ising spin variables, the induction of the transverse field generates superposition to search the ground state efficiently and solves various optimization problems. In practical use of QA, several sets of the variables must satisfy the constraints. As is often the case, squared terms expressing the constraints appear. This is called the penalty method in the context of the optimization problems. We assume here that the target Hamiltonian consists of several summations of the squared terms representing the constraints such as F i (q) = C i ∀ i and the other terms f 0 (q) as where λ i is a predetermined parameter set to be relatively large because the squared terms often express the constraints for the Ising variables. The squared terms yield fully connected interactions among the Ising variables because F i (q) often consists of the summation over several elements of q.
We introduce several examples appearing in QA. In a practical application of the quantum annealer for the reduction of traffic flow 28 , the squared term is then employed as e i e i i 0 , , , 2 and q μ,i = (1 + σ μ,i )/2 represents the selection of the μ-th route for the i-th car and S e,μ,i denotes the occupation of the road segment e by the μ th route and i th car. To avoid traffic congestion, they also implement another squared term as in the second term. Furthermore, the cost function for inferring the N-dimensional original signal σ k from the output consisting of M linear combinations = ∑ μ μ y a q k k k 0 in wireless communications and signal processing is written as One of the fascinating applications of QA is Q-Boost 52 , which selects relevant weak classifiers to gain the performance by combining them. In addition, the squared terms stem from the penalty method for the constraints as follows ( ) where u k (x μ ) denotes the weak classifier, x μ is the data vector, and its label is y μ . If the number of the classifiers is set to be K, the additional squared term is employed as In addition, the cost function itself is often expressed in the square form. One of the examples is the number partition problem, given as where n i denotes a component of the numbers to be divided into two groups. One group is assigned σ i = +1 and the other σ i = −1. The summation ∑ i n i σ i is desired to be zero to divide the component into two equal groups with an equal summation of the numbers. As exemplified above, to solve various optimization problems in the quantum annealer through the formulation of the quadratic unconstrained binary optimization (QUBO) problem, we often implement the squared terms. However, the squared terms result in fully connected interactions among the Ising variables included in them. The fully connected interactions prevent efficient computation in the current version of the quantum annealer because it has a limitation in dealing with the connection between artificial spins. For example, we have to embed the original optimization problem with fully connected interactions into a sparse graph, known as the chimera graph, on the superconducting chip for the case of D-Wave 2000Q. Then, unfortunately, although D-Wave 2000Q has over 2000 active physical qubits, the number of logical qubits dealt with is reduced to about 60 in the worst case. This is one of the crucial bottlenecks of the current version of the quantum annealer. A different type of special-purpose hardware implementing the Ising model, namely, CMOS annealing, also has the same bottleneck 47 , whereas the Fujitsu digital annealer is free from the problem in connectivity 48 . To avoid the difficulty of the squared terms, in the previous study, a different type of driver Hamiltonian from the transverse field was proposed 53,54 . In this case, they succeeded in enhancing the performance of QA, but the current version of the quantum annealer cannot employ their method. Below, we mitigate this difficulty by tackling this problem in the optimization problem with squared terms using the standard method in statistical mechanics. The method proposed below is available in the current version of the quantum annealer.

Reduction of Squared terms
A well-known technique for reducing the squared terms into linearized ones is the Hubbard-Stratonovich transformation or its variants 45,46 . First, we take the partition function expressing the equilibrium state governed by our target Hamiltonian as where β is the inverse temperature. Here, we perform the Hubbard-Stratonovich transformation of the squared terms and obtain another expression for the partition function as . The resulting partition function is www.nature.com/scientificreports www.nature.com/scientificreports/ We obtain an effective Ising model with linear terms on the constraints and continuous variables ν = (ν 1 , ν 2 , ⋯), namely the Lagrange multipliers. The effective Hamiltonian is The remaining problem is the minimization of the effective Hamiltonian instead of the original Hamiltonian. This is the same technique for dealing with constraints in optimization problems such as the Lagrange multiplier method. The original formulation employing squared terms is the penalty method. In the penalty method, we have to take a relatively large value of the coefficients λ k to deal with the constraints. However, the large value of the coefficients leads to obstacles in the optimization by the current version of the quantum annealer because it has the limitation of range and interval of the coefficient. In our formulation, we can take the limit of λ k → ∞ in a straightforward way. Instead of the large coefficient, the adaptive change of the multiplier ν k retains the constraints. As a "dual" problem, the effective Hamiltonian for ν can be obtained as where Z(ν) is the effective partition function defined as ( ) Obviously, the effective Hamiltonian for ν is highly nontrivial. In other words, the complexity of the original optimization problem remains even in the dual problem with continuous variables. The minimizer of the effective Hamiltonian is the saddle point of the integrand in the partition function when we take the limit of β → ∞. The saddle point equation is given as k k q where the bracket denotes expectation by the probability distribution of q conditioned on the value of ν. Notice that, in general, the free energy for the so called spin-glass models, as discussed in context of the optimization problem, has many local minima. Thus, the saddle point is not unique. This is a consequence of the non-monotonic increase in F q ( ) k q against ν in the spin glass models. We emphasize that the complexity to find the ground state remains even by our method. In this sense, our method is strongly dependent on the form of f 0 (q).
Our remaining problem is to attain the saddle point by gradually changing the value of ν. To find the saddle point, one may utilize the steepest ascent method. We take β → ∞, and the expectation value is evaluated by the Ising spin configuration in the ground state. Thus, the sampling of the spin configuration by use of the D-Wave 2000Q can be performed. In particular, the practical optimization problem has a nontrivial cost function f 0 (q). Then, the computation of the expectation value as F q ( ) k q is harmful. To mitigate its difficulty, the special-purpose machine is valuable. Notice that our technique is not restricted to use of the D-Wave 2000Q. Our technique is helpful for the Fujitsu digital annealer and CMOS annealing chip to enhance the precision to satisfy the constraints.
Instead of the direct manipulation of the Hubbard-Stratonovich transformation, we may consider the variational free energy, namely the Gibbs free energy. Let us consider the target Hamiltonian without squared terms for constraints, namely f 0 (q). Then, the Gibbs-Boltzmann distribution is given as 0 /Z. We introduce the Kullback-Leibler divergence to measure the distance between the trial distribution function P and Q as Here, we consider the trial distribution Q with minimum distance from P, whereas the expectation satisfies the following constraints The minimization of the KL divergence under this constraint yields the Gibbs free energy as . Here, we introduce the Lagrange multiplier ν for solving the minimization problem under the constraints. The minimizer depending on the Lagrange multiplier ν can be attained in a straightforward way as ( ) Then, the Gibbs free energy is written as (2020) 10 This corresponds to the minimization of the effective Hamiltonian for ν. The Gibbs free energy is the starting point of the Plefka expansion to establish a systematic way to solve the Ising spin-glass model beyond the level of the mean-field theory. Then, we consider the weak-interaction limit for computing the summation of the logarithmic term. In contrast, we have an efficient sampler for estimating the expectation of the Ising spin-glass model such as the D-Wave 2000Q. Therefore, we do not need any approximation to proceed our formulation further to solve the optimization problem under several constraints. When F k (q) consists of the summation over several binary variables, all we have to do is induce the longitudinal magnetic field to realize the resultant effective Hamiltonian. Thus, we construct a simple algorithm to solve the optimization problem as follows: • Initialize the Lagrange multipliers ν t=0 .
• Compute the gradient and update the Lagrange multipliers as where η is a step width of the gradient method to achieve the maximization and Q t is the trial distribution function with ν = ν t . Let us briefly describe the above procedure for the case with a constraint for the simple summation, namely F k (q) = ∑ i q i . The initial condition, for instance, is set to be no biases on the system.
k Q t and vice versa. The convergence depends on the rate of the update η. One may utilize the line search for an optimal choice of η to efficiently attain convergent behavior. To solve the constraints, we need to estimate the expectation of the Ising spin glass with the Hamiltonian f 0 (q) − ∑ k ν k F k (q), in which the squared terms on the constraints are absent. This is much easier to implement it in the D-Wave 2000Q and CMOS annealing chip with finite connectivity of the graph. Notice that several optimization problems are written only by the squared terms, namely f 0 (q) = 0. Then, the effective Hamiltonian consists only of linear terms, namely, the local magnetic fields. In this sense, it is not necessary to use the special-purpose device to generate the sampling of the nontrivial Hamiltonian such as the D-Wave 2000Q. Our technique from this perspective would be quite valuable for the case with nontrivial f 0 (q).

experiments
We test our method with various problems. The first experiment is performed for selection of the K-minimum set of the N random values. The original cost function is written as where h i takes a random value following the uniform distribution. We set N = 2000 and K = 5. The square term in Eq. (21) often appears in application of the quantum annealer to the optimization problem under constraints. The standard approach for solving optimization problems as in Eq. (21) using the D-Wave 2000Q is embedded on the chimera graph up to 64 logical variables. However, our technique can embed 2000 logical variables directly. In this case, because f 0 (q) = 0, we do not necessarily need the sampling from the special-purpose devices. This is just a test for validation of our technique. In addition, the first term in Eq. (21) is very simple but the exact solution is attained in a straightforward manner. We can check the validity of our technique. The initial condition is set as ν 0 = 0. We take the step width for the update in the steepest ascent by line search in all the cases shown below. As shown in Fig. 1, we confirm that our technique can select K-minimum set from N random variables and reach the optimal solutions. We plot the residual energy, which is the difference between the cost function and its minimum value.
The second experiment is performed on the number partition problem as in Eq. (7). Then, the effective Hamiltonian is In Fig. 2, we plot the cost function (7) and not the value of the effective Hamiltonian (22) at each step of our method. In addition, the lower figure shows the multiplier ν t . We employ the steepest ascent to attain the saddle point. The saddle point is expected to be around ν = 0 for the number partition problem because even tiny strength of the magnetic field enforces all the spin directions to be positive or negative but the optimal solution is randomly oriented. The initial condition is set to be ν 0 = 0.2. The result confirms that the optimal solution can be obtained by our technique.
The third example is solving the linear equation. In other words, it is termed as inference of the N-dimensional input from the M linear combinations as in Eq. (4). The effective Hamiltonian is . When the MSE and residual energy become zero, the perfect reconstruction is realized.
The fourth example is essentially the same problem as the previous one. However, the original input represents the two-dimensional structure as shown in Fig. 4. The problem emerges typically in the compressed sensing for reconstructing an original input from insufficient number of outputs using its sparsity as prior information. Let us take an example like q 0 in the previous case as in Fig. 4, which is two-dimensional structured data, while all the non-zero components are connected to each other. For α = 0.6, in which the number of outputs is too small to recover the original input, we employ the following Hamiltonian to infer the original input In this case, we regard the first term as f 0 (q). Then, sampling using the D-Wave 2000Q is efficient to evaluate the expectation value in the update Eq. (20). As shown in Fig. (5), we demonstrate that our method solves the optimization problem written in Eq. (24) even for the insufficient outputs M∕N < 0.633.
The fifth example is the traffic flow optimization problem. Following the previous study 28 , we extract the route data from the OpenStreetMap via osmnx 56 . We prepare candidate routes for each car by the shortest path, and its variants. We then assign the binary variables q μ,i for each car i and its route μ. Each car selects a single route by  www.nature.com/scientificreports www.nature.com/scientificreports/ satisfying the constraints as in Eq. (2). Instead of the constraints, we may decompose the quadratic term in f 0 (q) as follows Then, the effective Hamiltonian can be reduced to the Ising model in the local-magnetic fields as where h μ,i = ∑ e ν e S μ,i,e . Owing to the reduction of f 0 (q) instead of the constraints, notice that we can easily attain the expected value of the effective Hamiltonian without any sampling method as follows Then, the updated equation for each ν leads to a reasonable solution of the traffic flow optimization problem. However, the original optimization problem has many local minima. We may sample the binary variables following the effective Hamiltonian while tuning the Lagrange multiplier ν. Below, we compare (i) the deterministic way by using the expectation (27), (ii) sampling by classical way following the Gibbs-Boltzmann distribution, and (iii) sampling by the D-Wave 2000Q. The deterministic way quickly converges to the local minima of the cost function. In the context of statistical mechanics, the deterministic way corresponds to the level of the mean-field analysis. In this sense, this is a crude way to find an approximate solution. The following two methods are beyond the mean-field analysis level because sample fluctuation occurs. We utilize the sampling by the classical way following the Gibbs-Boltzmann distribution and by the D-Wave 2000Q just for selecting the choice of the route μ for each car i depending on the value of h μ,i while the outputs satisfy the constraints. The essential difference between two procedures is in the intermediate dynamics. The sampling by the classical way is based on hopping between the feasible solutions satisfying the constraint. In contrast, the sampling by the D-Wave 2000Q is driven by the quantum tunneling effect. The difference between two of the sampling methods appears as the performance of the resulting solutions. The latter method leads to a slightly better solution than the former one as far as our observations in this problem setting are concerned.
The number of cars is set to be 350 and that of the candidate routes is 3 for each car. The candidate routes are extracted from the actual maps. When we straightforwardly implement the optimization problem, the system  www.nature.com/scientificreports www.nature.com/scientificreports/ contains 1050 spins and fully connected interactions, which is not directly solved by the D-Wave 2000Q. As a reference, we put the solution from the Fujitsu digital annealer because the original optimization problem is difficult to implement directly on the D-Wave 2000Q. We tune λ to attain the best solution from the Fujistu digital annealer.
As shown in Fig. 6, we obtain the lower-energy solutions using the D-Wave 2000Q in comparison to the deterministic way and sampling by the classical way. The results shown in Fig. 6 satisfy the constraints for selecting the single route for each car because we do not apply our method in reduction of the quadratic term representing the constraints in this case. To find solutions satisfying the constraints by use of our method, we need longer time to attain the feasible solutions.
We test our method for the traffic flow optimization in Sendai city. Sendai city and nearby areas suffer from disaster by Tsunamis after big earthquakes in 2011. The optimal solution provides the appropriate information for evacuation avoiding traffic jam. The attained solutions are plotted in Fig. 7. For comparison, we have also plotted the shortest-path policy, in which each car selects the shortest path between the starting and destination points. As a reference, the resulting cost function is given as 920697 by the deterministic way, 848671 sampling by the classical way, and 830309 by our method with the D-Wave 2000Q, while the shortest path policy results in 1050159.
In addition, we also tested our method in Kyoto city as shown in Fig. 8. In this case, we attained the cost function to be 1602847 by the deterministic way, 1288513 sampling by the classical way, and 1284577 by our method with the D-Wave 2000Q, while the shortest path policy results in 1782220.
As demonstrated above, we solved the traffic-flow optimization problem exceeding the directly embeddable size on the D-Wave 2000Q. The precision of the results is at essentially the same level as that of the Fujitsu digital annealer, which can directly solve the optimization problem with a large number of binary variables.
To investigate how many steps our method takes typically, we run it in the case of the number partition problems (7), which is the inference of the N-dimensional input (4) in 1000 times. Because we choose these examples, Figure 6. Energy at each step in our method for optimizing the traffic flow at the Sendai city. The same symbols are used as in Fig. 2 except for the lines. In this figure, the red line denotes the result obtained by the deterministic way after a few steps, the green one is the minimum value by the sampling in the classical way during 10 steps, and the black one represents the reference result attained by direct manipulation of the Fujitsu digital annealer.  www.nature.com/scientificreports www.nature.com/scientificreports/ we know the ground state a priori. As shown in Fig. 9, all the cases converge to the ground state using our method and take several dozens of typical iterations.
As the last example, we take a simple problem with double constraints, as is often seen in several practical optimization problems. The original cost function is written as where h it is the randomly generated values, L is the linear size of the system, and the number of spins is N = L × L. This is the simplified version of the double-constraint problems as the traveling salesman problem. In the traveling salesman problem, an agent moves to each city i at each time t only once. To satisfy the rule, the cost function the double constraints as in the second and third terms as in Eq. (28). If we naively use the D-Wave 2000Q to solve this problem, the number of spins is limited to N = 64 and thus the number of cities to L = 8. We instead consider the random-field Ising model with the double constraint as in Eq. (28) to confirm the advantage of our method for satisfying such hard constraints. We implement the following effective Hamiltonian and then deal with the number of spins, which drastically increases up to L = 45, namely N = 2025, as i i t i t i t i t i t , , , , , , As in Fig. 10, we test our method to find the ground state of the original cost function (28).

Summary
We propose a technique to change various optimization problems with squared terms into those only with linear terms using the Hubbard-Stratonovich transformation. The squared terms hamper efficient computation when special-purpose hardware, such as D-Wave 2000Q, is used to solve the optimization problems. Our method mitigates the difficulty in dealing with the squared terms. Instead of direct manipulation, we iteratively solve the optimization problem with linear terms and nontrivial terms. We take various examples to test our methods. The first one is to select K variables under the random field, the second one is the number partition problem, the third one is to solve the linear equations, and the forth one is to reconstruct the structured data. These are the  optimization problems to find the feasible solutions satisfying the constraints. Although our method can attain feasible solutions, it takes a long time to converge to them because a number of Lagrange multipliers need to be tuned. In this sense, the application of our method is very important. Apart from the previous four examples, the fifth one is the application of our method to give a lower-energy solution satisfying the constraints. A part of the original optimization problem is reduced to the linear term, which becomes the local field. We do not consider the constraints by our method in this case because it is easy to satisfy them under only the local field. Then, our method leads to the feasible solution with lower energy. To attain feasible solutions, we propose three of the methods. The first one is the deterministic way to find the local minima, the second one is sampling by the classical way while jumping between feasible solutions. The third one is the sampling by the D-Wave 2000Q for the binary variables, which do not necessarily satisfy the constraints. In this sense, the range of the search for the optimal solutions is considered to be wide. Thus, the third method is to efficiently find the better solution than the first and second ones.
In addition, these types of applications is a generalization of the straightforward application of our approach for the target Hamiltonian without any constraints, where we set f 0 (q) = 0 and C = 0, is written as T T where A is a QUBO matrix, Λ is the diagonal matrix including the eigenvalues λ k (k = 1, 2, ⋯, N) of A, and U is the orthogonal matrix diagonalizing Q.
T . In this case, the saddle-point equation is T Then the Taylor expansion of the Gibbs free energy G(0) with respect to A leads to the Plefka expansion. The expansion up to the second order leads to saddle-point equation corresponding to the TAP equation. In this sense, our approach is a generalization of the mean-field analysis.
In general, we may solve the optimization problem by changing the Lagrange multipliers iteratively. On the D-Wave 2000Q, the fully connected interactions can be dealt with up to 64 binary variables. However, using our method, we can solve the QUBO including the Sherrington-Kirkpatrick model, which is a typical problem in spin glass theory, up to 2048. The sampling depending on the local field is an easy task. However, the sampling with changing value of the local fields depending on the Lagrange multipliers has a history, and it crucially affects the performance of the resulting solutions. We need another ingredient to improve the effect of the history of our method as proposed in the TAP equation to more efficiently solve the Ising spin-glass problem beyond the naive mean-field theory. Possibly, the quantum tunneling effect might remove the effect of the history. As far as our experience is concerned, we can find better solutions from the D-Wave 2000Q than from the classical way of sampling. This will be detailed in a future study.
As pointed out in the previous section, the performance of our method is strongly dependent on the form of f 0 (q). All the cases tested in the present study have the simple forms of f 0 (q) = 0 or linear combinations. In the traveling salesman problem, the interactions f 0 (q) = ∑ t ∑ i,j d i,j q i,t q j,t+1 , where d ij is the distance between different cities i and j. Because d ij > 0, the Griffiths inequality might not hold in this case. In other words, F q ( ) k q is not necessarily a monotonic increasing function against ν. Therefore, the current version of our method might not be capable to efficiently lead to the ground state for the typical hard optimization problems.
We again emphasize that the original optimization problem solved in our study, which has fully connected interactions, cannot be embedded on the D-Wave 2000Q. In this sense, our method makes a step to go ahead for more difficult tasks using the D-Wave 2000Q by reduction of the squared terms generating the fully connected interactions. We actually reveal not only the potential of D-Wave 2000Q, but also CMOS annealing chip. They do not suffer from the embedding of the optimization problem on the sparse graph due to the limitation of each piece of hardware. In addition, our method makes it possible to deal with the four-body interaction. By reducing the four-body interactions to the squared terms of the two-body interactions via diagonalization, we can obtain an effective two-body interacting system. In this sense, our method reveals the capability to solve a wide range of Ising models by using the special-purpose hardware. In addition, our method does not stick to the case to solve the optimization problem. Because our technique is based on statistical mechanics, we utilize our method to perform efficient sampling at low temperatures. We can find the hidden potential of the special-purpose hardware not only for solving the optimization problem but also for Boltzmann machine learning.