Introduction

Quantum artificial intelligence and quantum machine learning are emerging fields that leverage quantum information processing to address certain types of problems that are intractable in the realm of classical computing1,2,3. There are several models for the physical realization of quantum computers4. Among the quantum computing models, adiabatic quantum computers are currently more readily available at user sites due to recent advancements in commercializing programmable quantum annealers by D-Wave Systems5.

Quantum annealing is a meta-heuristic that (instead of thermal fluctuations) employs adjustable quantum fluctuations into a problem6,7,8,9,10,11. Thermal annealing (a.k.a. simulated annealing or classical annealing) can be very ineffective, compared to quantum annealing, because: (1) the landscape of the given Hamiltonian can be too glassy and there exist high energy barriers around local minimums that can trap the system for a very long time; and (2) a classical (or nonquantum) system can only assume one configuration at a time while the number of configurations in discrete optimization problems (i.e., combinatorial optimization problems) can grow exponentially with the number of variables9. Quantum annealing can bypass very high energy barriers, when they are narrow enough, which can address the ergodicity problem to some extent9,12,13,14,15. In addition, at some stage of annealing, quantum annealing can see the whole landscape simultaneously that can provide much faster relaxation to the ground state of the given Hamiltonian9.

Quantum annealers are a type of adiabatic quantum computer that provides a hardware implementation for finding the minimum energy configuration of Hamiltonians whose ground states represent optimum solutions of the original problems of interest. The D-Wave quantum annealer is a programmable Ising processing unit (IPU) that can find the minimum of the (stoquastic) Ising Hamiltonian or its equivalent quadratic unconstrained binary optimization (QUBO) form5,16. More precisely, the D-Wave quantum annealer receives coefficients of an Ising Hamiltonian (here h and J) as an executable quantum machine instruction (QMI) and returns the vector z that minimizes the following quadratic energy function:

$${E}_{{\rm{Ising}}}({\bf{z}})=\mathop{\sum }\limits_{i=1}^{N}\,{{\bf{h}}}_{i}{{\bf{z}}}_{i}+\mathop{\sum }\limits_{i=1}^{N}\,\mathop{\sum }\limits_{j=i+1}^{N}\,{J}_{ij}{{\bf{z}}}_{i}{{\bf{z}}}_{j},$$
(1)

where N denotes the number of quantum bits (qubits) and zi {−1, +1}. To solve a problem on a D-Wave quantum annealer, therefore, one needs to define an Ising Hamiltonian, shown in Eq. (1), whose ground state represents a solution for the original problem of interest16,17.

The current generation of D-Wave quantum annealers (i.e., the Chimera architecture) includes more than 2,000 qubits and about 6,000 couplers, while the next generation (the Pegasus topology of the Advantage) will include more than 5,000 qubits and about 40,000 couplers18. Recent studies have revealed the potential of quantum annealers (namely the D-Wave quantum processors) to address certain classes of real-world problems that are intractable in the realm of classical computing19— including, but not limited to, planning20, scheduling21,22, discrete optimization problems23, constraint satisfaction problems24, Boolean satisfiability problem (SAT)25,26, matrix factorization27, cryptography28, fault detection and system diagnosis29, compressive sensing30,31, control of automated vehicles32 and protein folding33. In addition, by sampling from high-dimensional probability distributions, one can use the D-Wave quantum annealers for many applications in artificial intelligence, machine learning and signal processing19,34,35.

Beside all aforementioned applications, the D-Wave quantum annealer architecture has limitations that not only restrict the process of mapping problems into an executable QMI (namely the sparse connectivity of the Chimera topology) but also lower the quality of results—i.e., the energy value of resulting samples (attained by the quantum annealer) is higher than the ground state of the given Ising Hamiltonian. In other words, casting a problem to an Ising Hamiltonian that represents the solution of the problem in its ground state does not guarantee that executing the corresponding QMI on a quantum annealer—like the D-Wave quantum processing unit (QPU)—will attain a global optimum.

For a given QMI, the D-Wave QPU draws samples from a problem-dependent pseudo-Boltzmann distribution at cryogenic temperatures19. The energy values of samples from the D-Wave QPU follow a Gaussian distribution. Thus, when we increase the number of reads/samples, we expect that the average parameter in the corresponding Gaussian distribution to approach the ground state energy of the corresponding Ising Hamiltonian—i.e., the probability of finding the global minimum approaches one. There are several drawbacks, nevertheless, that prevent quantum annealers from attaining a global minimum—including, but not limited to, confined anneal time, coefficients’ range and precision limitations, noise, and decoherence. From a quantum computing perspective, an adiabatic quantum computer needs to search for the ground state of a non-stoquastic Hamiltonian in order to be universal (which would make them equivalent to gate models); nevertheless, the D-Wave QPU samples from the ground state(s) of an Ising Hamiltonian which is stoquastic11,16,36.

Since coupling every qubit to every other qubit in a quantum annealer is impractical, the D-Wave QPU has a sparse structure/architecture—so-called Chimera topology. Hence, we entangle multiple qubits to represent virtual qubits with higher connectivity. Chaining physical qubits substantially reduces the capacity of QPUs—e.g., 2,048 qubits in the Chimera architecture is equivalent to a clique of size 64. It is possible to implicitly leverage the capacity of the current D-Wave QPUs37, albeit executing multiple QMIs. In addition, virtual qubits are vulnerable to breaking—the longer the chains, the higher the probability they break during the annealing process. Although we can remediate broken chains by applying postprocessing methods on classical computers (e.g., voting among the physical qubits on a chain), some chains break because they represent a state with lower energy.

The required anneal time in a quantum annealer to keep the process adiabatic has a reverse exponential relation to the energy gap between the ground state (global minimum) and the first excited state (a state right above the global minimum)9,11. In the current generation of the D-Wave QPU, hi {−2, +2} and Jij {−1, +1}. Thus, one needs to scale the resulting Ising model, Eq. (1), by dividing all coefficients with a large-enough positive number to satisfy the QPU hardware constraints. Although scaling coefficients does not alter the ground state of the corresponding Ising model, it reduces the energy gap between the ground and the first excited states. As a result, the required annealing time can quickly exceed the maximum possible anneal time on a physical quantum annealer (for example 2,000 microseconds on the D-Wave QPUs) and makes the process diabatic, which exponentially reduces the probability of getting to the ground state11,16.

The current generation of the D-Wave QPUs uses 8–9 bits for representing coefficients in Eq. (1). Hence, the D-Wave QPU truncates coefficients of a QMI prior to putting qubits in their superposition, which can result in the Ising model having a different ground state—compared to the original QMI. Consequently, the D-Wave QPU may solve a different problem whose result is either infeasible or less accurate than the original problem of interest38,39. Applying preprocessing techniques40 and classical postprocessing heuristics41 can remarkably enhance the performance of the D-Wave QPU; however, from a problem-solving viewpoint, the D-Wave quantum annealer cannot guarantee to achieve a global optimum.

In this study, we view quantum annealers from two different perspectives simultaneously: (1) a meta-heuristic for solving discrete optimization problems that can find very high-quality solutions in near-constant time; and (2) a physical process that naturally draws samples from a problem-dependent Boltzmann distribution at cryogenic temperatures. Unlike most current research in quantum artificial intelligence that applies quantum computing models to hard AI problems, in this paper, we explore how we might apply AI techniques to improve quantum information processing.

Learning automata (LA)42 are adaptive decision-making models (i.e., type of reinforcement learning43) that try to maximize the accumulative reward when they are interacting with stochastic environments. In a similar manner to reinforcement learning, LA use Markov decision processes for representing the automaton-environment structure42,43,44. In a learning automaton, an (intelligent) agent has a set of r actions (denoted by α = {α1, α2, …, αr}) and each action has a corresponding probability (denoted by pi and ∑pi = 1). At each episode, the agent takes (applies) the action αi (according to p), and (correspondingly) the stochastic environment returns its feedback β that specifies the performance evaluation of the action αi. The agent uses this feedback to learn from the environment and aims to take optimal actions over time. For β [0, 1] —so-called S-Type learning automata—in episode t, if the agent takes the action αi and receives the feedback βt, we can update p as follows:

$${{\bf{p}}}_{j}^{t+1}=\{\begin{array}{cc}{{\bf{p}}}_{j}^{t}-{\theta }_{2}(1-{\beta }^{t}){{\bf{p}}}_{j}^{t}+{\theta }_{1}{\beta }^{t}(1-{{\bf{p}}}_{j}), & i=j;\\ {{\bf{p}}}_{j}^{t}+{\theta }_{2}(1-{\beta }^{t})\left(\frac{1}{r-1}-{{\bf{p}}}_{i}^{t}\right)-{\theta }_{1}{\beta }^{t}{{\bf{p}}}_{j}, & i\ne j,\end{array}$$
(2)

where β = 0 represents the lowest action performance and β = 1 represents the highest action performance, θ1, θ2 [0, 1] are learning factors, and i, j {1, 2, …, r}42.

This paper presents the Reinforcement Quantum Annealing (RQA) scheme that leverages the idea of learning automata to iteratively improve the quality of results, attained by the quantum annealers, and implicitly address the limitations of physical quantum annealers. RQA views quantum annealing as an atomic process and it does not offer to modify/alter the annealing of quantum effects. In fact, each iteration of RQA includes a complete cycle of problem-solving with quantum annealers. At each iteration of RQA, we re-cast the given problem to a new Ising Hamiltonian, according to samples/results from the previous iteration, and annealing of quantum effects is identical in all iterations. From a problem-solving perspective, RQA searches the space of Hamiltonians of a given problem, to find the optimal one, rather than (repeatedly) exploring the Hilbert space of a given Hamiltonian. As a proof-of-concept, we first introduce a novel approach for casting the Boolean satisfiability problem (SAT)45 to Ising Hamiltonians and then demonstrate that adopting the proposed RQA scheme results in notably better solutions.

Results

In this section, we aim to evaluate the performance of RQA scheme on solving benchmark SAT instances, and compare it with recent software and hardware enhancements to the quantum annealers. For every SAT instance, we used the number of unsatisfied clauses as the metric for performance comparisons. In this study, we used Z3 (from Microsoft Research) as a framework for symbolic computing implementations46 and we executed each QMI on the D-Wave 2000Q quantum annealer, located at Burnaby, British Columbia.

For every SAT instance, we used inequalities (11), (17), (14) and (15) to represent the given SAT instance as a system of inequalities. Afterward, we solved problem (16) for casting the SAT to an executable QMI on a D-Wave quantum processor. Solving problem (16) will result in an Ising Hamiltonian which is not necessarily compatible with the D-Wave hardware graph (Chimera topology for the current generation). Therefore, we applied the minor-embedding heuristic47 for embedding the problem to the physical lattice of qubits on a D-Wave QPU. To avoid the impact of chaining physical qubits in our evaluations, we employed fixed embeddings of cliques in all instances—i.e., we used the pre-defined embeddings of cliques for the chimera architecture.

Recent studies have revealed that using spin-reversal transforms (a.k.a. gauge transforms)—i.e., flipping the qubits randomly without altering the ground state of the original Ising Hamiltonian—can reduce analog errors of the quantum annealers40. Thus, as a preprocessing technique, we applied spin-reversal transforms prior to submitting the QMIs to the physical QPU. We also put a delay between measurements to reduce the sample-to-sample correlation, albeit longer run-time.

To remediate possible broken chains in the resulting raw samples from the D-Wave QPU, we performed voting among the physical qubits of chains. After unembedding samples (i.e., representing variables in the original problem domain), we applied the multi-qubit correction (MQC) heuristic41 which has demonstrated a significant ability to improve the probability of finding the global minimum, attained by the D-Wave QPU. Finally, we performed a local search heuristic, so-called single-qubit correction (SQC), to construct the final solution of the given SAT41.

Experiment A: factoring pseudo-prime numbers

In number theory, the problem of integer factoring refers to decomposing a composite integer number into the product of smaller integers, and prime factorization restricts these factors to prime numbers. Although there are debates on the class (or complexity) of this problem, there is no known efficient (non-quantum) algorithm for factoring numbers in polynomial-time48.

In this study, we use the problem of prime factorization as a benchmark to evaluate the performance of RQA. It is worth noting that our research objective in this paper is not to set a new record for quantum factorized integers, which for the current generation of the D-Wave quantum annealers is 1,005,97328. Indeed, since the security of the modern public-key cryptography systems (like RSA) mainly relies on the difficulty of factoring very large pseudo-prime numbers28,49, we relied on the difficulty of prime factorization problem for generating benchmark SAT instances. Let f(x1, x2) be a Boolean function as follows:

$${\bf{q}}=f({{\bf{x}}}_{1},{{\bf{x}}}_{2})={{\bf{x}}}_{1}\times {{\bf{x}}}_{2},$$
(3)

where \({{\bf{x}}}_{1}\in {\mathrm{\{0,}\mathrm{1\}}}^{{n}_{1}}\) and \({{\bf{x}}}_{2}\in {\mathrm{\{0,}\mathrm{1\}}}^{{n}_{2}}\) are integer-valued numbers in binary representation (here, x1, x2 ≥ 2), and the multiply operator is in binary base—each element of the vector q is a Boolean function of x1 and x2. Assume that \(\hat{{\bf{q}}}\) is a pseudo-prime integer number in binary base (i.e., \(\hat{{\bf{q}}}\) has two prime factors. We can map the problem of factoring \(\hat{{\bf{q}}}\) to SAT as follows

$$g={f}_{{\rm{SAT}}}({\bf{q}},\hat{{\bf{q}}})=\underset{i=1}{\overset{n}{\wedge }}\neg ({{\bf{q}}}_{i}\oplus {\hat{{\bf{q}}}}_{i}),$$
(4)

where n = n1 + n2 denotes the length of q. We can look at the process of generating SAT instances from a reverse-engineering viewpoint. To this end, we generated pseudo-prime numbers via multiplying two prime numbers, and represented them in binary base (denoted by \(\hat{{\bf{q}}}\)). For each instance, we then used the Eq. (4) to map the factorization of \(\hat{{\bf{q}}}\) to a satisfiable Boolean formula. Since g is a Boolean expression of x, we applied the Tseitin transformation50 to represent g in conjunctive normal form (CNF)45. We also performed pre-processing techniques, namely “ctx-solver-simplify“, “recover-01“, “propagate-values“ and “reduce-args” tactics from Z351. Note that applying the Tseitin transformation can increase the size of g linearly, due to defining auxiliary variables. Since the capacity of the current D-Wave 2000Q quantum processors is limited to a complete graph of size 63, we eliminated SAT instances (in CNF) with more than 63 Boolean variables which resulted in 136 satisfiable SAT instances.

Figure 1 illustrates results—minimum (circles), maximum (triangles), average and variance of the number of unsatisfied clauses—for solving these 136 satisfiable SAT instances, and compares the performance of the proposed RQA scheme with quantum annealing (QA) and quantum annealing with multiple post-quantum processes (SMQC). To enhance the standard quantum annealing technique, we used two spin-reversal-transforms40, as well as the delay between measurements to reduce the inter-sample correlation. In the second method (SMQC), we first used the multi-qubit correction (MQC) method41, in problem variable level—which is the state-of-the-art technique in the realm of post-quantum correction for quantum annealers—and then applied a local search to maximize the quality of results, attained by the SMQC arrangement.

Figure 1
figure 1

Experiment results for solving 136 satisfiable SAT instances (with at most 63 Boolean variables) for factoring pseudo-prime numbers with quantum annealing (QA), quantum annealing with classical post-processing (SMQC) and reinforcement quantum annealing (RQA).

To update the influence factors of clauses in RQA, Eq. (10), we used θ1 = 0.1 and θ1 = 0. Learning automata generally require a notable number of episodes to converge to an optimal (or sub-optimal) policies. In this experiment, nevertheless, the agent terminates the process after at most T = 10 episodes (due to QPU time limitations) or finding a solution that satisfies all clauses. Hence, we formed a hall-of-fame—a set of final solutions from all episodes—and applied MQC (followed by SQC) on them to obtain the ultimate solution of RQA. Our empirical observations showed that this technique can implicitly address the limited number of allowed episodes in RQA. Note that RQA utilizes at most the same number of samples as QA and SMQC. In other words, the pure (quantum) annealing time of RQA was at most equal to QA and SMQC.

Experiment B: uniform random 3-SAT with phase transitions

Sampling from the phase transition region of uniform Random 3-SAT is a common practice for generating benchmark SAT (and MAX-SAT) problems52,53,54,55. In this experiment, as our second study case, we used the satisfiable benchmark test-set of uniform random 3-SAT with phase transitions56. Considering the capacity of the current generation of the D-Wave quantum annealers—we can embed a clique of size at most 63 on chimera architecture—we employed the test-set with 50 Boolean variables.

Figure 2 demonstrates results—minimum (circles), maximum (triangles), average and variance of the number of unsatisfied clauses—for solving the first 100 instances from the benchmark test-set, and (similar to the previous experiment) compares the performance of the proposed RQA scheme with quantum annealing (QA) and quantum annealing with multiple post-quantum processes (SMQC). The setting for this experiment was identical to the previous experiment, except the number of instances (136 vs. 100) and the number of variables (variant vs. 50).

Figure 2
figure 2

Experiment results for solving 100 satisfiable uniform random 3-SAT instances with phase transitions—each SAT instance contains 50 Boolean variable—using quantum annealing (QA), quantum annealing with classical post-processing (SMQC) and reinforcement quantum annealing (RQA).

Experiment C: run-time evaluation

In this experiment, we aim to evaluate the average run-time of RQA scheme and compare it with QA and SMQC approaches. We implemented all experiments in Python 3.7.4, and executed them on a 64-bit Windows 10 based system with 32 GB RAM and Intel Xeon processor at 3.00 GHz. Figure 3 shows the average run-time of solving 100 SAT instances for QA, SMQC and RQA methods on a D-Wave 2000Q quantum processor. Note that, in this experiment, we did not include the computation time for finding the embedding of the QMI on a working graph—we used the pre-defined embeddings of cliques for the chimera architecture. For all instances, we used two spin-reversal transforms and we also enabled the inter-sample delay between samples’ reads.

Figure 3
figure 3

Average run-time for solving 100 SAT instances with quantum annealing (QA), quantum annealing with classical post-processing (SMQC) and reinforcement quantum annealing (RQA) approaches on a D-Wave 2000Q quantum processing unit.

Discussion

In this study, we introduced a novel scheme—called reinforcement quantum annealing (RQA)—that leverages reinforcement learning (more specifically learning automata) to enhance the quality of results, attained by the quantum annealers. RQA has an iterative scheme that is independent of the architecture of the annealer. In other words, we look at the annealing process (here the quantum annealing) as a black box (or an atomic instruction). Hence, one can use RQA on top of any (quantum) annealing process. It is worth noting that our initial evaluations on applying RQA on classical (thermal) annealing did not show notable improvement. As an example, Eq. (1) represents the Ising Hamiltonian, which is stoquastic, that the current generations of the D-Wave quantum annealers sample from its ground state. When the next generations of quantum annealers are able to explore the landscape of a non-stoquastic Hamiltonian, we expect that one will be able to introduce RQA to supplement the new quantum annealer.

Ramezanpour (2018) has proposed to improve the simulated quantum annealing algorithm by adding reinforcement to the standard quantum annealing algorithm57. Simulated quantum annealing is an iterative algorithm (similar to simulated annealing) that is implemented and run on classical computers. On the other hand, quantum annealers are physical, single-instruction, quantum processing units where the entire annealing process is atomic. Thus, we cannot modify or adjust the annealing process after starting the annealing (i.e., putting quantum bits on their superposition). in RQA, similar to the standard reinforcement learning scheme, iterations emulate the interactions between an agent and its environment. Ramezanpour’s method, however, has an adaptive optimization scheme in which iterations simulate one quantum annealing process on a classical computer and is not applicable to physical quantum annealers.

Note that RQA does not offer to modify/adjust the annealing of quantum effects. In fact, we look at the quantum annealing process as a black box (i.e., an atomic process). We first cast the original problem of interest to an Ising Hamiltonian and then employ a quantum annealer to sample from its ground state(s). Afterward, at each iteration of RQA, we look at the results (samples) of the previous iteration and adjust the penalty of unsatisfied constraints. We then re-cast the problem (according to our knowledge from previous iterations/annealing) and (again) employ a quantum annealer to sample from the ground state of the new Ising Hamiltonian. From a problem-solving viewpoint, Herr et al. (2017) showed how to optimize the schedule of the annealing process. However, we offer to optimize the process of casting a given problem to the spin glass problem—albeit running multiple quantum annealing processes—without changing the schedule. In other words, in all experiments, we used a fixed annealing schedule and the annealing time was 20 microseconds in all executions.

Moreover, RQA aims to implicitly address limitations of physical quantum annealers that might not be a limitation in simulated quantum annealing. As an example, if the energy gap between the ground and first excited states is reduced close to the end of the annealing cycle, according to the Anderson localization limitation58, RQA may also require exponentially large annealing time. As a proof-of-concept, we proposed a novel method for casting SAT to Ising Hamiltonians and then demonstrated that applying the proposed RQA scheme (on a D-Wave quantum annealer) results in notably better solutions. It is worth highlighting that, however, the proposed approach (i.e., hybridization of reinforcement learning and quantum annealing) is applicable to a vast range of classic AI problems like constraint satisfaction, planning, and scheduling.

We applied the proposed RQA scheme on two different SAT problem sets, and compared its performance with quantum annealing (QA) and quantum annealing with post-quantum error corrections (so-called SMQC) which is the state-of-the-art in the realm of quantum annealing41,59,60. The first problem set includes 136 satisfiable SAT instances which represent factoring pseudo-prime numbers that have at most 63 Boolean variables, in CNF representation. Besides the length of the given composite numbers, the difficulty of integer factoring also depends on the properties of integer numbers. The hardest instances of this problem are factoring pseudo-prime numbers (product of two prime numbers) whose factors have the same size (in binary base). SAT instances in experiment A are not the hardest cases of prime-factoring— restricting the SAT instances in experiment A to a composite of the same size prime factors resulted in only eight problems. It is worth noting that our main objective in this experiment was not to address the prime factorization nor to use quantum annealers for solving SAT or MAX-SAT problems.

Figure 1 illustrates results—minimum, maximum, average and variance of a number of unsatisfied clauses—of solving these 136 satisfiable SAT instances for 100, 500, 1,000, 5,000 and 10,000 samples. In RQA, increasing the number of samples from 100 to 10,000 reduces the average number of unsatisfied clauses from 1.57 to 1.40. Similarly, in QA and SMQC, the average number of unsatisfied clauses is reduced from 9.41 and 3.55 to 6.85 and 3.17, respectively. Although increasing the number of samples in all three methods reduces the average number of unsatisfied clauses, RQA with 100 samples outperforms both QA and SMQC approaches in all arrangements (even with 10,000 samples). It is worth highlighting that QA was not able to satisfy all clauses of any of these 136 SAT instances, even when we requested 10,000 samples, while both SMQC and RQA methods were able to find a satisfying solution for at least one of the instances in all cases. From a robustness viewpoint, increasing the number of samples from 100 to 10,000 lowers the variance and range (the difference between maximum and minimum) of QA, SMQC and RQA from 4.28, 2.60 and 0.86 to 3.21, 1.73 and 0.70, respectively. Therefore, RQA demonstrated better robustness (i.e., higher reproducibility rate), compared to both QA and SMQC approaches.

In the second case study, shown in Fig. 2, we used the first 100 SAT instances of the satisfiable benchmark test-set of uniform random 3-SAT with phase transitions56. Similar to Figs. 1 and 2 demonstrates that RQA with 100 samples outperforms both QA and SMQC approaches in all settings. More specifically, increasing the number of samples from 100 to 10,000 in QA, SMQC and RQA decreases the average number of unsatisfied clauses from 8.02, 1.65 and 0.79 to 6.39, 1.28 and 0.68, respectively. Also, the variances are reduced from 6.52, 0.53 and 0.31 to 4.43, 0.42 and 0.20, respectively. Thus, the minimum number of unsatisfied clauses in RQA for all settings is zero while SMQC needed at least 5,000 samples to satisfy all clauses of at least one SAT instance.

Note that RQA is an iterative scheme—RQA executes multiple QMIs for a given problem; hence, in all experiments, we restricted the total number of samples in RQA not to exceed the sample size of QA and SMQC. As an example, when QA and SMQC requested 1,000 samples, RQA (with T = 10 iterations) asked for only 100 samples in each iteration.

Problem-solving with a D-Wave quantum processor, in practice, requires some preprocesses (e.g., embedding and gauge transforms) and post-processes (like error correction and broken-chain remediation) that extends the total run-time from 20—2000 microseconds (pure annealing time) to some seconds and even minutes. Run-time in many pre/post-processing techniques mainly depends on the number of samples. It is a common practice in applying quantum annealers to request a few thousand (at most 10,000 per QMI) samples to increase the probability of finding the global minimum. Figure 3 demonstrates that increasing the number of samples increases the total run-time of RQA at a significantly lower rate (compared to QA and more specifically SMQC)—since a request of 10 times fewer samples in each iteration of RQA has less pre/post-processing overhead. As an illustration, 100 samples (10 samples in each iteration of RQA), increases the total run-time 7.6× and 6.8× more than QA and SMQC, respectively. However, increasing the number of samples to 10,000, (1,000 samples in each iteration of RQA) only increases the total run time by 4.4× and 1.7×, compared to QA and SMQC, respectively. In both experiments A and B, RQA with 100 reads (i.e., repeating the QA 100 times) was able to find a sample with lower energy, compared to QA and SMQC with 10,000 samples. On the other hand, the total computation time (quantum annealing plus classical postprocessing) of RQA with 100 reads is close to SMQC with 10,000 reads. As a conclusion, RQA with 100 reads utilizes less quantum annealing time, finds samples with lower energy and has the same total run-time, compared to SMQC with 10,000 reads.

Methods

Assume that the given problem ∏ that we aim to solve on a quantum annealer contains a finite set of constraints (components), denoted by πi for i {1, 2, …, M}, over the same variables as follows:

$$\Pi \,:=\{{\pi }_{i},{\pi }_{2},\ldots ,{\pi }_{M}\},$$
(5)

where M indicates the number of constraints and our ultimate objective is to find a solution that addresses (satisfies) all constraints. Let Hi be an Ising Hamiltonian whose ground state represents a solution for πi and H be the corresponding Ising Hamiltonian of ∏ that all are acting on the same spins (variables). In addition, let \({E}_{0}^{{H}_{\Pi }}\) be the ground state energy of H and \({E}_{0}^{{H}_{i}}\) be ground state energy of corresponding Hi. If there exists z that puts Hi (i, i {1, 2, …, M}) in their ground states (i.e., satisfies all constraints in ∏), then z also puts H in its ground state61 —in other words,

$${E}_{0}^{{H}_{\Pi }}={E}_{0}^{{H}_{1}}+{E}_{0}^{{H}_{2}}+\ldots +{E}_{0}^{{H}_{M}}.$$
(6)

Hence, we can represent H as follows:

$${H}_{\Pi }={H}_{1}+{H}_{2}+\ldots +{H}_{M}.$$
(7)

This setting appears in a vast range of problem formulations—including, but not limited to SAT26, constraint satisfaction problems62,63, planning and scheduling20,21,22, fault detection and diagnosis24,29, and compressive sensing30,64—specifically when we adopt the idea of penalty methods for casting problems of interest to the spin glass problem.

Theorem 1. For any problem in class NP, there are infinite different Ising models whose ground states are all identical to the solution of the original problem.

Proof. According to Cook—Levin theorem, we can reduce any NP problem to finding the ground state of Hamiltonians (which is also in the class NP) in polynomial-time9,65. Multiplying all coefficients of the Ising model by a positive non-zero real number will result in a new spin glass problem whose ground state will be identical to the original Ising model. Since the number of positive real numbers are infinite, we can generate infinite different Ising models whose ground states represent the solution for the original problem of interest.

According to Theorem 1, there are an infinite number of different Ising Hamiltonians whose ground states all represent the solution of the original problem of interest—nevertheless owing to the range and precision limitations on the D-Wave QPUs, we have a finite number of different Ising models for a given problem. In theory, these different Ising models are equivalent to each other—i.e., an adiabatic annealing process always attains the ground state which is identical for all corresponding Ising models of a problem. In practice, however, each of these (theoretically) equivalent Ising models are analogous to a pseudo-Boltzman distribution whose parameters are different. Consequently, when we minimize the corresponding QMIs with a physical quantum annealer (like the D-Wave QPU), the probability of finding the global minimum for different Ising Hamiltonians of a given problem varies from zero to one. As an example, an annealing process on a D-Wave QPU may become diabatic because the required anneal time exceeds the maximum possible anneal time (2,000 microseconds), which can substantially reduce the probability of finding the global minimum. Note that for a given Ising Hamiltonian, we cannot estimate the probability of finding the ground state prior to executing the corresponding QMI.

We introduce the reinforcement quantum annealing (RQA) scheme, in which an intelligent agent interacts with a quantum annealer, as the stochastic environment of a learning automaton. RQA searches the space of Hamiltonians to iteratively find a better model for the given problem of interest that sampling from its ground state(s), by a quantum annealer, results in a better distribution—i.e., the probability of finding the global optimum is increased over the time. It is worth highlighting that RQA does not offer to alter/modify the annealing of quantum effects—i.e., all quantum annealing processes have an identical schedule. To this end, we extend Eq. (7) as follows:

$${\tilde{H}}_{\Pi }={\tilde{H}}_{1}+{\tilde{H}}_{2}+\ldots +{\tilde{H}}_{M},$$
(8)

such that

$${\tilde{H}}_{i}=\chi ({H}_{i},{\rho }_{i}),$$
(9)

where \({\rho }_{i}\in {\mathbb{R}}\) denotes the impact (or influence) factor of πi (or Hi) and χ is a function that maps the input Hamiltonian to a different Hamiltonian which satisfies:

  • any z that puts Hi in its ground state also puts \({\tilde{H}}_{i}\) in its ground state, and vice versa;

  • if \({\rho }_{i}^{1} < {\rho }_{i}^{2}\) then \(\chi ({H}_{i},{\rho }_{i}^{1})\ge \chi ({H}_{i},{\rho }_{i}^{2})\).

We extend learning automata to allow the agent to take multiple actions in each episode. Let \({\hat{\alpha }}^{t}\subset \alpha \) denotes the set (list) of actions that the agents takes in episode t. We can extend the Eq. (2) as follows:

$${{\bf{p}}}_{j}^{t+1}=\{\begin{array}{cc}{{\bf{p}}}_{j}^{t}-{\theta }_{2}(1-{\beta }^{t}){{\bf{p}}}_{j}^{t}+{\theta }_{1}{\beta }^{t}(1-{{\bf{p}}}_{j}), & {\alpha }_{j}\in {\hat{\alpha }}^{t};\\ {{\bf{p}}}_{j}^{t}-{\theta }_{2}(1-{\beta }^{t})\left(\frac{1}{r-\hat{r}}-{\hat{p}}_{\hat{\alpha }}^{t}\right)-\frac{{\theta }_{1}{\beta }^{t}}{r-\hat{r}}(\hat{r}-{\hat{p}}_{\hat{\alpha }}^{t}), & {\alpha }_{j}\notin {\hat{\alpha }}^{t},\end{array}$$
(10)

where \(\hat{r}=|{\hat{\alpha }}^{t}|\) and,

$${\hat{p}}_{\hat{\alpha }}^{t}=\sum \,{{\bf{p}}}_{i},\,{\rm{for}}\,{\alpha }_{i}\in {\hat{\alpha }}^{t}\mathrm{}.$$

Finally, we leverage multi-task learning automata—let \({\rho }_{i}={{\bf{p}}}_{i}\) and M = r—to propose the RQA scheme. RQA is an iterative process that we can start it with a uniform distribution of influence factors—i.e., \(\rho ={\left\{\frac{1}{M}\right\}}^{M}\). In each iteration, the agent applies Eq. (8) and submits the corresponding QMI to a quantum annealer. After performing the necessary post-processing methods (like remediating broken-chains and applying post-quantum error correction heuristics), the agent estimates β according to the number of satisfied constraints (πi) and employs Eq. (10) to update the influence factor ρ.

Proof of concept: RQA for solving SAT instances

For a given Boolean formula f(x1, x2, …, xn), the problem of Boolean satisfiability (SAT) determines whether a constant replacement of values (“True” or “False”) for all Boolean variables can interpret f as “True”45. From a complexity perspective, SAT is NP-complete, and we can reduce all problems of class NP to SAT in polynomial-time65. The Boolean formula f is in conjunctive normal form (CNF) if it is a conjunction (“AND”) of clauses (i.e., \(f({\bf{x}})={C}_{1}\wedge {C}_{2}\wedge \ldots \wedge {C}_{M}\)), where each clause is a disjunction (“OR”) of literals (a Boolean variable or its negation)—\({C}_{i}={{\bf{l}}}_{i}\vee {{\bf{l}}}_{j}\ldots \vee {{\bf{l}}}_{k}\). The maximum satisfiability problem (MAX-SAT) is an NP-hard extension of the SAT problem that aims to maximize the number of satisfying clauses45,66.

In this section, we adopt the idea of penalty methods for casting SAT to Ising Hamiltonians and show that adopting the proposed RQA scheme can notably improve the probability of finding the global optimum. It is important to highlight that our objective in this study is not to employ quantum annealers for addressing the NP-complete problem of SAT. Moreover, the proposed heuristic is not guaranteed to solve all SAT instances, even if we have access to an ideal quantum annealer. In the same manner, RQA does not offer to bypass the Anderson localization limitation of the adiabatic quantum optimization58.

In this mapping, we aim to find coefficients of Eq. (1) such that the ground state of the resulting Ising Hamiltonian represents the satisfying solution of the original SAT instance. In this formulation, zi represents the Boolean variable xi, and we interpret −1 and +1 as “False” and “True”, respectively. For a clause with k literals, there are 2k different possibilities among which, we can distinguish the only state that makes the clause to be false—called infeasible state. Hence, we represent each clause of the given SAT instance with two inequalities as:

$${D}_{i}\le {E}_{{\rm{infeasible}}},$$
(11)

and,

$${D}_{i}\ge \sum \,{E}_{{\rm{feasible}}},$$
(12)

where \({D}_{i}\in {\mathbb{R}}\) is the boundary variable corresponding to the clause Ci, AND Efeasible and Einfeasible represent the contribution of the feasible and infeasible states in the ultimate energy function, respectively. For \({C}_{i}={{\bf{x}}}_{1}\vee \neg {{\bf{x}}}_{4}\vee {{\bf{x}}}_{9}\), as an example, Eq. (11) reduces to:

$${D}_{i}\le -{{\bf{h}}}_{1}+{{\bf{h}}}_{4}-{{\bf{h}}}_{9}-{J}_{1,4}+{J}_{1,9}-{J}_{4,9},$$

and we can represent Eq. (12) as follows:

$$\begin{array}{rcl}{D}_{i} & \ge & -{{\bf{h}}}_{1}-{{\bf{h}}}_{4}-{{\bf{h}}}_{9}+{J}_{1,4}+{J}_{1,9}+{J}_{4,9}\\ & & -{{\bf{h}}}_{1}-{{\bf{h}}}_{4}+{{\bf{h}}}_{9}+{J}_{1,4}-{J}_{1,9}-{J}_{4,9}\\ & & -{{\bf{h}}}_{1}+{{\bf{h}}}_{4}+{{\bf{h}}}_{9}-{J}_{1,4}-{J}_{1,9}+{J}_{4,9}\\ & & +{{\bf{h}}}_{1}-{{\bf{h}}}_{4}-{{\bf{h}}}_{9}-{J}_{1,4}-{J}_{1,9}+{J}_{4,9}\\ & & +{{\bf{h}}}_{1}-{{\bf{h}}}_{4}+{{\bf{h}}}_{9}-{J}_{1,4}+{J}_{1,9}-{J}_{4,9}\\ & & +{{\bf{h}}}_{1}+{{\bf{h}}}_{4}-{{\bf{h}}}_{9}+{J}_{1,4}-{J}_{1,9}-{J}_{4,9}\\ & & +{{\bf{h}}}_{1}+{{\bf{h}}}_{4}+{{\bf{h}}}_{9}+{J}_{1,4}+{J}_{1,9}+{J}_{4,9}.\end{array}$$

A clause with k literals includes 2k − 1 different feasible states so the size of Eq. (12) grows exponentially with k.

Theorem 2. Sum of the energy values for all possible states in every Ising model is zero.

Proof. Let Z denotes the set of all possible states in Eq. (1). Because spins in the Ising model (here zI) take their values from {−1, +1}, Z is a closed set under the complement operation (i.e., z, −zZ), and |Z| = 2N. Accordingly, sum of the energy values for all possible states in the Ising model is:

$$\begin{array}{ccc}\sum _{{\bf{z}}\in Z}\,{E}_{{\rm{Ising}}}({\bf{z}}) & = & \sum _{{\bf{z}}\in Z}\,(\mathop{\sum }\limits_{i=1}^{N}\,{{\bf{h}}}_{i}{{\bf{z}}}_{i}+\mathop{\sum }\limits_{i=1}^{N}\,\mathop{\sum }\limits_{j=i+1}^{N}\,{J}_{ij}{{\bf{z}}}_{i}{{\bf{z}}}_{j})=\mathop{\sum }\limits_{i=1}^{N}\,\left(\frac{{2}^{N}}{2}{{\bf{h}}}_{i}+\frac{{2}^{N}}{2}(-{{\bf{h}}}_{i})\right)\\ & & +\mathop{\sum }\limits_{i=1}^{N}\,\mathop{\sum }\limits_{j=i+1}^{N}\left(\frac{{2}^{N}}{2}{J}_{ij}+\frac{{2}^{N}}{2}(-{J}_{ij})\right)=0.\end{array}$$

According to Theorem 2, we can rewrite Eq. (12) as follows:

$${D}_{i}\ge -{E}_{{\rm{infeasible}}},$$

that obeys:

$$0\le {D}_{i}.$$
(13)

Note that clauses in CNF representation are connected with the “AND” operator. Hence, after representing each clause of the SAT with two inequalities, Eqs. (11) and (13), we aggregate the resulting sub-systems of inequalities to form a larger system of inequalities. After representing the given SAT instance with M clauses with a system of two million inequalities, we represent the D-Wave hardware restrictions through embedding:

$$-2\le {{\bf{h}}}_{i}\le +2,$$
(14)

and,

$$-1\le {J}_{ij}\le +1,$$
(15)

where i, j {1, 2, …, N} and i < j. Finally, we solve the following objective function:

$$\mathop{{\rm{\max }}}\limits_{{\bf{h}},J,D}\,\mathop{\sum }\limits_{i=1}^{M}\,{D}_{i},$$
(16)

to obtain coefficients of the Ising Hamiltonian, shown in Eq. (1), that is executable by a D-Wave QPU. Note that inequalities (11), (13), (14) and (15) are linear. Thus, the objective function in Eq. (16) is tractable by linear programming and convex optimization techniques. Considering that biases and couplers on a D-Wave QPU are bounded, Eqs. (14) and (15), problem (16) will always converge.

To adopt the proposed RQA scheme, we rewrite the inequality (13) as follows:

$${\rho }_{i}\le {D}_{i},$$
(17)

where ρi denotes the influence factor of the clause Ci. Here, we define ρ as:

$${\rho }_{i}=\frac{1}{M}-{{\bf{p}}}_{i},$$
(18)

where pi is the corresponding probability of constraint πi (here the clause Ci) in Eq. (10). Note that when \({{\bf{p}}}_{i}=\frac{1}{M}\), the inequalities (13) and (17) are identical.

The architecture of the proposed agent contains the following components:

  • Φt—set of unsatisfied clauses in episode t;

  • ρt—tuple of M influence factors in episode t;

  • QMIt—action of the agent in episode t, the Ising Hamiltonian for solving the given SAT instance (according to Φt and ρt);

  • z—perception of the agent from the stochastic environment, resulting sample(s) from executing the QMIt on a quantum annealer.

For a given Boolean formula in CNF, the agent initializes its internal state as:

$${\Phi }^{0}=\varnothing ,$$
$${{\bf{p}}}^{0}={\left\{\frac{1}{M}\right\}}^{M}.$$

In each episode, the agent forms a system of inequalities with Eqs. (11) and (17), and embeds Eqs. (14) and (15). Afterward, the agent solves problem (16), and submits the resulting Ising Hamiltonian (QMIt) to a D-Wave QPU. The environment (here the D-Wave QPU) draws sample(s) from the corresponding pseudo-Boltzmann distribution, and returns the resulting sample(s). The episode ends with updating the internal state of the agent as follows:

$${\Phi }^{{\rm{t}}}={\rm{set}}\,{\rm{of}}\,{\rm{unsatisfied}}\,{\rm{clauses}}\,{\rm{with}}\,{{\bf{z}}}^{t},$$
$${\beta }^{t}=1-\frac{|{\Phi }^{t}|}{M},$$

and (finally) updating probabilities with (10). In RQA, the action of the agent in episode t depends on ρt−1; therefore, Markov property holds here67.