Reinforcement Quantum Annealing: A Hybrid Quantum Learning Automata.

We introduce the notion of reinforcement quantum annealing (RQA) scheme in which an intelligent agent searches in the space of Hamiltonians and interacts with a quantum annealer that plays the stochastic environment role of learning automata. At each iteration of RQA, after analyzing results (samples) from the previous iteration, the agent adjusts the penalty of unsatisfied constraints and re-casts the given problem to a new Ising Hamiltonian. As a proof-of-concept, we propose a novel approach for casting the problem of Boolean satisfiability (SAT) to Ising Hamiltonians and show how to apply the RQA for increasing the probability of finding the global optimum. Our experimental results on two different benchmark SAT problems (namely factoring pseudo-prime numbers and random SAT with phase transitions), using a D-Wave 2000Q quantum processor, demonstrated that RQA finds notably better solutions with fewer samples, compared to the best-known techniques in the realm of quantum annealing.

where N denotes the number of quantum bits (qubits) and z i ∈ {−1, +1}. To solve a problem on a D-Wave quantum annealer, therefore, one needs to define an Ising Hamiltonian, shown in Eq. (1), whose ground state represents a solution for the original problem of interest 16,17 .
The current generation of D-Wave quantum annealers (i.e., the Chimera architecture) includes more than 2,000 qubits and about 6,000 couplers, while the next generation (the Pegasus topology of the Advantage) will include more than 5,000 qubits and about 40,000 couplers 18 . Recent studies have revealed the potential of quantum annealers (namely the D-Wave quantum processors) to address certain classes of real-world problems that are intractable in the realm of classical computing 19 -including, but not limited to, planning 20 , scheduling 21,22 , discrete optimization problems 23 , constraint satisfaction problems 24 , Boolean satisfiability problem (SAT) 25,26 , matrix factorization 27 , cryptography 28 , fault detection and system diagnosis 29 , compressive sensing 30,31 , control of automated vehicles 32 and protein folding 33 . In addition, by sampling from high-dimensional probability distributions, one can use the D-Wave quantum annealers for many applications in artificial intelligence, machine learning and signal processing 19,34,35 .
Beside all aforementioned applications, the D-Wave quantum annealer architecture has limitations that not only restrict the process of mapping problems into an executable QMI (namely the sparse connectivity of the Chimera topology) but also lower the quality of results-i.e., the energy value of resulting samples (attained by the quantum annealer) is higher than the ground state of the given Ising Hamiltonian. In other words, casting a problem to an Ising Hamiltonian that represents the solution of the problem in its ground state does not guarantee that executing the corresponding QMI on a quantum annealer-like the D-Wave quantum processing unit (QPU)-will attain a global optimum.
For a given QMI, the D-Wave QPU draws samples from a problem-dependent pseudo-Boltzmann distribution at cryogenic temperatures 19 . The energy values of samples from the D-Wave QPU follow a Gaussian distribution. Thus, when we increase the number of reads/samples, we expect that the average parameter in the corresponding Gaussian distribution to approach the ground state energy of the corresponding Ising Hamiltonian-i.e., the probability of finding the global minimum approaches one. There are several drawbacks, nevertheless, that prevent quantum annealers from attaining a global minimum-including, but not limited to, confined anneal time, coefficients' range and precision limitations, noise, and decoherence. From a quantum computing perspective, an adiabatic quantum computer needs to search for the ground state of a non-stoquastic Hamiltonian in order to be universal (which would make them equivalent to gate models); nevertheless, the D-Wave QPU samples from the ground state(s) of an Ising Hamiltonian which is stoquastic 11,16,36 .
Since coupling every qubit to every other qubit in a quantum annealer is impractical, the D-Wave QPU has a sparse structure/architecture-so-called Chimera topology. Hence, we entangle multiple qubits to represent virtual qubits with higher connectivity. Chaining physical qubits substantially reduces the capacity of QPUs-e.g., 2,048 qubits in the Chimera architecture is equivalent to a clique of size 64. It is possible to implicitly leverage the capacity of the current D-Wave QPUs 37 , albeit executing multiple QMIs. In addition, virtual qubits are vulnerable to breaking-the longer the chains, the higher the probability they break during the annealing process. Although we can remediate broken chains by applying postprocessing methods on classical computers (e.g., voting among the physical qubits on a chain), some chains break because they represent a state with lower energy.
The required anneal time in a quantum annealer to keep the process adiabatic has a reverse exponential relation to the energy gap between the ground state (global minimum) and the first excited state (a state right above the global minimum) 9,11 . In the current generation of the D-Wave QPU, h i ∈ {−2, +2} and J ij ∈ {−1, +1}. Thus, one needs to scale the resulting Ising model, Eq. (1), by dividing all coefficients with a large-enough positive number to satisfy the QPU hardware constraints. Although scaling coefficients does not alter the ground state of the corresponding Ising model, it reduces the energy gap between the ground and the first excited states. As a result, the required annealing time can quickly exceed the maximum possible anneal time on a physical quantum annealer (for example 2,000 microseconds on the D-Wave QPUs) and makes the process diabatic, which exponentially reduces the probability of getting to the ground state 11,16 .
The current generation of the D-Wave QPUs uses 8-9 bits for representing coefficients in Eq. (1). Hence, the D-Wave QPU truncates coefficients of a QMI prior to putting qubits in their superposition, which can result in the Ising model having a different ground state-compared to the original QMI. Consequently, the D-Wave QPU may solve a different problem whose result is either infeasible or less accurate than the original problem of interest 38,39 . Applying preprocessing techniques 40 and classical postprocessing heuristics 41 can remarkably enhance the performance of the D-Wave QPU; however, from a problem-solving viewpoint, the D-Wave quantum annealer cannot guarantee to achieve a global optimum.
In this study, we view quantum annealers from two different perspectives simultaneously: (1) a meta-heuristic for solving discrete optimization problems that can find very high-quality solutions in near-constant time; and (2) a physical process that naturally draws samples from a problem-dependent Boltzmann distribution at cryogenic temperatures. Unlike most current research in quantum artificial intelligence that applies quantum computing models to hard AI problems, in this paper, we explore how we might apply AI techniques to improve quantum information processing.
Learning automata (LA) 42 are adaptive decision-making models (i.e., type of reinforcement learning 43 ) that try to maximize the accumulative reward when they are interacting with stochastic environments. In a similar manner to reinforcement learning, LA use Markov decision processes for representing the automaton-environment structure [42][43][44] . In a learning automaton, an (intelligent) agent has a set of r actions (denoted by α = {α 1 , α 2 , …, α r }) and each action has a corresponding probability (denoted by p i and ∑p i = 1). At each episode, the agent takes (applies) the action α i (according to p), and (correspondingly) the stochastic environment returns its feedback β that specifies the performance evaluation of the action α i . The agent uses this feedback to learn from the environment and aims to take optimal actions over time. For β ∈ [0, 1] -so-called S-Type learning automata-in episode t, if the agent takes the action α i and receives the feedback β t , we can update p as follows: (2020) 10 (1 ) where β = 0 represents the lowest action performance and β = 1 represents the highest action performance, θ 1 , θ 2 ∈ [0, 1] are learning factors, and i, j ∈ {1, 2, …, r} 42 . This paper presents the Reinforcement Quantum Annealing (RQA) scheme that leverages the idea of learning automata to iteratively improve the quality of results, attained by the quantum annealers, and implicitly address the limitations of physical quantum annealers. RQA views quantum annealing as an atomic process and it does not offer to modify/alter the annealing of quantum effects. In fact, each iteration of RQA includes a complete cycle of problem-solving with quantum annealers. At each iteration of RQA, we re-cast the given problem to a new Ising Hamiltonian, according to samples/results from the previous iteration, and annealing of quantum effects is identical in all iterations. From a problem-solving perspective, RQA searches the space of Hamiltonians of a given problem, to find the optimal one, rather than (repeatedly) exploring the Hilbert space of a given Hamiltonian. As a proof-of-concept, we first introduce a novel approach for casting the Boolean satisfiability problem (SAT) 45 to Ising Hamiltonians and then demonstrate that adopting the proposed RQA scheme results in notably better solutions.

Results
In this section, we aim to evaluate the performance of RQA scheme on solving benchmark SAT instances, and compare it with recent software and hardware enhancements to the quantum annealers. For every SAT instance, we used the number of unsatisfied clauses as the metric for performance comparisons. In this study, we used Z3 (from Microsoft Research) as a framework for symbolic computing implementations 46 and we executed each QMI on the D-Wave 2000Q quantum annealer, located at Burnaby, British Columbia.
For every SAT instance, we used inequalities (11), (17), (14) and (15) to represent the given SAT instance as a system of inequalities. Afterward, we solved problem (16) for casting the SAT to an executable QMI on a D-Wave quantum processor. Solving problem (16) will result in an Ising Hamiltonian which is not necessarily compatible with the D-Wave hardware graph (Chimera topology for the current generation). Therefore, we applied the minor-embedding heuristic 47 for embedding the problem to the physical lattice of qubits on a D-Wave QPU. To avoid the impact of chaining physical qubits in our evaluations, we employed fixed embeddings of cliques in all instances-i.e., we used the pre-defined embeddings of cliques for the chimera architecture.
Recent studies have revealed that using spin-reversal transforms (a.k.a. gauge transforms)-i.e., flipping the qubits randomly without altering the ground state of the original Ising Hamiltonian-can reduce analog errors of the quantum annealers 40 . Thus, as a preprocessing technique, we applied spin-reversal transforms prior to submitting the QMIs to the physical QPU. We also put a delay between measurements to reduce the sample-to-sample correlation, albeit longer run-time.
To remediate possible broken chains in the resulting raw samples from the D-Wave QPU, we performed voting among the physical qubits of chains. After unembedding samples (i.e., representing variables in the original problem domain), we applied the multi-qubit correction (MQC) heuristic 41 which has demonstrated a significant ability to improve the probability of finding the global minimum, attained by the D-Wave QPU. Finally, we performed a local search heuristic, so-called single-qubit correction (SQC), to construct the final solution of the given SAT 41 .
Experiment A: factoring pseudo-prime numbers. In number theory, the problem of integer factoring refers to decomposing a composite integer number into the product of smaller integers, and prime factorization restricts these factors to prime numbers. Although there are debates on the class (or complexity) of this problem, there is no known efficient (non-quantum) algorithm for factoring numbers in polynomial-time 48 .
In this study, we use the problem of prime factorization as a benchmark to evaluate the performance of RQA. It is worth noting that our research objective in this paper is not to set a new record for quantum factorized integers, which for the current generation of the D-Wave quantum annealers is 1,005,973 28 . Indeed, since the security of the modern public-key cryptography systems (like RSA) mainly relies on the difficulty of factoring very large pseudo-prime numbers 28,49 , we relied on the difficulty of prime factorization problem for generating benchmark SAT instances. Let f(x 1 , x 2 ) be a Boolean function as follows: are integer-valued numbers in binary representation (here, x 1 , x 2 ≥ 2), and the multiply operator is in binary base-each element of the vector q is a Boolean function of x 1 and x 2 . Assume that q is a pseudo-prime integer number in binary base (i.e., q has two prime factors. We can map the problem of factoring q to SAT as follows where n = n 1 + n 2 denotes the length of q. We can look at the process of generating SAT instances from a reverse-engineering viewpoint. To this end, we generated pseudo-prime numbers via multiplying two prime numbers, and represented them in binary base (denoted by q). For each instance, we then used the Eq. (4) to map the (2020) 10:7952 | https://doi.org/10.1038/s41598-020-64078-1 www.nature.com/scientificreports www.nature.com/scientificreports/ factorization of q to a satisfiable Boolean formula. Since g is a Boolean expression of x, we applied the Tseitin transformation 50 to represent g in conjunctive normal form (CNF) 45 . We also performed pre-processing techniques, namely "ctx-solver-simplify", "recover-01", "propagate-values" and "reduce-args" tactics from Z3 51 . Note that applying the Tseitin transformation can increase the size of g linearly, due to defining auxiliary variables. Since the capacity of the current D-Wave 2000Q quantum processors is limited to a complete graph of size 63, we eliminated SAT instances (in CNF) with more than 63 Boolean variables which resulted in 136 satisfiable SAT instances. Figure 1 illustrates results-minimum (circles), maximum (triangles), average and variance of the number of unsatisfied clauses-for solving these 136 satisfiable SAT instances, and compares the performance of the proposed RQA scheme with quantum annealing (QA) and quantum annealing with multiple post-quantum processes (SMQC). To enhance the standard quantum annealing technique, we used two spin-reversal-transforms 40 , as well as the delay between measurements to reduce the inter-sample correlation. In the second method (SMQC), we first used the multi-qubit correction (MQC) method 41 , in problem variable level-which is the state-of-the-art technique in the realm of post-quantum correction for quantum annealers-and then applied a local search to maximize the quality of results, attained by the SMQC arrangement.
To update the influence factors of clauses in RQA, Eq. (10), we used θ 1 = 0.1 and θ 1 = 0. Learning automata generally require a notable number of episodes to converge to an optimal (or sub-optimal) policies. In this experiment, nevertheless, the agent terminates the process after at most T = 10 episodes (due to QPU time limitations) or finding a solution that satisfies all clauses. Hence, we formed a hall-of-fame-a set of final solutions from all episodes-and applied MQC (followed by SQC) on them to obtain the ultimate solution of RQA. Our empirical observations showed that this technique can implicitly address the limited number of allowed episodes in RQA. Note that RQA utilizes at most the same number of samples as QA and SMQC. In other words, the pure (quantum) annealing time of RQA was at most equal to QA and SMQC. Experiment B: uniform random 3-SAT with phase transitions. Sampling from the phase transition region of uniform Random 3-SAT is a common practice for generating benchmark SAT (and MAX-SAT) problems [52][53][54][55] . In this experiment, as our second study case, we used the satisfiable benchmark test-set of uniform random 3-SAT with phase transitions 56 . Considering the capacity of the current generation of the D-Wave quantum annealers-we can embed a clique of size at most 63 on chimera architecture-we employed the test-set with 50 Boolean variables. Figure 2 demonstrates results-minimum (circles), maximum (triangles), average and variance of the number of unsatisfied clauses-for solving the first 100 instances from the benchmark test-set, and (similar to the previous experiment) compares the performance of the proposed RQA scheme with quantum annealing (QA) and quantum annealing with multiple post-quantum processes (SMQC). The setting for this experiment was identical to the previous experiment, except the number of instances (136 vs. 100) and the number of variables (variant vs. 50).

Experiment C: run-time evaluation.
In this experiment, we aim to evaluate the average run-time of RQA scheme and compare it with QA and SMQC approaches. We implemented all experiments in Python 3.7.4, and executed them on a 64-bit Windows 10 based system with 32 GB RAM and Intel Xeon processor at 3.00 GHz. Figure 3 shows the average run-time of solving 100 SAT instances for QA, SMQC and RQA methods on a D-Wave 2000Q quantum processor. Note that, in this experiment, we did not include the computation time for finding the embedding of the QMI on a working graph-we used the pre-defined embeddings of cliques for the chimera architecture. For all instances, we used two spin-reversal transforms and we also enabled the inter-sample delay between samples' reads.

Discussion
In this study, we introduced a novel scheme-called reinforcement quantum annealing (RQA)-that leverages reinforcement learning (more specifically learning automata) to enhance the quality of results, attained by the quantum annealers. RQA has an iterative scheme that is independent of the architecture of the annealer. In other  www.nature.com/scientificreports www.nature.com/scientificreports/ words, we look at the annealing process (here the quantum annealing) as a black box (or an atomic instruction). Hence, one can use RQA on top of any (quantum) annealing process. It is worth noting that our initial evaluations on applying RQA on classical (thermal) annealing did not show notable improvement. As an example, Eq. (1) represents the Ising Hamiltonian, which is stoquastic, that the current generations of the D-Wave quantum annealers sample from its ground state. When the next generations of quantum annealers are able to explore the landscape of a non-stoquastic Hamiltonian, we expect that one will be able to introduce RQA to supplement the new quantum annealer. Ramezanpour (2018) has proposed to improve the simulated quantum annealing algorithm by adding reinforcement to the standard quantum annealing algorithm 57 . Simulated quantum annealing is an iterative algorithm (similar to simulated annealing) that is implemented and run on classical computers. On the other hand, quantum annealers are physical, single-instruction, quantum processing units where the entire annealing process is atomic. Thus, we cannot modify or adjust the annealing process after starting the annealing (i.e., putting quantum bits on their superposition). in RQA, similar to the standard reinforcement learning scheme, iterations emulate the interactions between an agent and its environment. Ramezanpour's method, however, has an adaptive optimization scheme in which iterations simulate one quantum annealing process on a classical computer and is not applicable to physical quantum annealers.
Note that RQA does not offer to modify/adjust the annealing of quantum effects. In fact, we look at the quantum annealing process as a black box (i.e., an atomic process). We first cast the original problem of interest to an Ising Hamiltonian and then employ a quantum annealer to sample from its ground state(s). Afterward, at each iteration of RQA, we look at the results (samples) of the previous iteration and adjust the penalty of unsatisfied constraints. We then re-cast the problem (according to our knowledge from previous iterations/annealing) and (again) employ a quantum annealer to sample from the ground state of the new Ising Hamiltonian. From a problem-solving viewpoint, Herr et al. (2017) showed how to optimize the schedule of the annealing process. However, we offer to optimize the process of casting a given problem to the spin glass problem-albeit running multiple quantum annealing processes-without changing the schedule. In other words, in all experiments, we used a fixed annealing schedule and the annealing time was 20 microseconds in all executions.
Moreover, RQA aims to implicitly address limitations of physical quantum annealers that might not be a limitation in simulated quantum annealing. As an example, if the energy gap between the ground and first excited  www.nature.com/scientificreports www.nature.com/scientificreports/ states is reduced close to the end of the annealing cycle, according to the Anderson localization limitation 58 , RQA may also require exponentially large annealing time. As a proof-of-concept, we proposed a novel method for casting SAT to Ising Hamiltonians and then demonstrated that applying the proposed RQA scheme (on a D-Wave quantum annealer) results in notably better solutions. It is worth highlighting that, however, the proposed approach (i.e., hybridization of reinforcement learning and quantum annealing) is applicable to a vast range of classic AI problems like constraint satisfaction, planning, and scheduling.
We applied the proposed RQA scheme on two different SAT problem sets, and compared its performance with quantum annealing (QA) and quantum annealing with post-quantum error corrections (so-called SMQC) which is the state-of-the-art in the realm of quantum annealing 41,59,60 . The first problem set includes 136 satisfiable SAT instances which represent factoring pseudo-prime numbers that have at most 63 Boolean variables, in CNF representation. Besides the length of the given composite numbers, the difficulty of integer factoring also depends on the properties of integer numbers. The hardest instances of this problem are factoring pseudo-prime numbers (product of two prime numbers) whose factors have the same size (in binary base). SAT instances in experiment A are not the hardest cases of prime-factoring-restricting the SAT instances in experiment A to a composite of the same size prime factors resulted in only eight problems. It is worth noting that our main objective in this experiment was not to address the prime factorization nor to use quantum annealers for solving SAT or MAX-SAT problems. Figure 1 illustrates results-minimum, maximum, average and variance of a number of unsatisfied clausesof solving these 136 satisfiable SAT instances for 100, 500, 1,000, 5,000 and 10,000 samples. In RQA, increasing the number of samples from 100 to 10,000 reduces the average number of unsatisfied clauses from 1.57 to 1.40. Similarly, in QA and SMQC, the average number of unsatisfied clauses is reduced from 9.41 and 3.55 to 6.85 and 3.17, respectively. Although increasing the number of samples in all three methods reduces the average number of unsatisfied clauses, RQA with 100 samples outperforms both QA and SMQC approaches in all arrangements (even with 10,000 samples). It is worth highlighting that QA was not able to satisfy all clauses of any of these 136 SAT instances, even when we requested 10,000 samples, while both SMQC and RQA methods were able to find a satisfying solution for at least one of the instances in all cases. From a robustness viewpoint, increasing the number of samples from 100 to 10,000 lowers the variance and range (the difference between maximum and minimum) of QA, SMQC and RQA from 4.28, 2.60 and 0.86 to 3.21, 1.73 and 0.70, respectively. Therefore, RQA demonstrated better robustness (i.e., higher reproducibility rate), compared to both QA and SMQC approaches.
In the second case study, shown in Fig. 2, we used the first 100 SAT instances of the satisfiable benchmark test-set of uniform random 3-SAT with phase transitions 56 . Similar to Figs. 1 and 2 demonstrates that RQA with 100 samples outperforms both QA and SMQC approaches in all settings. More specifically, increasing the number of samples from 100 to 10,000 in QA, SMQC and RQA decreases the average number of unsatisfied clauses from 8.02, 1.65 and 0.79 to 6.39, 1.28 and 0.68, respectively. Also, the variances are reduced from 6.52, 0.53 and 0.31 to 4.43, 0.42 and 0.20, respectively. Thus, the minimum number of unsatisfied clauses in RQA for all settings is zero while SMQC needed at least 5,000 samples to satisfy all clauses of at least one SAT instance.
Note that RQA is an iterative scheme-RQA executes multiple QMIs for a given problem; hence, in all experiments, we restricted the total number of samples in RQA not to exceed the sample size of QA and SMQC. As an example, when QA and SMQC requested 1,000 samples, RQA (with T = 10 iterations) asked for only 100 samples in each iteration.
Problem-solving with a D-Wave quantum processor, in practice, requires some preprocesses (e.g., embedding and gauge transforms) and post-processes (like error correction and broken-chain remediation) that extends the total run-time from 20-2000 microseconds (pure annealing time) to some seconds and even minutes. Run-time in many pre/post-processing techniques mainly depends on the number of samples. It is a common practice in applying quantum annealers to request a few thousand (at most 10,000 per QMI) samples to increase the probability of finding the global minimum. Figure 3 demonstrates that increasing the number of samples increases the total run-time of RQA at a significantly lower rate (compared to QA and more specifically SMQC)-since a request of 10 times fewer samples in each iteration of RQA has less pre/post-processing overhead. As an illustration, 100 samples (10 samples in each iteration of RQA), increases the total run-time 7.6× and 6.8× more than QA and SMQC, respectively. However, increasing the number of samples to 10,000, (1,000 samples in each iteration of RQA) only increases the total run time by 4.4× and 1.7×, compared to QA and SMQC, respectively. In both experiments A and B, RQA with 100 reads (i.e., repeating the QA 100 times) was able to find a sample with lower energy, compared to QA and SMQC with 10,000 samples. On the other hand, the total computation time (quantum annealing plus classical postprocessing) of RQA with 100 reads is close to SMQC with 10,000 reads. As a conclusion, RQA with 100 reads utilizes less quantum annealing time, finds samples with lower energy and has the same total run-time, compared to SMQC with 10,000 reads.

Methods
Assume that the given problem ∏ that we aim to solve on a quantum annealer contains a finite set of constraints (components), denoted by π i for i ∈ {1, 2, …, M}, over the same variables as follows: This setting appears in a vast range of problem formulations-including, but not limited to SAT 26 , constraint satisfaction problems 62,63 , planning and scheduling [20][21][22] , fault detection and diagnosis 24,29 , and compressive sensing 30,64 -specifically when we adopt the idea of penalty methods for casting problems of interest to the spin glass problem.
Theorem 1. For any problem in class NP, there are infinite different Ising models whose ground states are all identical to the solution of the original problem.
Proof. According to Cook-Levin theorem, we can reduce any NP problem to finding the ground state of Hamiltonians (which is also in the class NP) in polynomial-time 9,65 . Multiplying all coefficients of the Ising model by a positive non-zero real number will result in a new spin glass problem whose ground state will be identical to the original Ising model. Since the number of positive real numbers are infinite, we can generate infinite different Ising models whose ground states represent the solution for the original problem of interest.
According to Theorem 1, there are an infinite number of different Ising Hamiltonians whose ground states all represent the solution of the original problem of interest-nevertheless owing to the range and precision limitations on the D-Wave QPUs, we have a finite number of different Ising models for a given problem. In theory, these different Ising models are equivalent to each other-i.e., an adiabatic annealing process always attains the ground state which is identical for all corresponding Ising models of a problem. In practice, however, each of these (theoretically) equivalent Ising models are analogous to a pseudo-Boltzman distribution whose parameters are different. Consequently, when we minimize the corresponding QMIs with a physical quantum annealer (like the D-Wave QPU), the probability of finding the global minimum for different Ising Hamiltonians of a given problem varies from zero to one. As an example, an annealing process on a D-Wave QPU may become diabatic because the required anneal time exceeds the maximum possible anneal time (2,000 microseconds), which can substantially reduce the probability of finding the global minimum. Note that for a given Ising Hamiltonian, we cannot estimate the probability of finding the ground state prior to executing the corresponding QMI.
We introduce the reinforcement quantum annealing (RQA) scheme, in which an intelligent agent interacts with a quantum annealer, as the stochastic environment of a learning automaton. RQA searches the space of Hamiltonians to iteratively find a better model for the given problem of interest that sampling from its ground state(s), by a quantum annealer, results in a better distribution-i.e., the probability of finding the global optimum is increased over the time. It is worth highlighting that RQA does not offer to alter/modify the annealing of quantum effects-i.e., all quantum annealing processes have an identical schedule. To this end, we extend Eq. (7) as follows: www.nature.com/scientificreports www.nature.com/scientificreports/ error correction heuristics), the agent estimates β according to the number of satisfied constraints (π i ) and employs Eq. (10) to update the influence factor ρ. proof of concept: RQA for solving SAt instances. For a given Boolean formula f(x 1 , x 2 , …, x n ), the problem of Boolean satisfiability (SAT) determines whether a constant replacement of values ("True" or "False") for all Boolean variables can interpret f as "True" 45 . From a complexity perspective, SAT is NP-complete, and we can reduce all problems of class NP to SAT in polynomial-time 65 . The Boolean formula f is in conjunctive normal form (CNF) if it is a conjunction ("AND") of clauses (i.e., ), where each clause is a disjunction ("OR") of literals (a Boolean variable or its negation)-= ∨ … ∨ C l l l i i j k . The maximum satisfiability problem (MAX-SAT) is an NP-hard extension of the SAT problem that aims to maximize the number of satisfying clauses 45,66 .
In this section, we adopt the idea of penalty methods for casting SAT to Ising Hamiltonians and show that adopting the proposed RQA scheme can notably improve the probability of finding the global optimum. It is important to highlight that our objective in this study is not to employ quantum annealers for addressing the NP-complete problem of SAT. Moreover, the proposed heuristic is not guaranteed to solve all SAT instances, even if we have access to an ideal quantum annealer. In the same manner, RQA does not offer to bypass the Anderson localization limitation of the adiabatic quantum optimization 58 .
In this mapping, we aim to find coefficients of Eq.
(1) such that the ground state of the resulting Ising Hamiltonian represents the satisfying solution of the original SAT instance. In this formulation, z i represents the Boolean variable x i , and we interpret −1 and +1 as "False" and "True", respectively. For a clause with k literals, there are 2 k different possibilities among which, we can distinguish the only state that makes the clause to be false-called infeasible state. Hence, we represent each clause of the given SAT instance with two inequalities as: i infeasible and, is the boundary variable corresponding to the clause C i , AND E feasible and E infeasible represent the contribution of the feasible and infeasible states in the ultimate energy function, respectively. For = ∨ ¬ ∨ C x x x i 1 4 9 , as an example, Eq. (11) reduces to: According to Theorem 2, we can rewrite Eq. (12) as follows: i infeasible that obeys: i Note that clauses in CNF representation are connected with the "AND" operator. Hence, after representing each clause of the SAT with two inequalities, Eqs. (11) and (13), we aggregate the resulting sub-systems of