A hybrid classical-quantum approach to speed-up Q-learning

We introduce a classical-quantum hybrid approach to computation, allowing for a quadratic performance improvement in the decision process of a learning agent. Using the paradigm of quantum accelerators, we introduce a routine that runs on a quantum computer, which allows for the encoding of probability distributions. This quantum routine is then employed, in a reinforcement learning set-up, to encode the distributions that drive action choices. Our routine is well-suited in the case of a large, although finite, number of actions and can be employed in any scenario where a probability distribution with a large support is needed. We describe the routine and assess its performance in terms of computational complexity, needed quantum resource, and accuracy. Finally, we design an algorithm showing how to exploit it in the context of Q-learning.


I. INTRODUCTION
Quantum algorithms can produce statistical patterns not so easy to obtain on a classical computer; in turns, they may, perhaps, help recognize patterns that are difficult to identify classically.To pursue this basic idea, a huge research effort is being put forward to speed up machine learning routines by exploiting unique quantum properties, such as coherence and entanglement, ultimately due to the superposition principle, [1,2].
Within the realm of machine learning, the Reinforcement Learning (RL) paradigm has gained huge attention in the last two decades [3,4], as, in a wide range of application scenarios, it allows modeling an agent that is able to learn and improve its behavior through rewards and penalties received from a not fully known environment.The agent, typically, chooses the action to perform by sampling a probability distribution that mirrors the expected returns associated to each of the actions possibly taken in a given situation (state).The estimated outputs and the corresponding probability distribution for the actions need to be updated at every step.
In this paper, we provide a novel algorithm for updating a probability distribution on a quantum register, which does not require the knowledge of the probabilities for all of the admissible actions, as assumed in other works [14,15].Instead, the actions are clustered in a predetermined number of subsets (classes), each associated to a range with a minimum and a maximum value of the expected reward.The cardinality of each class is evaluated in due course, through a procedure built upon well-known quantum routines, i.e., quantum oracle and quantum counting [16,17].Once this information is obtained, a classical procedure is run to assign a probability to each subset, in accordance to any desired distribution, while the elements within the same class are taken to be equally likely.This allows one to tune probabilities, in order, e. g., to assign a large one to the actions included in the range with maximum value of the expected reward.The probability distribution can also be changed dynamically, in order to enforce exploration at a first stage (to allow choosing also actions that have a low value) and exploitation at a second stage (to restrict the search to actions having a high value).The quantum computing algorithm presented here allows re-evaluating the values after examining all the actions that are admissible in a given state in a single parallel step, which is only possible due to quantum superposition.
Besides the RL scenario, for which our approach is explicitly tailored, the main advantageous features of this algorithm could be also exploited in several other contexts where one needs to sample from a distribution, ranging from swarm intelligence algorithms (such as Particle Swarm Optimization and Ant Colony Optimization [18]), to Cloud architectures (where the objective is to find an efficient assignment of virtual machines to physical servers, a problem that is known to be NP-hard [19]).After presenting the algorithm in details in Sec.II (while postponing to the appendix a formal evaluation of its performance in terms computational complexity and maximum approximation error), we will focus on its use in the RL setting in Sec.III, to show that it is, in fact, tailored for the needs of RL with a finite number of actions/states of the environment.We, finally draw some concluding remarks in Sec.IV.

II. PREPARING A QUANTUM PROBABILITY DISTRIBUTION
We now present the main building blocks of our approach to encode a user-defined probability distribution into a Quantum Register (QR).This procedure may turn useful in every application where it is required to extract the value of a random variable.
Let us assume that we have a random variable whose discrete domain includes J different values, {x j : j = 1, . . ., J}, which we map into the basis states of a J-dimensional Hilbert-space.Our goal is to prepare a quantum state for which the measurement probabilities in this basis reproduce the random variable probability distribution: The algorithm starts by initializing the QR as: where |φ is the equal superposition of the basis states {|x k }, while an ancillary qubit is set to the state |1 a .In our approach, the final state is prepared by encoding the probabilities sequentially, which will require J − 1 steps.At the i-th step of the algorithm (1 ≤ i < J), Grover's iterations [20] are used to set the amplitude of the |x i basis state to a i = √ p xi .In particular we apply a conditional Grover's operator : a , where Ĝi = R Ôi and Π(y) a is the projector onto the state |y a of the ancilla (y = 0, 1).It forces the Grover unitary, Ĝi , to act only on the component of the QR state tied to the ('unticked') state |1 a of the ancillary qubit.The operator R = 2 |φ φ| − I is the reflection with respect to the uniform superposition state, whereas the operator Ôi = I − 2 |x i x i | is built so as to flip the sign of the state |x i and leave all the other states unaltered.The Grover's operator is applied until the amplitude of |x i approximates a i to the desired precision (see Appendix A).The state of the ancilla is not modified during the execution of the Grover's algorithm.
The i-th step ends by ensuring that the amplitude of |x i is not modified anymore during the next steps.To this end, we 'tick' this component by tying it to the state |0 a .This is obtained by applying the operator whose net effect is: Where X is the NOT-gate, with 1 ≤ i ≤ J, y ∈ {0, 1}.
After the step i, the state of the system is: The state |β i has in general non-zero overlap with all of the basis states, including those states, whose amplitudes have been updated previously.This is due to the action of the reflection operator R, which outputs a superposition of all the J basis states.However, this does not preclude us from extracting the value of the probability of the random variable correctly, thanks to the ancillary qubit.Indeed, at the end of the last step, the ancilla is first measured in the logical basis.If the outcome of the measurement is 0, then we can proceed measuring the rest of the QR to get one of the first J − 1 values of the random variable with the assigned probability distribution.Otherwise, the output of the algorithm is set to x J , since the probability of getting 1 from the measurement of the ancilla is p x J due to the normalization condition (p ).The procedure can be generalized to encode a random distribution for which N elements, with N > J, are divided into J sub-intervals.In this case, a probability is not assigned separately to every single element; but, rather, collectively to each of the J sub-intervals, while the elements belonging to the same sub-interval are assigned equal probabilities.This is useful to approximate a random distribution where the number of elements, N , is very large.In this case, the QR operates on an Ndimensional Hilbert space, and Grover's operators are used to amplify more than one state at each of the J − 1 steps.For example, in the simplest case, we can think of having a set of J Grover operators, each of which acts on N/J basis states.In the general case, though, each Grover's operator could amplify a different, and a priori unknown, number of basis states.This case will be discussed in the following where the algorithm is exploited in the context of RL.

III. IMPROVING REINFORCEMENT LEARNING
We will now show how the algorithm introduced above can be exploited in the context of RL, and, specifically, in the Q-learning cycle.Figure 1 provides a sketch emphasising the part of the cycle that is involved in our algorithm.Our objective, in this context, is to update the action probabilities, as will be clarified in the rest of this section.An RL algorithm can be described in terms of an abstract agent interacting with an environment.The agent can be in one of the states that belong to a given set S, and is allowed to perform actions picked from a set A s , which, in general, depends on s.For each state s ∈ S, the agent chooses one of the allowed actions, according to a given policy.After the action is taken, the agent receives a reward r from the environment and its state changes to s ∈ S. The reward is used by the agent to understand whether the action has been useful to approach the goal, and then to learn how to adapt and improve its behavior.Shortly, the higher the value of the reward, the better the choice of the action a for that particular state s.In principle, this behavior, or policy, should consist in a rule that determines the best action for any possible state.An RL algorithm aims at finding the optimal policy, which maximizes the overall reward, i.e., the sum of the rewards obtained after each action.However, if the rewards are not fully known in advance, the agent needs to act on the basis of an estimate of their values.
Among the various approaches designed to this end, the so called Q-Learning algorithm [4] adopts the Temporal Difference (TD) method to update the Q(s, a) value, i.e., an estimation of how profitable is the choice of the action a when the agent is in the state s.
When choosing an action, at a given step of the algorithm, two key factors need to be taken into account: explore all the possible actions, and exploit the actions with the greatest values of Q(s, a).As it is common, we resort to a compromise between exploration and exploitation by choosing the new action through a random distribution, defined so that the probability of choosing the action a in the state s mirrors Q(s, a).For example, one could adopt a Boltzmann-like distribution P (a|s) = e Q(s,a)/T /Z, where Z is a normalising factor, while the T parameter can vary during the learning process (with a large T in the beginning, in order to favor exploration, and lower T once some experience about the environment has been acquired, in order to exploit this knowledge and give more chances to actions with a higher reward).
A severe bottleneck in the performance of a TD training algorithm arises when the number of actions and/or states is large.For example, in the chess game, the number of states is ∼ 10 120 and it is, in fact, impossible to deal with the consequent huge number of Q(s, a) values.A workaround is to use a function Q * θ (s, a) that approximates the values of Q(s, a) obtained by the TD rule and whose properties depend upon a (small) set of free parameters θ that are updated during the training.This approach showed its effectiveness in different classical approaches as, for example, in Deep Q-Learning [21][22][23][24][25], where the Q * θ (s, a) function is implemented by means of a neural network whose parameters are updated in accordance with the experience of the agent.
In the quantum scenario, this approach turns out to be even more effective.Indeed, it is possible to build a parameter-dependent quantum circuit that implements Q * θ (s, a); an approach that has been adopted by recent studies on near-term quantum devices [26][27][28][29][30].This circuit allows us to evaluate the function Q * θ (s, a) in a complete quantum parallel fashion; i.e., in one shot for all the admissible actions in a given state.With this approach, it is possible to obtain a quantum advantage in the process of building the probability distribution for the actions, using the algorithm presented in Sec.II.To achieve a significant quantum speed-up, and reduce the number of required quantum resources, thus making our algorithm suitable for near term NISQ processing units, we do not assign a probability to every action; but, rather, we aggregate actions into classes (i.e., subsets) according to their probabilities, as explained in the following.Let us consider the minimum (m) and maximum (M ) of the Q * θ (s, a) values, and let us divide the interval [m, M ] into J (non-overlapping, but not necessarily equal) subintervals I j with 1 ≤ j ≤ J.For a given state s, we include the action a ∈ A s in the class The probability of each sub-interval, then, will be determined by the sum of the Q * -values of the corresponding actions, a∈Cj Q * θ (s, a).All of the actions in C j , will be then considered equally probable.
In this way, our algorithm requires only J − 1 steps, each of which is devoted to amplify the actions belonging to one of the J − 1 classes (while the J-th probability is obtained by normalization).Furthermore, we can also take advantage of the aggregation while encoding the probability distributions onto the QR: in this case, indeed, we can use predetermined Grover's oracles, each devoted to amplify the logical states corresponding to the actions belonging to a given C j .
In order to insert the distribution-update algorithm into the Q-learning procedure, we need two QRs: A and I, encoding the actions and the sub-intervals, respectively.These registers need log 2 (max s |A s |) and log 2 (J) qubits, respectively.Let us consider the class C j = {a ∈ A s : Q * (s, a) ∈ I j }.Our goal is to assign to each action a ∈ A j a probability p j , based on the sub-interval j, using the algorithm presented in Sec.II.The distribution building process starts by preparing the following uniform superposition: where |0 I is the initial state of the register I.In order to apply our algorithm, we need J − 1 oracles Ôj , one for each given I j .To obtain these oracles, it is first necessary to define an operator Ĵ that records the sub-interval j a to which the value Q * θ (s, a) belongs.Its action creates correlations between the two registers by changing the initial state of the I-register as follows: To complete the construction of the oracles, we need to execute two unitaries: i) the operator Ô ja = I A ⊗ (I I − 2 |j a j a |), which flips the phase of the state |j a of the I-register; and, ii) the operator Ĵ † , which disentangles the two registers.The effective oracle operator entering the algorithm described in the previous section is then defined as Ôj = Ĵ † Ô j Ĵ , its net effect being Eventually, we apply the reflection about average R on the register A, thus completing an iteration of the Grover operator.
If the cardinality of each C j is not decided from the beginning, in order to evaluate the right number of Grover iterations to be executed, we need to compute it (see App.A 1 for details).This number of actions can be obtained, for each C j , by running the quantum counting algorithm associated with Ôj (and before its action).It is then possible to apply the algorithm of Sec.II in order to build the desired probability distribution.
We stress that our procedure works for any assignment rule of the probabilities to sub-intervals, and the Boltzmann distribution mentioned above is just one example providing a good balance between exploration and exploitation [31,32].
After the quantum state of the A-register is obtained, the agent will choose the action measuring its state.Then, according to the outcome of the environment, it will update the θ values classically, thus changing the behaviour of the operator Ĵ .
In order to give an overall summary of the procedure we are proposing, the full process, with the various steps we have described, is summarized in the box below, where classical and quantum operations are denoted as (C) and (Q), respectively.

Scheme of the hybrid algorithm
Initialize θ and start from state s (C) Execute the cycle: • Build the sub-intervals, classes and the quantum circuits for the oracles Ôi = Ĵ † Ô i Ĵ (C) • Use quantum counting on Ôi to compute the number of actions belonging to each subinterval (Q) • Compute the number of Grover iterations for each class (C) • Build the probability distribution for the admissible actions in s (Q) • Measure, obtain an action and execute it (C) • Get the new state s and the reward r (C) Before concluding, we determine the advantage that can be obtained with our quantum algorithm.Indeed, to build the probability distribution classically, for a given state s, the number of calls of the function Q * θ (s, a) increases asymptotically as O(|A s |).Conversely, with our quantum protocol the number of calls of Ĵ , and therefore of Q * θ (s, a), is asymptotically O( |A s |).Our algorithm is devised for the case |A s | >> 1, which also corresponds to a good precision (see Appendix A 1 for details).Finally, we point out that in the process of the definition of sub-intervals, it is possible to compute the maximum (M) and the minimum (m) value of Q * θ (s, a) before running our algorithm trough quantum routines that do not increase the complexity of our procedure [33,34].

IV. CONCLUSIONS
In our work, we presented an algorithm, based on the Grover's one, to encode a probability distribution onto a quantum register with a quadratic speed-up improvement.This algorithm can find several useful applications in the context of hybrid classical-quantum workflows.In this spirit, we have shown how such an algorithm can be exploited for the training of the Q-learning strategy.We have shown that this gives rise to a quadratic quantum speed up of the RL algorithm, obtained by the inclusion of our quantum subroutine in the stage of action selection of the RL workflow.This effectively enables achieving a trade off between exploration and exploitation, thanks to the intrinsic randomness embodied by the extraction from a QR of the action to be performed and, also, to the possibility of dynamically changing the relationship between the action and their values (and, thus, their relative probabilities).The latter, in particular, would become too burdensome in a fully classical setting.
Finally, we stress once again that, with our procedure, we can use Grover's oracles, which are given once and for all if i) the minimum and maximum range of action values, and ii) the number of intervals in which this range is dived are specified.the optimal number of Grover's iterations for a given step i is upper bounded by : to the leading-order in our working assumptions (N >> 1), we obtain: From this expression, we can conclude that the initial conditions of the state can only reduce the optimal number of calls of Grover's operators because we are dealing only with positive amplitudes, and we have K(0), L(0) ≥ 0. It is worth to note that in our case we want to set the amplitude of each action not to the maximum value but to the value that is determined by a probability distribution, therefore the number of Grover's iterations is typically much lower than the upper bound given above.As explained in Sec.III, the number of times that the Grover procedure has to be executed is equal to the number of sub-intervals (J) chosen, and so the total complexity is: where we took into account that for a given state s, N = |A s |, the complexity is equal to O |A s | .Another important property to analyse is the precision of the algorithm.With precision we mean the sensibility in the increment of the probability of one action after one iteration (from t to t+1).It can be quantified as follows:

FIG. 1 .
FIG. 1. Representation of a Q-learning cycle, where the quantum advantage is brought (on the left part) by the algorithm that encodes and updates action probabilities on a QR.