Abstract
As the field of artificial intelligence advances, the demand for algorithms that can learn quickly and efficiently increases. An important paradigm within artificial intelligence is reinforcement learning^{1}, where decisionmaking entities called agents interact with environments and learn by updating their behaviour on the basis of the obtained feedback. The crucial question for practical applications is how fast agents learn^{2}. Although various studies have made use of quantum mechanics to speed up the agent’s decisionmaking process^{3,4}, a reduction in learning time has not yet been demonstrated. Here we present a reinforcement learning experiment in which the learning process of an agent is sped up by using a quantum communication channel with the environment. We further show that combining this scenario with classical communication enables the evaluation of this improvement and allows optimal control of the learning progress. We implement this learning protocol on a compact and fully tunable integrated nanophotonic processor. The device interfaces with telecommunicationwavelength photons and features a fast activefeedback mechanism, demonstrating the agent’s systematic quantum advantage in a setup that could readily be integrated within future largescale quantum communication networks.
Main
Rapid advances in the field of machine learning and in general artificial intelligence (AI) are paving the way towards intelligent algorithms and automation. An important paradigm within AI is reinforcement learning (RL), where decisionmaking entities called ‘agents’ interact with an environment, ‘learning’ to achieve a goal via feedback^{1}. Whenever the agent performs well (that is, makes the right decision), the environment rewards its behaviour, and the agent uses this information to progressively increase the likelihood of accomplishing its task. In this sense, an agent ‘learns’ by ‘reinforcement’. RL has applications in many sectors, from robotics^{5,6} to the healthcare domain^{7}, to brainlike computing simulation^{8} and neural network implementations^{6,9}. In addition, the celebrated AlphaGo algorithm^{10}, which is able to beat even the most skilled human players at the game of Go, employs RL.
At the same time, quantum technologies have experienced remarkable progress^{11}. At the heart of quantum mechanics lies the superposition principle, dictating that even the simplest, twodimensional quantum system is described by a continuum of infinitely many possible choices via a state vector \(\psi \rangle =\alpha {\rm{0}}\rangle +\beta 1\rangle \) with the complex numbers α and β satisfying \({\alpha }^{2}+{\beta }^{2}=1\), while only two possible states, \(0\rangle \) and \(1\rangle \), exist classically. Advantageous RL algorithms^{12,13} inspired by quantum mechanics have been successful in aiding problems in quantum information processing, for example, decoding of errors^{14,15,16}, quantum feedback^{17}, adaptive codedesign^{18}, quantumstate reconstruction^{19} and even the design of quantum experiments^{20,21}. Conversely, quantum technologies have enabled quadratically faster decisionmaking processes for RL agents via the quantization of their internal hardware^{3,4,22,23}.
In all of these applications, agent and environment interact entirely classically. Here we consider a novel RL setting where they can also interact quantumly, formally via a quantum channel^{2}. We therefore introduce a quantumenhanced hybrid agent capable of quantum as well as classical information transfer. This makes it possible to achieve and quantify a quantum speedup in the agent’s learning time with respect to RL based solely on classical interaction.
We realize this protocol using a fully programmable nanophotonic processor interfaced with photons at telecommunication wavelengths. The setup enables the implementation of activefeedback mechanisms, thus proving suitable for demonstrations of RL algorithms. Moreover, such photonic platforms hold the potential of integrating RL quantum speedups in future quantum networks owing to the photons’ telecommunication wavelengths. A longstanding goal in the development of quantum communication lies in establishing a form of ‘quantum internet’^{24,25}, a highly interconnected network able to distribute and manipulate quantum states via optical links. We therefore envisage AI and RL to play important roles in future quantum networks, including a potential quantum internet, much in the same way that AI forms integral part of the internet today.
Quantumenhanced RL
The conceptual idea of RL is shown in Fig. 1a. A decisionmaking entity called an agent interacts with an environment by receiving perceptual input (‘percepts’) s_{i}, and outputting specific ‘actions’ a_{i} accordingly, at a certain time step i. Different ‘rewards’ r issued by the environment for correct combinations of percepts and actions incentivize agents to improve their decisionmaking, and thus to learn^{1}.
Although RL has already been shown amenable to quantum enhancements, the interaction has so far been restricted exclusively to classical communication, meaning that signals can only be composed from a fixed, discrete alphabet. For signals carried by quantum systems (for example, single photons considered here) this corresponds to a fixed preferred basis, for example, ‘vertical’ or ‘horizontal’ photon polarization, as shown in Fig. 1b.
In general, it has been shown that granting agents access to quantum hardware (while still considering classical communication) does not reduce the learning time, although it allows actions to be output quadratically faster^{3,4}. To achieve reductions in learning times, quantum communication becomes necessary.
We therefore consider an environment and a quantumenhanced hybrid agent with access to internal quantum (as well as classical) hardware interacting by exchanging quantum states \({a}_{i}\rangle \), \({s}_{i}\rangle \) and \(r\rangle \), representing actions a_{i}, percepts s_{i} and rewards r, respectively. Such agents may behave ‘classically’, that is, use a classical channel, or ‘quantumly’, meaning that communication is no longer limited to a fixed preferred basis, but allows for exchanges of arbitrary superpositions via a quantum channel, as shown in Fig. 1c. In general, agents react to (sequences of) percepts \({s}_{i1}\rangle \) with (sequences of) actions \({a}_{i}\rangle \) according to a policy \({\rm{\pi }}({a}_{i}{s}_{i1})\) that is updated during the learning process via classical control.
Within this framework, we focus on socalled^{2} deterministic strictly epochal (DSE) learning scenarios, also called episodic instead of epochal^{1}. Here ‘epochs’ consist of strings of percepts s = (s_{0}, ..., s_{L−1}) with fixed s_{0}, actions a = (a_{1}, ..., a_{L}) of fixed length L, and a final reward r, and both s = s(a) and r = r(a) are completely determined by a. Therefore, no explicit representation of the percepts is required in our experiment (Methods). A nontrivial feature of the DSE scenario is that the effective behaviour of the environment can be modelled via a unitary U_{E} (ref. ^{2}) on the action and reward registers A and R as
U_{E} is similar to a generalized controlledNOT gate such that in case of rewarded action sequences (r(a) > 0), the reward state is flipped. U_{E} can therefore be used to perform a quantum search for such sequences.
A hybrid agent can choose between quantum and classical behaviour in each epoch. In classical epochs, the agent prepares the state \({{\bf{a}}\rangle }_{{\rm{A}}}{0\rangle }_{{\rm{R}}}\), where a is determined by sampling from a classical probability distribution p(a) determined by its policy π. With a winning probability
with ξ ∈ [0, 2π], the agent receives a reward and updates its policy according to a rule, presented in equation (4), based on projective simulation^{26} (see also Methods). In quantum epochs, the following steps are performed:
(1) The agent prepares the state \({\psi \rangle }_{{\rm{A}}}{\rangle }_{{\rm{R}}}\), with \({\psi \rangle }_{{\rm{A}}}={{\sum }_{{\bf{a}}}\sqrt{p({\bf{a}})}{\bf{a}}\rangle }_{{\rm{A}}}=\)\(\cos (\xi ){{\ell }\rangle }_{{\rm{A}}}+\,\sin (\xi ){w\rangle }_{{\rm{A}}}\), and sends it to the environment. \({w\rangle }_{{\rm{A}}}\) and \({{\ell }\rangle }_{{\rm{A}}}\) are superpositions of all winning (rewarded) and losing (nonrewarded) action sequences, respectively, and \({\rangle }_{{\rm{R}}}=({0\rangle }_{{\rm{R}}}{1\rangle }_{{\rm{R}}})/\sqrt{2}\).
(2) The environment applies U_{E} from equation (1) to \({\psi \rangle }_{{\rm{A}}}{\rangle }_{{\rm{R}}}\), flipping the sign of the winning state:
and returns the resulting state to the agent.
(3) The agent performs a reflection \({U}_{{\rm{R}}}=2\psi \rangle {\langle \psi }_{{\rm{A}}}{{\mathbb{1}}}_{{\rm{A}}}\) over the initial state \({\psi \rangle }_{{\rm{A}}}\).
The last step leads to amplitude amplification similar to Grover’s algorithm^{27} and thus to an increased probability sin^{2}(3ξ) (ref. ^{28}) to find rewarded action sequences (Methods). In our experiment, the hybrid agent performs a single query (and thus a single step of amplitude amplification) during a quantum epoch. However, the general framework allows for multiple steps of amplitude amplification in consecutive quantum epochs.
While quantum epochs lead to an increased winning probability, they do not reveal the reward (or corresponding percept sequence, in general). The reward can be determined only via classical test epochs, where the obtained action sequence is used as input. Thus, the hybrid agent alternates between quantum and classical test epochs, updating its policy every time a reward is obtained after a test epoch. Such agents accomplish their task of finding winning action sequences faster, and hence learn faster than entirely classical agents. This approach allows us to quantify the speedup in learning time, which is not possible in the general setting discussed in ref. ^{2}. The learning speedup manifests in a reduced average learning time \({\langle T\rangle }_{{\rm{Q}}}\), that is, the average number of epochs necessary to achieve a certain winning probability P_{L}. In general, a quadratic improvement can be achieved if the maximal number of coherent interactions between agent and environment scales with the problem size (Methods).
Experimental implementation
Quantized RL protocols can be compactly realized using stateoftheart photonic technology^{29}. Nowadays, integrated photonic platforms hold the advantage of providing scalable architectures where many elementary components can be accommodated on small devices^{30}. Here we use a programmable nanophotonic processor comprising 26 waveguides fabricated to form 88 Mach–Zehnder interferometers (MZIs). An MZI is equipped with two configurable phase shifters as shown in Fig. 2a, b, and acts as a tunable beam splitter. Information is spatially encoded onto two orthogonal modes \(0\rangle \) = (1, 0)^{T} and \(1\rangle \) = (0, 1)^{T}, which constitute the computational basis (T indicates the transpose).
As illustrated in Fig. 2c, pairs of single photons are generated (at telecommunication wavelengths) from a singlephoton source. One photon is coupled into a waveguide and then detected by singlephoton detectors D1, D2 or D3, while the other one is sent to D0 for heralding (that is, clicks in detectors D1, D2 or D3 are registered in coincidence with clicks in D0). The detectors are superconducting nanowires with efficiencies of up to roughly 90% (see Methods for experimental details). The processor is divided into three regions, where the first and last are assigned to the agent, and the middle region to the environment, to carry out, in quantum epochs, steps 1–3 listed above. The agent is further equipped with a classical control mechanism (a feedback loop) that updates its learning policy.
In our experiment, we represent the winning and losing action states \({w\rangle }_{{\rm{A}}}\) and \({{\ell }\rangle }_{{\rm{A}}}\) by a single qubit via \({1\rangle }_{{\rm{A}}}={w\rangle }_{{\rm{A}}}\) and \({0\rangle }_{{\rm{A}}}={{\ell }\rangle }_{{\rm{A}}}\), and use another qubit to encode the reward \(({0\rangle }_{{\rm{R}}},{1\rangle }_{{\rm{R}}})\). This results in a fourlevel system, where each level is a waveguide in our processor, as shown in Fig. 3. The winning probability for the agent is initially set to ε = sin^{2}(ξ) = 0.01, representing a single rewarded action sequence out of 100. After a single photon is coupled into the mode \({0}_{{\rm{A}}}{0}_{{\rm{R}}}\rangle \), the agent creates the state \({\psi \rangle }_{{\rm{A}}}=(\cos ({\xi }){0\rangle }_{{\rm{A}}}+\,\sin ({\xi }){1\rangle }_{{\rm{A}}}){0\rangle }_{{\rm{R}}}\) by applying a unitary U_{P}. Next, it can decide to play classically or quantum mechanically.
In a classical strategy, the environment flips the reward qubit only if the action qubit is in the winning state via U_{E} (Fig. 3a). Next, the photon is coupled out and detected in either D1 or D2 with probability cos^{2}(ξ) and sin^{2}(ξ), respectively. If D2 is triggered (that is, the agent has been rewarded), a feedback mechanism updates the policy π by updating the winning probability ε_{j} after having obtained j rewards as
π is related to p(a), and therefore to ε, via equation (5) (Methods).
In a quantum strategy, after the reward qubit is rotated to \({\rangle }_{{\rm{R}}}\) via two operators U_{H0} and U_{H1}, the environment acts as an oracle via U_{E} as in equation (3). Consecutively, the agent reverses the effect of U_{H0} and U_{H1}, and performs the reflection U_{R} (Fig. 3b). Measuring in the computational basis of the action register then leads to the detection of a rewarded action sequence with increased probability sin^{2}(3ξ) in D3. For practical reasons, the classical test epoch is implemented only in software (Methods). The update rule remains the same as in the classical case.
In general, any Groverlike algorithm faces a drop in amplitude amplification after the optimal point is reached. As different agents will reach this optimal point in different epochs, one can identify the probability ε = 0.396 until which it is beneficial for all agents to use a quantum strategy, as they will observe more rewards than in the classical strategy on average (Methods). When this probability is surpassed, it is advantageous to switch to an entirely classical strategy. This combined strategy thus avoids the typical amplitude amplification drop without introducing additional overheads in terms of experimental resources.
Results
At the end of each classical epoch, we record outcomes 1 and 0 for the rewarded and nonrewarded behaviour, respectively, obtaining a binary sequence whose length equals the number of played epochs in the classical learning strategy and half of the number of played epochs in the quantum strategy (as here two epochs, quantum and classical test, are needed to obtain the reward). For a fair comparison between these scenarios, in the quantum strategy, the reward is distributed (that is, averaged) over the quantum and classical test epochs. The reward is then averaged over different independent agents. Figure 4 shows this average reward η for the different learning strategies.
The theoretical data are simulated for n = 10,000 agents and the experimental data obtained from n = 165. Figure 4a visualizes the quantum improvement originating from the use of amplitude amplification compared with a purely classical strategy. For completeness, the comparison between not distributing and distributing the reward over two epochs in the quantum strategy is shown in Fig. 4b, c.
When ε = 0.396, η for the quantum strategy starts decreasing, as visible in Fig. 4a. Our setup allows the agents to choose the favourable strategy by switching from quantum to classical when the latter becomes advantageous. This combined strategy outperforms the purely classical scenario, as shown in Fig. 4d. As previously discussed, a certain winning probability P_{L} has to be defined to quantify the learning time. Choosing P_{L} = 0.37 (note however, that any probability below ε = 0.396 can be chosen), the learning time \(\langle T\rangle \) for P_{L} decreases from \({\langle T\rangle }_{{\rm{C}}}\) = 270 in the classical strategy to \({\langle T\rangle }_{{\rm{Q}}}\) = 100 in the combined strategy. This implies a reduction of 63%, which fits well to the theoretical values \({T}_{{\rm{C}}}^{{\rm{theory}}}\) = 293 and \({T}_{{\rm{Q}}}^{{\rm{theory}}}\) = 97, accounting for small experimental imperfections.
In general, hybrid agents can experience a quadratic speedup in their learning time if arbitrary numbers of coherent Grover iterations can be performed, even if the number of rewarded actions is unknown^{31}.
Conclusions
We have demonstrated an RL protocol where an agent can boost its performance by allowing quantum communication with the environment. This enables a quantum speedup in its learning time and optimal control of the learning process. Emerging photonic technology provides the advantages of compactness, tunability and lowloss communication, thus proving suitable for RL algorithms where activefeedback mechanisms, even over long distances, need to be implemented. Future scaledup implementations of our protocol rely on a linearly increasing number of waveguides with the action space size when considering action sequences of length L = 1, and the use of just a single photon. In this case, a learning task with N different actions requires a processor with 2N modes, while 3N + 1 or maximally \(\frac{{\rm{\pi }}}{4}\sqrt{N}(3N+1)\) gates are needed for a single or multiple rounds of amplitude amplification, respectively. In general, multiple photons will be required to deal with a combinatorially big space of action sequences of arbitrary length L. We envision our protocol to aid specifically in problems where frequent search is needed, for example, network routing problems, where, for instance, tens of qubits, waveguides and detectors would be employed to represent search spaces of 10^{4} elements. In general, the development of superconducting detectors, ondemand singlephoton sources^{32} or the largescale integration of artificial atoms within photonic circuits^{33} suggest substantial steps towards scalable multiphoton applications. Although photonic architectures are particularly suitable for such learning algorithms, our theoretical background is applicable to different platforms, for example, trapped ions or superconducting qubits. Here future realizations can feature the implementation of agent and environment as spatially separated systems, and a light–matter quantum interface for coherent exchange between them^{24,34}.
Methods
Quantum enhancement in RL agents
Here we present an explicit method for combining a classical agent with quantum amplitude amplification. Introducing a feedback loop between classical policy update and quantum amplitude amplification, we are able to determine achievable improvements in sample complexity, and thus in learning time. In addition, the final policy of our agent has properties similar to those of the underlying classical agent, leading to a comparable behaviour as discussed in more detail in A.H. et al. (manuscript in preparation; which is dedicated to discussing the theoretical background of the hybrid agent more specifically).
In the following, we focus on simple DSE environments, where the interaction between the agent and the environment is structured into epochs. Each epoch starts with the same percept s_{0}, and at each time step i an action–percept pair (a_{i}, s_{i}) is exchanged. Many interesting environments are epochal, for example, in applications of RL to quantum physics^{21,35,36,37} or popular problems such as playing Go^{10}. At the end of each epoch, after L action–percept pairs are communicated, the agent receives a reward r ∈ {0, 1}. The rules of the game are deterministic and time independent, such that performing a specific action a_{i} after receiving a percept s_{i−1} always leads to the same following percept s_{i}.
The behaviour of an agent is determined by its policy described by the probability π(a_{i}s_{i−1}) to perform the action a_{i} given the percept s_{i−1}. In deterministic settings, the percept s_{i} is completely determined by all previously performed actions a_{1}, ⋅⋅⋅, a_{i} such that π(a_{i}s_{i−1}) = π(a_{i}a_{1}, ⋅⋅⋅, a_{i−1}). Thus, the behaviour of the agent within one epoch is described by action sequences a = (a_{1}, ..., a_{L}) and their corresponding probabilities
Our learning agent uses a policy based on projective simulation^{26}, where each action sequence a is associated with a weight factor h(a) initialized to h = 1. Its policy is defined via the probability distribution
In our experiment, the initial winning probability ε is given by
and is set to 1/100. If the agent has chosen the sequence a, it updates the corresponding weight factor via
where λ = 2 in our experiment and r(a) = 1 (0) if a is rewarded (nonrewarded). Thus, the winning probability after the agent has found j rewards is given by equation (4). In general, the update method for quantumenhanced agents is not limited to projective simulation and can be used to enhance any classical learning scenario, provided that p(a) exists and that the update rule is solely based on the observed rewards. We generalize the given learning problem to the quantum domain by encoding different action sequences a into orthogonal quantum states \({\bf{a}}\rangle \) defining our computational basis. In addition, we create a fair unitary oracular variant of the environment^{2}, whose effective behaviour on the action register can be described by \({\tilde{U}}_{{\rm{E}}}\) as
The unitary oracle \({\tilde{U}}_{{\rm{E}}}\) can be used to perform, for instance, a Grover search or amplitude amplification for rewarded action sequences by performing Grover iterations
on an initial state \(\psi \rangle \). A quantumenhanced agent with access to \({\tilde{U}}_{{\rm{E}}}\) can thus find rewarded action sequences faster than a corresponding classical agent defined by the same initial policy π(a_{i}s_{i−1}) and update rules.
In general, the optimal number k of Grover iterations \({U}_{{\rm{G}}}^{k}\psi \rangle \) depends on the winning probability ε via \(k\propto 1/\sqrt{\varepsilon }\) (ref. ^{27}). In the following, we assume that ε is known at least to a good approximation. This is, for instance, possible if the number of rewarded action sequences is known. However, a similar agent can also be developed if ε is unknown by adapting methods from ref. ^{31} as described in A.H. et al. (manuscript in preparation).
Description of the agent
A quantumenhanced hybrid agent is constructed via the following steps:
(1) Given the classical probability distribution p(a), determine the winning probability ε = sin^{2}(ξ) based on the current policy and prepare the quantum state in the action register:
Here the quantum states
contain all losing and winning components, respectively. In our experiment, we identify \({\ell }\rangle \widehat{=}0\rangle \) and \(w\rangle \widehat{=}1\rangle \). The task assigned to the agent is to (learn to) perform the winning sequences \(w\rangle \) via policy update. This translates to a maximization of the obtained reward.
(2) Apply the optimal number \(k(\sqrt{\varepsilon })\) of Grover iterations leading to
and perform a measurement in the computational basis on \(\psi {}^{{\prime} }\rangle \) to determine a test action sequence a.
(3) Play one classical epoch by using the test sequence a determined in step 2 and obtain the corresponding percept sequence s(a) and the reward r(a).
(4) Update ε, and thus the classical policy π, using the rule in equation (4).
There exists a limit P on ε determining whether it is more advantageous for the agent to perform k Grover iterations with \(k(\sqrt{\varepsilon })\ge 1\) or sample directly from p(a) (therefore \(k(\sqrt{\varepsilon })=0\)) to determine a. In the latter case, the agent would interact only classically (as in step 3) with the environment.
After each epoch, a classic agent receives a reward with probability sin^{2}(ξ). A quantumenhanced agent can instead use one epoch to either perform one Grover iteration (step 2) or to determine the reward of a given test sequence a (step 3). After k Grover iterations, the winning probability is sin^{2}[(2k + 1)ξ] (ref. ^{28}; see next section). Thus, for k = 1, the agent receives a reward after every second epoch with probability sin^{2}(3ξ). Therefore, we define the expected average reward of an agent playing a classical strategy as η_{C} = sin^{2}(ξ) and of an agent playing a quantum strategy with k = 1 as η_{Q} = sin^{2}(3ξ)/2. For ε < P, η_{Q} > η_{C}, meaning that the quantum strategy proves advantageous over the classical case. However, as soon as η_{Q} = η_{C} (at P = 0.396), a classical agent starts outperforming a quantumenhanced agent that still performs Grover iterations.
Determining the winning probability ε exactly as in the example presented here is not always possible. In general, additional information such as the number of possible solutions and model building helps to perform this task. Note that a P smaller than 0.396 should be chosen if ε can only be estimated up to some range. To circumvent this problem, methods like Grover search with unknown reward probability^{31} or fixedpoint search^{38} can be used to determine whether and how many steps of amplitude amplification should be performed (A.H. et al., manuscript in preparation).
Enhancement of the winning probability
After a quantum epoch, the amplitude sin(ξ) of the winning state \(w\rangle \) increases to sin(3ξ). Here we derive this result. The projections onto the winning and losing subspaces are given by \({\Pr }_{w}={\sum }_{\{{\bf{a}}r({\bf{a}}) > 0\}}{\bf{a}}\rangle \langle {\bf{a}}\) and \({\Pr }_{{\ell }}={\sum }_{\{{\bf{a}}r({\bf{a}})=0\}}{\bf{a}}\rangle \langle {\bf{a}}\), respectively, which are orthogonal and sum to identity. Therefore, the initial state (12) can be decomposed into a normalized winning \(w\rangle \propto {\Pr }_{w}\psi \rangle \) and losing \({\ell }\rangle \propto {\Pr }_{{\ell }}\psi \rangle \) component, and the unitary (10) implementing one Grover iteration can be written as
Now, let us investigate the effect of U_{G} on an arbitrary real superposition
in the plane spanned by \(w\rangle \) and \({\ell }\rangle \). Using the trigonometric addition theorems, the application of one Grover iteration to \(s\rangle \)
can be identified as a rotation of 2ξ in the plane. Assuming α = ξ, we therefore find the amplitude of \(w\rangle \) to be sin(3ξ), and thus obtain a winning probability sin^{2}(3ξ). Implementing k Grover iterations leads to sin^{2}[(2k + 1)ξ].
Learning time
We define the learning time T as the number of epochs an agent needs on average to reach a certain winning probability P_{L}. The hybrid agent can reach P with fewer epochs on average than its classical counterpart. However, once both reach P, they need on average the same number of epochs to reach P = Δ_{P} where Δ_{P} is a certain value satisfying 0 ≤ Δ_{P} < 1 − P. Therefore, we choose P_{L} ≤ P to quantify the achievable improvement of a hybrid agent compared with its classical counterpart. In our experiment, we choose P_{L} = 0.37 to define the learning time.
Let \({l}_{J}=\{{{\bf{a}}}_{1},\cdots ,{{\bf{a}}}_{J}\}\) be a timeordered list of all the rewarded action sequences an agent has found until it reaches P_{L}. Note that the actual policy π_{j}, and thus p_{j}, of our agents depend only on the list l_{j} of observed rewarded action sequences, and this is independent of whether they have found them via classical sampling or quantum amplitude amplification. As a result, a classical agent and its quantumenhanced hybrid version are described by the same policy π(l_{j}) and behave similarly if they have found the same rewarded action sequences. However, the hybrid agent finds them faster.
In general, the actual policy and overall winning probability might depend on the rewarded action sequences that have been found. Thus, the number J of observed rewarded action sequences necessary to learn might vary. However, this is not the case for the experiment reported here. In our case, the learning time can be determined via
where t_{j} determines the number of epochs necessary to find the next rewarded sequence a_{j+1} after it has observed j rewards. For a purely classical agent, the average time is given by
This time is quadratically reduced to
for the hybrid agent. Here α is a parameter depending only on the number of epochs needed to create one oracle query \({\tilde{U}}_{{\rm{E}}}\) (ref. ^{2}) and on whether ε_{j} is known. In the case considered here, we find α = π/4. As a consequence, the average learning time for the hybrid agent is given by
where we used the Cauchy–Schwarz inequality and equations (19) and (20). The classical learning time typically scales with \({\langle T\rangle }_{{\rm{C}}}\propto {A}^{K}\) for a learning problem with episode length K and the choice between A different actions in each step. The number J depends on the specific policy update and sometimes also on the list l_{J} of observed rewarded action sequences. For an agent sticking with the first rewarded action sequence, we would find J = 1. However, typical learning agents are more explorative, and common scalings are J ∝ K such that we find
for these cases. This is equivalent to a quasiquadratic speedup in the learning time if arbitrary numbers of Grover iterations can be performed.
In more general settings, there exist several possible l_{J} with different length J such that the learning time \(\langle T(J)\rangle \) needs to be averaged over all possible l_{J}, which again leads to a quadratic speedup in learning time (A.H. et al., manuscript in preparation).
Limited coherent evolutions
In general, all nearterm quantum devices allow for coherent evolution only for a limited time and are thus limited to a maximal number of Grover iterations. For winning probabilities ε = sin^{2}(ξ) with (2k + 1)ξ ≤ π/2, performing k Grover iterations leads to the highest probability of finding a rewarded action.
Again, we assume that the actual policy of an agent depends on only the number of observed rewards an agent has found. As a consequence, the average time a hybrid agent limited to k Grover iterations needs to achieve the winning probability P < sin^{2}[π/(4k + 2)] is given by
with α_{0} determining the number of epochs necessary to create one oracle query \({\tilde{U}}_{{\rm{E}}}\). For α_{0} = 1, k ≫ 1 and (2k + 1)ξ_{J} ≪ π/2, we can approximate the learning time for the hybrid agent via
where we used sin(x) ≈ x for x ≪ 1. In general, it can be shown (A.H. et al., manuscript in preparation) that the winning probability P_{k} = sin^{2}[π/(4k + 2)] can be reached by a hybrid agent limited to k Grover iterations in a time
where γ is a factor depending on the specific setting.
In our case, equation (24) can be used to compute the lower bound for the average quantum learning time, with α_{0} = k = 1. For the classical strategy, equation (20), together with equation (19), is used. Thus, given P_{L} = 0.37, the predictions for the learning time in our experiment are \({T}_{{\rm{Q}}}^{{\rm{theory}}}\) = 97 and \({T}_{{\rm{C}}}^{{\rm{theory}}}\) = 293.
Experimental details
A continuouswave laser (Coherent Mira HP) is used to pump a singlephoton source producing photon pairs in the telecommunicationwavelength band. The laser light has a central wavelength of 789.5 nm and pumps the singlephoton source at a power of approximately 100 mW. The source is a periodically poled KTiOPO_{4} nonlinear crystal placed in a Sagnac interferometer^{39,40}, where the emission of single photons occurs via a typeII spontaneous parametric downconversion process. The crystal (produced by Raicol) is 30 mm long, set to a temperature of 25 °C, has a poling period of 46.15 μm and is quasiphase matched for degenerate emission of photons at 1,570 nm when pumping with coherent laser light at 785 nm. As the processor is calibrated for a wavelength of 1,580 nm, we shift the wavelength of the laser light to 789.75 nm to produce one photon at 1,580 nm (that is then coupled into the processor) and another one at 1,579 nm (the heralding photon).
The processor is a silicononinsulator type, designed by the Quantum Photonics Laboratory at the Massachusetts Institute of Technology (MIT)^{30}. Each programmable unit on the device acts as a tunable beam splitter implementing the unitary
where θ and ϕ are the internal and external phases shown in Fig. 2a, b, set via thermooptical phase shifters controlled by a voltage supply. The achievable precision for phase settings is higher than 250 μrad. The bandwidth of the phase shifters is around 130 kHz. The waveguides, spatially separated from one another by 25.4 μm, are designed to admit one linear polarization only. The high contrast in refractive index between the silicon and silica (the insulator) allows for waveguides with very small bend radius (less than 15 μm), thus enabling high component density (in our case 88 MZIs) on small areas (in our case, 4.9 × 2.4 mm). Given the small dimensions, the in(out)coupling is realized with the help of Si_{3}N_{4}−SiO_{2} waveguide arrays (produced by Lionix International), that shrink (and enlarge) the 10μm optical fibres’ mode to match the 2μm mode size of the waveguides in the processor. The total input–output loss is around 7 dB. The processor is stabilized to a temperature of 28 °C and calibrated at 1,580 nm for optimal performance. To reduce the blackbody radiation emission due to the heating of the phase shifters when voltage is applied, wavelength division multiplexers with a transmission peak centred at 1,571 nm and bandwidth of 13 nm are used before the photons are sent to the detectors. In our processor, two external phase shifters in the implemented circuits were not responding to the supplied voltage. These defects were accounted for by employing an optimization procedure.
The singlephoton detectors are multielement superconducting nanowires (produced by Photon Spot) with efficiencies up to 90% in the telecommunicationwavelength band. They have a dark count rate of about 100 counts per second, low timing jitter (hundreds of picoseconds) and a reset time <100 ns (ref. ^{41}). Coincidence events are those detection events registered in D0 and at the output of the processor that fall into a temporal window of 1.3 ns (the coincidence window), and are found using a time tagging module (TTM). In more detail, to record coincidences and then update the agent’s policy accordingly, the following steps are performed in the classical/quantum strategy (after initially setting \({h}_{w}={\sum }_{\{{\bf{a}}r({\bf{a}}) > 0\}}h({\bf{a}})=1\) and \({h}_{{\ell }}={\sum }_{\{{\bf{a}}r({\bf{a}})=0\}}h({\bf{a}})=99\) such that ε = 0.01).
(1) The TTM records the time tags for photons in D0, D1 and D2/D3.
(2) A Python script converts the time tags into arrival times, and it iterates through them until it finds a coincidence event between either D0 and D1, or D0 and D2/D3.
(3) If a coincidence event between D0 and D1 (or D0 and D2/D3) is first found, a 0 (1) is sent to another Python script controlling the MZIs’ phase shifters operating on a different computer. If a 0 is sent, the ratio \(\varepsilon ={h}_{w}/({h}_{{\ell }}+{h}_{w})\) remains unchanged. If a 1 is sent, h_{w} is updated as h_{w} + 2, which follows the update in equation (4). In the quantum strategy, this step also includes the implementation of a classical test epoch.
Implementing classical test epochs on hardware would require ‘testing’ the measured action state, that is, using the measured action sequence as input and making the environment act via U_{E}, thus leading to detection of a reward \({0\rangle }_{{\rm{R}}}\)or \({1\rangle }_{{\rm{R}}}\). However, since this simple circuit works in very close agreement with theoretical predictions (its visibility exceeds 0.99), this part has been implemented in software only.
The update rate is about 1 Hz for both the classical and quantum epochs, and can be reduced up to the phase shifters’ bandwidth.
Data availability
All the datasets used in the current work are available on Zenodo at https://doi.org/10.5281/zenodo.4327211.
References
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 1998).
Dunjko, V., Taylor, J. M. & Briegel, H. J. Quantumenhanced machine learning. Phys. Rev. Lett. 117, 130501 (2016).
Paparo, G. D., Dunjiko, V., Makmal, A., MartinDelgrado, M. A. & Briegel, H. J. Quantum speedup for active learning agents. Phys. Rev. X4, 031002 (2014).
Sriarunothai, T. et al. Speedingup the decision making of a learning agent using an ion trap quantum processor. Quantum Sci. Technol. 4, 015014 (2019).
Johannink, T. et al. Residual reinforcement learning for robot control. In 2019 International Conference on Robotics and Automation (ICRA) 6023–6029 (IEEE, 2019).
Tjandra, A., Sakti, S. & Nakamura, S. Sequencetoaequence ASR optimization via reinforcement learning. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5829–5833 (IEEE, 2018).
Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C. & Faisal A. A. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat. Med. 24, 1716–1720 (2018).
Thakur, C. S. et al. Largescale neuromorphic spiking array processors: a quest to mimic the brain. Front. Neurosci. 12, 891 (2018).
Steinbrecher, G. R., Olson, J. P., Englund, D. & Carolan, J. Quantum optical neural networks. npj Quantum Inf. 5, 60 (2019).
Silver, D. et al. Mastering the game of Go without human knowledge. Nature550, 354–359 (2017).
Arute, F. et al. Quantum supremacy using a programmable superconducting processor. Nature574, 505–510 (2019).
Dong. D., Chen, C., Li, H. & Tarn, T.J. Quantum reinforcement learning. IEEE Trans. Syst. Man Cybern. B38, 1207–1220 (2008).
Dunjko, V. & Briegel, H. J. Machine learning & artificial intelligence in the quantum domain: a review of recent progress. Rep. Prog. Phys. 81, 074001 (2018).
Baireuther, P., O’Brien, T. E., Tarasinski, B. & Beenakker, C. W. J. Machinelearningassisted correction of correlated qubit errors in a topological code. Quantum2, 48 (2018).
Breuckmann, N. P. & Ni, X. Scalable neural network decoders for higher dimensional quantum codes. Quantum2, 68–92 (2018).
Chamberland, C. & Ronagh, P. Deep neural decoders for near term faulttolerant experiments. Quant. Sci. Technol. 3, 044002 (2018).
Fösel, T., Tighineanu, P., Weiss, T. & Marquardt, F. Reinforcement learning with neural networks for quantum feedback. Phys. Rev. X8, 031084 (2018).
Poulsen Nautrup, H., Delfosse, N., Dunjko, V., Briegel, H. J. & Friis, N. Optimizing quantum error correction codes with reinforcement learning. Quantum3, 215 (2019).
Yu, S. et al. Reconstruction of a photonic qubit state with reinforcement learning. Adv. Quantum Technol. 2, 1800074 (2019).
Krenn, M., Malik, M., Fickler, R., Lapkiewicz, R. & Zeilinger, A. Automated search for new quantum experiments. Phys. Rev. Lett. 116, 090405 (2016).
Melnikov, A. A. et al. Active learning machine learns to create new quantum experiments. Proc. Natl Acad. Sci. USA115, 1221–1226 (2018).
Dunjko, V., Friis, N. & Briegel, H. J. Quantumenhanced deliberation of learning agents using trapped ions. New J. Phys. 17, 023006 (2015).
Jerbi, S., Poulsen Nautrup, H., Trenkwalder, L. M., Briegel, H. J. & Dunjko, V. A framework for deep energybased reinforcement learning with quantum speedup. Preprint at https://arxiv.org/abs/1910.12760 (2019).
Kimble, H. J. The quantum internet. Nature453, 1023–1030 (2008).
Cacciapuoti, A. S. et al. Quantum internet: networking challenges in distributed quantum computing. IEEE Netw. 34, 137–143 (2020).
Briegel, H. J. & De las Cuevas, G. Projective simulation for artificial intelligence. Sci. Rep. 2, 400 (2012).
Grover, L. K. Quantum mechanics helps in searching for a needle in a haystack. Phys. Rev. Lett. 79, 325–328 (1997).
Nielsen, M. A. & Chuang, I. L. Quantum Computation and Quantum Information (Cambridge Univ. Press, 2000).
Flamini, F. et al. Photonic architecture for reinforcement learning. New. J. Phys. 22, 045002 (2020).
Harris, N. C. et al. Quantum transport simulations in a programmable nanophotonic processor. Nat. Photon. 11, 447–452 (2017).
Boyer, M., Brassard, G., Hoyer, P. & Tappa, A. Tight bounds on quantum searching. Fortschr. Phys. 46, 493–505 (1998).
Senellart, P., Solomon, G. & White, A. Highperformance semiconductor quantumdot singlephoton sources. Nat. Nanotechnol. 12, 1026–1039 (2017).
Wan, N. H. et al. Largescale integration of artificial atoms in hybrid photonic circuits. Nature583, 226–231 (2020).
Northup, T. E. & Blatt, R. Quantum information transfer using photons. Nat. Photon. 8, 356–363 (2014).
Denil, M. et al. Learning to perform physics experiments via deep reinforcement learning. Proc. Int. Conf. on Learning Representations (2017).
Bukov, M. et al. Reinforcement learning in different phases of quantum control. Phys. Rev. X8, 031086 (2018).
Poulsen Nautrup, H. et al. Operationally meaningful representations of physical systems in neural networks. Preprint at https://arxiv.org/abs/2001.00593 (2020).
Yoder, T. J., Low, G. H. & Chuang, I. L. Fixedpoint quantum search with an optimal number of queries. Phys. Rev. Lett. 113, 210501 (2014).
Kim, T., Fiorentino, M. & Wong, F. N. C. Phasestable source of polarizationentangled photons using a polarization Sagnac interferometer. Phys. Rev. A73, 012316 (2006).
Saggio, V. et al. Experimental fewcopy multipartite entanglement detection. Nat. Phys. 15, 935–940 (2019).
Marsili, F. et al. Detecting single infrared photons with 93% system efficiency. Nat. Photon. 7, 210–214 (2013).
Acknowledgements
We thank L. A. Rozema, I. Alonso Calafell and P. Jenke for help with the detectors. A.H. acknowledges support from the Austrian Science Fund (FWF) through the project P 30937N27. V.D. acknowledges support from the Dutch Research Council (NWO/OCW), as part of the Quantum Software Consortium programme (project number 024.003.037). N.F. acknowledges support from the Austrian Science Fund (FWF) through the project P 31339N27. H.J.B. acknowledges support from the Austrian Science Fund (FWF) through SFB BeyondC F7102, the Ministerium für Wissenschaft, Forschung, und Kunst BadenWürttemberg (Az. 3375333010/41/1) and the Volkswagen Foundation (Az. 97721). P.W. acknowledges support from the research platform TURIS, the European Commission through ErBeStA (no. 800942), HiPhoP (no. 731473), UNIQORN (no. 820474), EPIQUS (no. 899368), and AppQInfo (no. 956071), from the Austrian Science Fund (FWF) through CoQuS (W1210N25), BeyondC (F 7113) and Research Group (FG 5), and Red Bull GmbH. The MIT portion of the work was supported in part by AFOSR award FA95501610391 and NTT Research.
Author information
Authors and Affiliations
Contributions
V.S. and B.E.A. implemented the experiment and performed data analysis. A.H., V.D., N.F., S.W. and H.J.B. developed the theoretical idea. T.S. and P.S. provided help with the experimental implementation. N.C.H., M.H. and D.E. designed the nanophotonic processor. V.S., S.W. and P.W. supervised the project. All the authors contributed to writing the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review informationNature thanks Vojtěch Havlíček, Lucas Lamata and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
About this article
Cite this article
Saggio, V., Asenbeck, B.E., Hamann, A. et al. Experimental quantum speedup in reinforcement learning agents. Nature 591, 229–233 (2021). https://doi.org/10.1038/s41586021032427
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41586021032427
This article is cited by

Selfcorrecting quantum manybody control using reinforcement learning with tensor networks
Nature Machine Intelligence (2023)

QAmplifyNet: pushing the boundaries of supply chain backorder prediction using interpretable hybrid quantumclassical neural network
Scientific Reports (2023)

Safe reinforcement learning under temporal logic with reward design and quantum action selection
Scientific Reports (2023)

Asymmetric quantum decisionmaking
Scientific Reports (2023)

2022 ECMWFESA workshop report: current status, progress and opportunities in machine learning for Earth System observation and prediction
npj Climate and Atmospheric Science (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.