Experimental quantum speed-up in reinforcement learning agents

Saggio, V.; Asenbeck, B. E.; Hamann, A.; Strömberg, T.; Schiansky, P.; Dunjko, V.; Friis, N.; Harris, N. C.; Hochberg, M.; Englund, D.; Wölk, S.; Briegel, H. J.; Walther, P.

doi:10.1038/s41586-021-03242-7

Article
Published: 10 March 2021

Experimental quantum speed-up in reinforcement learning agents

Nature volume 591, pages 229–233 (2021)Cite this article

17k Accesses
90 Citations
209 Altmetric
Metrics details

Subjects

Abstract

As the field of artificial intelligence advances, the demand for algorithms that can learn quickly and efficiently increases. An important paradigm within artificial intelligence is reinforcement learning¹, where decision-making entities called agents interact with environments and learn by updating their behaviour on the basis of the obtained feedback. The crucial question for practical applications is how fast agents learn². Although various studies have made use of quantum mechanics to speed up the agent’s decision-making process^3,4, a reduction in learning time has not yet been demonstrated. Here we present a reinforcement learning experiment in which the learning process of an agent is sped up by using a quantum communication channel with the environment. We further show that combining this scenario with classical communication enables the evaluation of this improvement and allows optimal control of the learning progress. We implement this learning protocol on a compact and fully tunable integrated nanophotonic processor. The device interfaces with telecommunication-wavelength photons and features a fast active-feedback mechanism, demonstrating the agent’s systematic quantum advantage in a setup that could readily be integrated within future large-scale quantum communication networks.

You have full access to this article via your institution.

Download PDF

Realizing a deep reinforcement learning agent for real-time quantum feedback

Article Open access 06 November 2023

Coherent transport of quantum states by deep reinforcement learning

Article Open access 20 June 2019

Generalizable control for quantum parameter estimation through reinforcement learning

Article Open access 04 October 2019

Main

Rapid advances in the field of machine learning and in general artificial intelligence (AI) are paving the way towards intelligent algorithms and automation. An important paradigm within AI is reinforcement learning (RL), where decision-making entities called ‘agents’ interact with an environment, ‘learning’ to achieve a goal via feedback¹. Whenever the agent performs well (that is, makes the right decision), the environment rewards its behaviour, and the agent uses this information to progressively increase the likelihood of accomplishing its task. In this sense, an agent ‘learns’ by ‘reinforcement’. RL has applications in many sectors, from robotics^5,6 to the healthcare domain⁷, to brain-like computing simulation⁸ and neural network implementations^6,9. In addition, the celebrated AlphaGo algorithm¹⁰, which is able to beat even the most skilled human players at the game of Go, employs RL.

At the same time, quantum technologies have experienced remarkable progress¹¹. At the heart of quantum mechanics lies the superposition principle, dictating that even the simplest, two-dimensional quantum system is described by a continuum of infinitely many possible choices via a state vector $|\psi \rangle =\alpha |{\rm{0}}\rangle +\beta |1\rangle $ with the complex numbers α and β satisfying ${|\alpha |}^{2}+{|\beta |}^{2}=1$, while only two possible states, $|0\rangle $ and $|1\rangle $, exist classically. Advantageous RL algorithms^12,13 inspired by quantum mechanics have been successful in aiding problems in quantum information processing, for example, decoding of errors^14,15,16, quantum feedback¹⁷, adaptive code-design¹⁸, quantum-state reconstruction¹⁹ and even the design of quantum experiments^20,21. Conversely, quantum technologies have enabled quadratically faster decision-making processes for RL agents via the quantization of their internal hardware^3,4,22,23.

In all of these applications, agent and environment interact entirely classically. Here we consider a novel RL setting where they can also interact quantumly, formally via a quantum channel². We therefore introduce a quantum-enhanced hybrid agent capable of quantum as well as classical information transfer. This makes it possible to achieve and quantify a quantum speed-up in the agent’s learning time with respect to RL based solely on classical interaction.

We realize this protocol using a fully programmable nanophotonic processor interfaced with photons at telecommunication wavelengths. The setup enables the implementation of active-feedback mechanisms, thus proving suitable for demonstrations of RL algorithms. Moreover, such photonic platforms hold the potential of integrating RL quantum speed-ups in future quantum networks owing to the photons’ telecommunication wavelengths. A long-standing goal in the development of quantum communication lies in establishing a form of ‘quantum internet’^24,25, a highly interconnected network able to distribute and manipulate quantum states via optical links. We therefore envisage AI and RL to play important roles in future quantum networks, including a potential quantum internet, much in the same way that AI forms integral part of the internet today.

Quantum-enhanced RL

The conceptual idea of RL is shown in Fig. 1a. A decision-making entity called an agent interacts with an environment by receiving perceptual input (‘percepts’) s_i, and outputting specific ‘actions’ a_i accordingly, at a certain time step i. Different ‘rewards’ r issued by the environment for correct combinations of percepts and actions incentivize agents to improve their decision-making, and thus to learn¹.

**Fig. 1: Schematic of a learning agent.**

Although RL has already been shown amenable to quantum enhancements, the interaction has so far been restricted exclusively to classical communication, meaning that signals can only be composed from a fixed, discrete alphabet. For signals carried by quantum systems (for example, single photons considered here) this corresponds to a fixed preferred basis, for example, ‘vertical’ or ‘horizontal’ photon polarization, as shown in Fig. 1b.

In general, it has been shown that granting agents access to quantum hardware (while still considering classical communication) does not reduce the learning time, although it allows actions to be output quadratically faster^3,4. To achieve reductions in learning times, quantum communication becomes necessary.

We therefore consider an environment and a quantum-enhanced hybrid agent with access to internal quantum (as well as classical) hardware interacting by exchanging quantum states $|{a}_{i}\rangle $, $|{s}_{i}\rangle $ and $|r\rangle $, representing actions a_i, percepts s_i and rewards r, respectively. Such agents may behave ‘classically’, that is, use a classical channel, or ‘quantumly’, meaning that communication is no longer limited to a fixed preferred basis, but allows for exchanges of arbitrary superpositions via a quantum channel, as shown in Fig. 1c. In general, agents react to (sequences of) percepts $|{s}_{i-1}\rangle $ with (sequences of) actions $|{a}_{i}\rangle $ according to a policy ${\rm{\pi }}({a}_{i}|{s}_{i-1})$ that is updated during the learning process via classical control.

Within this framework, we focus on so-called² deterministic strictly epochal (DSE) learning scenarios, also called episodic instead of epochal¹. Here ‘epochs’ consist of strings of percepts s = (s₀, ..., s_L−1) with fixed s₀, actions a = (a₁, ..., a_L) of fixed length L, and a final reward r, and both s = s(a) and r = r(a) are completely determined by a. Therefore, no explicit representation of the percepts is required in our experiment (Methods). A non-trivial feature of the DSE scenario is that the effective behaviour of the environment can be modelled via a unitary U_E (ref. ²) on the action and reward registers A and R as

$${U}_{{\rm{E}}}{|{\bf{a}}\rangle }_{{\rm{A}}}{|0\rangle }_{{\rm{R}}}=\{\begin{array}{c}{|{\bf{a}}\rangle }_{{\rm{A}}}{|1\rangle }_{{\rm{R}}}\,\,{\rm{i}}{\rm{f}}\,\,r({\bf{a}}) > 0\\ {|{\bf{a}}\rangle }_{{\rm{A}}}{|0\rangle }_{{\rm{R}}}\,\,{\rm{i}}{\rm{f}}\,\,r({\bf{a}})=0\end{array}\cdot $$

(1)

U_E is similar to a generalized controlled-NOT gate such that in case of rewarded action sequences (r(a) > 0), the reward state is flipped. U_E can therefore be used to perform a quantum search for such sequences.

A hybrid agent can choose between quantum and classical behaviour in each epoch. In classical epochs, the agent prepares the state ${|{\bf{a}}\rangle }_{{\rm{A}}}{|0\rangle }_{{\rm{R}}}$, where a is determined by sampling from a classical probability distribution p(a) determined by its policy π. With a winning probability

$$\varepsilon ={\sin }^{2}(\xi )=\sum _{\{{\bf{a}}|r({\bf{a}}) > 0\}}p({\bf{a}}),$$

(2)

with ξ ∈ [0, 2π], the agent receives a reward and updates its policy according to a rule, presented in equation (4), based on projective simulation²⁶ (see also Methods). In quantum epochs, the following steps are performed:

(1) The agent prepares the state ${|\psi \rangle }_{{\rm{A}}}{|-\rangle }_{{\rm{R}}}$, with ${|\psi \rangle }_{{\rm{A}}}={{\sum }_{{\bf{a}}}\sqrt{p({\bf{a}})}|{\bf{a}}\rangle }_{{\rm{A}}}=$$\cos (\xi ){|{\ell }\rangle }_{{\rm{A}}}+\,\sin (\xi ){|w\rangle }_{{\rm{A}}}$, and sends it to the environment. ${|w\rangle }_{{\rm{A}}}$ and ${|{\ell }\rangle }_{{\rm{A}}}$ are superpositions of all winning (rewarded) and losing (non-rewarded) action sequences, respectively, and ${|-\rangle }_{{\rm{R}}}=({|0\rangle }_{{\rm{R}}}-{|1\rangle }_{{\rm{R}}})/\sqrt{2}$.

(2) The environment applies U_E from equation (1) to ${|\psi \rangle }_{{\rm{A}}}{|-\rangle }_{{\rm{R}}}$, flipping the sign of the winning state:

$${U}_{{\rm{E}}}{|\psi \rangle }_{{\rm{A}}}{|-\rangle }_{{\rm{R}}}=[\cos (\xi ){|\ell \rangle }_{{\rm{A}}}-\,\sin (\xi ){|w\rangle }_{{\rm{A}}}]{|-\rangle }_{{\rm{R}}},$$

(3)

and returns the resulting state to the agent.

(3) The agent performs a reflection ${U}_{{\rm{R}}}=2|\psi \rangle {\langle \psi |}_{{\rm{A}}}-{{\mathbb{1}}}_{{\rm{A}}}$ over the initial state ${|\psi \rangle }_{{\rm{A}}}$.

The last step leads to amplitude amplification similar to Grover’s algorithm²⁷ and thus to an increased probability sin²(3ξ) (ref. ²⁸) to find rewarded action sequences (Methods). In our experiment, the hybrid agent performs a single query (and thus a single step of amplitude amplification) during a quantum epoch. However, the general framework allows for multiple steps of amplitude amplification in consecutive quantum epochs.

While quantum epochs lead to an increased winning probability, they do not reveal the reward (or corresponding percept sequence, in general). The reward can be determined only via classical test epochs, where the obtained action sequence is used as input. Thus, the hybrid agent alternates between quantum and classical test epochs, updating its policy every time a reward is obtained after a test epoch. Such agents accomplish their task of finding winning action sequences faster, and hence learn faster than entirely classical agents. This approach allows us to quantify the speed-up in learning time, which is not possible in the general setting discussed in ref. ². The learning speed-up manifests in a reduced average learning time ${\langle T\rangle }_{{\rm{Q}}}$, that is, the average number of epochs necessary to achieve a certain winning probability P_L. In general, a quadratic improvement can be achieved if the maximal number of coherent interactions between agent and environment scales with the problem size (Methods).

Experimental implementation

Quantized RL protocols can be compactly realized using state-of-the-art photonic technology²⁹. Nowadays, integrated photonic platforms hold the advantage of providing scalable architectures where many elementary components can be accommodated on small devices³⁰. Here we use a programmable nanophotonic processor comprising 26 waveguides fabricated to form 88 Mach–Zehnder interferometers (MZIs). An MZI is equipped with two configurable phase shifters as shown in Fig. 2a, b, and acts as a tunable beam splitter. Information is spatially encoded onto two orthogonal modes $|0\rangle $ = (1, 0)^T and $|1\rangle $ = (0, 1)^T, which constitute the computational basis (T indicates the transpose).

As illustrated in Fig. 2c, pairs of single photons are generated (at telecommunication wavelengths) from a single-photon source. One photon is coupled into a waveguide and then detected by single-photon detectors D1, D2 or D3, while the other one is sent to D0 for heralding (that is, clicks in detectors D1, D2 or D3 are registered in coincidence with clicks in D0). The detectors are superconducting nanowires with efficiencies of up to roughly 90% (see Methods for experimental details). The processor is divided into three regions, where the first and last are assigned to the agent, and the middle region to the environment, to carry out, in quantum epochs, steps 1–3 listed above. The agent is further equipped with a classical control mechanism (a feedback loop) that updates its learning policy.

In our experiment, we represent the winning and losing action states ${|w\rangle }_{{\rm{A}}}$ and ${|{\ell }\rangle }_{{\rm{A}}}$ by a single qubit via ${|1\rangle }_{{\rm{A}}}={|w\rangle }_{{\rm{A}}}$ and ${|0\rangle }_{{\rm{A}}}={|{\ell }\rangle }_{{\rm{A}}}$, and use another qubit to encode the reward $({|0\rangle }_{{\rm{R}}},{|1\rangle }_{{\rm{R}}})$. This results in a four-level system, where each level is a waveguide in our processor, as shown in Fig. 3. The winning probability for the agent is initially set to ε = sin²(ξ) = 0.01, representing a single rewarded action sequence out of 100. After a single photon is coupled into the mode $|{0}_{{\rm{A}}}{0}_{{\rm{R}}}\rangle $, the agent creates the state ${|\psi \rangle }_{{\rm{A}}}=(\cos ({\xi }){|0\rangle }_{{\rm{A}}}+\,\sin ({\xi }){|1\rangle }_{{\rm{A}}}){|0\rangle }_{{\rm{R}}}$ by applying a unitary U_P. Next, it can decide to play classically or quantum mechanically.

In a classical strategy, the environment flips the reward qubit only if the action qubit is in the winning state via U_E (Fig. 3a). Next, the photon is coupled out and detected in either D1 or D2 with probability cos²(ξ) and sin²(ξ), respectively. If D2 is triggered (that is, the agent has been rewarded), a feedback mechanism updates the policy π by updating the winning probability ε_j after having obtained j rewards as

$${\varepsilon }_{j}=\frac{1+2j}{100+2j}.$$

(4)

π is related to p(a), and therefore to ε, via equation (5) (Methods).

In a quantum strategy, after the reward qubit is rotated to ${|-\rangle }_{{\rm{R}}}$ via two operators U_H0 and U_H1, the environment acts as an oracle via U_E as in equation (3). Consecutively, the agent reverses the effect of U_H0 and U_H1, and performs the reflection U_R (Fig. 3b). Measuring in the computational basis of the action register then leads to the detection of a rewarded action sequence with increased probability sin²(3ξ) in D3. For practical reasons, the classical test epoch is implemented only in software (Methods). The update rule remains the same as in the classical case.

In general, any Grover-like algorithm faces a drop in amplitude amplification after the optimal point is reached. As different agents will reach this optimal point in different epochs, one can identify the probability ε = 0.396 until which it is beneficial for all agents to use a quantum strategy, as they will observe more rewards than in the classical strategy on average (Methods). When this probability is surpassed, it is advantageous to switch to an entirely classical strategy. This combined strategy thus avoids the typical amplitude amplification drop without introducing additional overheads in terms of experimental resources.

Results

At the end of each classical epoch, we record outcomes 1 and 0 for the rewarded and non-rewarded behaviour, respectively, obtaining a binary sequence whose length equals the number of played epochs in the classical learning strategy and half of the number of played epochs in the quantum strategy (as here two epochs, quantum and classical test, are needed to obtain the reward). For a fair comparison between these scenarios, in the quantum strategy, the reward is distributed (that is, averaged) over the quantum and classical test epochs. The reward is then averaged over different independent agents. Figure 4 shows this average reward η for the different learning strategies.

**Fig. 4: Behaviour of the average reward η for different learning strategies.**

The theoretical data are simulated for n = 10,000 agents and the experimental data obtained from n = 165. Figure 4a visualizes the quantum improvement originating from the use of amplitude amplification compared with a purely classical strategy. For completeness, the comparison between not distributing and distributing the reward over two epochs in the quantum strategy is shown in Fig. 4b, c.

When ε = 0.396, η for the quantum strategy starts decreasing, as visible in Fig. 4a. Our setup allows the agents to choose the favourable strategy by switching from quantum to classical when the latter becomes advantageous. This combined strategy outperforms the purely classical scenario, as shown in Fig. 4d. As previously discussed, a certain winning probability P_L has to be defined to quantify the learning time. Choosing P_L = 0.37 (note however, that any probability below ε = 0.396 can be chosen), the learning time $\langle T\rangle $ for P_L decreases from ${\langle T\rangle }_{{\rm{C}}}$ = 270 in the classical strategy to ${\langle T\rangle }_{{\rm{Q}}}$ = 100 in the combined strategy. This implies a reduction of 63%, which fits well to the theoretical values ${T}_{{\rm{C}}}^{{\rm{theory}}}$ = 293 and ${T}_{{\rm{Q}}}^{{\rm{theory}}}$ = 97, accounting for small experimental imperfections.

In general, hybrid agents can experience a quadratic speed-up in their learning time if arbitrary numbers of coherent Grover iterations can be performed, even if the number of rewarded actions is unknown³¹.

Conclusions

We have demonstrated an RL protocol where an agent can boost its performance by allowing quantum communication with the environment. This enables a quantum speed-up in its learning time and optimal control of the learning process. Emerging photonic technology provides the advantages of compactness, tunability and low-loss communication, thus proving suitable for RL algorithms where active-feedback mechanisms, even over long distances, need to be implemented. Future scaled-up implementations of our protocol rely on a linearly increasing number of waveguides with the action space size when considering action sequences of length L = 1, and the use of just a single photon. In this case, a learning task with N different actions requires a processor with 2N modes, while 3N + 1 or maximally $\frac{{\rm{\pi }}}{4}\sqrt{N}(3N+1)$ gates are needed for a single or multiple rounds of amplitude amplification, respectively. In general, multiple photons will be required to deal with a combinatorially big space of action sequences of arbitrary length L. We envision our protocol to aid specifically in problems where frequent search is needed, for example, network routing problems, where, for instance, tens of qubits, waveguides and detectors would be employed to represent search spaces of 10⁴ elements. In general, the development of superconducting detectors, on-demand single-photon sources³² or the large-scale integration of artificial atoms within photonic circuits³³ suggest substantial steps towards scalable multiphoton applications. Although photonic architectures are particularly suitable for such learning algorithms, our theoretical background is applicable to different platforms, for example, trapped ions or superconducting qubits. Here future realizations can feature the implementation of agent and environment as spatially separated systems, and a light–matter quantum interface for coherent exchange between them^24,34.

Methods

Quantum enhancement in RL agents

Here we present an explicit method for combining a classical agent with quantum amplitude amplification. Introducing a feedback loop between classical policy update and quantum amplitude amplification, we are able to determine achievable improvements in sample complexity, and thus in learning time. In addition, the final policy of our agent has properties similar to those of the underlying classical agent, leading to a comparable behaviour as discussed in more detail in A.H. et al. (manuscript in preparation; which is dedicated to discussing the theoretical background of the hybrid agent more specifically).

In the following, we focus on simple DSE environments, where the interaction between the agent and the environment is structured into epochs. Each epoch starts with the same percept s₀, and at each time step i an action–percept pair (a_i, s_i) is exchanged. Many interesting environments are epochal, for example, in applications of RL to quantum physics^21,35,36,37 or popular problems such as playing Go¹⁰. At the end of each epoch, after L action–percept pairs are communicated, the agent receives a reward r ∈ {0, 1}. The rules of the game are deterministic and time independent, such that performing a specific action a_i after receiving a percept s_i−1 always leads to the same following percept s_i.

The behaviour of an agent is determined by its policy described by the probability π(a_i|s_i−1) to perform the action a_i given the percept s_i−1. In deterministic settings, the percept s_i is completely determined by all previously performed actions a₁, ⋅⋅⋅, a_i such that π(a_i|s_i−1) = π(a_i|a₁, ⋅⋅⋅, a_i−1). Thus, the behaviour of the agent within one epoch is described by action sequences a = (a₁, ..., a_L) and their corresponding probabilities

$$p({\bf{a}})=\mathop{\prod }\limits_{i=1}^{L}{\rm{\pi }}({a}_{i}|{a}_{1},\cdots ,{a}_{i-1}).$$

(5)

Our learning agent uses a policy based on projective simulation²⁶, where each action sequence a is associated with a weight factor h(a) initialized to h = 1. Its policy is defined via the probability distribution

$$p({\bf{a}})=\frac{h({\bf{a}})}{\sum _{{\bf{a}}{\prime} }h({\bf{a}}{\prime} )}.$$

(6)

In our experiment, the initial winning probability ε is given by

$$\varepsilon =\sum _{\{{\bf{a}}|r({\bf{a}}) > 0\}}p({\bf{a}})=\frac{\sum _{\{{\bf{a}}|r({\bf{a}}) > 0\}}h({\bf{a}})}{\sum _{\{{\bf{a}}\}}h({\bf{a}})}$$

(7)

and is set to 1/100. If the agent has chosen the sequence a, it updates the corresponding weight factor via

$$h({\bf{a}})\to h({\bf{a}})+\lambda r({\bf{a}}),$$

(8)

where λ = 2 in our experiment and r(a) = 1 (0) if a is rewarded (non-rewarded). Thus, the winning probability after the agent has found j rewards is given by equation (4). In general, the update method for quantum-enhanced agents is not limited to projective simulation and can be used to enhance any classical learning scenario, provided that p(a) exists and that the update rule is solely based on the observed rewards. We generalize the given learning problem to the quantum domain by encoding different action sequences a into orthogonal quantum states $|{\bf{a}}\rangle $ defining our computational basis. In addition, we create a fair unitary oracular variant of the environment², whose effective behaviour on the action register can be described by ${\tilde{U}}_{{\rm{E}}}$ as

$${\tilde{U}}_{{\rm{E}}}|{\bf{a}}\rangle =\{\begin{array}{ll}-|{\bf{a}}\rangle & {\rm{if}}\,r({\bf{a}}) > 0\\ |{\bf{a}}\rangle & {\rm{if}}\,r({\bf{a}})=0\end{array}.$$

(9)

The unitary oracle ${\tilde{U}}_{{\rm{E}}}$ can be used to perform, for instance, a Grover search or amplitude amplification for rewarded action sequences by performing Grover iterations

$${U}_{{\rm{G}}}=(2|\psi \rangle \langle \psi |-{\mathbb{1}}){\mathop{U}\limits^{ \sim }}_{{\rm{E}}}$$

(10)

on an initial state $|\psi \rangle $. A quantum-enhanced agent with access to ${\tilde{U}}_{{\rm{E}}}$ can thus find rewarded action sequences faster than a corresponding classical agent defined by the same initial policy π(a_i|s_i−1) and update rules.

In general, the optimal number k of Grover iterations ${U}_{{\rm{G}}}^{k}|\psi \rangle $ depends on the winning probability ε via $k\propto 1/\sqrt{\varepsilon }$ (ref. ²⁷). In the following, we assume that ε is known at least to a good approximation. This is, for instance, possible if the number of rewarded action sequences is known. However, a similar agent can also be developed if ε is unknown by adapting methods from ref. ³¹ as described in A.H. et al. (manuscript in preparation).

Description of the agent

A quantum-enhanced hybrid agent is constructed via the following steps:

(1) Given the classical probability distribution p(a), determine the winning probability ε = sin²(ξ) based on the current policy and prepare the quantum state in the action register:

$$|\psi \rangle =\sum _{\{{\bf{a}}\}}\sqrt{p({\bf{a}})}|{\bf{a}}\rangle $$

(11)

$$=\,\cos (\xi )|\ell \rangle +\,\sin (\xi )|w\rangle .$$

(12)

Here the quantum states

$$|\ell \rangle \propto \sum _{\{{\bf{a}}|r({\bf{a}})=0\}}\sqrt{p({\bf{a}})}|{\bf{a}}\rangle ,$$

(13)

$$|w\rangle \propto \sum _{\{{\bf{a}}|r({\bf{a}}) > 0\}}\sqrt{p({\bf{a}})}|{\bf{a}}\rangle ,$$

(14)

contain all losing and winning components, respectively. In our experiment, we identify $|{\ell }\rangle \widehat{=}|0\rangle $ and $|w\rangle \widehat{=}|1\rangle $. The task assigned to the agent is to (learn to) perform the winning sequences $|w\rangle $ via policy update. This translates to a maximization of the obtained reward.

(2) Apply the optimal number $k(\sqrt{\varepsilon })$ of Grover iterations leading to

$$|\psi {}^{{\prime} }\rangle ={U}_{{\rm{G}}}^{k}|\psi \rangle ,$$

(15)

and perform a measurement in the computational basis on $|\psi {}^{{\prime} }\rangle $ to determine a test action sequence a.

(3) Play one classical epoch by using the test sequence a determined in step 2 and obtain the corresponding percept sequence s(a) and the reward r(a).

(4) Update ε, and thus the classical policy π, using the rule in equation (4).

There exists a limit P on ε determining whether it is more advantageous for the agent to perform k Grover iterations with $k(\sqrt{\varepsilon })\ge 1$ or sample directly from p(a) (therefore $k(\sqrt{\varepsilon })=0$) to determine a. In the latter case, the agent would interact only classically (as in step 3) with the environment.

After each epoch, a classic agent receives a reward with probability sin²(ξ). A quantum-enhanced agent can instead use one epoch to either perform one Grover iteration (step 2) or to determine the reward of a given test sequence a (step 3). After k Grover iterations, the winning probability is sin²[(2k + 1)ξ] (ref. ²⁸; see next section). Thus, for k = 1, the agent receives a reward after every second epoch with probability sin²(3ξ). Therefore, we define the expected average reward of an agent playing a classical strategy as η_C = sin²(ξ) and of an agent playing a quantum strategy with k = 1 as η_Q = sin²(3ξ)/2. For ε < P, η_Q > η_C, meaning that the quantum strategy proves advantageous over the classical case. However, as soon as η_Q = η_C (at P = 0.396), a classical agent starts outperforming a quantum-enhanced agent that still performs Grover iterations.

Determining the winning probability ε exactly as in the example presented here is not always possible. In general, additional information such as the number of possible solutions and model building helps to perform this task. Note that a P smaller than 0.396 should be chosen if ε can only be estimated up to some range. To circumvent this problem, methods like Grover search with unknown reward probability³¹ or fixed-point search³⁸ can be used to determine whether and how many steps of amplitude amplification should be performed (A.H. et al., manuscript in preparation).

Enhancement of the winning probability

After a quantum epoch, the amplitude sin(ξ) of the winning state $|w\rangle $ increases to sin(3ξ). Here we derive this result. The projections onto the winning and losing subspaces are given by ${\Pr }_{w}={\sum }_{\{{\bf{a}}|r({\bf{a}}) > 0\}}|{\bf{a}}\rangle \langle {\bf{a}}|$ and ${\Pr }_{{\ell }}={\sum }_{\{{\bf{a}}|r({\bf{a}})=0\}}|{\bf{a}}\rangle \langle {\bf{a}}|$, respectively, which are orthogonal and sum to identity. Therefore, the initial state (12) can be decomposed into a normalized winning $|w\rangle \propto {\Pr }_{w}|\psi \rangle $ and losing $|{\ell }\rangle \propto {\Pr }_{{\ell }}|\psi \rangle $ component, and the unitary (10) implementing one Grover iteration can be written as

$${U}_{{\rm{G}}}=(2|\psi \rangle \langle \psi |-{\mathbb{1}})({Pr}_{\ell }-{Pr}_{w}).$$

(16)

Now, let us investigate the effect of U_G on an arbitrary real superposition

$$|s\rangle =\,\sin (\alpha )|w\rangle +\,\cos (\alpha )|{\ell }\rangle $$

(17)

in the plane spanned by $|w\rangle $ and $|{\ell }\rangle $. Using the trigonometric addition theorems, the application of one Grover iteration to $|s\rangle $

$${U}_{{\rm{G}}}|s\rangle =\,\sin (\alpha +2\xi )|w\rangle +\,\cos (\alpha +2\xi )|{\ell }\rangle $$

(18)

can be identified as a rotation of 2ξ in the plane. Assuming α = ξ, we therefore find the amplitude of $|w\rangle $ to be sin(3ξ), and thus obtain a winning probability sin²(3ξ). Implementing k Grover iterations leads to sin²[(2k + 1)ξ].

Learning time

We define the learning time T as the number of epochs an agent needs on average to reach a certain winning probability P_L. The hybrid agent can reach P with fewer epochs on average than its classical counterpart. However, once both reach P, they need on average the same number of epochs to reach P = Δ_P where Δ_P is a certain value satisfying 0 ≤ Δ_P < 1 − P. Therefore, we choose P_L ≤ P to quantify the achievable improvement of a hybrid agent compared with its classical counterpart. In our experiment, we choose P_L = 0.37 to define the learning time.

Let ${l}_{J}=\{{{\bf{a}}}_{1},\cdots ,{{\bf{a}}}_{J}\}$ be a time-ordered list of all the rewarded action sequences an agent has found until it reaches P_L. Note that the actual policy π_j, and thus p_j, of our agents depend only on the list l_j of observed rewarded action sequences, and this is independent of whether they have found them via classical sampling or quantum amplitude amplification. As a result, a classical agent and its quantum-enhanced hybrid version are described by the same policy π(l_j) and behave similarly if they have found the same rewarded action sequences. However, the hybrid agent finds them faster.

In general, the actual policy and overall winning probability might depend on the rewarded action sequences that have been found. Thus, the number J of observed rewarded action sequences necessary to learn might vary. However, this is not the case for the experiment reported here. In our case, the learning time can be determined via

$$T(J)=\mathop{\sum }\limits_{j=0}^{J-1}{t}_{j},$$

(19)

where t_j determines the number of epochs necessary to find the next rewarded sequence a_j+1 after it has observed j rewards. For a purely classical agent, the average time is given by

$${\langle {t}_{j}\rangle }_{{\rm{C}}}=\frac{1}{{\varepsilon }_{j}}.$$

(20)

This time is quadratically reduced to

$${\langle {t}_{j}\rangle }_{{\rm{Q}}}=\frac{\alpha }{\sqrt{{\varepsilon }_{j}}}$$

(21)

for the hybrid agent. Here α is a parameter depending only on the number of epochs needed to create one oracle query ${\tilde{U}}_{{\rm{E}}}$ (ref. ²) and on whether ε_j is known. In the case considered here, we find α = π/4. As a consequence, the average learning time for the hybrid agent is given by

$${\langle T(J)\rangle }_{{\rm{Q}}}=\mathop{\sum }\limits_{j=0}^{J-1}\frac{\alpha }{\sqrt{{\varepsilon }_{j}}}\,\le \,\alpha \sqrt{J}\sqrt{{\langle T(J)\rangle }_{{\rm{C}}}},$$

(22)

where we used the Cauchy–Schwarz inequality and equations (19) and (20). The classical learning time typically scales with ${\langle T\rangle }_{{\rm{C}}}\propto {A}^{K}$ for a learning problem with episode length K and the choice between A different actions in each step. The number J depends on the specific policy update and sometimes also on the list l_J of observed rewarded action sequences. For an agent sticking with the first rewarded action sequence, we would find J = 1. However, typical learning agents are more explorative, and common scalings are J ∝ K such that we find

$${\langle T(J)\rangle }_{{\rm{Q}}}\propto \sqrt{\log ({\langle T(J)\rangle }_{{\rm{C}}})}\sqrt{{\langle T(J)\rangle }_{{\rm{C}}}}$$

(23)

for these cases. This is equivalent to a quasi-quadratic speed-up in the learning time if arbitrary numbers of Grover iterations can be performed.

In more general settings, there exist several possible l_J with different length J such that the learning time $\langle T(J)\rangle $ needs to be averaged over all possible l_J, which again leads to a quadratic speed-up in learning time (A.H. et al., manuscript in preparation).

Limited coherent evolutions

In general, all near-term quantum devices allow for coherent evolution only for a limited time and are thus limited to a maximal number of Grover iterations. For winning probabilities ε = sin²(ξ) with (2k + 1)ξ ≤ π/2, performing k Grover iterations leads to the highest probability of finding a rewarded action.

Again, we assume that the actual policy of an agent depends on only the number of observed rewards an agent has found. As a consequence, the average time a hybrid agent limited to k Grover iterations needs to achieve the winning probability P < sin²[π/(4k + 2)] is given by

$${\langle T(J,k)\rangle }_{{\rm{Q}}}=\mathop{\sum }\limits_{j=0}^{J-1}\frac{{\alpha }_{0}k+1}{{\sin }^{2}[(2k+1){\xi }_{j}]},$$

(24)

with α₀ determining the number of epochs necessary to create one oracle query ${\tilde{U}}_{{\rm{E}}}$. For α₀ = 1, k ≫ 1 and (2k + 1)ξ_J ≪ π/2, we can approximate the learning time for the hybrid agent via

$${\langle T(J,k)\rangle }_{{\rm{Q}}}\approx \mathop{\sum }\limits_{j=0}^{J-1}\frac{1}{4k{\varepsilon }_{j}}=\frac{{\langle T(J)\rangle }_{{\rm{C}}}}{4k},$$

(25)

where we used sin(x) ≈ x for x ≪ 1. In general, it can be shown (A.H. et al., manuscript in preparation) that the winning probability P_k = sin²[π/(4k + 2)] can be reached by a hybrid agent limited to k Grover iterations in a time

$${\langle T(k)\rangle }_{{\rm{Q}}}\le \gamma \frac{{\langle T\rangle }_{{\rm{C}}}}{k},$$

(26)

where γ is a factor depending on the specific setting.

In our case, equation (24) can be used to compute the lower bound for the average quantum learning time, with α₀ = k = 1. For the classical strategy, equation (20), together with equation (19), is used. Thus, given P_L = 0.37, the predictions for the learning time in our experiment are ${T}_{{\rm{Q}}}^{{\rm{theory}}}$ = 97 and ${T}_{{\rm{C}}}^{{\rm{theory}}}$ = 293.

Experimental details

A continuous-wave laser (Coherent Mira HP) is used to pump a single-photon source producing photon pairs in the telecommunication-wavelength band. The laser light has a central wavelength of 789.5 nm and pumps the single-photon source at a power of approximately 100 mW. The source is a periodically poled KTiOPO₄ nonlinear crystal placed in a Sagnac interferometer^39,40, where the emission of single photons occurs via a type-II spontaneous parametric down-conversion process. The crystal (produced by Raicol) is 30 mm long, set to a temperature of 25 °C, has a poling period of 46.15 μm and is quasi-phase matched for degenerate emission of photons at 1,570 nm when pumping with coherent laser light at 785 nm. As the processor is calibrated for a wavelength of 1,580 nm, we shift the wavelength of the laser light to 789.75 nm to produce one photon at 1,580 nm (that is then coupled into the processor) and another one at 1,579 nm (the heralding photon).

The processor is a silicon-on-insulator type, designed by the Quantum Photonics Laboratory at the Massachusetts Institute of Technology (MIT)³⁰. Each programmable unit on the device acts as a tunable beam splitter implementing the unitary

$${U}_{\theta ,\varphi }=\left(\begin{array}{cc}{{\rm{e}}}^{{\rm{i}}\varphi }\sin \frac{\theta }{2} & {{\rm{e}}}^{{\rm{i}}\varphi }\,\cos \frac{\theta }{2}\\ \cos \frac{\theta }{2} & -\sin \frac{\theta }{2}\end{array}\right),$$

(27)

where θ and ϕ are the internal and external phases shown in Fig. 2a, b, set via thermo-optical phase shifters controlled by a voltage supply. The achievable precision for phase settings is higher than 250 μrad. The bandwidth of the phase shifters is around 130 kHz. The waveguides, spatially separated from one another by 25.4 μm, are designed to admit one linear polarization only. The high contrast in refractive index between the silicon and silica (the insulator) allows for waveguides with very small bend radius (less than 15 μm), thus enabling high component density (in our case 88 MZIs) on small areas (in our case, 4.9 × 2.4 mm). Given the small dimensions, the in-(out-)coupling is realized with the help of Si₃N₄−SiO₂ waveguide arrays (produced by Lionix International), that shrink (and enlarge) the 10-μm optical fibres’ mode to match the 2-μm mode size of the waveguides in the processor. The total input–output loss is around 7 dB. The processor is stabilized to a temperature of 28 °C and calibrated at 1,580 nm for optimal performance. To reduce the black-body radiation emission due to the heating of the phase shifters when voltage is applied, wavelength division multiplexers with a transmission peak centred at 1,571 nm and bandwidth of 13 nm are used before the photons are sent to the detectors. In our processor, two external phase shifters in the implemented circuits were not responding to the supplied voltage. These defects were accounted for by employing an optimization procedure.

The single-photon detectors are multi-element superconducting nanowires (produced by Photon Spot) with efficiencies up to 90% in the telecommunication-wavelength band. They have a dark count rate of about 100 counts per second, low timing jitter (hundreds of picoseconds) and a reset time <100 ns (ref. ⁴¹). Coincidence events are those detection events registered in D0 and at the output of the processor that fall into a temporal window of 1.3 ns (the coincidence window), and are found using a time tagging module (TTM). In more detail, to record coincidences and then update the agent’s policy accordingly, the following steps are performed in the classical/quantum strategy (after initially setting ${h}_{w}={\sum }_{\{{\bf{a}}|r({\bf{a}}) > 0\}}h({\bf{a}})=1$ and ${h}_{{\ell }}={\sum }_{\{{\bf{a}}|r({\bf{a}})=0\}}h({\bf{a}})=99$ such that ε = 0.01).

(1) The TTM records the time tags for photons in D0, D1 and D2/D3.

(2) A Python script converts the time tags into arrival times, and it iterates through them until it finds a coincidence event between either D0 and D1, or D0 and D2/D3.

(3) If a coincidence event between D0 and D1 (or D0 and D2/D3) is first found, a 0 (1) is sent to another Python script controlling the MZIs’ phase shifters operating on a different computer. If a 0 is sent, the ratio $\varepsilon ={h}_{w}/({h}_{{\ell }}+{h}_{w})$ remains unchanged. If a 1 is sent, h_w is updated as h_w + 2, which follows the update in equation (4). In the quantum strategy, this step also includes the implementation of a classical test epoch.

Implementing classical test epochs on hardware would require ‘testing’ the measured action state, that is, using the measured action sequence as input and making the environment act via U_E, thus leading to detection of a reward ${|0\rangle }_{{\rm{R}}}$or ${|1\rangle }_{{\rm{R}}}$. However, since this simple circuit works in very close agreement with theoretical predictions (its visibility exceeds 0.99), this part has been implemented in software only.

The update rate is about 1 Hz for both the classical and quantum epochs, and can be reduced up to the phase shifters’ bandwidth.

Data availability

All the datasets used in the current work are available on Zenodo at https://doi.org/10.5281/zenodo.4327211.

References

Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 1998).
Dunjko, V., Taylor, J. M. & Briegel, H. J. Quantum-enhanced machine learning. Phys. Rev. Lett. 117, 130501 (2016).
Article ADS MathSciNet Google Scholar
Paparo, G. D., Dunjiko, V., Makmal, A., Martin-Delgrado, M. A. & Briegel, H. J. Quantum speedup for active learning agents. Phys. Rev. X4, 031002 (2014).
Google Scholar
Sriarunothai, T. et al. Speeding-up the decision making of a learning agent using an ion trap quantum processor. Quantum Sci. Technol. 4, 015014 (2019).
Article ADS Google Scholar
Johannink, T. et al. Residual reinforcement learning for robot control. In 2019 International Conference on Robotics and Automation (ICRA) 6023–6029 (IEEE, 2019).
Tjandra, A., Sakti, S. & Nakamura, S. Sequence-to-aequence ASR optimization via reinforcement learning. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5829–5833 (IEEE, 2018).
Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C. & Faisal A. A. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat. Med. 24, 1716–1720 (2018).
Article CAS Google Scholar
Thakur, C. S. et al. Large-scale neuromorphic spiking array processors: a quest to mimic the brain. Front. Neurosci. 12, 891 (2018).
Article Google Scholar
Steinbrecher, G. R., Olson, J. P., Englund, D. & Carolan, J. Quantum optical neural networks. npj Quantum Inf. 5, 60 (2019).
Article ADS Google Scholar
Silver, D. et al. Mastering the game of Go without human knowledge. Nature550, 354–359 (2017).
Article ADS CAS Google Scholar
Arute, F. et al. Quantum supremacy using a programmable superconducting processor. Nature574, 505–510 (2019).
Article ADS CAS Google Scholar
Dong. D., Chen, C., Li, H. & Tarn, T.-J. Quantum reinforcement learning. IEEE Trans. Syst. Man Cybern. B38, 1207–1220 (2008).
Article Google Scholar
Dunjko, V. & Briegel, H. J. Machine learning & artificial intelligence in the quantum domain: a review of recent progress. Rep. Prog. Phys. 81, 074001 (2018).
Article ADS MathSciNet Google Scholar
Baireuther, P., O’Brien, T. E., Tarasinski, B. & Beenakker, C. W. J. Machine-learning-assisted correction of correlated qubit errors in a topological code. Quantum2, 48 (2018).
Article Google Scholar
Breuckmann, N. P. & Ni, X. Scalable neural network decoders for higher dimensional quantum codes. Quantum2, 68–92 (2018).
Article Google Scholar
Chamberland, C. & Ronagh, P. Deep neural decoders for near term fault-tolerant experiments. Quant. Sci. Technol. 3, 044002 (2018).
Article ADS Google Scholar
Fösel, T., Tighineanu, P., Weiss, T. & Marquardt, F. Reinforcement learning with neural networks for quantum feedback. Phys. Rev. X8, 031084 (2018).
Google Scholar
Poulsen Nautrup, H., Delfosse, N., Dunjko, V., Briegel, H. J. & Friis, N. Optimizing quantum error correction codes with reinforcement learning. Quantum3, 215 (2019).
Article Google Scholar
Yu, S. et al. Reconstruction of a photonic qubit state with reinforcement learning. Adv. Quantum Technol. 2, 1800074 (2019).
Article Google Scholar
Krenn, M., Malik, M., Fickler, R., Lapkiewicz, R. & Zeilinger, A. Automated search for new quantum experiments. Phys. Rev. Lett. 116, 090405 (2016).
Article ADS Google Scholar
Melnikov, A. A. et al. Active learning machine learns to create new quantum experiments. Proc. Natl Acad. Sci. USA115, 1221–1226 (2018).
Article ADS CAS Google Scholar
Dunjko, V., Friis, N. & Briegel, H. J. Quantum-enhanced deliberation of learning agents using trapped ions. New J. Phys. 17, 023006 (2015).
Article ADS Google Scholar
Jerbi, S., Poulsen Nautrup, H., Trenkwalder, L. M., Briegel, H. J. & Dunjko, V. A framework for deep energy-based reinforcement learning with quantum speed-up. Preprint at https://arxiv.org/abs/1910.12760 (2019).
Kimble, H. J. The quantum internet. Nature453, 1023–1030 (2008).
Article ADS CAS Google Scholar
Cacciapuoti, A. S. et al. Quantum internet: networking challenges in distributed quantum computing. IEEE Netw. 34, 137–143 (2020).
Article Google Scholar
Briegel, H. J. & De las Cuevas, G. Projective simulation for artificial intelligence. Sci. Rep. 2, 400 (2012).
Article Google Scholar
Grover, L. K. Quantum mechanics helps in searching for a needle in a haystack. Phys. Rev. Lett. 79, 325–328 (1997).
Article ADS CAS Google Scholar
Nielsen, M. A. & Chuang, I. L. Quantum Computation and Quantum Information (Cambridge Univ. Press, 2000).
Flamini, F. et al. Photonic architecture for reinforcement learning. New. J. Phys. 22, 045002 (2020).
Article ADS MathSciNet Google Scholar
Harris, N. C. et al. Quantum transport simulations in a programmable nanophotonic processor. Nat. Photon. 11, 447–452 (2017).
Article ADS CAS Google Scholar
Boyer, M., Brassard, G., Hoyer, P. & Tappa, A. Tight bounds on quantum searching. Fortschr. Phys. 46, 493–505 (1998).
Article Google Scholar
Senellart, P., Solomon, G. & White, A. High-performance semiconductor quantum-dot single-photon sources. Nat. Nanotechnol. 12, 1026–1039 (2017).
Article ADS CAS Google Scholar
Wan, N. H. et al. Large-scale integration of artificial atoms in hybrid photonic circuits. Nature583, 226–231 (2020).
Article ADS CAS Google Scholar
Northup, T. E. & Blatt, R. Quantum information transfer using photons. Nat. Photon. 8, 356–363 (2014).
Article ADS CAS Google Scholar
Denil, M. et al. Learning to perform physics experiments via deep reinforcement learning. Proc. Int. Conf. on Learning Representations (2017).
Bukov, M. et al. Reinforcement learning in different phases of quantum control. Phys. Rev. X8, 031086 (2018).
CAS Google Scholar
Poulsen Nautrup, H. et al. Operationally meaningful representations of physical systems in neural networks. Preprint at https://arxiv.org/abs/2001.00593 (2020).
Yoder, T. J., Low, G. H. & Chuang, I. L. Fixed-point quantum search with an optimal number of queries. Phys. Rev. Lett. 113, 210501 (2014).
Article ADS Google Scholar
Kim, T., Fiorentino, M. & Wong, F. N. C. Phase-stable source of polarization-entangled photons using a polarization Sagnac interferometer. Phys. Rev. A73, 012316 (2006).
Article ADS Google Scholar
Saggio, V. et al. Experimental few-copy multipartite entanglement detection. Nat. Phys. 15, 935–940 (2019).
Article CAS Google Scholar
Marsili, F. et al. Detecting single infrared photons with 93% system efficiency. Nat. Photon. 7, 210–214 (2013).
Article ADS CAS Google Scholar

Download references

Acknowledgements

We thank L. A. Rozema, I. Alonso Calafell and P. Jenke for help with the detectors. A.H. acknowledges support from the Austrian Science Fund (FWF) through the project P 30937-N27. V.D. acknowledges support from the Dutch Research Council (NWO/OCW), as part of the Quantum Software Consortium programme (project number 024.003.037). N.F. acknowledges support from the Austrian Science Fund (FWF) through the project P 31339-N27. H.J.B. acknowledges support from the Austrian Science Fund (FWF) through SFB BeyondC F7102, the Ministerium für Wissenschaft, Forschung, und Kunst Baden-Württemberg (Az. 33-7533-30-10/41/1) and the Volkswagen Foundation (Az. 97721). P.W. acknowledges support from the research platform TURIS, the European Commission through ErBeStA (no. 800942), HiPhoP (no. 731473), UNIQORN (no. 820474), EPIQUS (no. 899368), and AppQInfo (no. 956071), from the Austrian Science Fund (FWF) through CoQuS (W1210-N25), BeyondC (F 7113) and Research Group (FG 5), and Red Bull GmbH. The MIT portion of the work was supported in part by AFOSR award FA9550-16-1-0391 and NTT Research.

Author information

Authors and Affiliations

University of Vienna, Faculty of Physics, Vienna Center for Quantum Science and Technology (VCQ), Vienna, Austria
V. Saggio, B. E. Asenbeck, T. Strömberg, P. Schiansky & P. Walther
Institut für Theoretische Physik, Universität Innsbruck, Innsbruck, Austria
A. Hamann, S. Wölk & H. J. Briegel
LIACS, Leiden University, Leiden, The Netherlands
V. Dunjko
Institute for Quantum Optics and Quantum Information - IQOQI Vienna, Austrian Academy of Sciences, Vienna, Austria
N. Friis
Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA, USA
N. C. Harris & D. Englund
Nokia Corporation, New York, NY, USA
M. Hochberg
Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR), Institut für Quantentechnologien, Ulm, Germany
S. Wölk
Fachbereich Philosophie, Universität Konstanz, Konstanz, Germany
H. J. Briegel
Christian Doppler Laboratory for Photonic Quantum Computer, Faculty of Physics, University of Vienna, Vienna, Austria
P. Walther

Authors

V. Saggio
View author publications
You can also search for this author in PubMed Google Scholar
B. E. Asenbeck
View author publications
You can also search for this author in PubMed Google Scholar
A. Hamann
View author publications
You can also search for this author in PubMed Google Scholar
T. Strömberg
View author publications
You can also search for this author in PubMed Google Scholar
P. Schiansky
View author publications
You can also search for this author in PubMed Google Scholar
V. Dunjko
View author publications
You can also search for this author in PubMed Google Scholar
N. Friis
View author publications
You can also search for this author in PubMed Google Scholar
N. C. Harris
View author publications
You can also search for this author in PubMed Google Scholar
M. Hochberg
View author publications
You can also search for this author in PubMed Google Scholar
D. Englund
View author publications
You can also search for this author in PubMed Google Scholar
S. Wölk
View author publications
You can also search for this author in PubMed Google Scholar
H. J. Briegel
View author publications
You can also search for this author in PubMed Google Scholar
P. Walther
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

V.S. and B.E.A. implemented the experiment and performed data analysis. A.H., V.D., N.F., S.W. and H.J.B. developed the theoretical idea. T.S. and P.S. provided help with the experimental implementation. N.C.H., M.H. and D.E. designed the nanophotonic processor. V.S., S.W. and P.W. supervised the project. All the authors contributed to writing the paper.

Corresponding authors

Correspondence to V. Saggio or P. Walther.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review informationNature thanks Vojtěch Havlíček, Lucas Lamata and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer Review File

Rights and permissions

Reprints and permissions

About this article

Cite this article

Saggio, V., Asenbeck, B.E., Hamann, A. et al. Experimental quantum speed-up in reinforcement learning agents. Nature 591, 229–233 (2021). https://doi.org/10.1038/s41586-021-03242-7

Download citation

Received: 12 August 2020
Accepted: 15 January 2021
Published: 10 March 2021
Issue Date: 11 March 2021
DOI: https://doi.org/10.1038/s41586-021-03242-7

This article is cited by

Theoretical guarantees for permutation-equivariant quantum neural networks
- Louis Schatzki
- Martín Larocca
- M. Cerezo
npj Quantum Information (2024)
High-efficiency reinforcement learning with hybrid architecture photonic integrated circuit
- Xuan-Kun Li
- Jian-Xu Ma
- Xian-Min Jin
Nature Communications (2024)
Automated quantum software engineering
- Aritra Sarkar
Automated Software Engineering (2024)
Self-correcting quantum many-body control using reinforcement learning with tensor networks
- Friederike Metz
- Marin Bukov
Nature Machine Intelligence (2023)
QAmplifyNet: pushing the boundaries of supply chain backorder prediction using interpretable hybrid quantum-classical neural network
- Md Abrar Jahin
- Md Sakib Hossain Shovon
- Yuichi Okuyama
Scientific Reports (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.