Rapid advances in the field of machine learning and in general artificial intelligence (AI) are paving the way towards intelligent algorithms and automation. An important paradigm within AI is reinforcement learning (RL), where decision-making entities called ‘agents’ interact with an environment, ‘learning’ to achieve a goal via feedback1. Whenever the agent performs well (that is, makes the right decision), the environment rewards its behaviour, and the agent uses this information to progressively increase the likelihood of accomplishing its task. In this sense, an agent ‘learns’ by ‘reinforcement’. RL has applications in many sectors, from robotics5,6 to the healthcare domain7, to brain-like computing simulation8 and neural network implementations6,9. In addition, the celebrated AlphaGo algorithm10, which is able to beat even the most skilled human players at the game of Go, employs RL.

At the same time, quantum technologies have experienced remarkable progress11. At the heart of quantum mechanics lies the superposition principle, dictating that even the simplest, two-dimensional quantum system is described by a continuum of infinitely many possible choices via a state vector \(|\psi \rangle =\alpha |{\rm{0}}\rangle +\beta |1\rangle \) with the complex numbers α and β satisfying \({|\alpha |}^{2}+{|\beta |}^{2}=1\), while only two possible states, \(|0\rangle \) and \(|1\rangle \), exist classically. Advantageous RL algorithms12,13 inspired by quantum mechanics have been successful in aiding problems in quantum information processing, for example, decoding of errors14,15,16, quantum feedback17, adaptive code-design18, quantum-state reconstruction19 and even the design of quantum experiments20,21. Conversely, quantum technologies have enabled quadratically faster decision-making processes for RL agents via the quantization of their internal hardware3,4,22,23.

In all of these applications, agent and environment interact entirely classically. Here we consider a novel RL setting where they can also interact quantumly, formally via a quantum channel2. We therefore introduce a quantum-enhanced hybrid agent capable of quantum as well as classical information transfer. This makes it possible to achieve and quantify a quantum speed-up in the agent’s learning time with respect to RL based solely on classical interaction.

We realize this protocol using a fully programmable nanophotonic processor interfaced with photons at telecommunication wavelengths. The setup enables the implementation of active-feedback mechanisms, thus proving suitable for demonstrations of RL algorithms. Moreover, such photonic platforms hold the potential of integrating RL quantum speed-ups in future quantum networks owing to the photons’ telecommunication wavelengths. A long-standing goal in the development of quantum communication lies in establishing a form of ‘quantum internet’24,25, a highly interconnected network able to distribute and manipulate quantum states via optical links. We therefore envisage AI and RL to play important roles in future quantum networks, including a potential quantum internet, much in the same way that AI forms integral part of the internet today.

Quantum-enhanced RL

The conceptual idea of RL is shown in Fig. 1a. A decision-making entity called an agent interacts with an environment by receiving perceptual input (‘percepts’) si, and outputting specific ‘actions’ ai accordingly, at a certain time step i. Different ‘rewards’ r issued by the environment for correct combinations of percepts and actions incentivize agents to improve their decision-making, and thus to learn1.

Fig. 1: Schematic of a learning agent.
figure 1

a, An agent interacts with an environment by receiving perceptual input si and outputting actions ai. When the correct ai is chosen, the environment issues a reward r that the agent uses to enhance its performance in the next round of interaction. b, Agent and environment interacting classically, that is, using a classical channel, where communication is only possible via a fixed preferred basis (for example, vertical or horizontal photon polarization). c, Agent and environment interacting via a quantum channel, where arbitrary superposition states are exchanged.

Although RL has already been shown amenable to quantum enhancements, the interaction has so far been restricted exclusively to classical communication, meaning that signals can only be composed from a fixed, discrete alphabet. For signals carried by quantum systems (for example, single photons considered here) this corresponds to a fixed preferred basis, for example, ‘vertical’ or ‘horizontal’ photon polarization, as shown in Fig. 1b.

In general, it has been shown that granting agents access to quantum hardware (while still considering classical communication) does not reduce the learning time, although it allows actions to be output quadratically faster3,4. To achieve reductions in learning times, quantum communication becomes necessary.

We therefore consider an environment and a quantum-enhanced hybrid agent with access to internal quantum (as well as classical) hardware interacting by exchanging quantum states  \(|{a}_{i}\rangle \), \(|{s}_{i}\rangle \)  and  \(|r\rangle \), representing actions ai, percepts si and rewards r, respectively. Such agents may behave ‘classically’, that is, use a classical channel, or ‘quantumly’, meaning that communication is no longer limited to a fixed preferred basis, but allows for exchanges of arbitrary superpositions via a quantum channel, as shown in Fig. 1c. In general, agents react to (sequences of) percepts \(|{s}_{i-1}\rangle \) with (sequences of) actions \(|{a}_{i}\rangle \) according to a policy \({\rm{\pi }}({a}_{i}|{s}_{i-1})\) that is updated during the learning process via classical control.

Within this framework, we focus on so-called2 deterministic strictly epochal (DSE) learning scenarios, also called episodic instead of epochal1. Here ‘epochs’ consist of strings of percepts s = (s0, ..., sL−1) with fixed s0, actions a = (a1, ..., aL) of fixed length L, and a final reward r, and both s = s(a) and r = r(a) are completely determined by a. Therefore, no explicit representation of the percepts is required in our experiment (Methods). A non-trivial feature of the DSE scenario is that the effective behaviour of the environment can be modelled via a unitary UE (ref. 2) on the action and reward registers A and R as

$${U}_{{\rm{E}}}{|{\bf{a}}\rangle }_{{\rm{A}}}{|0\rangle }_{{\rm{R}}}=\{\begin{array}{c}{|{\bf{a}}\rangle }_{{\rm{A}}}{|1\rangle }_{{\rm{R}}}\,\,{\rm{i}}{\rm{f}}\,\,r({\bf{a}}) > 0\\ {|{\bf{a}}\rangle }_{{\rm{A}}}{|0\rangle }_{{\rm{R}}}\,\,{\rm{i}}{\rm{f}}\,\,r({\bf{a}})=0\end{array}\cdot $$

UE is similar to a generalized controlled-NOT gate such that in case of rewarded action sequences (r(a) > 0), the reward state is flipped. UE can therefore be used to perform a quantum search for such sequences.

A hybrid agent can choose between quantum and classical behaviour in each epoch. In classical epochs, the agent prepares the state \({|{\bf{a}}\rangle }_{{\rm{A}}}{|0\rangle }_{{\rm{R}}}\), where a is determined by sampling from a classical probability distribution p(a) determined by its policy π. With a winning probability

$$\varepsilon ={\sin }^{2}(\xi )=\sum _{\{{\bf{a}}|r({\bf{a}}) > 0\}}p({\bf{a}}),$$

with ξ [0, 2π], the agent receives a reward and updates its policy according to a rule, presented in equation (4), based on projective simulation26 (see also Methods). In quantum epochs, the following steps are performed:

(1) The agent prepares the state \({|\psi \rangle }_{{\rm{A}}}{|-\rangle }_{{\rm{R}}}\), with \({|\psi \rangle }_{{\rm{A}}}={{\sum }_{{\bf{a}}}\sqrt{p({\bf{a}})}|{\bf{a}}\rangle }_{{\rm{A}}}=\)\(\cos (\xi ){|{\ell }\rangle }_{{\rm{A}}}+\,\sin (\xi ){|w\rangle }_{{\rm{A}}}\), and sends it to the environment. \({|w\rangle }_{{\rm{A}}}\) and \({|{\ell }\rangle }_{{\rm{A}}}\) are superpositions of all winning (rewarded) and losing (non-rewarded) action sequences, respectively, and \({|-\rangle }_{{\rm{R}}}=({|0\rangle }_{{\rm{R}}}-{|1\rangle }_{{\rm{R}}})/\sqrt{2}\).

(2) The environment applies UE from equation (1) to \({|\psi \rangle }_{{\rm{A}}}{|-\rangle }_{{\rm{R}}}\), flipping the sign of the winning state:

$${U}_{{\rm{E}}}{|\psi \rangle }_{{\rm{A}}}{|-\rangle }_{{\rm{R}}}=[\cos (\xi ){|\ell \rangle }_{{\rm{A}}}-\,\sin (\xi ){|w\rangle }_{{\rm{A}}}]{|-\rangle }_{{\rm{R}}},$$

and returns the resulting state to the agent.

(3) The agent performs a reflection \({U}_{{\rm{R}}}=2|\psi \rangle {\langle \psi |}_{{\rm{A}}}-{{\mathbb{1}}}_{{\rm{A}}}\) over the initial state \({|\psi \rangle }_{{\rm{A}}}\).

The last step leads to amplitude amplification similar to Grover’s algorithm27 and thus to an increased probability sin2(3ξ) (ref. 28) to find rewarded action sequences (Methods). In our experiment, the hybrid agent performs a single query (and thus a single step of amplitude amplification) during a quantum epoch. However, the general framework allows for multiple steps of amplitude amplification in consecutive quantum epochs.

While quantum epochs lead to an increased winning probability, they do not reveal the reward (or corresponding percept sequence, in general). The reward can be determined only via classical test epochs, where the obtained action sequence is used as input. Thus, the hybrid agent alternates between quantum and classical test epochs, updating its policy every time a reward is obtained after a test epoch. Such agents accomplish their task of finding winning action sequences faster, and hence learn faster than entirely classical agents. This approach allows us to quantify the speed-up in learning time, which is not possible in the general setting discussed in ref. 2. The learning speed-up manifests in a reduced average learning time \({\langle T\rangle }_{{\rm{Q}}}\), that is, the average number of epochs necessary to achieve a certain winning probability PL. In general, a quadratic improvement can be achieved if the maximal number of coherent interactions between agent and environment scales with the problem size (Methods).

Experimental implementation

Quantized RL protocols can be compactly realized using state-of-the-art photonic technology29. Nowadays, integrated photonic platforms hold the advantage of providing scalable architectures where many elementary components can be accommodated on small devices30. Here we use a programmable nanophotonic processor comprising 26 waveguides fabricated to form 88 Mach–Zehnder interferometers (MZIs). An MZI is equipped with two configurable phase shifters as shown in Fig. 2a, b, and acts as a tunable beam splitter. Information is spatially encoded onto two orthogonal modes \(|0\rangle \) = (1, 0)T and \(|1\rangle \) = (0, 1)T, which constitute the computational basis (T indicates the transpose).

Fig. 2: Experimental setup.
figure 2

a, Single programmable unit consisting of an MZI equipped with two fully tunable phase shifters, one internal allowing for a scan of the output distribution over θ [0, 2π], and one external dictating the relative phase ϕ [0, 2π] between the two output modes. This makes the MZI act as a fully tunable beam splitter and allows for coherent implementation of sequences of quantum gates. b, Image of a single MZI in the processor. The third phase shifter in the bottom arm of the interferometer is not used. c, Overview of the setup. A single-photon source generates single-photon pairs at telecommunication wavelength. One photon is sent to a single-photon detector D0, while the other one is coupled into the processor and undergoes the desired computation. It is then detected, in coincidence with the photon in D0, either in detector D1 or in D2/D3 after the agent plays the classical/quantum strategy (see Fig. 3 for more details). The coincidence events are recorded with a custom-made TTM. Different areas of the processor are assigned to either the agent or the environment, which can perform a Grover-like quantum search to look for rewarded action sequences in quantum epochs. The bottom part of the figure represents the Grover quantum circuit, where H indicates Hadamard gates creating quantum superpositions and n represents the number of target qubits. The agent has access to a classical control that updates its policy.

As illustrated in Fig. 2c, pairs of single photons are generated (at telecommunication wavelengths) from a single-photon source. One photon is coupled into a waveguide and then detected by single-photon detectors D1, D2 or D3, while the other one is sent to D0 for heralding (that is, clicks in detectors D1, D2 or D3 are registered in coincidence with clicks in D0). The detectors are superconducting nanowires with efficiencies of up to roughly 90% (see Methods for experimental details). The processor is divided into three regions, where the first and last are assigned to the agent, and the middle region to the environment, to carry out, in quantum epochs, steps 1–3 listed above. The agent is further equipped with a classical control mechanism (a feedback loop) that updates its learning policy.

In our experiment, we represent the winning and losing action states \({|w\rangle }_{{\rm{A}}}\) and \({|{\ell }\rangle }_{{\rm{A}}}\) by a single qubit via \({|1\rangle }_{{\rm{A}}}={|w\rangle }_{{\rm{A}}}\)  and  \({|0\rangle }_{{\rm{A}}}={|{\ell }\rangle }_{{\rm{A}}}\), and use another qubit to encode the reward \(({|0\rangle }_{{\rm{R}}},{|1\rangle }_{{\rm{R}}})\). This results in a four-level system, where each level is a waveguide in our processor, as shown in Fig. 3. The winning probability for the agent is initially set to ε = sin2(ξ) = 0.01, representing a single rewarded action sequence out of 100. After a single photon is coupled into the mode \(|{0}_{{\rm{A}}}{0}_{{\rm{R}}}\rangle \), the agent creates the state \({|\psi \rangle }_{{\rm{A}}}=(\cos ({\xi }){|0\rangle }_{{\rm{A}}}+\,\sin ({\xi }){|1\rangle }_{{\rm{A}}}){|0\rangle }_{{\rm{R}}}\) by applying a unitary UP. Next, it can decide to play classically or quantum mechanically.

Fig. 3: Circuit implementation.
figure 3

a, b, One photon is coupled into the \(|{0}_{{\rm{A}}}{0}_{{\rm{R}}}\rangle \) waveguide and undergoes different operators depending on whether a classical (a) or a quantum (b) epoch is implemented. The waveguides highlighted in yellow show the photon’s possible paths. Identity gates are represented by straight waveguides. Only the part of the processor needed for the computation is illustrated.

In a classical strategy, the environment flips the reward qubit only if the action qubit is in the winning state via UE (Fig. 3a). Next, the photon is coupled out and detected in either D1 or D2 with probability cos2(ξ) and sin2(ξ), respectively. If D2 is triggered (that is, the agent has been rewarded), a feedback mechanism updates the policy π by updating the winning probability εj after having obtained j rewards as

$${\varepsilon }_{j}=\frac{1+2j}{100+2j}.$$

π is related to p(a), and therefore to ε, via equation (5) (Methods).

In a quantum strategy, after the reward qubit is rotated to \({|-\rangle }_{{\rm{R}}}\) via two operators UH0 and UH1, the environment acts as an oracle via UE as in equation (3). Consecutively, the agent reverses the effect of UH0 and UH1, and performs the reflection UR (Fig. 3b). Measuring in the computational basis of the action register then leads to the detection of a rewarded action sequence with increased probability sin2(3ξ) in D3. For practical reasons, the classical test epoch is implemented only in software (Methods). The update rule remains the same as in the classical case.

In general, any Grover-like algorithm faces a drop in amplitude amplification after the optimal point is reached. As different agents will reach this optimal point in different epochs, one can identify the probability ε = 0.396 until which it is beneficial for all agents to use a quantum strategy, as they will observe more rewards than in the classical strategy on average (Methods). When this probability is surpassed, it is advantageous to switch to an entirely classical strategy. This combined strategy thus avoids the typical amplitude amplification drop without introducing additional overheads in terms of experimental resources.


At the end of each classical epoch, we record outcomes 1 and 0 for the rewarded and non-rewarded behaviour, respectively, obtaining a binary sequence whose length equals the number of played epochs in the classical learning strategy and half of the number of played epochs in the quantum strategy (as here two epochs, quantum and classical test, are needed to obtain the reward). For a fair comparison between these scenarios, in the quantum strategy, the reward is distributed (that is, averaged) over the quantum and classical test epochs. The reward is then averaged over different independent agents. Figure 4 shows this average reward η for the different learning strategies.

Fig. 4: Behaviour of the average reward η for different learning strategies.
figure 4

The solid line represents the theoretical data simulated with n = 10,000 agents, while the dots represent the experimental data measured with n = 165 agents. The shaded regions indicate the errors associated with each single data point. a, η of agents playing a quantum (blue) or classical (orange) strategy. b, c, η accounting for rewards obtained only every second epoch in the quantum strategy (b), compared with the case where the reward is distributed over the two epochs needed to acquire it (c). The error bars represent the standard errors. d, Comparison between the classical (orange) and combined (green) strategy, where an advantage over the classical strategy is visible. Here the agents stop the quantum strategy at their best performance (at ε = 0.396) and continue playing classically. The inset shows the point where agents playing the quantum strategy reach the winning probability PL = 0.37, after \({\langle T\rangle }_{{\rm{Q}}}\) = 100 epochs.

The theoretical data are simulated for n = 10,000 agents and the experimental data obtained from n = 165. Figure 4a visualizes the quantum improvement originating from the use of amplitude amplification compared with a purely classical strategy. For completeness, the comparison between not distributing and distributing the reward over two epochs in the quantum strategy is shown in Fig. 4b, c.

When ε = 0.396, η for the quantum strategy starts decreasing, as visible in Fig. 4a. Our setup allows the agents to choose the favourable strategy by switching from quantum to classical when the latter becomes advantageous. This combined strategy outperforms the purely classical scenario, as shown in Fig. 4d. As previously discussed, a certain winning probability PL has to be defined to quantify the learning time. Choosing PL = 0.37 (note however, that any probability below ε = 0.396 can be chosen), the learning time \(\langle T\rangle \) for PL decreases from \({\langle T\rangle }_{{\rm{C}}}\) = 270 in the classical strategy to \({\langle T\rangle }_{{\rm{Q}}}\) = 100 in the combined strategy. This implies a reduction of 63%, which fits well to the theoretical values \({T}_{{\rm{C}}}^{{\rm{theory}}}\) = 293 and \({T}_{{\rm{Q}}}^{{\rm{theory}}}\) = 97, accounting for small experimental imperfections.

In general, hybrid agents can experience a quadratic speed-up in their learning time if arbitrary numbers of coherent Grover iterations can be performed, even if the number of rewarded actions is unknown31.


We have demonstrated an RL protocol where an agent can boost its performance by allowing quantum communication with the environment. This enables a quantum speed-up in its learning time and optimal control of the learning process. Emerging photonic technology provides the advantages of compactness, tunability and low-loss communication, thus proving suitable for RL algorithms where active-feedback mechanisms, even over long distances, need to be implemented. Future scaled-up implementations of our protocol rely on a linearly increasing number of waveguides with the action space size when considering action sequences of length L = 1, and the use of just a single photon. In this case, a learning task with N different actions requires a processor with 2N modes, while 3N + 1 or maximally \(\frac{{\rm{\pi }}}{4}\sqrt{N}(3N+1)\) gates are needed for a single or multiple rounds of amplitude amplification, respectively. In general, multiple photons will be required to deal with a combinatorially big space of action sequences of arbitrary length L. We envision our protocol to aid specifically in problems where frequent search is needed, for example, network routing problems, where, for instance, tens of qubits, waveguides and detectors would be employed to represent search spaces of 104 elements. In general, the development of superconducting detectors, on-demand single-photon sources32 or the large-scale integration of artificial atoms within photonic circuits33 suggest substantial steps towards scalable multiphoton applications. Although photonic architectures are particularly suitable for such learning algorithms, our theoretical background is applicable to different platforms, for example, trapped ions or superconducting qubits. Here future realizations can feature the implementation of agent and environment as spatially separated systems, and a light–matter quantum interface for coherent exchange between them24,34.


Quantum enhancement in RL agents

Here we present an explicit method for combining a classical agent with quantum amplitude amplification. Introducing a feedback loop between classical policy update and quantum amplitude amplification, we are able to determine achievable improvements in sample complexity, and thus in learning time. In addition, the final policy of our agent has properties similar to those of the underlying classical agent, leading to a comparable behaviour as discussed in more detail in A.H. et al. (manuscript in preparation; which is dedicated to discussing the theoretical background of the hybrid agent more specifically).

In the following, we focus on simple DSE environments, where the interaction between the agent and the environment is structured into epochs. Each epoch starts with the same percept s0, and at each time step i an action–percept pair (ai, si) is exchanged. Many interesting environments are epochal, for example, in applications of RL to quantum physics21,35,36,37 or popular problems such as playing Go10. At the end of each epoch, after L action–percept pairs are communicated, the agent receives a reward r {0, 1}. The rules of the game are deterministic and time independent, such that performing a specific action ai after receiving a percept si−1 always leads to the same following percept si.

The behaviour of an agent is determined by its policy described by the probability π(ai|si−1) to perform the action ai given the percept si−1. In deterministic settings, the percept si is completely determined by all previously performed actions a1, , ai such that π(ai|si−1) = π(ai|a1, , ai−1). Thus, the behaviour of the agent within one epoch is described by action sequences a = (a1, ..., aL) and their corresponding probabilities

$$p({\bf{a}})=\mathop{\prod }\limits_{i=1}^{L}{\rm{\pi }}({a}_{i}|{a}_{1},\cdots ,{a}_{i-1}).$$

Our learning agent uses a policy based on projective simulation26, where each action sequence a is associated with a weight factor h(a) initialized to h = 1. Its policy is defined via the probability distribution

$$p({\bf{a}})=\frac{h({\bf{a}})}{\sum _{{\bf{a}}{\prime} }h({\bf{a}}{\prime} )}.$$

In our experiment, the initial winning probability ε is given by

$$\varepsilon =\sum _{\{{\bf{a}}|r({\bf{a}}) > 0\}}p({\bf{a}})=\frac{\sum _{\{{\bf{a}}|r({\bf{a}}) > 0\}}h({\bf{a}})}{\sum _{\{{\bf{a}}\}}h({\bf{a}})}$$

and is set to 1/100. If the agent has chosen the sequence a, it updates the corresponding weight factor via

$$h({\bf{a}})\to h({\bf{a}})+\lambda r({\bf{a}}),$$

where λ = 2 in our experiment and r(a) = 1 (0) if a is rewarded (non-rewarded). Thus, the winning probability after the agent has found j rewards is given by equation (4). In general, the update method for quantum-enhanced agents is not limited to projective simulation and can be used to enhance any classical learning scenario, provided that p(a) exists and that the update rule is solely based on the observed rewards. We generalize the given learning problem to the quantum domain by encoding different action sequences a into orthogonal quantum states \(|{\bf{a}}\rangle \) defining our computational basis. In addition, we create a fair unitary oracular variant of the environment2, whose effective behaviour on the action register can be described by \({\tilde{U}}_{{\rm{E}}}\) as

$${\tilde{U}}_{{\rm{E}}}|{\bf{a}}\rangle =\{\begin{array}{ll}-|{\bf{a}}\rangle & {\rm{if}}\,r({\bf{a}}) > 0\\ |{\bf{a}}\rangle & {\rm{if}}\,r({\bf{a}})=0\end{array}.$$

The unitary oracle \({\tilde{U}}_{{\rm{E}}}\) can be used to perform, for instance, a Grover search or amplitude amplification for rewarded action sequences by performing Grover iterations

$${U}_{{\rm{G}}}=(2|\psi \rangle \langle \psi |-{\mathbb{1}}){\mathop{U}\limits^{ \sim }}_{{\rm{E}}}$$

on an initial state \(|\psi \rangle \). A quantum-enhanced agent with access to \({\tilde{U}}_{{\rm{E}}}\) can thus find rewarded action sequences faster than a corresponding classical agent defined by the same initial policy π(ai|si−1) and update rules.

In general, the optimal number k of Grover iterations \({U}_{{\rm{G}}}^{k}|\psi \rangle \) depends on the winning probability ε via \(k\propto 1/\sqrt{\varepsilon }\) (ref. 27). In the following, we assume that ε is known at least to a good approximation. This is, for instance, possible if the number of rewarded action sequences is known. However, a similar agent can also be developed if ε is unknown by adapting methods from ref. 31 as described in A.H. et al. (manuscript in preparation).

Description of the agent

A quantum-enhanced hybrid agent is constructed via the following steps:

(1) Given the classical probability distribution p(a), determine the winning probability ε = sin2(ξ) based on the current policy and prepare the quantum state in the action register:

$$|\psi \rangle =\sum _{\{{\bf{a}}\}}\sqrt{p({\bf{a}})}|{\bf{a}}\rangle $$
$$=\,\cos (\xi )|\ell \rangle +\,\sin (\xi )|w\rangle .$$

Here the quantum states

$$|\ell \rangle \propto \sum _{\{{\bf{a}}|r({\bf{a}})=0\}}\sqrt{p({\bf{a}})}|{\bf{a}}\rangle ,$$
$$|w\rangle \propto \sum _{\{{\bf{a}}|r({\bf{a}}) > 0\}}\sqrt{p({\bf{a}})}|{\bf{a}}\rangle ,$$

contain all losing and winning components, respectively. In our experiment, we identify \(|{\ell }\rangle \widehat{=}|0\rangle \) and \(|w\rangle \widehat{=}|1\rangle \). The task assigned to the agent is to (learn to) perform the winning sequences \(|w\rangle \) via policy update. This translates to a maximization of the obtained reward.

(2) Apply the optimal number \(k(\sqrt{\varepsilon })\) of Grover iterations leading to

$$|\psi {}^{{\prime} }\rangle ={U}_{{\rm{G}}}^{k}|\psi \rangle ,$$

and perform a measurement in the computational basis on \(|\psi {}^{{\prime} }\rangle \) to determine a test action sequence a.

(3) Play one classical epoch by using the test sequence a determined in step 2 and obtain the corresponding percept sequence s(a) and the reward r(a).

(4) Update ε, and thus the classical policy π, using the rule in equation (4).

There exists a limit P on ε determining whether it is more advantageous for the agent to perform k Grover iterations with \(k(\sqrt{\varepsilon })\ge 1\) or sample directly from p(a) (therefore \(k(\sqrt{\varepsilon })=0\)) to determine a. In the latter case, the agent would interact only classically (as in step 3) with the environment.

After each epoch, a classic agent receives a reward with probability sin2(ξ). A quantum-enhanced agent can instead use one epoch to either perform one Grover iteration (step 2) or to determine the reward of a given test sequence a (step 3). After k Grover iterations, the winning probability is sin2[(2k + 1)ξ] (ref. 28; see next section). Thus, for k = 1, the agent receives a reward after every second epoch with probability sin2(3ξ). Therefore, we define the expected average reward of an agent playing a classical strategy as ηC = sin2(ξ) and of an agent playing a quantum strategy with k = 1 as ηQ = sin2(3ξ)/2. For ε < P, ηQ > ηC, meaning that the quantum strategy proves advantageous over the classical case. However, as soon as ηQ = ηC (at P = 0.396), a classical agent starts outperforming a quantum-enhanced agent that still performs Grover iterations.

Determining the winning probability ε exactly as in the example presented here is not always possible. In general, additional information such as the number of possible solutions and model building helps to perform this task. Note that a P smaller than 0.396 should be chosen if ε can only be estimated up to some range. To circumvent this problem, methods like Grover search with unknown reward probability31 or fixed-point search38 can be used to determine whether and how many steps of amplitude amplification should be performed (A.H. et al., manuscript in preparation).

Enhancement of the winning probability

After a quantum epoch, the amplitude sin(ξ) of the winning state \(|w\rangle \) increases to sin(3ξ). Here we derive this result. The projections onto the winning and losing subspaces are given by \({\Pr }_{w}={\sum }_{\{{\bf{a}}|r({\bf{a}}) > 0\}}|{\bf{a}}\rangle \langle {\bf{a}}|\) and \({\Pr }_{{\ell }}={\sum }_{\{{\bf{a}}|r({\bf{a}})=0\}}|{\bf{a}}\rangle \langle {\bf{a}}|\), respectively, which are orthogonal and sum to identity. Therefore, the initial state (12) can be decomposed into a normalized winning \(|w\rangle \propto {\Pr }_{w}|\psi \rangle \) and losing \(|{\ell }\rangle \propto {\Pr }_{{\ell }}|\psi \rangle \) component, and the unitary (10) implementing one Grover iteration can be written as

$${U}_{{\rm{G}}}=(2|\psi \rangle \langle \psi |-{\mathbb{1}})({Pr}_{\ell }-{Pr}_{w}).$$

Now, let us investigate the effect of UG on an arbitrary real superposition

$$|s\rangle =\,\sin (\alpha )|w\rangle +\,\cos (\alpha )|{\ell }\rangle $$

in the plane spanned by \(|w\rangle \) and \(|{\ell }\rangle \). Using the trigonometric addition theorems, the application of one Grover iteration to \(|s\rangle \)

$${U}_{{\rm{G}}}|s\rangle =\,\sin (\alpha +2\xi )|w\rangle +\,\cos (\alpha +2\xi )|{\ell }\rangle $$

can be identified as a rotation of 2ξ in the plane. Assuming α = ξ, we therefore find the amplitude of \(|w\rangle \) to be sin(3ξ), and thus obtain a winning probability sin2(3ξ). Implementing k Grover iterations leads to sin2[(2k + 1)ξ].

Learning time

We define the learning time T as the number of epochs an agent needs on average to reach a certain winning probability PL. The hybrid agent can reach P with fewer epochs on average than its classical counterpart. However, once both reach P, they need on average the same number of epochs to reach P = ΔP where ΔP is a certain value satisfying 0 ≤ ΔP < 1 − P. Therefore, we choose PL ≤ P to quantify the achievable improvement of a hybrid agent compared with its classical counterpart. In our experiment, we choose PL = 0.37 to define the learning time.

Let \({l}_{J}=\{{{\bf{a}}}_{1},\cdots ,{{\bf{a}}}_{J}\}\) be a time-ordered list of all the rewarded action sequences an agent has found until it reaches PL. Note that the actual policy πj, and thus pj, of our agents depend only on the list lj of observed rewarded action sequences, and this is independent of whether they have found them via classical sampling or quantum amplitude amplification. As a result, a classical agent and its quantum-enhanced hybrid version are described by the same policy π(lj) and behave similarly if they have found the same rewarded action sequences. However, the hybrid agent finds them faster.

In general, the actual policy and overall winning probability might depend on the rewarded action sequences that have been found. Thus, the number J of observed rewarded action sequences necessary to learn might vary. However, this is not the case for the experiment reported here. In our case, the learning time can be determined via

$$T(J)=\mathop{\sum }\limits_{j=0}^{J-1}{t}_{j},$$

where tj determines the number of epochs necessary to find the next rewarded sequence aj+1 after it has observed j rewards. For a purely classical agent, the average time is given by

$${\langle {t}_{j}\rangle }_{{\rm{C}}}=\frac{1}{{\varepsilon }_{j}}.$$

This time is quadratically reduced to

$${\langle {t}_{j}\rangle }_{{\rm{Q}}}=\frac{\alpha }{\sqrt{{\varepsilon }_{j}}}$$

for the hybrid agent. Here α is a parameter depending only on the number of epochs needed to create one oracle query \({\tilde{U}}_{{\rm{E}}}\) (ref. 2) and on whether εj is known. In the case considered here, we find α = π/4. As a consequence, the average learning time for the hybrid agent is given by

$${\langle T(J)\rangle }_{{\rm{Q}}}=\mathop{\sum }\limits_{j=0}^{J-1}\frac{\alpha }{\sqrt{{\varepsilon }_{j}}}\,\le \,\alpha \sqrt{J}\sqrt{{\langle T(J)\rangle }_{{\rm{C}}}},$$

where we used the Cauchy–Schwarz inequality and equations (19) and (20). The classical learning time typically scales with \({\langle T\rangle }_{{\rm{C}}}\propto {A}^{K}\) for a learning problem with episode length K and the choice between A different actions in each step. The number J depends on the specific policy update and sometimes also on the list lJ of observed rewarded action sequences. For an agent sticking with the first rewarded action sequence, we would find J = 1. However, typical learning agents are more explorative, and common scalings are J K such that we find

$${\langle T(J)\rangle }_{{\rm{Q}}}\propto \sqrt{\log ({\langle T(J)\rangle }_{{\rm{C}}})}\sqrt{{\langle T(J)\rangle }_{{\rm{C}}}}$$

for these cases. This is equivalent to a quasi-quadratic speed-up in the learning time if arbitrary numbers of Grover iterations can be performed.

In more general settings, there exist several possible lJ with different length J such that the learning time \(\langle T(J)\rangle \) needs to be averaged over all possible lJ, which again leads to a quadratic speed-up in learning time (A.H. et al., manuscript in preparation).

Limited coherent evolutions

In general, all near-term quantum devices allow for coherent evolution only for a limited time and are thus limited to a maximal number of Grover iterations. For winning probabilities ε = sin2(ξ) with (2k + 1)ξ ≤ π/2, performing k Grover iterations leads to the highest probability of finding a rewarded action.

Again, we assume that the actual policy of an agent depends on only the number of observed rewards an agent has found. As a consequence, the average time a hybrid agent limited to k Grover iterations needs to achieve the winning probability P < sin2[π/(4k + 2)] is given by

$${\langle T(J,k)\rangle }_{{\rm{Q}}}=\mathop{\sum }\limits_{j=0}^{J-1}\frac{{\alpha }_{0}k+1}{{\sin }^{2}[(2k+1){\xi }_{j}]},$$

with α0 determining the number of epochs necessary to create one oracle query \({\tilde{U}}_{{\rm{E}}}\). For α0 = 1, k 1 and (2k + 1)ξJ π/2, we can approximate the learning time for the hybrid agent via

$${\langle T(J,k)\rangle }_{{\rm{Q}}}\approx \mathop{\sum }\limits_{j=0}^{J-1}\frac{1}{4k{\varepsilon }_{j}}=\frac{{\langle T(J)\rangle }_{{\rm{C}}}}{4k},$$

where we used sin(x) ≈ x for x 1. In general, it can be shown (A.H. et al., manuscript in preparation) that the winning probability Pk = sin2[π/(4k + 2)] can be reached by a hybrid agent limited to k Grover iterations in a time

$${\langle T(k)\rangle }_{{\rm{Q}}}\le \gamma \frac{{\langle T\rangle }_{{\rm{C}}}}{k},$$

where γ is a factor depending on the specific setting.

In our case, equation (24) can be used to compute the lower bound for the average quantum learning time, with α0 = k = 1. For the classical strategy, equation (20), together with equation (19), is used. Thus, given PL = 0.37, the predictions for the learning time in our experiment are \({T}_{{\rm{Q}}}^{{\rm{theory}}}\) = 97 and \({T}_{{\rm{C}}}^{{\rm{theory}}}\) = 293.

Experimental details

A continuous-wave laser (Coherent Mira HP) is used to pump a single-photon source producing photon pairs in the telecommunication-wavelength band. The laser light has a central wavelength of 789.5 nm and pumps the single-photon source at a power of approximately 100 mW. The source is a periodically poled KTiOPO4 nonlinear crystal placed in a Sagnac interferometer39,40, where the emission of single photons occurs via a type-II spontaneous parametric down-conversion process. The crystal (produced by Raicol) is 30 mm long, set to a temperature of 25 °C, has a poling period of 46.15 μm and is quasi-phase matched for degenerate emission of photons at 1,570 nm when pumping with coherent laser light at 785 nm. As the processor is calibrated for a wavelength of 1,580 nm, we shift the wavelength of the laser light to 789.75 nm to produce one photon at 1,580 nm (that is then coupled into the processor) and another one at 1,579 nm (the heralding photon).

The processor is a silicon-on-insulator type, designed by the Quantum Photonics Laboratory at the Massachusetts Institute of Technology (MIT)30. Each programmable unit on the device acts as a tunable beam splitter implementing the unitary

$${U}_{\theta ,\varphi }=\left(\begin{array}{cc}{{\rm{e}}}^{{\rm{i}}\varphi }\sin \frac{\theta }{2} & {{\rm{e}}}^{{\rm{i}}\varphi }\,\cos \frac{\theta }{2}\\ \cos \frac{\theta }{2} & -\sin \frac{\theta }{2}\end{array}\right),$$

where θ and ϕ are the internal and external phases shown in Fig. 2a, b, set via thermo-optical phase shifters controlled by a voltage supply. The achievable precision for phase settings is higher than 250 μrad. The bandwidth of the phase shifters is around 130 kHz. The waveguides, spatially separated from one another by 25.4 μm, are designed to admit one linear polarization only. The high contrast in refractive index between the silicon and silica (the insulator) allows for waveguides with very small bend radius (less than 15 μm), thus enabling high component density (in our case 88 MZIs) on small areas (in our case, 4.9 × 2.4 mm). Given the small dimensions, the in-(out-)coupling is realized with the help of Si3N4−SiO2 waveguide arrays (produced by Lionix International), that shrink (and enlarge) the 10-μm optical fibres’ mode to match the 2-μm mode size of the waveguides in the processor. The total input–output loss is around 7 dB. The processor is stabilized to a temperature of 28 °C and calibrated at 1,580 nm for optimal performance. To reduce the black-body radiation emission due to the heating of the phase shifters when voltage is applied, wavelength division multiplexers with a transmission peak centred at 1,571 nm and bandwidth of 13 nm are used before the photons are sent to the detectors. In our processor, two external phase shifters in the implemented circuits were not responding to the supplied voltage. These defects were accounted for by employing an optimization procedure.

The single-photon detectors are multi-element superconducting nanowires (produced by Photon Spot) with efficiencies up to 90% in the telecommunication-wavelength band. They have a dark count rate of about 100 counts per second, low timing jitter (hundreds of picoseconds) and a reset time <100 ns (ref. 41). Coincidence events are those detection events registered in D0 and at the output of the processor that fall into a temporal window of 1.3 ns (the coincidence window), and are found using a time tagging module (TTM). In more detail, to record coincidences and then update the agent’s policy accordingly, the following steps are performed in the classical/quantum strategy (after initially setting  \({h}_{w}={\sum }_{\{{\bf{a}}|r({\bf{a}}) > 0\}}h({\bf{a}})=1\)  and  \({h}_{{\ell }}={\sum }_{\{{\bf{a}}|r({\bf{a}})=0\}}h({\bf{a}})=99\) such that ε = 0.01).

(1) The TTM records the time tags for photons in D0, D1 and D2/D3.

(2) A Python script converts the time tags into arrival times, and it iterates through them until it finds a coincidence event between either D0 and D1, or D0 and D2/D3.

(3) If a coincidence event between D0 and D1 (or D0 and D2/D3) is first found, a 0 (1) is sent to another Python script controlling the MZIs’ phase shifters operating on a different computer. If a 0 is sent, the ratio \(\varepsilon ={h}_{w}/({h}_{{\ell }}+{h}_{w})\) remains unchanged. If a 1 is sent, hw is updated as hw + 2, which follows the update in equation (4). In the quantum strategy, this step also includes the implementation of a classical test epoch.

Implementing classical test epochs on hardware would require ‘testing’ the measured action state, that is, using the measured action sequence as input and making the environment act via UE, thus leading to detection of a reward \({|0\rangle }_{{\rm{R}}}\)or \({|1\rangle }_{{\rm{R}}}\). However, since this simple circuit works in very close agreement with theoretical predictions (its visibility exceeds 0.99), this part has been implemented in software only.

The update rate is about 1 Hz for both the classical and quantum epochs, and can be reduced up to the phase shifters’ bandwidth.