Abstract
Quantum information processing devices need to be robust and stable against external noise and internal imperfections to ensure correct operation. In a setting of measurementbased quantum computation, we explore how an intelligent agent endowed with a projective simulator can act as controller to adapt measurement directions to an external stray field of unknown magnitude in a fixed direction. We assess the agent’s learning behavior in static and timevarying fields and explore composition strategies in the projective simulator to improve the agent’s performance. We demonstrate the applicability by correcting for stray fields in a measurementbased algorithm for Grover’s search. Thereby, we lay out a path for adaptive controllers based on intelligent agents for quantum information tasks.
Introduction
When building devices for quantum information processing one has to take changing environment conditions and device imperfections into account. It is therefore necessary to include adaptive mechanisms that characterize and calibrate the device from within. Furthermore, it is desirable for these devices to obtain a certain degree of autonomy in maintaining their functional state despite detrimental environment influences, in particular, when they are assembled to a larger quantum information processing infrastructure. In the attempt to miniaturize current implementations of quantum devices, we will reach the point where these devices will be of microscopic scale and require short reaction times. For such microscopic systems we can no longer assume that their internal controllers are fullfledged universal computers that can carry out arbitrary programs. Instead, controllers will be small physical systems that are specialized for their respective purpose with a program that emerges from the controller’s analog dynamics.
In this paper we explore the applicability of a controller in form of an intelligent learning agent that has access to a projective simulator^{1,2,3,4}. Within this agent framework, the aim is to demonstrate adaptive calibration and compensation strategies against stray external fields when carrying out quantum information tasks. The agent shall thereby implement a simple form of adaptive error avoidance and implicit parameter estimation.
Algorithms from machine learning have been used to find strategies for parameter estimation and optimal strategies for parameter estimation are known for specific cases, see e.g.^{5,6,7,8,9,10}. Here, however, we focus on strategies that arise naturally from the adaptive dynamics of the underlying physical system, for which we choose a projective simulator. The projective simulator is a platform that has been proposed as a physical model for reinforcement learning^{11,12} and it effectively reproduces input—output—reward correlations from an internal adaptive stochastic process. With the restriction to this particular system, one cannot hope for the best possible strategy to emerge while keeping the rules governing the dynamics reasonably simple and computational overhead low. Both requirements are necessary to allow for an actual physical realization. As an additional feature, the projective simulator offers a natural route to quantization^{1} and thereby a way to intelligent agents that benefit from internal quantum dynamics, as demonstrated in the reflective quantum projective simulator^{13,14}. Agent quantization is not explored further in the present work as we focus on the application of a classical agent to quantum information processing first. For recent comprehensive reviews in the domain of quantum physics and artificial intelligence or machine learning see Ref. 15,16.
As illustration of our method of adaptive quantum information processing we study Grover’s quantum search algorithm^{17,18} in the paradigm of measurementbased quantum computation. Grover’s algorithm provides a fast way to find a marked item in an unsorted database with N elements. In particular, it provides a quadratic speedup with database lookups over a search by means of a classical computer with O(N) lookups. First proofofprinciple implementations of Grover’s algorithm with nuclear magnetic resonance techniques^{19,20} and entangled photons^{21} employed the circuit model of quantum computation, where individual unitary quantum logic gates are applied to a register of qubits to process information.
Measurementbased quantum computation (MBQC)^{22,23} is a different paradigm of quantum computation, where the computation is carried out by measuring single qubits of an initially highly entangled resource state^{24}. The first experimental demonstration of MBQC in a system of entangled photons^{25,26} (and with trapped ions^{27}) also demonstrated the Grover algorithm in its smallest realization with a database of 4 entries (2qubits) by using a 4qubit cluster state as computation resource.
As preparation for the full measurementbased algorithm we first study a basic setting. We situate a quantum system, a single qubit, in an unknown external magnetic field. An artificial agent, the controller, is endowed with a projective simulator and the ability to measure the quantum system and thereby prepare quantum states. We hardwire the learning process, i.e., the update rule in the reinforcement learning process of the projective simulator, such that the agent effectively carries out the following tasks: (i) Adapt measurement directions to changes of the external magnetic field and dynamically improve the sensing resolution. (ii) Learn to adapt simultaneously for multiple measurement directions needed for general MBQCalgorithms in a feedback scheme. (iii) Carry out a quantum information task, the Grover algorithm^{17,18} in the setting of measurementbased quantum computation, with unknown stray magnetic fields. This provides a completely workedout example, starting from the physical system that generates the actions of an adaptive “intelligent” agent, here a projective simulator, to a controller tailored to a specific quantum information task, e.g. measurementbased Grover’s search algorithm.
Results
First, we describe an approach that allows the projective simulator to effectively obtain a notion of the strength of an external magnetic field and hence carry out a primitive form of parameter estimation. However, there is a conceptual difference between our approach and parameter estimation. After the agent has learned, the information on the strength of the magnetic field will not be available as a number that the agent gives as an output. Instead, this information is only indirectly incorporated into the dynamics and decision patterns of the agent and it can be exploited to do certain things that are adapted to the external field. Therefore, we will analyze the learning process of the agent from two different perspectives: From an operational perspective we characterize how well the agent adapts its actions to the external field and from an informational perspective we quantify how much of the information about the external field is really contained in the parameters that define the dynamics of the agent.
We start with a detailed description of the setting, that is, of the agent and its interaction with the measurement apparatus, the dynamics of the projective simulator and an analysis of the learning process.
Agent and Projective Simulator Dynamics
In the present setting the magnetic field direction is promised to be fixed along the zaxis and the agent needs to estimate its strength B. The following steps are visualized in Fig. 1. The agent starts by preparing a single qubit in the state , which in the presence of the field evolves according to the Hamiltonian H = ℏωσ_{z}/2, where the frequency ω is proportional to the magnetic field strength B. After some fixed time interval Δt, the initial state has evolved into
up to a global phase, with φ = ωΔt. Estimating the field strength B amounts to estimating φ〉 and obtaining information about the angle φ between this state and the initial state in the equatorial plane of the Bloch sphere. In a linear optics setup^{28}, the unknown angle φ would correspond to an unknown phase shifter in the beam line.
The agent measures the qubit in the unknown state φ〉 in various directions and incorporates the measurement outcomes to change its choice of measurement directions. The measurements applied by the agent are in general described by POVMs^{29}. For simplicity, we will restrict our analysis to projective measurements. We shall comment on the general case at the end of the paper.
The challenge is to effectively realize a probability distribution for the unknown angle φ without explicitly performing computations and analyzing the measurement data. Rather it should emerge dynamically as the result of a feedback loop by reinforcing certain actions on the quantum system. Therefore, we choose an approach where the internals of the agent are wired such that it tries to optimize the direction of a measurement. In the optimal case φ〉 is the +1 eigenstate of this measurement. Qubit observables whose eigenstates with eigenvalues ±1 lie in the equator of the Bloch sphere are given by
where α〉 is of the form (1) with angle α. Both eigenstates lie on opposite sides of the equator. The probability to obtain the measurement outcome ±1 is
that is, the closer the angles α and φ the higher is the probability to obtain the +1 measurement outcome. To simplify notation we often consider the projector onto the +1 eigenstate
instead of the observable Ô_{α} and measurements of the projector with outcomes 1 and 0. For qubits, measuring gives the same statistics of measurement outcomes and resulting states as measuring the observable Ô_{α} because there is a unique state orthogonal to α〉.
The projective simulator inside the agent employs an adaptive stochastic process that is modeled by a random walk of an excitation in a network of socalled “clips”^{1}. For now the clip network takes the form of a directed weighted graph depicted in Fig. 2. The random walk starts at the only “percept clip”, which is excited by an internal trigger of the agent with a time interval Δt after the qubit has been prepared (cf. Fig. 1). The excitation propagates in the network according to the weights of the links that connect the percept clip to the action clips. Once the excitation reaches an action clip, the corresponding action is performed and the process inside the projective simulator is finished. A single action is a measurement of a certain at the qubit. If the measurement outcome is +1 it is fed back as reward to the agent to reenforce and strengthen the link between the percept clip and the last action clip. The process is repeated for the next measurement. The probabilities to select certain measurements, however, change as a result of previous measurement outcomes. This makes measurements with angles α closer to φ more likely. These probabilities in effect represent a coarsegrained, discrete probability distribution over angles φ.
In detail, each link in the clip network carries a weight h_{α}. The probability to jump from the percept clip “*” to the action clip corresponding to is given by the normalized weight of all edges from the percept clip, that is,
At the beginning of the learning process all weights are initialized with h_{α}(0)= 1. After the measurement of in the nth round, the measurement outcome (0 or 1) is rescaled by a factor λ and fed back into the projective simulator as a reward λ_{n} to the transition with weight h_{α}. Regardless of whether or not a transition has been taken, all weights are damped by a small amount with rate γ. After the nth round, in which α was the measurement angle, all weights are changed according to the following update rule:
As a result, the projective simulator converges to a state (set of hvalues) that increases the chances of obtaining +1 measurement outcomes and thereby increases the probability to measure in directions close to φ. From the perspective of the projective simulator only an outcome +1 denotes success because the action that led to this outcome will be reinforced. This “subjective” success probability is
An action that leads to a reward (measurement result +1) is also the correct action from an operational point of view. The transition probabilities p_{α} provide an internal representation of a discretized probability distribution for the angle φ. The change of p_{s} as a function of the number of rounds (measurements on the quibt) is depicted in Fig. 3 for several examples of φ. The results in Fig. 3 show that the agents learns to obtain rewards more often and thus obtains information about the state φ〉 and thereby about B.
Learning Curve Analysis
In our example we start with 4 projectors at angles every π/2, which corresponds to the projectors onto the eigenstates of the observables that are given by the Pauli matrices and . If φ = 0, measurements of will always give outcome +1 and hence be rewarded. The two adjacent projectors at α = π/2 and 3π/2 are rewarded in half of the measurements and measurements in the direction α = π are never rewarded. In this situation the projective simulator builds a strong link to , somewhat less strong links to and and leaves the link for at its initial value. The coarsegrained discrete probability distribution for φ is consequently peaked at φ = 0 and—within statistical fluctuations—symmetric around this direction (Fig. 3 top left inset). If φ is between two of the projectors, say φ ≈ π/4, measurements of and will only be rewarded with only 85% probability and measurements of the opposite projectors with 15% probability. The distribution of measurement probabilities will also be symmetric around the direction π/4 but less pronounced as shown by a broader distribution in Fig. 3 (bottom left inset). A broad distribution for measuring in the direction α results in a lower success probability for angles that have a large distance to all projectors, e.g., α = π/4.
At this point a smaller damping rate γ and a larger multiplier of the rewards λ both lead to a larger value of rewarded transitions in the steadystate and hence to a larger success probability and a probability distribution that is more peaked. At the same time increasing both λ and γ speedsup the learning process leading to learning curves with a steeper initial rise. Note, however, that extremal cases with too large rewards or too weak damping favor situations in which the agent prefers actions that just by luck led to a reward in the past although they are not highly rewarded on average. Unlearning such an initial “misunderstanding” and building a probability distribution that reflects the actual probabilities of being rewarded may take a long time. This aspect leads to larger fluctuations in the success probability of an ensemble of agents and a slower final convergence.
Asymptotic Success Probability
For the asymptotes of the success probability we can find a firstorder approximation by assuming a steady state of the transition probabilities and the respective hvalues. The resulting steadystate success probability is . When coarsegraining over many measurements the time average of the reward for each action is given by and the steadystate probability to measure in direction α is . With these assumptions the update rule (6) turns into a set of coupled equations for the steadystate values ,
in which the loss terms given by the damping γ and the gain terms given by the timeaveraged reward are in equilibrium. This set of nonlinear equations can be solved numerically and yields a very good approximation for the ensemble average as seen in Fig. 3. The asymptotic value obtained in this approximation only depends on the ratio λ/γ.
Time Average vs. Ensemble Average
The fluctuations of the probabilities to choose certain actions (see insets in Fig. 3) show that even after 1000 iterations of the update rule (6) not all agents have converged to a single state (Fig. 4). Many steady states occur if there is a whole manifold that is rewarded equally, that is, when two or more actions have the same expected reward. For example, for φ = π/4 both actions and have equal chances of being rewarded and thus the there is no preference of either action as long as one of them is carried out. Actions that have the same expected reward span a subspace for which the sum of the probabilities for doing these actions is approximately constant in the steady state, however, their relative ratio is not. The states of the whole ensemble of agents fills this degenerate subspace of action probabilities. The ensemble average yields an approximation of φ.
Although the ensemble has learned, i.e., the success probability has converged, the dynamics of each individual is not necessarily converged to a single state where it remains. In the course of time the state of a single agent explores the whole degenerate reward manifold while keeping the success probability constant as we numerically illustrate in Fig. 5. We find that the time average of a single agent for long times equals the ensemble average because the state of the single agent assumes all the different steady states that an ensemble produced after short time as in Fig. 4. Hence, to obtain an ensemble average a snapshot after a relatively short time is sufficient. However, if there is a degenerate space in the reward scheme, the state of a single agent at a fixed time gives only an imprecise estimate of φ and even its time average does when considered only for a short time. A larger damping parameter γ and higher rewards λ facilitate a faster exploration of the degenerate reward manifold and thus provide a better time average for a single agent for shorter times. For φ = π/4 in Fig. 5(right), the agent selects either α = 0 or π/2 for an extended time and then suddenly switches between these equally rewarded choices. This jumping behavior occurs for large reward and damping, whereas for smaller values (left) also equal probabilities occur for longer durations.
Comparison to State Tomography and Bayesian Analysis
The way that the agent uses the rewards to change its actions to do measurements more often along angles that are close to φ, is a way of representing information about φ. We regard this probability distribution of actions along the discrete set of angles as a probability distribution of φ^{30,31} and compare it to standard computational analysis procedures employed in state and parameter estimation. By its actions and the returned rewards the agent effectively samples the reward distribution p(+1φ, α). The same data, namely the measurement direction and outcome, however, can also be used in a Bayesian update rule to explicitly build a probability distribution p(φ), or the data can be used to reconstruct the state φ〉 via state tomography. We compare the angular distribution of actions that the agent maintains to the angular distribution that a Bayesian update would produce and also to a simple state tomography by estimating expectation values from the same measurement data.
A simple form of state tomography can be done by calculating the expectation values and from the measurement results of the four projective measurements. Together with the initial assumption , these expectation values give an approximation of the state’s Bloch vector. Our four measurement directions α = 0, π/2, π, 3π/2 give the same measurements as the Pauli matrices with expectation values
where expectation values of the observables can be related to those of the projectors by
For a total of measurements, of which M_{α} are done in direction α, with individual measurement outcomes r_{α,m}= ±1 for observable Ô_{α}, the expectation values can be approximated with the mean
The resulting Bloch vector with coordinates provides an angle with the xaxis and thereby an estimate of φ.
In a Bayesian analysis, we update an initially flat prior distribution p(φ)= 1/(2π) with the information obtained from each measurement. After each measurement, the distribution is updated with result r_{m}∈{−1,+1} for measurement in direction α_{m}, e.g., for the first update
where we include the knowledge of quantum mechanics and the statistics of measurement outcomes for the underlying system with p(r_{1}φ) given by (3). After M measurements the resulting probability distribution is
with normalization For an efficient update and a compact representation of the conditioned probability distribution we expand it in a Fourier series, which has at most M higher harmonics and construct an recursive update rule for the expansion coefficients following the approach in Ref. 8 for parameter estimation with a single fixed observable but variable time delays. For our choice of measurement directions, with α being a multiple of π/2, the Fourier expansion generally contains sin and cos terms. The recursive update rules for the expansion coefficients are given in the Methods section.
To compare the estimates of φ by these three approaches, we fix the angle φ and let 10 agents with a projective simulator do 1500 measurements each and according to the dynamics arising from using the projective simulator. After these 1500 measurement each agent has built a probability distribution of actions p_{α}, we take the mean of each distribution as the estimate of φ. Fig. 6 shows these estimates as the blue data points. Because the probability distributions for φ that we obtain from the p_{α} have support only on 4 angles, which are uniformly and discretely spaced on the circle, each distribution has a large variance. We average the distributions of all 10 agents and give the mean and circular standard deviation of the resulting distribution as the blue error bar in Fig. 6 for comparison. Clearly, when φ is close to one of the possible choices of α, the projective simulator captures φ accurately, but for values of φ = π/8 or π/4 the estimates are biased towards one of the α as in Fig. 4. For φ = π/4 the angular means are widely spread and their distribution has a large variance, which is reminiscent of the distributions given in the insets in Fig. 3 and the distribution of means of a large ensemble in Fig. 4.
The estimates for φ obtained from the expectation values of the Pauli matrices, i.e., the simple state tomography, are calculated from the same measurement record for each agent and are given by the orange data points in Fig. 6. For the Bayesian update scheme, we construct the conditional probability distributions for φ, again from the same measurement record that the projective simulator generated. All of the resulting distributions assume an approximate Gaussian shape with a narrow peak (σ ≈ π/100). The means are given as red data points in Fig. 6. Both approaches can estimate φ correctly within the error bars. Surprisingly, for φ along one of the α, the estimates from the expectation values spread more than in the other two approaches. The reason is that in these cases the projective simulator samples most of measurements along a single direction and only few for the other observable, which causes a rather large uncertainty in one of the coordinates.
Although state tomography and Bayesian estimation perform generally equally good or better than the projective simulator, the big conceptual difference between these approaches is that very little knowledge of quantum physics and measurement statistics is build into the projective simulator. The projective simulator does not assume that the rewards originate from measurement probabilities of a quantum state and, therefore, it is “model free”. The update rule causes a learning dynamics that drive the agent to measure more often into directions that give a +1 measurement outcome and thereby implicitly align measurement directions with φ. Even when no optimal measurement direction is available the agent learns how to deal with a system such that reward is most likely to occur. In principle, it could even adapt to artificial situations, where measurements along the xaxis always give a +1 outcome and measurements along y always give the outcome −1, something which cannot be explained by measuring a qubit in a defined fixed state. Therefore, it is not surprising that methods that make use of additional information, namely measurement probabilities predicted by quantum physics, can extract more information about φ from the measurement results. Given that an agent with a projective simulator lacks this additional information it does comparably well, and, conceivably, it can be improved by changing the update rule to incorporate more knowledge about the underlying quantum physics. For example, a positive measurement result and reward in one direction can be combined with a negative reward into the opposite direction, or, for each measurement result the reward is distributed according to how close all potential actions are to the rewarded one.
Adapting to Changing Fields and Improving Resolution
An important feature of the projective simulator is its ability to forget and thus to adapt to a changed situation. This ability distinguishes the present setting from schemes of parameter estimation, where the unknown parameter is assumed to be constant. For example, for a changing parameter standard Bayesian updating cannot be applied because past information needs to be disregarded and only recent information should be considered for estimating the current parameter. The projective simulator, in contrast, keeps track of an integrated average of past rewards for each action and is endowed with an element, the damping quantified by γ, to forget these rewards. The agent has the ability to completely change its behavior regardless of what has been rewarded earlier and irrespective of its earlier state. We shall consider two of such relearning scenarios in the following. We analyze the relearning by means of two quantities, the asymptotic success probability and the learning time it takes the agent to adapt.
Relearning After a Switched Field
Changes of B result in a different φ and require the agent to adapt and relearn. For a single sudden change in B the angle φ changes only once at a certain time to a new angle φ′. Depending on the values of φ and φ′ the agents shows a rich landscape of relearning patterns as illustrated in Fig. 7.
After the switch the asymptotic success probability is always that of the new φ′ and may lie above or below the success probability of the old φ (Fig. 7 left). The change in is illustrated in Fig. 7 (right top). The sudden drop or increase in success right after the change of the angle and the time to reach a success probability depends strongly on the relation of the two angles and how much of the internal state (hvalues) needs to be changed to reach the new state. These effects in the relearning time appear in addition to the known effects of changing the reward scaling λ and damping rate γ.^{1,2} A summary of the relearning times and change in asymptotic efficiencies is given in Fig. 7 (right).
Timedependent Fields
An important feature for applications is the agent’s ability to adapt its actions to slowly changing external fields. An agent’s state is the result of a dynamical equilibrium between rewarded actions in the past and forgetting this information on a time scale given by γ. Therefore, the speed at which an agent can adapt is limited by the speed with which it can modify its internal state. The agent can adapt to a change in the reward landscape caused by a changing field as long as it has enough time to sample the modified reward landscape and modify its internal state accordingly, which depends on λ and the timescale given by γ.
Figure 8 shows two examples. The first example (left) is a setting with a fast oscillating field, i.e., one with φ(n) ~ cos(ωn) as a function of the measurement round n, where only the time average is learned because the agent effectively takes samples from the entire reward landscape. The state vector converges to angle φ = 0 and reaches almost unit length. The second example (right) shows a setting with a linearly increasing magnetic field, giving rise to φ(n) ~ n. As φ moves anticlockwise on the unit circle as a function of the measurement round, the agent can keep up as quantified by with a state trajectory that also moves counterclockwise albeit with a slight delay and the length of the state vector is longer, i.e., the field is learned better, for a slower rate of change.
Initial Choice of Measurement Directions and Composition
The choice of projectors that the agent can measure affects the agent’s success in two ways: On one hand it fixes the available angles and thereby ability to measure the correct angle. A finer grained sample of measurement angles is beneficial because it will contain an angle that is closer to the actual angle and allow for almost perfect measurements. It also avoids efficiency minima due to the coarsegraining as they appear in Fig. 3 (top, right). A finer resolution of measurement angles, however, will introduce many angles that are almost equally successful and are hard to distinguish by their average reward. On the other hand, the choice of measurement angles fixes the discrete support on which the probability distribution for φ can be built, which contains the information on the angle φ. A drawback of a coarsegrained support is the arising large variance in the distribution. A finegrained support, however, needs lots of sampling to evaluate each individual point in the distribution.
As there are advantages and disadvantages to the number of measurement directions, we ask if there is an optimal number of fixed projectors. In order to distinguish directions on a circle, at least three directions are needed. (For just two directions that are equally successful, it is not possible to decide which of the two angles between those two directions is the correct one.) We have calculated the asymptotic success probability for an agent that has access to 2N_{α} measurement angles that are equidistantly spaced on the unit circle, i.e., it can measure N_{α} observables of the form (2) with two eigenstates each that are opposite on the equator of the Bloch sphere. The angular dependence of the success probability in Fig. 9 (inset) shows maxima at angles that are measurement directions and minima between two neighboring angles. As more observables are added, the worst cases in between two neighboring projectors improve, but at the same time also the optimal cases decrease because the optimal angle is not chosen as frequently due to slightly better neighboring angles that are also rewarded more often. The success probability averaged over all angles first increases and then decreases as summarized in Fig. 9. In the limit of large N_{α} the success probability converges to 50% because of the constant damping γ. In the example case with λ = 1 and γ = 1/100 we can give these recommendations: For optimizing the best case, the number of 2 projectors is optimal, for the best worst case success probability, 8 projectors are best and for the best average success probability 6 projectors are the best initial choice.
A strategy to mitigate the decrease in overall efficiency for a more refined angular resolution is composition, which is one of the original features of projective simulation^{1,2}. With composition the projective simulator is endowed with the ability to generate new clips based on the composition of already existing ones. For parameter estimation the projective simulator can insert new clips with new measurement directions only where additional resolution is needed.
The composition mechanism is an additional dynamical element in the projective simulator. Based on the state of the projective simulator it is triggered and inserts a new clip based on existing ones. These new elements, i.e., the trigger mechanism, constructing the new clip and how the new clip is inserted into the network must be specified and leave room for arbitrarily complicated rules. We will restrict to the simplest mechanisms, which will also draw some intuition from actual conceivable physical dynamics.
Bisecting Composition
The first composition mechanism simply operates by bisection and refining the resolution in the relevant regions. After the agent has learned with its initial set of projectors, the two actions clips with the largest hvalues are selected and used to compose a new clip between the two. In situations with angles φ = π/4 or π/8 the action clips with α = 0 and π/2 will have the largest hvalues and give rise to the creation of a new clip with α = π/4, which improves the resolution of the discretization in the first quadrant. The success probability before and after one such composition is depicted in the inset in Fig. 9 as light blue and red curve, respectively. For the angle φ = π/4 the success probability is increased from a minimum of 83.4% to a maximum with 96.2% without adding unnecessary projectors in the remaining quadrants, which would lower to 93.2% of the curve with 8 projectors. When always adding a single additional angle in the middle of the quadrant in which φ lies, worst case scenarios for appear only for angles like π/8 and 3π/8 with 93.32%, which still is a slight improvement over the coarse graining with only 4 projectors (93.26%). For φ = π/8 the composition at α = π/4 is helpful but suboptimal. For angles at the projectors, e.g. φ = 0, an additional composition is harmful and decreases from 97.1% to 96.1%. A single composition that doubles the angular resolution in one quadrant is qualitatively similar to 8 initial measurement angles, but with a higher success probability. A second composition step that adds another projector with an angle of odd multiples of π/8, effectively reproduces the resolution of 16 initial angles but only in one octant of the unit circle. It improves the worst cases at the cost of a slightly reduced overall success probability. Even more bisections will further increase the angular resolution but reduce the overall success to the point that they are counterproductive. Although a bisecting composition is very simple approach, it provides the advantage of a larger number in initial projectors while avoiding a large penalty in overall efficiency due to a large action space with the same parameters.
Composition with the Glow Mechanism
The second mechanism departs from the strict bisection strategy of the first mechanism. The agent reaches an optimal success probability if it can measure along the direction α = φ. The bisection strategy only approximates φ and sometimes introduces unnecessarily many angles, e.g., for φ = π/8 the additional angle α = π/4 has to be built first. We overcome this disadvantage by a better use of the information provided by the measurement results to estimate which new projector angle should be inserted as addition action. We employ a variant of the “edge glow mechanism”^{2} to compose a single new action clip in the following way. We assign a second degree of freedom to each edge called “glow” and denote it by g_{α}. Instead of updating the hvalues with the reward according to (6), we first accumulate rewards in the g_{α} according to the following update rule:
with initial values g_{α}(0) = 0. The change in the update rule for h effectively amounts to setting λ = 0 and γ = 1. The behavior of the agent remains unchanged as the hvalues remain at their initial values h_{α}(0)= 1, i.e., the agent measures equally often in all available directions. However, since there is no bias in the frequency of available measurement direction, the accumulated rewards in the respective g_{α} provide a measure of the average reward for each direction. Once the agent sampled enough measurement results, e.g., when the first g_{α} surpasses the threshold g_{thresh}= 500, a new action clip is composed and inserted into the projective simulator. The new measurement direction is composed from all α and weighted by the g_{α}:
and we set the new .
In order to prevent that a direction is inserted that is already present, the agent first checks that is sufficiently different from all already existing α, e.g., by inserting only if it differs from α by more than 1/10 circular standard deviations of the circular distribution given by the g_{α}. If is too close to one α, the h_{α} of this α is instead strengthened and set equal to the sum of all g_{α} and no new clip is inserted. After the composition, we continue with the usual update rule for the hvalues.
The learning curves for the this form of glow composition are shown in Fig. 10. Starting with 4 angles and a g_{thresh}= 500, at least 2000 measurements have to be done on average before the first composition can occur. This threshold can be decreased, leading to a faster composition, albeit at worse statistics, which result in inaccuracies of the composed angles. Inaccurate compositions, however, impact the success probability only to a small extent because it decreases with the cosine of the angular difference between φ and the composed angle.
In the direction of an existing angle, e.g., φ = 0 or π/2, the first amplifications of the respective α occurs starting with 2000 measurements. For the direction φ = π/4, more measurement need to be done on average to reach g_{0} or g_{π/2}= 500 because these direction are not rewarded with certainty and composition occurs on average later, with φ = π/4 being one of the four latest instances. After the composition the success probability jumps from 50% to about 99%. Since the newly set hvalue for the best measurement direction is larger than the steadystate value for our choice of λ = 1 and γ = 1/100, the success probability decreases slightly to approach from above.
In an ensemble of 1000 agents only 4 compose an angle when φ = 0 or π/2, whereas all do a composition for φ = π/8 or π/4. In our numerical experiment, the distribution of composed angles is sharply peaked around φ with a σ ≈ π/100.
By using the glow mechanism to obtain an effective average reward for each measurement direction and then composing a mean angle from the reward distribution, the agent effectively creates a weighted sum of directions. It thereby embodies a method similar to the estimation of expectation values done in state tomography.
Adapting multiple measurement directions
So far we have demonstrated how an agent equipped with a suitable projective simulator can align a single measurement direction, e.g., for the state +〉, with an initially unknown state φ〉, which emerged from +〉 due to a magnetic field. Since one of the aims is to employ the agent as a means to carry out measurementbased quantum computation (MBQC)^{22} in an unknown external field, all measurement directions that are required to run a specific algorithm in MBQC need to be adapted to this unknown stray field. We therefore need to extend the projective simulator to learn several measurement directions, which shall be given as the respective input. Ultimately, the agent would translate the measurement directions necessary for the algorithm to the reference frame that rotates due to the magnetic field.
We modify the inital agent setup depicted in Fig. 1 in the following way. The step that prepares the defined initial state +〉 is removed and the qubit is simply left in the state that is prepared by the previous measurement. The projective simulator now receives as an input not just a trigger event, which activated the *clip, but now it receives the previous measurement direction and the obtained measurement result as a percept. The initial state and percept can be chosen arbitrarily, e.g., at random, as they do not matter in the subsequent feedback loop.
For the new scheme, we also extend the clip network of the projective simulator to 8 percept clips, which represent all combinations of previous measurement direction and obtained reward, as depicted in Fig. 11. Effectively, the extended clip network consists of 8 copies of the previous simple clip network, which are activated according to the actions and results of the previous time step. The agent enters a feedback cycle, where measured directions and outcomes are fed back to the agent. The information about which state preparation method was used is available to the agent as percept and thereby it indirectly receives a hint about which state has been prepared. Given each prepared state, which then evolves to acquire an additional shift in the angle by φ, the agent learns which measurement direction most likely matches this rotated initial state. To give an example, consider the test qubit in the initial state +〉 ≡α=0〉, which evolves into φ〉. The agent measures this state, say along α = π/2 and obtains result 1. It thereby prepares the test qubit in state π/2〉, which again evolves for time Δt into π/2+φ〉 for the next measurement. This next measurement is chosen according to the hvalues of edges originating from the percept clip “π/2, 1” to each of the four actions.
The clip network is now much larger than before and the agent needs more measurements to update all the connections until the hvalues converge into those of the steady state. Naively, we can expect an 8fold increase, however, since the agent converges to a state in which measurements that give outcomes l are preferred, learning the right measurements for a outcome0 preparation is delayed. This learning behavior is shown in Fig. 12, where outcomel preparations converge early and outcome0 preparations later, which in turn also delays the overall convergence. Naturally, the training of the whole network is faster in situations where the 0 outcomes occur more often, e.g., for φ = π/4, or in situations that lead to different measurement directions, e.g., φ = π/2.
As the agent encounters situations with different percepts, the number of time steps in between two successive activations of the same percept is now increased on average. This leads to a qualitative and quantitative change in the learning curves as compared to the previous simple agent with only one percept. The number of times that the damping reduces the hvalue of each edge would increase and lead to a reduced efficiency, because the agent forgets too quickly in between rewards. To maintain high hvalues for rewarded transitions we could adjust γ to a lower value, but we choose to simply restrict the application of the update rule and the application of the damping in particular, to a subgraph of the clip network, namely, only those edges that are connected to the activated percept clip. Thereby we maintain the quantitative behavior of the simple clip network used in the previous sections.
Percepts give the preparation procedure of the test qubit and thereby effectively encode information about which state has been prepared. A closer inspection reveals that each state is represented twice because is can be prepared in two ways, e.g., +〉 can be prepared by a measurement of with outcome 1 or by with outcome 0. Preparation procedures that result in the same prepared states are highlighted with the same color of the percept clip in Figs. 11 and 12. This redundancy increases the learning times because the same behavior has to be learned twice. The clip network could be optimized with an additional intermediate layer that first maps preparation methods to states, which may be learned first without a stray field and then the prepared states to best measurement directions in a stray field.
Once the agent has adapted its measurement directions to the unknown external field with a test qubit, it can be used as a translator between intended measurement directions and their corresponding directions in the rotated reference frame. This application of a trained agent works as follows. After a training period, we fix all the hvalues. Instead on the test qubit, the agent now acts on the qubit that needs to be measured along a certain direction according to a MBQC scheme, for example. We then excite a percept of the agent that corresponds to the direction of the intended measurement direction in zero field. The agent then chooses most likely the measurement direction that corresponds to this measurement in the rotated frame, i.e., the measurement that takes the field into account.
Measurementbased Grover Algorithm
We first briefly repeat the MBQC variant of the Grover search algorithm for a database with 4 elements^{25} and adapt it to our notation and use of projective measurements . The initial resource state is a cluster state of 4 qubits in ring form, i.e., starting from the state +〉^{⊗}^{4} we apply a controlled phase gate between the qubit pairs 12, 23, 34, 41 and obtain
A database with 4 entries (i.e., with elements 00, 01, 10 and 11) only requires a single Grover step to find the marked element. The algorithm starts by doing this one necessary query to the database and thereby marks the database entry that is to be found. A measurement of the projectors or on qubits 1 and 4 realizes the specific database, where each pair of measurement directions 00, 0π, π0 and ππ corresponds to marking the database element 00, 01, 10 and 11, respectively. For each of the two measurements of or both measurement results r_{1,4}= 0 or 1 appear with probability 1/2. Therefore, the results alone do not allow us to infer the measurement directions and thereby the marked element. In the problem setting of the algorithm the choice of measurement directions is hidden. Only from the measurement results of qubits 1 and 4 and from the measurements done on the remaining two qubits, we should infer the marked element. On the remaining qubits we therefore measure the observable , whose measurement outcome depends on the measurement directions on qubits 1 and 4 and is correlated to the previous two outcomes. Finally, the calculation of (r_{1}r_{3}, r_{2}r_{4}), i.e., addition of the measurement outcomes modulo 2, reveals the two bits of the marked element with certainty. Although, at the present point the MBQC version of Grover’s algorithm appears to merely uncover (anti)correlations between measurement directions, there is an explicit mapping between the quantum circuit of Gover’s algorithm on one hand and the circuit for creating and measuring the cluster state on the other^{25}.
If the initial state is placed in an unknown external field pointing along the zdirection, the state Ψ_{0}〉 is transformed into U⊗U⊗U⊗UΨ_{0}〉 with the local unitary rotations . If we recall that U+〉= φ〉 it is straightforward to see that the measurement protocol of the Grover algorithm will no longer give the correct marked element because the external field effectively shifts the measurement directions by the angle −φ with respect to the original measurement directions. As a result the probability to identify the correct marked element, i.e., the success probability of the algorithm, is periodically modulated by φ and a straightforward calculation gives
For φ being a multiple of π the algorithm works perfectly because the local rotations align the qubits’ reference frames again with the xaxis. The Grover algorithm is invariant under the inversion of all measurement directions (i.e., changing all directions 0 to π and vice versa). In the worst case, for φ being an odd multiple of π/2, the chance of identifying the right element is 1/4, as the measurements of the intended Grover search actually reveal no useful information because they are unbiased with respect to the required measurement direction. In Fig. 13, the analytic results for identifying the correct marked element 00 in the rotated state match the trials with 1000 agents that simply measure all 4 qubits the direction α = 0 and then try to identify the marked element from the obtained measurement results.
For testing the agent with a projective simulator we restrict to a realization with a single marked element, namely 00, which can be implemented with measurements along the xaxis, all in the direction α = 0. The agent first learns with a test qubit exposed to the external field and adapts to the field strength. We then fix these obtained hvalues and use the agent without update rule to carry out the four measurements on the cluster state, one after the other, according to its available measurement directions and internal probabilities.
The first example is an agent that has only 4 fixed measurement directions available (α being a multiple of π/2), which we first train to achieve optimal success probability with the test qubit. The optimal performance is reached in the limit γ→0, which amounts to p_{α}= 1 for the single α that is closest to φ and all others zero. The light blue curve in Fig. 13 shows the fraction of the agent ensemble that identifies the marked element with this projective simulator correctly. The Grover search is recovered perfectly for fields with φ being a multiple of π/2, which can be matched exactly by the available measurement directions.
The second example is an agent that first learns with a test qubit in the external field with composition according to the glow mechanism. That is, after 2000 measurements on average, the agent composes a new measurement direction or strengthens an existing one that matches φ. The hvalues after the composition remain fixed and the agent measures the cluster state according to the available measurement directions and probabilities. The dark blue curve in Fig. 13 illustrates that an ensemble of this kind of agent is highly successful in doing the Grover search for all angles φ. The shortfall from a perfect performance (the average success probability is 99.0% with a standard deviation of 0.3%) originates in the slight deviations of the composed angle from φ and the nonzero probability to chose the remaining nonoptimal measurement directions.
Discussion
We presented an autonomous adaptive system that is able to perform quantum information processing in changing environments. The controller is a learning agent endowed with a projective simulator that adapts measurement directions in a setup of measurementbased quantum computation by reinforcement learning. Our approach thus combines elements from embodied artificial intelligence with the purpose of carrying out robust quantum information processing.
In an exemplary setup of adapting measurement directions to an unknown stray magnetic field in a fixed direction, we have characterized the learning process of the projective simulator and its adaption to timevarying fields using numerical studies. We found that an agent using projective simulation is able to adapt to such unknown stray fields. We provide analytical estimates of its success probability in limiting cases of the nonlinear learning process. In our scenario the agent may adapt the measurement direction by drawing from a initially provided set of fixed measurement directions. We have characterized the performance of the agent for different sets of available measurement directions and we explored composition mechanisms to create new and better measurement directions on the fly, together with the corresponding internal structure in the projective simulator. Strategies with composed measurement directions surpass strategies with fixed sets of directions in both learning speed and resulting efficiency. As a demonstration of adaptive quantum information processing, the agent successfully carries out a measurementbased version of Grover’s search algorithm in the presence of a detrimental unknown external magnetic field.
The present approach can be readily extended and improved in several directions as indicated in the respective sections in the paper. First and foremost, the agent effectively develops and embodies rules to cope and operate with quantum mechanical systems, which are seeded by the specific form the update rule together with the reward scheme and the composition mechanisms. Both of these elements start from simple primitives, e.g. “prefer a specific measurement if it more likely results in a +1 measurement outcome” for the update rule and give rise to a sensible and sufficient behavior in our problem setting. Both can be improved by effectively incorporating more information about the quantum mechanical nature of the underlying problem domain, however, at the expense of more complicated update and composition rules. Errors or imperfect measurements can be straightforwardly incorporated into the present scheme by using POVMs instead of projective measurement, or by adding a classical noise, e.g. bit flips, to the measurement outcomes. Such errors lead to a diluted information about which measurement directions are correct and give +1 measurement outcomes. In the presence of errors, spurious rewards appear for wrong measurement directions and the average reward for correct measurement directions is reduced. Both effectively diminish the contrast in the reward landscape, which is equivalent to a lower reward scaling factor λ. We expect that the agent will still be able to learn in such situations, but it will take longer to do so and reach a lower asymptotic success probability. The latter can partly be recovered by adjusting λ and γ, however, an increase of the learning time over a noiseless scenario will remain.
The longterm goal of this investigation is to develop integrated and autonomous schemes for measurementbased quantum information processing that can adapt to changing environments. In our scheme, learning is not realized by feedback from some external macroscopic sensor, e.g. a magnetometer, but it uses only information drawn from measurements on qubits, which are also the operations that drive the processing of the quantum information. In this sense our approach is related to recent work on intelligent quantum error correction^{32}.
The approach that we have presented in the present paper can be generalized and integrated into a scheme of universal measurementbased quantum computation, where measurements of stabilizer operators of a cluster state are used both for the correction of errors on the resource state and, at the same time, for the adaption of measurement directions that drive the quantum computation. This will be reported elsewhere.
We note that the projective simulator does not assume that rewards originate from measurement probabilities of a quantum state and, therefore, it is “model free”. This also opens the path to study foundational questions such as, to what extent can a machine effectively learn the rules of quantum mechanics through simple reinforcement processes.
Methods
Recursion Relations for Bayesian Updating
The angular probability distribution for φ given the M measurement outcomes r_{m}= ±1 in directions α_{m}, which are multiples of π/2, is given by
with normalization and can be expanded into a Fourier sum
where the normalization is solely contained in the coefficient c_{M}(0). Updating this probability distribution with the next measurement result r_{M+1} amounts to multiplication with the factor [1+r_{M1}cos(φ−α_{M+1})]/2, which we again expand into a Fourier sum. Comparing the coefficients we obtain the following recursion relations for the c_{M+1}(q) and s_{M+1}(q):
where for q > M we set c_{M}(q)= s_{M}(q)= 0. The starting distribution is the flat prior p(φ)= 1/(2π) with c_{0}(0) = 1/π.
The advantage of the Fourier representation is that circular moments of the probability distribution can be straightforwardly calculated:
The first circular mosment R gives rise to the mean angle and the circular standard deviation ^{30,31}.
Additional Information
How to cite this article: Tiersch, M. et al. Adaptive quantum computation in changing environments using projective simulation. Sci. Rep. 5, 12874; doi: 10.1038/srep12874 (2015).
References
Briegel, H. J. & De las Cuevas, G. Projective simulation for artificial intelligence. Sci. Rep. 2, 400 (2012).
Mautner, J., Makmal, A., Manzano, D., Tiersch, M. & Briegel, H. J. Projective simulation for classical learning agents: A comprehensive investigation. New Gen. Comp. 33, 69–114 (2015).
Melnikov, A. A., Makmal, A. & Briegel, H. J. Projective simulation applied to the gridworld and the mountaincar problem. Preprint arXiv:1405.5459 [cs.AI] (2014).
Melnikov, A. A., Makmal, A., Dunjko, V. & Briegel, H. J. Projective simulation with generalization. Preprint arXiv:1504.02247 [cs.AI] (2015).
Hentschel, A. & Sanders, B. C. Machine Learning for Precise Quantum Measurement. Phys. Rev. Lett. 104, 063603 (2010).
Hentschel, A. & Sanders, B. C. Efficient Algorithm for Optimizing Adaptive Quantum Metrology Processes. Phys. Rev. Lett. 107, 233601 (2011).
Lovett, N. B., Crosnier, C., PerarnauLlobet, M. & Sanders, B. C. Differential evolution for manyparticle adaptive quantum metrology. Phys. Rev. Lett. 110, 220501 (2013).
Sergeevich, A., Chandran, A., Combes, J., Bartlett, S. D. & Wiseman, H. M. Characterization of a qubit Hamiltonian using adaptive measurements in a fixed basis. Phys. Rev. A 84, 052315 (2011).
Granade, C. E., Ferrie, C., Wiebe, N. & Cory, D. G. Robust online Hamiltonian learning. New J. Phys. 14, 103013 (2012).
Hayes, A. J. F. & Berry, D. W. Swarm optimization for adaptive phase measurements with low visibility. Phys. Rev. A 89, 013838 (2014).
Sutton, R. S. & Barto, A. G. Reinforcement learning: An introduction 1st edn (MIT Press, 1998).
Russel, S. J. & Norvig, P. Artificial intelligence—A modern approach 2nd edn (Prentice Hall, 2003).
Paparo, G. D., Dunjko, V., Makmal, A., MartinDelgado, M. A. & Briegel, H. J. Quantum speedup for active learning agents. Phys. Rev. X 4, 031002 (2014).
Dunjko, V., Friis, N. & Briegel, H. J. Quantumenhanced deliberation of learning agents using trapped ions. New J. Phys. 17, 023006 (2015).
Wittek, P. Quantum Machine Learning: What Quantum Computing Means to Data Mining. (Academic Press, 2014).
Schuld, M., Sinayskiy, I. & Petruccione, F. An introduction to quantum machine learning. Preprint arXiv:1409.3097 [quantph] (2014).
Grover, L. K. A fast quantum mechanical algorithm for database search in Proceedings of the 28th Annual Symposium on the Theory of Computing, 212–219 (ACM Press, 1996).
Grover, L. K. Quantum mechanics helps in search for a needle in a haystack. Phys. Rev. Lett. 79, 325–328 (1997).
Chuang, I. L., Gershenfeld, N. & Kubinec, M. Experimental implementation of fast quantum searching. Phys. Rev. Lett. 80, 3408–3411 (1998).
Jones, J. A., Mosca, M. & Hansen, R. H. Implementation of a quantum search algorithm on a quantum computer. Nature 393, 344–346 (1998).
Kwiat, P. G., Mitchell, J. R., Schwindt, P. D. D. & White, A. G. Grover’s search algorithm: An optical approach. J. Mod. Opt. 47, 257–266 (2000).
Raussendorf, R. & Briegel, H. J. A oneway quantum computer. Phys. Rev. Lett. 86, 5188–5191 (2001).
Raussendorf, R., Browne, D. E. & Briegel, H. J. Measurementbased quantum computation on cluster states. Phys. Rev. A 68, 022312 (2003).
Briegel, H. J. & Raussendorf, R. Persistent entanglement in arrays of interacting particles, Phys. Rev. Lett. 86, 910–913 (2001).
Walther, P. et al. Experimental oneway quantum computing. Nature 434, 169–176 (2005).
Prevedel R. et al. Highspeed linear optics quantum computing using active feedforward. Nature 445, 65–69 (2007).
Lanyon, B. P. et al. Measurementbased quantum computation with trapped ions. Phys. Rev. Lett. 111, 210501 (2013).
Knill, E., Laflamme, R. & Milburn, G. J. A scheme for efficient quantum computation with linear optics. Nature 409, 46–52 (2001).
Helstrom, C. W. Quantum detection and estimation theory. (Academic Press, 1976).
Fisher, N. I., Lewis, T. & Embleton, B. J. J. Statistical analysis of spherical data. (Cambridge University Press, 1987).
Mardia, K. V. & Jupp, P. E. Directional Statistics. Wiley series in probability and statistics (John Wiley & Sons Ltd, 2000).
Combes, C. et al. Insitu characterization of quantum devices with error correction. Preprint arXiv:1405.5656 [quantph] (2014).
Acknowledgements
We thank Wolfgang Dür for initial discussions on this topic and Vedran Dunjko for comments on the manuscript. We acknowledge support from the Austrian Science Fund (FWF) through the SFB FoQuS: F4012 and the Templeton World Charity Foundation grant TWCF0078/AB46.
Author information
Authors and Affiliations
Contributions
M.T. and H.J.B. designed research. M.T. and E.J.G. performed computations and simulations. M.T. prepared main manuscript text. All authors discussed and reviewed the manuscript.
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Rights and permissions
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
About this article
Cite this article
Tiersch, M., Ganahl, E. & Briegel, H. Adaptive quantum computation in changing environments using projective simulation. Sci Rep 5, 12874 (2015). https://doi.org/10.1038/srep12874
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/srep12874
This article is cited by

How a Minimal Learning Agent can Infer the Existence of Unobserved Variables in a Complex Environment
Minds and Machines (2023)

An Efficient and Novel SemiQuantum Deterministic Secure Quantum Communication Protocol
International Journal of Theoretical Physics (2022)

A projective simulation scheme for partially observable multiagent systems
Quantum Machine Intelligence (2021)

Threeparty semiquantum protocol for deterministic secure quantum dialogue based on GHZ states
Quantum Information Processing (2021)

Controlled Deterministic Secure SemiQuantum Communication
International Journal of Theoretical Physics (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.