Quantum Logic Gate Synthesis as a Markov Decision Process

Reinforcement learning has witnessed recent applications to a variety of tasks in quantum programming. The underlying assumption is that those tasks could be modeled as Markov Decision Processes (MDPs). Here, we investigate the feasibility of this assumption by exploring its consequences for two fundamental tasks in quantum programming: state preparation and gate compilation. By forming discrete MDPs, focusing exclusively on the single-qubit case (both with and without noise), we solve for the optimal policy exactly through policy iteration. We find optimal paths that correspond to the shortest possible sequence of gates to prepare a state, or compile a gate, up to some target accuracy. As an example, we find sequences of $H$ and $T$ gates with length as small as $11$ producing $\sim 99\%$ fidelity for states of the form $(HT)^{n} |0\rangle$ with values as large as $n=10^{10}$. In the presence of gate noise, we demonstrate how the optimal policy adapts to the effects of noisy gates in order to achieve a higher state fidelity. Our work shows that one can meaningfully impose a discrete, stochastic and Markovian nature to a continuous, deterministic and non-Markovian quantum evolution, and provides theoretical insight into why reinforcement learning may be successfully used to find optimally short gate sequences in quantum programming.


Introduction
Recent years have seen dramatic advances in the field of artificial intelligence [1] and machine learning [2,3].A long term goal is to create agents that can carry M. Sohaib Alam: sohaib.alam@nasa.govout complicated tasks in an autonomous manner, relatively free of human input.One of the approaches that has gained popularity in this regard is reinforcement learning.This could be thought of as referring to a rather broad set of techniques that aim to solve some task based on a reward-based mechanism [4].Formally, reinforcement learning models the interaction of an agent with its environment as a Markov Decision Process (MDP).In many practical situations, the agent may have limited access to the environment, whose dynamics can be quite complicated.In all such situations, the goal of reinforcement learning is to learn or estimate the optimal policy, which specifies the (conditional) probabilities of performing actions given that the agent finds itself in some particular state.On the other hand, in fairly simple environments such as the textbook grid-world scenario [4], the dynamics can be fairly simple to learn.Moreover, the state and action spaces are finite and small, allowing for simple tabular methods instead of more complicated methods that would, for example, necessitate the use of artificial neural networks [3].In particular, one could use the dynamic programming method of policy iteration to solve for the optimal policy exactly [5].
In recent times, reinforcement learning has met with success in a variety of quantum programming tasks, such as error correction [6], combinatorial optimization problems [7], as well as state preparation [8][9][10][11][12] and gate design [13,14] in the context of noisy control.Here, we investigate the question of state preparation and gate compilation in the context of abstract logic gates, and ask whether reinforcement learning could be successfully applied to learn the optimal gate sequences to prepare some given quantum state, or compile a specified quantum gate.Instead of exploring the efficacy of any one particular reinforcement method, we investigate whether it is even feasible to model these tasks as MDPs.By discretizing state and action spaces in this context, we circumvent questions and challenges involving convergence rates, reward sparsity, and hyperparemeter optimization that typically show up in reinforcement learning scenarios.Instead, the discretization allows us to exactly solve for and study quite explicitly the properties of the optimal policy itself.This allows us to test whether we can recover optimally short programs using reinforcement learning techniques in quantum programming situations where we already have well-established notions of what those optimally short programs, or circuits, should look like.
There have been numerous previous studies in the general problem of quantum compilation, including but not limited to, the Solovay-Kitaev algorithm [15], quantum Shannon decomposition [16], approximate compilation [17,18], as well as optimal circuit synthesis [19][20][21].Here, we aim to show that optimally short circuits could be found through solving discrete MDPs, and that these circuits agree with independently calculated shortest possible gate sequences for the same tasks.Since the initial posting of this work, numerous works have continued to explore the interface between classical reinforcement learning and quantum computing.These include finding optimal parameters in variational quantum circuits [22][23][24], quantum versions of reinforcement learning and related methods [25][26][27][28][29], Bell tests [30], as well as quantum control [31][32][33][34], state engineering and gate compilation [35][36][37][38][39][40], the subject of this paper.
In such studies, reinforcement learning is employed as an approximate solver of some underlying MDP.This raises the important question of how, and under what conditions, can the underlying MDP be solved exactly, and what kind of solution quality does it result in.Naturally, such MDPs can only be solved exactly for relatively small problem sizes.Our paper explores the answer to this question in the context of single-qubit state preparation and gate compilation, and demonstrates the effects of native gate choice, coordinate representation, discretization effects as well as noise.
The organization of this paper is as follows.We first briefly review the formalism of MDPs.We then investigate the problem of single-qubit state preparation using a discretized version of the continuous {RZ, RY } gates, as well as the discrete gateset {I, H, S, T }.We then study this problem in the context of noisy quantum channels.Finally, we consider the application to the problem of single-qubit compilation into the {H, T } gateset, and show, among other things, that learning the MDP can be highly sensitive to the choice of coordinates for the unitaries.

Brief Review of MDPs
Markov Decision Processes (MDPs) provide a convenient framing of problems involving an agent interacting with an environment.At discrete time steps t, an agent receives a representation of the environment's state s t ∈ S, takes an action a t ∈ A, and then receives a scalar reward r t+1 ∈ R. The policy of the agent, describing the conditional probability π(a|s) of taking action a given the state s, is independent of the environment's state at previous time steps and therefore satisfies the Markov property.The discounted return that an agent receives from the environment after time step t is defined as is the discount factor.The goal of the agent is then to find the optimal policy π * (a|s) that maximizes the state-value function (henceforth, "value function" for brevity), defined as the expectation value of the discounted return received from starting in state s t ∈ S and thereafter following the policy π(a|s), and expressed as More formally then, the optimal policy π * satisfies the inequality V π * (s) ≥ V π (s) for all s ∈ S and all policies π.For finite MDPs, there always exists a deterministic optimal policy, which is not necessarily unique.The value function for the optimal policy is then defined as the optimal value function V The value function satisfies a recursive relationship known as the Bellman equation relating the value of the current state to that of its possible successor states following the policy π.Note that the conditional probability of finding state s and receiving reward r having performed action a in state s specifies the environment dynamics, and also satisfies the Markov property.This equation can be turned into an iterative procedure known as iterative policy evaluation which converges to the fixed point V k = V π in the k → ∞ limit, and can be used to obtain the value function corresponding to a given policy π.In practice, we define convergence as |V k+1 −V k | < for some sufficiently small .Having found the value function, we could then ask if the policy that produced this value function could be further improved.To do so, we need the state-action value function Q π (s, a), defined as the expected return by carrying out action a in state s and thereafter following the policy π, i.e.
According to the policy improvement theorem, given deterministic policies π and π , the inequality where π (s) = a (and in general π (s) = π(s)) for all s ∈ S. In other words, having found the state-value function corresponding to some policy, we can then improve upon that policy by iterating through the action space A while maintaining the next-step state-value functions on the right hand side of Eq. ( 2) to find a better policy than the current one ( -greedy algorithm for policy improvement).
We can then alternate between policy evaluation and policy improvement in a process known as policy iteration to obtain the optimal policy [4].Schematically, this process involves evaluating the value function for some given policy up to some small convergence factor, followed by the improvement of the policy that produced this value function.The process terminates when the improved policy stops differing from the policy in the previous iteration.Of course, this procedure to identify the optimal policy for an MDP relies on the finiteness of state and action spaces.As we will see below, by discretizing the space of 1-qubit states (i.e. the surface and interior of the Bloch sphere corresponding to pure and mixed states), as well as identifying a finite gate set, we create an MDP with the goal of state preparation for which optimal policies in the form of optimal (i.e.shortest) quantum circuits may be found through this method.
We note that one could view state evolution under unitary operations or left multiplication of unitaries by other unitaries as deterministic processes.These could be thought of as trivially forming a Markov Decision Process where the probabilities p(s |s, a) have a δ-function support on some (point-like) state s .Once we impose discretization, this underlying determinism implies that the dynamics of the discrete states are strictly speaking non-Markovian, i.e. the conditional probability of landing in some discrete state s depends not just on the previous discrete state and action, but also on all the previous states and actions, since the underlying continuous/point-like state evolves deterministically.However, we shall see below that with sufficient care, both the tasks of state preparation and gate compilation can be modeled and solved as MDPs even with discretized state spaces.

Preparation of single-qubit states
In this section, we will discuss the preparation of singlequbit states as an MDP.In particular, we will focus on preparing a discrete version of the |1 state.We will do so using two different gate sets, a discretized version of the continuous RZ and RY gates, and the set of naturally discrete gates I, H, S and T , and describe probabilistic shuffling within discrete states to arrive at optimal quantum programs via optimal policies.We will also consider states of the form (HT ) n |0 .

State and Action Spaces
We apply a fairly simple scheme for the discretization of the space of pure 1-qubit states.As is well known, this space has a one-to-one correspondence with points on a 2-sphere, commonly known as the Bloch sphere.With θ ∈ [0, π] denoting the polar angle and φ ∈ [0, 2π) denoting the azimuthal angle, an arbitrary pure 1-qubit state can be represented as The discretization we adopt here is as follows.First, we fix some small number = π/k for some positive integer k.Next, we identify polar caps around the north (θ = 0) and south (θ = π) pole.The northern polar cap is identified as the set of all 1-qubit (pure) states for which θ < for some fixed , regardless of the value of φ.Similarly, the southern polar cap is identified as the set of all 1-qubit (pure) states for which θ > π − , independent of φ.Apart from these special regions, the set of points n ≤ θ ≤ (n+1) and m ≤ φ ≤ (m+1) for some positive integers 1 ≤ n ≤ k−2 and 0 ≤ m ≤ 2k−1 are identified as the same region.The polar caps thus correspond to n = 0, k − 1, respectively.We identify every region (n, m) as a "state" in the MDP.As a result of this identification, elements of the space of 1-qubit pure states are mapped onto a discrete set of states such that the 1-qubit states can now only be identified up to some threshold fidelity.For instance, the |0 state is identified as the northern polar cap with fidelity cos 2 π 2k .Similarly, the |1 state is identified with the southern polar cap with fidelity sin 2 (k−1) π 2k = cos 2 π 2k .In other words, if we were to try and obtain these states using this scheme, we would only be able to obtain them up to these fidelities.
Having identified a finite state space S composed of discrete regions of the Bloch sphere, we next identify single-qubit unitary operations, or gates, as the action space A. There are some natural single-qubit gate sets that are already discrete, such as {H, T }.Others, such as the continuous rotation gates {RZ, RY }, require discretization similar to that of the continuous state space of the Bloch sphere.We discretize the continuous gates RZ(β) and RY (γ) by discretizing the angles β, γ ∈ [0, 2π].The resolution δ = π/l must be sufficiently smaller than that of the state space = π/k so that all states s ∈ S are accessible from all others via the discretized gateset a ∈ A. In practice, a ratio of /δ ∼ O (10) is usually sufficient, although the larger this ratio, the better the optimal circuits we would find.
Without loss of generality, and for illustrative purposes, we identify the discrete state corresponding to the |1 state (hereafter referred to as the "discrete |1 state") as the target state of our MDP.To prepare the |1 state starting from any pure 1-qubit state using the gates RZ and RY , it is well-known that we require at most a single RZ rotation followed by a single RY rotation.For states lying along the great circle through the x and z axes, we need only a single RY rotation.As a test of this discretized procedure, we investigate whether solving this MDP would be able to reproduce such optimally short gate sequences.We also consider the gateset {I, H, T } below, where we include the identity gate to allow for the goal state to "do nothing" and remain in its state.For simplicity and illustrative purposes, we also include the S = T 2 gate in the case of single-qubit state preparation.

Reward Structure and Environment Dynamics
An obvious guess for a reward would be the fidelity | φ|ψ | 2 between the target state |ψ and the prepared state |φ .However, here we consider an even simpler reward structure of assigning +1 to the target state, and 0 to all other states.This allows us to directly relate the length of optimal programs to the value function corresponding to the optimal policy, as we show below.
To finish our specification of the MDP, we also estimate the environment dynamics p(s , r|s, a).Since our reward structure specifies a unique reward r to every state s ∈ S, these conditional probabilities reduce to simply p(s |s, a).The discretization of the Bloch sphere implies that the action of a quantum gate a on a discrete state s = (n, m) maps this state to other states s = (n , m ) according to a transition probability distribution p(s |s, a).This non-determinism of the effect of the actions occurrs because the discrete states are themselves composed of entire families of continuous quantum states, which are themselves mapped deterministically to other continuous quantum states.However, continuous states from the state discrete state region can land in different discrete final state regions.A simple way to estimate these probabilities is to uniformly sample points on the 2-sphere, determine which discrete state they land in, then perform each of the actions to determine the state resulting from this action.We sample uniformly across the Bloch sphere by sampling u, v ∼ U[0, 1], then setting θ = cos −1 (2u − 1) and φ = 2πv.Although other means of estimating these probabilities exist, we find that this simple method Optimal values for various states on the Bloch sphere using the discrete RZ and RY gates, with a discount factor γ = 0.8.The color of a state corresponds to its optimal value function Vπ * , where lighter colors indicate a larger value.Those colored in green are also exactly the states whose optimal circuits to prepare the discrete |1 state consist of a single RY rotation, while those in blue are also exactly the ones whose optimal circuits consist of an RZ rotation followed by an RY rotation.
works well in practice for the particular problem of single-qubit state preparation.
Note that for the target state |1 , the optimal policy is to just apply the identity, i.e.RZ(0) or RY (0).This action will continue to keep this state in the target state, while yielding +1 reward at every time step.This yields an infinite series V (t) = ∞ k=0 γ k , where V (s) := V π * and t is the target state, which we can trivially sum to obtain (1 − γ) −1 .This is the highest value of any state on the discretized Bloch sphere.For γ = 0.8, we obtain V (t) = 5.0.For some generic state s ∈ S, we can show that with our reward structure, the optimal value function is given by where the elements of the matrix P are given by P s ,s = p(s |s, π (s)).From Eq. 4, it immediately follows that V (s) ≤ V (t) for all s ∈ S. The Markov chain produced by the optimal policy has an absorbing state given by the target state, and for some large enough number of steps, all (discrete) states land in this absorbing state.Indeed, the smallest K for which the Marko- vian process converges to a steady state, such that for all s, s ∈ S provides an upper bound for the length of the gate sequence that leads from any one discrete state s to the target discrete state t.Thus, for the target state itself, K = 0. Since P k s ,s ≤ 1, for states that are one gate removed from the target state s 1 , we have V (s 1 ) ≤ V (t), and more generally V (s k+1 ) ≤ V (s k ).This intimately relates the length of the optimal program to the optimal value function.
The optimal value landscape for the two gatesets are shown in Figs. ( 1) and ( 2).Note that while in the case of the discretized {RZ, RY } gates we have a distinguished ring of states along the equator around the x-axis that are only a single gate application away from the target state, we have no such continuous patch on the Bloch sphere for the {I, H, S, T } gateset, even though there may be indidividual (continuous) states that are only a single gate application away from the target state, e.g.H|1 for the target state |1 .This shows that states which are nearby on the Bloch sphere need not share similar optimal paths to the target state, given such a gateset.

Optimal State Preparation Sequences
Using policy iteration allows for finding the optimal policy in an MDP.The optimal policy dictates the best action to perform in a given state.We can chain the actions drawn from the optimal policy together to find an optimal sequence of actions, or gates, to reach the target state.In our case, the actions are composed of unitary operations, which deterministically evolve a quantum state (note that we consider noise below in which case the unitary gates are replaced by non-unitary quantum channels).However, due to the discretization, this is no longer true in our MDP, where the states evolve according to the non-trivial probabilities p(s |s, a).The optimal policy is learned with respect to these stochastic dynamics, and not with respect to the underlying deterministic dynamics.In other words, we are imposing a Markovian structure on essentially non-Markovian dynamics.Therefore, if we simply start with some specific quantum state, and apply a sequence of actions drawn from the optimal policy of the discrete states that the evolving quantum states belong to, we might not necessarily find ourselves in the target (discrete) state.For instance, the optimal policy in any one discrete state may be to apply the Hadamard gate, and for a subset of quantum states within that discrete state, this may lead to another discrete state for which the optimal policy is again the Hadamard gate.In such a case, the evolution would be stuck in a loop.
To circumvent this issue, in principle one may allow "shuffling" of the quantum states within a particular discrete state before evolving them under the optimal policy.However, this may increase the length of the gate sequence and moreover lead to poorer bounds on the fidelity, since in general, where the "shuffling" transformations are given by U (i) s : |ψ → | ψ such that |ψ ∼ | ψ belong to the same discrete state, while the U i specify (unitary) actions sampled from the optimal policy.On the other hand, without such "shuffling", the fidelities in the target states from sequences that only differ in their starting states is the same as the fidelities of the starting states, i.e.
where |ψ i and |ψ i are two different initial pure states that belong to the same initial discrete state, and U = i U i is the product of the optimal policies U i .
To avoid such shuffling while still producing convergent paths, we sample several paths that lead from the starting state and terminate in the target (discrete) state, discarding sequences that are larger than some acceptable value, e.g. the length K defined by Eq. 5, and report the one with the smallest length as the optimal (i.e.shortest) program.Schematically, this can be described in pseudo-code as in Algorithm 1.This algorithm can be used to generate optimal programs for any given (approximately universal) single-qubit gateset.In our experiments, we found M to be 2 for the discrete {RZ, RY } gateset, and 88 for the {I, H, S, T } gateset, and took K to be 100.Optimal-Programs[s] ← Optimal-Prog 31: end for

Discrete RZ and RY gateset
In the case of discrete RZ and RY gates, we find what we would expect at most a single RZ rotation followed by a single RY rotation to get from anywhere on the Bloch sphere to the (discrete) |1 state.For (discrete) states lying along the equatorial ring around the Yaxis, we need only apply a single RY rotation.Empirically, we choose a state resolution of = π/16 so that we would find sequences generating the pure |1 state from various discrete states across the Bloch sphere with cos 2 π 32 ∼ 99% fidelity.The optimal programs we find via the preceding procedure for this gateset are composed of programs with lengths either 1 or 2.

Discrete {I, H, T } gateset
We can also use the procedure described above to obtain approximations to the states (HT ) n |0 for integers n ≥ 1.The unitary HT can be thought of as a rotation by an angle θ = 2 arccos cos(7π/8) √ 2 about an axis n = (n x , n y , n z ) = The angle θ has the continued fraction representation which is infinite, and thus the angle θ is irrational.In the above, we have used the Gaussian notation for the continued fraction and x is the flooring operation x → n where n ∈ Z is the closest integer where n ≤ x.The states (HT ) n |0 lie along an equatorial ring about the axis n, and no two states (HT ) n |0 and (HT ) m |0 are equal for n = m.Increasing the value of n corresponds to preparing states from among a finite set of states that span the equatorial ring about the axis of rotation.We choose to investigate state preparation up to n = 10 10 .Although as their form makes explicit, these states can be reproduced exactly using n many H and T gates, using our procedure, they can only be obtained up to some fidelity controlled by the discretization as described above.The advantage is that we can obtain good approximations to these states with much fewer gates than n.This is illustrated in Table 1 where short gate sequences can reproduce states of the form (HT ) n |0 for very large values of n using only a few (between 3 and 17 gates).

Noisy state preparation
Reinforcement learning has previously shown success when applied in the presence of noise [13,14].Indeed, the ability to learn the effects of a noise channel has apparent practical use when applied to the current generation of noisy quantum computers.These devices are often plagued by errors that severely limit the depth of quantum circuits that can be executed.As full error correction procedures are too resource intensive to be implemented on current hardware, error mitigation methods have been developed to decrease the effect of  [41][42][43][44].However, there are also pre-processing error mitigation schemes that aim to modify the input circuit in order to reduce the impact of noise.Examples are quantum optimal control methods and dynamical decoupling [45][46][47].Such techniques attempt to prepare a desired quantum state on a noisy device using circuits (or sequence of pulses) that are different from the ones that would be optimal in the absence of noise.This idea is immediately applicable in our MDP framework as we now demonstrate.

State and Action Spaces
In the presence of noise, the quantum state becomes mixed and is described by a density matrix, which for a single qubit can generally be written as Here, r = (r x , r y , r z ) are real coefficients called the Bloch = (r sin θ cos φ, r sin θ sin φ, r cos θ) , (10) where r ≡ |r| ∈ [0, 1], θ ∈ [0, π], and φ ∈ [0, 2π).
We perform the state discretization analogously to the previous section, but now need to discretize states within the full Bloch ball.To this end, we fix = π/k and δ = 1/k for some positive integer k.Now the set of points n ≤ θ ≤ (n + 1) , m ≤ φ ≤ (m + 1) , and lδ ≤ r ≤ (l + 1)δ for integers 1 ≤ n ≤ k − 2, 0 ≤ m ≤ 2k − 1, and 0 ≤ l ≤ k − 1 constitute the same discrete state s = (n, m, l) in the MDP.As before, the polar regions n = 0, k − 1 are special as these regions are independent of φ, i.e. they are described by the set of integers s = (n, m = 0, l).This discretization corresponds to nesting concentric spheres and setting the discrete MDP states s to be the 3-dimensional regions between them.
Let us now introduce the action space A in the presence of noise.We model noisy gates using a composition of a unitary gate U and a noisy quantum channel described by a set of Kraus operators Application of a noisy quantum channel can shrink the magnitude r of the Bloch vector as the state becomes more mixed.Evolution under a unitary gate U in this noisy channel results in We here again consider the discrete gateset U ∈ {I, H, T }.Once we specify the type of noise via a set of Kraus operators, its sole effect on our description of the MDP is to change the transition probability distributions p(s |s, a).While noise can change the optimal policies, we may nevertheless solve for the optimal policies using the exact same procedure that we used in the noiseless case.In the following, we compare the resulting shortest gate sequences found by an agent that was trained using the noisy transition probabilities p and compare them to those found by an agent lacking knowledge of the noise channel.The noise observed in current quantum computers is to a good approximation described by amplitude damping and dephasing channels.The amplitude damping channel is described by the two Kraus operators with 0 ≤ γ ≤ 1. Physically, we can interpret this channel as causing a qubit in the |1 state to decay to the |0 state with probability γ.In current quantum computing devices, the relaxation time, T 1 , describes the timescale of such decay processes.For a given T 1 time and a characteristic gate execution time τ g , we parametrize γ = 1 − e −τg/T1 .(14) Note that application of the amplitude damping channel also leads to dephasing of the off-diagonal elements of the density matrix in the Z basis with timescales T 2 = 2T 1 .
The dephasing channel takes the form and is described by the Kraus operators A 0 = √ 1 − p 1 and A 1 = √ pZ.This channel leads to pure phase damping of the off-diagonal terms of the density matrix in the Z basis.It is described by a dephasing time, T 2 .We use Pyquil [48] (specifically the function damping after dephasing from pyquil.noise) to construct a noise channels consisting of a composition of these two noise maps by specifying the T 1 and T 2 times as well as the gate duration τ g .The amplitude damping parameter γ is set by T 1 using Eq.(14).Since this also results in phase damping of the off-diagonal terms of the density matrix (in the Z basis), the dephasing time T 2 is upper limited by 2T 1 .We thus parametrize the dephasing channel parameter (describing any additional pure dephasing) as The dephasing channel thus describes dephasing leading to T 2 < 2T 1 and acts trivially if T 2 is at its upper bound T 2 = 2T 1 .In the following, we consider T 1 = T 2 such that the dephasing channel acts non-trivially on the quantum state.

Reward Structure and Environment
For the reward structure of our noisy state preparation, we consider the purity of the state when calculating the reward.This is to account for the fact that there may be no gate sequence that results in a state with a high enough purity to land in the pure goal state.As such, assigning a reward of +1 to the pure target state and 0 to all other states can lead to poor convergence.We assign the reward as follows: the pure target state ρ target is one of the MDP states (n target , m target , k − 1), where r = 1 and thus l = k − 1. Considering a state ρ in the MDP state (n , m , l ).If n = n target and m = m target , we assign a reward of l /k.Otherwise, we assign a reward of 0. With this construction, we reward reaching a state in the target direction (i.e. with correct angles θ, φ) using a reward amount that is proportional to the purity of the state.We thus also reward gate sequences that do not end up in the pure goal state, while still encouraging states with higher purity.One can expect the optimal value function for a noisy evolution to take on smaller values due to the fact that the rewards are smaller.Indeed, it can be easily verified that for a simplified noise model consisting of only a depolarizing quantum channel E(ρ) = (1 − p)ρ + p 3 (XρX + Y ρY + ZρZ), the resulting optimal value function is simply uniformly shrunk compared to the optimal value function of a noiseless MDP.Since the change is uniform across all values, the optimal policy is unchanged from the noiseless setting.This is no longer the case for realistic noise models such as described by amplitude and dephasing quantum channels, in which case we rederive the optimal policy using policy iteration.
This requires updating the conditional probability distributions p(s |s, a) by performing Monte-Carlo simulations as before by drawing random continuous states from within a given discrete states, applying deterministic noisy gates, and recording the obtained discrete states.Now we find transitions to states s with lower purity than the initial state s.Note that the randomness of the probability distribution p arises solely from the randomly sampled continuous quantum states to which we apply the noisy gate actions.The randomness due to noise is fully captured within the mixed state density matrix description of quantum states.

Optimal Noisy State Preparation Sequences
We now consider the task of approximating states of the form (HT ) n |0 for n 1, starting from the state |0 , using a gate set {I, H, T } in the presence of noise.We use the MDP formulation with transition probabilities p(s |s, a) obtained in the presence of amplitude and dephasing noise channels.We find the optimal policy using policy iteration that yields optimal gate sequences via Algorithm (1).
In Table 2, we present results up to n = 10 10 that includes the shortest gate sequences found by the optimal policies of noisy and noiseless MDP.We also compare the final state fidelities produced by these optimal circuits.The fidelities F that are listed in the table are found by applying the optimal gate sequences for a given n to the exact state |0 .Since the resulting states are mixed, we calculate the fidelity between the target state σ and the state resulting from an optimal gate sequence ρ as We list both the gate sequences found by the noiseless MDP, the agent whose underlying probability distribu- .We choose such stronger noise values in order to highlight the difference in gate sequences (and resulting fidelities) produced by the optimal policies π * for noisy and noiseless MDPs.We expect that this result is generic and robust when considering MDPs for multiple qubits, where two-qubit gate errors are expected to lead to more pronounced noise effects.
The results in Table 2 demonstrate that even in the presence of (strong) noise, the noisy MDP is able to provide short gate sequences that approximate the target state reasonably well.Importantly, for all values of n shown (except for n = 10 4 ), the optimal policy of the noisy MDP π * noisy yields a gate sequence that results in a higher fidelity than the gate sequence obtained from π * noiseless of the noiseless MDP (if applied in the presence of noise).This shows that noise can be mitigated by adapting the gate sequence according to the noise experienced by the qubit.Solving for the optimal policies of a noisy MDPs are a convenient approach to finding such adapted quantum circuits.
In Fig. 3 we compare the gate sequences and fidelities obtained from the optimal policies of noisy and noiseless MDPs for a fixed value of n = 10 7 as a function of T 1 = T 2 .We observe the noisy MDP to outper-form the noiseless MDP for all noise strengths.This indicates that by learning the noise channel, the agent can adapt to the noise and find gate sequences that yield higher fidelities in that channel.Note that if we applied the gate sequences found by the noisy MDP in a noiseless setting, they would yield lower fidelities than gate sequences produced by the optimal policy of a noiseless MDP.Based on these result, we conclude that dynamic programming and reinforcement learning methods provide a powerful and generic way to perform pre-processing error mitigation by identifying optimal gate sequences for qubit state preparation in the presence of noise.Future work should be directed towards exploring these approaches for two and more coupled qubits.

Compilation of single-qubit gates
In the previous sections, we considered an agentenvironment interaction in which we identified Hilbert space as the state space, and the space of SU (2) gates as the action space.Shifting our attention to the problem of quantum gate compilation, we now identify both the state and action spaces with the space of SU (2) matrices, where for convenience we ignore an overall U (1) phase from the true group of single-qubit gates U (2).We first consider an appropriate coordinate system to use, and discuss why the quaternions are better suited to this task than Euler angles.We focus exclusively on the gateset {I, H, T }, and modify the reward structure slightly so that we now have to work with the probabilities p(s , r|s, a) instead of the simpler p(s |s, a) as in the previous section.We present empirical results for a few randomly chosen (special) unitaries.noisy (orange) and π * noiseless (blue) of noisy and noiseless MDPs, respectively.The noisy policy gives gate sequences that are different from the noiseless case, which consistently yield higher fidelities.The optimal noisy gate sequence is HT HT HT H for all times T1 = T2 ≥ 60µs.We fix the gate time to τg = 200 ns when generating the Kraus operators as defined by Eqs. ( 14) and (16).For each value of T1, T2, we generate the transition probabilities p(s |s, a) according to the corresponding noise map and use policy iteration to find the optimal policy.The fidelity is then calculated by applying the gate sequence found by both the noisy and noiseless MDPs to |0 in the error channel for that specific value of T1, T2.The point at infinity represents the noiseless case, and corresponds to the transition probabilities learned by the noiseless MDP.

Coordinate system
We consider the gateset {I, H, T }.We include the identity in our gate set since we would like the target state to possess the highest value, and have the agent do nothing in the target state under the optimal policy.Because we would like to remain in the space of SU (2) matrices, we define H = RY (π/2)RZ(π), which differs from the usual definition by an overall factor of i, and T = RZ(π/4).Note that owing to our alternative gate definitions, we have that H 2 = T 8 = −1 = 1 so that we may obtain up to 3 and 15 consecutive applications of H and T respectively in the optimal program.Next, we choose an appropriate coordinate system.One choice is to parametrize an arbitrary U ∈ SU (2) using the ZYZ-Euler angle decomposition.Under this parametrization, given some U ∈ SU (2) for some angles α, β and γ.Note that for β = 0, we have a continuous degeneracy of choices in α and γ to specify some RZ(δ) with α + γ = δ.However, the transformations above will conventionally fix this to α = γ = δ/2.Under the action of T , i.e.T : U → U = T U = RZ(α )RY (β )RZ(γ ), or equivalently T : (α, β, γ) → (α , β , γ ), the ZYZ-coordinates transform rather simply as α = α + π/4, β = β, γ = γ.Under a similar action of H however, the coordinates transform nontrivially.The matrix entries, on which these parameters depend, transform as This is a non-volume preserving operation for which where J denotes the Jacobian of the transformation from (α, β, γ) to (α , β , γ ) under the action of H, and which diverges for values of α and β such that cos(α) sin(β) = ±1.This implies that for such pathological values, a unit hypercube in the discretized (α, β, γ) space gets mapped to a region that covers indefinitely many unit hypercubes in the discretized (α , β , γ ) space.In turn, this means that a single state s gets mapped to an unbounded number of possible states s , causing p(s |s, a = H) to be arbitrary small.This may prevent the agent from recognizing an optimal path to valuable states, since even if the quantity (r + γV π (s )) is particularly large for some states s , this quantity gets multiplied by the negligible factor p(s |s, a = H), and therefore has a very small contribution in an update rule such as Eq (2).These problems can be overcome by switching to using quaternions as our coordinate system.Unlike the ZYZ-Euler angles, the space of quaternions is in a one-to-one correspondence with SU (2).Given some U ∈ SU (2) as in Eq (18), the corresponding quaternion is given simply as q = (a, b, c, d).Under the action of T , its components transform as while under the action of H, its components transform as and det(J (T ) ) = det(J (H) ) = 1 for the Jacobians associated with both transformations, so that these operations are volume-preserving on this coordinate system.In turn, this implies that a hypercube with unit volume in the discretized quaternionic space gets mapped to a region with unit volume.
For the purposes of the learning agent, this means that the total number of states that can result from acting with either T or H is bounded above.Suppose we choose our discretization such that the grid spacing along each of the 4 axes of the quaternionic space is the same.Then, since a d-dimensional hypercube can intersect with at most 2 d equal-volume hypercubes, a state s can be mapped to at most 16 possible states s .While this is certainly better than the pathological case we noted previously using the ZYZ-Euler angles, one could ask if it is possible to do better and design a coordinate system such that a state gets mapped to at most one other state.
One possible approach to make the environment dynamics completely deterministic is to consider a discretization q = (n 1 ∆, n 2 ∆, n 3 ∆, n 4 ∆) where n 1 , n 2 , n 3 , n 4 ∈ Z, and choose ∆ such that the transformed quaternion can also be described similarly as q = (n 1 ∆, n 2 ∆, n 3 ∆, n 4 ∆), and try to ensure that n 1 , n 2 , n 3 , n 4 are also integers.Essentially this would mean that corners of hypercubes map to corners of hypercubes, so that discretized states map uniquely to other discretized states.However, consider the transformation under H, Eq. (23).For this transformation, re- for some k ∈ Z (and similarly for the other components).This implies that n 1 cannot be an integer, and so the map given with this gateset over this discretized coordinate system cannot be made deterministic in this manner.Nevertheless, we find that our construction is sufficient to solve the MDP that we have set up.

Reward Structure and Environment Dynamics
Some natural measures of overlap between two unitaries include the Hilbert-Schmidt inner product tr(U † V ), and since we work with quaternions, the quaternion distance |q − q |.However, neither does the Hilbert-Schmidt inner product monotonically increase, nor does the quaternion distance monotonically decrease, along the shortest {H, T } gate sequence.As an example, consider the target quaternion q = [−0.52514,−0.38217, 0.72416, 0.23187] from Table (3) with shortest compilation sequence HT T T T HT HHH (read right to left) satisfying |q−q | < 0.3, where q is the prepared quaternion via the sequence.After the first H application, |q−q | ∼ 1.34, which drops after the second H application to |q − q | ∼ 0.97, and then rises again after the third H applciation to |q − q | ∼ 1.49, before eventually falling below the threshold error.Similarly, the Hilbert-Schmidt inner product starts at ∼ 0.21, rises to ∼ 1.05, then falls to ∼ −0.21 before eventually becoming ∼ 1.96.On the other hand, we showed previously how assigning a reward structure of +1 to some target state, and 0 to all other states, made it possible to relate the optimal value function to the length of the optimal path.
Instead of specifying a reward of +1 in some target state and 0 in every other state however, we now assign a reward of +1 whenever the underlying unitary has evolved to within an -net approximation of the target unitary.Since we work with quaternions, we specify this as obtaining a reward of +1 whenever the evolved quaternion q satisfies |q−q | < , for some > 0 and q is the target quaternion, and 0 otherwise.We note that the Euclidean distance between two vectors (a, b, c, d) and (a + ∆ bin , b + ∆ bin , c + ∆ bin , d + ∆ bin ) equals 2∆ bin , however both those vectors cannot represent quaternions, since only either one of them can have unit norm.Nevertheless, this sets a size of discrete states, and we require that be comparable to this scale, setting = 2∆ bin in practice.This requirement comes from the fact that in general, the -net could cover more than one state, so that we now need to estimate the probabilities p(s , r|s, a), in contrast to the scenario where a state uniquely specifies the reward.Demanding that ∼ ∆ bin ensures that p(s , r = 1|s, a) does not become negligibly small.
We could estimate the dynamics by uniformly ran-domly sampling quaternions, track which discrete state the sampled quaternions belong to, evolve them under the actions and track the resultant discrete state and reward obtained as a result, just as we did in the previous section.However, here we now estimate the environment dynamics by simply rolling out gate sequences.Each rollout is defined as starting from the identity gate, then successively applying either an H or T gate with equal probability until some fixed number K of actions have been performed.The probabilities for the identity action p(s , r|s, a = I) are simply estimated by recording that (s , a = I) led to (s , r) at each step that we sample (s , r) when performing some other action a = I in some other state s = s .The number of actions per rollout K is set by the desired accuracy, which the Solovay-Kitaev theorem informs us is O(polylog(1/ )) [15], and in our case has an upper bound given by Eq. 5. Estimating the environment dynamics in this manner is similar in spirit to off-policy learning in typical reinforcement learning algorithms, such as Q-learning [4].

Optimal Gate Compilation Sequences
Solving the constructed MDP through policy iteration, we arrive at the optimal policy just as before.We now chain the optimal policies together to form optimal gate compilation sequences, accounting for the fact that while the dynamics of our constructed MDP is stochastic, the underlying evolution of the unitary states is deterministic.The procedure we use for starting with the identity gate and terminating, with some accuracy, at the target state is outlined in pseudo-code in Algorithm 2, where the length of the largest sequence K is dictated by Eq. 5, and in our experiments we took 100 rollouts.
Algorithm 2 Optimal Gate Compilation Sequence  The accuracy with which we would obtain the minimum length action sequence in Algorithm (2) need not necessarily satisfy the bound set by the reward criterion, r = 1 for |q − q | < , for reasoning similar to the shuffling discussed in the context of state preparation above.This is why we require Algorithm (2) to report the minimum length action sequence that also satisfies the precision bound.In practice, we found that this was typically an unnecessary requirement and even when the precision bound was not satisfied, the precision did not stray too far from the bound.It should be emphasized that due to the shuffling effect, there is no a priori guarantee that optimal-sequence returned by Algorithm (2) need even exist, since the precision bound is not guaranteed to exist, and the only bound we can safely set is |q −q | ∆ bin k, where k is the number of actions in the sequence that prepares q.In practice however, we find the algorithm to work quite well in producing optimal sequences that correspond to the shortest possible gate sequences to prepare the target quaternions q .
To benchmark the compilation sequences found by this procedure, we find shortest gate sequences for com-pilation to some specified precision using a brute-force search that yields the smallest gate sequence that satisfies |q − q | < for some > 0 with the smallest value of |q − q |, where q is the prepared quaternion and q is the target quaternion.This brute-force procedure can be described in pseudo-code as in Algorithm 3. for Seq in Sequences Shortest-Sequence ← Seq with Min(Quaternion-Distances)

Algorithm 3 Shortest Gate Compilation Sequence
As an experiment, we drew 30 (Haar) random SU (2) matrices, and found their compilation sequences from Algorithms (2) and (3).We set = 2∆ bin = 0.3, estimated the environment dynamics using 1000 rollouts, each rollout being 50 actions long, and each action being a uniform draw between H and T .The findings are presented in Table (3), where the sequences are to be read right to left.We find that although the two approaches sometimes yield different sequences, the two sequences agree in their length and produce quaternions that fall within of the target quaternion.We expect in general that the two approaches will produce comparable length sequences and target fidelities, though not necessarily equal.

Conclusions
We have shown that the tasks of single-qubit state preparation and gate compilation can be modeled as finite MDPs yielding optimally short gate sequences to prepare states or compile gates up to some desired accuracy.These optimal sequences were found to be comparable with independently calculated shortest gate sequences for the same tasks, often agreeing with them exactly.Additionally, we investigated state preparation in the presence of amplitude damping and dephasing noise channels.We found that an agent can learn information about the noise and yield noise-adapted optimal gate sequences that result in a higher fidelity with the target state.This work therefore provides strong evidence that more complicated quantum programming tasks can also be successfully modeled as MDPs.In scenarios where the state or action spaces grow too large for dynamic programming to be applicable, or where the environment dynamics cannot be accurately learned in the simple manner described above, it is therefore highly promising to apply reinforcement learning to find optimally short circuits for particular tasks.Future work should be directed towards using dynamic programming and reinforcement learning methods for noiseless and noisy state preparation and gate compilation for several coupled qubits.
We provide the required programs for qubit state preparation as open-source software, and we make the corresponding raw data of our results openly accessible [49].
where in the first equality, we have simply used the fact that the reward is 1 in the target state t and 0 in every other state, in the 4th inequality we have expanded V (s ) using the same fact, in the 6th equality we have used the fact that s ,s p(s |s )p(s |s) = s p(s |s) and used the notation (P k ) s ,s = s1,...,s k−1 p(s |s k−1 ) . . .p(s 1 |s), and finally in the last equality we have recursively expanded V (s ), just as in the preceding steps, a total of K times.Noting that we can carry out this recursive expansion arbitrarily many times, in the limit K → ∞, we find The above expression is valid for the value function corresponding to an arbitrary policy.Specializing to the optimal policy, for which V (t) = (1 − γ) −1 , we find precisely Eq. 4.

Figure 1 :
Figure1: Optimal values for various states on the Bloch sphere using the discrete RZ and RY gates, with a discount factor γ = 0.8.The color of a state corresponds to its optimal value function Vπ * , where lighter colors indicate a larger value.Those colored in green are also exactly the states whose optimal circuits to prepare the discrete |1 state consist of a single RY rotation, while those in blue are also exactly the ones whose optimal circuits consist of an RZ rotation followed by an RY rotation.

Figure 2 :
Figure 2: Optimal value landscape across the Bloch sphere using the set of gates {I, H, S, T }, with a discount factor γ = 0.95.The color of a state corresponds to its optimal value function Vπ * , where darker colors indicate a larger value.States distributed around the equator of the Bloch sphere are especially advantageous to start from in order to reach the target |1 state, as their optimal circuits consist of short sequences of S and H gates.

1 .
vector and σ = (X, Y, Z) are the Pauli matrices.Since density matrices are semi-definite, it holds that |r| ≤ 1.If ρ is a pure state, then |r| = 1, otherwise |r| < Pure states can thus be interpreted as points on the surface of the Bloch sphere, whereas mixed states correspond to points within the Bloch sphere.The maximally mixed state ρ = I/2 corresponds to the origin.To find r, one can calculate the expectation values of each Pauli operator r = Tr(ρX), Tr(ρY ), Tr(ρZ)

Figure 3 :
Figure 3: Fidelity F of the state σ prepared using optimal gate sequences with target state ρtarget = (HT ) n |0 for fixed n = 10 7 as a function of noise strength T1 = T2.The shortest gate sequences (indicated in the figure) are produced by optimal policies π *noisy (orange) and π * noiseless (blue) of noisy and noiseless MDPs, respectively.The noisy policy gives gate sequences that are different from the noiseless case, which consistently yield higher fidelities.The optimal noisy gate sequence is HT HT HT H for all times T1 = T2 ≥ 60µs.We fix the gate time to τg = 200 ns when generating the Kraus operators as defined by Eqs.(14) and(16).For each value of T1, T2, we generate the transition probabilities p(s |s, a) according to the corresponding noise map and use policy iteration to find the optimal policy.The fidelity is then calculated by applying the gate sequence found by both the noisy and noiseless MDPs to |0 in the error channel for that specific value of T1, T2.The point at infinity represents the noiseless case, and corresponds to the transition probabilities learned by the noiseless MDP.

Table 1 :
Gate sequences obtained from the optimal policy to approximately produce target states |ψtarget = (HT ) n |0 .The optimal policy and fidelity are calculated for the noiseless case.The fidelity is defined as F = ψtarget|ψ , where |ψ is obtained from application of the shown gate sequences to the state |0 .The sequences are to be read from right to left.
noise.Methods such as zero-noise extrapolation (ZNE), Clifford data regression (CDR), and probabilistic error cancellation (PEC) involve post-processing of circuit results (with and without making use of knowledge about the underlying type of noise)

Table 2 :
Shortest gate sequences and noisy fidelities F produced by the optimal policies π * of noiseless MDP (columns 2 and 3) and noisy MDP (columns 4 and 5).The gate sequences should be read right to left.The noise is characterized by T1 = T2 = 1 µs and the gate time is set to τg = 200 ns.While this corresponds to a noise level that is stronger than in current day NISQ hardware, where T1, T2 ≈ 100µs, these parameters yield sufficiently strong noise to highlight differences in the optimal gate sequences.The fidelities (with the target state) are obtained by preparing mixed states using the shown gate sequences applied to |0 in the presence of noise.Note that the states generated by (HT ) n |0 for n = 3, 4 have an overlap fidelity of 0.99.This is also the case for n = 7, 10.This explains the similarity of gate sequences found for these cases.
tion p(s |s, a) is constructed from exact unitary gates, and the noisy MDP, whose transition probabilities are generated from noisy gates considering combined amplitude damping and dephasing error channels.We set the relaxation and dephasing times to T 1 = T 2 = 1µs and the gate time to τ g = 200 ns.While the value for τ g is typical for present day superconducting NISQ hardware, the values of T 1 , T 2 are about two orders of magnitude shorter than typical values on today's NISQ hardware, where T 1 , T 2 100µs