Quasiprobability decompositions with reduced sampling overhead

Quantum error-mitigation techniques can reduce noise on current quantum hardware without the need for fault-tolerant quantum error correction. For instance, the quasiprobability method simulates a noise-free quantum computer using a noisy one, with the caveat of only producing the correct expected values of observables. The cost of this error mitigation technique manifests as a sampling overhead which scales exponentially in the number of corrected gates. In this work, we present an algorithm based on mathematical optimization that aims to choose the quasiprobability decomposition in a noise-aware manner. This directly leads to a significantly lower basis of the sampling overhead compared to existing approaches. A key element of the novel algorithm is a robust quasiprobability method that allows for a tradeoff between an approximation error and the sampling overhead via semidefinite programming.


Introduction
Quantum computing holds the promise for significant advancements in various fields such as quantum chemistry, material science, optimization, and machine learning.Unlike classical computers, quantum computers suffer from a significant amount of noise which accumulates over the execution of a quantum algorithm and thus limits experiments to shallow quantum circuits.The theory of quantum error correction and fault-tolerant quantum computing solves this problem.The celebrated threshold theorem ensures that given the noise level of the physical gates is below some constant threshold value, arbitrarily long calculations are possible at arbitrarily low error rates [3].The cost for a fault-tolerant implementation of a given circuit is a polylogarithmic overhead that is tame when speaking of asymptotics, but currently available quantum hardware does not fulfil the requirements for quantum error correction [20] and it remains a major challenge to achieve this goal.
It is thus a natural question if it is possible to mitigate noise on quantum hardware without the need of full quantum error correction.Multiple schemes have been proposed [16,14,23,18] that aim to reduce the effect of noise while also being significantly easier to implement than quantum error correction.All of these methods come with a certain drawback that prohibits them from achieving large-scale fault tolerant quantum computation.The term quantum error mitigation (QEM) is often used to group these methods.The hope is to use error mitigation techniques to demonstrate a quantum advantage on a useful task with near-term devices.
One specific method in the family of quantum error mitigation techniques is the quasiprobability method [23] (also known as probabilistic error cancellation).The central idea is to decompose the ideal (noise-free) quantum circuit into a quasiprobabilistic mixture of noisy circuits, called quasiprobability decomposition (QPD), that one can implement on a given hardware.These noisy circuits are randomly sampled and run.By correctly post-processing the measurement outcome, one can obtain an unbiased estimator for the expectation values of the ideal circuit.The quasiprobability method exhibits a simulation overhead in terms of an additional sampling cost.More precisely, the number of shots required to simulate a circuit scales as O(γ 2|C| ) where γ ≥ 1 is the so-called sampling overhead of the QPD and |C| denotes the number of corrected gates in a given circuit C.This quantity encapsulates how strongly the method has to compensate for the noise in the quantum system.Notably, γ goes to 1 in the limit where the noise vanishes, making the simulation overhead disappear.This exponential cost restricts the quasiprobability method to shallow quantum circuits.
By the above reasoning, it is evident that one wants to find QPDs that exhibit the smallest possible γ-factor.The arguably most difficult part of finding a suitable QPD is how to choose the noisy quantum circuits, called decomposition set, into which we decompose the ideal quantum circuit.It has been realized [23] that the optimal QPD can be expressed as a linear program, under the assumption that the decomposition set is already fixed.Furthermore, a concrete decomposition set was proposed in [10] that suffices to decompose any circuit under the constraint of arbitrary noise that is not too strong.However, this decomposition set does not necessarily produce the optimal QPD and indeed we demonstrate that there exist decomposition sets that yield a lower sampling overhead.This motivated the study of improved decomposition sets for the quasiprobability method.In [22] the authors prove analytical bounds on the best possible sampling overhead under the assumption that all unitary and state preparation operations exhibit the same noise.
In this work we introduce a novel method, which we call Stinespring algorithm, for finding efficient decomposition sets for single-qubit and two-qubit gates 1 .Based on mathematical optimization techniques for convex and non-convex problems this new iterative algorithm takes into account the hardware noise.As a building block for the Stinespring algorithm, we introduce a robust quasiprobability method called approximate quasiprobability decomposition, which is interesting on its own.Instead of perfectly simulating a certain circuit, we allow for a small approximation error.This enables us to make a tradeoff between the approximation quality and the sampling overhead of the quasiprobability method.We will illustrate our results with simulations showing that the new methods significantly reduce the γ-factor of the QPD compared to existing approaches.

Quasiprobability method
Consider a linear operator F ∈ L(L(A), L(B)), where L(A, B) denotes the set of linear maps from Hilbert spaces A to B and L(A) := L(A, A).A QPD of F is a finite set of tuples {(a i , E i )} i with a i ∈ R and completely positive maps E i , such that for all ρ ∈ L(A) We call {E i } i the decomposition set of the QPD.The quasiprobability method requires a QPD {(a i , E i )} i where the maps E i correspond to operations that can be implemented on the quantum hardware, e.g. they might correspond to the channel representing a noisy quantum gate.In practice, these could be obtained by doing tomography and/or by using prior knowledge of the experimental noise.Note that the E i may not need to be trace-preserving as non-trace-preserving maps can be simulated using measurements and postselection.
The quasiprobability method allows us to simulate a noiseless execution of F using the noisy operations E i .In contrast to quantum error correction, this technique is very hardware-friendly as we do not encode our quantum information in a larger space, so a logical qubit still corresponds to a physical qubit.However, the quasiprobability method is only able to systematically remove the bias from the expectation values and cannot extend the coherence times of the computation.This manifests in a bad asymptotic scaling emphasizing that this technique is most favorable for near-term circuits.
Typically one is interested in the case where F is a unitary operation, i.e.F = U for some unitary matrix U , where we use the notation U(ρ) := U ρU † .This corresponds to simulating an ideal quantum gate U using noisy operations.But sometimes it can also be interesting to consider F which are nonphysical, for example F could be the inverse of a noise operator that occurred on a certain quantum system.By simulating the inverse, one can revert the effect of the noise on average.Typically the inverse of interesting noise models, such as depolarizing noise, are not completely positive maps.From now on we will restrict ourselves to the case F is unitary, though all results also hold in a more generalized setting.

Quasiprobability sampling
Consider a gate described by a unitary U and a QPD {(a i , E i )} i of the unitary channel U. Suppose that we are interested in the expectation value of a projective measurement described by a Hermitian operator O, which would be performed after the gate U , i.e. we would like to obtain tr[O U(ρ)], given a certain input state ρ.The linearity of the trace together with (1) implies where sgn(•) denotes the sign function.Via (2) we have introduced the γ-factor γ := i |a i |.The right-hand side of (2) naturally gives us a way to construct an unbiased Monte Carlo estimator for tr[O U(ρ)], while only having access to the operations E i of the noisy hardware, as depicted in Figure 1.We choose a random number i with probability |a i |/γ and then perform the operation E i followed by the measurement O.The measurement outcome is weighted by the coefficient γsgn(a i ) which gives the output of the estimator.By sampling this estimator many times and considering the average value, we can obtain an arbitrarily precise estimate of the true expectation value tr In practice the E i are imperfect estimates of the true underlying quantum channels, produced by tomography.Unfortunately, tomography fundamentally cannot give us an arbitrarily precise estimate of the true channels, due to state preparation and measurement (SPAM) errors.An erroneous knowledge of the E i will give erroneous coefficients a i in the QPD, which again translate into an error in our estimator of the expectation value.It was shown that this problem can be circumvented by using gate set tomography (GST), as it still allows us to obtain an unbiased Monte Carlo estimator [10], even though the exact E i are unknown.For simplicity, we will assume from now on that we are able to perform exact tomography of the quantum channel E i .
The quasiprobability estimator constructed above does generally not exhibit the identical statistics as the ideal measurement outcome of O, it only has the same expectation value.In fact, the variance of the estimator increases with γ and one requires O(γ2 ) more shots to estimate tr[O U(ρ)] to a target accuracy compared to the case where U is implemented exactly.This is a direct consequence of Hoeffding's inequality.
The quasiprobability method can be applied to large quantum circuits consisting of many different gates, by obtaining a QPD of each quantum gate individually, and then combining them together into one large QPD of the whole circuit.The sampling of the total quasiprobability estimator can still be done efficiently: Consider a circuit consisting of a sequence of m unitary gates U m • • • • • U 1 followed by a measurement described by the observable O.For each operation U k a QPD {(a i )} with γ-factor γ k is given.Our estimator starts by sampling a random number i 1 according to the probabilities {|a (1) i |/γ 1 } and executing the operation i1 .In a second step we sample a random number i 2 according to the probabilities {|a i2 .This procedure is continued for all i = 3, . . ., m while keeping track of all indices i 1 , . . ., i m sampled along the way.At the very end we measure the observable O on the system.The estimator then outputs the outcome of that measurement multiplied by sgn( We see that the combined γ-factor scales in a multiplicative way as γ total = m k=1 γ k .Therefore the sampling overhead of the total circuit scales exponentially in the circuit size.The Monte Carlo sampler for multiple error mitigated quantum gates only remains an unbiased estimator under the assumption that the noise is localized and Markovian.In looser terms this means that the noise on any quantum gate must be uncorrelated with other noise and independent on what operations were performed previously on the circuit.Similarly to previous works we will assume for simplicity that this assumption holds [23,10].Some more recent research has shown that cross-correlations can be tackled by using variants of the quasiprobability method which do not rely on tomography to find the optimal quasiprobability coefficients [21].

Finding quasiprobability decompositions
For a given quantum hardware, can we be certain that there exists a QPD of U into operations that the hardware can implement?This question may seem difficult, as its answer depends on the exact details of the capabilities of the hardware, namely what kind of quantum operations it can implement and at what fidelity it does so.Fortunately the answer to this question is positive, as long as as the noise is not too strong [10].
Consider the simplest case where U is a single-qubit gate.Table 1 lists 16 single-qubit operations that can be realized by the successive execution of the Hadamard gate H, the phase gate S, and/or a measurement-postselection operation P 0 = |0 0|.The measurement-postselection operation can be simulated in the context of the quasiprobability sampling estimator by performing a measurement in the computational basis and aborting if the outcome is 1, where aborting means that the estimator outputs 0. Therefore any quantum computer able to implement Clifford gates and measurements in the computational basis can also implement these operations, at least approximately.It can be verified that the mentioned set of 16 operations forms a basis of the (real) space of 1-qubit Hermitian-preserving operations 2 and hence we call it standard basis.It can be shown that the standard basis is robust to small errors in the sense that it remains a basis of the space of Hermitian-preserving operations if the individual elements suffer from a small amount of noise [10].
Consider the setting where we have a fixed decomposition set {E i } i and we would like to find the optimal (in terms of lowest possible γ-factor) quasiprobability coefficients a i such that (1) is fulfilled.If our decomposition set consists of linearly independent operations, which is e.g. the case for the standard basis, then this problem corresponds to solving a set of linear equations.But in practice one is often confronted with the more general case where there is an infinity of possible decompositions, so we require some method to pick out the best one of them.This optimization problem can be written [ and hence can be solved efficiently [23].We note that the number of equality constraints in (3) is exponential in the number of qubits involved in the untiary U.This exponential cost implies that the LP can only be solved for few-qubit gates, typically 1-and 2-qubit gates.

Approximate quasiprobability decomposition
In this section we present our first result, which is a relaxation of the QPD where we allow for an error in the decomposition.In mathematical terms, we require the condition (1) to hold approximately, i.e., An erroneous QPD leads to an error in the Monte Carlo estimator.Giving up the exactness of the quasiprobability method allows us to find a better decomposition in terms of the γ-factor.More precisely the approximate QPD results in a tradeoff between the sampling overhead and the error in the method.

Semidefinite programming relaxation using the diamond norm
In a first step we have to quantify the error in the approximate QPD given by (4).A natural candidate is to use the diamond norm as it has a strong operational interpretation.In addition, the diamond norm fits very naturally in our mathematical optimization setting, as it has been shown to be expressible as a semidefinite program (SDP) [24].
Theorem 3.1 (SDP for diamond norm [24]).Let G ∈ L(L(A), L(B)) and denote its Choi matrix by Λ G .Then which is a SDP. 3he dual formulation of the SDP ( 5) is given by where • ∞ denotes the spectral norm.Suppose we have a fixed decomposition set {E i } i and we would like to find the best possible approximate QPD of an operation U into that decomposition set.More precisely, we give a certain γ-factor budget, denoted γ budget , such that the QPD may not have a γ-factor higher than that.The corresponding optimization problem is By inserting the SDP from ( 6) into ( 7) the problem can be rewritten as where Λ U and Λ Ei are the Choi matrices of U and E i , respectively.Luckily ( 8) is still a SDP, which allows us to efficiently solve our optimization problem.The approximation U approx := i a i E i of U obtained from the optimization problem in (7) is not guaranteed to be a physical map.In practice, this implies that we might simulate the execution of a non-physical map using the quasiprobability method.It is possible to enforce complete positivity and/or trace-preservingness of U approx into the optimization problem by adding a positivity/partial-trace constraint on the Choi matrix Λ Uapprox of U approx .
We note that the idea of considering approximate QPDs has been considered in some form or another in other works.For instance, in [21] the quasiprobability coefficients of a complete circuit are optimized all at once in order to minimize the error of the computation.This optimisation problem is very difficult and does not come with the same properties and guarantees as a SDP.In [7] the authors utilize the quasiprobability method to simulate noise with reduced strength in order to be used in conjunction with the error extrapolation method.The used QPD is exact and not approximate, but the target channel F is chosen to be noisy instead of an ideal unitary channel.

Tradeoff curves
One can solve the SDP (7) for different values of γ budget to obtain a relation between the diamond norm error and the γ-factor.This function, which we call tradeoff curve, encapsulates the tradeoff between the approximation error and the sampling overhead.Example tradeoff curves for three gates are computed using a specific noise model and depicted in Figure 2. As expected, if the γ-factor budget  Tradeoff curves for the three quantum gates Ry (with an arbitrary angle since the noise model is the same for all rotation angles), CNOT and SWAP using the standard basis as decomposition set.The noisy channels of the three gates and the standard basis are extracted from a noise model in Qiskit [1] which approximates the noise on the ibmq melbourne device.More precisely, the noise model consists of a combination of depolarizing and thermal relaxation errors where the noise parameters are inferred from physical quantities measured on the hardware (e.g.T 1 , T 2 , gate times, qubit frequency, etc...).The CNOT gates exhibit an error rate of around 2% and the single-qubit gates exhibit an error rate of around 0.1%.The red dashed line represents the error (in diamond norm) of the reference noisy channel, i.e. when the gate is implemented as-is without QEM.We use the SDP solver in the MOSEK [17] software package through the CVXPY [9, 2] modelling language.
is larger than the optimal γ-factor of the non-approximate QPD, the error becomes zero.Similarly, when the γ-factor budget is exactly 1, then one does not gain any advantage over implementing the gate as-is without QEM.The more interesting regime is in between these two values of γ budget : One clearly sees that we can still significantly reduce the error of a gate, without having to pay the full γ-factor necessary for a non-approximate QPD.For the SWAP gate, the exact QPD requires a γ-factor of 2.21 to completely correct the gate.However, if we only pay a γ-factor of 1.21, we can still reduce the error by 67%.The saved costs in terms of the sampling overhead are substantial, since the number of shots scales as γ 2|C| , where |C| is the number of gates.
If one applies the approximate quasiprobability method to a circuit with multiple gates, a new degree of freedom emerges, which is not present in the original formulation of the quasiprobability method: How much γ-factor budget do we give to every individual gate?Assume we have a budget γ total for the whole circuit, how do we distribute that budget optimally across the whole circuit?More concretely, given N gates we have to find individual budgets γ budget,i ≥ 0 for the i-th gate where i = 1, . . ., N such that N i=1 γ budget,i = γ total .This is discussed in more detail in Appendix A.

Stinespring algorithm
For a fixed decomposition set an optimal QPD can be found by solving the LP (3).To further reduce the γ-factor it would be advantageous to also find the optimal decomposition set for a given hardware noise.This task is non-trivial for at least two reasons: 1.It is difficult to optimize over the space of all possible operations that a given quantum hardware can perform.In general this space is large and might vary significantly from one machine to another. 4.Performing tomography to characterize the operations E i is expensive.Hence, we want to limit the number of required uses of tomography.
In this section, we will introduce a novel method, called the Stinespring algorithm, that is able to deal with both issues.While the Stinespring algorithm is generally not able to produce the optimal decomposition set, it can be seen as an empirical approximation thereof.We demonstrate with simulations that the obtained decomposition set exhibits a significantly reduced γ-factor compared to the standard basis.The algorithm relies on a fundamental result in quantum information theory, stating that any quantum channel can be expressed as a unitary evolution on some extended Hilbert space.
Theorem 4.1 (Stinespring dilation).Consider a trace-preserving completelty positive (TPCP) map E ∈ TPCP(A, A).There exists a Hilbert space R and an isometry V ∈ L(A, A ⊗ R) such that for all density matrices ρ Furthermore, there exists an isometry with dim(R) ≤ r, where r is the rank of the quantum channel defined by r := rank(Λ E ) for Λ E the Choi matrix of E.
Any isometry V can be extended (generally non-uniquely) to a unitary U ∈ U(A ⊗ R, A ⊗ R) such that U acts equivalently to V on the space of states of the form The core idea of the Stinespring algorithm is to find the set of general quantum channels {E i } that allows the realization of a QPD with lowest possible γ-factors.These quantum channels are then approximately implemented on the quantum hardware by use of a Stinespring dilation. 5However, due to the significant hardware noise, the realized operation will only be an approximation of the desired quantum channel.The approximation error of the dilation is then accounted for in an iterative manner.
In practice, we will restrict the Stinespring algorithm to generate decomposition sets for errormitigation of 1-qubit and 2-qubit gates, as the involved computation and tomography quickly become expensive for larger gates.This does not restrict the applicability of the method, as any multi-qubit unitary can be decomposed in single and two-qubit gates.

Variational unitary approximation
To optimize the approximate execution of a quantum channel via the Stinespring dilation, we make use of two central observations: 1.Because the choice of the unitary U extending the isometry V is not unique we can try to choose the U which requires the least amount of two-qubit gates to implement.
2. Since the hardware is noisy, it may not be advantageous to implement a circuit that represents U exactly.It might be better to use a circuit that approximates U and requires less two-qubit gates and thus suffers less from noise on the hardware.
In this section, we propose a method that makes use of both of these observations, which we call variational unitary approximation.We use a variational circuit to implement the dilation unitary and fit its parameters in order to approximate the Stinespring isometry V as well as possible.Denote by θ the tuple of all variational parameters and U Var (θ) the unitary represented by the variational form.Furthermore we denote by V Var (θ) the submatrix of U Var (θ) restricted on the subspace where the input ancilla qubits are fixed to the zero state.We want to choose our parameters θ such that we minimize the difference between V Var (θ) and V , where this optimization is done with a classical computer.The variational unitary approximation allows us to freely choose the expressiveness of our variational circuit.More concretely, we can tune how many two-qubit gates it shall contain and therefore influence the tradeoff between the approximation error and the hardware noise error.
Variational unitary approximations are studied in various settings such as [13], where the interested reader can find more information.
In the following we are going to look at a concrete example to illustrate the gains from the variational unitary approximation.Consider an arbitrary 2-qubit quantum channel with rank 2 6 that we want to approximate with a Stinespring dilation.Let U be a 3-qubit unitary realizing a Stinespring dilation of the given channel.The exact representation of U typically requires around 100 CNOT gates (assuming linear qubit connectivity), which is significantly more than what current hardware can reasonably handle.We replace this ideal circuit with a RyRz variational circuit depicted in Figure 3, where m denotes the depth of the variational form, the total number of parameters is 6(m + 1) and the the total number of CNOT gates is 2m.We use the gradient-based BFGS algorithm implemented in SciPy [12] to minimize the objective V − V Var (θ) 2 . 7As initial guess for the parameters we use uniformly random numbers.The objective is non-convex and in practice the optimization algorithm finds a different local minimum depending on the provided initial guess.To obtain a good result in practice, we observed that it is enough to repeat the optimization 5 times for different initial values and then take best result.

|0
Ry (θ 1 ) Rz(θ 4 ) For purpose of numerical demonstration, the above procedure is performed for different depths of the variational form on a Haar-random 3-qubit unitary U .By using the same noise model as in Section 3.2, we can estimate how well our method allows us to approximate the 2-qubit quantum channel ρ → tr 3 [U ρU † ] where tr 3 stands for tracing out the third qubit.More precisely, we can compute the diamond norm error between the obtained 2-qubit quantum channel and the ideal 2-qubit quantum channel.This error encapsulates the approximation error of the variational form combined with the hardware noise.The results can be seen in Figure 4.The variational unitary approximation technique allows us to significantly reduce the error to roughly one quarter of its reference value.One can also clearly see that there is a sweet spot in the variational circuit depth.This makes sense intuitively when considering the tradeoff mentioned previously: if the depth is too short, then U is not well represented by the variational form and if the depth is too long, then the hardware noise takes over and we start to mostly sample noise.The optimum is reached at a variational circuit depth of 6.

Channel decomposition
Consider a trace-preserving linear operator F ∈ L(L(A), L(B)) of which we want to find an optimal QPD.In this section we will generalize the LP (3) by not only optimizing over the quasiprobability coefficients a i , but also over the decomposition set {E i } i .The resulting QPD will not be directly usable for the quasiprobability method, as the obtained decomposition will generally not be implementable by a quantum computer in practice.To highlight this, we will denote the decomposition set obtained by this optimisation procedure by{G i } i instead of {E i } i throughout this section.We call the resulting Figure 4: Numerical results from the variational unitary approximation of a Haar-random 3-qubit unitary U using a RyRz variational form.This plot depicts the diamond norm error of the induced 2-qubit quantum channel for different depths of the variational form.The depicted error contains the effects of the hardware noise and of the approximation error of the variational unitary.The noise model used is the same as in Section 3.2.The dashed red reference line corresponds to the error obtained when U is (exactly) decomposed in single and two-qubit gates (without the variational unitary approximation) and then executed on the hardware.
QPD the channel decomposition of F. We note that this decomposition was independently derived in [19,11].The channel decomposition will be an important step towards the realization of the Stinespring algorithm.
We denote the Choi matrix of F by Λ F and our goal is now to construct a finite set of Choi matrices {Λ Gi } i which correspond to a decomposition set {G i } i .The Choi representation is very convenient, as it allows us to formulate the TPCP-condition on the decomposition basis in a straightforward way: Λ Gi ≥ 0 and tr 2 [Λ Gi ] = 1  2 n 1 for all i where tr 2 stands for the partial trace over the ancillary Hilbert space of the Choi-Jamiolkowski isomorphism and n is the number of qubits involved in F. The optimization problem at hand can be expressed as By grouping all positive and negative quasiprobability coefficients together one can see that choosing N = 2 is enough to reach the optimum.By performing the substitution Λ± := a ± Λ G ± we can transform the problem into a SDP which can be solved efficiently The channel decomposition asserts that we can find TPCP maps G + , G − and a + , a − ≥ 0 such that F = a + G + − a − G − with optimal γ-factor γ = a + + a − .For the sake of the Stinespring algorithm, we would like to approximate the G ± using a Stinespring dilation.This will only work reasonably well when the number of required ancilla qubits is not too large.We can enforce this by adding an additional constraint rank(Λ ± Gi ) ≤ r for some r ∈ N into the SDP (9), i.e., where Λ± Gi := a ± i Λ ± Gi .Note that with the rank constraint we do not have the guarantee anymore that we can decompose F into just two channels, so generally it is not evident how small we can choose the number of positive and negative channels, n pos and n neg , respectively, while still finding the optimum.Because SDPs with rank constraints are NP-hard, we have to resort to heuristics to find a solution.One common approach is due to Burer-Monteiro [5,6].Suppose that we want to solve a given SDP with a rank constraint rank(C) ≤ r for some positive-semidefinite n × n matrix C. The main idea is to parametrize C = X † X for some r × n complex matrix X and then optimize over the matrix elements of X.The positive-semidefiniteness and rank constraint of C are automatically enforced from the construction.Unfortunately, the objective and the constraints generally contain quadratic terms of X, so the problem becomes a quadratically-constrained quadratic program.Still, recent research has demonstrated that the objective landscape of these problem tends to behave nicely and that using local optimization methods can provably lead to the global optimum under some assumptions [8,4].We defer further technical details to Appendix B.

The algorithm
The result in Section 4.2 allows us to quasiprobabilistically decompose any TPCP map F, which could for example be our desired unitary U, into rank r quantum channels that can be approximated with log 2 r ancilla qubits.By choosing r small enough and by making use of the variational unitary approximation, we can ensure that this approximation is reasonably good.Still, there is an important step missing before we can practically use this result.The quasiprobability method requires a QPD where the elements of the decomposition set correspond to channels describing noisy operations which the actual quantum hardware can execute.However, due to the noisy nature of the hardware, we can only execute an approximation of the channels G i obtained from the channel decomposition.We have to take into account the inaccuracy of the implemented Stinespring dilation when constructing our decomposition set.This will be achieved by the use of an iterative algorithm.
Assume that we have access to a noise oracle G → N (G) which gives us the actual quantum channel executed by the hardware when we implement the channel G using a Stinespring dilation.In general, since we are considering single and two-qubit gates, this oracle can be realized by simply performing tomography of the noisy realisations of the G i on the hardware.Assume that we found a channel decomposition using the rank-constrained optimization (10).If we were to implement F with the quasiprobability method by using the Stinespring approximation of the involved channels G ± i , an error δ would occur where We can iteratively repeat our procedure, but this time decomposing δ instead of F. During each point of the iteration we store the N (G ± i ) into our decomposition set.The δ can be obtained by performing an approximate QPD of the target unitary channel U into the decomposition set.At some point our decomposition set will be large enough such that the resulting approximation error is smaller than a desired threshold.A conceptual visualization of the procedure is depicted in Figure 5 and the exact algorithm can be found in Algorithm 1.
Step 2: Realize G i using Stinespring dilation and use noise oracle to find error caused by hardware noise Step 3: Determine error δ Step 4: Jump to Step 1 with δ as target operation repeat {(a j , D j )} j := compute approximate QPD coefficients of F using decomposition set D; ∆ := get diamond norm error of approximate QPD {(a j , D j )} j ; if ∆ < ∆ threshold then break; end δ := get the remaining error of the approximate QPD {(a j , D j )} j ; {G i } i := perform rank-constrained channel decomposition of δ into a set of channels; {V i } i := get Stinespring dilation isometries of the channels {G i } i ; {C i } i := get variational unitary approximation circuits of the isometries It is crucial that the Stinespring dilation allows for a reasonably accurate approximation of a desired quantum channel.In other words, the blue and green arrows in Figure 5 have to be sufficiently close.In the simulations presented in Section 4.4, we observed that this was only the case with the inclusion of the rank constraint in the channel decomposition.The variational unitary approximation technique significantly improves the approximation further.

Simulation results
We next show the performance of the Stinespring algorithm via simulation results on the gates Ry, CNOT, and SWAP, that we already analyzed in Section 3.2.For all three gates we enforce a rankconstraint r = 2 during the channel decomposition.This corresponds to allowing for at most one ancilla qubit during the Stinespring dilation.For the two-qubit gates we additionally make use of the variational unitary approximation with a 3-qubit RyRz variational form of depth 6 with linear qubit connectivity.The noise oracle is obtained from the noise model that was already used in Section 3.2, which aims to approximate the noise on the ibmq melbourne device.Using a noise model instead of a full tomography significantly speeds up the simulation.We use a threshold of ∆ threshold = 10 −7 to stop the iterative procedure.During each iteration of the Stinespring algorithm, we store the diamond norm error of the current approximate QPD (denoted ∆ in Algorithm 1). Figure 6 shows how this error decreases during the Stinespring algorithm.It is clearly visible that this decrease is exponential.As discussed in detail in Appendix B, each iteration of the Stinespring algorithm applied on the 2-qubit gates extends the decomposition set by 16 new operations.By considering that we need around 6-7 steps to reach the desired threshold, this implies that the produced decomposition set is significantly smaller than the standard basis (around 70-80 elements instead of 256).This result is remarkable and indicates that the Stinespring algorithm really does find a decomposition set that is well adapted to the hardware noise.As a reminder, the 256 elements in the standard basis were needed to completely span the space of Hermitian-preserving operators.The decomposition set produced by the Stinespring algorithm spans a significantly smaller space, as it is tailored to only represent one specific operation.
Table 2 denotes the γ-factors obtained by finding the optimal QPD using the decomposition set produced by the Stinespring algorithm for the three gates in question.We see that a significant improvement can be observed compared to using the decomposition set consisting of the standard basis and the respective noisy gate.As previously discussed, the sampling overhead of the quasiprobability method scales exponentially in the number of gates, where the γ-factor forms the basis of that exponential cost.Therefore any reduction in the γ-factor is significant and allows us to implement considerably larger quantum circuits.
Consider for instance that we have a circuit consisting of gates with equal sampling overhead γ = 1 + κ and we wish to fix the total sampling overhead Γ = (1 + κ) 2ngates .Then the number of gates n gates that one can error mitigate without exceeding the total sampling overhead Γ is given by 1 2 ln Γ ln 1+κ which is ≈ 1 2 ln Γ κ for κ ≈ 0. So a reduction of κ by a factor 2, which is roughly what we achieved for the Ry and CNOT gates, corresponds to doubling the number of gates that we can error mitigate.
Finally, we repeat the experiment on the CNOT gate for two additional noise models.We again consider noise models from Qiskit which approximate the noise on the ibmq sydney and ibmq mumbai hardware platforms.These two chips are newer and more advanced than ibmq melbourne and therefore exhibit lower noise.The results are depicted in table 3 and one sees that the Stinespring algorithm again provides a decomposition set with a significantly reduced γ-factor.
We do not have a guarantee that the Stinespring algorithm always converges.Such a proof would require a formalization of the heuristics that were made during the construction of the algorithm.In Appendix C we summarize these heuristics in detail.3: Simulation results of the Stinespring algorithm applied on the CNOT gate using noise models from different IBMQ hardware backends.The obtained γ-factor is compared to the value obtained when the respective gates are decomposed into the standard basis.

Discussion
The quasiprobability method is an error mitigation technique that allows us to suppress errors on small-scale devices without the need of implementing universal fault-tolerant quantum computing.The cost is a sampling overhead scaling as O(γ 2|C| ), where γ ≥ 1 is the γ-factor and |C| denotes the number of gates in the circuit that need to be error mitigated.We presented two novel methods to reduce the γ-factor.Since the sampling overhead scales exponentially in the number of gates with the γ-factor being the basis of the exponent, reducing γ is crucial.
In this work, we presented two improvements of the QPD method that both allow to reduce the overhead by minimizing the γ-parameter.The first contribution, i.e., the approximate QPD, allows us to elegantly generalize the quasiprobability method in an approximate setting, where one can make a tradeoff between an approximation error and the sampling overhead.Since in many practical applications it is very difficult to obtain exact estimates of the hardware noise, the quasiprobability method will anyways incur a significant error.In this setting it might be possible to reduce the sampling overhead without significantly increasing the error by making use of our approximate quasiprobability method.
Whereas the approximate QPD approach is fully analytical and mathematically rigorous, the second improvement (the Stinespring algorithm) relies on some numerical heuristics that we tested on various practical problems and verified its usefulness via simulation results.Nevertheless it would be important to prove the convergence or even the convergence rate of the Stinespring algorithm.We leave this for future work.Our results indicate that this approach could have the potential to significantly reduce the sampling overhead compared to existing methods, as can be seen from a simple calculation.Suppose we have a quantum circuit where we want to correct as many SWAP gates as possible with a total sampling overhead that is not larger than 10 3 .8Table 2 shows that via the previous QPD approach we can correct 4 SWAP gates, whereas our improved QPD method allows us to correct up to 16 SWAP gates. 9From an experimental point of view this technique also has the advantage that, in contrary to the standard basis, no measurements have to be executed on the data qubits.
Due to its inherent exponential scaling, neither the QPD nor other error mitigation techniques are believed to be able to replace the need for error correction for complicated quantum computations.However error mitigation techniques may be able to smoothen the transition into the fault-tolerant quantum computing era.Hence as a next step for future work it would be interesting to see how the improved QPD techniques presented in this work can be combined with error correction ideas.

A Optimal resource distribution
Section 3 motivates the question of how to distribute a given γ-factor budget on different individual gates.Before we continue, we have to clarify what objective function we seek to optimize when we talk about an optimal budget distribution.The overall goal is to minimize the error (in terms of the diamond norm) of the total circuit.Denote by F approx,i the channel produced by the approximate QPD of the i-th quantum gate and U i the unitary corresponding to the i-th gate, where The computation of this quantity is intractable, as it would require to simulate the complete noisy circuit.Therefore, we need to find a proxy that is computable from local quantities.
From the triangle inequality and the fact that the diamond norm is a contraction under completely positive maps, one immediately obtains This curve shows the optimal γ-factor budget distribution across a circuit consisting of a single Ry and CNOT gate.The total γ-factor budget γ total is varied between 1 and 1.191, which is the minimal value necessary to obtain perfect QPD for both gates at the same time.The red dashed lines denote the necessary γ-factor to decompose the respective gate perfectly.The black line denotes γ total and therefore corresponds to the multiplication of the blue and yellow curves.
for any completely positive maps E, F, E , F where E, E are trace-preserving.If we wish to apply this bound to our setting in (11), we have to guarantee that the F approx,i are TPCP.This can be achieved by inserting an additional constraint into the SDP of the approximate QPD.Using the notation ε i (•) to denote the tradeoff curve of the i-th gate, we thus obtain the optimization problem Problem ( 12) is generally non-convex, as we have not enough guarantees on the shape of the tradeoff curves.Still in practice we observe that the objective is convex in a broad region around the optimum, so we expect numerical heuristics based on gradient descent to work well.
We finish this section with a small demonstration on a simple circuit that consists of a Ry and a CNOT gate.The tradeoff curves are computed exactly as in Section 3.2, with the sole difference that we now include a TPCP-constraint for the approximate QPD.We evaluate the tradeoff curves at discrete points and use linear interpolation in order to extrapolate this data to a full function that can be used in the optimization routine.We solve the optimization problem (12) for different values of γ total using a black box solver based on the BFGS algorithm implemented in the SciPy [12] software package.The results are depicted in Figure 7. Interestingly, the optimal strategy for budget distribution is nontrivial.In the regime with a small budget 1 ≤ γ total 1.02, it is optimal to give most of the γ-factor budget solely to the CNOT gate.In a transition regime 1.02 γ total 1.03 both gates obtain a significant amount of γ-factor budget.In the upper regime 1.03 γ total the Ry gate is perfectly decomposed and the remaining budget is then given to the CNOT gate.So whether one should prioritize the CNOT or the Ry gate in the budgeting strongly depends on the value of γ total .

B Details on rank-constrained channel decomposition
In our case the Λ ± Gi do not show up in the objective function, so we actually have a quadraticallyconstrained linear program, i.e. a problem of the form where we grouped all variables into a large vector x, f is a linear function and g is a quadratic map that encapsulates all constraints.Furthermore, we know the optimum f * of the objective function without the rank constraint.Let's make an assumption for the moment being: Assumption B.1.The rank-constrained optimization problem (10) reaches the same minimal γ-factor as the original SDP without the rank constraint.
This assumption does obviously not hold for r = 1, as it would imply that we could decompose any F ∈ TP(A, A) into unitary operations.We are only concerned with the case r ≥ 2 in practice, so let us assume for the moment that the assumption is correct.This allows us to reformulate (13) as We say that our optimization only succeeds if it finds a solution of problem ( 14) which minimizes the quadric objective g(x) 2 2 to zero.The reason why we swap objective and constraint is because we observed linear constraints to be easier to deal with numerically than quadratic (and possibly nonconvex) constraints.
To numerically solve problem (14) we made use of the trust-region algorithm for constrained optimization which is included in SciPy [12].We fixed r = 2, which corresponds to allowing at most one ancilla qubit in the Stinespring dilation.In the single-qubit case 10 we could consistently find a good solution, whereas in the two-qubit case we were often confronted with convergence problems.It is not exactly clear whether these problem stemmed from the non-existence of a low-rank solution (i.e.Assumption B.1 being wrong) or from numerical issues with the Burer-Monteiro method.In any case, we observed that the convergence could be significantly improved by allowing some error in the linear constraint, i.e., where we used a value of ε = 0.2.In other words, we allow for a sub-optimality in the γ-factor by at most 20%.In principle one could try to find the smallest admissible ε by usage of bisection.We used n pos = n neg = 2 for single-qubit quantum channels and n pos = n neg = 8 for two-qubit quantum channels.
We also observed a good initialization for problem (14) to be of major importance.In the rest of the section we focus on this aspect.The task at hand is to solve following optimization problem We restrict ourselves to the case of r = 2, which corresponds to allowing a single ancilla qubit in the Stinespring dilation.In order to solve this problem with a local method, an initial guess a ±,0 i , X ±,0 i for the parameters is required.We present a heuristic to find such initial values which seems to work well in our experience.We start off by computing the channel difference decomposition F = a + G + − a − G − of our target operation.We consider the spectral decomposition of the Choi matrices of G where λ ± i denote the eigenvalues and u ± i denote the corresponding eigenvectors.We chose the indices such that the λ ± i are ordered non-increasingly in i. Equation ( 15) is a decomposition of G ± into rank-1 operations.Since we want to find a decomposition into rank-2 operations, we group these rank-1 matrices into pairs We choose our initial guess as a ±,0 Therefore the only condition that is not yet fulfilled is the trace-preservingness of the Λ± Gi .

C Heuristics for Stinespring algorithm
It is natural to ask whether we can prove the convergence of the Stinespring algorithm presented in Section 4 or derive bounds on the approximation error.Unfortunately, it seems difficult to even prove convergence, as several heuristics were used.In this section we aim to provide a clear overview of these heuristics.
Existence of a low-rank channel decomposition.It is not immediately clear if Assumption B.1 holds in practice or not.The fact that we observed a value of ε > 0 to be necessary for good convergence could be an indicator that it does indeed not hold.Furthermore it is not clear what minimal values of n pos , n neg are required to reach this minimum.
Burer-Monteiro convergence.In recent years, there has been an impressive progress in mathematically demonstrating the convergence and convergence speed of the Burer-Monteiro heuristic [4,8].However, it is not clear if they can be applied to our setting.
Quality of the Stinespring dilation approximation.Even under the assumption that there exists a solution of problem (10) and that we can find it efficiently, the convergence speed of the Stinespring algorithm strongly depends on how well we can approximate an arbitrary quantum channel with the Stinespring dilation.In technical terms, we require the noise oracle N to be as closed to the identity as possible.It would be interesting for future work to prove the convergence speed and resulting γ-factor depending on some kind of fidelity measure of N .

D Simulation results for depolarizing noise
In case the gate errors consist of purely depolarizing noise, the optimal QPD to realize the inverse map is analytically known and its corresponding decomposition set only contains noisy Pauli operations [22].So for practical considerations, there is not much use in applying the Stinespring algorithm to this specific type of noise.However, it can still be interesting to run the Stinespring algorithm and compare the quality of the obtained QPD to the ideal one.Indeed in fig.8, we report that the Stinespring algorithm is able to recover a QPD that is almost as good as the optimal one.The value of γ − 1 only differs only by roughly 3% to 4% from the optimal value.The simulations are performed for different values of the depolarizing error rate p of two-qubit gates, where the depolarizing rate of single-qubit gates is assumed to be 0.1p.The depolarizing rate p for two-qubit gates is varied.The depolarizing gate of single-qubit gates is assumed to be 0.1p.The γ-factor of the optimal QPD is computed using the channel difference decomposition introduced in Section 4.2.

E Comparison with standard basis QPD
The goal of this section is to provide further illustrations in order to provide some insight into how the decomposition set obtained from the Stinespring algorithm outperforms the decomposition set consisting of the standard basis.For this purpose, we consider the simulation results obtained in Section 4.4 for the CNOT gate using the noise model of ibmq melbourne.Figure 9 (a) depicts the absolute value of the quasiprobability coefficients of the QPD corresponding to the decomposition set obtained from the Stinespring algorithm.The colors indicate in which iteration of the Stinespring algorithm a specific element was added to the decomposition set.It can be seen that the quasiprobability coefficients from later rounds have significantly lower contribution to the γ-factor than the earlier ones.Similarly, Figure 9 (b) depicts the magnitude of the quasiprobability coefficients ordered in descending order when the ideal gate is decomposed into the decomposition set consisting of the standard basis as well as the noisy realisation of the gate.Due to the larger number of operations which have a large quasiprobability coefficient, the resulting γ-factor is higher.Table 4 depicts the 13 operations in the decomposition set which exhibit the largest magnitude of their quasiprobability weights.

Figure 1 :
Figure 1: Graphical representation of the quasiprobability method.The ideal quantum gate U is randomly replaced by a quantum channel E i that the hardware can implement.The output of the measurement O must be weighted according to the sampled operation.

Figure 2 :
Figure2: Tradeoff curves for the three quantum gates Ry (with an arbitrary angle since the noise model is the same for all rotation angles), CNOT and SWAP using the standard basis as decomposition set.The noisy channels of the three gates and the standard basis are extracted from a noise model in Qiskit[1] which approximates the noise on the ibmq melbourne device.More precisely, the noise model consists of a combination of depolarizing and thermal relaxation errors where the noise parameters are inferred from physical quantities measured on the hardware (e.g.T 1 , T 2 , gate times, qubit frequency, etc...).The CNOT gates exhibit an error rate of around 2% and the single-qubit gates exhibit an error rate of around 0.1%.The red dashed line represents the error (in diamond norm) of the reference noisy channel, i.e. when the gate is implemented as-is without QEM.We use the SDP solver in the MOSEK[17] software package through the CVXPY[9,2] modelling language.

Figure 3 :
Figure 3: Variational circuit used to approximate the the Stinespring unitary.

Figure 5 :Algorithm 1 :
Figure 5: Graphical visualization of the iterative process involved in the Stinespring algorithm.

Figure 6 :
Figure 6: Evolution of the approximation error in each step of the Stinespring algorithm.Three different runs of the algorithm on three different gates Ry, CNOT and SWAP are depicted.The hardware noise is estimated using a noise model approximating the ibmq melbourne device.The horizontal grey line denotes the threshold of 10 −7 at which the algorithm stops.
Figure7: This curve shows the optimal γ-factor budget distribution across a circuit consisting of a single Ry and CNOT gate.The total γ-factor budget γ total is varied between 1 and 1.191, which is the minimal value necessary to obtain perfect QPD for both gates at the same time.The red dashed lines denote the necessary γ-factor to decompose the respective gate perfectly.The black line denotes γ total and therefore corresponds to the multiplication of the blue and yellow curves.

i:
= tr[Y i ] and X ±,0 i := Yi tr[Yi] .If n pos = n neg < 1 2 4n this implies that we only consider the 2n pos largest eigenvalues and discard the others.As already mentioned, in our implementation we chose n pos = n neg = 1 2 4 n .11This way Λ by the initial guess.Furthermore we also already fulfill that the γ-factor is

Figure 8 :
Figure8: Comparison of γ-factor obtained from Stinespring algorithm with optimal γ-factor.The depolarizing rate p for two-qubit gates is varied.The depolarizing gate of single-qubit gates is assumed to be 0.1p.The γ-factor of the optimal QPD is computed using the channel difference decomposition introduced in Section 4.2.

Table 1 :
16basis operations constituting the standard basis, which was introduced in [10] using the notation [U ](•) = U (•)U † .All these operations can be realized using Hadamard gates H, phase gates S and measurement-postselection operations P 0 .
in terms of a linear program (LP) min ai∈R i

Table 2 :
Simulation results of the Stinespring algorithm applied on the three quantum gates Ry, CNOT, and SWAP.The obtained γ-factor is compared to the value obtained when the respective gates are decomposed into the standard basis.