Introduction

Recent advances in quantum technologies have resulted in the availability of noisy intermediate-scale quantum (NISQ) devices, promising advantages of quantum information processing by controlling tens to hundreds of qubits1,2. However, inevitable noise remains a critical roadblock for their practical use; every gate has a chance of error, and their continuing accumulation will eventually destroy any potential quantum advantage. While quantum error correction enables in-principle means to suppress such error indefinitely, they involve measuring error syndromes and making adaptive corrections. In contrast, NISQ devices often cannot adaptively execute quantum operations.

This technological hurdle has motivated the study of quantum error mitigation, resulting in a diverse collection of alternative techniques (e.g., zero-error noise extrapolation3,4,5,6,7,8, probabilistic error cancellation3,9,10,11,12,13, and virtual distillation14,15,16,17,18,19). All share in common that they avoid adaptive operations. Instead, error-mitigation algorithms suppress errors by sampling available noisy devices many times and classically post-processing these measurement outcomes. Such techniques generally have drastically reduced technological requirements, providing potential near-term solutions for suppressing errors in other NISQ algorithms (e.g., variational algorithms for estimating the ground state energy in quantum chemistry20,21,22,23).

The performance of these algorithms is typically analyzed on a case-by-case basis. While this is crucial for understanding the value of a particular methodology in a specific practical context, it leaves open a fundamental question: What is the ultimate potential of quantum error mitigation? The motivation to answer this question parallels the development of heat engines. There, Carnot’s theorem allows us to understand the ultimate efficiency of all possible heat engines24, allowing us to know what is physically forbidden and enabling a universal means to understand what specific engines have the greatest room for potential improvement.

Here, we initiate a research program toward characterizing the ultimate limits of quantum error mitigation. We propose a framework to formally define error mitigation as any strategy that requires no adaptive quantum operations (see Fig. 1). We introduce maximum estimator spread as a universal benchmark for error-mitigation performance—a quantity that tells us how many extra runs of a NISQ device guarantee that outputs are within some desired accuracy threshold. We then derive fundamental lower bounds for this spread—that no current or yet-undiscovered error-mitigation strategy can violate. Our bounds are represented in terms of the reduction in the distinguishability of quantum states due to the noise effect, providing an operational understanding of the cost for error mitigation.

Fig. 1: Quantum error mitigation.
figure 1

A A major goal of many near-term algorithms is to estimate the expectation value of some observable A, when acting on the output ψ of some idealized computation U applied to some input ψin. B However, noise prevents the exact synthesis of ψ. Quantum error-mitigation protocols assist to estimate the true expectation value \(\langle A\rangle ={{{\rm{Tr}}}}(A\psi )\) without using the adaptive quantum operations necessary in general error correction. This is done by (1) using available NISQ devices to synthesize N distorted quantum states \({\{{{{{\mathcal{E}}}}}_{n}(\psi )\}}_{n = 1}^{N}\) and (2) acting some physical process \({{{\mathcal{P}}}}\) on these distorted states to produce a random variable EA that approximates A. This procedure can then be repeated over M rounds to draw M samples of EA, whose mean is used to estimate 〈A〉. C We can characterize the efficacy of such protocol by (1) its spread ΔeA, the difference between maximum and minimum possible values of EA and (2) the bias \({b}_{A}(\psi )=\left\langle {E}_{A}\right\rangle -\langle A\rangle\). Here we derive ultimate lower bounds on ΔeA for each given bias that no such error-mitigation protocol can exceed, as well as tighter bounds when \({{{\mathcal{P}}}}\) is restricted only to coherent interactions over Q noisy devices at a time. This then tells us how many times \({{{\mathcal{P}}}}\) must be executed to estimate 〈A〉 within some desired accuracy and failure probability.

We then illustrate two immediate consequences of our general bounds. The first is in the context of mitigating local depolarizing noise in variational quantum circuits20,25. We show that the maximum estimator spread grows exponentially with circuit depth for the general error-mitigation protocol, confirming a suspicion that the well-known exponential growing estimation error observed in several existing error-mitigation techniques3,26 is a consequence of the fundamental obstacle shared by the general error-mitigation strategies. Our second study shows that probabilistic error cancellation—a prominent method of error mitigation—minimizes the maximum estimator spread when mitigating local dephasing noise acting on an arbitrary number of qubits. These results showcase how our bounds can help rule out what error-mitigation performance targets are unphysical, and identify what methods are already near-optimal.

Results

Framework

Our framework begins by introducing a formal definition of error mitigation. Consider an ideal computation described by (1) application of some circuit U to some input ψin (2) measurement of the output state ψ in some arbitrary observable A (see Fig. 1A). In realistic situations, however, there is noise, such that we have only access to NISQ devices capable of preparing some certain distorted states \({{{\mathcal{E}}}}(\psi )\). The aim is then to retrieve desired output data specified by \(\langle A\rangle ={{{\rm{Tr}}}}(A\psi )\). Here, we assume \(-{\mathbb{I}}/2\le A\le {\mathbb{I}}/2\) without loss of generality. This is because any observable O can be shifted and rescaled to some A satisfying this condition, from which full information of O can be recovered. For instance, if we are interested in a non-identity Pauli operator P, which has eigenvalues ±1, we instead consider an observable A = P/2. Note also that while ψ is pure in many practically relevant instances, our analysis applies equally when ψ is mixed.

We consider NISQ devices with no capacity to execute adaptive quantum operations. That is, they cannot enact different quantum operations conditioned on a measurement outcome. We then refer to an algorithm aimed to estimate 〈A〉 under such constrained devices as an error-mitigation strategy. Each error-mitigation strategy involves sampling NISQ devices configured in N settings for some integer N. Denote the states generated by these configurations by \({{{{\mathcal{E}}}}}_{1}(\psi ),\ldots ,{{{{\mathcal{E}}}}}_{N}(\psi )\), with effective noise channels \({\{{{{{\mathcal{E}}}}}_{i}\}}_{i = 1}^{N}\), where these effective noise channels can be different from each other in general. The effective noise channel is a non-adaptive operation that connects an ideal state to a distorted state and may be different from the actual noise channel that happens in the NISQ device. Nevertheless, one can always find such an effective noise channel given the descriptions of the actual noise channels and the idealized circuit U. The strategy then further describes some physical process \({{{\mathcal{P}}}}\)—which is independent of either the input ψin or the ideal output ψ—that takes these distorted states as input and outputs some classical estimate random variable EA of \({{{\rm{Tr}}}}(A\psi )\) (see Fig. 1B). The aim is to generate EA such that its expected value 〈EA〉 is close to \({{{\rm{Tr}}}}(A\psi )\). Each round of the protocol involves generating a sample of EA. M rounds of this procedure then enable us to generate M samples of EA, whose mean is used to estimate \({{{\rm{Tr}}}}(A\psi )\).

Each error-mitigation strategy can then be entirely described by its choice of \({{{\mathcal{P}}}}\) and \({\{{{{{\mathcal{E}}}}}_{i}\}}_{i = 1}^{N}\). Our most fundamental bound pertain to all possible choices. However, we can often make these bounds tighter in situations where further practical limitations constrain how many distorted states \({{{\mathcal{P}}}}\) can coherently interact. Error mitigation protocols under such constraints typically select N = KQ to a multiple of Q, such that the N distorted states are divided into K clusters, each containing Q distorted states. We label these as \({\{{{{{\mathcal{E}}}}}_{q}^{(k)}(\psi )\}}_{q = 1,k = 1}^{Q,K}\) for convenience. \({{{\mathcal{P}}}}\) is then constrained to represent (1) local measurement procedures M(k) that can coherently interact distorted states within the kth cluster (i.e., \({\{{{{{\mathcal{E}}}}}_{q}^{(k)}(\psi )\}}_{q = 1}^{Q}\)) to produce some classical interim outputs i(k) and (2) classical post-processing function eA that transform the interim outputs \({\{{i}^{(k)}\}}_{k = 1}^{K}\) into a sample of EA.

We name such a protocol as (Q, K)-error mitigation, and refer to the generation of each i(k) as an experiment. Each round of a (Q, K)-error mitigation protocol thus contains K experiments on systems of up to Q distorted states. We also summarize the above procedure in Fig. 2 and give a formal mathematical definition in Methods. Figure 3 and accompanying captions discuss how several prominent error-mitigation methods fit into this framework.

Fig. 2: Schematic of a (Q, K)-error mitigation protocol.
figure 2

A (Q, K)-error mitigation protocol is motivated when practical considerations limit the maximum number of distorted states that our mitigation process \({{{\mathcal{P}}}}\) can coherently interact to Q. A general approach then divides these into K = N/Q groups of size Q. To estimate 〈A〉 of some ideal output state ψ, each round of (Q, K)-mitigation involves first using available NISQ devices to generate Q copies of each distorted states \({{{{\mathcal{E}}}}}_{q}^{(k)}(\psi )\), for each of k = 1, … K. These distorted states are then grouped together as inputs into K experiments, where each group consists of a single copy of each \({{{{\mathcal{E}}}}}_{q}^{(k)}(\psi )\). The kth experiment then involves applying some general (possibly entangling) POVM \(\{{M}_{{i}^{(k)}}^{(k)}\}\) on the kth grouping, resulting in measurement outcome i(k). Classical computing is then deployed to produce an estimate eA(i(1), … , i(K)) whose average after M rounds of the above process is used to estimate \({{{\rm{Tr}}}}(A\psi )\). Note that there can be additional quantum operations before the POVM measurements \(\{{M}_{{i}^{(k)}}^{(k)}\}\), but these can be absorbed into the description of the POVMs without loss of generality.

Fig. 3: Error-mitigation protocols.
figure 3

Our framework encompasses all commonly used error-mitigation protocols, a sample of which we outline here. A Probabilistic error cancellation3 assumes we can only act a single coherent state each round, where it seeks to undo a given noise map \({{{\mathcal{E}}}}\) by applying a suitable stochastic operation \({{{\mathcal{B}}}}\). Thus it corresponds to the case of Q = K = 1. B Rth order noise extrapolation assumes3,4 the capacity to synthesize R + 1 NISQ devices whose outputs represent distortions of ψ at various noise strengths. It then uses individual measurements of an observable A on these distorted states to estimate the observable expectation value on the zero-noise limit. Thus it is an example where Q = 1 and K = R + 1. C Meanwhile, R-copy virtual distillation14,15 involves running an available NISQ device R times to synthesize R copies of a distorted state \({{{\mathcal{E}}}}(\psi )\). Coherent interaction \({{{\mathcal{D}}}}\) over these copies followed by a suitable measurement MA then enables improved estimation of 〈A〉. Thus it is an example where K = 1 and Q = R. In the main text and Methods, we provide a detailed account of each protocol and how it fits within our framework.

Several comments on our error-mitigation framework are in order. We first note that, for a given set of noisy circuits that result in effective noise channels \({\{{{{{\mathcal{E}}}}}_{i}\}}_{i = 1}^{N}\), our framework assumes to apply an additional process \({{{\mathcal{P}}}}\) after the noisy circuits and does not include processes within the initial noisy circuits. Our framework thus excludes error correction, which employs adaptive processes integrated into noisy circuits. This allows our framework to differentiate error mitigation from error correction and makes it useful to investigate the limitations imposed particularly on the former.

One might think that this would overly restrict the scope of error mitigation, which could also use some processes in noisy circuits. This can be avoided by considering that such processes are already integrated into the description of effective noise channels \({\{{{{{\mathcal{E}}}}}_{i}\}}_{i = 1}^{N}\). In other words, the effective noise channel can be considered as a map that connects an ideal state to a distorted state affected by not only a noise channel but non-adaptive processes accessible to a given near-term device; the error mitigation process \({{{\mathcal{P}}}}\) is then an additional process that follows them. This is manifested in the Rth order noise extrapolation in Fig. 3B, in which R different noise levels realized on a near-term device are represented by the set \({\{{{{{\mathcal{E}}}}}_{i}\}}_{i = 1}^{R}\) of effective noise channels.

More broadly, taking appropriate effective noise channels allows our framework to include error-mitigation protocols that employ modified circuits. Namely, if \({\{{{{{\mathcal{N}}}}}_{i}\}}_{i = 1}^{N}\) are the noisy circuits that an error-mitigation protocol employs and \({{{\mathcal{U}}}}\) is the ideal circuit, then such an error-mitigation strategy is encompassed in our framework with \({{{{\mathcal{E}}}}}_{i}={{{{\mathcal{N}}}}}_{i}\circ {{{{\mathcal{U}}}}}^{{\dagger} }\). This, for instance, includes the conventional strategy of probabilistic error cancellation applied to a noisy circuit, in which a probabilistic operation is applied after every noisy gate.

We also remark that our framework leaves the freedom of how to choose the round number M and the sample number N = KQ per round for a given shot budget; if the total shot budget is T, one is free to choose any N and M such that T = NM. As we describe shortly, our results in Theorem 1 and Corollary 2 are concerned with the number of rounds M, and they apply to any choice of shot allocation. However, our results become most informative by choosing as large M (equivalently, as small N) as possible. The strategies in Fig. 3 admit small N’s that do not scale with the total shot budget, representing examples for which our results give fruitful insights into their round number M. On the other hand, some strategies that employ highly nonlinear computation on the measurement outcomes (e.g., exponential noise extrapolation11, subspace expansion27) require a large N, in which case our results on the round number M can have a large gap from the actual sampling cost.

Our framework also allows one to assume some pre-knowledge prior to the error-mitigation process. For instance, this includes the information about the underlying noise or some pre-computation that error-mitigation process can use in its strategy. The results in Theorem 1 and Corollary 2 then give information about the round number M given such pre-knowledge. Since the process of obtaining the pre-knowledge itself may be considered as a part of error-mitigation process, there are many possible divisions between the pre-computation and the error-mitigation process. Our results apply to any choice of pre-knowledge, and this can be flexibly chosen depending on one’s interest. For instance, R-copy virtual distillation can be considered as a (R, 1)-error mitigation (that is, N = R) as in Fig. 3C under the pre-knowledge of an eigenvalue of the noisy state, which is one of the settings discussed in ref. 14 (see also Methods). This pre-knowledge allows for a small choice of N, making the estimation of the round number M by our method insightful. Another example includes the Clifford Data Regression28, which can employ a linear regression based on a pattern learned from a training set. By considering the first learning step as the pre-computation, our results provide a meaningful bound for the sampling cost in the latter stage in which the output from the circuit of interest is compared to the model estimated from the training set.

Up to the flexibility described above, our framework encompasses a broad class of error-mitigation strategies proposed so far3,4,11,14,15,27,28,29,30,31,32.

Quantifying performance

The performance of an error-mitigation protocol is determined by how well the random variable EA governing each estimate aligns with \({{{\rm{Tr}}}}(A\psi )\). We can characterize this by (1) its bias, representing how close 〈EA〉 is to the ideal expectation value \({{{\rm{Tr}}}}(A\psi )\) and (2) its spread, representing the amount of intrinsic randomness within EA.

A protocol’s bias quantifies the absolute minimum error with which it can estimate \({{{\rm{Tr}}}}(A\psi )\), given no restrictions on how many rounds it can run (i.e., samples of EA it can draw). Mathematically, this is represented by the difference \({b}_{A}(\psi )=\langle {E}_{A}\rangle -{{{\rm{Tr}}}}(A\psi )\). Since the error-mitigation strategy should work for an arbitrary state ψ and observable A, we can introduce the maximum bias:

$${b}_{\max }:=\mathop{\max }\limits_{-{\mathbb{I}}/2\le A\le {\mathbb{I}}/2}\mathop{\max }\limits_{\psi }\,|\langle {E}_{A}\rangle -{{{\rm{Tr}}}}(A\psi )|$$
(1)

to bound the bias of an error-mitigation protocol in estimating expectation values over all output states and observables of interest. Hereafter, we will also assume \({b}_{\max }\le 1/2\), as this condition must be satisfied for any meaningful error-mitigation protocol. This is because a maximum bias of 1/2 can always be achieved by the trivial “error-mitigation” protocol that outputs eA = 0 regardless of ψ or A.

Of course, having \({b}_{\max }=0\) still does not guarantee an effective error-mitigation protocol. Each sample of EA will also deviate from \({{{\rm{Tr}}}}(A\psi )\) due to intrinsic random error. The greater this randomness, the more samples we need from EA to ensure that the mean of our samples is a reliable estimate of its true expectation value 〈EA〉. The relation is formalized by Hoeffding’s inequality33. Namely, suppose \({\{{x}_{i}\}}_{i = 1}^{M}\) are M samples of a random variable X with xi [a, b], the number M of samples that ensures an estimation error \(| \left\langle X\right\rangle -{\sum }_{i}{x}_{i}/M| \,<\, \delta\) with probability 1 − ε is given by \(\frac{| a-b{| }^{2}}{2{\delta }^{2}}\log (2/\varepsilon )\propto | a-b{| }^{2}\). In our context, the latter quantity corresponds to the maximum spread in the outcomes of estimator function eA defined by:

$${{\Delta }}{e}_{\max }:=\mathop{\max }\limits_{-{\mathbb{I}}/2\le A\le {\mathbb{I}}/2}{{\Delta }}{e}_{A},$$
(2)

where ΔeA is the difference between the maximum and minimum possible values that EA can take, i.e., \({{\Delta }}{e}_{A}:={e}_{A,\max }-{e}_{A,\min }\) where \({e}_{A,\max }:=\mathop{\max }\nolimits_{{i}^{(1)}\ldots {i}^{(K)}}{e}_{A}({i}^{(1)}\ldots {i}^{(K)})\) and \({e}_{A,\min }:=\mathop{\min }\nolimits_{{i}^{(1)}\ldots {i}^{(K)}}{e}_{A}({i}^{(1)}\ldots {i}^{(K)})\).

\({{\Delta }}{e}_{\max }\) thus directly relates to the sampling cost of an error-mitigation protocol. Given an error-mitigation protocol whose estimates have maximum spread \({{\Delta }}{e}_{\max }\), it uses sample EA of order \({{{\mathcal{O}}}}({{\Delta }}{e}_{\max }^{2}\log (1/\varepsilon )/{\delta }^{2})\) times to ensure that its estimate of 〈EA〉 has accuracy δ and failure rate ε. Therefore, we may think of \({{\Delta }}{e}_{\max }\) as a measure of computational cost or feasibility. Its exponential scaling with respect to the circuit depth, for example, would imply eventual intractability in mitigating associated errors in a class of non-shallow circuits.

We note that if the variance of EA happens to be small, the actual sampling cost required to achieve the accuracy δ and failure rate ε can be smaller than the estimate based on the maximum spread. In this sense, \({{\Delta }}{e}_{\max }\) quantifies the round number M that one would practically use in the worst-case scenario. However, knowing the variance of EA beforehand is a formidable task in general, and the worst-case estimate gives a useful benchmark to assess the feasibility of a given error-mitigation strategy in such situations.

Fundamental limits

Our main contribution is to establish a universal lower bound on \({{\Delta }}{e}_{\max }\). Our bound then determines the number of times an error-mitigation method samples EA (and thus the number of times we invoke a NISQ device) to estimate A within some tolerable error.

To state the bound formally, we utilize measures of state distinguishability. Consider the scenario where Alice prepares a quantum state in either ρ and σ and challenges Bob to guess which is prepared. The trace distance \({D}_{{{{\rm{tr}}}}}(\rho ,\sigma )=\frac{1}{2}\parallel \rho -\sigma {\parallel }_{1}\) (where 1 is the trace norm) then represents the quantity such that Bob’s optimal probability of guessing correctly is \(\frac{1}{2}(1+{D}_{{{{\rm{tr}}}}}(\rho ,\sigma ))\). When ρ and σ describe states on K-partite systems S1SK, we can also consider the setting in which Bob is constrained to local measurements, resulting in the optimal guessing probability \(\frac{1}{2}(1+{D}_{{{{\rm{LM}}}}}(\rho ,\sigma ))\) where DLM is the local distinguishability measure34 (see also Methods). In our setting, we identify each local subsystem Sk with a system corresponding to the kth experiment in Fig. 2. We are then in a position to state our main result:

Theorem 1 Consider an arbitrary (Q, K)-mitigation protocol with maximum bias \({b}_{\max }\). Then, its maximum spread \({{\Delta }}{e}_{\max }\) is lower bounded by:

$$\begin{array}{*{20}{l}}{{\Delta }}{e}_{\max }\ge \mathop {\max }\limits_{\psi ,\phi } \frac{{D}_{{{{\rm{tr}}}}}(\psi ,\,\phi )\,-\,2{b}_{\max }}{{D}_{{{{\rm{LM}}}}}\left({\tilde{\psi }}_{Q}^{(K)},\,{\tilde{\phi }}_{Q}^{(K)}\right)}\end{array}$$
(3)

where \({\tilde{\psi }}_{Q}^{(K)}:={\otimes }_{k = 1}^{K}{\otimes }_{q = 1}^{Q}\left[{{{{\mathcal{E}}}}}_{q}^{(k)}(\psi )\right]\) and \({\tilde{\phi }}_{Q}^{(K)}:={\otimes }_{k = 1}^{K}{\otimes }_{q = 1}^{Q}\left[{{{{\mathcal{E}}}}}_{q}^{(k)}(\phi )\right]\) are distorted states corresponding to the QK copies of some ideal outputs ψ and ϕ, and \({{{{\mathcal{E}}}}}_{q}^{(k)}\) is the effective noise channel for the qth input in the kth experiment.

Combining this with Hoeffding’s inequality leads to the following bound on the sampling cost.

Corollary 2 Consider an arbitrary (Q, K)-mitigation protocol with maximum bias \({b}_{\max }\). Then, an estimation error of \({b}_{\max }+\delta\) is realized with probability 1 − ε when the number of samples M satisfies:

$$\begin{array}{*{20}{l}}M&\ge \frac{{{\Delta }}{e}_{\max }^{2}\log (2/\varepsilon )}{2{\delta }^{2}}\\&\ge {\left[\mathop {\max }\limits_{\psi ,\phi }\frac{{D}_{{{{\rm{tr}}}}}(\psi ,\,\phi )\,-\,2{b}_{\max }}{{D}_{{{{\rm{LM}}}}}\left({\tilde{\psi }}_{Q}^{(K)},\,{\tilde{\phi }}_{Q}^{(K)}\right)}\right]}^{2}\frac{\log (2/\varepsilon )}{2{\delta }^{2}}\end{array}$$
(4)

where \({\tilde{\psi }}_{Q}^{(K)}:={\otimes }_{k = 1}^{K}{\otimes }_{q = 1}^{Q}\left[{{{{\mathcal{E}}}}}_{q}^{(k)}(\psi )\right]\) and \({\tilde{\phi }}_{Q}^{(K)}:={\otimes }_{k = 1}^{K}{\otimes }_{q = 1}^{Q}\left[{{{{\mathcal{E}}}}}_{q}^{(k)}(\phi )\right]\).

Theorem 1 and Corollary 2 offer two qualitative insights. The first is the potential trade-off between sampling cost and systematic error—we may reduce the sampling cost by increasing tolerance for bias. The second is a direct relation between sampling cost and distinguishability—the more a noise source degrades distinguishability between states, the more costly the error is to mitigate.

The intuition behind this relation rests on the observation that the error-mitigation process is a quantum channel. Thus, any error-mitigation procedure must obey data-processing inequalities for distinguishability. On the other hand, error mitigation aims to improve our ability to estimate expectation values of various observables, which would enhance our ability to distinguish between noisy states. The combination of these observations then implies that distinguishability places a fundamental constraint on required sampling costs to mitigate error. For details of the associated proof, see Methods.

Observe that our bound involves the local distinguishability DLM(ρ, σ) rather than the standard trace distance \({D}_{{{{\rm{tr}}}}}(\rho ,\sigma )\). This is due to the constraints we placed of \({{{\mathcal{P}}}}\) that limits it to coherently interacting the outputs of a finite number of NISQ devices—reflecting the hybrid nature of quantum error mitigation utilizing quantum and classical resources in tandem. Notably, these quantities coincide for the most powerful NISQ devices (the ones allowing coherent interactions between all N noisy initial states). This case then corresponds to the most fundamental bound:

$${{\Delta }}{e}_{\max }\ge \mathop {\max }\limits_{\psi ,\phi}\frac{{D}_{{{{\rm{tr}}}}}(\psi ,\phi )-2{b}_{\max }}{{D}_{{{{\rm{tr}}}}}\left({\tilde{\psi }}_{Q}^{(K)},{\tilde{\phi }}_{Q}^{(K)}\right)},$$
(5)

which represents the ultimate performance limits of all (Q, K) error-mitigation protocols that coherently operate on N = QK distorted states each round.

We also remark that our framework can give tighter bounds when available error-mitigation methods involve specific states and observables (see Eq. (36)).

Alternative bounds

While the bounds derived above in terms of distinguishability have a clear operational meaning, its evaluation in realistic settings can face two significant hurdles. (1) It involves evaluating the distinguishability between two quantum states whose dimensions scale exponentially with KQ, making its evaluation costly for protocols that require many NISQ samples per round. (2) It requires that we have tomographic knowledge of the effective noise channels \({{{{\mathcal{E}}}}}_{q}^{(k)}\).

One potential means around this is to identify bounds on the distinguishability measures that alleviate such hurdles. For example, since \({D}_{{{{\rm{tr}}}}}(\rho ,\sigma )\le \sqrt{1-F(\rho ,\sigma )}\) for any pair of states ρ and σ where \(F(\rho ,\sigma ):={\left({{{\rm{Tr}}}}\sqrt{{\sigma }^{1/2}\rho {\sigma }^{1/2}}\right)}^{2}\) is the (squared) fidelity35, this, together with Eq. (5), implies:

$${{\Delta }}{e}_{\max }\ge \mathop{\max }\limits_{\begin{array}{c}\psi ,\phi \end{array}}\frac{{D}_{{{{\rm{tr}}}}}(\psi ,\phi )-2{b}_{\max }}{\sqrt{1-\mathop{\prod }\nolimits_{q = 1}^{Q}\mathop{\prod }\nolimits_{k = 1}^{K}F\left({{{{\mathcal{E}}}}}_{q}^{(k)}(\psi ),{{{{\mathcal{E}}}}}_{q}^{(k)}(\phi )\right)}}.$$
(6)

This form only involves the computation of the trace distance and fidelity of single-copy states, both of which can be computed by semidefinite programming36.

Meanwhile, the need for tomographic knowledge of \({{{{\mathcal{E}}}}}_{q}^{(k)}\) can be mitigated by using subfidelity37:

$$E(\rho ,\sigma ):={{{\rm{Tr}}}}(\rho \sigma )+\sqrt{2\left[{\left\{{{{\rm{Tr}}}}(\rho \sigma )\right\}}^{2}-{{{\rm{Tr}}}}(\rho \sigma \rho \sigma )\right]}.$$
(7)

The subfidelity bounds F(ρ, σ) from below, and thus also lower bounds the maximum spread:

$${{\Delta }}{e}_{\max }\ge \mathop{\max }\limits_{\begin{array}{c}\psi ,\phi \end{array}}\frac{{D}_{{{{\rm{tr}}}}}(\psi ,\phi )-2{b}_{\max }}{\sqrt{1-\mathop{\prod }\nolimits_{q = 1}^{Q}\mathop{\prod }\nolimits_{k = 1}^{K}E\left({{{{\mathcal{E}}}}}_{q}^{(k)}(\psi ),{{{{\mathcal{E}}}}}_{q}^{(k)}(\phi )\right)}}.$$
(8)

subfidelity between two unknown states can be measured by a quantum computer using a circuit of constant depth38,39 (see also Methods). This obviates the need for tomographical data, while its low depth means that the noise in this process is typically much smaller than the noise in our circuits of interest. We remark that, instead of using the subfidelity, one could use an alternative quantity that lower bounds the fidelity that can be estimated by NISQ devices, e.g., truncated fidelity40. Such techniques could enable benchmarking protocols that allow us to rule out a candidate NISQ device should our bounds suggest their error profiles are too adverse to support any viable means of error mitigation.

In addition, the maximum in the right-hand sides of (6) and (8) do not need to be evaluated exactly; any choice of states ψ and ϕ provides a valid lower bound for the maximum spread. While these alternative bounds may not be as tight, they still serve as universal lower bounds that can put non-trivial constraints on the error-mitigation performance (see Remark 2 in Supplementary Note 1 and Supplementary Note 3).

Error-mitigating layered circuits

Quantitatively, the above bounds enable us to determine the ultimate performance limits of error mitigation given a particular set of imperfect quantum devices specified by error channels \(\{{{{{\mathcal{E}}}}}_{q}^{(k)}\}\). We now illustrate how this enables the identification of sampling overheads when performing error mitigation on a common class of NISQ algorithms—layered circuits used extensively in variational quantum eigensolvers41. Variational algorithms typically assume a quantum circuit consisting of multiple layers of unitary gates \({\{{U}_{l}\}}_{l = 1}^{L}\) acting on an n-qubit system. Indeed, as designed with NISQ applications in mind, they are key candidates for benchmarking of error-mitigation protocols7,42,43.

In particular, consider a local depolarizing noise25,44, in which the depolarizing channel \({{{{\mathcal{D}}}}}_{\epsilon }(\rho ):=(1-\epsilon )\rho +\epsilon {\mathbb{I}}/2\) acts on each qubit. A general approach to mitigate this error is to employ a (Q, K)-mitigation protocol for some Q and K, in which the kth experiment involves depolarizing noise with noise strength ϵk (Fig. 4).

Fig. 4: Noise mitigation in layered circuits.
figure 4

Layered circuits are used extensively in variational algorithms for NISQ devices. They involve repeated layers of gates, each consisting of some unitary Ul. A standard noise model for such circuits involves the action of local depolarizing noise \({{{{\mathcal{D}}}}}_{\epsilon }\) on each qubit during each layer of the circuit. The kth experiment in a general (Q, K)-protocol involves running this circuit Q times to produce a distorted state \({\otimes }_{q = 1}^{Q}{{{{\mathcal{E}}}}}_{q}^{(k)}(\psi )\) with some noise strength ϵk—which possibly varies over different experiments. The protocol then measures each \({\otimes }_{q = 1}^{Q}{{{{\mathcal{E}}}}}_{q}^{(k)}(\psi )\) for k = 1, … , K and outputs an estimate EA through classical post-processing of the measurements results.

Taking U = ULU2U1 in Fig. 2 and applying Theorem 1 to this setting, we obtain the following bound (see Supplementary Note 1 for the proof).

Theorem 3 For an arbitrary (Q, K)-error mitigation with maximum bias \({b}_{\max }\) applied to n-qubit circuits with L-layer unitaries under local depolarizing noise, the maximum spread is lower bounded as:

$${{\Delta }}{e}_{\max }\ge \frac{1-2{b}_{\max }}{\sqrt{2\ln 2}\sqrt{nQ}\,K}{\left(\frac{1}{1-{\epsilon }_{\min }}\right)}^{L},$$
(9)

where \({\epsilon }_{\min }:=\mathop{\min }\nolimits_{k}{\epsilon }_{k}\) is the minimum noise strength among K experiments.

Theorem 3 suggests that error-mitigation strategies encompassed in our framework will use exponentially many samples with respect to the circuit depth L. This validates our intuition that information should quickly get degraded due to the sequential noise effects, incurring exponential overhead to remove the accumulated noise effect.

We also remark that, although we here focus on the exponential growth of the maximum spread with respect to the circuit depth L for the sake of generality, one can expect that the maximum spread grows exponentially with the total gate number nQKL rather than just the layer number L in many practical cases.

Protocol benchmarking

Theorems 1 and 3 place strategy-independent bounds on the maximum spread for each Q and K and available noise channels \({{{{\mathcal{E}}}}}_{q}^{(k)}\), enabling us to identify the ultimate potential of error mitigation under various noise settings and operational constraints. Comparing this limit with that achieved by specific known methods of error mitigation then provides a valuable benchmark, helping us assess their optimality and quantify the potential room for improvement. We illustrate this here by considering probabilistic error cancellation3, while we discuss how our framework can be applied to other prominent error-mitigation protocols in Methods.

Probabilistic error cancellation is an error-mitigation protocol that produces an estimate of \({{{\rm{Tr}}}}(A\psi )\) using a distorted state \({{{\mathcal{E}}}}(\psi )\) each round (see Fig. 3A). It then fulfills the criteria of being a (1, 1)-protocol, i.e., Q = K = 1. Here, we assume that the description of the noise channels is given as pre-knowledge, in which case the estimator becomes unbiased, i.e., \({b}_{\max }=0\). Probabilistic error cancellation operates by identifying a complete basis of processes \({\{{{{{\mathcal{B}}}}}_{j}\}}_{j}\) such that \({{{{\mathcal{E}}}}}^{-1}={\sum }_{j}{c}_{j}{{{{\mathcal{B}}}}}_{j}\) for some set of real (but possibly negative) numbers \({\{{c}_{j}\}}_{j}\). Setting γ := ∑jcj, the protocol then (1) applies \({{{{\mathcal{B}}}}}_{j}\) to the noisy state \({{{\mathcal{E}}}}(\psi )\) with probability pj = cj/γ, (2) measures A to get outcome aj, and (3) multiplies each outcome by \(\gamma \,{{{\rm{sgn}}}}({c}_{j})\) and takes the average.

In the context of our framework, we can introduce a quantum operation \({{{\mathcal{B}}}}\) that represents first initializing a classical register to a state j with probability pj and applying \({{{{\mathcal{B}}}}}_{j}\) to \({{{\mathcal{E}}}}(\psi )\) conditioned on j. Meanwhile, MA represents an A-measurement of the resulting quantum system combined with a measurement of the register, resulting in the outcome pair (aj, j). Taking \({e}_{A}^{{{{\rm{PEC}}}}}\left(({a}_{j},j)\right)=\gamma\, {{{\rm{sgn}}}}({c}_{j}){a}_{j}\), we see that the maximum spread of this estimator is given by:

$${{\Delta }}{e}_{\max }^{{{{\rm{PEC}}}}}=\gamma ,$$
(10)

a well-studied quantity that is already associated with the sampling overhead of probabilistic error cancellation3.

The optimal sampling cost γopt is then achieved by minimizing such γ over all feasible \({\{{{{{\mathcal{B}}}}}_{j}\}}_{j}\)45. Once computed for a specific noise channel \({{{\mathcal{E}}}}\), we can compare it to the lower bounds in Theorem 1 to determine if there is possible room for improvement.

Let us now consider local dephasing noise on an n-qubit system, where the dephasing noise \({{{{\mathcal{Z}}}}}_{\epsilon }(\rho ):=(1-\epsilon )\rho +\epsilon Z\rho Z\) acts on each qubit. We find that the optimal cost is obtained as:

$${\gamma }_{{{{\rm{opt}}}}}={{\Delta }}{e}_{\max }^{{{{\rm{PEC}}}}}=\frac{1}{{(1-2\epsilon )}^{n}}.$$
(11)

This can be compared to the bound for \({{\Delta }}{e}_{\max }\) from Theorem 1 that applies to every mitigation protocol with Q = K = 1. Note that, since K = 1, \({D}_{{{{\rm{LM}}}}}={D}_{{{{\rm{tr}}}}}\). We then get:

$$\mathop{\max }\limits_{\psi ,\phi }\frac{{D}_{{{{\rm{tr}}}}}(\psi ,\phi )}{{D}_{{{{\rm{tr}}}}}({{{{\mathcal{Z}}}}}_{\epsilon }(\psi ),{{{{\mathcal{Z}}}}}_{\epsilon }(\phi ))}\ge \frac{1}{{(1-2\epsilon )}^{n}}.$$
(12)

Detailed computation to obtain (11) and (12) can be found in Supplementary Note 2. Remarkably, the two quantities—the maximum spread for the probabilistic error cancellation and the lower bound for arbitrary unbiased mitigation strategies with Q = K = 1—exactly coincide. This shows that probabilistic error cancellation achieves the ultimate performance limit of unbiased (1, 1)-protocols for correcting local dephasing noise for an arbitrary qubit number n.

We can also consider the d-dimensional depolarizing noise \({{{{\mathcal{D}}}}}_{\epsilon }^{d}(\rho )=(1-\epsilon )\rho +\epsilon {\mathbb{I}}/d\). The bound from Theorem 1 for this noise is obtained as:

$$\mathop{\max }\limits_{\psi ,\phi }\frac{{D}_{{{{\rm{tr}}}}}(\psi ,\phi )}{{D}_{{{{\rm{tr}}}}}({{{{\mathcal{D}}}}}_{\epsilon }^{d}(\psi ),{{{{\mathcal{D}}}}}_{\epsilon }^{d}(\phi ))}=\frac{1}{1-\epsilon },$$
(13)

which is slightly lower than \({{\Delta }}{e}_{\max }^{{{{\rm{PEC}}}}}=\frac{1+(1-2/{d}^{2})\epsilon }{1-\epsilon }\)45,46,47, with difference being O(ϵ). This suggests that probabilistic error cancellation is nearly optimal for this noise model, while still leaving the possibility for a better protocol to exist.

We can also apply similar techniques to study the performance of other prominent error-mitigation protocols. Here, we plot the estimator spread for probabilistic error cancellation, virtual distillation, and noise extrapolation, and their corresponding lower bounds for local dephasing noise (Fig. 5) and global depolarizing noise (Fig. 6). We note that, for virtual distillation and extrapolation, we evaluated (36) that allows us to bound ΔeA in (2) with a specific observable A of interest. We provide details for the evaluation of these values in Supplementary Note 2. We can observe that both protocols perform near-optimal limits at the low-error regime. At the high-error regime, their performance can diverge significantly from our lower bounds depending on underlying noise models and mitigation strategies. We emphasize that such divergences are expected because of the high generality of our lower bounds. Narrowing the gaps between the fundamental lower bounds and achievable maximum spread, e.g., finding more examples such as probabilistic error cancellation for local dephasing noise, will be a natural direction for future work.

Fig. 5: The estimator spreads to mitigate local dephasing noise on a 50-qubit system.
figure 5

Solid green curve: \({{\Delta }}{e}_{\max }\) for probabilistic error cancellation and the lower bound for unbiased (1, 1)-mitigation protocols, which coincide as explained in the main text. Brown curve: ΔeA with \(A=\frac{1}{2}{\otimes }_{i = 1}^{n}{X}_{i}\) for 2-copy virtual distillation with GHZ state inputs and the lower bound for (2, 1)-mitigation protocols with the same bias, which coincide as explained in Supplementary Note 2. Triangles and rectangles: ΔeA with \(A=\frac{1}{2}{\otimes }_{i = 1}^{n}{X}_{i}\) for 11th order noise extrapolation with GHZ state inputs (triangles) and a lower bound for (1, 12)-mitigation protocols with the same bias (rectangles).

Fig. 6: The estimator spreads to mitigate global depolarizing noise on a 50-qubit system.
figure 6

Green curves: \({{\Delta }}{e}_{\max }\) for probabilistic error cancellation (dashed) and the lower bound for unbiased (1, 1)-mitigation protocols (solid). Brown curves: ΔeA with \(A=\frac{1}{2}{\otimes }_{i = 1}^{n}{X}_{i}\) for 2-copy virtual distillation with GHZ state inputs (dashed) and the lower bound for (2, 1)-mitigation protocols with the same bias (solid). Triangles and rectangles: ΔeA with \(A=\frac{1}{2}{\otimes }_{i = 1}^{n}{X}_{i}\) for 1st order noise extrapolation (triangles) and a lower bound for (1, 2)-mitigation protocols with the same bias (rectangles).

Discussion

Our work aimed to identify the ultimate performance limits of quantum error mitigation—a large class of techniques designed to estimate the outputs of ideal quantum circuits by post-processing measurement data from imperfect counterparts. This involved identifying a universal performance measure—applicable to any such error-mitigation protocols—that captures how many extra executions of available NISQ devices the protocol uses to ensure that its estimates are sufficiently close with some required probability of success. We then derived ultimate performance limits that pertain to all such error mitigation methods. The significance of our bounds parallels that of various fundamental converse bounds in quantum information (e.g., quantum communication48,49,50 and thermodynamics51,52,53), representing the ultimate performance limits that quantum error-mitigation protocols can never surpass. Our bounds particularly demonstrate that probabilistic error cancellation is optimal in the maximum spread to mitigate local dephasing noise among all unbiased error-mitigation protocols that involve no coherent interactions between multiple copies of distorted states, and imply that the exponential growth in the maximum spread on mitigating noise in layered circuits is an unavoidable feature shared by the general error-mitigation protocols.

We note that our performance bounds have focused on the scaling of M, representing how many rounds an error-mitigation protocol should be run to get a reliable estimate of some observable 〈A〉. Although this analysis is sufficient for many present methods of error mitigation, it is possible to also improve estimates of 〈A〉 by scaling the number of distorted outputs we process in a single round (e.g., extrapolation11 and subspace expansion27). While our framework in Fig. 1 encompasses such methodologies—and as such all bounds on estimation error apply—full understanding of the performance of such protocols would involve further investigation on how estimation error scales with respect to N or K. This then presents a natural direction for future research.

Our results also offer potential insights into several related fields. Non-Markovian dynamics have shown promise in decreasing sampling costs in error mitigation54. Since non-Markovianity is known to be deeply related to the trace distance55, our newly established relations between trace distance and quantum error mitigation hint at promising relations between the two fields. The second direction is to relate our general framework of quantum error mitigation to the established theory of quantum error correction. Quantum error correction concerns algorithms that prevent degrading the trace distance between suitably encoded logical states, while our results indicate that less reduction in trace distance can enable smaller error mitigation costs. Thus, our work provides a toolkit for identifying fundamental bounds in the transition from error mitigation to error correction as we proceed from NISQ devices toward scalable quantum computing. This then complements presently active research in error suppression that combines the two techniques56,57,58,59. Beyond error suppression, quantum protocols in many diverse settings also share the structure of classical post-processing of quantum measurements—from quantum metrology and illumination to hypothesis testing and stochastic analysis60,61,62,63,64. Our framework—suitably extended—could thus identify new performance bounds in each of these settings.

Methods

Formal definition of (Q, K)-error mitigation

Here, we give a formal definition of (Q, K)-error mitigation as a quantum operation. Since POVM measurements in different experiments are independent of each other, the whole measurement process can be represented as a tensor product of each POVM. Then, the classical post-processing following the measurement is a classical-classical channel such that the expected value of the output will serve as an estimate of the desired expectation value. We can then formalize an error-mitigation process as a concatenation of these two maps.

Definition 4 ((Q, K)-error mitigation). For an arbitrary observable A satisfying \(-{\mathbb{I}}/2\le A\le {\mathbb{I}}/2\), a (Q, K)-mitigation protocol—involving Q inputs and K experiments—is a concatenation of quantum-classical channel ΛA and classical-classical channel \({\hat{e}}_{A}\) as \({\hat{e}}_{A}\circ {{{\Lambda }}}_{A}\). Here, ΛA has a form:

$${{{\Lambda }}}_{A}(\cdot )=\mathop{\sum}\limits_{{{{\bf{i}}}}}{{{\rm{Tr}}}}(\cdot \,{M}_{{i}^{(1)}}^{(1)}\otimes \cdots \otimes {M}_{{i}^{(K)}}^{(K)})\,\left|{{{\bf{i}}}}\right\rangle \,\left\langle {{{\bf{i}}}}\right|$$
(14)

where \(\{{M}_{{i}^{(k)}}^{(k)}\}\) is the POVM for the kth experiment acting on Q copies of n-qubit noisy states, and i := i(1)… i(K) denotes a collection of measurement outcomes with \(\left|{{{\bf{i}}}}\right\rangle =\left|{i}^{(1)}\ldots {i}^{(K)}\right\rangle\) being a classical state acting on K subsystems. The channel \({\hat{e}}_{A}\) implements a K-input classical function eA such that:

$$\mathop{\sum}\limits_{{{{\bf{i}}}}}{p}_{{{{\bf{i}}}}}{e}_{A}({{{\bf{i}}}})={{{\rm{Tr}}}}(A\psi )+{b}_{A}(\psi )$$
(15)

for some function bA(ψ) called bias, and:

$${p}_{{{{\bf{i}}}}}:=\mathop{\prod }\limits_{k=1}^{K}{{{\rm{Tr}}}}[{{{{\mathcal{E}}}}}_{1}^{(k)}(\psi )\otimes \cdots \otimes {{{{\mathcal{E}}}}}_{Q}^{(k)}(\psi )\,{M}_{{i}^{(k)}}^{(k)}]$$
(16)

is the probability of getting outcomes i = i(1)i(K) for the input noisy states \({\{{{{{\mathcal{E}}}}}_{q}^{(k)}(\psi )\}}_{q = 1,k = 1}^{Q,K}\).

Proof of Theorem 1—The intuition behind Theorem 1 lies in the intimate relation between the effect of error mitigation and distinguishability of quantum states. Recall that the goal of quantum error mitigation is to estimate the expectation value of an arbitrary observable A for an arbitrary ideal state ψ only using the noisy state \({{{\mathcal{E}}}}(\psi )\). Although \({{{\rm{Tr}}}}(A{{{\mathcal{E}}}}(\psi ))\) can deviate from \({{{\rm{Tr}}}}(A\psi )\), error mitigation correctly allows us to estimate \({{{\rm{Tr}}}}(A\psi )\), which appears to have eliminated noise effects. Since each error-mitigation strategy should also work for another state ϕ, it should be able to remove the noise and estimate \({{{\rm{Tr}}}}(A\phi )\) out of \({{{\rm{Tr}}}}(A{{{\mathcal{E}}}}(\phi ))\). Does this “removal” of noise imply that error mitigation can help distinguish \({{{\mathcal{E}}}}(\psi )\) and \({{{\mathcal{E}}}}(\phi )\)?

The subtlety of this question can be seen by looking at how quantum error mitigation works. The estimation of \({{{\rm{Tr}}}}(A{{{\mathcal{E}}}}(\psi ))\) without error mitigation is carried out by making a measurement with respect to the eigenbasis of \(A={\sum }_{a}a\left|a\right\rangle \,\left\langle a\right|\), which produces a probability distribution \(p(a| {{{\mathcal{E}}}}(\psi ),A)\) over possible outcomes {a}. Because of the noise, the expectation value of this distribution is shifted from \({{{\rm{Tr}}}}(A\psi )\). Similarly, the same measurement for a state \({{{\mathcal{E}}}}(\phi )\) produces a probability distribution \(p(a| {{{\mathcal{E}}}}(\phi ),A)\), whose expectation value may also be shifted from \({{{\rm{Tr}}}}(A\phi )\). An error-mitigation protocol applies additional operations, measurements and classical post-processing to produce other probability distributions \({p}_{{{{\rm{EM}}}}}(a| {{{\mathcal{E}}}}(\psi ),A)\) and \({p}_{{{{\rm{EM}}}}}(a| {{{\mathcal{E}}}}(\phi ),A)\) whose expectation values get closer to the original ones. As a result, although the expectation values of the two error-mitigated distributions get separated from each other, they also get broader, which may increase the overlap between the two distributions, possibly making it even harder to distinguish two distributions (see Fig. 7).

Fig. 7: Error mitigation and distinguishability.
figure 7

The top schematic illustrates the probability distribution of an observable A for two noisy states \({{{\mathcal{E}}}}(\psi )\) and \({{{\mathcal{E}}}}(\phi )\). The expectation values are shifted from the true values due to the noise effects. As in the bottom schematic, error mitigation converts them to other distributions whose expectation values are closer to the true values than the initial noisy distributions are. However, the converted distributions get broader, and the overlap between two distributions increases in general.

One can see that this intuition that error mitigation does not increase the distinguishability is indeed right by looking at the whole error-mitigation process as a quantum channel. Then, the data-processing inequality implies that the distinguishability between any two states should not be increased by the application of quantum channels. This motivates us to rather use this observation as a basis to put a lower bound for the necessary overhead.

Let us recall that the trace distance admits the following form:

$$\begin{array}{ll}{D}_{{{{\rm{tr}}}}}(\rho ,\sigma )=\frac{1}{2}\parallel \rho -\sigma {\parallel }_{1}\\ \qquad\qquad\,=\mathop{\max }\limits_{0\le M\le {\mathbb{I}}}{{{\rm{Tr}}}}\left[M(\rho -\sigma )\right],\end{array}$$
(17)

and similarly the local distinguishablity measure can be written as34:

$$\begin{array}{ll}{D}_{{{{\rm{LM}}}}}(\rho ,\sigma )=\mathop{\max }\limits_{\{{M}_{i}\}\in {{{\rm{LM}}}}}\frac{1}{2}\parallel {{{\mathcal{M}}}}(\rho )-{{{\mathcal{M}}}}(\sigma ){\parallel }_{1}\\ \qquad\qquad\;\;=\mathop{\max }\limits_{\{M,{\mathbb{I}}-M\}\in {{{{\rm{LM}}}}}_{2}}{{{\rm{Tr}}}}[M(\rho -\sigma )]\end{array}$$
(18)

where LM is the set of POVMs that take the form \({M}_{{i}^{(1)}}^{(1)}\otimes \cdots \otimes {M}_{{i}^{(K)}}^{(K)}\), where \({M}_{{i}^{(k)}}^{(k)}\) represents some POVM local to system Sk, and LM2 is the set of two-outcome measurements realized by local measurements together with classical post-processing. The second forms for the above measures particularly tell that they quantify how well two states can be distinguished by accessible quantum measurements. By definition, it is clear that:

$${D}_{{{{\rm{tr}}}}}(\rho ,\sigma )\ge {D}_{{{{\rm{LM}}}}}(\rho ,\sigma )$$
(19)

for all states ρ and σ, and the inequality often becomes strict65,66.

The local distinguishability measure satisfies the data-processing inequality under all local measurement channels. Namely, for all states ρ and σ defined on a composite system \({\otimes }_{k = 1}^{K}{S}_{k}\), and for an arbitrary quantum-classical channel \({{\Lambda }}(\cdot )={\sum }_{i}{{{\rm{Tr}}}}\left(\,\cdot \,{M}_{{i}^{(1)}}^{(1)}\otimes \cdots \otimes {M}_{{i}^{(K)}}^{(K)}\right)\left|{\bf i}\right\rangle \left\langle {\bf i}\right|\),

$$\begin{array}{ll}{D}_{{{{\rm{LM}}}}}({{\Lambda }}(\rho ),{{\Lambda }}(\sigma ))=\mathop{\max }\limits_{{{{\mathcal{M}}}}\in {{{\rm{LM}}}}}\frac{1}{2}\parallel {{{\mathcal{M}}}}\circ {{\Lambda }}(\rho )-{{{\mathcal{M}}}}\circ {{\Lambda }}(\sigma ){\parallel }_{1}\\ \qquad\qquad\qquad\qquad\le \mathop{\max }\limits_{{{{\mathcal{M}}}}\in {{{\rm{LM}}}}}\frac{1}{2}\parallel {{{\mathcal{M}}}}(\rho )-{{{\mathcal{M}}}}(\sigma ){\parallel }_{1}\\ \qquad\qquad\qquad\qquad={D}_{{{{\rm{LM}}}}}(\rho ,\sigma )\end{array}$$
(20)

where in the inequality we used that the set of local measurement channels is closed under concatenation.

Let us define:

$$\begin{array}{l}{\tilde{\psi }}_{Q}^{(K)}:={\otimes }_{k = 1}^{K}{\otimes }_{q = 1}^{Q}\left[{{{{\mathcal{E}}}}}_{q}^{(k)}(\psi )\right],\\ {\tilde{\phi }}_{Q}^{(K)}:={\otimes }_{k = 1}^{K}{\otimes }_{q = 1}^{Q}\left[{{{{\mathcal{E}}}}}_{q}^{(k)}(\phi )\right].\end{array}$$
(21)

Since the channel ΛA in Definition 4 is a local measurement channel, we employ (20) to get:

$$\begin{array}{rcl}{D}_{{{{\rm{LM}}}}}\left({\tilde{\psi }}_{Q}^{(K)},{\tilde{\phi }}_{Q}^{(K)}\right)&\ge &{D}_{{{{\rm{LM}}}}}\left({{{\Lambda }}}_{A}\left({\tilde{\psi }}_{Q}^{(K)}\right),{{{\Lambda }}}_{A}\left({\tilde{\phi }}_{Q}^{(K)}\right)\right)\\ &=&{D}_{{{{\rm{LM}}}}}(\hat{p},\hat{q})\end{array}$$
(22)

where:

$$\hat{p}=\mathop{\sum}\limits_{{{{\bf{i}}}}}{p}_{{{{\bf{i}}}}}\,\left|{{{\bf{i}}}}\right\rangle \,\left\langle {{{\bf{i}}}}\right|,\quad \hat{q}=\mathop{\sum}\limits_{{{{\bf{i}}}}}{q}_{{{{\bf{i}}}}}\left|{{{\bf{i}}}}\right\rangle \,\left\langle {{{\bf{i}}}}\right|$$
(23)

and pi and qi are classical distributions defined in (16) for ψ and ϕ respectively, which satisfy:

$$\begin{array}{l}\mathop{\sum}\limits_{{{{\bf{i}}}}}{p}_{{{{\bf{i}}}}}{e}_{A}({{{\bf{i}}}})={{{\rm{Tr}}}}(A\psi )+{b}_{A}(\psi ),\\ \mathop{\sum}\limits_{{{{\bf{i}}}}}{q}_{{{{\bf{i}}}}}{e}_{A}({{{\bf{i}}}})={{{\rm{Tr}}}}(A\phi )+{b}_{A}(\phi ).\end{array}$$
(24)

When \(\hat{p}\) and \(\hat{q}\) are tensor products of classical states, i.e., \(\hat{p}={\hat{p}}^{(1)}\otimes \cdots \otimes {\hat{p}}^{(K)}\) and \(\hat{q}={\hat{q}}^{(1)}\otimes \cdots \otimes {\hat{q}}^{(K)}\), it holds that:

$${D}_{{{{\rm{LM}}}}}(\hat{p},\hat{q})={D}_{{{{\rm{tr}}}}}(\hat{p},\hat{q}).$$
(25)

This can be seen as follows. Let M be the optimal POVM element achieving the trace distance in (17). Then, we get:

$$\begin{array}{rcl}{D}_{{{{\rm{tr}}}}}(\hat{p},\hat{q})&=&Tr[{M}^{\star }(\hat{p}-\hat{q})]\\ &=&Tr[{{\Delta }}({M}^{\star })(\hat{p}-\hat{q})]\end{array}$$
(26)

where:

$${{\Delta }}(\cdot ):=\mathop{\sum}\limits_{{{{\bf{i}}}}}\left|{{{\bf{i}}}}\right\rangle \,\left\langle {{{\bf{i}}}}\right|\cdot \left|{{{\bf{i}}}}\right\rangle \,\left\langle {{{\bf{i}}}}\right|$$
(27)

is a classical dephasing channel. The effective POVM element Δ(M) has the form:

$${{\Delta }}({M}^{\star })=\mathop{\sum}\limits_{{{{\bf{i}}}}}\langle {{{\bf{i}}}}| {M}^{\star }| {{{\bf{i}}}}\rangle \left|{{{\bf{i}}}}\right\rangle \,\left\langle {{{\bf{i}}}}\right|.$$
(28)

Since each \(\left|{{{\bf{i}}}}\right\rangle \,\left\langle {{{\bf{i}}}}\right|\) is a local POVM element and 0 ≤ 〈iMi〉 ≤ 1 because \(0\le {M}^{\star }\le {\mathbb{I}}\), the two-outcome measurement \(\{{{\Delta }}({M}^{\star }),{\mathbb{I}}-{{\Delta }}({M}^{\star })\}\) can be realized by a local measurement and classical post-processing, and thus belongs to LM2. This, together with (18), implies \({D}_{{{{\rm{tr}}}}}(\hat{p},\hat{q})\le {D}_{{{{\rm{LM}}}}}(\hat{p},\hat{q})\), and further combining (19) gives (25).

Combining (22) and (25) gives:

$${D}_{{{{\rm{tr}}}}}(\hat{p},\hat{q})\le {D}_{{{{\rm{LM}}}}}\left({\tilde{\psi }}_{Q}^{(K)},{\tilde{\phi }}_{Q}^{(K)}\right).$$
(29)

We now connect (29) to the expression (24) of the expectation value and bias. Let us first suppose \({{{\rm{Tr}}}}(A\psi )+{b}_{A}(\psi )\ge {{{\rm{Tr}}}}(A\phi )+{b}_{A}(\phi )\). Let \({{{{\mathcal{I}}}}}^{\star }:=\left\{\left.{{{\bf{i}}}}\ \right|\ {p}_{{{{\bf{i}}}}}-{q}_{{{{\bf{i}}}}}\ge 0\right\}\) and let \({\overline{{{{\mathcal{I}}}}}}^{\star }\) be the complement set. Let us also define \(A^{\prime} =A+{\mathbb{I}}/2\), which satisfies \(0\le A^{\prime} \le {\mathbb{I}}\) due to \(-{\mathbb{I}}/2\le A\le {\mathbb{I}}/2\). Then, we get:

$$\begin{array}{l}{{{\rm{Tr}}}}[A^{\prime} (\psi -\phi )]+{b}_{A}(\psi )-{b}_{A}(\phi )\\ ={{{\rm{Tr}}}}[(A+{\mathbb{I}}/2)(\psi -\phi )]+{b}_{A}(\psi )-{b}_{A}(\phi )\\ ={{{\rm{Tr}}}}[A(\psi -\phi )]+{b}_{A}(\psi )-{b}_{A}(\phi )\\ =\mathop{\sum}\limits_{{{{\bf{i}}}}}({p}_{{{{\bf{i}}}}}-{q}_{{{{\bf{i}}}}}){e}_{A}({{{\bf{i}}}})\\ \le \mathop{\sum}\limits_{{{{\bf{i}}}}\in {{{{\mathcal{I}}}}}^{\star }}({p}_{{{{\bf{i}}}}}-{q}_{{{{\bf{i}}}}}){e}_{A,\max }+\mathop{\sum}\limits_{{{{\bf{i}}}}\in {\overline{{{{\mathcal{I}}}}}}^{\star }}({p}_{{{{\bf{i}}}}}-{q}_{{{{\bf{i}}}}}){e}_{A,\min }\\ ={D}_{{{{\rm{tr}}}}}(\hat{p},\hat{q})({e}_{A,\max }-{e}_{A,\min })\end{array}$$
(30)

where in the third line we used (24), in the fourth line we used the maximum and minimum estimator values:

$${e}_{A,\max }:=\mathop{\max }\limits_{{{{\bf{i}}}}}{e}_{A}({{{\bf{i}}}}),\quad {e}_{A,\min }:=\mathop{\min }\limits_{{{{\bf{i}}}}}{e}_{A}({{{\bf{i}}}}),$$
(31)

and in the last line we used that:

$$\mathop{\sum}\limits_{{{{\bf{i}}}}\in {\overline{{{{\mathcal{I}}}}}}^{\star }}({p}_{{{{\bf{i}}}}}-{q}_{{{{\bf{i}}}}})=-\mathop{\sum}\limits_{{{{\bf{i}}}}\in {{{{\mathcal{I}}}}}^{\star }}({p}_{{{{\bf{i}}}}}-{q}_{{{{\bf{i}}}}})$$
(32)

and that the trace distance reduces to the total variation distance:

$${D}_{{{{\rm{tr}}}}}(\hat{p},\hat{q})=\mathop{\sum}\limits_{i:{p}_{i}-{q}_{i}\ge 0}({p}_{i}-{q}_{i})$$
(33)

for all classical states \(\hat{p}={\sum }_{i}{p}_{i}\left|i\right\rangle \,\left\langle i\right|\) and \(\hat{q}={\sum }_{i}{q}_{i}\left|i\right\rangle \,\left\langle i\right|\). Combining (29) and (30), we get:

$${e}_{A,\max }-{e}_{A,\min }\ge \frac{{{{\rm{Tr}}}}[A^{\prime} (\psi -\phi )]+{b}_{A}(\psi )-{b}_{A}(\phi )}{{D}_{{{{\rm{LM}}}}}\left({\tilde{\psi }}_{Q}^{(K)},{\tilde{\phi }}_{Q}^{(K)}\right)}.$$
(34)

On the other hand, if \({{{\rm{Tr}}}}(A\psi )+{b}_{A}(\psi )\le {{{\rm{Tr}}}}(A\phi )+{b}_{A}(\phi )\), we flip the role of ψ and ϕ to get:

$${e}_{A,\max }-{e}_{A,\min }\ge -\frac{{{{\rm{Tr}}}}[A^{\prime} (\psi -\phi )]+{b}_{A}(\psi )-{b}_{A}(\phi )}{{D}_{{{{\rm{LM}}}}}\left({\tilde{\psi }}_{Q}^{(K)},{\tilde{\phi }}_{Q}^{(K)}\right)}.$$
(35)

Defining \({{\Delta }}{e}_{A}:={e}_{A,\max }-{e}_{A,\min }\), these two can be summarized as:

$${{\Delta }}{e}_{A}\ge \frac{\left|{{{\rm{Tr}}}}[A^{\prime} (\psi -\phi )]+{b}_{A}(\psi )-{b}_{A}(\phi )\right|}{{D}_{{{{\rm{LM}}}}}\left({\tilde{\psi }}_{Q}^{(K)},{\tilde{\phi }}_{Q}^{(K)}\right)}.$$
(36)

Optimizing over A, ϕ, and ψ on both sides, we reach:

$$\begin{array}{*{20}{l}}{{\Delta }}{e}_{\max }&\ge\mathop{\max }\limits_{\psi ,\phi}\mathop{\max}\limits_{-{\mathbb{I}}/2\le A\le {\mathbb{I}}/2}\frac{| {{{\rm{Tr}}}}[A^{\prime} (\psi \,-\,\phi )]\,+\,{b}_{A}(\psi )\,-\,{b}_{A}(\phi )| }{{D}_{{{{\rm{LM}}}}}\left({\tilde{\psi }}_{Q}^{(K)},\,{\tilde{\phi }}_{Q}^{(K)}\right)}\\&=\mathop{\max }\limits_{\psi ,\phi}\mathop{\max}\limits_{-{\mathbb{I}}/2\le A\le {\mathbb{I}}/2}\frac{{{{\rm{Tr}}}}[A^{\prime} (\psi \,-\,\phi )]\,+\,{b}_{A}(\psi )\,-\,{b}_{A}(\phi )}{{D}_{{{{\rm{LM}}}}}\left({\tilde{\psi }}_{Q}^{(K)},\,{\tilde{\phi }}_{Q}^{(K)}\right)}\\ &\ge \mathop{\max }\limits_{\psi ,\phi}\frac{{D}_{{{{\rm{tr}}}}}(\psi ,\,\phi )\,+\,{b}_{{A}^{\star }}(\psi )\,-\,{b}_{{A}^{\star }}(\phi )}{{D}_{{{{\rm{LM}}}}}\left({\tilde{\psi }}_{Q}^{(K)},\,{\tilde{\phi }}_{Q}^{(K)}\right)}\\ &\ge \mathop{\max }\limits_{\psi ,\phi}\frac{{D}_{{{{\rm{tr}}}}}(\psi ,\,\phi )\,-\,2{b}_{\max }}{{D}_{{{{\rm{LM}}}}}\left({\tilde{\psi }}_{Q}^{(K)},\,{\tilde{\phi }}_{Q}^{(K)}\right)}\end{array}$$
(37)

where in the second line we used that we can always take the numerator positive by appropriately flipping ψ and ϕ, in the third line we fixed \({A^{\prime} }^{\star }={A}^{\star }+{\mathbb{I}}/2\) to the one that achieves the trace distance \({{{\rm{Tr}}}}[{A^{\prime} }^{\star }(\psi -\phi )]={D}_{{{{\rm{tr}}}}}(\psi ,\phi )\) as in (17), and in the fourth line we used the definition of \({b}_{\max }\). □

Measuring subfidelity

To estimate the subfidelity (7) for n-qubit states ρ and σ, it suffices to measure the two quantities, \({{{\rm{Tr}}}}(\rho \sigma )\) and \({{{\rm{Tr}}}}(\rho \sigma \rho \sigma )\), which can be measured by a quantum computer38,39. For readers’ convenience, here we summarize several methods that can measure the subfidelity and see that the measurement can be done by a constant-depth quantum circuit.

Let us begin by \({{{\rm{Tr}}}}(\rho \sigma )\). Note that \({{{\rm{Tr}}}}(\rho \sigma )={{{\rm{Tr}}}}(S\,\rho \otimes \sigma )\) where S is the n-qubit SWAP operator defined by \(S\left|\psi \right\rangle \otimes \left|\phi \right\rangle =\left|\phi \right\rangle \otimes \left|\psi \right\rangle\) with \(\left|\psi \right\rangle\) and \(\left|\phi \right\rangle\) being arbitrary n-qubit pure states. This can be famously measured by the SWAP test38 that uses one ancillary qubit and n-qubit SWAP gate controlled on the ancillary qubit. Since the n-qubit SWAP gate can be realized by swapping individual qubits, the SWAP test runs with n uses of qubit SWAP gates controlled on the ancillary qubit, taking the circuit depth n.

One can significantly reduce the circuit depth by employing the destructive SWAP test67. Note that \({{{\rm{Tr}}}}(\rho \sigma )={{{\rm{Tr}}}}({S}_{2}^{\otimes n}\rho \otimes \sigma )\) where \({S}_{2}:=\mathop{\sum }\nolimits_{i,j = 0}^{1}\left|ij\right\rangle \,\left\langle ji\right|\) is the qubit SWAP operator. This is obtained by measuring ρσ with respect to the eigenbasis of \({S}_{2}^{\otimes n}\), which is just a tensor product of the eigenbasis of S2. Therefore, such a measurement can be accomplished by individually measuring a pair of qubits from ρ and σ with respect to the eigenbasis of S2, for which one can use, e.g., Bell measurement. These measurements can run in parallel and thus only needs a constant depth circuit with respect to n (in fact, depth 2) that involves n two-qubit gates.

We remark that, at this point, we have already obtained a valid lower bound of \({{\Delta }}{e}_{\max }\) because the second term in (7) is positive, only improving the lower bound. Nevertheless, evaluating the second term, which involves \({{{\rm{Tr}}}}(\rho \sigma \rho \sigma )\), can significantly improve the bound particularly when ρ and σ are highly noisy and their purity is small.

\({{{\rm{Tr}}}}(\rho \sigma \rho \sigma )\) can be measured by a similar strategy to the one for \({{{\rm{Tr}}}}(\rho \sigma )\) with two copies of ρ and σ. Instead of the SWAP operator S, consider the CYCLE operator C defined as \(C\left({\otimes }_{i = 1}^{4}\left|{\psi }_{i}\right\rangle \right)={\otimes }_{i = 1}^{4}\left|{\psi }_{i+1}\right\rangle\) where \(\left|{\psi }_{i}\right\rangle ,i=1,2,3,4\) is an arbitrary n-qubit pure state with \(\left|{\psi }_{5}\right\rangle :=\left|{\psi }_{1}\right\rangle\). Then, it is straightforward to check that \({{{\rm{Tr}}}}(\rho \sigma \rho \sigma )={{{\rm{Tr}}}}(C\,\rho \otimes \sigma \otimes \rho \otimes \sigma )\). This can be measured by a generalization of the SWAP test where CYCLE gate C is controlled on the single ancillary qubit. Similarly to the case of SWAP, the CYCLE gate C can be decomposed into \(C={C}_{2}^{\otimes n}\) where kth C2 gate (for any k = 1, … , n) acts on the four-qubit state that consists of the kth qubit of ρ, σ, ρ, and σ. Since C2 can be realized by three SWAP gates, one can measure \({{{\rm{Tr}}}}(\rho \sigma \rho \sigma )\) with 3n uses of qubit-SWAP gates controlled on the ancillary qubit, taking the circuit depth 3n.

Similarly to the case of \({{{\rm{Tr}}}}(\rho \sigma )\), we can realize a significant reduction in the circuit depth by making the measurement destructive. All we have to do is to measure individual four-qubit states that each C2 gate acts on with respect to the eigenbasis of C2. Since the measurement of each C2 can be run in parallel and each measurement circuit has a depth independent of n, this results in a constant-depth circuit that measures \(C={C}_{2}^{\otimes n}\).

We note the apparent similarity between the construction above and the circuit used in virtual distillation14,15. In particular, the strategy of destructive measurement was extensively discussed in ref. 15. It is interesting to see that a construction that is highly relevant to a specific error-mitigation protocol provides a bound applicable to a general class of error-mitigation protocols.

Applications to other error-mitigation protocols

Here, we discuss how our framework can be applied to other two prominent error-mitigation protocols, noise extrapolation and virtual distillation.

Extrapolation methods3,4 are used in scenarios where there is no clear analytical noise model. These strategies consider a family of noise channels \({\{{{{{\mathcal{N}}}}}_{\xi }\}}_{\xi }\), where ξ corresponds to the noise strength. The assumption here is that the description of \({{{{\mathcal{N}}}}}_{\xi }\) is unknown, but we have the ability to “boost” ξ such that \(\xi \ge \tilde{\xi }\) where \(\tilde{\xi }\) is the noise strength present in some given noisy circuit. The idea is that by studying how the expectation value of an observable depends on ξ, we can extrapolate what its value would be if ξ = 0. In particular, the Rth order Richardson extrapolation method work as follows. Let us take constants \({\{{\gamma }_{r}\}}_{r = 0}^{R}\) and \({\{{c}_{r}\}}_{r = 0}^{R}\) with \(1={c}_{0} \,<\, {c}_{1} < \cdots <\, {c}_{R}\le 1/\tilde{\xi }\) such that:

$$\mathop{\sum }\limits_{r=0}^{R}{\gamma }_{r}=1,\,\,\mathop{\sum }\limits_{r=0}^{R}{\gamma }_{r}{c}_{r}^{t}=0\quad t=1,\ldots ,R.$$
(38)

Using these constants, one can show that:

$$\mathop{\sum }\limits_{r=0}^{R}{\gamma }_{r}{{{\rm{Tr}}}}[A{{{{\mathcal{N}}}}}_{{c}_{r}\tilde{\xi }}(\psi )]={{{\rm{Tr}}}}(A\psi )+{b}_{A}(\psi )$$
(39)

where \({b}_{A}(\psi )={{{\mathcal{O}}}}({\tilde{\xi }}^{R+1})\). This allows us to estimate the true expectation value using noisy states under multiple noise levels, as long as \(\tilde{\xi }\) is sufficiently small.

Richardson extrapolation is an instance of (1, R + 1)-error mitigation. In particular, we have:

$${{{{\mathcal{E}}}}}^{(k)}={{{{\mathcal{N}}}}}_{{c}_{k-1}\tilde{\xi }}\quad k=1,\ldots ,R+1$$
(40)

in Definition 4. For an observable A = ∑aaΠa where Πa is the projector corresponding to measuring outcome a, the POVMs \({\{{M}_{{a}^{(k)}}^{(k)}\}}_{k = 1}^{R+1}\) and classical estimator function eA take the forms:

$${M}_{{a}^{(k)}}^{(k)}={{{\Pi }}}_{{a}^{(k)}}\quad k=1,\ldots ,R+1,$$
(41)
$${e}_{A}({a}^{(1)},\ldots ,{a}^{(R+1)})=\mathop{\sum }\limits_{k=1}^{R+1}{\gamma }_{k-1}{a}^{(k)},$$
(42)

where \({\{{\gamma }_{k}\}}_{k = 0}^{R}\) are the constants determined by (38). One can easily check that plugging the above expressions in the form of Definition 4 leads to (39).

Because of the constraint \(-{\mathbb{I}}/2\le A\le {\mathbb{I}}/2\), every eigenvalue a satisfies −1/2 ≤ a ≤ 1/2. This implies that:

$$\begin{array}{l}{e}_{A,\max }\le \frac{1}{2}\mathop{\sum}\limits_{r:{\gamma }_{r}\ge 0}{\gamma }_{r}-\frac{1}{2}\mathop{\sum}\limits_{r:{\gamma }_{r} < 0}{\gamma }_{r}\\ \qquad\quad=\frac{1}{2}\mathop{\sum }\limits_{r=0}^{R}| {\gamma }_{r}| \end{array}$$
(43)

and:

$$\begin{array}{l}{e}_{A,\min }\ge -\frac{1}{2}\mathop{\sum}\limits_{r:{\gamma }_{r}\ge 0}{\gamma }_{r}+\frac{1}{2}\mathop{\sum}\limits_{r:{\gamma }_{r} < 0}{\gamma }_{r}\\ \qquad\quad=-\frac{1}{2}\mathop{\sum }\limits_{r=0}^{R}| {\gamma }_{r}| ,\end{array}$$
(44)

leading to \({{\Delta }}{e}_{\max }\le \mathop{\sum }\nolimits_{r = 0}^{R}| {\gamma }_{r}|\). On the other hand, any observable A having ±1/2 eigenvalues saturates this inequality. Therefore, we get the exact expression of the maximum spread for the extrapolation method as:

$${{\Delta }}{e}_{\max }^{{{{\rm{EX}}}}}=\mathop{\sum }\limits_{r=0}^{R}| {\gamma }_{r}| .$$
(45)

Next, we discuss virtual distillation14,15, which is an example of (Q, 1)-error mitigation. Let ψ be an ideal pure output state from a quantum circuit. We consider a scenario where the noise in the circuit acts as an effective noise channel \({{{\mathcal{E}}}}\) that brings the ideal state to a noisy state of the form:

$${{{\mathcal{E}}}}(\psi )=\lambda \psi +\mathop{\sum }\limits_{k=2}^{d}{\lambda }_{k}{\psi }_{k}$$
(46)

for a certain \({\{{\lambda }_{k}\}}_{k = 1}^{d}\), where d is the dimension of the system and \({\{{\psi }_{k}\}}_{k = 1}^{d}\) constructs an orthonormal basis with ψ1 := ψ. We also assume that λ is given as pre-knowledge. This form reflects the intuition that, as long as the noise is sufficiently small, the dominant eigenvector should be close to the ideal state ψ. For a more detailed analysis of the form of this spectrum, we refer readers to ref. 68.

The Q-copy virtual distillation algorithm aims to estimate \({{{\rm{Tr}}}}(W\psi )\) for a unitary observable W satisfying \({W}^{2}={\mathbb{I}}\) (e.g., Pauli operators) by using Q copies of \({{{\mathcal{E}}}}(\psi )\). The mitigation circuit consists of a controlled permutation and unitary W, followed by a measurement on the control qubit with the Hadamard basis. The probability of getting outcome 0 (projecting onto \(\left|+\right\rangle \,\left\langle +\right|\)) is:

$$\begin{array}{l}{p}_{0}=\frac{1}{2}\left(1+{{{\rm{Tr}}}}\left[W{{{\mathcal{E}}}}{(\psi )}^{Q}\right]\right)\\ \qquad=\frac{1}{2}\left[1+{\lambda }^{Q}{{{\rm{Tr}}}}(W\psi )+\mathop{\sum }\limits_{k=2}^{d}{\lambda }_{k}^{Q}{{{\rm{Tr}}}}(W{\psi }_{k})\right].\end{array}$$
(47)

This implies that:

$$(2{p}_{0}-1){\lambda }^{-Q}={{{\rm{Tr}}}}(W\psi )+\mathop{\sum }\limits_{k=2}^{d}{\left(\frac{{\lambda }_{k}}{\lambda }\right)}^{Q}{{{\rm{Tr}}}}(W{\psi }_{k}),$$
(48)

providing a way of estimating \({{{\rm{Tr}}}}(W\psi )\) with the bias \(| \mathop{\sum }\nolimits_{k = 2}^{d}{({\lambda }_{k}/\lambda )}^{Q}{{{\rm{Tr}}}}(W{\psi }_{k})| \le \mathop{\sum }\nolimits_{k = 2}^{d}{({\lambda }_{k}/\lambda )}^{Q}\).

We can see that this protocol fits into our framework with K = 1 and \({{{{\mathcal{E}}}}}_{q}={{{\mathcal{E}}}}\) for q = 1, … , Q as follows. For an arbitrary observable A, we can always find a decomposition with respect to the Pauli operators {Pi} as:

$$A=\mathop{\sum}\limits_{i}{c}_{i}{P}_{i}$$
(49)

for some set of real numbers {ci}. We now apply the virtual distillation circuit for Pi at probability ci/∑jcj and—similarly to the case of probabilistic error cancellation—employ an estimator function defined as:

$$\begin{array}{l}{e}_{A}(i0):=\gamma {{{\rm{sgn}}}}({c}_{i}){\lambda }^{-Q}\\ {e}_{A}(i1):=-\gamma {{{\rm{sgn}}}}({c}_{i}){\lambda }^{-Q}\end{array}$$
(50)

with γ := ∑ici, where we treat i as a part of the measurement outcome. Then, we get:

$$\mathop{\sum}\limits_{i}\left[{p}_{i0}\,{e}_{A}(i0)+{p}_{i1}{e}_{A}(i1)\right]={{{\rm{Tr}}}}(A\psi )+{b}_{A}(\psi )$$
(51)

where pi0 is the probability (47) with W = Pi multiplied by ci/∑jcj, pi1 = 1 − pi0, and \({b}_{A}(\psi ):=\mathop{\sum }\nolimits_{k = 2}^{d}{({\lambda }_{k}/\lambda )}^{Q}{{{\rm{Tr}}}}(A{\psi }_{k})\). Optimizing over observables \(-{\mathbb{I}}/2\le A\le {\mathbb{I}}/2\), we have:

$${{\Delta }}{e}_{\max }^{{{{\rm{V}}\; {\rm{D}}}}}=\max \left\{\left.2{\lambda }^{-Q}\mathop {\sum}\limits_{i}| {c}_{i}| \ \right|\ -{\mathbb{I}}/2\le \mathop {\sum}\limits_{i}{c}_{i}{P}_{i}\le {\mathbb{I}}/2\right\}$$
(52)

and:

$${b}_{\max }^{{{{\rm{V\; D}}}}}=\mathop{\sum }\limits_{k=2}^{d}\frac{1}{2}{\left(\frac{{\lambda }_{k}}{\lambda }\right)}^{Q}.$$
(53)

Note added to proof

During the completion of our manuscript, we became aware of an independent work by Wang et al.69, which showed a result related to our Theorem 3 on the exponential scaling of the maximum estimator spread.