Error statistics and scalability of quantum error mitigation formulas

Quantum computing promises advantages over classical computing in many problems. Nevertheless, noise in quantum devices prevents most quantum algorithms from achieving the quantum advantage. Quantum error mitigation provides a variety of protocols to handle such noise using minimal qubit resources. While some of those protocols have been implemented in experiments for a few qubits, it remains unclear whether error mitigation will be effective in quantum circuits with tens to hundreds of qubits. In this paper, we apply statistics principles to quantum error mitigation and analyse the scaling behaviour of its intrinsic error. We find that the error increases linearly O(ϵN) with the gate number N before mitigation and sublinearly O(ϵ′Nγ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O({\epsilon }^{{\prime} }{N}^{\gamma })$$\end{document} after mitigation, where γ ≈ 0.5, ϵ is the error rate of a quantum gate, and ϵ′\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\epsilon }^{{\prime} }$$\end{document} is a protocol-dependent factor. The N\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sqrt{N}$$\end{document} scaling is a consequence of the law of large numbers, and it indicates that error mitigation can suppress the error by a larger factor in larger circuits. We propose the importance Clifford sampling as a key technique for error mitigation in large circuits to obtain this result.

Quantum computing promises advantages over classical computing in many problems. Nevertheless, noise in quantum devices prevents most quantum algorithms from achieving the quantum advantage. Quantum error mitigation provides a variety of protocols to handle such noise using minimal qubit resources . While some of those protocols have been implemented in experiments for a few qubits, it remains unclear whether error mitigation will be effective in quantum circuits with tens to hundreds of qubits. In this paper, we apply statistics principles to quantum error mitigation and analyse the scaling behaviour of its intrinsic error. We find that the error increases linearly O( N ) with the gate number N before mitigation and sub-linearly O( N γ ) after mitigation, where γ ≈ 0.5, is the error rate of a quantum gate, and is a protocol-dependent factor. The √ N scaling is a consequence of the law of large numbers, and it indicates that error mitigation can suppress the error by a larger factor in larger circuits. We propose the importance Clifford sampling as a key technique for error mitigation in large circuits to obtain this result.

I. INTRODUCTION
With the recent progress that quantum computers can have more than half a hundred qubits [1,2], it is widely accepted that we are in the era of noisy intermediate-scale quantum (NISQ) technologies [3]. A prominent feature of NISQ technologies is the potential for surpassing all classical computers in certain tasks, yet they cannot realize full quantum error correction and achieve fault tolerance due to noise and the limited number of physical qubits. Under the assumption of realistic noise models, the qubit overhead is thousands of physical qubits per logical qubit to reduce the chance of a logical error to the negligible level [4,5]. This requirement of quantum error correction is considerably beyond today's technologies.
Nevertheless, we can still perform computation tasks with NISQ devices. Protocols proposed recently allow us to bypass quantum error correction, which are termed quantum error mitigation [6][7][8][9][10][11][12][13][14][15][16][17]. Unlike error correction preserving the logical quantum state, error mitigation aims at recovering the error-free measurement outcome without physically preparing the error-free state. It can extract the correct computation result from a noisy device as long as the physical quantum state is not excessively damaged by the error accumulation [18]. For example, if the state becomes the maximally mixed state due to noise, there is nothing we can do to extract any useful information about the noise-free state. Recently, quantum algorithms using shallow circuits have been developed to minimise error accumulation. Quantum simulation algorithms based on variational, Lanczos and Monte Carlo methods are promising examples of such algorithms [19][20][21][22][23]. Although shallow-circuit algorithms and error mitigation protocols have been successful in proof-of-principle experiments [12,[24][25][26][27][28][29][30], it remains unexplored how they will perform as we venture into the * yli@gscaep.ac.cn regime of useful applications, where the computation involves more than half a hundred qubits and the device noise permits error mitigation but not yet error correction.
In this work, we address how the computation error after mitigation scales with the circuit size. In many quantum algorithms, we use quantum circuits to evaluate the expected values of observables. For example, the Hamiltonian is evaluated in the variational quantum eigensolver [20]. Because of noise, an actual quantum computer produces a biased expected value, and the bias usually increases with the circuit size due to the error accumulation. Among the error mitigation protocols, probabilistic error cancellation can completely remove the bias under ideal conditions [7,8]. Under realistic conditions, however, all protocols leave a residual bias in the computation result. This residual bias depends on the protocol and circuit depth.
To draw a conclusion regardless of the protocol, we utilise a general formalism of error mitigation. In this formalism, we recover the observable in the error-free circuit using an error mitigation formula, which is a function of observables directly measured with noisy circuits. Many such formulas are inspired by our knowledge of quantum physics, such as error extrapolation [6,7,31,32], probabilistic error cancellation [7,8] and virtual distillation [13,14,[33][34][35]. Throughout this work, when a concrete error mitigation formula is needed for analysis, we take the three aforementioned protocols as examples. An alternative way to construct the formula is optimising a parameterised function with data of selected training circuits [36,37]. We find that the optimisation can suppress the scaling of the residual bias with respect to the circuit size.
For optimisation-based error mitigation protocols, we propose the importance Clifford sampling (ICS) as an efficient and scalable method to generate training circuits. Other than being practically useful in its own right, ICS lends us a tool to analyze the residual bias in the computation result. With its help, we show that the global de- polarising model with circuit-dependent fluctuation is an effective phenomenological error model, which describes the impact of realistic error models. Using this phenomenological model, we analyse the scaling behaviour of the residual bias. We find that the bias in the computation result after an optimised error mitigation process increases in proportion to √ N , where N is the gate number. In contrast, the bias is usually proportional to N without error mitigation. Because error mitigation can suppress the error by a factor increasing with the circuit size, it is a feasible technique for large circuits.
The Results section is organised as follows. After introducing the general formalism of error mitigation, we discuss the error scaling in the mitigation protocols using the global depolarising model, which will be validated subsequently as the effective phenomenological error model. Then we propose the ICS protocol, followed by a description of the important training circuits, the algorithms to generate them and an analysis of the sampling cost. We introduce the phenomenological error model and show that the fluctuation of the effective depolarising rate follows the √ N scaling, which is numerically verified. Finally, we show the same scaling relation between the bias and the gate number in error extrapolation, probabilistic error cancellation and virtual distillation.

A. Error mitigation formula
First, we introduce the notations. In quantum computing, a quantum circuit consists of quantum gates. Let U j be the unitary operator of the j-th gate. The circuit with N gates realises the transformation U = U N · · · U 2 U 1 . Given the initial state of n qubits |0 ⊗n and observable Q, the expected value in the error-free circuit is f C = Tr[Q[U ](|0 0| ⊗n )], where [U ](•) = U • U † . Here we use C = (U 1 , . . . , U N , Q) to denote the circuit with the observable specified. If the circuit is noisy, the transformation is inexact, and we use the completely-positive map E to denote the erroneous transformation. The expected value becomes y C = Tr[QE(|0 0| ⊗n )]. Then, y C − f C is the bias without error mitigation. Note that the error in the actual computing also depends on the statistical error due to finite measurement shots.
The general form of error mitigation formulas reads where y C is the result of the circuit C after error mitigation, C 1 , C 2 , . . . are circuits generated from the primitive circuit C, and λ's denote parameters determined via error mitigation protocols. See Fig. 1. In quantum computing, we evaluate y Ci using the noisy quantum computer and calculate the error-mitigated value y C according to the formula. The bias after error mitigation is y C − f C . Next, we show how some specific error mitigation protocols fit into the general form. Many error mitigation protocols have been proposed. See Ref. [17] for a review. In this work, we take three protocols as examples: error extrapolation, probabilistic error cancellation and virtual distillation. These protocols are applicable to any quantum algorithm evaluating expected values and can largely reduce the error. We give a minimal description here and leave a more detailed overview to Appendix A.
In error extrapolation using a polynomial fitting function [7,31] , the error mitigation formula is where C i is the primitive circuit with noise increased by a factor of r i , and coefficients q i are determined by noise amplification factors (i.e. r i ). For example, for the linear extrapolation with r 1 = 1 and r 2 = 2, the formula is In probabilistic error cancellation, the completelypositive map of the error-free circuit is expressed as a linear combination of erroneous maps, i.e.
where q i are quasi-probabilities, and E i is the map of a noisy circuit C i . Here C i is generated by, for example, replacing or adding some gates in the primitive circuit C. We can work out the quasi-probability decomposition with gate set tomography data [8] or in a learning manner [36]. Given the decomposition, the error mitigation formula is the same as Eq. (2), but coefficients and circuits are different from error extrapolation. In virtual distillation, k copies of the erroneous state ρ are used to evaluate the observable in a distilled state without physically preparing it. Given the primitive circuit C that prepares the state ρ, the circuit C 1 is to evaluate y C1 = Tr(Qρ k ), and the circuit C 2 is to evaluate y C2 = Tr(ρ k ). Then the error mitigation formula reads It is similar in related protocols, e.g. verified phase estimation [34] and dual-state purification [35].

B. Bias in the global depolarising model
Before considering realistic error models, we take the global depolarising model as an example to discuss the bias in error mitigation formulas. In this section, we show that, if the error mitigation protocols are perfectly implemented, probabilistic error cancellation and learningbased error mitigation can reduce the bias to zero, while linear extrapolation and virtual distillation with two copies can reduce the bias from O(N ) to O(N 2 2 ), where N is the gate number and is the depolarising rate per gate. In the section of "Phenomenological error model" we will show that the global depolarising model successfully captures the influence of realistic noise and can be used as a phenomenological model.
In the global depolarising model, the j-th gate with error is described by the map G j = (1 − )[U j ] + D acting on the whole input state, where is the gate depolarising rate, D(•) = Tr(•)ρ m is the depolarising map, and ρ m = 1 1/2 n is the maximally mixed state. Without loss of generality, we assume that the observable is a traceless operator, and we have y The bias increases linearly with the gate number when N is significantly smaller than −1 . In the limit of large N , the bias approaches a finite value if the observable is bounded.
We take linear extrapolation as an example of error extrapolation. We can construct two noisy circuits using original gates and double-noise gates, respectively. Let G j = (1 − 2 )[U j ] + 2 D be the gate with the doubled depolarising rate, two circuits labelled by i = 1, 2 produce expected values y Ci = Tr[QE i (|0 0| ⊗n )], where E 1 = G N · · · G 2 G 1 and E 2 = G N · · · G 2 G 1 . Then, Eq. (3) leads to the error-mitigated expected value We can find that the bias in the linear extrapolation formula increases quadratically with the gate number because the linear extrapolation eliminates the first-order contribution of errors.
In probabilistic error cancellation, we take the quasiprobability decomposition of each gate as This decomposition means that we can correct the error by stochastically replacing the original gate G j with the depolarising map D according to a quasi-probability distribution. The decomposition formula of the entire circuit reads where E 1 = G N · · · G 2 G 1 corresponding to the primitive circuit, E 2 = G N · · · G 2 D in which the first gate is replaced, and so on. Then the error mitigation formula is Here, we have used that y Ci = 0 if any gate is replaced with D. Therefore, the residual bias is zero.
Lastly, we consider virtual distillation. The final state of N gates with the depolarising error is where t = 1 − (1 − ) N . Take the second-order virtual distillation (i.e. k = 2) as an example, the error-mitigated expected value is Therefore, the bias in the second-order virtual distillation increases quadratically with the gate number, which is the natural consequence of the second-order distillation formalism. So far we have been considering ideal conditions. Under realistic conditions, imperfections in the implementation cause an additional contribution to the bias. For example, zero-bias probabilistic error cancellation requires exact knowledge about the depolarising rate. If the depolarising rate is thought to be instead of its actual value and we work out the error mitigation formula with , we have y C = (1− ) N /(1− ) N f C . Then, the bias of the error mitigation formula is O(( − )N ), which is finite and increases linearly with the gate number. It is similar for error extrapolation, in which the bias scales linearly if the noise is not increased exactly as designed.
Next, we analyse the bias in learning-based error mitigation. The optimisation of an ansatz function is a flexible approach for working out a proper error mitigation formula. Various ansatz functions have been proposed [36][37][38]. In this work, we consider a general framework of this approach and focus on the scaling of the bias with respect to the gate number.
One way to compose an ansatz function is by modifying a specific-form formula. Taking the linear error extrapolation as an example, we parameterise the formula as We determine λ by minimising the bias for a set of circuits, which are called training circuits. To evaluate the bias, the error-free expected value must be known. This condition limits the choice of training circuits. We can use only one training circuit T and the corresponding data (y T 1 , y T 2 , f T ) to determine λ for the ansatz considered here. The bias of the training circuit is minimised at For the global depolarising model, the optimal parameter is λ If we take λ = λ * in the error mitigation formula, the bias is zero for all circuits with the same gate number N . Therefore, the linear error extrapolation becomes bias-free after the optimisation. It is similar for other error mitigation protocols. For probabilistic error cancellation, we can take the depolarising rate in Eq. (9) as the variational parameter, assuming the actual depolarising rate is unknown. We can find the optimal value of with data of a training circuit, and the optimal value must be the actual depolarising rate. Then, the error mitigation formula taking the optimal parameter is bias-free for all circuits. For virtual distillation, we can choose the ansatz y C = λ According to Eq. (11), the bias is zero when λ cancels the factor before f C .
We have seen that the learning-based approach can reduce the bias in error mitigation. According to the global depolarising model, the bias is zero in all examples. We get this perfect result because the global depolarising model is free of fluctuation, i.e. errors of all gates have the same impact on the expected value. The impact is a factor of 1 − . Without the fluctuation, there are many simple error mitigation formulas that can simultaneously and completely correct the bias for all circuits.
In error models with fluctuation, the optimised error mitigation formula has a finite bias, and the bias increases with the gate number. Usually, errors are localised in many actual quantum computing systems, e.g. superconducting qubits and trapped ions. The error associated with a gate only affects qubits at the location of the gate (rather than the entire quantum register as in the global depolarising model). The contribution of an error to the bias depends on its location and the circuit. For example, if the observable is the Pauli operator X of qubit-1, errors localised on qubit-2 do not affect the observable; A phase-flip error before the measurement changes the sign of X but preserves the sign if we modify the circuit by inserting a Hadamard gate before the measurement. The fluctuation of error contributions causes a finite bias, i.e. the error mitigation formula cannot simultaneously compensate for all errors for all circuits. Assuming we can successfully compensate for the average contribution of errors, the residual bias is due to the fluctuation across different circuits. We find that in a large class of error mitigation formulas, the fluctuationcaused bias is proportional to √ N . Later, we will show that the global depolarising model with fluctuation is an effective phenomenological model to characterise the impact of errors in realistic error models, see Fig. 2.

C. Importance Clifford sampling
In this section, we address the question of how to efficiently sample large training circuits by proposing sampling algorithms whose resource costs scale linearly with the circuit size. These training circuits are Clifford circuits sharing the same circuit frame as the original noisy circuit, for which the ideal measurements take non-zero expected values.
A classical computer can efficiently simulate Clifford circuits, in which all gates are Clifford gates. Because the error-free expected value f C of a Clifford circuit is computable [39,40], we can take them as training circuits. However, not every Clifford circuit is suitable. We take Eq. (13) as an example. If the training circuit T has a zero expected value, i.e. f T = 0, erroneous expected values are all zero, i.e. y T 1 = y T 2 = 0. In this case, we cannot use the equation to determine the optimal parameter. Therefore, to find the optimal parameter, we need a training circuit T whose expected value is nonzero.
It is general that some training circuits are more important than others in the learning-based approach. To optimise the error mitigation formula, we need a measure of its overall performance in various circuits. We take the mean squared error (MSE) as an example, which reads where g(C) R ≡ 1 |R| C∈R g(C) is the average of the real-valued circuit function g(C) over the circuit set R. Importance sampling is a crucial technique in statistics, in which the probability of a sample is proportional to the magnitude of its value, i.e. (y C −f C ) 2 in MSE. According to importance sampling, we prefer training circuits with a larger bias over those with a smaller bias. The largerbias circuits, i.e. error-sensitive circuits, can provide more information about noise in the circuit.
The question of sampling training circuits has two parts. The first part is how to efficiently generate an error-sensitive circuit. The second part is how to draw samples according to a distribution. We address the first part in the "Circuit generation" section and the second part in the "Circuit frame" and "Sampling algorithms" sections.

D. Circuit generation
There are different approaches of generating an errorsensitive circuit. For example, we can randomly select a circuit and calculate the expected value, and we take it as a training circuit only if the expected value is nonzero. This approach works only when the circuit size is small because circuits with a nonzero expected value are rare in large Clifford circuits. An approach usually used in randomised benchmarking is reversing the transformation by adding an additional unitary at the end of the circuit [41]. We will not take this approach because the additional unitary may significantly increase the total gate number in multi-qubit circuits. We want to generate training circuits with a specific gate number, such that the error mitigation formula is optimised for circuits with the same gate number.
In the following, we focus on the case that the observable Q is a Pauli operator. In the standard model of quantum computing, qubits at the end of the circuit are measured in the computation basis, i.e. the Pauli operator Z is measured. One can adjust the measurement basis by inserting gates before the measurement. For example, by inserting single-qubit Clifford gates before the measurement, we can measure any Pauli operator. For a general observable, a way to evaluate its expected value is by expressing it as a linear combination of Pauli operators and computing the expected value of each term.
The expected value of a Pauli operator in a Clifford circuit takes three values 0 and ±1. We can reexpress the error-free expected value as f C = Tr(Q U |0 0| ⊗n ), where Q U = U † QU is the effective observable. When U is Clifford, Q U is a Pauli operator. Let P i = I, X, Y, Z be the single-qubit Pauli operator on qubit-i, Error-sensitive circuit generation. We compose an error-sensitive circuit with two sections U0 and U , as shown in (a). U is a Clifford operator. The observable is a Pauli operator , e.g. Q = Z ⊗ I ⊗ · · · ⊗ I. U and the observable is equivalent to an effective observable Q U = X ⊗ I ⊗ · · · ⊗ Z, as shown in (b). Gates in U0 are taken from the group of single-qubit Clifford gates. We choose the gates such that all non-identity Pauli operators in Q U are mapped to ±Z, as shown in (c). P 2 ⊗ · · · ⊗ P n . Then f C = ± n i=1 0|P i |0 . If any singlequbit Pauli operator P i is X or Y , the expected value is zero. If all P i are I or Z, f C = ±1, and the sign is the same as Q U . For a randomly generated Clifford circuit, it is likely that some single-qubit Pauli operators contained in Q U are X or Y , i.e. f C = 0.
We can deterministically generate an error-sensitive circuit as follows. The setup is shown in Fig. 3. The overall unitary transformation of the circuit is U = U U 0 , where U 0 = R 1 ⊗R 2 ⊗· · ·⊗R n is one layer of single-qubit gates, and R i is the gate on qubit-i. First, given the gate number, we generate a random Clifford circuit, which realises the unitary U . If U 0 = 1 1, the effective observable is Q U = ±P 1 ⊗ P 2 ⊗ · · · ⊗ P n . Given Q and U , we can efficiently work out this expression of Q U on a classical computer. Second, we determine single-qubit gates in U 0 : we take a Clifford R i satisfying R † i P i R i = ±Z, I. For the final circuit U = U U 0 , single-qubit Pauli operators in its effective observable Q U are either I or Z. Then, the expected value is f C = ±1.

E. Circuit frame
In the learning-based error mitigation, we aim at an optimised error mitigation formula that works for a set of circuits, including training circuits and circuits useful in some computation tasks. Choosing the target circuit set is important. When the circuit set is larger, it is harder to find a formula suitable for every circuit. Therefore, we want to be focusing on a circuit set relevant to some tasks to minimise bias. A way to construct a task-relevant circuit set is by taking circuits with the same pattern of multi-qubit Clifford gates, see Fig. 4. This pattern is called the circuit frame. In many quantum computing systems, such as superconducting qubits and trapped ions, the error rates of single-qubit gates are much lower than multi-qubit gates. Errors occurring in a circuit are mainly determined by multi-qubit gates. Therefore, all the circuits with the same frame have approximately the same errors, and we are able to correct them using the same error mitigation formula.
In the fixed-frame circuit set, single-qubit gates are variables. As shown in Fig. 4, the frame includes the qubit initialisation, multi-qubit Clifford gates and measurement. Fixing these operations, we change singlequbit gates to generate the circuit set. We call each variable single-qubit gate a slot. In Ref. [36], a setup with slots after each multi-qubit gate is proposed. Here we reduce the slot number to minimise the circuit set. We only take locations of single-qubit non-Clifford gates in the task circuit as slots and add two layers of slots after the initialisation and before the measurement, respectively. The reason is that a sequence of Clifford gates not interrupted by any non-Clifford gate can be treated Algorithm 1 Generation of error-sensitive circuits.
as one multi-qubit Clifford gate.
The minimised slots have sufficient degrees of freedom for implementing Pauli twirling and probabilistic error cancellation for general error models. A Pauli error is an unwanted Pauli transformation stochastically occurring in the circuit. In Pauli twirling, we convert general errors into Pauli errors by randomly applying Pauli gates before and after each Clifford gate. We can correct a Pauli error by applying a Pauli gate to undo the error. Relevant discussions can be found in Ref. [36].
With the frame determined, a circuit depends on the choice of single-qubit gates. Let C = (U 1 , . . . , U N , Q) be a circuit (with two layers of single-qubit gates after the initialisation and before the measurement, respectively).
The corresponding frame is F = (. . . , U i , • k , . . . , U j , • q , . . . , Q), where U i is a gate on the frame, and • k denotes a slot on qubit-k. In other words, F is the same as C except that gates in slots are replaced with • k . Formally, if S = {i 1 , i 2 , . . . } are labels of slots and K = {k i1 , k i2 , . . . } are corresponding qubits, the frame is To generate training circuits of the fixed frame, we can randomly draw the gate on each slot from the 24 single-qubit Clifford gates. Because the frame is formed of Clifford gates, the entire circuit constructed in this way is Clifford. It is likely that such a random circuit has a zero expected value. We can work out a circuit with a non-zero expected value by adjusting the first-layer gates, i.e. gates after the initialisation, as described in in the previous section. We give details of this procedure in Algorithm 1.

F. Sampling algorithms
We give two algorithms for sampling error-sensitive Clifford circuits in Algorithms 2 and 3. For clarity, we use the following notations in the algorithms. F is the Algorithm 2 Non-uniform importance Clifford sampling.
1: Input F . 2: for t = 1 to NT do 3: for i = n + 1 to NR do 4: Choose a random Ri from C1.

10:
Accept and set Ct = C if u ≤ A.
circuit frame, Q is the observable, n is the qubit number, N R is the slot number, and N T is the sample number. C 1 is the single-qubit Clifford group with 24 elements.
to denote an ordered set of single-qubit Clifford gates, and R 1 , R 2 , . . . , R n are gates in the first-layer slots. w(C) is the weight of the Clifford circuit C: Q U = U † QU = ±P 1 ⊗ P 2 ⊗ · · · ⊗ P n is a tensor product of Pauli operators, then w(C) is the number of non-identity Pauli operators in the product, i.e.
where δ I,Pi = 1 if P i = I, and δ I,Pi = 0 otherwise. In Algorithm 3, we employ the Metropolis-Hasting algorithm to realise a uniform distribution of error-sensitive circuits, which requires a conditional distribution g(R |R) for suggesting a candidate sample. For example, we can take the conditional distribution as follows: we update gates in some randomly selected slots with newly generated random gates and keep gates in other slots unchanged.
There is a relation between Clifford sampling and unitary sampling which allows us to estimate the bias distribution in general unitary circuits using Clifford circuits. We use C to denote the set of Clifford circuits and U to denote the set of all unitary circuits with the same frame. For a frame with N R slots, the total number of Clifford circuits is |C| = 24 N R , i.e. each slot takes one of 24 single-qubit Clifford gates. In U, each slot can take any single-qubit unitary. When errors are independent of the choice of single-qubit gates, MSEs are the same for the two circuit sets, i.e. L U = L C [42]. Because the set C is large, we need to use the Monte Carlo method to evaluate L C .
There is a similar relation between ICS and unitary sampling. Error-sensitive circuits are a subset of all Clifford circuits, denoted by C ES . According to Algorithm 1, for all 24 single-qubit Clifford gates, which contributes a factor of 24; If P i = I, R † i P i R i = ±Z for 8 single-qubit Clifford gates, which contributes a factor of 8. The number of differentR's is 24 N R −n , then the total number of error-sensitive circuits is where C j are circuits with differentR's. In a Clifford circuit, a Pauli error either preserves the Pauli observable or flips its sign. As a result, non-sensitive Clifford circuits do not respond to Pauli errors, i.e y C = f C if f C = 0. Therefore, for Pauli error models, where η ≡ |C ES |/|C| is the proportion of error-sensitive circuits in all Clifford circuits. The distribution of error-sensitive circuits from Algorithm 2 is non-uniform. Because we uniformly choose slot gates inR, the probability of an error-sensitive circuit C is Therefore, the probability of C is proportional to 3 w(C) . If we use Algorithm 2 to sample circuits, we can evaluate L C ES according to where the expected value is taken over the distribution P nu (C). We can generate a uniform distribution of errorsensitive circuits as shown in Algorithm 3. In the uniform distribution, the probability of an error-sensitive circuit where the expected value is taken over the distribution P u (C). By changing the formula of the acceptance probability, we can use the same algorithm to generate other distributions of error-sensitive circuits.
We now summarise the algorithms and analyse their classical-computing costs. Algorithm 1 is used to generate an error-sensitive circuit. Provided with an observable Q and a frame with n qubits and N two-qubit gates, Algorithm 1 includes operations that conjugate Q (line 3) via O(N ) Clifford gates and a conditioned random selection for the single-qubit gates in the first layer (line 5 to 8). The time cost of the conjugating operations is O(nN ) according to the efficient simulation algorithm for Clifford gates [39], and the time cost of selecting gates in the first layer is O(n). Thus, the cost of Algorithm 1 is O(nN ). Algorithm 2 and Algorithm 3 are used to sample error-sensitive circuits according to the non-uniform distribution P nu (C) and uniform distribution P u (C), respectively. To generate N T circuits, the costs for both algorithms are O (N T nN ), because the elementary building block of both algorithms is nothing but the circuit generation given in Algorithm 1, which is repeated for N T times. The numerical result in Appendix C demonstrates that the number of error-sensitive circuits N T required to perform learning-based error mitigation does not increase (as far as we have observed) with either the number of gates or the number of qubits. Overall, the cost scales linearly with the number of qubits and the number of gates. Noting that the sampling algorithms assume that two-qubit gates are Clifford and errors are independent of single-qubit gates. We give discussion in Appendix D about the implementation of the algorithms when the assumptions are not satisfied.

G. Phenomenological error model
In this section, we introduce the phenomenological error model which quantifies the bias caused by realistic errors in a circuit. Then, we show that the phenomenological error model can be effectively represented by a global depolarising model with fluctuation, and the fluctuation is O(1/ √ N ) times smaller than the depolarising rate. This result suggests that, if we are able to use error mitigation to cancel the impact of the effective global depolarising error, we can reduce the bias caused by realistic errors by a factor of O(1/ √ N ). Before introducing our phenomenological error model, we give a brief overview of realistic error models. Consider a quantum gate with the unitary operator U i , the error-free output state of the gate is [U i ]ρ i , where ρ i is the input state. When the gate is imperfect, we can always express the output state with error as N i [U i ]ρ i (assuming the noisy circuit is a Markov process), where the completely positive map N i describes the effect of noise associated with the gate. In the global depolarising model, In realistic error models, N i is usually caused by local processes, such as dephasing, dissipation and imperfections in the coherent evolution. If the gate acts on qubit-1 and qubit-2, the noise mainly affects these two qubits. Taking a Pauli error model as an example, the noise map reads where We call this particular Pauli error model the gate depolarising model, in which probabilities of Pauli errors are the same. We can rewrite this summation-form error model into the product form where p /15. In the product form, the noise map is a product of 15 independent maps, and we call each of them a Pauli error channel.
The global depolarising model with fluctuation can characterise the impact of realistic errors in large circuits. Given a circuit C, the error-free final state is If we allow C to be any value (rather than limited in the interval [0, 1]), this error model is a general phenomenological error model. Given any f C and y C , the corresponding depolarising rate is C = 1 − y C /f C . Note that the bias is C f C , which is always finite even when f C = 0 and C is infinite.
We write the circuit-dependent depolarising rate as two terms, the average and fluctuation, i.e.
is the average depolarising rate with the weight f 2 C , and δ C is the circuit-dependent fluctuation. We characterise the fluctuation with the weighted standard deviation The key result is that ∆ increases with the gate number as O(N γ ), and γ ≈ 0.5, see Fig. 2.
In the rest part of this section, we show theoretically that the standard deviation ∆ is proportional to √ N using a Pauli error model. In the next two sections, we introduce an error mitigation protocol inspired by the phenomenological error model , then we verify the scaling behaviour in numerical simulations of the gate depolarising model, composite error models involving Pauli, amplitude damping and coherent errors, and a model with single-qubit-gate dependent errors. The √ N scaling is observed in all the error models.
We focus on Pauli errors to analyse the fluctuation in the phenomenological error model. For general errors, we can use Pauli twirling to convert them into Pauli errors. If error mitigation is concatenated with error correction, logical errors after correction are mainly Pauli errors [43]. Suppose errors are independent of single-qubit gates, we have the following relations, where U, C and C ES are circuit sets with the same frame.
In the above equations, the first equal sign follows because the Clifford group is a unitary-2 design [42,44], and therefore • U = • C holds if • is a polynomial of degree two in the gate unitaries. The second equal sign is a consequence of f C = 0 when C / ∈ C ES and η = |C ES |/|C|. Using f C = ±1 for error-sensitive circuits, we can obtain These relations allow us to study 0 and ∆ with errorsensitive circuits. For simplicity, we consider an error model where twoqubit gates are the dominant sources of errors in actual quantum computing devices. We assume that the initialisation, single-qubit gates and measurement are perfect. In a two-qubit gate, we assume that the probability of Pauli errors are the same, i.e. the gate depolarising model. We use N to denote the number of two-qubit gates.
The effect of local Pauli errors is equivalent to that of global depolarising errors in error-sensitive circuits. The unitary transformation of a circuit with N gates is U = U N · · · U 1 . If a Pauli error σ occurs after the i-th gate, the transformation becomes U = U N · · · U i+1 σU i · · · U 1 = σ C U , where σ C = U N · · · U i+1 σU † i+1 · · · U † N is the Pauli error propagated to the end of the circuit. Because gates are Clifford, σ C is also a Pauli operator, i.e. any Pauli error in the circuit is equivalent to a Pauli error at the end of the circuit. If the probability of the Pauli error is p, i.e. the error channel is (1 − p) [ ] is the corresponding error channel at the end of the circuit. We use the binary number t k (C) to denote whether the k-th error channel affect the observable, i.e. t k (C) = 0 if σ k,C and Q are commutative, and t k (C) = 1 otherwise. Then, the expected value is changed to The equivalent depolarising rate is The average depolarising rate is proportional to the gate number, and the standard deviation is proportional to the square root of the gate number. We can understand this phenomenon as follows. If we choose the circuit randomly from the circuit set, each error channel is switched on and off randomly, i.e. each t k takes a random value. Under the assumption that t k are independent and identically distributed random variables, the distribution of C is binomial. Let P be the probability of t k = 1 and neglect O(p 2 ) terms, the average depolarising rate is 0 2pM P , and the standard deviation is ∆ 2pM P (1 − P ). Note that M is proportional to the gate number.
In large circuits, the global depolarising model with the depolarising rate 0 is an approximate phenomenological error model. When we sample circuits composed of noisy gates, the circuit plays the role of a sampler, i.e. the impact of each gate error is a random variable dependent on the circuit configuration. In a certain regime, the total impact is the summation of individual gate errors. When the gate number is larger, the number of random variables in the summation is larger. According to the law of large numbers, the relative standard deviation of the summation decreases with the number of random variables, i.e.
where M ∝ N ∼ N . Therefore, C is in the vicinity of 0 with a high probability in large circuits.
The analysis above has shown that local gate errors can be represented by a fluctuating global depolarising error, and the ratio of the fluctuation ∆ to the depolarising rate 0 is in proportion to 1/ √ N . This result will be verified by the numerical simulations in the next two sections. We will show that, if the effective global depolarising error is removed by error mitigation, the remaining error (caused by the fluctuation) scales with the gate number as 1/ √ N . In addition, we numerically illustrate the error propagation model used in the above analysis. We show that the overall effect of propagated gate errors will become close to the global depolarising error and the relative difference between them decreases as 1/ √ N . We leave the numerical result of error propagation to Appendix B.
The analysis in this section assumes a small total error rate pM . Under this assumption, we can neglect contributions from the second order in Eq. (31). In the section of "Numerical results of the scaling behaviour", we Distributions of the bias for six-qubit periodiccycling circuits with 72 two-qubit gates under the gate depolarising noise. The error rate per gate is 0.001. Before error mitigation, the bias distribution of unitary circuits (the blue histogram) has a shape similar to the Gaussian distribution, and the bias distribution of error-sensitive circuits (the orange histogram) is concentrated at two values. When we mitigate errors according to the average depolarising rate 0, we move the two peaks to the centre, and the residual bias is determined by the width of the two peaks. Because of the equivalence between the importance Clifford sampling and unitary sampling, the bias of unitary circuits is significantly reduced after error mitigation (the red histogram).
randomly take total error rates from about 0.003 to 0.3, and we observe the √ N scaling behaviour. We remark that a modest total error rate is a general requirement of quantum error mitigation [45,46]. Unlike quantum error correction, which actively detects and corrects errors in the circuit, most quantum error mitigation protocols correct the result by post-processing the noisy experimental data. When the total error rate is high, i.e. the fidelity approaches zero, the raw data lose the information about the correct quantum state, from which post-processing cannot recover the information. For example, in probabilistic error cancellation, the sampling overhead is exponential in the number of gates given a constant error rate per gate [7,8].

H. Error mitigation according to the phenomenological error model
According to the phenomenological error model, the effective depolarising rate in large circuits is 0 with a small fluctuation. We can mitigate errors by compensating the effect of 0 . We use the root mean square error (RMSE) as the measure of the overall accuracy of an error mitigation formula in a circuit set. Before error mitigation, RMSE of unitary circuits with the same frame is increases linearly with the gate number. Using the error which increases sublinearly with the gate number. Because 0 = 1 − y C f C C ES , we can measure 0 (and ∆) by uniformly sampling error-sensitive circuits. Actually, because the fluctuation is small, we can even takê 0 = 1−y C f C for one randomly generated error-sensitive circuit C ∈ C ES , and it is likely that the error mitigation formula still works. This phenomenological-error-model inspired (PEMI) error mitigation protocol is illustrated in Fig. 5.
Similar protocols that mitigate errors according to the global depolarising model have been proposed in Refs. [37,47,48]. In these protocols, the effective depolarising rate is measured in different ways. Before considering general error mitigation formulas, we take the PEMI protocol as an example to verify the phenomenological error model, because the bias of this protocol is directly related to the fluctuation.
In the PEMI protocol, we can further reduce RMSE by optimising the error mitigation formula. If we take RMSE after mitigation is reduced to

I. Numerical results of the scaling behaviour
In this section, we numerically test the PEMI error mitigation formula and verify the scaling behaviour of 0 and ∆. Results of other error mitigation formulas will be given in the next section.
To demonstrate the scaling behaviour, we generate three families of circuits. In periodic-cycling circuits, two-qubit gates are arranged according to a fixed pattern, and we increase the circuit depth by repeating the pattern. Therefore, periodic-cycling circuits are deterministic. In linear-network circuits, two-qubit gates only act on the nearest neighbouring qubits on a one-dimensional qubit array, and we randomly place two-qubit gates in the circuit. In all-to-all-network circuits, two-qubit gates are also arranged randomly but they can act on any pair of qubits.
We use three types of error models in our numerical calculations: the gate depolarising model with a randomly selected error rate, randomly generated composite error models and a model with single-qubit-gate dependent errors. The gate depolarising model is used to derive the phenomenological error model, but the conclusion holds for other error models. The composite error model involves gate depolarising, dephasing, amplitude damping and coherent errors, which are the typical error sources in actual devices. We generate different composite error In the numerical simulation, we randomly generate a circuit frame with n qubits and N two-qubit gates, and we randomly take the error rate per gate . We generate 1000 Clifford circuits according to Algorithm 2 to estimate the phenomenological error rate and then generate 1000 random unitary circuits to compute L and L . models by randomly choosing the weight of each component and observe the same scaling behaviour as the gate depolarising model. The equivalence between Clifford sampling and unitary sampling is also used in deriving the phenomenological error model, which is under the condition that errors are single-qubit-gate independent. In the numerical result, we find that the conclusion on the scaling behaviour holds even if errors are single-qubitgate dependent. See the Methods section for details of numerical calculations.
By compensating the average depolarising rate, we can reduce RMSE from According to the discussion in the section of "Phenomenological error model", 0 ∝ N and ∆ ∝ √ N . Therefore, RMSE is reduced in error mitigation by a factor of ∆/ 0 ∝ 1/ √ N . We verify these scaling behaviours by applying the error mitigation formula in Eq. (33) to randomly generated circuits with up to ten qubits and more than a thousand twoqubit gates. To implement the formula, 0 and ∆ are measured by sampling error-sensitive circuits. RMSEs before and after error mitigation √ L and √ L are calculated and plotted in Figs. 6 and 7. For the model with single-qubit-gate dependent errors, we directly calculate and plot 0 and ∆ in Fig. 8. We can find that numerical results are consistent with scaling behaviours predicted by the phenomenological error model. In addition, we perform experiments on IBM quantum computers [49] and observe good agreement between the numerical and experimental results. We include the experimental results in Appendix F.
In Fig. 7, the error suppression ratio L/L for allto-all-network circuits meets L/L = a √ N and a is a positive number independent of the qubit number. How-ever, in Fig. 6, we find that a for linear-network circuits decreases with the qubit number. The difference between all-to-all-network and linear-network circuits is that two-qubit gates in linear-network circuits are shortrange, thus it requires more gates for the error on one qubit to propagate across the circuit network.
The error suppression ratio L/L are obtained via averaging random unitary circuits, which usually have nearzero expected values. However, in common quantum applications such as variational quantum eigensolver, the expected value is far from zero, which is atypical for random unitary circuits. Thus, we come to ask the question of whether the average suppression ratio of random unitary circuits is also the error suppression ratio of these atypical circuits. To answer this question, we numerically investigate the dependence of the error suppression ratio on the error-free expectation. The numerical result is illustrated in Appendix E, and the answer is which demonstrates that the average error suppression ratio can be applied to these atypical circuits.
We note that the √ N scaling of error-mitigated result relies on a modest total error rate. This condition is essential for quantum error mitigation methods to work properly [45,46] and is considered as a general requirement of NISQ computation [3]. For each data point in Figs. 6 and 7, we randomly choose the error rate per gate such that the total error rate N is in the interval about 0.003 to 0.3. with the composite model. In the numerical simulation, we randomly generate a circuit frame with n qubits and N two-qubit gates, and we randomly take the error rate per gate . We generate 1000 Clifford circuits according to Algorithm 2 to estimate the phenomenological error rate and then generate 1000 random unitary circuits to compute L and L .

J. Error scaling in optimised error mitigation formulas
In this section, we utilise the phenomenological error model to show that one can suppress the scaling of the residual bias in a learning-based manner. For imperfect error extrapolation and probabilistic error cancellation, the error scaling after the optimisation is ∝ √ N . The imperfections are due to the imperfect control of noise in error extrapolation and inaccurate knowledge of the error model in probabilistic error cancellation. For virtual distillation, the result is similar.
First, we analyse the error scaling of error extrapolation. An error mitigation formula usually involves multiple circuits. For each of them, we can effectively characterise the impact of noise using our phenomenological error model. Taking the linear error extrapolation as an example, the two circuits C 1 and C 2 are the same as the primitive circuit C, but the noise level is doubled in C 2 . In the phenomenological error model of the circuit C i , the average depolarising rate is i , the rate fluctuation is δ C,i , and the standard deviation is ∆ i . Because C 1 and C 2 are the same circuit, their fluctuations are correlated: Suppose effective depolarising rates are approximately proportional to the noise level, we have 2 2 1 and δ C,2 = 2δ C, 1 . Therefore, the fluctuation-caused bias depends on the covariance matrix K i,j ≡ η −1 δ C,i δ C,j f 2 C U . For the linear extrapolation formula in Eq. (12), RMSE after mitigation depends on average depolarising rates i and the covariance matrix K, i.e.
where E = (1 − 1 , 1 − 2 ) T and Λ = (λ, 1 − λ) T . Taking λ = 2 /( 2 − 1 ), we can remove the contribution of average depolarising rates, and RMSE becomes ( Here, we have used that K is positive semi-definite, ∆ 2 1 and ∆ 2 2 are diagonal elements of K, and Λ √ 5 does not change significantly with the gate number. Note that this upper bound holds even if the noise is not increased as designed, and we can further reduce RMSE by optimising the parameter λ. In Fig. 9, we plot RMSE before and after error mitigation. In the optimised error mitigation formula, we take λ = 2 /( 2 − 1 ). The numerical result is consistent with the scaling behaviour predicted by the phenomenological error model.

Theorem 1.
Consider the general extrapolation formula in Eq. (2), let i , δ C,i and ∆ i be the average depolarising rate, rate fluctuation and standard deviation of the circuit C i , respectively, then The proof is straightforward. Let Λ = (q 1 , q 2 , . . .) T , the expression of RMSE is the same as Eq. (35). We can prove the theorem by taking Λ = E/ E 2 .
Second, we investigate the error scaling of probabilistic error cancellation. In probabilistic error cancellation, we reconstruct the transformation of the ideal circuit as a linear combination of transformations of noisy circuits. A practical way is decomposing each ideal gate in the circuit as a linear combination of noisy gates. In general, we can work out the decomposition as follows. If U i is the unitary operator of the ideal gate, the completely-positive map of the noisy gate is N i [U ]. We can cancel the noise by applying an inverse noise N −1 i = k q i,k E i,k after the noisy gate, and the overall effective gate is Here, E i,k are some noisy gates, i.e. we insert the gate E i,k after the gate N i [U ] with the quasi-probability q i,k .
i , the error in the gate is completely removed; otherwise, effective noise in the gate is N −1 i N i . We consider a Pauli error model with gate depolarising errors and dephasing errors as an example. For a twoqubit gate on qubit-1 and qubit-2, the noise map is where . Suppose our knowledge about the noise map is inaccurate and we correct the error according to the gate depolarising model, we have When λ = −16 d /(15 − 16 d ) and z = 0, we can correct all errors in the gate; otherwise, the effective gate has a finite error rate. We can suppress the error scaling in imperfect probabilistic error cancellation by optimisation. For an error mitigation formula worked out according to an inaccurate error model, we can treat it as having a virtual quantum computer, in which the error model is given by N −1 i N i . Then, we can describe the error in this virtual machine using the phenomenological error model and reduce the bias using the PEMI protocol. We can use the formula y C = (1− 0 )y C , where 0 and y C are respectively the average depolarising rate and expected value in the virtual machine. Then the residual bias of y C is determined by the standard deviation ∆ of the virtual machine. Actually, it is not necessary to modify the formula to suppress the error scaling. For example, we can take λ in Eq. (37) as a variational parameter and optimise it in ICS. The numerical result in Fig. 9 shows that RMSE of probabilistic error cancellation with the optimised λ scales as ∝ √ N . Third, we investigate the error scaling of virtual distillation. The virtual distillation formula is nonlinear unlike error extrapolation and cancellation. For a general error mitigation formula, suppose the truncation on the Taylor expansion is valid, we have where a i = 1 − i . In Eq.  can compensate average depolarising rates by a factor. In the numerical simulation, we determine the factor by taking the original virtual distillation formula y C = y C1 /y C2 as a virtual machine and concatenating it with the PEMI protocol according to the formula y C = (1− 0 )y C , where 0 is the average depolarising rate of y C . We find that RMSE of the optimised formula scales as N α and α < 1/2 as shown in Fig. 10. The remaining error after virtual distillation changes from the coherent mismatch [14] to decoherence er-ror when the gate number increases. With the errormitigation formula y C = Tr(Qρ 2 )/Tr(ρ 2 ), the decoherence error is reduced from N (gate number times error rate per gate) to (N ) 2 , while the coherent mismatch is not suppressed, about which we give a short introduction in Appendix A.3. Because the remaining decoherence error increases quadratically with the gate number, the coherent mismatch is the dominant component in the remaining error when the gate number is small, and the decoherence error is the dominant component when the gate number is large. This change in the type of error could explain the bifurcation in Fig. 10, and the result suggests that the optimisation protocol can further reduce the remaining decoherence error but not the coherent mismatch.
In the numerical simulations, we have taken into account imperfect implementations in probabilistic error cancellation and error extrapolation. Assuming the implementation is perfect, probabilistic error cancellation can reduce RMSE to zero, and error extrapolation can reduce RMSE to a much lower level. Note that perfect implementation requires the exact knowledge of the error model or exact control of the error model. In virtual distillation, we have only taken into account errors in those gates that prepare the state ρ and neglected errors in those gates that implement virtual distillation, e.g. the controlled-swaps in Ref. [14].

III. DISCUSSION
In this work, we show that the residual bias in the computation result after error mitigation scales with the gate number N as O( N γ ) if the error mitigation formula is optimised. Here, γ ≈ 0.5, and is a parameter depending on the error rate of quantum gates and the error mitigation formula. In contrast, the bias in the computation result before error mitigation scales linearly with N . The two scaling relations lead to a somewhat surprising result: We can suppress the computation error by a larger factor in larger circuits.
In the analysis, we introduce a phenomenological error model characterising errors as the global depolarisation with fluctuation, which captures the impact of realistic noise on the computation result. For the optimisation of an error mitigation formula, we propose ICS as an efficient method of generating training circuits, where only those Clifford circuits sensitive to Pauli errors are selected. The optimised formula removes the average contribution of noise and leaves the fluctuation proportional to √ N . We verify this result with the numerical simulation of various circuits, error models and error mitigation formulas, from which we observe that the scaling behaviour is universal.
Despite the encouraging scaling of bias in error mitigation, we point out that the circuit size is still limited by the quality of quantum devices. On a quantum device with a finite error rate per gate, the bias increases with the circuit size. Although the bias scaling after error mitigation is advantageous in comparison with the linear error accumulation before mitigation, at certain circuit sizes the computation result becomes sufficiently random that error mitigation cannot faithfully recover the information. Therefore, the efficacy of error mitigation is conditional on the quality of the quantum device. In general, the minimum requirement for error mitigation to take effect is a non-zero fidelity between the errorfree and erroneous circuits, and the performance is better with higher fidelity. Beyond this, the impact of the unmitigated error rate on the accuracy of the mitigated result depends on the mitigation method. In probabilistic error cancellation, for example, the variance in calculating the expectation value of the result increases with the error rate. Another example is that, after the virtual distillation using two copies, the bias in the expectation value scales quadratically with the error rate. Once the device can implement the circuit with sufficiently high fidelity (which is not necessarily close to one but we take a fidelity of 0.9 as an example), error mitigation can improve the computation result to a much higher accuracy (equivalent to quantum computing with fidelity of 0.99 if the error is reduced by a factor of ten).
In scalable quantum computers, we can adopt quantum error correction to increase the fidelity of logical qubits. Protocols concatenating error correction with error mitigation have been proposed recently [50][51][52]. Fault-tolerant devices will enable the implementation of much deeper circuits than NISQ hardware. Our result of the scaling behaviours suggests that error mitigation can perform even better in the fault-tolerant regime than in the NISQ regime.

A. Circuits
We use three families of circuits: periodic-cycling circuits, linear-network circuits and all-to-all-network circuits.
Periodic-cycling circuits. The qubit array has n qubits, and n is even. All qubits are initialised in the state |0 . After initialisation, a layer of single-qubit gates is placed, see Fig. 11(a). The circuit pattern is periodic, and each period has two layers of two-qubit gates. In the first layer, a controlled-Z gate is applied on qubit-(2i − 1) and qubit-(2i), where i = 1, 2, . . . , n/2. In the second layer, a controlled-Z gate is applied on qubit-(2i − 1) and qubit-(2i − 2), and qubit-0 and qubit-n are the same qubits. After each two-qubit gate, a single-qubit gate is applied to each of the two qubits. The observable O is Z of the first qubit. All single-qubit gates are taken as slots in the corresponding circuit frame.
Linear-network circuits. Except for the pattern of two-qubit gates and observable, the setup is the same as periodic-cycling circuits. All two-qubit gates are controlled-Z gates. For each of them, we randomly generate an integer i ∈ [1, n] and apply the two-qubit gate on qubit-(i − 1) and qubit-i, see Fig. 11(b). The observable is O = P 1 ⊗ P 2 ⊗ · · · ⊗ P n , where P = I, Z is taken randomly.
All-to-all-network circuits. It is similar to linearnetwork circuits. For each of the two-qubit gates, we randomly generate two different integers i, j ∈ [1, n] and apply the two-qubit gate on qubit-i and qubit-j, see Fig. 11(c).

B. Error models
Several error models are used in the numerical simulations.
Gate depolarising model. The model is given in Eq. (20), and only two-qubit gates have errors. This model is used to generated data shown in Figs. 2, 5 and 6. In Figs. 2 and 5, we take = 0.001. In Figs. 6, for each data point, we randomly generate a circuit (and the corresponding circuit frame) and an error rate. For a circuit with N two-qubit gates, we generate a random real number η ∈ [−2.5, −0.5], and we take = 10 η /N as the error rate per gate. Notice that 10 η is the total error rate.
Composite error model. Only two-qubit gates have errors. For a two-qubit gate U , the gate with errors is where N is the gate depolarising error in Eq. (20) with the error rate d , is the dephasing error on qubit-i, R i,P = e −i θ i,P 2 P is a single-qubit rotation on qubit-i, and is the amplitude damping on qubit-i. This model is used to generate data shown in Fig. 7 (c) and (d). For each data point, we randomly generate the error model parameters as follows. For a circuit with N two-qubit gates, we generate a random real number η ∈ [−2.5, −0.5], and we take = 10 η /N as the error rate per gate. Then, we take d = (1+0.2κ d ) /9, i,z = (1+0.2κ i,z ) /9, θ i,P = κ i,P /9 and i,a = (1 + 0.2κ i,a ) /6. Each κ is taken randomly in the interval [−1, 1].
Gate-dependent error model. In this model, both single-qubit and two-qubit gates have errors. The error model is the gate depolarising model. For two-qubit gates, the noise map is given by Eq. (20). For a singlequbit gate R, the gate with error is S[R], where and s = 0.1π −1 arccos |Tr(R)| 2 . This model is used to generate data shown in Fig. 8, and we estimate 0 and ∆ using 10000 unitary circuits in U.
Gate depolarising and dephasing model. The model is given in Eq. (36), and only two-qubit gates have errors. This model is used to generate data shown in Figs. 9 and 10. In the numerical simulation, we approximate the error model with Z 2 Z 1 N for simplicity in coding, which only causes a small difference and will not change the conclusion.
The above error models take into consideration kinds of physical noise processes and are able to simulate noises in realistic quantum devices. The depolarising error N and dephasing error Z simulates the relaxation process and the dephasing process [53,54], which are the main contributions to noise in realistic quantum devices. Amplitude damping A refers to the infidelity caused by energy dissipation. Random rotations R refer to coherent errors caused by imperfect controls. This composite model takes into consideration all the above realistic imperfections and it was demonstrated in Ref. [42] that the composite model can produce error distributions resembling that in experiments on a superconducting quantum processor. The single-qubit-gate dependent error model S is the single-qubit depolarising error with an error rate depending on the gate parameters. This error model takes into consideration the realistic situation that gate errors increase with the gate time. Additionally, we will make a direct comparison between the experimental results and simulation results in Appendix F and show that experimental results are consistent with simulation results.

C. Error mitigation protocols
We verified the scaling behaviour by simulating various error mitigation protocols. The formula in Eq. (33) is used to generate data shown in Fig. 5, 6 and 7. The PEMI protocol in Fig. 9 is y C = (1 − 0 ) −1 y C . In optimised error extrapolation, we take λ = 2 /( 2 − 1 ). In optimised probabilistic error cancellation, we take λ = −16 d /(15 − 16 d ) − 2 z : We have searched for the optimal λ using ICS data and found that the numerical optimal value is close to it. In optimised virtual distillation in Fig. 10, the formula is y C = (1 − 0 ) −1 y C . To implement optimised error mitigation formulas, we estimate 0 , ∆, 1 , 2 or 0 using 1000 error-sensitive circuits, according to Algorithm 2. Then, we generate 1000 unitary circuits with the same frame to estimate RMSE.

DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.

CODE AVAILABILITY
The codes that support the findings of this study are available from the corresponding author upon reasonable request. effective model in error mitigation. Our final result is on the bias scaling of error mitigation formulas.

AUTHOR CONTRIBUTIONS
DYQ, YC and YL together conceived the ideas. YL developed the theory. DYQ and YL performed the numerical simulation. DYQ and YC implement the experiment. DYQ, YC and YL prepared the manuscript.

COMPETING INTERESTS
The authors declare no competing interests.

Appendix A: Error mitigation
Quantum noise is the main obstacle preventing us from implementing desired quantum computations. Error mitigation refers to the recently-proposed techniques to handle noises with low quantum expenses, including error extrapolation, probabilistic error cancellation and virtual distillation etc. We here give a brief introduction to some of those techniques.

Error extrapolation
Suppose the erroneous expected value can be expressed as the series expansion where the coefficient a k is independent of i if we assume that the generator of noise operation is independent of the noise amplification factor r i . We can infer the errorfree expectation f C by taking Eq. (A1) as the polynomial fitting function and extrapolating to the zero error limit r = 0. Taking Richardson extrapolation [7] as an example, the error extrapolation formula reads where the coefficients are determined by and the remaining error is |y C − f C | = O( m+1 ). Take r 1 = 1, r 2 = 2 and m = 1, it becomes the linear extrapolation as in Eq. (3) of the main text.

Probabilistic error cancellation
As we provided in Eq. (4) of the main text, the errorfree map in probabilistic error cancellation is expressed as where q i are quasi-probabilities and E i is the map of a noisy circuit C i . If there is only local gate error, e.g. there is no cross-talk between gates, probabilistic error cancellation can be conducted in a gate-wise manner. Suppose the erroneous map of the j-th gate is and p k is already determined using gate-set tomography, we can realise the error-free gate via where E i,j = [σ i ]M j is realised by inserting Pauli gate [σ i ] in front of the j-th gate and q i,j is the solution to Then the formula for error mitigated circuit is

Virtual distillation
The objective of virtual distillation is to obtain the expectation of an observable in the purified state ρ = ρ k /Tr(ρ k ). Suppose the spectrum decomposition of the erroneous state is ρ = where λ 0 is the largest eigenvalue. ρ exponentially gets close to |ψ 0 ψ 0 | as k increases, and error can be completely removed in the limit of k = ∞ if |ψ 0 ψ 0 | is the true error-free state ρ 0 . However, |ψ 0 ψ 0 | could deviate from ρ 0 even when the error is completely incoherent, thus virtual distillation usually has an additional error caused by the coherent mismatch 1 − Tr(ρ 0 |ψ 0 ψ 0 |) [14].

Appendix B: Error propagation
In the "Phenomenological error model" section, the analysis shows that the impact of realistic errors can be described as fluctuating global depolarising error. Suppose 0 is the effective depolarising rate and ∆ is the fluctuation, we have concluded that ∆/ 0 ∼ 1/ √ N where N is the number of gates. Here, we provide an alternative approach to justify the fluctuating global depolarising error model. We consider a Clifford circuit where N P is the number of errors propagated to ±P ⊗ · · · . If N X = N Y = N Z , the overall propagated error is exactly the global depolarising error. We now numerically demonstrate that N P N ≈ 1 4 ∀ P ∈ {I, X, Y, Z} when the gate number is large, and the expectation of The result is shown in Fig. 12, which verifies our result in Eq. (32) of the main text.

Appendix C: Verification of feasibility
We have proposed optimising the error mitigation formula via ICS. Here, we verify that the optimisation via ICS is feasible with increasing numbers of gates or qubits. Explicitly, since we have already analytically show that  the cost is O(N T nN ) for ICS to sample N T error-sensitive Clifford circuits with n qubits and N gates in the "Sampling algorithms" section, we here show that N T required for optimising the error mitigation formula is finite and does not increase with either the number of qubits or the number of gates. Suppose y C (λ) is the error mitigation formula parameterised by λ, the optimal error mitigation formula is y C (λ u ), where λ u minimises MSE over unitary circuits, i.e. λ u = argmin λ L U (λ), and L U (λ) = (y C (λ)−f C ) 2 C∈U . We aim to find λ u via ICS. Suppose the optimised parameter we find via ICS is λ c . If the number of unitary circuits and Clifford circuits are both infinite, it holds that λ c = λ u and L U (λ c ) = L U (λ u ). If the number of circuits N T is finite, we should expect that λ c = λ u and L U (λ c ) ≤ L U (λ u ) (since λ u always min- imises L U ). We demonstrate in Fig. 13 that, with a finite N T , L U (λ c )/L U (λ u ) is always finite and close to 1 for both algorithms proposed in the "Sampling algorithms" section, which verifies that it is feasible to implement optimised error mitigation utilising ICS.

Appendix D: Assumptions about the circuits
Having proposed algorithms for generating errorsensitive circuits with different distributions, we can use these circuits to determine the variational parameters in an error mitigation formula. The equivalence between Clifford sampling and unitary sampling is under the condition that errors are single-qubit-gate independent. When errors are weakly single-qubit-gate dependent, we can use hybrid sampling to estimate L U [42]. Clifford-dominant circuits are circuits with a few non-Clifford gates, and they can also be efficiently simulated on a classical computer. Hybrid sampling uses both Clifford and Clifford-dominant circuits. The equivalence between ICS and unitary sampling holds for Pauli error models. When there are errors other than Pauli errors occurring in the circuit, non-sensitive Clifford circuits with f C = 0 may also respond to errors. In this case, we can sample some non-sensitive circuits, maybe with a smaller probability than error-sensitive circuits according to the principle of importance sampling, to estimate L U . The dependence of errors before and after error mitigation on the error-free computation result. |f C | is the absolute error-free computation result that ranges from 0 to 1. and are respectively the errors before and after error mitigation. Here, we take the PEMI formula in error mitigation. We generate 1000 error-sensitive circuits to determine the parameter in the formula and perform error mitigation on 1000 random unitary circuits. Each data point in the plots is obtained via averaging ten random unitary circuits with |f C | values in the vicinity. The horizontal line in the lower panel represents the value of / obtained from random error-sensitive Clifford circuits, whose expectation equals the value of / obtained from random unitary circuits. The error model is the gate depolarising model, and the error rate per gate is 2 × 10 −4 .
Our methods rely on the two-qubit gates in the circuit being Clifford gates and can be generalised to situations when two-qubit gates are non-Clifford in practice. We take the Molmer-Sorensen (MS) gate [56] M S(θ) = exp(iθZ 1 Z 2 ) as an example to illustrate the generalisations. The methods can be generalized as long as we can transpile the circuit to a two-qubit Clifford circuit. For example, instead of directly implementing the MS gate at the physical level, one can implement it via M S(θ) = (CNOT) exp(iθZ 2 )(CNOT) where CNOT gate is realised with MS gate and single-qubit gates. Although the transpiled gate sequence introduces extra errors since we used two MS gates to realize one, it may be worthwhile considering the significant error reduction using our method. Without explicitly implementing the transpilation above, there are other approaches to apply the method depending on the practical condition. If the error is independent of θ, we can randomly choose θ from {0, π/2, π, 3π/2} such that the training circuit is Clifford. If the error weakly depends on θ, we can employ the hybrid sampling mentioned above, i.e. we allow a few θ to be arbitrary and sample the others from {0, π/2, π, 3π/2}. In some restricted situations, the method can be applicable regardless of the dependence of error on θ. For example, in trotterised circuits or unitary coupled cluster circuits, there are many repeated blocks of the form where S i is a single-qubit unitary gate and P i is a Pauli operator on the i-th qubit. If the noise only depends on the length of pulse manipulating qubits, i.e. the absolute value of θ, one can let θ = −θ * and S i = I for the block right before a block where θ = θ * , which makes the circuit Clifford.

Appendix E: Dependence on the error-free computation result
In the "Error scaling in optimised error mitigation formulas" section, we have demonstrated that the optimised error mitigation can significantly reduce RMSE. Such a result is obtained with completely random unitary circuits, in which the typical outcome is close to zero. It is natural to ask whether the error suppression ratio obtained via averaging random unitary circuits still applies to atypical but useful circuits such as the circuits in variational quantum eigensolver and quantum approximation optimisation algorithms. Here, we numerically investigate the dependence of the error suppression ratio on the error-free expected value. The result is shown in Fig. 14. We can find that the error suppression ratio for the atypical unitary circuits (that the error-free outcome is far from zero) is close to that obtained via averaging random unitary circuits. From the figure, we can also see that the error suppression ratio is relatively small when the error-free outcome is close to zero; This is not problematic since the absolute error is usually small when the error-free outcome is small.
Because of the low probability of successfully finding a circuit with near-one error-free outcome via postselecting completely random unitary circuits, we take a trick to overcome this problem. We first randomly generate error-sensitive Clifford circuits using ICS, and then we alter single-qubit gates with small-angle random rotations, such that random unitary circuits with near-one outcomes can be efficiently generated.

Appendix F: Revisiting the scalability and validation on quantum processors
Our theoretical analysis and numerical simulations are mainly based on the gate depolarising model, which predicts that the error suppression ratio L/L increases with the gate number as ∝ √ N . Here, we experimentally verify this on six IBM quantum computers. In all experiments, we observe that the error suppresion ratio L/L increases in the regime of low total error rate, and decreases when the gate number is too large, which is consistent with the numerical result of the gate depolarising model. We note that the decrease is expected, since it occurs when the condition of modest total error rate is vioalted (see the "Phenomenological error model" section for the discussion of this condition) and can be explained by the gate depolarising error model.
In the experiments, we take the PEMI protocol in error mitigation. The experiments are performed on six IBM Quantum open-access devices [49] with various numbers of qubits and numbers of CNOT gates. We perform two-qubit experiments on ibmq lima and ibm oslo, threequbit experiments on ibmq belem and ibmq quito, a fivequbit experiment on ibmq manila and a seven-qubit experiment on ibm nairobi. The number of CNOT gates ranges from four to a hundred. For each pair of qubit number and gate number, we randomly generate 50 errorsensitive Clifford circuits via Algorithm 2 to determine the optimal parameter in the PEMI protocol and generate 50 unitary circuits with error-free outcomes larger than 0.5 to compute the RMSEs √ L and √ L . Each circuit is measured for 100000 shots. For two-, threeand five-qubit experiments, the circuit frame is the same as that in Fig. 11(a), without the two-qubit gate acting on the first and last qubits because there is no direct connection between these two qubits on the devices. For the seven-qubit experiment on ibm nairobi, the circuit frame is depicted in Fig. 15(a) due to the qubit network shown in Fig. 15(b). The qubit network only allows up to two CNOT gates implemented in parallel (which may cause significant idle-operation errors compared with other experiments). The experimental results are shown in Figs. 16, 17 and 18.
As a comparison to the experiments, we simulate the same circuits numerically with the gate depolarising error model. In addition to gate errors, measurement errors are also taken into account. We take the measurement error rate provided by IBM Quantum servers. In gate operations, we assume that only CNOT gates have errors, i.e. the error rate per CNOT gate effectively includes errors in single-qubit gates and idle operations. Notice that the gate depolarising model is different from the actual error model of the devices. Therefore, instead of taking the CNOT-gate error rate provided by IBM Quantum servers, we determine the error rate by fitting the numerical result to the experimental result. Notice that in fitting, we only use the experimental result of Clifford circuits. Specifically, blue squares in the left panels in With the error rates, we repeat each experiment on the numerical simulator ten times: For one of them, we take the same random circuits used in the experiment, and the result is represented by orange triangles in the right panels in Figs. 16, 17 and 18; and for the other nine simulations, we regenerate random circuits, and the results are represented by translucent orange triangles. The disparity among the orange curves suggests moderate fluctuation due to the finite number of random circuits.
We can find that the experimental behaviour of the RMSE ratio L/L is consistent with the numerical re-sult. The main result of this work is that L/L increases with the gate number as ∝ √ N under the condition that the total error rate is modest. Given a constant error rate per gate, a large gate number violates the condition and causes a vanishing fidelity. In both the experimental and numerical results, L/L increases with N then decreases when N is too large. In each experiment, the turning points occur at similar gate numbers in the experiment and numerical simulation. Notice that L/L is the RMSE ratio for general unitary circuits rather than Clifford circuits, therefore, the consistent behaviour is not due to the fitting (The fitting is implemented for Clifford circuits to work out the error rate per gate). Actually, the absolute values of the RMSE ratio are different in the experiment and numerical simulation, though their trends are the same. This difference is reasonable because the gate depolarising model is different from the actual error models of devices.  L exp,U are RMSEs of general unitary circuits without and with error mitigation, respectively, obtained in experiments. In (b) and (d), the error suppression ratios L/L are plotted, which are computed using general unitary circuits. Axes on the left and right sides correspond to experimental and numerical results, respectively. The translucent curves show numerical simulations with different randomly generated circuits. In the seven-qubit experiment, we use the same method as in Appendix E to generate random unitary circuits.