## Introduction

The scaling of quantum computers is limited by quantum decoherence1,2,3,4. State of the art quantum computers that consist of fewer than 100 qubits5,6 are of interest for noisy intermediate-scale quantum (NISQ) applications, where qubits are used without error correction7. Therefore, classically simulating imperfect and noisy circuits are vital for the development and design of NISQ-era algorithms as well as characterizing errors in quantum hardware8. Existing classical simulators have typically focused on emulating noiseless circuits. In this case, simulations of quantum circuits with more than 50 qubits have been demonstrated9,10,11. There are several techniques for speeding up simulations of certain types of circuits12,13,14,15,16 or algorithms17,18,19,20. High performance computations have been developed for simulations of noisy circuits21,22,23,24, as well as lightweight open-source tools such as density-matrix simulators in Qiskit25 and Cirq26. These simulators are based on evolving full density matrices (FDMs), which are prohibitively expensive for large numbers of qubits.

In an open quantum system, the decoherence can be modeled as interactions with a large environment27. One can determine the properties of an open quantum system with the Monte Carlo wave-function method where a system is decomposed into an ensemble of pure states that evolve individually and then are averaged28,29,30,31,32. On the other hand, the dynamics can also be characterized by the Lindblad equation33,34, which well describes decoherence in various quantum hardware architectures35,36,37. To reduce the complexity of the Lindblad equation, one can project the quantum states onto a lower dimensional basis using filtering theory and simulate the states more efficiently with reasonable accuracy38,39.

In this work, we combine the ideas of the pure state decomposition28,29 and the low dimension basis projection39 to efficiently simulate noisy quantum circuits. Compressed representations are commonly applied to classical simulation of quantum systems with high symmetry or low entanglement. Applications range from compressed sensing of quantum state tomography40,41, limiting bond dimensions of tensor networks14,42,43, and low-rank factorization of Hamiltonians44 to efficient representations of states15,45,46,47. In our case, it is found that the von Neumann entropy of a density matrix is often small when the noise level is low, implying that it is possible to model the density matrix using a matrix of lower rank with minimal information loss. We achieve this by iteratively projecting onto a subspace of the eigenbasis, and evolving only a small ensemble of pure states.

In the following, we present a complete and explicit algorithm, which decomposes a mixed density matrix into a low-rank matrix akin to an ensemble of pure states, applies gate and Kraus operators to this low-rank matrix, and computes the output density matrix and probability distribution. The procedure involves iterative compression of the density matrix to maintain a numerically compact form with minimal error. As an example, Fig. 1a shows a six-qubit density matrix after a quantum circuit that solves a Grover’s search problem for finding computational basis states with Hamming weight ≤2. The same circuit is simulated with depolarizing noise with noise strength p 0.33% by an exact method and by our low-rank method. In this example, we use a low-rank representation that has only 20% of the full rank. Figure 1b, c shows that the low-rank method simulates noise with high accuracy. More extensive benchmarking and detailed descriptions on the performance and accuracy of the method are in “Implementation and benchmarking.”

In fact, we show that it is possible to evolve and compute quantities of interest (e.g., expectation values or qubit probabilities) of a (2N × 2N) density matrix without ever forming a matrix of size (2N × 2N), where N is the number of qubits. This reduction in the required memory may allow for the studying of systems which would otherwise be prohibitively large.

The algorithm is then assessed by a sequence of random benchmarking under various types and strengths of noise channels to test its practical speedup and error. We show that the algorithm performs consistently in random circuits and in structured circuits for quantum algorithms such as Grover’s search, with a speedup of more than two orders of magnitude and with a small error (around 0.01%) in the probability distribution associated with the final output density matrix. Furthermore, as N becomes larger, and approaches the range for which classical simulations become difficult, the advantage of this algorithm continues to increase over the standard method of FDM evolution.

## Results

We present an algorithm that simulates noisy circuits using a low-rank representation of density matrices. The algorithm consists of two parts, low-rank evolution and eigenvalue truncation, which are covered in “Low-rank evolution” and “Eigenvalue truncation” below. In “Kraus operator decomposition”, an iterative procedure consisting of these two parts is introduced. Then, in “The LRET algorithm” we explain how to sample the associated probability distribution without explicitly forming the FDM.

### Low-rank evolution

A coherent quantum system can be represented by either a state vector $$\left|\psi \right\rangle$$ or a density matrix $$\rho =\left|\psi \right\rangle \left\langle \psi \right|$$. $$\left|\psi \right\rangle$$ has dimensions (2N × 1), while ρ has dimensions (2N × 2N). Because of the substantial difference in the sizes of $$\left|\psi \right\rangle$$ and ρ, most classical simulations of coherent quantum systems work directly with the state vector representation. Unfortunately, the state vector representation does not directly allow for the presence of decoherence. Instead, the evolution of a quantum system in a decoherent noise channel can only be described by a density matrix ρ, which is generally more computationally demanding to simulate. The evolution of a quantum state in a noisy quantum circuit is described by refs. 48,49,50

$${\rho }^{(d+1)}=\mathop{\sum}\limits_{\alpha }{p}_{\alpha }{K}_{\alpha }{\rho }^{(d)}{K}_{\alpha }^{\dagger }$$
(1)

where Kα are Kraus matrices, pα is the corresponding probability, and ρ(d) and ρ(d + 1) are density matrices before and after a noise channel. This Kraus-based noise model is capable of encapsulating many different types of decoherence channels such as bit flip, depolarizing, etc., and therefore is an extremely useful tool in modeling the operation of NISQ-era quantum algorithms in the presence of decoherence noise on real devices. However, most current implementations of this Kraus-based decoherence model explicitly work with a (2N × 2N) density matrix. Our approach exploits the fact that for realistically small noise levels (p 0.01), the von Neumann entropy of the density matrix usually remains low, raising the possibility of working with an approximate, but accurate, low-rank representation of the density matrix. This observation is supported by an entropic analysis in the supplementary information.

While the possibility of a low-entropy density-matrix evolution is tantalizing, there remains to be resolved many pragmatic details about how to efficiently identify and exploit this rank structure while avoiding formation and manipulation of any density-matrix-sized quantities in the rank identification process. Here and in “Eigenvalue truncation” we describe an algorithm that can accomplish this for density matrices that exhibit the desired low-rank structure. A density matrix, ρ, can always be decomposed as an outer product of the $$L\in {{\mathbb{C}}}^{{2}^{N}\times V}$$ matrix,

$$\rho \equiv L{L}^{\dagger },$$
(2)

for some $$V\in {\mathbb{N}}$$. While the choice of L is not unique, in general, it is possible to find L with V equal to the rank of the density matrix using decomposition methods such as singular value decomposition. For density matrices with rank smaller than 2N, this form most compactly represents the state with the minimal column dimension. Using the decomposition in Eq. (2), we can evolve the density matrix by updating L without evaluating ρ(d) explicitly as in Eq. (1). For a gate operation

$$\begin{array}{ll}{\rho }^{(d+1)}={{\mathcal{G}}}^{(d)}{\rho }^{(d)}&={G}^{(d)}{\rho }^{(d)}{G}^{(d)\dagger }\\ &={G}^{(d)}{L}^{(d)}{L}^{(d)\dagger }{G}^{(d)\dagger }\\ &={L}^{(d+1)}{L}^{(d+1)\dagger }\end{array}$$
(3)

where L(d + 1) ≡ G(d)L(d), $${{\mathcal{G}}}^{(d)}$$ is a gate operation, and G(d) is its corresponding gate matrix. Likewise, for a Kraus operator,

$$\begin{array}{lll}{\rho }^{(d+1)}&=&{{\mathcal{K}}}^{(d)}{\rho }^{(d)}=\mathop{\sum }\limits_{\alpha =1}^{A}{p}_{\alpha }{K}_{\alpha }^{(d)}{\rho }^{(d)}{K}_{\alpha }^{(d)\dagger }\\ &=&\mathop{\sum }\limits_{\alpha =1}^{A}{p}_{\alpha }{K}_{\alpha }^{(d)}{L}^{(d)}{L}^{(d)\dagger }{K}_{\alpha }^{(d)\dagger }\\ &=&\mathop{\sum }\limits_{\alpha =1}^{A}{J}_{\alpha }^{(d+1)}{J}_{\alpha }^{(d+1)\dagger }={L}^{(d+1)}{L}^{(d+1)\dagger }\end{array}$$
(4)

where $${J}_{\alpha }^{(d+1)}\equiv \sqrt{{p}_{\alpha }}{K}_{\alpha }^{(d)}{L}^{(d)}$$, $${{\mathcal{K}}}^{(d)}$$ is a Kraus operator, and $${K}_{\alpha }^{(d)}$$ and pα are its corresponding Kraus matrices and probability factor, and A is the number of Kraus matrices in the operation. L(d + 1) is formed by concatenating $${J}_{\alpha }^{(d+1)}$$ as columns, i.e., $${L}^{(d+1)}\equiv [{J}_{1}^{(d+1)},{J}_{2}^{(d+1)},...,{J}_{A}^{(d+1)}]$$. Note that, due to the concatenation, the number of columns of L changes after each noise operation; each column in L(d) will evolve to A columns in L(d + 1). For example, if the dimension of L(d) is 2N × 3 and A = 2, then L(d + 1) has dimension 2N × 6. The computational complexity of updating L is $${\mathcal{O}}(V{2}^{N})$$.

### Eigenvalue truncation

The number of Jα vectors in layer d grows to A times the number of Jα vectors in layer d − 1. For a system starting from a pure state, this number is Ad at the dth layer. This scaling makes tracking all Jα vectors computationally intractable over time if left unchecked. Furthermore, in practice, when the noise level is small, the number of columns corresponding to significant eigenvalues of L(d) is often found to grow only polynomially with the system size. An eigenvalue truncation procedure is used to project the density matrix into a lower-rank representation that retains only the highest contributing columns, akin to quantum filtering in simulating open quantum systems38. We truncate the eigenvectors whose eigenvalues are negligible by

$${\rho }^{(d)}={U}^{(d)}{{{\Lambda }}}^{(d)}{U}^{(d)\dagger }\simeq {\tilde{U}}^{(d)}{\tilde{{{\Lambda }}}}^{(d)}{\tilde{U}}^{(d)\dagger }$$
(5)

where U(d), Λ(d) are eigenvectors and eigenvalues of ρ(d), and from which we define $${\tilde{U}}^{(d)}$$, $${\tilde{{{\Lambda }}}}^{(d)}$$ as approximations for which the unimportant eigenvalues and eigenvectors are truncated. The truncation is based on a threshold ϵ. The descending-ordered eigenvalues are picked up one by one until they sum to 1 − ϵ. The remaining eigenvalues sum to ϵ are thrown away along with their associated eigenvectors. Although more sophisticated ways of truncation exist51,52, we use this simple cutoff criteria to better control the error introduced by the procedure. Furthermore, this truncation method is the optimal scheme to preserve the trace and the 2-norm of a matrix, known as the Eckart–Young theorem53.

The representation above is a useful method to retain only the most important eigenvectors in Kraus noise models. However, the approach appears to have the computational problem that it involves the eigendecomposition of a (2N × 2N) matrix. Solving an eigenvalue problem of a (2N × 2N) matrix has a complexity of $${\mathcal{O}}({({2}^{N})}^{3})$$, which is very expensive and would overwhelm the benefit of low-rank simulation. However, using the result from theorem 1 in the Supplementary information, we can efficiently compute the eigenvectors and eigenvalues without explicitly constructing the density matrix. The complexity of the eigenvalue problem is instead $${\mathcal{O}}({(AV)}^{2}{2}^{N})$$ where V < 2N is the number of columns of L in Eq. (2), and A is the number of Kraus matrices comprising the Kraus operator. Although the the complexity still contains 2N, it is an improvement by an exponential factor over $${\mathcal{O}}({({2}^{N})}^{3})$$.

### Kraus operator decomposition

We model noisy quantum channels with single-qubit Kraus operators. To model noise induced by gate operations, one may apply Kraus operators following each gate. If the circuit is sparse in terms of gates, correspondingly, there are only a few Kraus operators per layer. In this case, AV stays small and eigenvalue truncation can be done relatively efficiently. However, consider a dense noisy circuit with a depolarizing Kraus operator acting on every qubit at each time-step (Fig. 2). The depolarizing Kraus operator is comprised of four matrices50, and therefore A = 4N, making eigenvalue truncation intractable with complexity $${\mathcal{O}}({({4}^{N}V)}^{2}{2}^{N})$$. This can be resolved by decomposing the Kraus operator in several groups

$${{\mathcal{K}}}^{(d)}=\mathop{\prod }\limits_{\beta =1}^{B}{{\mathcal{K}}}_{\beta }^{(d)}.$$
(6)

This decomposition is possible because noise channels in a quantum computer can be well-described by a combination of one and two-qubit Kraus operators5. In the example in Fig. 2, instead of an eigenvalue truncation after each whole Kraus layer, a truncation is applied after each $${{\mathcal{K}}}_{\beta }^{(d)}$$ in order to prevent A from getting too large. In other words, the evolution $${\rho }^{(d+1)}\to {{\mathcal{K}}}^{(d)}{\rho }^{(d)}$$ is approximated by

$${\rho }^{(d+1)}={\mathcal{E}}{{\mathcal{K}}}_{B}^{(d)}{\mathcal{E}}{{\mathcal{K}}}_{B-1}^{(d)}\cdots {\mathcal{E}}{{\mathcal{K}}}_{2}^{(d)}{\mathcal{E}}{{\mathcal{K}}}_{1}^{(d)}{\rho }^{(d)}$$
(7)

where $${\mathcal{E}}$$ is eigenvalue truncation operation. Because a Kraus operator is decomposed into B operations, each decomposed operation has only $$\bar{A}={4}^{\frac{N}{B}}$$ Kraus matrices, denoted as $${K}_{\beta ,\alpha }^{(d)}$$ with subscript α running from 1 to $$\bar{A}$$. Each truncation has complexity $${\mathcal{O}}({({4}^{\frac{N}{B}}V)}^{2}{2}^{N})$$ and there are B truncation steps, so the total complexity is $${\mathcal{O}}(B{({4}^{\frac{N}{B}}V)}^{2}{2}^{N})$$. We choose B such that $$M=\frac{N}{B}$$ is constant and the complexity becomes $${\mathcal{O}}(N\frac{{({4}^{M}V)}^{2}}{M}{2}^{N})$$. We define the “intermediate rank” VI as

$${V}_{I}^{2}\equiv N\frac{{({4}^{M}V)}^{2}}{M}.$$
(8)

This quantity will be important to estimate the conditions for which low-rank simulation is faster than FDM simulation in “Implementation and benchmarking”.

We focus on the simulation of NISQ-era circuits, which are shallow in terms of circuit depth. However, note that for very deep circuits, VI can grow larger than 2N. In this case, there is no benefit of doing low-rank evolution, and we switch back to FDM evolution. The algorithm for low-rank noise simulation is summarized in Algorithm 1.

### The LRET algorithm

“Low-rank evolution”, “Eigenvalue truncation”, and “Kraus operator decomposition” together describe the algorithm for getting the final low-rank representation, L(D). We refer to this algorithm as low-rank simulation with eigenvalue truncation (LRET). The time complexity of LRET is $${\mathcal{O}}({V}^{2}{2}^{N})$$, while the complexity of a FDM simulation is $${\mathcal{O}}({2}^{2N})$$. The LRET algorithm provides an improvement by an exponential factor over the FDM method. The space complexity of LRET is $${\mathcal{O}}(V{2}^{N})$$, while the FDM method has $${\mathcal{O}}({2}^{2N})$$. The reduced memory requirement may allow for simulations of larger circuits than would be possible otherwise. The full procedure of LRET is summarized in Algorithm 1. The concatenation and the eigenvalue truncation in the algorithm are described in “Low-rank evolution” and “Eigenvalue truncation”, respectively. Note that although we use the $$\left|000...\right\rangle$$ state as the initial state (as shown in Algorithm 1) throughout this article, the algorithm works with any fiducial state. Also note that the density matrix, probability distribution, and expectation value in Algorithm 1 are optional and are included as examples of user-specified outputs.

Once we have the final low-rank representation, L(D), we can construct the density matrix using Eq. (2). However, this FDM quantity is rarely needed in standard practice. For example, one may want to simulate the behavior of quantum hardware where the only information we obtain is from measurements in a fixed computational basis, which sample the probability mass function, Prob(x), that is defined by the underlying density matrix. In this case, low-rank simulation gains an additional speedup as the probability distribution is simply

$$\,\text{Prob}\,(x)=\mathop{\sum }\limits_{v=1}^{V}{L}_{x,v}^{(D)}{L}_{v,x}^{(D)\dagger }.$$
(9)

where subscript x and v run over the computational basis dimension and column dimension of the matrix, respectively. The measurement count for each state is then sampled from this distribution. Note that, if the goal of a circuit simulation is to observe and count the measurement outputs, a density matrix is not formed at any point of the simulation as long as the intermediate rank is smaller than 2N. Similarly, we can evaluate observables in low-rank form using $$O=\,\text{Tr}\,(\rho {\mathcal{O}})=\,\text{Tr}\,(L{L}^{\dagger }{\mathcal{O}})=\,\text{Tr}\,({L}^{\dagger }{\mathcal{O}}L)$$ where O is the expectation value of an observable $${\mathcal{O}}$$.

Algorithm 1 Low-rank simulation with eigenvalue truncation

L(0) = [1, 0, 0,...,0]

for d ← 1 to D do

L(d) ← G(d)L(d−1)

for β ← 1 to B do

$${L}^{(d)}\leftarrow {\text{Concatenate}}_{\alpha }(\sqrt{{p}_{\alpha }^{(d)}}{K}_{\beta ,\alpha }^{(d)}{L}^{(d)})$$

L(d) ← Eigenvalue Truncationϵ(L(d))

end for

end for

Compute quantities of user’s choice:

Density Matrix ρ(D) ← L(D)L(D)†

$$\,\text{Prob}\,(x)\leftarrow {\sum }_{v}{L}_{x,v}^{(D)}{L}_{v,x}^{(D)\dagger }$$

$$\,\text{Expectation}\leftarrow \text{Tr}\,({L}^{(D)\dagger }{\mathcal{O}}{L}^{(D)})$$

### Implementation and benchmarking

In the following sections, we discuss the performance of the algorithm for low-rank noise simulation from our implementations using an in-house quantum circuit simulator built in Python. In our simulator, one can specify one of two options for a noisy simulation: FDM simulation, and LRET as described in Algorithm 1. We benchmark the two simulation methods in three scenarios: randomized benchmarking, state preparation for quantum chemistry, and Grover’s search algorithm. For time benchmarking, we use Cirq 0.5.0, a widely used open-source FDM simulator, to show that our implementation of the FDM method is reasonably optimized and serves as a good baseline for comparison. All of the benchmarking was executed on an AWS c5.12xlarge instance.

The general result is that the LRET method is two orders of magnitude faster than the FDM method with a trade-off of ~0.01% error. The error is measured by the variational distance54 between the output density matrices from the LRET method (ρLRET) and from an exact method (ρexact), such as FDM. Because this quantity depends on the noise level, we define a more appropriate measure for error benchmarking

$$\,\text{Distortion}\,\equiv \frac{T({\rho }_{\text{LRET}},{\rho }_{\text{exact}})}{T({\rho }_{\text{exact}},{\rho }_{\text{noiseless}})}$$
(10)

where ρnoiseless is the density matrix from the simulation of the same circuit without noise, and T is the variational distance between the probability distributions defined by the two density matrices in the computational basis, i.e., $$T({\rho }^{A},{\rho }^{B})={\sum }_{i}| {\rho }_{ii}^{A}-{\rho }_{ii}^{B}|$$. To aid in a qualitative understating of the distortion measure, it is useful to note that the distance between ρLRET and ρexact captures the error or information loss incurred by the eigenvalue truncation procedure. The denominator scales this value relative to the change induced by the noise channel to ρnoiseless. For example, when the output error is 0.01% and the change induced by noise is 0.1%, the distortion is 10%.

### Randomized benchmarking: depolarizing noise

Randomized benchmarking is a standard tool used to evaluate the performance of quantum hardware55,56,57. We use the idea to benchmark time and error metrics for the two simulation methods on an ensemble of randomly generated circuits. The circuits are generated from random choices of common gates, including X, Y, Z, S, T, RX, RY, RZ, SWAP, CZ, and CNOT. The “Randomized benchmarking: circuit sparsity and connectivity” below discusses the different types of circuits we use for benchmarking. If not explicitly stated otherwise, the random circuits are dense circuits where one- and two-qubit gates appear with equal probability, and the two-qubit gates connect to adjacent qubits. Dense circuits are those in which a gate acts on each qubit at each time-step. Inspired by the fact that noise is well-described by a set of Kraus operators whose dimension is independent of the circuit size5, all Kraus operators act on single qubits in the benchmarking, as illustrated in Fig. 2.

To understand the conditions for which the LRET method gains a speedup against the FDM method, we inspect the rank evolution of density matrices evolved using LRET. The rank and the intermediate rank (as defined in Eq. (8)) directly influence the computational complexity and the speed of the LRET algorithm. In the following sections, we first benchmark the time, error, and rank for simulations under different noise channels. Then, we assess performance on a variety of differently characterized random circuits.

In this section, we benchmark dense circuits with the depolarizing noise channel, which is defined as $$\rho \to (1-p)\rho +\frac{p}{3}X\rho X+\frac{p}{3}Y\rho Y+\frac{p}{3}Z\rho Z$$. It has been shown that noise in quantum hardware, in average, behaves like depolarizing noise58,59, making it a good description of realistic noise channels.

Figure 3a shows that, while our FDM method and Cirq take a similar amount of time to run for 13-qubit circuits with p = 0.1% under depolarizing noise, the LRET method is much faster than both. In shallow circuits, LRET is 200× faster than FDM (Fig. 3b). Even for higher depth circuits, LRET remains roughly 100× faster. This can be understood by considering the size of the numerical representation these methods are keeping track of. While the FDM method evolves a 2N × 2N density matrix, LRET only keeps track of a 2N × V representation of a density matrix. Effective use of the LRET algorithm amounts to choosing the truncation threshold as to best manage the trade-off between the speed of the simulation and the error in the simulation results. This trade-off is characterized in Fig. 4.

Although V is always smaller than 2N at low depth for N > 2 (Fig. 3c), the conditions for a speedup is determined by the intermediate rank VI, defined in Eq. (8); the LRET method is faster if VI < 2N. As shown in Fig. 3d, e, there is a significant speedup only when N > 7 for the circuit depth consider herein. Since VI increases approximately polynomially and 2N increases exponentially in N, the range of depths for which LRET has an advantage will increase even more as the number of qubits increases. Critically, LRET has an advantage precisely in the range where classical simulations of FDMs begin to become burdensome. Furthermore, the space of circuit sizes in which LRET provides a significant advantage also characterizes the circuits of the early NISQ area with few tens of qubits, shallow-depth circuits, and with noise strengths p < 0.017.

The LRET method gains a speedup by truncating the negligible components of a density matrix. Although the truncation in each step is small, over time the discrepancy from the exact methods, like FDM, can build up. Here, we benchmark the error induced by the eigenvalue truncation in the LRET method.

Figure 4a shows the distortion as a function of the number of qubits (N), depth of the circuit (D), and eigenvalue truncation threshold (ϵ). While the distortion depends on N and D, ϵ is the most impactful. There is a general trend that the error starts to increase rapidly for values larger than ϵ 10−4, so we take ϵ = 10−4 as a reasonable choice. The speedup for different ϵ is plotted in Fig. 3f, and shows the trade-off between accuracy and speedup; the speedup decreases when we increase accuracy (decrease distortion). From Fig. 4b, we can see that the error is <8% for all the N and D considered herein. By slicing out the number of qubits and circuit depths axes in Fig. 4b, we can see that the distortion grows roughly linearly with N and D (Fig. 4c, d).

We now see how the noise strength affects the performance of the LRET method. All benchmarking in this section uses dense circuits with depolarizing noise channels of various strengths. In Fig. 5a, the speedup of the LRET method against the FDM method degrades as the noise strength grows. From p = 0.1% to p = 1%, the speedup drops from the order of 100× to 10× at D = 12. This degradation is due to the higher-order terms in noise which scale super-linearly in p. While the truncation threshold ϵ is adapted linearly by fixing the ratio ϵ/p = 0.1 in this benchmarking, more higher-order terms need to be included to meet the truncation threshold. This results in a larger V and VI (Fig. 5b), and thus a longer computational time.

Figure 5c shows that the distortion as a function of ϵ has a universal shape regardless of the circuit size and/or noise strength p. The magnitude of the distortion is relatively insensitive to circuit size. The noise strength p proportionally shifts the curves in the ϵ axis (i.e., when p and ϵ are scaled by a same factor, the error stays in a similar range).

### Randomized benchmarking: other noise channels

We now consider noise simulations under bit flip and amplitude damping channels for dense circuits with p = 0.1% and ϵ = 10−4. Bit flip can be represented by ρ → (1 − p)ρ + pXρX in the operator-sum formalism. In other words, the quantum state is rotated about the x-axis with probability p. The bit flip channel is a special case of anisotropic noise. The results for the bit flip channel generalizes to other types of anisotropic noise in randomized benchmarking, such as phase flip and all other channels described by ρ → (1 − p)ρ + pUρU where U is any (2 × 2) unitary matrix. The amplitude damping channel dissipates the energy of a qubit toward its lower energy basis, usually denoted as the $$\left|0\right\rangle$$. We use the operator-sum formalism ρ → E0ρE0 + E1ρE1, where E0 and E1 are Kraus operators for amplitude damping, defined as $${E}_{0}=\left[\begin{array}{ll}1&0\\ 0&\sqrt{1-p}\end{array}\right]$$and $${E}_{1}=\left[\begin{array}{ll}0&\sqrt{p}\\ 0&0\end{array}\right].$$

Figure 6a shows that LRET is significantly faster than FDM for all noise types. Especially for the amplitude damping channel, where LRET completes in <3 s while FDM takes more than 25 min at depth = 12. The speedup of the depolarizing and bit flip channels are about 100× faster while the speedup for amplitude damping is almost 1000× faster (Fig. 6b). This is related to the slower increase of intermediate rank (Fig. 6c) due to the fact that amplitude damping has a preferred state, the $$\left|0\right\rangle$$ state, regardless of the details of the qubit state.

The error benchmarking for the bit flip channel, Fig. 7a–d, is very similar to that of the depolarizing channel, except that bit flip is more tolerant to ϵ when the circuit size is small. At ϵ = 10−3, the distortion is ~50% in the bit flip channel while it is ~100% in the depolarizing channel. When ϵ = 10−4, distortion is reasonably small for all N and D considered herein, so we take 10−4 as a recommended choice of ϵ.

In contrast to depolarizing and bit flip channels, under amplitude damping the distortion saturates at lower values when ϵ is large (Fig. 7e). This is possibly because amplitude damping prefers the ground state, favoring the LRET method, which keeps only few important components of a quantum state. When ϵ = 10−4, the distortion is smaller than 4% for all circuits considered in Fig. 7f. We see that the distortion grows slowly but linearly with circuit depth (Fig. 7g, h).

### Randomized benchmarking: circuit sparsity and connectivity

A quantum circuit can be characterized by its sparsity and connectivity of gates. In all of the benchmarking above, the random circuits are dense, i.e., for all time steps a gate acts on each qubit (e.g., the circuit in Fig. 2a), and the connections are local, which means that two-qubit gates only connect adjacent qubits. Below, we consider other types of circuits. The first is dense and global, and the second is sparse and local. In sparse circuits, each qubit does not always interact with a gate at every time-step and Kraus operators are only inserted after gates (i.e., the noise is as sparse as the gates). In a globally connected circuit, two-qubit gates can connect any pair of qubits in the circuit. In this section, we use the bit flip channel for benchmarking the LRET method on all the aforementioned circuit-types with p = 0.1% and ϵ = 10−4.

In terms of time-cost, simulating different circuit types goes from harder to easier as circuits go from dense to sparse and from global to local. In Fig. 8a, one can see that the LRET method retains it’s speedup for all circuit types. The time difference between the sparse and the dense is because there are about twice as many gates in dense circuits than in sparse circuits. The time difference between the global and the local is because the set of all fixed-depth globally connected circuits spans a larger Hilbert space. Therefore, the rank of dense-global circuits grows slightly faster (Fig. 8c) and the simulations are slightly slower (Fig. 8a). Interestingly, in Fig. 8a, one can observe that in FDM simulation, due to the number of gates and the memory allocation, the time-cost grows at a similar rate as that of LRET with increasing circuit depth. Thus, speedups are consistently ~100× faster with LRET than with FDM (Fig. 8b).

### State preparation for quantum chemistry

Quantum simulation is one of the most promising application areas for NISQ devices7. Algorithms have been developed to solve physical and optimization problems through quantum simulation approaches60,61,62. Here, we use our low-rank noise simulator to run a circuit that generates generalized-amplitude W states63 and Dicke states64. Although states of this kind are not hard to simulate classically, they are commonly used as subroutines for quantum information processing65,66,67,68,69 and thus are of high interest for simulations. Ordinarily the parameters of the circuit are initialized according to the solution of a configuration interaction singles chemistry problem, but for benchmarking purposes, we set the parameters randomly.

Unlike most of the circuits used in the randomized benchmarking above, the circuit in Fig. 9a is sparse. Two ways to model noise channels are (1) placing Kraus operators only after each gate, and (2) after each qubit at every time-step. We call the former sparse noise, and the later dense noise. We use a depolarizing noise channel with p = 0.1% and ϵ = 10−4. Figure 9b, d shows that the LRET method has at least a 10× speedup, and more than 100× when N is larger. The distortion caused by LRET is about 5% in sparse noise and 15% in dense noise (Fig. 9c, e). For the 13-qubit circuits under sparse or dense noise, the rank of the final density matrix in LRET is just 0.4% or 1% of the full rank, respectively. The disparity is due to the rank of a density matrix in a circuit with dense noise increasing faster. The quantum states produced by this state preparation are highly entangled. The fact that the low-rank method gains an order of magnitude speedup demonstrates its utility when applied to practical algorithms.

### Grover’s Search algorithm and amplitude amplification

Amplitude amplification is a generalization of Grover’s quantum search algorithm70,71,72. The algorithm aims to find a solution, x, such that f(x) = 1 if x is a solution and f(x) = 0 otherwise, implying that x is not a solution. If x is a solution we say that x is good. Let pgood be the probability of randomly sampling a search space, $${\mathcal{X}}$$, and finding a good input. For a classical search algorithm, it is expected that one would have to sample from the input space on the order of $$\frac{1}{{p}_{\text{good}}}$$ times to find a solution; however, using amplitude amplification, one can expect to find a solution using in only $${\mathcal{O}}(\sqrt{\frac{1}{{p}_{\text{good}}}})$$ samples—a quadratic speedup over the classical case70,73.

In the algorithm, qubits are initialized to a uniform superposition over the entire search space, where each basis state corresponds to an element, $$x\in {\mathcal{X}}$$. Next, a number of unitaries, known as Grover Iterates, act on the initialized state and boost the amplitudes of states that correspond to good solutions. The number of Grover iterates to apply is given by $$\lfloor \frac{\pi }{4}\sqrt{\frac{1}{{p}_{\text{good}}}}\rfloor$$. A full measurement of the resulting circuit yields states corresponding to good solutions with high probability.

In our implementation, we define the function f such that f(x) = 1 (i.e., a good input), when the binary string, x, has a Hamming weight, HW(x), ≤2.

$$f=\left\{\begin{array}{ll}1&\,{\mathrm{if}}\, {\mathrm{HW(x)}}\,\le 2\\ 0&{\rm{otherwise}}\end{array}\right.$$
(11)

We run amplitude amplification on circuits ranging from 9 to 13 qubits under depolarizing noise with p = 0.1% and ϵ = 10−4 and compare LRET and FDM methods (Fig. 10). In the 13-qubit circuit, the rank of the final density matrix from LRET is 0.5% of the full rank with a trade-off of 3.7% distortion. The similarity of the measurement results from both methods demonstrates the accuracy of LRET; in other words, that any information loss from eigenvalue truncation is insignificant when sampling from the resulting density matrix. Time benchmarking of the two methods illustrates the speedup provided by LRET, which continues to improve as the number of qubits increases.

We note that it is not the intent of this study to most accurately predict the results of running this experiment on a particular hardware specification. When running on quantum hardware, gates must be decomposed into the set of gates native to the particular hardware, whereas, here, we model each of the Grover iterates as a single unitary. Rather, the aim of this study is to show that LRET retains its accuracy and computational advantage not only for the random circuits used for benchmarking, but also for circuits that may have legitimate applications and which exhibit more structure than the randomized circuits.

## Discussion

In this work we have demonstrated a method to simulate the evolution of mixed quantum states in noise channels that is efficient in practice. Iterative compression of the density matrices enable us to take advantage of low-rank evolution throughout the simulation of a noisy circuit with minimal error. Provided that the noise level of the channel is sufficiently small (<0.1–1% depending on the other simulation parameters), then the density matrices are found to be well-approximated by low-rank matrices. We provide an entropy argument in the supplementary information to support this finding. Under the low noise assumption, our results show that the algorithm provides orders of magnitude of speedup while only causing a small error, on the order of 10−4, in the probability distributions associated with the output density matrix. The performance in speed and in distortion is robust in different circuit structures and for varying levels of entanglement since we make no assumption on the symmetry or the entanglement of the circuits.

Our approach is based in linear algebraic primitives, namely, matrix multiplication and eigendecomposition. Both are common routines in GPU libraries, such as GEMM in cuBLAS and eigensolvers in MAGMA. Since the eigendecomposition is performed on a comparatively small matrix, most of the run-time is devoted to matrix multiplications. It is fairly straightforward to use GPUs to accelerate these matrix multiplications, without the need for parallel reduction strategies. Therefore, one may utilize GPUs to speedup our algorithm even further, and potentially simulate in a regime that would ordinarily be intractable.

While our attention rests on the column space of density matrices, further speedups can be achieved by optimizing the representation with respect to the computational basis14,74,75.