Abstract
We develop an algorithmic framework for contracting tensor networks and demonstrate its power by classically simulating quantum computation of sizes previously deemed out of reach. Our main contribution, index slicing, is a method that efficiently parallelizes the contraction by breaking it down into much smaller and identically structured subtasks, which can then be executed in parallel without dependencies. We benchmark our algorithm on a class of random quantum circuits, achieving greater than 10^{5} times acceleration over the original estimate of the simulation cost. We then demonstrate applications of the simulation framework for aiding the development of quantum algorithms and quantum error correction. As tensor networks are widely used in computational science, our simulation framework may find further applications.
Main
Quantum computers offer the promise of exponential speedups over classical computers. Consequently, as quantum technologies grow in scale, there will be an inevitable crossing point after which nascent quantum processors will overtake behemoth classical computing systems in performing specialized tasks. The term quantum supremacy was coined to describe this watershed moment^{1}, which we refer to in this paper as quantum superiority. Recent advances in quantum computing hardware have resulted in quantum processors with more than 50 qubits, and there have been multiple claims that quantum superiority has been achieved on different quantum devices, including superconducting systems^{2} and photonic quantum devices^{3}. These milestones mark the start of the era of noisy intermediatesized quantum devices^{4}.
With quantum devices increasing in size and precision, classical simulation of the corresponding quantum systems also becomes increasingly challenging. Classical simulation plays an indispensable role in understanding and designing quantum devices, as it is often, if not always, the only means to validate and benchmark existing quantum devices. Although there have already been numerous efforts in designing and implementing efficient classical simulators^{5,6,7}, there is always a push to simulate larger quantum devices. The reason is twofold: first, simulating large quantum systems helps reduce the finitesize effect observed in certain experiments with smaller quantum systems, which allows us to more confidently project the performance of large quantum systems in which classical simulation is definitely out of reach. Second, claims of quantum superiority are based on the assumption that classical simulation cannot achieve a task that is easily achievable using quantum devices. This is not sound without an extensive effort to push the boundaries of classical simulability.
In this paper we propose a highly optimized framework for classically simulating intermediate to largescale quantum computations, represented as tensor networks. Tensor network contraction has been one of the prominent choices for simulating quantum computation due to its high flexibility and expressive power; however, exact contraction of general tensor networks is a computationally hard problem with respect to the problem size: there are tensor networks for which an exact simulation would take exponential amount of time under wellestablished computational complexity assumptions^{8,9}. Nevertheless, exact tensor network contraction is indispensable in cases in which a numerically accurate approximation ansatz has not yet been established, suffers from practical inefficiency (for example, repeated singular value decomposition) or outputs poorquality results. In reality, most tensor network instances of interest are far from the worstcase scenario and the contraction efficiency can have orders of magnitude improvements compared with the naive approach by optimizing the contraction procedure. This is the focus of our paper.
In addition to developing and conglomerating several technical optimizations for tensor network contraction, the main technical contribution of our paper is a framework to parallelize tensor network contraction called index slicing. Index slicing decomposes a tensor network contraction task into many subtasks that have identical shapes and can be executed in an embarrassingly parallel way, that is, there is no dependency or communication required between the execution of the subtasks. Such an algorithm can be readily deployed on modern computational clusters and experimental evidence shows that such parallelization introduces little overhead to the total running time. As tensor networks are ubiquitous in quantum information science (with applications including benchmarking quantum devices^{2}, probing quantum manybody systems^{10,11,12,13} and decoding quantum errorcorrecting codes^{14,15,16,17}), our simulator represents a useful tool to aid in the development of quantum technologies.
One major challenge of index slicing is controlling the overhead introduced to the total running time. This overhead is usually not noticeable when the number of indices needed to be sliced is very small, but it can quickly grow out of control for a larger number of sliced indices. Multiple works have been dedicated to addressing this problem^{18,19,20,21}. In this paper we develop a heuristic algorithm to minimize the overhead by interleaving finding the best index to slice with local optimization of the contraction order. To further improve performance, we focus the local optimizations on tensor network contraction steps that takes the most amount of time and space, allowing more rounds of local optimization.
As a benchmarking example, we test our algorithm on the simulation task that prompted the quantumsuperiority claim made in Arute and coworkers^{2}. This task—called Sycamore random circuit sampling—is to output bitstrings distributed according to the measurements of random quantum circuits that are designed to be executed on Google’s recent 53qubit Sycamore device. This task is considered relatively easy on the Sycamore quantum chip, while at the same time infeasibly hard for any classical computational device. It is estimated that such a sampling task takes about 200 s to achieve on the Sycamore quantum chip, but would take over 10,000 years for the thenbest supercomputer Summit^{2}. By experimenting on a subsample of the task on the GPU cluster at Alibaba, we show that our embarrassingly parallel tensor network contraction algorithm can indeed be carried out without extra cost, and it can complete the random circuit sampling task within 20 days on a Summitcomparable cluster. Furthermore, to demonstrate the usefulness and broad capabilities of the tensor networkbased simulation framework, we apply it to both the studies of nearterm quantum algorithms and faulttolerant quantum computing. The two examples we studied are the quantum approximate optimization algorithm (QAOA) as a candidate for graph isomorphism discovery, and the performance of the Surface17 in a quantum memory experiment under noise models including neighbouring qubit stray ZZinteraction. In both cases, the simulation tasks go slightly beyond quantum circuits, but they fall easily into the grasp of our simulation framework, indicating flexibility of our framework in the area of quantum computing.
Results
Efficient contraction of tensor networks
The tensor network is a wellstudied framework for expressing multilinear functions over multidimensional arrays called tensors, and it is extensively used in multiple areas including quantum physics^{22,23}, machine learning^{24,25} and quantum computation^{26,27}. A tensor network can be formulated as a mutlihypergraph. Each node of a tensor network is associated with a tensor and each of its connecting edges corresponds to one dimension of the tensor. A hyperedge in a tensor network can connect multiple tensor nodes, indicating that the corresponding dimensions are identified. Furthermore, a hyperedge is either closed or open. The computational task associated with a tensor network—called the tensor network contraction—is to compute an output tensor given the values of the tensor nodes and the hypergraph structure. Each dimension of the output tensor corresponds to an open edge in the tensor network. A particular entry in the output tensor corresponds to an assignment of the open edges, and its value equals the summation over all possible assignments of the closed edges of the product of the corresponding entries of the input tensors.
Sequential pairwise contraction
One common method for exactly contracting tensor networks is through sequential pairwise contraction. In each step, two tensors from the tensor network are selected and merged together according to their shared indices, in a way similar to matrix multiplication. This reduces the number of tensor nodes in the tensor network by one. By repeatedly applying the pairwise contraction, one is left with a single tensor in the tensor network at the end, which is the result of the tensor network contraction. A cleverly chosen order of pairwise contraction can often reduce the total time complexity of tensor network contraction by several orders of magnitude, making it one of the most timeefficient tensor network contraction algorithms. Over the years many heuristics have been proposed to find efficient sequential pairwise contraction orders^{26,27,28,29}; however, sequential pairwise contraction suffers from intrinsic sequentiality and a sometimes inevitable space complexity lower bound; both greatly affect the scalability of such algorithms.
Index slicing
To remedy these two problems, we propose a parallelization framework called index slicing that aims to divide a tensor network contraction task into many subtasks with identical tensor network structures such that the subtasks can be executed in parallel, each with a space complexity small enough to fit into a single computational unit. Index slicing starts by selecting a subset of the hypergraph indices. Each subtask then corresponds with a partial sum where the assignment of the selected sliced indices are fixed; it is itself a tensor network. The subtask tensor networks often have simpler structures, allowing them to be contracted sequentially pairwise with reduced space complexity. A contraction of the whole tensor network is then reduced to summing up the partial results obtained from the contractions of the individual subtasks.
In practice, memory constraints are far more rigid than running time constraints, and many modern computer architectures allow for massive parallelism. Index slicing effectively makes use of the latter while coping with the former, thereby improving on previous tensor network contraction algorithms both in efficiency and scalability. Of course, selecting the best subset of indices to slice over is a nondeterministic polynomialtime (NP)hard optimization problem, for which no known algorithm is guaranteed to find an optimal solution in polynomial time. Furthermore, it is often intertwined with the other NPhard optimization problem of finding the best sequential contraction algorithm; however, with the heuristics discussed in the Methods, index slicing can be carried out with extremely low parallelization overhead while reducing the space complexity to a single computational node. Figure 1 illustrates the idea of tensor networks, contraction orders, index slicing and a flowchart briefing our heuristic strategies for finding good contraction orders and indices to slice.
Unless specifically indicated, all indices in the tensor network run through {0, 1}, a common assumption for qubitbased quantum computation. Throughout the results we report the complexity of contracting different tensor networks by a pair of numbers: the first one, called the computation cost, is the baseten logarithm of the number of total floating point operations (FLOPS), serving as a measure for the time complexity. The second one, the number of subtasks, is exponential in the number of sliced indices, and serves as a measure of concurrency for the contraction algorithm. Moreover, we measure the quality of the index slicing by the slicing overhead, that is, the ratio of the total time complexity before and after performing index slicing. Figure 2 summarizes the contraction costs and slicing overheads for various tensor networks studied in this paper.
Classical simulation of Sycamore random circuit sampling
We first benchmark the opensource implementation of our simulation framework (the Alibaba Cloud Quantum Development Platform, ACQDP) with a family of circuits called the Sycamore random circuits, which were originally proposed to demonstrate quantum superiority^{2}. It was claimed that when the number of layers m in the circuit is 20, a certain sampling task could be efficiently performed on the existing Sycamore quantum chip in about 200 s, whereas a comparable task would take Summit—one of the most powerful supercomputers in the world—at least 10,000 years.
We measure the performance of various simulation frameworks based on tensor network contraction with the contraction cost and an extrapolated running time that is based on actually running some of the many structurally identical subtasks created by index slicing. The ACQDP achieves an exceptionally low contraction cost—up to 10^{6}times lower than qFlex^{21} and up to 1,000times lower than Cotengra^{28}; however, the FLOPS efficiency of ACQDP is also considerably lower than that of Cotengra and qFlex. This is probably due to the involvement of many general matrix–matrix products with smallsized matrices during the computation. Overall, ACQDP forecasts a running time of less than 20 days for the m = 20 random circuit sampling task on a Summitcomparable computing cluster, a speedup of more than five orders of magnitude when compared with the best classical algorithms reported by Arute and colleagues^{2}, and a speedup of more than two orders of magnitude when compared with other stateoftheart simulators.
The contraction cost, FLOPS efficiency, extrapolated runtime and comparisons with other leading simulators are all illustrated in Fig. 3. For all of the tensor networkbased simulators (qFlex, Cotengra and ACQDP), a batch of amplitudes is computed using open tensor network contraction. Due to randomness in the ACQDP, we ran ten independent orderfinding experiments for each number of cycles m (the statistical results are presented in Fig. 2a). The orders found show good concentration in time complexity, and we take the best orders found for the comparisons reported in Fig. 3. The projected running time of the hybrid Schrödinger–Feynman algorithm reported in ref. ^{2} is estimated from a different architecture than Summit, and so the FLOPs efficiency is not shown.
A very recent result by Pan and colleagues^{30} claimed to have completed the Sycamore random circuit sampling task by producing one million distinct but correlated bitstrings within five days using a small GPU cluster, achieving a linear crossentropy benchmarking fidelity (XEB) value of ~0.739. This is made possible by combining hierarchical partitioning and dynamic index slicing of tensor networks with their newly developed heuristics. Although their work verifiably passes the linear XEB test, it essentially completes a different task than what we describe as an unbiasednoise approximate (UNA) sampling. A detailed discussion of the definition of the random circuit sampling task is presented in Supplementary Section 3C.
QAOA for graph isomorphism discovery
We investigate a potential application of the QAOA, which is to determine whether two graphs are isomorphic by checking whether their QAOA energy functions are equal^{31}. It is not clear whether this method can distinguish between all pairs of nonisomorphic graphs (for sufficiently large number of QAOA layers p) or whether the energy gap would be noticeable. Here we try to study these questions by using ACQDP to classically compute QAOA energies associated with various graphs.
By classically computing the QAOA energies, we can separate all nonisomorphic 3regular graphs up to size 18, all strongly regular graphs up to size 26, and several hard graph pairs including the Miyazaki and Praust graphs of size 20, and the Cai–Fürer–Immerman graphs of size 40. These findings and the theoretical results in Szegedy^{31} make us believe that QAOA energies give a full characterization of isomorphism classes, unlike many quantum walkbased distinguishers that were considered earlier^{32,33,34}. Table 1 provides pairs or classes of graphs, as well as the QAOA depth p that distinguish them.
Our simulator can cope with Cai–Fürer–Immerman graphs of size 40, a wellknown pair of hard instances (see Fig. 4). This instance is hard even for the QAOA due to the fact that the two graphs cannot be distinguished until the QAOA depth reaches six. In fact, Fig. 2c shows the contraction cost of the QAOA instances. It is worth noting that most of the slicing induces extremely low slicing overhead. The overall average overhead introduced by one slice for all of the p = 5 and p = 6 tensor networks is 0.2% and 3.5%, respectively.
Simulating surface codes with crosstalk errors
We simulate a quantum memory experiment on a surface code with 17 qubits—Surface17 for short—in the presence of a practical noise model^{35} and a ZZ crosstalk model (see Methods), which was not considered before in this context. The performance of the surface code is measured by the Pauli transfer matrix (PTM) on the logical qubit.
Effects of crosstalk for 2 + 1 rounds of syndrome extraction
To compute the logical PTM under the optimal decoder, one needs to compute all of the PTMs corresponding to quantum operations for each assignment of the syndrome bits. As mentioned before, our tensor networkbased approach can deal with up to 2 + 1 rounds of syndrome extraction, as this would result in a resulting tensor of size 2^{28}. Note that this is also the depth proposed for a nearterm faulttolerance demonstration^{36}.
We report the logical channels for 2 + 1 rounds of syndrome extraction, with and without the presence of crosstalk (Fig. 5). We can infer from the table that the effect of ZZ crosstalk on the logical channel is concentrated on the logical coherent Zrotation and the stochastic phaseflip error, each ~10^{−3} in magnitude. Compared with other error sources, ZZ crosstalk introduces a minimal amount of logical error and thus does not present itself as a main error source for the quantum memory experiment.
Results with variants in code and gate scheduling
We test the performance of slight variants of the above gate scheduling on Surface17 to validate the robustness of our findings. We first switch the Zstabilizer syndrome extractions and the Xstabilizer ones in each cycle to see whether considerable changes in logical errors can be observed. We then test out the recently proposed XZZX code^{37} to balance the bias in X and Z Pauli errors. The corresponding PTMs are listed in Fig. 5.
For the experiment switching X and Z syndrome extraction, it can be observed that the X and Z portions of the logical errors are also switched. This agrees with the observation that such a error correction circuit is similar to one where a transversal Hadamard gate is applied to all of the physical qubits, resulting in a logical Hadamard gate interchanging the X and Z Pauli errors. On the other hand, merely changing the code to its XZZX variant does not balance the logical noises. This is probably due to the fact that the majority of the error occurs during the syndrome extraction routines in this particular experiment, not between them. Although XZZX variants are good at averaging biases occurring outside of the error correction cycles, it does not introduce much difference during the actual syndrome extraction routines.
Tensor network contraction cost
Figure 2b shows the contraction cost of the tensor networks corresponding to the experiments on different variants of the Surface17. One particularly interesting phenomenon is that the slicing overhead of the surface code simulation is below one in some cases. This indicates that the initial contraction tree found by hypergraph decomposition framework is suboptimal, which could be due to the fact that the tensor networks considered here have many open edges, unlike the instances in the random circuit sampling problem or the QAOA experiments. Whether such a phenomenon exists in other tensor networks with many open edges, and whether index slicing can be applied to aid the contraction order finding in a more general setting is worth further investigation. We leave this to future work.
Discussion
The index slicing framework proposed in our paper establishes an interpolation between the sequential pairwise contraction and the Feynman path integral algorithm, which correspond to the cases in which no index is sliced and in which all indices are sliced, respectively. For a tensor network with m indices associated with a hypergraph with tree width t and contraction width c, the sequential pairwise contraction achieves a time complexity of O^{*}(2^{t}), whereas the space complexity is lower bounded by Ω^{*}(2^{c}) (although not necessarily simultaneously achievable). The Feynman path integral, on the other hand, has a space complexity of O(m), yet the time complexity is Ω^{*}(2^{m}); however, how the slicingincorporated idea interpolates between the aforementioned two extreme points requires further investigation. It also remains open whether there exists a tensor network contraction algorithm that achieves both the relatively low time complexity of O^{*}(2^{t}) and the space complexity of O(m), and it does not seem likely that a slicingincorporated sequential pairwise contraction could achieve this limit.
On the practical side, there are still many ways to improve the performance of the tensor network contraction algorithm, potentially by several orders of magnitude. A better contraction order—together with index slicing—might be found through algorithmic refinements; however, it is known^{8} that exact contraction of general tensor networks is a #Phard problem, a computational complexity category for which no known efficient (subexponential time) algorithms exist. It is therefore worth investigating approximation proposals such as tensor network contraction based on matrix product states^{13,22,38,39}, matrix product operators^{40,41,42} and other such ansatzes. Such approximation proposals do not necessarily suffer from #Phardness and could offer a big leap in simulability assuming the ansatzes are good. We leave investigation and design of efficient and relatively accurate approximate tensor network contraction methods to future work.
It has been noted that the good contraction schemes found by our algorithm, usually with relatively low time complexity, might not necessarily perform well on modern computational architectures. In fact, the 20cycle random quantum circuit simulation indicates a very low FLOPS efficiency (~15%). This is probably due to the fact that, despite being efficient on paper, many of the contraction schemes found by our algorithm involve matrix multiplication of skewed shapes, which cannot be processed as efficiently as matrix multiplications of square matrices. A computational cost that is tailored to better reflect modern architecture design could be useful to guide better designs of contraction schemes in practice. We leave this to future work.
For these reasons, we expect that further improvements on both the algorithmic and engineering considerations can considerably reduce the overall simulation costs we report. Given the ubiquity of tensor networks in quantum information science and the efficiency of our simulator, we believe that it could provide a valuable tool in the development of quantum information technologies while helping to define the quantum superiority frontier.
Methods
Tensor networks contraction algorithms
Framework for tensor network contraction
We use indexslicingincorporated sequential pairwise contraction to contract tensor networks. Finding the optimal contraction scheme (that is, a subset of indices to slice over and a sequential pairwise contraction order for the subtasks; identical in structure) is NPhard; however, for large instances of tensor networks presented in this paper, a preprocessing heuristic finding nearoptimal contraction schemes is often worthwhile as it makes a big difference in time/space complexities for the actual contraction task that considerably dwarfs the relatively short extra time spent on such preprocessing.
Finding an optimal contraction order and finding indices to slice are two strongly coupled optimization problems. On the one hand, good index slicing is often based on an existing contraction order, as the changes in time/space complexities caused by slicing then become manifest. On the other hand, the optimal contraction order also depends on the slicing, as the hypergraph structure is changed when some edges are removed. We therefore propose a twophase contraction schemefinding heuristic: in the first phase, we find a good contraction order for the unsliced tensor network. In the second phase, we look for indices to slice, interleaving index slicing with local reconfigurations of the contraction order to keep it nearoptimal given the already sliced indices.
Initial contraction order finding
Disregarding the relative ordering between steps without data dependence, a contraction order can be regarded as a binary tree, where the leaf nodes correspond to the input tensors and the root to the final output tensor (recall the example in Fig. 1). We apply a slight augmentation of the hypergraphdecompositionbased contraction tree construction method in Gray and Kourtis^{28}. Such an algorithm constructs a contraction tree topdown, by first decomposing the hypergraph into two or more components. The components are then regarded as tensor networks that are to be contracted individually and then contracted together; in other words, such a hypergraph decomposition fixes the top layers of the contraction tree. The subgraphs are dealt with in a recursive manner, until the number of nodes in a certain subgraph is small enough to allow efficient subtree constructions.
A hypergraph decomposition algorithm takes two parameters (K, ϵ) and a hypergraph, and outputs K disjoint components of the hypergraph whose size differences are controlled by the parameter ϵ. We observe a difference between the toplayer decomposition (where the tensor network is usually closed or contains few open edges) and the subsequent layers (where there are many open edges mostly connecting to other components). For this reason, we use the parameter combination (K, ϵ) for the top layer and \((2,\epsilon ^{\prime} )\) for subsequent layers. We then perform optimizations over the three parameters \((K,\epsilon ,\epsilon ^{\prime} )\) to obtain a satisfactory initial contraction tree. We use the covariance matrix adaptation evolution strategy algorithm^{43} for parameter optimization and the KaHyPar package for hypergraph decomposition^{44}. The cutoff size for the hypergraph decomposition is set to 25; contraction trees on hypergraphs with fewer nodes are constructed greedily using builtin functionalities in the opt_einsum package^{45}.
Index slicing and local optimization
After finding the initial contraction order, one way of selecting the indices to slice over is by greedily picking the index that decreases the space complexity the most or introduces the least time complexity overhead. In this work we interleave the greedy approach with a series of local reordering of the contraction tree that ensures a more robust slicing. In particular, we apply the following heuristics:

The first one is a general local optimization method: take a connected subgraph of a contraction tree, which represents a series of contraction steps, with multiple intermediate outcomes as the input and a single output. Such a series of contraction steps represents a tensor network contraction of its own and can be optimized by reconfiguring the internal contraction tree connections. If the subgraph chosen is small enough, the optimal configuration can be found with a bruteforce approach. Repeatedly choosing small connected subgraphs of the contraction tree and optimizing over them could greatly reduce the overall contraction cost. We focus on subgraphs with many highcost intermediate steps to accelerate this process, which hopefully reduces the contraction cost by the maximum. In our experiments, we take subgraphs of size up to 14 to perform local optimizations on.

The second one is more specifically designed for index slicing. In a contraction tree, the nodes in which a particular index appear form a subtree. The overhead induced by slicing a particular index is determined by the total cost of the corresponding subtree, which in turn depends almost entirely on the overlap of the subtree with the highestcost nodes. The more highcost nodes in a contraction tree involving a particular index, the less overhead is incurred while slicing this particular index. One can therefore slightly tweak the contraction tree by commuting different highcost contraction steps to maximize the utility of a single index. This increases the overall unsliced cost (assuming that the original contraction tree is locally optimal), but at the same time reduces the slicing overhead via increasing the utility of the particular index. Enumerating over several promising index candidates helps find a good one, especially when an obvious choice is absent.
Runtime modification of the contraction scheme
When executing sequential pairwise contraction on a GPU, we apply the following runtimespecific modifications on the obtained contraction schemes. These modifications do not alter the theoretical contraction cost by much, but usually enable much more efficient execution.

Most nodes in the contraction tree represent very small portions of the overall time complexity; however, they involve many small tensors, transmission of which to the GPU would incur considerable overhead. This motivates us to precompute these small steps on a CPU before executing the slicing and only deploy the heavy computational steps of each individual task on the GPU. The partial results for the lowcost steps are shared by all subtasks and only need to be computed once. In practice, this considerably reduces the communication cost between the GPU and the CPU and helps save a small portion of the computational cost. We regard any intermediate step resulting in an intermediate tensor of rank 23 (before slicing) as a lowcost step, and execute these steps before slicing.

After the precomputation getting rid of repeated lowcost steps, the computation performed on the GPU is typically a sequential absorption of small tensors into one large tensor, or two large tensors merged together near the end. In either case, a contraction tree with locally optimal contraction costs typically suffers from a large skewness in dimensions during matrix multiplication. On Nvidia Tesla V100 GPUs, matrix multiplication with dimensions M × N and N × K is much more efficient when the dimensions M, N and K are all multiples of 32; however, a typical small tensor is often shaped 4 × 4, 8 × 8 or 16 × 16. To overcome this, we slightly tweak the contraction order in the following way: whenever a large tensor is to be contracted with some small tensors consecutively, we instead contract the smaller tensors first and contract this intermediate result with the large tensor, thereby ensuring that inefficiency by skewness does not occur whenever the large tensor is involved in the contraction. This increases the runtime contraction cost, but decreases the actual running time by making use of the efficient kernel functions of the Nvidia Tesla V100. This is a somewhat ad hoc solution to the low GPU efficiency induced by small tensor dimensions; we hope that more systematic approaches can be explored to increase the GPU efficiency.
Sycamore random circuits
The Sycamore random quantum circuits used to benchmark ACQDP are introduced in Arute et al.^{2} and are available from the public Dryad repository^{46}. Each Sycamore random circuit is parameterized with a single parameter m, has 53 qubits arranged in a diagonal square grid pattern reflecting the qubit layout of the Sycamore quantum processor, and is generated randomly from some simple rules. Namely, a Sycamore random circuit is composed of m cycles, each consisting of a singlequbit gate layer and a twoqubit gate layer, and concludes with an extra singlequbit gate layer preceding measurement in the computational basis. In the first singlequbit gate layer, singlequbit gates are chosen for each individual qubit independently and uniformly at random from \(\{\sqrt{X},\sqrt{Y},\sqrt{W}\}\), where
In each successive singlequbit gate layer, singlequbit gates are chosen for each individual qubit uniformly at random from the subset of \(\{\sqrt{X},\sqrt{Y},\sqrt{W}\}\) that excludes the singlequbit gate applied in the previous cycle. In each twoqubit gate layer, twoqubit gates are applied to about onequarter of all pairs of adjacent qubit in the qubit layout, in a regular pattern, such that at most one twoqubit gate is applied to each qubit. There are four different patterns, labeled A, B, C and D in ref. ^{2}, and the eightcycle pattern A, B, C, D, C, D, A, B is repeated over all the twoqubit layers. Twoqubit gates are decomposed into four Zrotations determined by the cycle index and
where the parameters θ and ϕ are determined by the qubit pairing.
The random circuit sampling task
A quantum circuit U naturally defines a distribution \({{{{\mathcal{D}}}}}_{U}\) over bitstrings when all qubits are measured under the computational basis after executing the circuit on the allzero state: \({{{{\mathcal{D}}}}}_{U}(x):=  \langle x U 0\rangle { }^{2}\). Ideally, a quantum device executing U would sample from the distribution \({{{{\mathcal{D}}}}}_{U}\) exactly, but in practice many sources of hardware error causes the actual distribution to deviate from the ideal one. The linear XEB was used to measure the closeness of the output distribution to the ideal distribution^{2}. It is defined as 2^{n}〈p_{I}(x)〉 − 1, where n is the number of qubits, p_{I}(x) is the probability of x in the ideal distribution, and the expectation is taken over the output distribution. The XEB is 0 when the output distribution is uniform, and is 1 when the output distribution is ideal following the Porter–Thomas statistics. It was argued from numerical evidence that the aforementioned random quantum circuits had achieved an XEB of approximately 0.2%; however, simulating these circuits was estimated to be infeasible and thus this could not be directly verified.
Meanwhile, in this paper we require the classical simulation algorithm to satisfy a stronger criterion of approximate sampling, namely (ϵ, 0.2%)unbiased noise approximate sampling, where ϵ is negligible (a rough estimation shows that ϵ < 6.4 × 10^{−31} in our algorithm). The definition of unbiased noise approximate sampling is as follows:
Definition 1 (ϵ, F)UNA sampling
For a quantum circuit U as a unitary on n qubits, the task of FUNA sampling is to generate independent and identically distributed samples from the distribution
where \({{{{\mathcal{U}}}}}_{n}\) denotes the uniform distribution over {0, 1}^{n}. Moreover, (ϵ, F)UNA sampling generates independent and identically distributed samples from a distribution ϵ close to the distribution \({{{{\mathcal{D}}}}}_{U}^{(F)}\) under total variational distance.
We will discuss more about why this stronger criterion is used in Supplementary Section 3C. Meanwhile, we will note that if we can achieve (ϵ, 1)UNA sampling in average time T, then there is a trivial method to achieve (ϵF, F)UNA sampling in average time FT by yielding a genuine sample with probability F and a uniformly random bitstring otherwise. We adapt this method in our expirments, generating nearperfect samples from \({{{{\mathcal{D}}}}}_{U}\) and multiplying the final running time estimate with a factor F = 0.2%.
Frugal rejection sampling
We adopt a previously proposed framework^{2,6} to reduce (nearperfect) sampling from \({{{{\mathcal{D}}}}}_{U}\) into computation of probability amplitudes of individual or small batches of bitstrings. This framework assumes that the output distribution of a random quantum circuit is a randomly permuted Porter–Thomas distribution. This assumption implies that there is a small number M (M ≈ 10 for 53qubit circuits) for which bitstrings x with probability p_{I}(x) > M/N (where N is the number of all possible bitstrings; N = 2^{53} in this case) do not contribute much to the overall distribution, which naturally gives rise to a frugal rejection sampling algorithm that on average only needs to compute M individual probability amplitudes to generate one sample from \({{{{\mathcal{D}}}}}_{U}\).
The overhead of frugal rejection sampling can be further decreased by computing a small batch of amplitudes for related bitstrings at a time, which for tensor networkbased methods can be done with almost no extra cost compared to computing a single amplitude. We note that we cannot generate multiple samples from a single batch because that will introduce unwanted correlation between samples, violating the independent and identically distributed requirement for UNA sampling; however, if the first randomly chosen bitstring in a batch is rejected, then we can try other bitstrings in the same batch until one of them is accepted. With a batch of 2^{6} = 64 bitstrings, the probability that one of them will be accepted is close to 1, thus lowering the overhead of frugal rejection from about 10× to 1×. This may introduce some further deviation from the ideal distribution \({{{{\mathcal{D}}}}}_{U}\), but the error is negligible assuming that the correlation between amplitudes in the same batch is negligible.
QAOA for graph isomorphism discovery
The QAOA was first developed by Farhi, Goldstone and Gutman^{47} to solve combinatorial optimization problems. For a combinatorial optimization problem of the form \(C:{\{0,1\}}^{n}\to {\mathbb{R}}\), which can be decomposed as a sum of local clauses \(C=\mathop{\sum }\nolimits_{i = 1}^{m}{C}_{i}\) each acting only on a small number of bits, QAOA works by regarding the objective function C as a local Hamiltonian \(\hat{C}={\sum }_{x}f(x)\leftx\right\rangle \left\langle x\right=\mathop{\sum }\nolimits_{j = 1}^{m}{\hat{C}}_{j}\), and taking the ansatz that the state
defined by the mixing operator \(\hat{B}=\mathop{\sum }\nolimits_{i = 1}^{n}{X}_{i}\) and the angle sequences \(\overrightarrow{\gamma },\overrightarrow{\beta }\in {{\mathbb{R}}}^{p}\) approaches an eigenstate of \(\hat{C}\) with either minimum or maximum eigenvalue with carefully chosen parameters \(\overrightarrow{\gamma },\overrightarrow{\beta }\), even with a small QAOA depth p. As both \(\hat{B}\) and \(\hat{C}\) are sums of commuting local terms, the state \(\left\overrightarrow{\gamma },\overrightarrow{\beta }\right\rangle \) can be readily prepared using a quantum circuit.
The QAOA energy function with p layers is defined as
that is, the expectation value of the objective function C(Z) where the random string Z comes from measuring the quantum state \(\left\overrightarrow{\gamma },\overrightarrow{\beta }\right\rangle \) under the computational basis.
In order to use QAOA for graph isomorphism discovery, consider the Maxcut problem on a graph G = (V, E), with the simple objective function C(x) = ∑_{(u, v)∈E}∣x_{u} − x_{v}∣, where x ∈ {0, 1}^{∣V∣}. Obviously, the QAOA energy function \({F}_{p}(\overrightarrow{\gamma },\overrightarrow{\beta })\) for the Maxcut problem does not depend on the ordering of vertices in V but only the structure of G. Two isomorphic graphs will therefore always give the same value for \({F}_{p}(\overrightarrow{\gamma },\overrightarrow{\beta })\), no matter how \(\overrightarrow{\gamma }\) and \(\overrightarrow{\beta }\) are chosen. On the other hand, it is conjectured that for two nonisomorphic graphs, for sufficiently large p, the values of \({F}_{p}(\overrightarrow{\gamma },\overrightarrow{\beta })\) are different with probability 1 for \(\overrightarrow{\gamma },\overrightarrow{\beta }\) uniformly chosen from [0, 2π]^{2p} (ref. ^{31}); thus, evaluating \({F}_{p}(\overrightarrow{\gamma },\overrightarrow{\beta })\) for two graphs G_{1} and G_{2} with randomly chosen \(\overrightarrow{\gamma },\overrightarrow{\beta }\) can either reveal that G_{1} and G_{2} are nonisomorphic, or give a strong evidence that they are isomorphic. The larger p is, the stronger such evidence will be.
As the QAOA energy function can be written as a sum of energy values of all clauses, the above proposal is essentially a way to use the QAOA to characterize local neighborhoods of vertices in a graph. This locality also translates to ease of computation with tensor networkbased methods. In order to compute the energy value of a clause, the tensor network to evaluate corresponds to only part of the QAOA circuit, namely the lightcone of that clause (see Supplementary Section 4C).
Surface17
Surface17 (ref. ^{35}) is a surface code involving 17 qubits, including nine data qubits, four Xancilla qubits for X stabilizer measurements, and four Zancilla qubits for Z stabilizer measurements. The nine data qubits are arranged in a 3 × 3 grid, and the ancilla qubits all lie on another square lattice diagonally displaced from the data qubit lattice, such that each ancilla qubit is diagonally adjacent to either four data qubits, or two data qubits on the border of the 3 × 3 grid. During normal operation of the surface code, twoqubit gates are applied only between adjacent pairs of one data qubit and one ancilla qubit (not two data qubits nor two ancilla qubits). See Supplementary Fig. 7 for a diagram of the qubit layout.
Error model
The error model we use is based on the one in O’Brien et al.^{35}, which includes idling, gatespecific and measurement errors. We choose not to include the gatespecific error of CZ gates—modelled as a quasistatic flux noise—as its quasistatic coherent nature enables various techniques to compensate for it. We do introduce a model for crosstalk error caused by stray 2qubit ZZ interactions^{48}. See Supplementary Section 5B^{49}.
Logical memory experiment
In this paper, we study only how well the surface code preserves the value of a logical qubit (as opposed to how to initialize, apply a gate to, or measure the logical qubit). To detect and correct qubit errors that may happen even while idling, stabilizer measurements (also known as error syndrome extraction) need to be constantly performed. Our syndrome extraction circuits are based on the ones in O’Brien et al. ^{35}, containing only R_{y}( ± π/2) gates, CZ gates, and computational basis measurements. We also study variants of the syndrome extraction circuit where different stabilizers are measured by adding or removing some R_{y}(±π/2) gates.
We consider an idealized experiment that ignores errors during initialization or the final measurement. Starting from any singlequbit state, we first encode it into Surface17 with an ideal (noiseless) encoding circuit, then perform k rounds of noisy syndrome extraction plus 1 round of noiseless syndrome extraction (which we sometimes write simply as ‘k + 1 rounds of syndrome extraction’). The final round of noiseless syndrome extraction projects the physical state back to the code space, and allows us to map the final state back to a singlequbit logical state with a Pauli correction indicated by the optimal decoder. This entire process can be described by a welldefined logical channel C on the single logical qubit, which contains information on the kinds and magnitudes of all logical errors incurred by this process.
PTMs
We describe a singlequbit quantum channel as a PTM, a 4 × 4 real matrix indicating how the channel modifies the expectation values of Pauli operators. The PTM for a channel \({{{\mathcal{C}}}}\) is defined as
where σ_{0}, σ_{1}, σ_{2}, σ_{3} = I, X, Y, Z, respectively. Note that the first row of any PTM corresponding to a trace preserving map is always (1, 0, 0, 0), since a physical quantum channel should not change the expectation value of I, regardless of the expectation values of X, Y, and Z.
Optimal decoder
During normal operation of a stabilizer code, any errors detected are usually not corrected with physical gates. Instead, conceptually, virtual Pauli gates are applied to some of the code qubits, which is implemented by adjusting the results of stabilizer measurements thereafter on those qubits. Any Pauli gates on any number of code qubits can be implemented this way as long as the only operations applied on the code qubits are Clifford gates and stabilizer measurements.
Accordingly, our decoder tries to correct errors using only Pauli gates on code qubits, depending on the error syndromes measured. It is implemented in two steps: first, based on only the final round of noiseless error syndromes, a trivial decoder uses any number of Pauli gates to map the code qubits back into the code space. Second, based on all (including noisy and noiseless) error syndromes, a logical Pauli gate, one of I, X, Y, and Z, is applied to the logical qubit in order to maximize the fidelity of the logical channel. For the circuit sizes considered in this paper (no more than 2 + 1 rounds of syndrome extraction), tensor networkbased simulations enables us to compute the optimal decoder exactly. See Supplementary Section 5D.
Data availability
All data^{51} used to create the figures in the main texts as well as in the Supplementary Information can be found at https://doi.org/10.5061/dryad.nk98sf7t8. Contraction orders were derived using the orderfinding scheme in the ACQDP package. Detailed information about our cluster architecture and orderfinding parameters can be found in Supplementary Section 2. Source data are provided with this paper.
Code availability
ACQDP is publicly available at https://github.com/alibaba/acqdp. The specific version of the ACQDP package and the scripts^{50} that reproduces all of the results reported in this paper can be found at https://doi.org/10.24433/CO.4349832.v3.
References
Preskill, J. Quantum computing and the entanglement frontier. Preprint at https://arxiv.org/abs/1203.5813 (2012).
Arute, F. et al. Quantum supremacy using a programmable superconducting processor. Nature 574, 505–510 (2019).
Zhong, H.S. et al. Quantum computational advantage using photons. Science 370, 1460–1463 (2020).
Preskill, J. Quantum computing in the NISQ era and beyond. Quantum 2, 79 (2018).
Steiger, D. S., Häner, T. & Troyer, M. ProjectQ: an open source software framework for quantum computing. Quantum 2, 49 (2018).
Boixo, S. et al. Characterizing quantum supremacy in nearterm devices. Nat. Phys. 14, 595–600 (2018).
Pednault, E. et al. Breaking the 49qubit barrier in the simulation of quantum circuits. Preprint at https://arxiv.org/abs/1710.05867 (2017).
Biamonte, J. D., Morton, J. & Turner, J. Tensor network contractions for #SAT. J. Stat. Phys. 160, 1389–1404 (2015).
Huang, C., Newman, M. & Szegedy, M. Explicit lower bounds on strong quantum simulation. IEEE Trans. Inf. Theory 66, 5585–5600 (2020).
White, S. R. Density matrix formulation for quantum renormalization groups. Phys. Rev. Lett. 69, 2863 (1992).
Vidal, G. Efficient classical simulation of slightly entangled quantum computations. Phys. Rev. Lett. 91, 147902 (2003).
Vidal, G. Classical simulation of infinitesize quantum lattice systems in one spatial dimension. Phys. Rev. Lett. 98, 070201 (2007).
Schollwöck, U. The densitymatrix renormalization group. Rev. Modern Phys. 77, 259 (2005).
Bravyi, S., Suchara, M. & Vargo, A. Efficient algorithms for maximum likelihood decoding in the surface code. Phys. Rev. A 90, 032326 (2014).
Ferris, A. J. & Poulin, D. Tensor networks and quantum error correction. Phys. Rev. Lett. 113, 030501 (2014).
Chubb, C. T. & Flammia, S. T. Statistical mechanical models for quantum codes with correlated noise. Ann. Henri Poincaré D 8, 269–321 (2021).
Darmawan, A. S. & Poulin, D. Lineartime general decoding algorithm for the surface code. Phys. Rev. E 97, 051302 (2018).
Dudek, J. M. and Vardi, M. Y. Parallel weighted model counting with tensor networks. Preprint at https://arxiv.org/abs/2006.15512 (2020).
Schutski, R., Khakhulin, T., Oseledets, I. & Kolmakov, D. Simple heuristics for efficient parallel tensor contraction and quantum circuit simulation. Phys. Rev. A 102, 062614 (2020).
Lykov, D., Schutski, R., Galda, A., Vinokur, V. & Alexeev, Y. Tensor network quantum simulator with stepdependent parallelization. Preprint at https://arxiv.org/abs/2012.02430 (2020).
Villalonga, B. et al. A flexible highperformance simulator for verifying and benchmarking quantum circuits implemented on real hardware. npj Quantum Inf. 5, 1–16 (2019).
Orús, R. A practical introduction to tensor networks: matrix product states and projected entangled pair states. Annals Phys. 349, 117–158 (2014).
Vidal, G. Class of quantum manybody states that can be efficiently simulated. Phys. Rev. Lett. 101, 110501 (2008).
Han, Z.Y., Wang, J., Fan, H., Wang, L. & Zhang, P. Unsupervised generative modeling using matrix product states. Phys. Rev. X 8, 031012 (2018).
Gao, X., Zhang, Z.Y. & Duan, L.M. A quantum machine learning algorithm based on generative models. Sci. Adv. 4, eaat9004 (2018).
Markov, I. L. & Shi, Y. Simulating quantum computation by contracting tensor networks. SIAM J. Comput. 38, 963–981 (2008).
Boixo, S., Isakov, S. V., Smelyanskiy, V. N. & Neven, H. Simulation of lowdepth quantum circuits as complex undirected graphical models. Preprint at https://arxiv.org/abs/1712.05384 (2017).
Gray, J. & Kourtis, S. Hyperoptimized tensor network contraction. Quantum 5, 410 (2021).
Schutski, R., Lykov, D. & Oseledets, I. Adaptive algorithm for quantum circuit simulation. Physical Review A 101, 042335 (2020).
Pan, F. & Zhang, P. Simulating the sycamore quantum supremacy circuits. Preprint at https://arxiv.org/abs/2103.03074 (2021).
Szegedy, M. What do QAOA energies reveal about graphs? Preprint at https://arxiv.org/abs/1912.12277 (2019).
Wang, H., Wu, J., Yang, X. & Yi, X. A graph isomorphism algorithm using signatures computed via quantum walk search model. J. Phys. A 48, 115302 (2015).
Emms, D., Severini, S., Wilson, R. C. & Hancock, E. R. Coined quantum walks lift the cospectrality of graphs and trees. Pattern Recognit. 42, 1988–2002 (2009).
Mahasinghe, A., Izaac, J. A., Wang, J. B. & Wijerathna, J. K. Phasemodified CTQW unable to distinguish strongly regular graphs efficiently. J. Phys. A 48, 265301 (2015).
O’Brien, T. E., Tarasinski, B. & DiCarlo, L. Densitymatrix simulation of small surface codes under current and projected experimental noise. npj Quantum Inf. 3, 1–8 (2017).
Trout, C. J. et al. Simulating the performance of a distance3 surface code in a linear ion trap. New J. Phys. 20, 043038 (2018).
Ataides, J. P. B., Tuckett, D. K., Bartlett, S. D., Flammia, S. T. & Brown, B. J. The XZZX surface code. Nat. Commun. 12, 1–12 (2021).
Zhou, Y., Stoudenmire, E. M. & Waintal, X. What limits the simulation of quantum computers? Phys. Rev. X 10, 041038 (2020).
Verstraete, F., Murg, V. & Cirac, J. I. Matrix product states, projected entangled pair states, and variational renormalization group methods for quantum spin systems. Adv. Phys. 57, 143–224 (2008).
Pirvu, B., Murg, V., Cirac, J. I. & Verstraete, F. Matrix product operator representations. New J. Phys. 12, 025012 (2010).
Verstraete, F., GarciaRipoll, J. J. & Cirac, J. I. Matrix product density operators: simulation of finitetemperature and dissipative systems. Physical review letters 93, 207204 (2004).
Noh, K., Jiang, L. & Fefferman, B. Efficient classical simulation of noisy random quantum circuits in one dimension. Quantum 4, 318 (2020).
Hansen, N., Akimoto, Y. & Baudis, P. CMAES/pycma on Github (Zenodo, 2019); https://doi.org/10.5281/zenodo.2559634
Schlag, S. HighQuality Hypergraph Partitioning. PhD thesis, Karlsruhe Institute of Technology (2020).
Daniel, G. et al. Opt_einsum—a python package for optimizing contraction order for einsumlike expressions. J. Open Source Software 3, 753 (2018).
Martinis, J. M. et al. Quantum Supremacy Using a Programmable Superconducting Processor (Dryad, 2021); https://doi.org/10.5061/dryad.k6t1rj8
Farhi, E., Goldstone, J. & Gutmann, S. A quantum approximate optimization algorithm. Preprint at https://arxiv.org/abs/1411.4028 (2014).
DiCarlo, L. et al. Demonstration of twoqubit algorithms with a superconducting quantum processor. Nature 460, 240–244 (2009).
Huang, C. et al. Efficient Parallelization of Tensor Network Contractions for Simulating Quantum Computation (Dryad, 2021); https://doi.org/10.5061/dryad.nk98sf7t8
Huang, C., Zhang, F. & Chen, J. An opensource simulatordriven development tool for quantum computing. Code Ocean https://doi.org/10.24433/CO.4349832.v2 (2021).
Meringer, M. Fast generation of regular graphs and construction of cages. J. Graph Theory 30, 137–146 (1999).
Brouwer, A. E. PaulusRozenfeld Graphs https://www.win.tue.nl/~aeb/graphs/Paulus.html
Acknowledgements
We would like to thank our colleagues from various teams in Alibaba Cloud Intelligence and the Search Technology Division for supporting us in the numerical experiments presented in this paper. We also thank X. Jiang and his team, Y. Yan and his team, and J. Zhang and his local data center team for helping us with the computing facilities. We thank J. Zhu and his team, and X. Long and his team for their technical support on massive Elastic Computing Service (ECS) initialization and GPU optimization. Finally, we are particularly grateful to J. Gray for his helpful comments on this manuscript. Part of the work by M.N. and F.Z. was supported through an internship at Alibaba Group USA. M.N. was otherwise supported by ARO MURI (grant no. W911NF1610349). F.Z. was otherwise supported in part by the US NSF under award no. 1717523.
Author information
Authors and Affiliations
Contributions
C.H. and F.Z. contributed equally to this work. F.Z. and J. Chen proposed the index slicing technique. F.Z. wrote the original ACQDP v.1 code; C.H. devised new improvements including multilayered hypergraph decomposition, local reconfiguration and runtime hardwarespecific recompilation to ACQDP v.2. F.Z. and C.H. profiled the program. F.Z. and J. Chen benchmarked ACQDP v.1.1 with largescale computational facilities. C.H. and J. Chen benchmarked ACQDP v.2 on GPU nodes for Sycamore instances. J. Cai, Z.T., J.W., H.X., H.Y. and B.Y. prepared the largescale computational facilities used. X.G. deployed the largescale simulation based on Alibaba Cloud Function Compute. C.H. and J. Chen performed an analysis of the data. M.S. designed the experiment of QAOA simulation. C.H. implemented the QAOA simulation module. D.D., T.W., F.W., G.Z., H.K., H.Z. and C.D. contributed to the error model studied in the Surface17 simulation. X.N., C.H. and F.Z. implemented the Surface17 simulation module. C.H., F.Z., M.N., M.S., Y.S. and J. Chen contributed to discussions regarding the design and implementation of the platform. C.H., F.Z., M.N., D.D., Y.S. and J. Chen wrote the paper with contributions from all authors. Y.S. conceptualized and codirected this study with J. Chen. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
Alibaba Group Holding Ltd has filed patent application US20190332731A1 for inventors J. Chen, F.Z., Y.S., C.H. and M.N., as well as provisional patent applications 62/957,442 (C.H., F.Z. and J. Chen), 63/015,178 (C.H. and J. Chen) and 63/015,116 (C.H. and J. Chen).
Additional information
Peer review information Nature Computational Science thanks Roman Schutski and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Handling editor: Jie Pan, in collaboration with the Nature Computational Science team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–8, Discussion and Supplementary Tables 1 and 2.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Huang, C., Zhang, F., Newman, M. et al. Efficient parallelization of tensor network contraction for simulating quantum computation. Nat Comput Sci 1, 578–587 (2021). https://doi.org/10.1038/s43588021001197
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588021001197
This article is cited by

Boosting simulation of quantum computers
Nature Computational Science (2021)