A flexible high-performance simulator for verifying and benchmarking quantum circuits implemented on real hardware

Here we present qFlex, a flexible tensor network-based quantum circuit simulator. qFlex can compute both the exact amplitudes, essential for the verification of the quantum hardware, as well as low-fidelity amplitudes, to mimic sampling from Noisy Intermediate-Scale Quantum (NISQ) devices. In this work, we focus on random quantum circuits (RQCs) in the range of sizes expected for supremacy experiments. Fidelity f simulations are performed at a cost that is 1/f lower than perfect fidelity ones. We also present a technique to eliminate the overhead introduced by rejection sampling in most tensor network approaches. We benchmark the simulation of square lattices and Google’s Bristlecone QPU. Our analysis is supported by extensive simulations on NASA HPC clusters Pleiades and Electra. For our most computationally demanding simulation, the two clusters combined reached a peak of 20 Peta Floating Point Operations per Second (PFLOPS) (single precision), i.e., 64% of their maximum achievable performance, which represents the largest numerical computation in terms of sustained FLOPs and the number of nodes utilized ever run on NASA HPC clusters. Finally, we introduce a novel multithreaded, cache-efficient tensor index permutation algorithm of general application.

Quantum circuit simulation plays a dual role in demonstrating quantum supremacy.
First, it establishes a classical computational bar that quantum computation must pass to demonstrate supremacy.Indeed, formal complexity proofs related to quantum supremacy are asymptotic, and therefore assume an arbitrarily large number of qubits [11][12][13][14][15][16][17][18][19][20][21].This is only possible with a fault tolerant quantum computer [13,16,[22][23][24][25][26][27][28], and therefore a near term practical demonstration of quantum supremacy must rely on a careful comparison with highly optimized classical algorithms on state-of-the-art supercomputers.Second, it also provides verification that the quantum hardware is indeed performing as expected up to the limits of classical computational capabilities.
The leading near-term proposal for a quantum supremacy experiment on NISQ devices is based on the sampling of bit-strings from a random quantum circuit (RQC) [13,17,19,21].Indeed, under reasonable assumptions, sampling from large RQCs is classically unfeasible [11,13,14,16,17,19,21].Further, these quantum circuits appear to become difficult to simulate at relatively small sizes and within error tolerances that are expected to be implementable on early NISQ hardware [13].Here, we present a flexible simulator that both raises the bar for quantum supremacy demonstrations and provides expanded verification of quantum hardware through sampling.
This verification can be done through calculating the cross entropy difference between experimentally obtained samples and the output distribution of an ideal circuit, as proposed in Boixo et al. [13].Once sufficiently large quantum circuits can be run with reasonable fidelity on quantum hardware to be beyond the ability to compute samples classically, the cross entropy can no longer be calculated since there is no means to obtain the output distribution of an ideal circuit.Close correspondence between experiments, numerics, and theory up to that point, for a variety of circuits with combinations of fewer qubits, shallower depth, or simpler-to-simulate circuits (e.g., more Clifford gates) or architectures (see end of Sec.III A 1) of the same size, suggest by extrapolation that the hardware is performing correctly and has achieved practical quantum supremacy against the best state-of-the-art algorithms.
Here, we propose a flexible RQC simulator to classically simulate quantum circuits that were beyond reach for previous approaches, including the simulation of the Google Bristlecone QPU.By design, our simulator is "blind" to the randomness in the choice of single-qubit gates of the RQCs.Therefore, it presents no fluctuations in performance from one RQC to another.Moreover, by expanding on a technique introduced in [29], including introducing fine-grained "cuts" that enable us to judiciously balance memory requirements with number of independent computations that can be done in parallel, our simulator can output 1/f amplitudes with a target fidelity f at the same computational cost to compute a single perfect-fidelity amplitude; furthermore, we present an alternative technique to simulate RQC sampling with target fidelity f with the same speedup factor of 1/f .In the last few years, many different simulators have been proposed, either based on the direct evolution of the quantum wave-function [13,[29][30][31][32][33][34][35], Clifford + T gate sets [36], and tensor network contraction [37][38][39][40].Tensor network contraction based simulators have been particularly successful Figure 1.Sub-lattices of interest of the full Bristlecone-72 (bottom right), ordered by increasing hardness for a given depth.Note that Bristlecone-72 (entire lattice) is not harder to simulate than Bristlecone-70, since the two corner tensors can be contracted trivially at a negligible cost (see Section III).Note also that Bristlecone-64 is similar in hardness to Bristlecone-48, and substantially easier to simulate than Bristlecone-60, as is discussed in Sections III and VI.We identify a family of sub-lattices of Bristlecone, namely Bristlecone-24, -30, -40, -48, -60 and -70, that are hard to simulate classically, while keeping the number of qubits as low as possible.
in simulating RQCs for sizes close to the quantum supremacy regime.Some recent simulators exploited [35,39,40] weaknesses in the design of the RQCs presented in [13], and even introduced small changes in the circuits that make them significantly easier to simulate.These designs have been revised (see Section II) to remove these weaknesses [29].It is also important to note that the quantum supremacy computational task of interest consists of producing a sample of bit-strings within some variational distance of the output distribution defined by a quantum circuit [13,17,19,21].This is very different from computing a single output amplitude, as done in Ref. [40] (see Sec. IV).Among the proposed classical approaches, it is worth mentioning Markov et al.'s simulator [29].Their method is based on splitting I × J grids of qubits in halves, which are then independently simulated [39].To make the simulator more competitive, Markov et al. introduce checkpoint states and reuse them for different branches of a tree where internal nodes represent Schmidt decompositions of cross-gates and leaves represent simulation results for each tree path.The number of independent circuits to simulate is exponential in the number of projected CZ-gates that cross from one half to the other.As part of their study, the authors propose for the first time a technique to "match" the target fidelity f of the NISQ device, which actually reduces the classical computation cost by a factor f .By matching the fidelity of a realistic quantum hardware (f = 0.51%), Markov et al. [37] were able to simulate 7 × 7 and 7 × 8 grids with depth 1 + 40 + 1 by numerically computing 10 6 amplitudes in respectively 582,000 hours and 1,407,000 hours on single cores.However, the algorithm in [29] becomes less efficient than our algorithm for grids beyond 8 × 8 qubits because of memory requirements.Moreover, it is not well suited for the simulation of the Google Bristlecone QPU.Indeed, as we show here, the Google Bristlecone QPU implements circuit topologies with a large diameter, which increases the run time exponentially.In both cases, one could mitigate the memory requirements by either using distributed memory protocols like MPI, or by partitioning the RQCs in more sub-circuits.However, the aforementioned approaches introduce a non-negligible slow-down that make them unpractical (see Section C for more details).
To summarize, our tensor network based simulator relies on four different points of strength: Robustness.RQCs are mapped onto regular tensor networks, where each tensor corresponds to a block of the circuit enclosing several gates; consequently, 2D grids of qubits, including the Bristlecone architecture, are mapped onto 2D grids of tensors.Since the blocking operation removes any randomness in the resulting tensor network topology (the only randomness left is in the tensor entries themselves), our simulator is robust against fluctuations from RQC to RQC and to changes of the rules to generate RQCs.
Flexibility.By computing an appropriate fraction of "paths", it is possible to control the "fidelity" of the simulated RQCs, as first introduced in Ref. [29].Therefore, our simulator can output 1/f amplitudes with target fidelity f with the same computational cost to compute one perfect amplitude, for almost any f .This property is very important to "mimic" the sampling from NISQ devices.Scalability.By carefully choosing which cuts to apply to the RQCs, we are able to control the maximum size of tensors seen during tensor contraction.Thanks to the regularity of the resulting tensor network, together with a better memory management and a novel cache-efficient tensor index permutation routine, we are able to simulate circuits of as many as 72 qubits and realistic circuit depths on NISQ architectures such as Bristlecone.
Performance.To the best of our knowledge, our tensor contraction engine is optimized beyond all the existing CPU-based alternatives for contracting the RQCs with the largest number of qubits studied in this work.
Our analyses are supported by extensive simulations on Pleiades (27th in the November 2018 TOP500 list) and Electra (33rd in the November 2018 TOP500 list) supercomputers hosted at NASA Ames Research Center.
In total, we used over 3.2 million core-hours and ran six different numerical simulations (see Fig. 1  For the most computationally demanding simulation we have run, namely sampling from a 60-qubit sub-lattice of Bristlecone, the two systems combined reached a peak of 20 PFLOPS (single precision), that is 64% of their maximum achievable performance, while running on about 90% of the nodes.To date, this is the largest computation run on NASA HPC clusters in terms of peak PFLOPS and number of nodes used.All Bristlecone simulation data are publicly available [41] and we plan to open source our simulator in the near future [42].This paper is structured as follows.In Section II we review the rules for generating the revised RQCs [29], which are based on the constraints of the quantum hardware, while attempting to make classical simulations hard.The hardness of the revised RQCs motivates in part our simulator's approach, which is explained in Section III, where both conceptual and implementation details are discussed.In Section IV we discuss two methods to classically sample from an RQC mimicking the fidelity f of the output of a real device, while achieving a speedup in performance of a factor of 1/f (see Ref. [29]); in addition, we present a method to speedup the classical sampling by a factor of about 10× that, under reasonable assumptions, is well suited to tensor network based simulators.We also discuss the implications of classically sampling from a non fully-thermalized RQC.Section V discusses the hardness of simulating RQCs implemented on the Bristlecone QPU as compared to those implemented on square grids of qubits.Our results are presented in Section VI, with an emphasis on our ability to both simulate the computational task run on the quantum computer, as well as to compute perfect fidelity amplitudes for the verification of the experiments.Finally, in Section VII we summarize our conclusions.

II. REVISED SET OF RANDOM QUANTUM CIRCUITS
In this section, we review the prescription to generate RQCs proposed originally by Google [13], and its revised version [29].This prescription can be used to generate RQCs for 2D square grids, including the Bristlecone architecture (which is a diamond shaped subset of a 2D square grid).The circuit files used for the numerical simulations in this paper are publicly available in [43].
Given a circuit depth and circuit topology of n qubits, Google's RQCs [13,29] are an ensemble of quantum circuits acting on a Hilbert space of dimension N = 2 n .The computational task consists of sampling bit-strings as defined by the final output.
Due to the limitation of the current technology and the constraints imposed by the quantum hardware, circuits are randomly generated using the following prescription: (1) Apply a first layer of Hadamard (H) gates to all the qubits.
(2) After the initial layer (1), subsequent layers of two-qubit gates are applied.There are 8 different layers, which are cycled through in a consistent order (see Fig. 2).
(3) Within these layers, for each qubit that is not being acted upon by a two-qubit gate in the current layer, and such that a two-qubit gate was acting on it in the previous layer, randomly apply (with equal probability) a gate in the set {X 1/2 , Y 1/2 }.(4) Within these layers, for each qubit that is not being acted upon by a two-qubit gate in the current layer, and was acted upon by a gate in the set {X 1/2 , Y 1/2 , H} in the previous layer, apply a T gate.
(5) Apply a final layer of H gates to all the qubits.
The depth of a circuit will be expressed as 1 + t + 1, where the prefix and suffix of 1 explicitly denote the presence of an initial and a final layer of Hadamard gates.
For our simulations, as was done in prior RQC works, we use the CZ gate as our two-qubit gate.One of the differences between the original prescription [13] and this new prescription [29] for the generation of RQCs is that we now avoid placing T gates after CZ gates.If a T gate follows a CZ gate, this structure can be exploited to effectively reduce the computational cost to simulate the RQCs, as was done in [35,39,40].The revised RQC formulation ensures that each T gate is preceded by a {X 1/2 , Y 1/2 , H} gate, which foils this exploit.In addition, the layers of two-qubit gates have been reordered, in order to avoid consecutive "horizontal" or "vertical" layers, which is known to make simulations easier.Finally, it is important to keep the final layer of H gates, as otherwise multiple two-qubit gates at the end of the circuit can be simplified away, making the simulation easier [13].
Replacing CZ gates with iSWAP gates is known to make the circuits yet harder to simulate.More precisely, an RQC of depth 1+t+1 with CZ gates is equivalent, in terms of simulation cost, to an RQC of depth 1 + t/2 + 1 with iSWAPs.In future work, we will benchmark our approach on these circuits as well.

III. OVERVIEW OF THE SIMULATOR
A given quantum circuit can always be represented as a tensor network, where one-qubit gates are rank-2 tensors (tensors of 2 indexes with dimension 2 each), two-qubit gates are rank-4 tensors (tensors of 4 indexes with dimension 2 each), and in general n-qubit gates are rank-2n tensors.The computational and memory cost for the contraction of such networks is exponential with the number of open indexes and, for large enough circuits, the network contraction is unpractical; nonetheless, it is always possible to specify input and output configurations in the computational basis through rank-1 Kronecker deltas over all qubits, which can vastly simplify the complexity of the tensor network.This representation of quantum circuits gives rise to an efficient simulation technique, first introduced in Ref. [37], where the contraction of the network gives amplitudes of the circuit at specified input and output configurations.
Our approach allows the calculation of amplitudes of RQCs through the contraction of their corresponding tensor networks, as discussed above, but with an essential first step, which we now describe.One of the characteristics of the layers of CZ gates shown in Fig. 2 is that it takes 8 cycles for each qubit to share one, and only one, CZ gate with each of its neighbors.This property holds for all subsets of a 2D square grid, including the Bristlecone architecture.Therefore, it is possible to contract every 8 layers of the tensor network corresponding to an RQC of the form described in Section II onto an I × J two-dimensional grid of tensors, where I and J are the dimensions of the grid of qubits.While in this work we assume that the number of layers is a multiple of 8, our simulator can be trivially used for RQCs with a depth that is not a multiple of 8.The bond dimensions between each tensor and its neighbors are the Schmidt rank of a CZ gate, which (as for any diagonal two-qubit gate) is equal to 2 (note that for iSWAP the Schmidt rank is equal to 4, thus effectively doubling the depth of the circuit as compared to the CZ case).After contracting each group of 8 layers in the time direction onto a single, denser layer of tensors, the RQC is mapped onto an I × J × K three-dimensional grid of tensors of indexes of bond dimension 2, as shown in Fig. 3, where K = t/8, and 1 + t + 1 is the depth of the circuit (see Section II).Note that the initial (final) layer of Hadamard gates, as well as the input (resp.output) delta tensors, can be trivially contracted with the initial (resp.final) cycle of 8 layers of gates.At this point, the randomness of the RQCs appears only in the entries of the tensors in the tensor network, but not in its layout, which is largely regular, and whose contraction complexity is therefore independent of the particular RQC instance at hand.This approach contrasts with those taken in Refs.[35,38,40], which propose simulators that either benefit from an approach tailored for each random instance of an RQC, or take advantage of the particular layout of the CZ layers.
The contraction of the resulting 3D tensor network described above (see Fig. 3) in order to compute the amplitude corresponding to specified initial and final bit-strings is described in the following Section III A.

A. Contraction of the 3D tensor network
In this section, we describe the contraction procedure followed for the computation of single perfect-fidelity output amplitudes for the 3D grid of tensors described in the previous section.
Starting from the 3D grid of tensors of Fig. 3, we first contract each vertical (K direction) column of tensors onto a single tensor of at most 4 indexes of dimension 2 K each (see left panel of Fig. 4).Note that for the circuit sizes and depths we simulate, K is always smaller than I and J, and so this contraction is always feasible in terms of memory, fast, and preferable to a contraction in either the direction of I or J.This results in a 2D grid of tensors of size I × J, where all indexes have dimension 2 K (see right panel of Fig. 4).Note that contracting in the time direction first is done at a neglibigle cost, and reduces the number of high complexity contractions to only the ones left in the resulting 2D grid.While we have focused so far on the steps leading to the 2D square grid tensor network of Fig. 4, it is easy to see that the Bristlecone topology (see Bristlecone-72 in Fig. 1) is a sub-lattice of a square grid or qubits, and so all considerations discussed up to this point are applicable.Even though Bristlecone has 72 qubits, the top-left and bottomright qubits of the network can be contracted trivially with their respective only neighbor, adding no complexity to our classical simulation of RQCs.For this reason, without loss of generality, we "turn off" those two qubits from the Bristlecone lattice, and work only with the resulting sub-lattice, which we call Bristlecone-70 (see Fig. 1).For the remainder of this section, we will focus on Bristlecone-70 and other sub-lattices of Bristlecone (see sublattices considered in Fig. 1), and we will refer back to square grids of qubits in later sections.
From Fig. 1, it is easy to see that it is not possible to contract the Bristlecone-70 tensor network without generating tensors of rank 11, where each index has dimension 2 K .For a circuit of depth 1 + 32 + 1 and K = 4, the dimension of the largest tensors is 2 11×4 , which needs over 140 TB of memory to be stored using single precision floating point complex numbers, far beyond the RAM of a typical HPC cluster node (between 32 GB and 512 GB).Therefore, to avoid the memory bottleneck, we decompose the contraction of the Bristlecone-70 tensor network into independent contractions of several easier-to-compute sub-networks.Each subnetwork is obtained by applying specific "cuts", as is described below.
Given a tensor network with n tensors and a set of indexes to contract {i l } l=1,... , i1,i2,... T1 T2 . . .T n , we define a cut over index i k as the explicit decomposition of the contraction into i k {i l } l=1,... −{i k } T 1 T 2 . . .T n .This implies the contraction of dim(i k ) many tensor networks of lower complexity, namely each of the {i l } l=1,... −{i k } T 1 T 2 . . .networks, where tensors involving index i k decrease their rank by 1, fixing the value of i k to the particular value given by the term in i k .This procedure, equivalent to the ones used in Refs.[29,35,39,40], reduces the complexity of the resulting tensor network contractions to computationally manageable tasks (in terms of both memory and time), at the expense of creating exponentially many contractions.The resulting tensor networks can be contracted independently, which results in a computation that is embarrasingly parallelizable.It is possible to make more than one cut on a tensor network, in which case i k refers to a multi-index; the contribution to the final sum of each of the contractions (each of the values of the multi-index cut) is called a "path", and the final value of the contraction is the sum of all path contributions.
For the Bristlecone-70 example with depth (1 + 32 + 1), making four cuts, as shown in Fig. 5 (top row), decreases the size of the maximum tensor stored during the contraction from 2 11×4 to 2 7×4 entries, at the price of 2 4×4 contractions to be computed.At the same time, the choice of cuts aims at lowering the number of high complexity contractions needed per path, as well as lowering the number of largest tensors held simultaneously in memory.Note that for Bristlecone-60, tensors A and B are both equally large, and that the number of high complexity contractions is larger than for a single path of Bristlecone-70.
After making these cuts, the contraction of each path is carried out in the following way (see Fig. 5): first, we contract all tensors within region A onto a tensor of rank 7 (tensor A); we do the same for tensor B; then tensors A and B are contracted onto a rank-6 tensor, AB; finally, tensor C is contracted, which does not depend on the particular path at hand, followed by the contraction of AB with C onto a scalar.In Fig. 5 (bottom row) we depict the corresponding A, B, and C regions for the sub-lattices of Bristlecone we use in our simulations, as well as the cuts needed to contract the resulting tensor networks using the described method, in particular for Bristlecone-48, -60, and -64.Note that that Bristlecone-48 and -64 need both two cuts of depth 4, making them similar to each other in complexity, while Bristlecone-60 needs three cuts, making it substantially harder to simulate.
We identify a family of sub-lattices of Bristlecone, namely Bristlecone-24, -30, -40, -48, -60 and -70, that are hard to simulate classically, while keeping the number of qubits and gates as low as possible.Indeed, the fidelity of a quantum computer decreases with the number of qubits and gates involved in the experiment [13], and so finding classically hard sub-lattices with a small number of qubits is essential for quantum supremacy experiments.It is interesting to observe that Bristlecone-64 is an example of a misleadingly large lattice that is easy to simulate classically (see Section VI for our numerical results).
Note that the rules for generating RQCs cycle over the layers of two-qubit gates depicted in Fig. 2. In the case that the cycles or the layers are perturbed, our simulator can be trivially adapted.In particular: 1) if the layers are applied in a different order, but the number of two-qubit gates between all pairs of neighbors is the same, then the 2D grid tensor network of Fig. 4 still holds, and the contraction method can be applied as described; 2) if there is a higher count of two-qubit gates between some pairs of neighbors than between others, then the corresponding anisotropy in the bond dimensions of the 2D tensor network can be exploited through different cuts.

B. Implementation of the simulator
We implemented our tensor network contraction engine for CPU-based supercomputers using C ++ .
We have planned to release our tensor contraction engine in the near future [42].During the optimization, we were able to identify two clear bottlenecks in the implementation of the contractions: the matrix multiplication required for each index (or multi-index) contraction, and the reordering of tensors in memory needed to pass the multiplication to a matrix multiplier in the appropriate storage order (in particular, we always use row-major storage).In addition, to avoid time-consuming allocations during the runs, we immediately allocate large enough memory to be reused as scratch space in the reordering of tensors and other operations.

Matrix multiplications with Intel R MKL
For the multiplication of two large matrices that are not distributed over several computational nodes, Intel's MKL library is arguably the best performing library on Intel CPU-based architectures.We therefore leave this essential part of the contraction of tensor networks to MKL's efficient, hand-optimized implementation of the BLAS matrix multiplication functions.

Cache-efficient index permutations
The permutation of the indexes necessary as a preparatory step for efficient matrix multiplications can be very costly for large tensors, since it involves the reordering of virtually all entries of the tensors in memory; similar issues have been an area of study in other contexts [44][45][46].In this section we describe our novel cache-efficient implementation of the permutation of tensor indexes.
Let A i0,...,i k be a tensor with k indexes.In our implementation, we follow a row-major storage for tensors, a natural generalization of matrix rowmajor storage to an arbitrary number of indexes.In the tensor network literature, a permutation of indexes formally does not induce any change in tensor A. However, given a storage prescription (e.g., row-major), we will consider that a permutation of indexes induces the corresponding reordering of the tensor entries in memory.A naive implementation of this reordering routine will result in an extensive number of cache misses, with poor performance.
We implement the reorderings in a cacheefficient way by designing two reordering routines that apply to two special index permutations.Let us divide a tensor's indexes into a left and a right group: A i 0 , . . ., i j i j+1 , . . ., i k .If a permutation involves only indexes in the left (right) group, then the permutation is called a left (resp.right) move.Let γ be the number of indexes in the right group.We will denote left (resp.right) moves with γ indexes in the right group by Lγ (resp.Rγ).The importance of these moves is that they are both cache-efficient for a wide range of values of γ and that an arbitrary permutation of the indexes of a tensor can be decomposed into a small number of left and right moves, as will be explained later in this section.Let d γ be the dimension of all γ right indexes together.Then left moves involve the reordering across groups of d γ entries of the tensor, where each group of d γ entries is contiguous in memory and is moved as a whole, without any reordering within itself, therefore largely reducing the number of cache misses in the routine.On the other hand, right moves involve reordering within all of d γ entries that are contiguous in memory, but involves no reordering across groups, hence greatly reducing the number of cache misses, since all reorderings take place in small contiguous chunks of memory.Fig. 6 shows the efficiency of Rγ and Lγ as compared to a naive (but still optimized) implementation of the reordering that is comparable in performance to python's numpy implementation.A further advantage of the left and right moves is that they can be parallelized over multiple threads and remain cache-efficient in each of the threads.This allows for a very efficient use of the computation resources, while the naive implementation does not benefit from multi-threading.
Let us introduce the decomposition of an arbitrary permutation into left and right moves through an example.Let A abcdef g be a tensor with 7 indexes of dimension d each.Let abcdef g → cf eadgb be the index permutation we wish to perform.Furthermore, let us assume that it is known that L2 and R4 are cache-efficient.Let us also divide the list of 7 indexes of this example in three groups: the last two (indexes 6 and 7), the next group of two indexes from the right (indexes 4 and 5), and the remaining three indexes on the left (1, 2, and 3).We now proceed as follows.First, we apply an L2 move that places all indexes in the left and middle groups that need to end up in the rightmost group in the middle group; in our case this is index b, and the L2 we have in mind is abc|de |f g; note that if the middle group is at least as big as the rightmost group, then it is always possible to do this.Second, we apply an R4 move that places all indexes that need to end up in the rightmost group in their final positions; in our case, that is cae| bd|f g R4 → cae| f d|gb R4 ; note that, if the first move was successful, then this one can always be done.Finally, we take a In practice, we find that (beyond the above example, where µ = 2 and ν = 4) for tensors with binary indexes, µ = 5 and ν = 10 are good choices for our processors (see Fig. 6).If the tensor indexes are not binary, this approach can be generalized: if all indexes have a dimension that is a power of 2, then mapping the reordering onto one involving explicitly binary indexes is trivial; in the case where indexes are not all powers of 2, then different values of µ and ν could be found, or decompositions more general than Lµ − Rν − Lµ could be thought of.In our case, we find good results for the L5 − R10 − L5 decomposition.Note also that in many cases a single R or a single L move is sufficient, and sometimes a combination of only two of them is enough, which can accelerate contractions by a large factor.
We apply a further optimization to our index permutation routines.A reordering of tensor entries in memory (either a general one or some of Rγ or Lγ moves) involves two procedures: generating a map between the old and the new positions of each entry, which has size equal to the dimension of all indexes involved, and applying the map to actually move the entries in memory.The generation of the map takes a large part of the computation time, and so storing maps that have already been used in a look-up table (memoization), in order to reuse them in future reorderings, is a desirable technique to use.While the size of such maps might make this approach impractical in general, for left and right moves memoization becomes feasible, since the size of the maps is now exponentially smaller than in the general case due to left and right moves only involving a subset of indexes.In the contraction of regular tensor networks we work with maps reappear often, and so memoization proves very useful.
The implementation of the decomposition of general permutations of indexes into left and right moves, with all the details discussed above, give us speedups in the contractions that range from 5% in single-threaded contractions that are dominated by matrix multiplications, to over a 50% in multithreaded contractions that are dominated by reorderings.

IV. FAST CLASSICAL SAMPLING OF BIT-STRINGS FROM LOW FIDELITY RQCS
While the computation of perfect fidelity amplitudes of output bit-strings of RQCs is needed for the verification of quantum supremacy experiments [13], classically simulating sampling from low fidelity RQCs is essential in order to benchmark the performance of classical supercomputers in carrying out the same task that the low fidelity quantum computer performs.In Section IV A we describe two methods to mimic the fidelity f of the output wave-function of the quantum computer with our simulator, providing a speedup of a factor of 1/f to the simulation as compared to the computation of exact amplitudes [29].In Section IV B we describe a way to reduce the computational cost of the sampling procedure on tensor contraction type simulators by a factor of almost 10×, under reasonable assumptions.Finally, in Section IV C we discuss the implications of sampling from a Porter-Thomas distribution that has not fully converged.

A. Simulating low fidelity RQCs
Here, we describe two methods to reduce the computational cost of classically sampling from an RQC given a target fidelity.

Summing a fraction of the paths
This method, presented in Ref. [29], exploits the fact that, for RQCs, the decomposition of the output wave-function of a circuit into paths |ψ = p∈{paths} |ψ p (see Sec.III A 1) leads to terms |ψ p that have similar norm and that are almost orthogonal to each other.For this reason, summing only over a fraction f of the paths, one obtains a wave-function | ψ with norm ψ| ψ = f .Moreover, | ψ has fidelity f as compared to |ψ , that is: Therefore, amplitudes of a fidelity f wave-function can be computed at a cost that is only a fraction f of that of the perfect fidelity case.We find empirically that, while the different contributions |ψ p fulfill the orthogonality requirement (with a negligible overlap; e.g., in the Bristlecone-60 simulation, the mutual fidelity between pairs out of 4096 paths is about 10 −6 ), there is some non negligible variation in their norms (see Section VI and Fig. 7), and thus the fidelity achieved by |ψ p is equal to: which is in general different than (#paths) −1 .If an extensive subset of paths is summed over, then the variations on the norm and the fidelity are suppressed, and the target fidelity is achieved.This was the case in Ref. [29].However, in this work we aim at minimizing the number of cuts on the circuits, and so low fidelity simulations involve a small number of paths (between 1 and 21 in the cases simulated).In this case, some "unlucky" randomly selected paths might contribute with a fidelity that is below the target, while others might achieve a higher fidelity than expected.Finally, the low fidelity probability amplitudes reported in Ref. [29], obtained using the method described above, follow a Porter-Thomas distribution as expected for perfect fidelity amplitudes.Again, this is presumably true only in the case For some simulations, the depth is not sufficient to fully converge to a Porter-Thomas distribution.Furthermore, summing a small number of paths of low fidelity might lead to worse convergence than expected for a particular depth (see Section IV A 1).
when a large number of paths is considered.In our case, we find distributions that have not fully converged to a Porter-Thomas, but rather have a larger tail (see Section VI and Fig. 7).We attribute this phenomenon to the cuts in the circuit acting as removed gates between qubits, thus increasing the effective diameter of the circuit, which needs higher depth to fully thermalize.We discuss the implications of these tails for the sampling procedure in Section IV C.

Fraction of perfect fidelity amplitudes
There exists a second method to simulate sampling from the output wave-function |ψ with a target fidelity f that avoids summing over a fraction of paths.
The output density matrix of a random quantum circuit with fidelity f can be written as [13] This means that to produce a sample with fidelity f we can sample from the exact wave-function |ψ with probability f or produce a random bit-string with probability 1 − f .The sample from the exact wave-function can be simulated by calculating the required number of amplitudes with perfect fidelity.
Note that the method presented in this section involves the computation of the same number of paths as the one described in Section IV A 1 for a given f , circuit topology, circuit depth, and set of cuts.However, this second method is more robust in achieving a target fidelity.Note that by this argument the 6000 amplitudes of [Run5] are equivalent to 1.2M amplitudes at 0.5% fidelity.

B. Fast sampling technique
While 10 6 sampled amplitudes are necessary for cross entropy verification of the sampling task [13], the frugal rejection sampling proposed in Ref. [29] needs the numerical computation of 10×10 6 = 10 7 amplitudes in order to carry out the correct sampling on a classical supercomputer.This is due to the acceptance of 1/M amplitudes (on average) of the rejection sampling, where M = 10 when sampling from a given Porter-Thomas distribution with statistical distance of the order of 10 −4 (negligible).
In this section, we propose a method to effectively remove the 10× overhead in the sampling procedure for tensor network based simulators, which normally compute one amplitude at a time.For the sake of clarity, we tailor the proposed fast sampling technique to the Bristlecone architecture.However, it can be straightforwardly generalized to different architectures (see Appendix A).Given the two regions of the Bristlecone (and sub-lattices) AB and C of Fig. 5, and the contraction proposed (see Section III A 1), the construction of tensor C and the subsequent contraction with AB are computationally efficient tasks done in a small amount of time as compared to the full computation of the particular path.This implies that one can compute, for a given output bit-string on AB, s AB , a set of the 2 12 amplitudes generated by the concatenation of s AB with all possible s C bit-strings on C at a small overhead cost per amplitude.We call this set of amplitudes a "batch", we denote its size by N C , and each of the (concatenated) bitstrings by s ABC .In practice, we find that for the Bristlecone-64 and -60 with depth (1+32+1), the computation of a batch of 30 amplitudes is only around 10% more expensive than the computation of a single amplitude, while for the Bristlecone-48 and -70 with depth (1+32+1), the computation of a batch of 256 amplitudes is around 15% more expensive than the computation of a single amplitude, instead of a theoretical overhead of 30× and 256×, respectively.
The sampling procedure we propose is a modification of the frugal rejection sampling presented in Ref. [29] and proceeds as follows.First, we choose slightly over 10 6 (see below) random bitstrings on AB, s AB .For each s AB , we choose N C bit-strings on C, s C , at random (without repetition).We then compute the probability amplitudes corresponding to all s ABC bit-strings on all (slightly over) 10 6 batches.We now shuffle each batch of bit-strings.For each batch, we proceed onto the bit-strings in the order given by the suffle; we accept a bit-string s ABC with probability min [1, p(s ABC )N/M ], where p(s ABC ) is the probability amplitude of s ABC , and N is the dimension of the Hilbert space; once a bit-string is accepted, or the bit-strings of the batch have been exhausted without acceptance, we proceed to the next batch.By accepting at most one bit-string per batch we avoid introducing spurious correlations in the final sample of bit-strings.
Given an M and a batch size N C , the probability that a bit-string is accepted from a batch is (on average) 1 − (1 − 1/M ) N C .For M = 10 and N C = 30, the probability of acceptance in a batch is 95.76%, and one would need to compute amplitudes for 1.045 × 10 6 batches in order to sample 10 6 bit-strings; for M = 10 and N C = 60, the probability goes up to 99.82%, and one only needs 1.002×10 6 batches; for M = 10 and N C = 256, the probability of acceptance is virtually 100%, and 1.00 × 10 6 batches are sufficient.There is an optimal point, given by the overhead in computing batches of different sizes N C and the probability of accepting a bit-string on a batch given N C , that minimizes the runtime of the algorithm.
There is a crucial condition for this sampling algorithm to work, namely the absence of correlations between the probability amplitudes of the bit-strings {s ABC } for fixed s AB , so that they are indistinguishable from probability amplitudes taken from random bit-strings over ABC.We expect this condition to be satisfied for chaotic systems that have converged to a Porter-Thomas distribution.In order to test this, we perform the following test: for Bristlecone-24, we choose 1000 bit-strings over AB (s AB ) at random and for each of them we generate a batch of size N C = 32, where we constrain the bit-strings {s C } to be the same across batches.We now compute the Pearson correlation coefficient between the two sets of 1000 amplitudes gotten for each pair of bit-strings in C, and we do this for all 32 × 31/2 pairs.If Hamming distance is plotted with a solid line.We can see that, for depth (1+24+1) (top), the system has not thermalized, and the correlation decreases with Hamming distance; for depth (1+32+1) (middle), correlations approach zero, and become Hamming distance independent (on subsystem C).Bottom: we compare the distribution of Pearson coefficients, obtained as described above, to the distribution of Pearson coefficients obtained (numerically) from probability amplitudes with the same sample size as in the simulations above, drawn from a Porter-Thomas distribution.At large enough depth the system is expected to thermalize and the two distributions match, meaning that the probability amplitudes obtained by varying bit-strings only on subsystem C are uncorrelated.
the probability amplitudes of each batch are really uncorrelated to each other, we expect the correlation coefficient to vanish.We show the coefficient as a function of Hamming distance between the pairs in Fig. 8 (top two panels).We can see that, for depth (1+24+1) (top) there is a small but non negligible correlation, which in fact de-creases on average with Hamming distance.For depth (1+32+1) (middle), the correlation is Hamming distance independent and approaches zero.In the bottom panel of Fig. 8 we compare the distribution of Pearson coefficients obtained for both depths analyzed to that one obtained from pairs of sets of size 1000 sampled from a Porter-Thomas distribution.While a fairer comparison would involve sampling from the distribution of the output wave-function of the RQC, which might differ from the Porter-Thomas in the absence of convergence, we still see a clear tendency of the distributions to match for longer depth, i.e., closer to convergence.

C. Sampling from a non fully-thermalized
Porter-Thomas distribution In Ref. [29] an error of the order of 10 −4 is computed for a frugal rejection sampling with M = 10, assuming Porter-Thomas statistics.When the distribution has not converged to a Porter-Thomas, but rather has a larger tail, we expect the error to increase.We can estimate the error in sampling numerically for the cases simulated here as the sum of the probability amplitudes larger than M/N with N being the dimension of the Hilbert space, multiplied by N and divided by the number of amplitudes computed.For M = 10, we estimate an error = 9.3×10 −4 for [Run1], = 1.0×10 −2 for [Run2], and = 2.5×10 −3 for [Run6], respectively.If instead we consider M = 15, this lowers the error to = 1.3 × 10 −5 for [Run1], = 1.0 × 10 −3 for [Run2], and = 1.15 × 10 −4 for [Run6], respectively.Increasing M , in order to reduce the error in the frugal sampling, implies a lower acceptance rate in the fast sampling, which is resolved by increasing the size of the batches N C , which is done at a small cost.

V. SIMULATION OF THE BRISTLECONE LATTICE AS COMPARED TO RECTANGULAR GRIDS
The diamond shape of the topology of Bristlecone and its hard sub-lattices (see Fig. 1: Bristlecone-24, -30, -40, -48, -60, and -70) makes them particularly hard to simulate classically when compared to rectangular grids of the same (or smaller) number of qubits.Indeed, these lattices are subsets of large rectangular grids, from which they inherit their diameter; e.g., Bristlecone-70 is a sublattice of a 10 × 11 grid.When cutting the lattice (see Section III A 1), one has to apply several cuts in order to decrease the maximum size of the tensors in the contraction to manageable sizes; in the case of Bristlecone-70 and depth (1+32+1), four cuts are needed in order to have tensors in the contraction of at most dimension 2 7×4 , while for a rectangular 8 × 9 lattice (with 72 qubits) only 3 cuts are needed.Note that the computational cost scales with the dimension of the indexes cut, i.e., exponentially with the number of cuts.
The same applies to a simulator based on a full split of the circuit into two parts, as in Refs.[12,29,39].For instance, the number of CZ gates for RQCs with depth (1+32+1) which are cut when splitting Bristlecone-60 in two halves is equal to 40.In comparison, 8 × 8 grids of qubits with the same depth have only 32 CZ gates cut.See Section C for more details.
As was discussed in Section III, identifying topologies that are hard to simulate classically, but that minimize the number of qubits involved, increases the chances of success success of quantum supremacy experiments, due to the decrease of the overall fidelity of the quantum computer with the number of gates and qubits [13].For this reason, we find that Bristlecone is a good setup for quantum supremacy experiments.

VI. RESULTS
In this section we review the performance and the numerical results obtained by running our simulations [Run1-6] on the NASA HPC clusters Pleiades and Electra.
In the time of exclusive access to large portions of the NASA HPC clusters, we were able to run for over 3.2 million core-hours.Although most of the computation ran on varying portions of the supercomputers, for a period of time we were able to reach the peak of 20 PFLOPS (single precision), that corresponds to 64% of the maximum achievable performance for Pleiades and Electra combined.For a comparison, the peak for the LINPACK benchmark is 23 PFLOPS (single precision, projected), which is only 15% larger than the peak we obtained with our simulator.This is to date the largest simulation (in terms of number of nodes and FLOPS rate) run on the NASA Ames Research Center HPC clusters.This is not a surprise since both LINPACK and our simulation do the majority of work in MKL routines (dgemm or cgemm and similar), in our case due in part to the fact that our cache-efficient memory reordering routines lower the tensor indexes permutation bottleneck to a minimum.Fig. 9 reports the distribution of the runtimes for a single instance of each of the six simulations [Run1-6] for both Pleiades and Electra.Interest-ingly, we observe a split in the distribution of runtimes (see [47] for further details).For our simulations run on Pleiades, we used all the four available node architectures: • 2016 Broadwell (bro) nodes: Intel Xeon E5-2680v4, 28 cores, 128GB per node.
For the Electra system, we used its two available node architectures: • 1152 Broadwell (bro) nodes: same as above.
Note that the Skylake nodes at Electra form a much smaller machine than Pleiades, but substantially more efficient, both time and energy-wise.
In Table I we report runtime, memory footprint, and number of cores (threads) used for all six cases run on NASA Pleiades and Electra HPC clusters.As we describe in Section III, instances (which involve a certain number of paths given a cut prescription, as well as a batch size N c , as introduced in Section IV B) can be collected for a large number of low fidelity amplitudes or for a smaller number of high fidelity amplitudes at the same computation cost.Note that, after we run our simulations on Pleiades and Electra we have identified for Bristlecone-48 and -70 a better contraction procedure ([Run2b] and [Run3b], respectively).This new contraction is is about twice as fast as the one used in [Run2-3], which was similar in approach to the contraction used for Bristlecone-60 (see Section B for more details); we include the estimated benchmark of these new contractions as well.All the numerical data gathered during the simulations [Run1-6], including all the amplitudes, are publicly available [41].
In Table II we estimate the effective runtime needed for the computation of 10 6 amplitudes with a target fidelity close to 0.5% on a single core, for different node types.As one can see, the Bristlecone-60 sub-lattice is almost 10× harder to simulate than the Bristlecone-64 sub-lattice, while Bristlecone-64 is only 2× harder than Bristlecone-48.
In the following, we report the (estimated) runtime and energy consumption for both the  I for more details.For clarity, all distributions have been normalized so that their maxima are all at the same height.The nodes used on NASA HPC clusters Pleiades and Electra are: Broadwell (bro), Intel Xeon E5-2680v4 ; Haswell (has), Intel Xeon E5-2680v3; Ivy Bridge (ivy), Intel Xeon E5-2680v2; Sandy Bridge (san), Intel Xeon E5-2670; Skylake (sky), 2 × 20-core Intel Xeon Gold 6148 processors per node.Bottom: Same distribution as above, but the runtimes are multiplied by the number of cores per job on a single node, to provide a fairer comparison.As one can see, Skylake nodes provide generally the best performance, and belong to Electra, an energy efficient HPC cluster.The split of runtimes into groups is discussed in [47].
tasks of verification and sampling for rectangular grids of qubits, up to 8 × 9, as well as the full Bristlecone-70 layout.
The estimation is obtained by computing a small percentage of the calculations required for the full task.We would like to stress that our simulator's runtimes are independent of any particular RQC instance and,  I. Number of paths per instance, size of batches of amplitudes (see Section IV B), number of cores (threads) used per instance, memory footprint, number of instances fit in a node, and runtime per instance for all six cases run and for all five node types used on NASA Pleiades and Electra HPC clusters.We report single instances of a run, where an instance corresponds to the computation of a number of paths given a cut prescription, and the computation of a batch of NC amplitudes corresponding to output bit-strings chosen at random over subsystem C (see Section IV B).Note that for [Run3] NC = 512, and so computing NC amplitudes takes about three times the time of computing a single one.However, this is strongly mitigated with the contraction used for [Run3b].Note also that for [Run6] we ran 17 jobs per Skylake node, instead of 19, as a conservative strategy to stay well below the total memory available on these nodes and hence avoid any unwanted crash in our largest simulation.Instances can be collected for a large number of low fidelity amplitudes or for a smaller number of high fidelity amplitudes at the same computational cost.*[Run2b] and [Run3b] refer to the benchmark of the contraction procedure introduced in Section III A 1 for Bristlecone-48 and Bristlecone-70, respectively; [Run1] and [Run3] were run were run using a less performing procedure, similar to the one used for Bristlecone-60 (see Section B).Table II.Estimated effective runtimes on a single core for the computation of 10 6 amplitudes with a target fidelity of about 0.5% for the Bristlecone sub-lattices (see Fig. 1 for nomenclature).This is an estimate of the computational cost for the completion of the RQC sampling task.The estimate is based on the runtimes for single instances presented in therefore, our estimations are quite robust.

Runtime
Table III shows the (estimated) performance (runtimes and energy consumption) of our simulator in computing perfect fidelity amplitudes of output bit-strings of an RQC (rectangular lattices and Bristlecone-70), for both Pleiades and Electra.Runtimes are estimated assuming that fractions of the jobs are assigned to each group of nodes of the same type in a way that they all finish simultaneously, thus reducing the total real time of the run.The power consumption of Pleiades is 5MW, and a constant power consumption per core, regardless of the node type, is assumed for our estimations.For Electra, the 2304 Skylake nodes have an overall power consumption of 1.2MW, while the 1152 Broadwell nodes have an overall power consumption of 0.44MW.
Classically sampling bit-strings from the output state of an RQC involves the computation of a large number (approximately one million) of lowfidelity (about 0.5%) probability amplitudes, as better described in Section IV A. Table IV shows the (estimated) performance of our simulator in this task, with runtimes and energy consumption requirements on the two HPC clusters, Pleiades and Electra.
Finally, we compare our approach to the two leading previously existing simulators of RQCs, introduced in Ref. [40] (Alibaba) and Ref. [29] (MFIB) (see also Table V).
Compared to Ref. [40], our simulator is between 3.6× and 100× slower (see [49] for details), depending on the case.However, it is important to stress that Ref. [40] reports the computational cost to simulate a class of RQCs which is much easier to simulate than the class of RQCs reported in Ref. [13].Indeed, Chen et al. fail to include the final layer of Hadamards in RQCs and use more T gates at the beginning of the circuit.For these reasons, we estimate that such class is about 1000× easier to simulate than the new prescription of RQCs we actually simulated.The computational cost of simulating a circuit using Alibaba's simulator scales as 2 TW , where TW is the treewidth of the undirected graphical model of the circuit [38].We show in Fig. 10 the treewidths of the circuits simulated in Ref. [40], the old prescription of the circuits [13] (with and without the final layer of Hadamards), and the revised prescription, for RQCs on a 7 × 7 × (1 + 40 + 1) square grid.Note that the circuits simulated in Ref. [40] are (on average) 1000× easier or more than the revised ones.Note also that the revised RQCs have no variation in treewidth from one  [40] with depth (1+40) (note that no layer of Hadamards is added at the end of the circuit); [Ali 41] Ref. [40] with depth (1+41); [v1 no H] old prescription of the RQCs [13] without the final layer of Hadamards and depth (1+41); [v1] old prescription of the RQCs [13] with the final layer of Hadamards and depth (1+40+1); [v2] revised prescription of the RQCs [29] with depth (1+40+1).Note that in all cases the tree width of the RQCs is substantially larger than that ones simulated in Ref. [40], making the simulations about 2 13 × or 2 14 × harder (on average).Moreover, fluctuations in the treewidth for the revised prescriptions of RQCs are completely absent.The upper bounds were obtained by running quickbb [48] with settings --time 60 --min-fill-ordering.
instance to another.Finally, it is worth noting that Ref. [40] reports runtimes corresponding to the 80 percentile best results, excluding the worst runtimes.On the contrary, our runtimes have little fluctuations and are RQC independent.
Compared to Ref. [29], our simulator is 7× less efficient to compute 10 6 amplitudes with fidelity 0.51% for 7×7 grids of qubits with depth 1+40+1, using the new prescription of RQCs.However, it is important to note that the runtime of MFIB's simulator and our simulator scale in different ways.Indeed, MFIB's approach has the advantage to compute a large number of amplitudes with a small cost overhead.On the contrary, our approach performs much better in the computation of a smaller subset of amplitudes; both methods use comparable resources when computing about 10 5 amplitudes of a 7 × 7 × (1 + 40 + 1) RQC.Note also that MFIB's approach is limited by memory usage, and it scales unfavorably compared to our simulator for circuits with a large number of qubits (e.g., beyond 8 × 8 rectangular grids), with a large diameter (e.g., Bristlecone-60 and -70), or both.For instance, Bristlecone-70 would require 825GB per node, which is currently unavailable for most of HPC clusters.To mitigate the memory requirements, one could either partition the RQCs in more sub-circuits, or use distributed memory protocols like MPI.However, both approaches introduce a non-negligible slow-down that make them unpractical (see Section C for more details).

VII. CONCLUSIONS
In this work, we introduced a flexible simulator, based on tensor contraction, to compute both exact and noisy (with a given target fidelity [29]) amplitudes of the output wave-function of a quantum circuit.While the simulator is general and can be used for a wide range of circuit topologies, it is well optimized for quantum circuits with a regular design, including rectangular grids of qubits and the Google Bristlecone QPU.To test the performance of our simulator, we focused on the benchmark of random quantum circuits (RQCs) presented in Refs.[13,29] for both the 2-D grids (7 × 7, 8 × 8 and 8 × 9) and the Google Bristlecone QPU (24, 48, 60, 64, and 70 qubits).Compared to some existing methods [35,39,40], our approach is more robust and performs well on the redesigned, harder class of RQCs.While other benchmarks exploit [35], and sometimes introduce [39,40], weaknesses in particular ensembles of random quantum circuits that affect their reported performance sig-nificantly, our runtimes are directly determined by the number of full lattices of two-qubit gates at a given depth (see Fig. 4).
Our performance analyses are supported by extensive simulations on Pleiades (24th in the November 2018 TOP500 list) and Electra (43rd in the November 2018 TOP500 list) supercomputers hosted at NASA Ames Research Center.To our knowledge, ours is the first classical simulator able to compute exact amplitudes for the benchmark of RQCs using the full Google Bristlecone QPU with depth 1+32+1 in less than (f • 4200) hours on a single core, with f the target fidelity.This corresponds to 210 hours in Pleiades or 264 hours in Electra for 10 6 amplitudes with fidelity close to 0.5,% a computation needed to perform the RQC sampling task.All our data are publicly available to use [41].
At first sight, compared to Alibaba's simulator [40], our simulator is between 3.6× and 100× slower, depending on the case.However, Alibaba's simulator heavily exploits the structure of RQCs and its performance widely varies from one RQC instance to another.Indeed, Ref. [40] reports only runtimes corresponding to the 80th percentile best results, excluding the worst runtime.In contrast, our runtimes have little variation in performance between instanes and are independent of RQC class.Moreover, Ref. [40]  Table V.Estimated runtime and energy cost consumption to compute the specified number of amplitudes for our simulator on a single processor of the Skylake nodes portion of the NASA Electra system, compared to Ref. [29] (MFIB).The energy cost for the MFIB simulations is estimated assuming the same power consumption per core as the Skylake nodes.In Ref. [29], the authors a number of cores P = 625 × 16 = 10 simulations; this is due to the ability of modern Intel processors to "up-clock" their CPUs in favorable conditions (known as Dynamic Frequency Scaling), thus consuming a similar amount of energy and achieving a similar performance as in the case where there are no idle cores.Note that MFIB's approach has the advantage to compute a large number of of amplitudes with a small cost overhead.On the contrary our approach performs much better in the computation of a smaller subset of amplitudes; both methods use comparable resources when computing about 10 5 amplitudes.The MFIB algorithm becomes less efficient than our algorithm as the number of qubits grows because of memory requirements.
layer of Hadamards and uses fewer non-diagonal gates at the beginning of the circuit which, we estimate, makes the corresponding circuits approximately 1000× easier to simulate.We would like encourage the reporting of benchmarking against the circuit instances publicly available in [43] in order to arrive at meaningful conclusions.Compared to Ref. [29], our simulator is 7× less efficient (on Electra Skylake nodes) to compute 10 6 amplitudes with fidelity 0.51% for 7×7 grids of qubits with depth 1 + 40 + 1.However, compared to Ref. [29] our simulator scales better on grids beyond 8 × 8 and on circuits with a large number of qubits and diameter, including the Bristlecone QPU and its sub-lattices Bristlecone-60 and-70.
In addition, we were able to simulate (i.e., compute over 10 6 amplitudes) RQCs on classically hard sub-lattices of the Bristlecone of up to 60 qubits with depth (1+32+1) and fidelity comparable to the one expected in the experiments (around 0.50%) in effectively well below half a day using both Pleiades and Electra combined.We also discussed the classical hardness in simulating sublattices of Bristlecone as compared to rectangular grids with the same number of qubits.Our theoretical study and numerical analyses show that simulating the Bristlecone architecture is computationally more demanding than rectangular grids with the same number of qubits and we propose a family of sub-lattices of Bristlecone to be used in experiments that make classical simulations hard, while keeping the number of qubits and gates involved as small as possible to increase the overall fidelity.
As a final remark, we will explore using our approach and extensions to simulate different classes of quantum circuits, particularly those with a regular structure, including quantum circuits for algorithms with potential applications to challenging optimization and machine learning problems arising in aeronautics, Earth science, and space exploration, as well as to simulate many-body systems for applications in material science and chemistry.and Electra, during the dedicated downtime.
The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ODNI, IARPA, AFRL, or the U.S. Government.The U.S. Government is authorized to reproduce and distribute reprints for Governmental purpose notwithstanding any copyright annotation thereon.

Appendix A: Contraction of RQCs on rectangular grids
Here we describe the contraction scheme for I×J grids.To simplify the discussion, we consider 7×7×(1+40+1) RQCs.Nonetheless, the same procedure applies to other rectangular circuits.For the 7 × 7 × (1 + 40 + 1) RQCs, we use two cuts (2 2×5 paths) in the grid (see Fig. A1), in order to divide it into four tensors, A, B, C, and D, of dimension 2 6×5 = 2 30 each.Then, the contraction proceeds as follows.
1) Tensors in region A are contracted onto tensor A, which is path independent; we do the same for tensors in regions pB (partial-B), pC (partial-C), and ppD (partial-partial-D), which are all path independent; for this reason, all A, pB, pC, and ppD are reused over the entire computation.We now iterate over all paths with two nested iterations: the outer one iterates over the right cut, while the inner one iterates over the bottom cut.
2) For each of the 2 5 right paths, tensors in regions B and pD are contracted.
3) Tensors A and B are contracted onto AB; this tensor will be reused over all the inner loop iterations.
4) For each bottom path (given a right path), the tensors on region C and D are contracted.
5) C and D are contracted onto CD.
6) AB and CD are contracted onto a scalar, which is equal to this path's contribution to the amplitude.
From here, the iteration over the inner loop takes us back to step 4), for a different bottom path.After this loop is exhausted, we go back to step 2), for the next right path.After iterating over both loops, we have obtained all path contributions.
Note that there is a large amount of potential reuse of the tensors built at each step.However, taking advantage of already built tensors requires a large amount of memory in order to store all the tensors to be reused.While there is a clear trade-off between tensor reuse and memory usage, in practice we always found the reuse profitable.
Finally, the fast sampling introduced in Section IV B can be also applied here, by using a slightly different contraction than the one presented in Fig. A1.More precisely, in Fig. A2 we present the final steps of the alternative contraction.In 4) and 5) the size of the tensor pC is still small, and we focus, for this particular path, in contracting first D with AB onto ABD.This leaves six qubits free (above pC) for the computation of batches of as much as 2 6 = 64 amplitudes (or more if pC is shrunk further); for each amplitude, contracting the tensors in region C (where pC is reused) and computing the scalar ABCD is left, i.e., steps 6) and 7) of Fig. A2.Once this is done, then we go back to step 5), in order to loop over all bit-strings in the batch.After exhausting this loop, we go back to step 4) to compute the next bottom path.Note that the contraction following this procedure is dominated by the contraction of AB with D; however, the tensor ABD between A and B, and C and D, to iterate over paths between B and C, and those between A and D, lowering the qubit complexity to log 2 [2 α AB +α CD (2 α BC β BC + 2 α AD β AD )], where β BC ≡ 2 n B + 2 n C and β AD ≡ 2 n A + 2 n D .We can see in Fig. C4 that for Bristlecone-60 with depth (1+32+1) the best performance is achieved for a partition in two sub-circuits, as is the case for the rectangular grids considered in Refs.[29,39].For a square grid 8 × 8 × (1 + 32 + 1), the qubit complexity is 65, which is lower than the best complexity found in this section for Bristlecone-60 with depth (1+32+1), even though the 8 × 8 square grid has four more qubits.This suggests that hard Bristlecone sublattices are harder to simulate than square (or rectagular) grids of the same (or smaller) number of qubits.Similar arguments apply to Bristlecone-70.
ets within a node (all the Pleiades and Electra nodes are dual-socket).We enforced the rule that all the jobs running on a particular node had the same number of threads.This could lead to there being a few unused cores.Individual threads were "pinned" to run on individual cores to avoid interference.However, the pinning strategy caused the unused cores to always be the lowest numbered cores (which are all on the first socket), and so fewer jobs ran concurrently on the first socket, causing them to see less contention, and therefore slightly higher performance than jobs that ran on the second socket.Unfavorable thread counts and core counts could also lead to one job per node having it's threads split across the two sockets; this creates yet another source of anomalous timings.

Figure 2 .
Figure2.Layout of two-qubit gates and the corresponding cycle order (from 1 to 8).This layout can be tiled over 2D square grids of arbitrary size.The Bristlecone architecture is a diamond shaped subset of such a 2D grid.For our simulations, we use CZ gates as the two-qubit gate.

Figure 3 .
Figure 3. Left: 3D grid of tensors obtained by contracting 8 consecutive layers of CZ gates, including the single qubit gates.Right: example of a typical block of 8 layers of gates on a single qubit; note that the qubit shares one CZ gate with each of its four neighbors per block.

Figure 4 .
Figure 4. Contraction of the 3D grid of tensors (see Fig. 3) in the time direction to obtain a 2D grid of tensors.

Figure 5 .
Figure 5. Top: sketch of the contraction procedure followed to obtain one path of one amplitude of the Bristlecone-70 with depth (1+32+1).We first make four cuts of dimension 2 4 each, leaving us with 216  paths; for each path, we contract all tensors on region A, and all tensors on region B; then tensors A and B are contracted together; finally, tensor C (which is independent of chosen path, and can in addition be computed very efficiently) is contracted with AB, which obtains the contribution of this path to this particular amplitude.Bottom: corresponding regions A, B, and C for the Bristlecone-24, -48, -60, and -64.Note that both the Bristlecone-48 and the Bristlecone-64 need 2 cuts of dimension 2 4 each, while the Bristlecone-60 needs three of such cuts, making it a factor of 2 4 times harder than Bristlecone-64, even though it has 4 qubits less.

Figure 6 .
Figure 6.Single thread computation times on Broadwell nodes of Pleiades for an arbitrary permutation of the indexes of a tensor of single precision complex entries (and 25 indexes of dimension 2 each) following an optimized, naive implementation of the reordering (green), an arbitrary Lγ move (red), and an arbitrary Rγ move (blue).The optimized, naive approach performs comparably to python's numpy implementation of the reordering.Note that, for a wide range of γ, left and right moves are very efficient.Left inset: zoomed version of the main plot.For γ ∈ [5, 10], both right and left moves are efficient.Right inset: computation times for L5 and R10 (used in practice) as a function of the number of threads used.

Figure 8 .
Figure 8. Top and middle: Pearson coefficient as a function of Hamming distance for pairs generated at random on subsystem C of sub-lattice Bristlecone-24, for samples of size 1000 of random strings on subsystem A + B. All pairs between strings of a set of 32 random strings on subsystem C are considered.The average and standard deviation (error bars) for eachHamming distance is plotted with a solid line.We can see that, for depth (1+24+1) (top), the system has not thermalized, and the correlation decreases with Hamming distance; for depth (1+32+1) (middle), correlations approach zero, and become Hamming distance independent (on subsystem C).Bottom: we compare the distribution of Pearson coefficients, obtained as described above, to the distribution of Pearson coefficients obtained (numerically) from probability amplitudes with the same sample size as in the simulations above, drawn from a Porter-Thomas distribution.At large enough depth the system is expected to thermalize and the two distributions match, meaning that the probability amplitudes obtained by varying bit-strings only on subsystem C are uncorrelated.

Figure 9 .
Figure 9. Top: Distribution of the runtimes for a single instance of each of the six simulations [Run1-6] run on different node architectures.An instance refers to a certain number of paths for a particular number of amplitudes (output bit-strings); see Section III A 1 and TableIfor more details.For clarity, all distributions have been normalized so that their maxima are all at the same height.The nodes used on NASA HPC clusters Pleiades and Electra are: Broadwell (bro), Intel Xeon E5-2680v4 ; Haswell (has), Intel Xeon E5-2680v3; Ivy Bridge (ivy), Intel Xeon E5-2680v2; Sandy Bridge (san), Intel Xeon E5-2670; Skylake (sky), 2 × 20-core Intel Xeon Gold 6148 processors per node.Bottom: Same distribution as above, but the runtimes are multiplied by the number of cores per job on a single node, to provide a fairer comparison.As one can see, Skylake nodes provide generally the best performance, and belong to Electra, an energy efficient HPC cluster.The split of runtimes into groups is discussed in[47].

Figure A1 .
Figure A1.Sketch of the contraction of the 7 × 7 × (1 + 40 + 1) tensor network with two cuts.The names of the tensors at key steps shown in the contraction are referred to in Section A.

Figure A2 .
Figure A2.Alternative contraction for the use of fast sampling (see Section IV B).

Table IV .
Estimated runtimes and energy cost for the computation of 10 6 amplitudes with fidelity close to 0.5% on NASA HPC Pleiades and Electra systems.Note that for the 7 × 7 × (1 + 40 + 1) and 8 × 8 × (1 + 40 + 1) grids, jobs do not fit in Sandy Bridge nodes, due to their memory requirements; for that reason, the portion of Pleiades with Sandy Bridge nodes is not considered, and the energy cost estimations of these two cases do not include those nodes.
fails to include the final 5, since they use 625 nodes of 16 cores (32 vCPUs or hyper-threads) each.For ours, P = 2304 × 40 = 92,160 on the Skylake nodes of Electra (note that we consider 40 cores per node, even though we use only 39 in practice for the 7 × 7 × (1 + 40 + 1) 17= 131,072, while in our simulations on the Skylake nodes of Electra we used P = 2304 × 40 = 92,160 cores (note that we consider 40 cores per node, even though we use only 39 in practice for the 7 × 7 × (1 + 40 + 1) simulations and 36 in the 8 × 8 × (1 + 40 + 40) simulations; this is due to the ability of modern Intel processors to "upclock" their CPUs in favorable conditions (known as Dynamic Frequency Scaling), thus achieving a performance similar to the case where there are no idle cores.