Scaling quantum approximate optimization on near-term hardware

The quantum approximate optimization algorithm (QAOA) is an approach for near-term quantum computers to potentially demonstrate computational advantage in solving combinatorial optimization problems. However, the viability of the QAOA depends on how its performance and resource requirements scale with problem size and complexity for realistic hardware implementations. Here, we quantify scaling of the expected resource requirements by synthesizing optimized circuits for hardware architectures with varying levels of connectivity. Assuming noisy gate operations, we estimate the number of measurements needed to sample the output of the idealized QAOA circuit with high probability. We show the number of measurements, and hence total time to solution, grows exponentially in problem size and problem graph degree as well as depth of the QAOA ansatz, gate infidelities, and inverse hardware graph degree. These problems may be alleviated by increasing hardware connectivity or by recently proposed modifications to the QAOA that achieve higher performance with fewer circuit layers.


INTRODUCTION
Combinatorial optimization problems are commonly viewed as a potential application for near-term quantum computers to obtain a computational advantage over conventional methods [1]. A common approach to solving these problems uses the quantum approximate optimization algorithm (QAOA) [2], which begins with a "cost" Hamiltonian typically defined as with real coefficients J i,j and h i that encode a quadratic unconstrained binary optimization problem in the eigenspectrum of C [3]. The QAOA prepares a quantum state |γ, β on n qubits using p layers of unitary operators, where each layer alternates between Hamiltonian evolution under C and under a "mixing" Hamiltonian B = n i=1 X i composed of independent Pauli-X opera- * lotshawpc@ornl.gov; This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paidup, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan. (http://energy.gov/downloads/doe-public-access-plan) tors, (2) The state is then measured to yield the n-bit binary string z as a candidate solution to the problem. The angles β = (β 1 , ..., β p ) and γ = (γ 1 , ..., γ p ) are variational parameters chosen to minimize or maximize the expectation value C = γ, β|C|γ, β , depending on whether the optimal solution in C is the minimum or maximum value, respectively. Farhi et al. have argued that QAOA recovers the ground state of C as p → ∞ [2], but the primary interest in QAOA is in reaching high performance with a modest number of layers p that could realistically be implemented on a quantum computer. A significant body of theoretical [4][5][6][7][8], computational [9][10][11][12][13], and experimental [14,15] research has focused on understanding QAOA performance at p ≈ 1, mostly on the MaxCut problem with a small number of qubits n, but also for other types of problems [16][17][18]. These studies have shown some promising results, for example, with QAOA outperforming the conventional lower bound of the GW algorithm for MaxCut on some small instances [19,20]. There have also been a variety of proposed modifications to the algorithm to improve performance [21][22][23][24][25][26][27][28] and solve optimization problems with constraints [29][30][31]. The results from these and other studies have encouraged research into extending the QAOA to larger and more complex problems.
In contrast to the QAOA studies focused on a small number of variables n, conventional computational methods are capable of handling problem instances with hundreds of variables or more. To assess the usefulness of QAOA it will be necessary to scale to larger and more complex instances where it can be directly compared against these methods on practically relevant problems. A recent study suggests that hundreds of qubits are needed [32] to compete in time-to-solution, while the theoretical and experimental performance in this context are important open questions. Theoretical considerations indicate that the number of layers p will need to scale at least as log(n) in some instances, as the locality of the ansatz limits the ability to build global correlations that are needed for globally optimal solutions [33,34]. Classical algorithms have also been developed that outperform QAOA at low p [35,36], further suggesting large p may be necessary to compete with conventional methods. To optimize parameters at large n and p, a variety of computational [37,38] and theoretical [39][40][41][42][43][44] approaches have been developed and in some cases the theoretical performance has been characterized. With parameter setting strategies at hand, what remains to be seen is how the QAOA will perform in experimental implementations. The prospect of experimentally implementing the QAOA at large n and p raises questions about how quantum computing resources will scale with problem size and complexity, and how noise will influence the behavior of the algorithm.
Here we report on the scaling of resources needed by QAOA on near-term intermediate-scale quantum (NISQ) devices. We show how features of the combinatorial problem and the target hardware influence the total number of gates and measurements required to reach a specified threshold of accuracy. First we consider problem features such as the average degree d G of the graph defining the problem instance, where d G is related to the number of non-zero terms in the quadratic unconstrained binary optimization problem. While much of the QAOA literature has focused on problems with small d G , larger d G arises naturally in constrained combinatorial optimization problems [45,46]. In addition to d G , the problem size n and the number of QAOA layers p also contribute to the gate counts and hence the resources required to implement the algorithm. It is furthermore important to consider the constraints that arise in current NISQ hardware due to limited connectivity on the hardware device qubit register, which can require costly SWAP gates to transport logical qubits. We show that the interplay between these logical requirements and hardware constraints generate steep scaling in the resources required for high-fidelity implementation of QAOA as n, p, and d G increase.
Our approach synthesizes optimized circuit representations of QAOA for varying problem sets targeting constrained noisy hardware. We optimize both the number of gates and the overall performance through judicious placement of the logical qubits and injected SWAP gates. Placement and routing are difficult optimization problems and it is not clear a priori how an ideal QAOA instance expressed as Eq.
(2) will map to a given hardware [47][48][49][50]. To understand the role of hardware connectivity,  Fig. 1(a) to d H = 6 for the triangular lattice in Fig. 1(d). We quantify the SWAP gate counts with respect to d H , d G , n, and p, and we fit scaling relations to these results. Resource counts also give insight into the scalability of the QAOA in the presence of noise. We define a simple noise model for a quantum state traversing a circuit with gate counts estimated from our resource analysis and use this to quantify the reliability of QAOA as it scales to larger and more complex problems. Our analysis complements previous theoretical results describing how noise influences the QAOA cost expectation value, trainability, and eigenvectors of the density operator [51][52][53][54]. We quantify the number of measurements M that are needed to obtain a single result from the idealized state that would be produced by a noiseless version of the circuit. This characterizes the reliability of the algorithm and the expected time-to-solution T , assuming T ∝ M . The results assess the scalability of the QAOA on noisy near-term hardware and the expected influence of d H , d G , n, and p.

Mapping to Hardware
We express the QAOA unitary operators of Eq. (2) in terms of a hardware gate set of Hadamards H, Zrotations R(θ), and controlled-NOT CNOT, as de-scribed in Methods. The gate-to-unitary operator correspondences given there provide the minimal numbers of each type of gate that must be implemented in the algorithm, for example, on fully connected hardware.
It is useful to classify problem instances C in terms of their circuit structure. We define problem graphs G with vertices for each qubit i and edges i, j for each nonzero J i,j constant in Eq. (1). Each edge i, j requires a set of two-qubit gates CNOT i,j R j (2J i,j γ l )CNOT i,j and the total set of edges defines all two-qubit gates that are needed on fully connected hardware. The specific values of the parameters J i,j = 0, h i , γ l , and β l enter as rotation angles in the circuit, hence all problem instances with the same problem graph have the same circuits up to choices of these angles. When an h i = 0 then a single-qubit gate can be further removed from the circuit, but this does not affect the two-qubit gate structure. We consider all non-isomorphic connected problem graphs with n = 7 qubits to determine how the circuits scale with the average problem graph degree d G ; to determine scaling with the number of qubits we assess 3-regular problem graphs with d G = 3 at varying n. On fully connected hardware, the number of gates of each type are where η is the number of non-zero h i in Eq. (1) and N 0 ≤ ⌊n/2⌋ is an instance-dependent number of CNOT gates that can be removed from the first layer of the circuit as they do not affect the initial state [55], see Supplemental Information Sec. I for details. However, on hardware with limited connectivity, it is often the case that some of the two-qubit gates cannot be implemented by any initial placement of the logical qubits onto the hardware register. For example, a nonplanar problem graph cannot be mapped onto any of the planar registers in Fig. 1. It is therefore necessary to use SWAP gates to shuttle logical qubits around the register during execution of the circuit, to realize connections that are not available to the initial qubit placement. There are many potential circuits that can be created and these can result in different total numbers of SWAP gates, with up to n 2 SWAP gates in n circuit layers in the worst case [56,57]. An ideal circuit will minimize the number of gates or circuit depth to minimize the negative impacts of noise in the circuit.
We compute circuits that minimize CNOT gate counts for each register architecture in Fig. 1 using an optimization routine. We optimize single layers of the QAOA algorithm as additional layers have the same circuit structure apart from differences in the qubit locations due to SWAP gates. These differences can be accounted for by mirroring the circuit implementation of exp(−iγ l C) in subsequent layers, so that qubits move back and forth between locations from layer to layer. For an n-qubit problem instance, we use register grids of sizes just larger than √ n × √ n, as we found that further increasing the grid size tended to result in larger optimized circuits. Our optimization procedure uses two nested loops. The inner loop calls the circuit mapping algorithm SABRE [47], which generates a set of random placements of the logical qubits onto the hardware register then optimizes each placement, ultimately returning the final optimized circuit with the smallest depth. For our circuits, we have found that SABRE sometimes yields sub-optimal placements, as it does not recognize the commutativity of the terms exp(−iγ l J i,j Z i Z j ) in Eq.
(2), but instead tries to implement these in the order it is given. We therefore define an outer loop that randomly shuffles these commuting terms, to optimize over varying term orderings. This outer loop decreases the number of gates in our optimized circuits compared to a more basic implementation with SABRE only. For each problem graph, we take our final result from these nested loops as the circuit with the fewest CNOT gates. The total number of CNOT gates on hardware with limited connectivity with N SWAP SWAP gates is where σ quantifies the average increase in CNOT gates per SWAP gate, beyond the N fc CNOT gates that are needed on fully connected hardware. Each SWAP gate is defined as a product of three CNOT gates, so σ = 3 in the worst case. In better cases, a SWAP ij gate is placed adjacent to a CNOT ij gate in the circuit and CNOT ij CNOT ij = 1 is used to remove a pair of gates. This gives 1 ≤ σ ≤ 3 in our accounting. Further details of the implementation, convergence behavior, and performance can be found in Supplemental Information Sec. I.

Scaling with Problem Size and Degree
We next mapped circuits for each of the 853 nonisomorphic problem graphs at n = 7 [58]. The results in Fig. 2 show how the number of SWAP gates N SWAP scales with the average problem graph degree d G at this n across our hardwares with varying d H . As d G increases so does the number of edges in the graph, and hence the number of two-qubit gates in each layer of the QAOA algorithm. Greater numbers of SWAP gates are needed on average to accommodate these two-qubit gates. Similarly, as the hardware degree d H increases a greater number of two-qubit gates are available natively on the hardware, so fewer SWAP gates are needed. The mean numbers of SWAP gates at each d G and d H are fit by an empirical linear relation N SWAP (d G , d H ) ∼ d G /d H with fit parameters in the figure caption and a root-mean-squareerror (RMSE) of 0.58 SWAP gates. The small error indicates the empirical relation is successful in providing a unified account of the N SWAP scaling across problem graphs and hardware architectures at this n. Next we consider how the number of SWAP gates scales with the size of the problem n. We considered sets of 3-regular graphs with 108 graph instances each at n = 20, 40, and 60 qubits. The 3-regular problem graphs have three non-zero J i,j terms for each qubit i in Eq. (1) and this standardizes d G = 3 as we scale to larger sizes. Three-regular graphs have also been studied with considerable interest in the QAOA MaxCut literature [2, 4, 9,10,32,39,41] and in a previous experimental demonstration of QAOA [14]. They are appealing targets for near-term hardware since most graphs at the same n have higher average degree d G , hence we expect them to require more noisy two-qubit gates, due to both the increase in the minimal number of CNOT gates in Eq. (5) and also the expected increase in SWAP gates following the previous analysis of Fig. 2.
We computed optimized circuit mappings for these 3regular instances to obtain the key result pictured in Fig. 3, which relates the number of SWAP gates to the average hardware degree d H as the problem size n increases. We fit the data with an empirical curve that is based on counting the number of two-qubit terms that cannot be implemented by the initial qubit placement and assuming the number of SWAP gates needed to bring the qubits together for these edge terms increases on average in proportion to the length and width of the hardware grid, see Methods for details. This leads to the empirical relation shown by the solid line in the figure where µ = 0.73 ± 0.02 is a fit parameter computed through non-linear least squares and ±0.02 is the asymptotic standard error. Here n 0 sets the zero of N SWAP and represents the maximum problem graph size at which all graphs can be mapped to hardware, for example, for fully connected hardware n 0 = n and N SWAP (n, d H ) = 0. For the triangle lattice in Fig. 1(d), all 3-vertex problem graphs can be mapped directly onto the lattice but the 4-vertex complete graph cannot be, so n 0 = 3. For the other hardware lattices, n 0 = 2.
We assess the performance of the empirical formula using the RMSE between the average N SWAP and the empirical N SWAP (n, d H ). Across all results in Fig. 3, the RMSE=7.2 SWAP gates. The RMSE is strongly influenced by the outliers for the heavy-hexagon array at n = 40 and n = 60, where the empirical formula is up to 16% smaller than the results. These deviations may be related to the bimodal degree structure of the heavy-hexagon array in Fig. 1(a), which has a mixture of register elements of degrees two and three, unlike the other constant-degree hardwares. Excluding the results for the heavy-hexagon at n = 40 and n = 60 decreases the RMSE to 2.7 SWAP gates. We conclude the empirical formula is giving a good fit to the majority of data in the figure, apart from the heavy-hexagon at large n, where the formula gives a looser bound to the observed N SWAP .

Noisy Architecture Model and Measurement Count Scaling
We use a simple noise model for our circuits to assess how noise influences the scalability of the QAOA, in terms of the number of measurements M that are needed from a noisy circuit to obtain a single result from the intended noiseless quantum state distribution. This quantifies the reliability of a noisy QAOA circuit in producing the intended output and also characterizes the scaling in the time-to-solution T assuming T ∝ M .
An instance of a QAOA circuit is expressed in terms of a series of gates with ideal unitary evolution operators U 0 , U 1 , . . ., with U α ∈ {H, R, CNOT} the unitary for the αth gate, acting on an initial state ρ 0 = (|0 0|) ⊗n . The noisy state produced by the αth gate is expressed using a quantum channel as where the Kraus operators (ǫ α give noisy deviations from the intended evolution with probabilities ǫ The final state of the circuit is [54] where ρ ideal = |γ, β γ, β| is the density operator for the intended pure state |γ, β , ρ noise is a density operator composed of all terms with at least one Kraus operator, and F 0 = α (1 − ǫ α ) is a lower bound to the state preparation fidelity F = γ, β|ρ|γ, β ≥ F 0 , with equality when Trρ ideal ρ noise = 0. If we assume constant error rates ǫ CNOT , ǫ H , and ǫ R for each CNOT, H, and R gate respectively, then where the N are the corresponding gate counts. A noisy implementation of QAOA will be effective when it can produce measurement results from the intended state distribution ρ ideal . In the absence of readout errors, a measurement projects the total state ρ onto a computational basis state |z that is the result of the measurement, with probability P (z) = z|ρ|z = F 0 P ideal (z) + (1 − F 0 )P noise (z). This has a lower bound P (z) ≥ F 0 P ideal (z) independent of the specific noise process, apart from the values of the error rates ǫ α that determine F 0 . Summed over all |z in the support S of ρ ideal , the total probability P = |z ∈S P (z) to obtain any result from the ideal state distribution is We use this probability inequality to bound the number of measurements M = log(1 − P)/ log(1 − P ) that are needed to obtain a single sample from the distribution of the intended state with probability P [16,20], It is useful to consider a few examples. In a theoretical best case of QAOA, the intended state is a single computational basis state |γ, β = |z opt that gives the optimal cost value C(z opt ) = C opt ∈ R. If we assume that noise does not contribute significantly to the probability for |z opt , then P ≈ F 0 and M is close to the upper bound. In more generic cases of interest, the intended state has non-zero probability for a variety of approximately optimal states and the goal is to measure any one of these states. In this case M may be smaller than the upper bound, and potentially much smaller if the probability to measure approximately optimal states is significant for the ρ noise component. Smaller upper bounds for M might then be obtained using information about the noise process and its expected influence in ρ noise . However, without detailed information about a specific state and noise process we do not have a way to decrease M below the upper bound, which serves as a generic guide for any possible intended QAOA state and noisy evolution of the type in Eqs. (8)- (9).
We assessed the scalability of the number of measurement samples by computing the upper bound for M for 3-regular graphs at varying sizes n and at p = 20 QAOA layers, with a probability P = 0.99 to sample from the intended state distribution. We consider 3-regular problem graph instances with gate counts N H , N R , and N CNOT in Eqs.
(3),(4), and (6) respectively, assuming all h i = 0 in Eq. (1) so that η = n. We use N SWAP computed from the empirical formula of Eq. (7) for each hardware architecture in Fig. 1, σ = 3 as the number of additional CNOT gates per SWAP gate in Eq. (6), in accord with our results at large n from Supplemental Information Sec. I, and we approximate N 0 = 0 since N 0 ≪ N CNOT when p = 20. The F 0 in M is then computed from Eq. (10) with assumed error rates of ǫ CNOT = 5 × 10 −5 and ǫ R = ǫ H = ǫ CNOT /10. For comparison, recent advances in transmon qubits have achieved two-qubit gate error rates of 6.4 × 10 −3 and single-qubit error rates of 3.8 × 10 −4 [59]. Figure 4 shows how this M scales with problem size n. The number of measurements increases exponentially with n at a rate that depends on the hardware degree d H . The variations in hardware themselves give an exponential divergence in M as the reciprocal hardware degree 1/d H increases and the hardware becomes less connected (Fig. 4 inset), due to the empirical dependence of N SWAP ∼ 1/d H from Eq. (7). The hardware dependence is significant at the large n that are required for practical problems. For example, at n = 500 (vertical dotted line), the number of measurement samples is approximately 20 for fully connected hardware but increases by four orders of magnitude going to the least connected hardware (heavy-hexagon, Fig. 1(a)). Here n = 500 exemplifies a nontrivial problem size but is otherwise arbitrarysimilar scaling behavior is observed for other large n. Curves similar to Fig. 4 can also be computed for fixed n as the error rates ǫ α , number of QAOA layers p, or as the problem graph degree d G increase, see Supplemental Information Sec. III for details.

DISCUSSION
Prospects for obtaining a quantum computational advantage with the QAOA are expected to require hundreds of qubits or more to compete against conventional methods on practically relevant problems [18,32]. As the QAOA scales to larger and more complex problems, the number of gates to implement the algorithm on fully connected hardware increases with the problem graph degree d G and number of qubits n. For sparsely connected hardware additional SWAP gates are needed. We computed optimized circuits to determine how the number of SWAP gates N SWAP scales with n and d G on a variety of real and hypothetical hardware architectures with varying levels of connectivity in terms of the hardware degree d H . The reciprocal hardware degree 1/d H , average problem graph degree d G , and number of qubits n were each found to be important scaling factors in the empirical behavior of N SWAP . Using a simple noise model with gate counts extrapolated from our circuits we computed the number of measurement samples M from a noisy circuit that are needed to obtain a single measurement from the distribution of an idealized noiseless version of the state with probability P. This is a measure of the reliability of a noisy circuit in producing the intended outcome. We argued that M increases exponentially with n, d G , 1/d H , the number of QAOA layers p, and the gate error rates ǫ α . Assuming that M is proportional to the time to solution, this corresponds to an exponential time complexity in each of these factors.
We considered n = 500 as an example of a nontrivial problem size to compare the number of measurements across different hardwares. Our results show that the number of measurement samples is 2×10 3 ≤ M ≤ 5×10 5 at this n and p = 20 for the considered error rates and hardwares. These numbers of measurements should not be difficult to obtain from a quantum computer. However, our parameter choices and problem sets were optimistic in some respects. The assumed error rates were about two orders of magnitude below current state of the art devices [59] and larger error rates exponentially increase the number of measurements. For example, doubling the error rates so that ǫ CNOT = 10 −4 gives 5 × 10 5 ≤ M ≤ 5 × 10 10 for our hardwares. We also assumed 3-regular problem graphs, which have been studied with great interest in the QAOA literature. However, many practically relevant problems use denser problem graphs, for example in constrained optimization problems [18,45,46]. For denser graphs the average degree can scale as n and changes in degree can significantly affect M . For example, using our approach and parameter choices for a 500 qubit problem graph with average degree d G = 25 we obtain M = 3 × 10 6 on fully connected hardware. For the sparsely connected hardware we consider we do not have a precise scaling relation for N SWAP on d G = 25 graphs, but if we optimistically use the same relationship N SWAP (n, d H ) we found for 3-regular graphs we obtain 2 × 10 8 ≤ M ≤ 5 × 10 10 at d G = 25. This is ignoring any dependence of N SWAP on d G , which would be significant if our small n observation N SWAP ∼ d G holds also at large n. A final note is that if more than one measurement is needed from the state with high probability, then this will introduce an additional scaling beyond the M presented here. The numbers of measurements quickly become greater than what can realistically be expected from near-term quantum computers.
We expect the measurement scaling will significantly inhibit the ability to implement the QAOA at scales relevant for quantum advantage. When the QAOA parameters are optimized using measurements from a quantum computer, this optimization will also be greatly inhibited. Parameter optimization has been addressed in some instances using theoretical approaches [9,19,20,[37][38][39][40][41][42][43][44], though for generic instances it is unclear if such approaches can be applied. However, even with a good set of parameters the circuit must still be run to obtain the final bitstring solution to the problem, and in our model this requires a number of measurements that quickly becomes prohibitive at scales relevant for quantum advantage. Straightforward attempts to scale the QAOA will face a significant barrier if these scaling problems are not addressed.
Our expectations for performance are based on a general upper bound that is saturated when the noisy and ideal components of the total circuit density operator give distinct measurement results in the computational basis. A vanishing overlap in measurement results is expected when the ideal QAOA circuit prepares a computational basis state, while intermediate superposition states may have non-negligible overlap with the noisy subspace. Further analysis will require details from hardware-specific noise models to determine more precise estimates for how such errors influence M . In addition, there are methods to overcome the measurement count limitations. One approach is to significantly increase hardware connectivity or modify the gate set, for example, using ion-trap quantum computers with globally-entangling Mølmer-Sørensen gates [60] or Rydberg atoms that naturally enforce constraints in some instances of QAOA [61]. Another approach is to modify the QAOA ansatz. This includes introducing additional parameters within layers of QAOA [21], modifying the structure of the ansatz [22][23][24][25], modifying the cost function [27], objective function [28], and circuit structure [26]. Such technological and algorithmic advances are likely necessary to reduce the numbers of layers or gates, and hence the accumulated noise, as the QAOA scales to larger sizes.

METHODS
We generated circuits using the XACC quantum programming framework [62,63] to map the unitary quantum operators of Eq. (2) to a gate set of Hadamards H, Z-rotations R(θ) = exp(−i(θ/2)Z), and controlled-NOT CNOT gates. To map these circuits to hardware with limited connectivity, we used the Enfield software library [48] and SABRE algorithm [47] implemented within XACC. Details of the implementation, convergence behavior, and comparison with a lower bound for N SWAP at small n are described in the Supplemental Information Sec. I.
In terms of our gate set, the unitary operators in Eq. (2) are

Empirical Formula for 3-regular Graphs
We construct the empirical curve N SWAP (n, d H ) in Eq. (7) by considering how many two-qubit gates cannot be implemented by the initial mapping of qubits onto the register along with the average expected behavior for how many SWAP gates are needed to bring qubits together for each of these gates. We begin by separating the edge terms in a mapped problem graph instance into edges s = s 1 , s 2 that are "satisfied" by the initial placement of qubits on the register, in the sense that the two-qubit gates between s 1 and s 2 can be implemented in the initial placement, and edges u = u 1 , u 2 that are "unsatisfied," in the sense that SWAP gates are needed to bring the qubits u 1 and u 2 together to implement their two-qubit gates. Our approach is to express the total number of SWAP gates as N SWAP = u N (u) SWAP is the number of SWAP gates that are used in the circuit to bring qubits u 1 and u 2 together to implement the twoqubit gates for u.
Some care is needed to define the N (u) SWAP to give a consistent total N SWAP . Each SWAP gate moves locations of two qubits and hence can contribute to two terms N SWAP when a SWAP gate moves two qubits that help to satisfy u and u ′ . Another consideration is that a series of SWAP gates may be implemented before the gates for a given u, while along the way the SWAP gates that are relevant for u may also allow for implementations of two-qubit gates for a variety of other u ′ , u ′′ , .... We could then assign fractional values to each of the N SWAP . A final consideration is that sometimes the circuits will SWAP qubits that are in initially satisfied edges s before the two-qubit gates for those edges are implemented. Although additional SWAP gates are sometimes used in these cases for the satisfied edges s, these SWAP gates are only needed because there were initially unsatisfied edges u which began a series of SWAP gates earlier in the circuit, so it is reasonable to systematically assign the SWAP gates for these s to the N SWAP is the average number of SWAP gates per unsatisfied edge. The N u is determined solely by the initial placement of qubits onto the register, while the average N (u) SWAP = N SWAP /N u . We argue for the behavior of these terms in determining N SWAP and the empirical fit curve of Eq. (7).
For each hardware architecture and circuit, we computed the number of two-qubit edge terms N u that cannot be implemented directly on the hardware with the initial qubit placement. The N u for each hardware are found to scale as N u ∼ (n − n 0 ), where n 0 is a threshold size at which all graphs can be mapped directly to the hardware. The quantity n 0 sets the zero of N u and hence N SWAP , for example, on fully connected hardware n 0 = n so N u = 0 and no SWAP gates are needed. The rationale for the n dependence is that, on average, the number of unsatisfied edges increases linearly with the total number of edges, E = 3n/2 for the 3-regular graphs. The linear relations N u ∼ (n − n 0 ) for each individual hardware are shown in Supplemental Information Sec. I. They can be related to one another with a factor d SWAP . We can rationalize the √ n dependence by considering how many SWAP gates are needed to bring qubits together to satisfy an edge u, based on the typical distance between qubits on the approximately √ n × √ n hardware grids with √ n ∈ N. We begin by considering uniform random placements of logical qubits along a single dimension of length √ n. The probability for the first qubit to be at location i is P i = 1/ √ n, the probability for the second qubit to be at any other location j is P j = 1/( √ n − 1), and the average distance between the qubits is This scales approximately as √ n. If qubits are placed uniformly at random in two-dimensions and they move along each dimension separately, for example in the square hardware lattice of Fig. 1(c), then the total distance is twice the distance in a single dimension and this again scales as √ n. In reality the qubit placements are optimized instead of uniformly random, but still the length scales as √ n in each dimension and this gives some justification for the appearance of √ n in N (u) SWAP . Finally, we need to account for a factor 1/ √ d H to obtain the desired relation N (u) SWAP ∼ n/d H . We rationalize this factor by considering that fewer SWAP gates are needed to move a qubit from one location to another when there are more connections d H on the register, for example, in the triangle lattice some diagonal movements are allowed on the planar grid and we expect this to decrease the number SWAP gates that are needed. We incorporate this through a factor ∼ 1/ √ d H such that N We map QAOA problem instances to hardware circuits using the Enfield software library [48] implemented within the XACC programming framework [62,63]. We assessed performance of the circuit mapping algorithms WPM, CHW, BMT, and SABRE on example problems, finding that SABRE [47] gave superior performance and time-to-solution scaling with problem size. We therefore used SABRE for all of our circuit mappings. We optimized two adjustable parameters of the SABRE algorithm to minimize gate counts for each of our test sets. The first parameter "iterations" determines how many times SABRE generates random initial placements. Each of these placements is optimized by SABRE, which outputs the placement with the smallest depth as the final result. The second parameter used by SABRE is called "lookahead" and determines a balance between current and future gates in the objective function for varying steps in the algorithm, see Ref. [47] for details. Figure 6 shows an example of the convergence behavior for the "iterations" and "lookahead" parameters for a series of five "shuffled" initializations, as described in detail in the next section, for the test-set of 3-regular graphs of size n = 20. For each initialization, the results show a clear dependence on the values of "iterations" and "lookahead", which become steady when both of these parameters are at least forty. We set each parameter equal to 40 to compute our final results for the test set. We observed similar behavior for these parameters at n = 40, with results that appeared convergent when iterations and lookahead are equal. For n = 60 we optimize the parameters assuming they are equal based on our results from n = 20 and n = 40. Table I   A. Improvements to optimized circuit layouts We improve the SABRE circuit layouts by implementing QAOA-specific cancellations of circuit elements. CNOT gates can be removed when a SWAP gate (SWAP = CNOT ij CNOT ji CNOT ij ) appears next to a trio of gates for a two-qubit cost term (exp(−iγ l J i,j Z i Z j ) = CNOT ij R j (2J i,j γ)CNOT ij ). This gives adjacent and identical CNOT ij gates which can be removed since CNOT ij CNOT ij = 1. Although each SWAP gate is defined by a series of three CNOT gates, the net gate cost from adding a SWAP gate with a cancellation is only σ = 1 additional CNOT gate, since two CNOT gates are removed in the cancellation.
To increase the number of these cancellations, we defined a "shuffling" algorithm that rearranges the commuting edge terms CNOT ij R j (2J i,j γ)CNOT ij in the circuit that we input to SABRE for optimization. We found this rearrangement also reduces the total number of SWAP gates by finding more efficient series of gates for the hardware. We define a shuffling procedure that works in a loop outside SABRE to generate a random ordering for the commuting CNOT ij R j (2J i,j γ)CNOT ij gate-trios in the circuit. At each step in the loop, a shuffled QAOA instance is passed to SABRE to determine a final optimized hardware circuit, keeping the circuit with the fewest CNOT gates as the final optimized solution reported in the paper. Figure 7 shows the convergence behavior of SABRE with additional shuffling iterations.
The changes between iterations are very small by about 50 iterations. We find similar behavior for all our test sets, with small changes between subsequent shuffling iterations around 50, hence we use 50 shuffles for all our results. Each of the 50 shuffle iterations is optimized by SABRE over a number of random qubit initial placements given by the "iterations" parameter from Table  I to identify a final optimized instance. From the values in the table, this gives 1000-7000 optimizations per graph to identify a single best solution. Figure 8 evaluates the effectiveness of the shuffling iterations relative to a (shuffle) b (shuffle) a (no shuffle) b (no shuffle) NSWAP 1.47 ± 0.07 −2.6 ± 0.3 1.74 ± 0.07 −3.0 ± 0.3 Depth 11.3 ± 0.6 −9 ± 3 11.6 ± 0.3 −4 ± 1 an equal number of calls to SABRE without shuffling, for the set of non-isomorphic graphs at n = 7 mapped to a square hardware grid. In each figure, the horizontal axis shows the average vertex degree d G for the graphs, i.e., the average number of non-zero J i,j Z i Z j terms per qubit i. In the left and central figures, the shuffling algorithm is successful in decreasing the number of SWAP gates and the circuit depth, as demonstrated by the linear fits with parameters shown in Table II. The rightmost figure shows the number of CNOT gates per SWAP gate σ, which is significantly decreased using the shuffling routine, especially at small d. The horizontal dotted lines show the average numbers of CNOT gates per SWAP gate, averaged over all graphs with one or more SWAP gate. The average σ decreases by about 0.7 when the shuffling routine is implemented. Overall, the shuffling routine performs well at reducing the QAOA circuit cost by including QAOA-specific circuit commutativity in the SABRE optimization. Figure 9 evaluates σ for the sets of 3-regular graphs. These increase close to σ = 3 as the graph size n increases. We therefore use σ = 3 for 3-regular graphs at large n in our scaling analysis. An final improvement comes from cancelling the first layer of CNOT gates in the total QAOA circuit-the initial state is H ⊗n |0 ⊗n = |+ ⊗n and CNOT ij |+ ⊗n = |+ ⊗n , hence the first layer of CNOT gates can be removed. This gives the factor N 0 in Eq. (5) of the main text. To systematically search for CNOT gates to cancel in the first layer, we traverse the circuit from left to right and record the operands of each CNOT gate. If both qubit operands have never been seen before, we cancel the CNOT gate. Note there is an optimal ordering of edge terms (Z i Z j ) to maximize the number of CNOT gates that can be canceled in this way-for example, consecutive disjoint edge terms, such as Z 0 Z 1 and Z 2 Z 3 , result in more gate cancellation opportunities than a chaining list of terms, such as Z 0 Z 1 and Z 1 Z 2 . The procedure allows us to cancel at most ⌊n/2⌋ gates, as noted Majumdar et al. [55], since we can have at most ⌊n/2⌋ sets of Z i Z j gates with disjoint sets of operands i, j.
B. Degree-based SWAP Gate Lower Bound at Small n We evaluate SABRE performance in comparison with a simple lower bound for the number of SWAP gates for the non-isomorphic graphs at n = 7. The number of SWAP gates can be bounded in terms of the degrees of the hardware connectivity graph and the graph for the cost Hamiltonian, i.e., the set of non-zero J i,j Z i Z j terms for the problem instance. Suppose we have a qubit j with degree d (j) G > h max , where h max is the maximum degree for a register element on the hardware graph. In the terminology of the main paper, h max = h for the hexagon, square, and triangular hardware lattices, while for the heavy-hexagon it is the maximum number of connections per register element h max = 3. If d (j) G > h max , then at least one SWAP will be needed to enable j to interact with additional qubits, since not all of its interactions can be realized directly on the hardware lattice. In one case, the SWAP could change places of a qubit j ′ adjacent to j and another qubit j ′′ that is twice-removed from j such that j and interact with j ′′ . This allows for one additional interaction with j. In a second case, the SWAP gate can switch places of j and an adjacent qubit j ′ , which allows up to h max − 1 new connections between j and j ′′ , j ′′′ , etc. So the greatest number of new interactions with j that can be enabled by a SWAP gate is h max − 1. The minimum number of SWAP gates that must be performed for a qubit j to allow it to interact with all adjacent vertices in the cost graph is then where Each SWAP gate switches the hardware locations of two logical qubits and thus can enable new interactions for two logical qubits, that is, a SWAP ij gate could enable new interactions for both i and j. The minimum total number of SWAP gates is then half the sum of N min SWAP,j for the individual qubits, (18) Figure 10 compares the observed SWAP gate counts for the n = 7 non-isomorphic graphs to the gate counts from the lower bound. For each of the fixed-degree hardwares (all except heavy-hexagon), the average observed SWAP gate counts are no greater than four above the lower bound. To be clear, the lower bound is not expected to correspond exactly with the results, since it makes the simplistic assumption that every SWAP gate enables the maximum possible number of useful new interactions. Thus we expect the observed gate counts to be higher and take these results as suggesting that SABRE is achieving good performance for these graphs. The agreement is somewhat worse for the heavy-hexagon lattice, which has a mixture of vertices of degree two and three in the hardware graph. For this we simply use h max = 3 in computing the lower bound, which ignores additional SWAPs required by the degree-two vertices, so worse agreement is expected.  Figure 11 shows the average number of two-qubit cost terms exp(−iγ l J i,j Z i Z j ) = CNOT ij R j (2J i,j γ)CNOT ij for initially "unsatisfied" problem-graph edges that cannot be implemented directly on the hardware in the initial qubit placement, computed from each of the 3-regular graph test sets, see Methods for details. The number of unsatisfied edges on each hardware scales approximately as ∼ (n − n 0 ) with fit parameters in Table III. These curves can be united by introducing a scaling factor 1/ √ d H to each curve, giving the final fit in the Table and Fig. 5 of the main paper. hardware n0 fit function fit parameter heavy-hex 2 f hh (n) = ν hh (n − n0) ν hh = 1.03 ± 0.02 hexagon 2 f h (n) = ν h (n − n0) ν h = 0.99 ± 0.03 square 2 fs(n) = νs(n − n0) νs = 0.91 ± 0.04 triangle 3 ft(n) = νt(n − n0) νt = 0.71 ± 0.06 all 2 or 3 f (n) = ν(n − n0)/ √ h ν = 1.71 ± 0.04   Table III. III. MEASUREMENT SCALING WITH ERROR RATES, QAOA LAYERS, AND PROBLEM GRAPH DEGREE Figure 12 shows that the number of measurement samples M to obtain a single measurement from the ideal state distribution increases exponentially with the CNOT gate infidelity ǫ CNOT on fully connected hardware, with ǫ H = ǫ R = ǫ CNOT /10 as in the main text. Similar scaling can be observed with increasing numbers of QAOA layers p, since F 0 ≈ f p 0 where f 0 is the fidelity lower bound for a single layer. At small F 0 the logarithm log(1 − F 0 ) ≈ −f p 0 , so M ≈ − log(1 − P)/f p 0 and this diverges exponentially in p. Similarly, increasing d G increases the number of two-qubit edge terms in the QAOA circuit and the fully connected N fc CNOT ∼ d G . If we further assume the same scaling for N SWAP as for our n = 7 graphs, then N SWAP ∼ d G and the total number of CNOT gates N CNOT ∼ d G . These factors appear in exponents in F 0 and this gives an exponential divergence in M with respect to d G .