Automated optimization of large quantum circuits with continuous parameters

We develop and implement automated methods for optimizing quantum circuits of the size and type expected in quantum computations that outperform classical computers. We show how to handle continuous gate parameters and report a collection of fast algorithms capable of optimizing large-scale quantum circuits. For the suite of benchmarks considered, we obtain substantial reductions in gate counts. In particular, we provide better optimization in significantly less time than previous approaches, while making minimal structural changes so as to preserve the basic layout of the underlying quantum algorithms. Our results help bridge the gap between the computations that can be run on existing hardware and those that are expected to outperform classical computers. A new software tool significantly reduces the size of arbitrary quantum circuits, automatically optimizing the number of gates required for running algorithms. Yunseong Nam and colleagues from the University of Maryland developed a set of subroutines which, given a certain quantum circuit, would remove redundant gates by changing the order of individual or multiple operations and combining them. After a pre-processing phase, the execution of these routines in careful order constitutes a powerful automatized approach for reducing the resources required to implement a given algorithm. The heuristic nature of this optimization makes its computational cost scale well with the size of the circuit, as shown by comparisons for the computation of discrete logarithms and Hamiltonian simulations. This makes it applicable to computations that can be run on existing hardware and might outperform classical computers.


Introduction
Quantum computers have the potential to dramatically outperform classical computers at solving certain problems.Perhaps their best-known application is to the task of factoring integers: whereas the fastest known classical algorithm is superpolynomial [22], Shor's algorithm solves this problem in polynomial time [34], providing an attack on the widely-used RSA cryptosystem.
Even before the discovery of Shor's algorithm, quantum computers were proposed for simulating quantum mechanics [11].By simulating Hamiltonian dynamics, quantum computers can study phenomena in condensed-matter and high-energy physics, quantum chemistry, and materials science.Useful instances of quantum simulation are likely accessible to smaller-scale quantum computers than classically-hard instances of the factoring problem.
These and other potential applications [19] have helped motivate significant efforts toward building a scalable quantum computer.Two quantum computing technologies, superconducting circuits [16] and trapped ions [8], have matured sufficiently to enable fully programmable universal devices, albeit currently of modest size.Several groups are actively developing these platforms into larger-scale devices, backed by significant investments from both industry [15,17,24,35] and government [10,12,28].Thus, it is plausible that quantum computations involving tens or even hundreds of qubits will be carried out in the not-too-distant future [14,20].
Experimental quantum information processing remains a difficult technical challenge, and the resources available for quantum computation will likely continue to be expensive and severely limited for some time.
To make the most out of the available hardware, it is essential to develop implementations of quantum algorithms that are as efficient as possible.
Quantum algorithms are typically expressed in terms of quantum circuits, which describe a computation as a sequence of elementary quantum logic gates acting on qubits (see Section 2 for more details).There are many ways of implementing a given algorithm with an available set of elementary operations, and it is advantageous to find an implementation that uses the fewest resources.While it is imperative to develop algorithms that are efficient in an abstract sense and to implement them with an eye toward practical efficiency, large-scale quantum circuits are likely to have sufficient complexity to benefit from automated optimization.
In this work, we develop software tools for reducing the size of quantum circuits, aiming to improve their performance as much as possible at a scale where manual gate-level optimization is no longer practical.Since global optimization of arbitrary quantum circuits is QMA-hard [18], our goal is more modest: we apply a set of carefully chosen heuristics to reduce the gate counts, often resulting in substantial savings.
We apply our optimization techniques to several types of quantum circuits.Our benchmark circuits include components of quantum algorithms for factoring and computing discrete logarithms, such as the quantum Fourier transform, integer adders, and Galois field multipliers.We also consider circuits for the product formula approach to Hamiltonian simulation [4,23].In all cases, we focus on circuit sizes likely to be useful in applications that outperform classical computation.Our techniques can help practitioners understand which implementation of an algorithm is most efficient in a given application.We detail our methods in Section 3 and discuss our results in Section 4, before concluding in Section 5.
While there has been considerable previous work on quantum circuit optimization (as detailed in Section 4.3), we are not aware of prior work on automated optimization that has targeted large-scale circuits such as the ones considered here.Moreover, extrapolation of previously-reported runtimes suggests it is unlikely that existing quantum circuit optimizers would perform well for such large circuits.We perform direct comparisons by running our software on the same circuits optimized in Ref. [1], showing that our approach typically finds smaller circuits in less time.In addition, to the best of our knowledge, our work is the first to focus on automated optimization of quantum circuits with continuous gate parameters.

Background
A quantum circuit is a sequence of quantum gates acting on a collection of qubits.Quantum circuits are conveniently represented by diagrams in which horizontal wires denote time evolution of qubits, with time propagating from left to right, and boxes (or other symbols joining the wires) represent quantum gates.For example, the diagram describes a simple three-qubit quantum circuit.We consider a simple set of elementary gates for quantum circuits consisting of the two-qubit controllednot gate (abbreviated cnot, the leftmost gate in the above circuit), together with the single-qubit not gate, Hadamard gate h, and z-rotation gate r z (θ).Unitary matrices for these gates take the form , and cnot := where θ ∈ (0, 2π] is the rotation angle.The gates s := r z (π/2) and t := r z (π/4) are known as the Phase and t gates, respectively.When the rotation angle is irrelevant, we denote a generic z-rotation by r z .
While we aim to produce quantum circuits over the set of not, h, r z , and cnot gates, we consider input circuits that may also include Toffoli gates.The Toffoli gate (the top gate in Figure 5) is described by the mapping |x, y, z → |x, y, z ⊕ (x ∧ y) of computational basis states.We also allow Toffoli gates to have negated controls.For example, the Toffoli gate with its top control negated (the middle gate in Figure 5) acts as |x, y, z → |x, y, z ⊕ (x ∧ y) , and the Toffoli gate with both controls negated (the bottom gate in Figure 5) acts as |x, y, z → |x, y, z ⊕ (x ∧ ȳ) .
The cost of performing a given quantum circuit depends on the physical system used to implement it.The cost can also vary significantly between a physical-level (unprotected) implementation and a logicallevel (fault-tolerant) implementation.At the physical level, a two-qubit gate is typically more expensive to implement than a single-qubit gate [8,16].We accommodate this by considering the cnot gate count and optimizing the number of the cnot gates in our algorithms.
For logical-level fault-tolerant circuits, the so-called Clifford operations (generated by the Hadamard, Phase, and cnot gates) are often relatively easy to implement, whereas non-Clifford operations incur significant overhead [5,30].Thus we also consider the number of r z gates in our algorithms and try to optimize their count.In fault-tolerant implementations, r z gates are approximated over a discrete gate set, typically consisting of Clifford and t gates.Optimal algorithms for producing such approximations are known [21,32].The number of Clifford+t gates required to approximate a generic r z gate depends primarily on the desired accuracy rather than the specific angle of rotation, so it is preferable to optimize a circuit before approximating its r z gates with Clifford+t fault-tolerant circuits.
By minimizing both the cnot and r z counts, we perform optimizations targeting both physical-and logical-level implementations.One might expect a trade-off between these two goals, and in fact we know of instances where such trade-offs do occur.However, in this paper we only consider optimizations aimed at reducing both the r z and cnot counts.

Algorithms and implementation
In this section, we describe our optimization algorithms and their implementation.Throughout, we use g to denote the number of gates appearing in a circuit.We begin in Section 3.1 by describing three distinct representations of quantum circuits that we employ.In Section 3.2, we describe a preprocessing step used in all versions of our algorithm.Then, in Section 3.3, we describe several subroutines that form the basic building blocks of our approach.Section 3.4 explains how these subroutines are combined to form our main algorithms.Finally, in Section 3.5, we present two special-purpose optimization techniques that we use to handle particular types of circuits.

Representations of quantum circuits
We use the following three representations of quantum circuits: • First, we store a circuit as a list of gates to be applied sequentially (a netlist ).It is sometimes convenient to specify the circuit in terms of subroutines, which we call blocks.Each block can be iterated any number of times and applied to any subset of the qubits present in the circuit.A representation using blocks can be especially concise since many quantum circuits exhibit a significant amount of repetition.A block is specified as a list of gates and qubit addresses.
We input and output the netlists using the .qcformat of [1] and the format produced by the quantum programming language Quipper [13].Both include the ability to handle blocks.
• Second, we use a directed acyclic graph (DAG) representation.The vertices of the DAG are the gates of the circuit and the edges encode their input/output relationships.The DAG representation has the advantage of making adjacency between gates easy to access.
• Third, we use a generalization of the phase polynomial representation of {cnot,t} circuits [2].Unlike the netlist and DAG representations, this last representation applies only to circuits consisting entirely of not, cnot, and r z gates.Such circuits can be concisely expressed as the composition of an affine reversible transformation and a diagonal phase transformation.Let C be a circuit consisting only of not gates, cnot gates, and the gates r z (θ 1 ), r z (θ 2 ), . . ., r z (θ ℓ ).Then the action of C on the n-qubit basis state |x 1 , x 2 , . . ., x n has the form where h : {0, 1} n → {0, 1} n is an affine reversible function and is a linear combination of affine Boolean functions f i : {0, 1} n → {0, 1} with the coefficients reduced modulo 2π.We call p(x 1 , x 2 , . . ., x n ) the phase polynomial associated with the circuit C. For example, the circuit can be represented by the mapping |x, y → e ip(x,y) |x ⊕ y, y where p(x, y) = θ 1 y + θ 2 (x ⊕ y) + θ 3 x + θ 4 y.(In Ref. [2], the phase polynomial representation is only considered for {cnot, t} circuits, so all θ i in the expression (4) are integer multiples of π/4 and the functions f i are linear.) We can convert between any two of the above three circuit representations in time linear in the number of gates in the circuit.Given a netlist, we can build the corresponding DAG gate-by-gate.Conversely, we can convert a DAG to a netlist by standard topological sorting.To convert between the netlist and phase polynomial representations of {not, cnot, r z } circuits, we use a straightforward generalization of the algorithm of [2].

Preprocessing
Before running our main optimization procedures, we preprocess the circuit to make it more amenable to further optimization.Specifically, the preprocessing applies provided the input circuit consists only of not, cnot, and Toffoli gates (as is the case for the Quipper adders described in Section 4.1 and the t-par circuit benchmarks described in Section 4.3).In this case, we push the not gates as far to the right as possible by commuting them through the controls of Toffoli gates and the targets of Toffoli and cnot gates.When pushing a not gate through a Toffoli gate control, we negate that control (or remove the negation if it was initially negated).If this procedure leads to a pair of adjacent not gates, we remove them from the circuit.If no such cancellation is found, we revert the control negation changes and move the not gate back to its original position.
This not gate propagation leverages two aspects of our optimizer.First, we accept Toffoli gates that may have negated controls and optimize their decomposition into Clifford+t circuits by exploiting freedom in the choice of t/t † polarities (see Section 3.5).Second, since cancellations of not gates simplify the phase polynomial representation (by making some of the functions f i in the phase polynomial representation (4) linear instead of merely affine), such cancellations make it more likely that Routine 4 and Routine 5 in Section 3.3 will find optimizations (since those routines rely on finding matching terms in the phase polynomial representation).
The complexity of this preprocessing step is O(g) since we simply make a single pass through the circuit.
Figure 1: Hadamard gate reductions.The two rules illustrated on the bottom can be applied even if the middle cnot gate is replaced by a circuit with any number of cnot gates, provided they all share the target of the original cnot. •

Optimization subroutines
Our optimization algorithms rely on a variety of subroutines that we now describe.For each of them, we report the worst-case time complexity as a function of the number of gates g in the circuit (for simplicity, we neglect the dependence on the number of qubits and other parameters).We optimize practical performance by carefully ordering and restricting the subroutines, as we discuss further below.

Hadamard gate reduction
Hadamard gates do not participate in phase polynomial optimization (Routine 4 and Routine 5 below) and also tend to hinder gate commutation.Thus, we use the circuit identities pictured in Figure 1 to reduce the Hadamard gate count.Each application of these rules reduces the h count by up to 4. For a given Hadamard gate, we can use the DAG representation to check in constant time whether it is involved in one of these circuit identities.Thus, we can implement this subroutine with complexity O(g) by making a single pass through all Hadamard gates in the circuit.

Single-qubit gate cancellation
Using the DAG representation of a quantum circuit, it is straightforward to determine whether a gate and its inverse are adjacent.If so, both gates can be removed to reduce the gate count.More generally, we can cancel two single-qubit gates U and U † that are separated by a subcircuit A that commutes with U .In general, deciding whether a gate U commutes with a circuit A may be computationally demanding.Instead, we apply a specific set of rules that provide sufficient (but not necessary) conditions for commutation.This approach is fast and appears to discover many commutations that can be exploited to simplify quantum circuits.
Specifically, for each gate U in the circuit, the optimizer searches for possible cancellations with some instance of U † .To do this, we repeatedly check whether U commutes through a set of consecutive gates, as evidenced by one of the patterns in Figure 2. If at some stage we cannot move U to the right by some allowed commutation pattern, then we fail to cancel U with a matched U † , so we restore the initial configuration.Otherwise, we successfully cancel U with some instance of U † .
For each of the g gates U , we check whether it commutes through O(g) subsequent positions.Thus the complexity of the overall gate cancellation rule is O(g 2 ).We could make the complexity linear in g by only considering commutations through a constant number of subsequent gates, but we do not find this to be necessary in practice.
We also use a slight variation of this subroutine to merge rotation gates, rather than cancel inverses.Specifically, two rotations r z (θ 1 ) and r z (θ 2 ) can be combined into a single rotation r z (θ 1 + θ 2 ) to eliminate one r z gate.

Two-qubit gate cancellation
This routine is analogous to Routine 2, except that U is a two-qubit gate, which is always cnot in the circuits we consider.Again its complexity is O(g 2 ), but may be reduced to O(g) by imposing a maximal size for the subcircuit A.

Rotation merging using phase polynomials
Consider a subcircuit consisting of not, cnot, and r z gates.Observe that if two individual terms of its phase polynomial expression satisfy for some i = j, then the corresponding rotations r z (θ i ) and r z (θ j ) can be merged.For example, in the circuit ( 5), the first and fourth rotations are both applied to the qubit carrying the value y, as evidenced by its phase polynomial representation.Thus ( 5) is equivalent to the circuit in which the two rotations are combined.In other words, the phase polynomial representation of circuits reveals when two rotations-in this case, r z (θ 1 ) and r z (θ 4 )-are applied to the same affine function of the inputs, even if they appear in different parts of the circuit.Then we may combine these rotations into a single rotation, improving the circuit. 1 We have the flexibility to place the combined rotation at any point in the circuit where the relevant affine function appears.For concreteness, we place it at the first (leftmost) such location.do not include the cnot gate in the subcircuit).This exception gives a larger {not, cnot, r z } subcircuit that remains amenable to phase polynomial representation, as in the following example: In the example circuit ( 8), suppose we start our search from the first cnot gate acting on the top (q 1 ) and middle (q 2 ) qubits.Traversing q 1 to the left, we find an h gate, where we mark a termination point.Traversing q 1 to the right, we find two cnot gates, one r z gate, and then an h gate, where we mark a termination point.Observe that neither of the encountered cnot gates joins q 1 or q 2 to the remaining qubit q 3 .Next, we repeat the same procedure on q 2 from the original cnot gate.To the left we find an r z gate and then an h gate, where we mark a termination point.Traversing to the right, we find a cnot acting on q 2 and q 3 .This cnot reveals additional connectivity, so we mark an anchor point at the target of this cnot gate.Further to the right on the q 2 wire, we have three more cnot gates (none of which reveals additional connectivity), an r z gate, and finally an h gate, where we mark a termination point.Next we examine q 3 .We start from the aforementioned anchor point.To the left, we find an h gate with no further connections to other qubits, where we mark a termination point.To the right, we immediately find an h gate and mark a termination point.
Having built the subcircuit, we go through the netlist representation and prune it.In this pass, we encounter the fourth cnot gate acting on q 2 and q 3 , where we find that the control is within the border but the target is not.In this case we continue according to the exception handling scheme described in the pruning procedure.This ensures that we include the last cnot gate in the {not, cnot, r z } region, while excluding the fourth cnot gate (as indicated by the dotted border in (8)).Thus we discover that the last r z gate appearing in the circuit can be relocated to the very beginning of the circuit on the q 2 line, to the right of the leftmost h, enabling a phase-polynomial based r z merge (see below for details).
Once a valid {not, cnot, r z } subcircuit is identified, we generate its phase polynomial.For each r z gate, we determine the associated affine function its phase is applied to and the location in the circuit where it is applied.We then sort the list of recorded affine functions.Finally, we find and merge all r z gate repetitions, placing the merged r z at the first location in the subcircuit that computes the desired affine function.
This procedure considers O(g) subcircuits, and the cost of processing each of these is dominated by sorting, with complexity O(g log g), giving an overall complexity of O(g 2 log g) for Routine 4.However, in practice the subcircuits are typically smaller when there are more of them to consider, so the true complexity is lower.In addition, when identifying a {not, cnot, r z } subcircuit, we choose to start with a cnot gate that has not yet been included in any of the previously-identified {not, cnot, r z } subcircuits, so the number of subcircuits can be much smaller than g in practice.If desired, the overall complexity can be lowered to O(g) by limiting the maximal size of the subcircuit.
We now return to the description of optimization subroutines.

Floating r z gates
In Routine 4, we keep track of the affine functions associated with r z gates.More generally, we can record all affine functions that occur in the subcircuit and their respective locations, regardless of the presence of r z gates.Thus we can identify all possible locations where an r z gate could be placed, not just those locations where r z gates already appear in the circuit.In this "floating" r z gate placement picture, we employ three optimization subroutines: two-qubit gate cancellations, gate count preserving rewriting rules, and gate count reducing rewriting rules.
The first of these subroutines is essentially identical to Routine 3, except that r z gates are now floatable and we focus on a specific identified subcircuit.This approach allows us to place r z gates to facilitate cancellations by keeping track of all possible r z gate locations along the way.In particular, if not placing an r z gate at a particular location will allow two cnot gates to cancel, we simply remove that location from the list of possible locations for the r z gate while ensuring that the reduced list remains non-empty, and perform the cnot cancellation.
We next apply rewriting rules that preserve the gate count (see Figure 3) in an attempt to find further optimizations.While these replacements do not eliminate gates, they modify the circuit in ways that can enable optimizations elsewhere.The rewriting rules are provided by an external library file, and we identify subcircuits to which they can be applied using the DAG representation.The replacements are applied only if they lead to a reduction in the two-qubit gate count through one more round of the aforementioned two-qubit cancellation subroutine with floatable r z gates.Note that the rewriting rules are applicable only with certain floating r z gates at particular locations in a circuit.This subroutine uses floating r z gates to choose those combinations of r z gate locations that lead to reduction in the gate count.
The last subroutine applies rewriting rules that reduce the gate count (see Figure 4).These rules are also provided via an external library file.Since these rules reduce the gate count on their own, we always perform the rewriting whenever a suitable pattern is found.
The complexity of this three-step routine is upper bounded by O(g 3 ) since the number of subcircuits is O(g), and within each subcircuit, the two-qubit cancellation (Routine 3) has complexity O(g 2 ).The rewriting rules can be applied with complexity O(g) since, as in Routine 1, a single pass through the gates in the circuit suffices.Again, in practice, the number of subcircuits and the subcircuit sizes are typically inversely related, which lowers the observed complexity by about a factor of g.The complexity can also be lowered to O(g 2 ) by limiting the maximal size of the subcircuit.The complexity can be further lowered to O(g log g) by limiting the maximal size of the subcircuit A in the two-qubit gate cancellation (the sorting could still have complexity O(g log g)).
To illustrate how this optimization works, consider the circuit in equation (7).Observe that r z (θ 2 ) may be executed on the top qubit at the end of the circuit, allowing the first two cnots to cancel, leading to the circuit which is simplified even further.

General-purpose optimization algorithms
Our optimization algorithms simply apply the subroutines from Section 3.3 in a carefully chosen order.We consider two versions of the optimizer that we call Light and Heavy.The Heavy version applies more subroutines, yielding better optimization results at the cost of a higher runtime.The preprocessing step (see Section 3.2) is used in both Light and Heavy versions of the optimizer.The Light version of the optimizer applies the optimization subroutines in the order 1, 3, 2, 3, 1, 2, 4, 3, 2. We then repeat this sequence until no further optimization is achieved.We chose this sequence based on the principle that first exposing {cnot, r z } gates while reducing Hadamard gates (1) allows for greater reduction in the cancellation routines (3, 2, 3), and in particular frees up two-qubit cnot gates to facilitate single-qubit gate reductions and vice versa.Applying the replacement rule (1) may enable more reductions after the first four optimization subroutines.We then look for additional single-qubit gate cancellation and merging (2).This enables faster identification of the {not, cnot, r z } subcircuit regions to look for further r z count optimizations (4), after which we check for residual cancellations of the gates (3,2).The Heavy version of the optimizer applies the sequence 1, 3, 2, 3, 1, 2, 5.
Similarly, we repeat this sequence until no further optimization is achieved.The first six steps of the Heavy optimization sequence are identical to that of the Light optimizer.The difference is that in the Heavy optimizer, we take advantage of floating r z gates.This allows us to find locations for the r z gates that admit better cnot gate reductions, including the use of gate count preserving rewriting rules to expose further gate cancellations and gate count reducing rewriting rules to remove any remaining inefficiency.

Special-purpose optimizations
In addition to the general-purpose optimization algorithms described above, we employ two specialized optimizations to improve circuits with particular structures.• LCR optimizer: Some quantum algorithms-such as product formula simulation algorithms-involve repeating a fixed block multiple times.To optimize such a circuit, we first run the optimizer on a single block to obtain its optimized version, O.To find simplifications across multiple blocks, we optimize the circuit O 2 and call the result LR, where L is the maximal prefix of O in the optimization of O 2 .We then optimize O 3 .Provided optimizations only occur near the boundaries between blocks, we can remove the prefix L and the suffix R from the optimized version of O 3 , and call the remaining circuit C. Assuming we can find such L, C, and R (which is always the case in practice), then we can simplify O t to LC t−2 R.
• Toffoli decomposition: Many quantum algorithms are naturally described using Toffoli gates.Our optimizer can handle Toffoli gates with both positive and negative controls.Since we ultimately aim to express circuits over the gate set {not, cnot, h, r z }, we must decompose the Toffoli gate in terms of these elementary gates.We take advantage of different ways of doing this to improve the quality of optimization.
Specifically, we expand the Toffoli gates in terms of one-and two-qubit gates using the identities shown in Figure 5, keeping in mind that we also obtain the desired Toffoli gate by exchanging t and t † in those circuit decompositions (because the Toffoli gate is self-inverse).Initially, the optimizer leaves the polarity of t/t † gates (i.e., the choice of which gates include the dagger and which do not) in each Toffoli decomposition undetermined.The optimizer symbolically processes the indeterminate t and t † gates by simply moving their locations in a given quantum circuit, keeping track of their relative polarities.The optimization is considered complete when movements of the indeterminate t and t † gates cannot further reduce the gate count.Finally, we choose the polarities of each Toffoli gate (subject to the fixed relationships between them) with the goal of minimizing the t count in the optimized circuit.We perform this minimization in a greedy way, choosing polarities for each Toffoli gate in the order of appearance of the associated t/t † gates in the nearly-optimized circuit, so as to reduce the t count as much as possible.
Overall, this polarity selection process takes time O(g).After choosing the polarities, we run Routine 3 and Routine 2, since particular choices of polarities may lead to further cancellations of the cnot gates and single-qubit gates that were otherwise not possible due to the presence of the indeterminate gates blocking the desired commutations.

Results
We implemented our optimizer in the Fortran programming language and tested it using three sets of benchmark circuits.All results were obtained using a machine with a 2.9 GHz Intel Core i5 processor and 8 GB of 1867 MHz DDR3 memory, running OS X El Capitan.We considered quantum circuits that include components of Shor's integer factoring algorithm, namely the quantum Fourier transform (QFT ) and the integer adders.We also considered circuits for the product formula (PF ) approach to Hamiltonian simulation [4].In both cases, we focused on circuit sizes likely to be useful in applications that outperform classical computation, and ran experiments with different types of adders and product formulas.Finally, we considered a set of benchmark circuits from Ref. [1], consisting of various arithmetic circuits (including a family of Galois field multipliers) and implementations of multiplecontrol Toffoli gates.Files containing circuits before and after optimization are available at [29].
To check correctness of our optimizer, we verified the functional equivalence (i.e., equality of the corresponding unitary matrices) of various test circuits before and after optimization.Of course, such a test is only feasible for circuits with a small number of qubits.We performed this test for all 8-qubit benchmarks in Table 1 and Table 2, all 10-qubit benchmarks in Table 3, and the following benchmarks from Table 4: Mod 5 4 , VBE-Adder 3 , CSLA-MUX 3 , RC-Adder 6 , Mod-Red 21 , Mod-Mult 55 , Toff-Barenco 3..5 , Toff-NC 3..5 , GF(2 4 )-Mult, and GF(2 5 )-Mult.

QFT and adders
The QFT is a fundamental subroutine in quantum computation, appearing in many quantum algorithms with exponential speedup.The standard circuit for the exact n-qubit QFT uses r z gates, some with angles that are exponentially small in n.It is well known that one can perform a highly accurate approximate QFT by omitting gates with very small rotation angles [7].We choose to omit rotations by angles at most π/2 13 , which ensures sufficient accuracy of the approximate QFT for circuits of the sizes we consider.These small rotations are removed before optimization, so their omission does not contribute to the improvements we report.In Figure 6 (inset) we plot total gate counts for the approximate QFT before and after optimization.We observe a savings ratio of larger than 36% for the QFT with 512 or more qubits.The optimization comes entirely from reducing the number of r z gates, the most expensive resource in a fault-tolerant implementation.
We consider two types of integer adders: an in-place modulo 2 q adder as implemented in the Quipper library [13] and an in-place adder based on the QFT [9] (hereafter denoted QFA).The QFA circuits use an approximate QFT in which the rotations by angles less than π/2 13 are removed, as described above.Adders are a basic component of Shor's quantum algorithm for integer factoring [30].We report gate counts before and after optimization for the Quipper adders and the QFAs for circuits acting on 2 L qubits, with L ranging from 4 to 11. Adders with L = 10 are used in Shor's algorithm for factoring 1,024-bit numbers.Recall that the related RSA-1024 challenge remains unsolved [37].
The results of Light optimization of the adder circuits are shown in Table 1 and Figure 6.For the Quipper library adders, we used the standard Light optimizer.For the QFA optimization, we instead used a modified Light optimizer with the sequence of routines 1, 3, 2, 3, 1, 2, omitting the final three routines 4, 3, 2 of the full Light optimizer.We did this because we saw no additional gate savings from those routines in small instances (n ≤ 256).
Observe that the simplified Quipper library adder outperforms the QFA by a wide margin, suggesting that it may be preferred in practice.For the Quipper library adder, we see a reduction in the t gate count by a factor of up to 5.2.We emphasize that this reduction is obtained entirely by automated means, without using any prior knowledge of the circuit structure.Since Shor's integer factoring algorithm is dominated by the cost of modular exponentiation, which in turn relies primarily on integer addition, this optimization reduces the cost of executing the overall factoring algorithm by a factor of more than 5.
We also applied the Heavy optimizer to the QFT and adder circuits.For the QFT and QFA circuits, the Heavy setting does not improve the gate counts.The results of the Heavy optimization for the Quipper adder are shown in Table 2.We find a reduction in the cnot count by a factor of 2.7, compared to a factor of only 1.7 for the Light optimization.Figure 7 illustrates the total cnot counts of the Quipper library adder before optimization, after Light optimization, and after Heavy optimization, showing the reduction in the cnot count by the two types of optimization.

Quantum simulation
The first explicit polynomial-time quantum algorithm for simulating Hamiltonian dynamics was introduced in [23].This approach was later generalized to higher-order product formulas [4], giving improved asymptotic complexities.We report gate counts before and after optimization for the PF algorithms of orders 1, 2, 4, and 6 (for orders higher than 1, the order of the standard Suzuki construction is even).For concreteness, we implement these algorithms for a one-dimensional Heisenberg model with periodic boundary conditions in a random, site-dependent magnetic field, evolving the system for the time proportional to its size, and choose the algorithm parameters to ensure the Hamiltonian simulation error is at most 10 −3 using known bounds on the error of the product formula approximation.The results of Light optimization of product formula algorithms are reported in Table 3 and illustrated in Figure 8.For these algorithms, we find that Heavy optimization offers no further improvement.The 2nd-, 4th-, and 6th-order algorithms admit a ∼ 33.3% reduction in the cnot count and a ∼ 28.5% reduction in the r z count, roughly corresponding to the reductions relevant to physical-level and logical-level implementations.The 1st-order formula algorithm did not exhibit cnot or r z gate optimization.In all product formula algorithms, the number of Phase and Hadamard gates reduced significantly, by a factor of roughly 3 to 6.

Comparison with prior approaches
Quantum circuit optimization is already a well-developed field (see for example [1,27,31,33]).However, to the best of our knowledge, no prior work on circuit optimization has considered large-scale quantum circuits of the kind that could outperform classical computers.For instance, in [1], the complexity of optimizing a g-gate circuit is O(g 3 ) (sections 6.1 and 7), making optimization of large-scale circuits unrealistic.in [27] shows running times ranging from 0.07 to 1.883 seconds for numbers of qubits from n = 10 to 35 and gate counts from 60 to 368, whereas our optimizer ran for a comparable time when optimizing the Quipper adders up to n = 256 with around 23,000 gates, as shown in Table 1.Reference [31] relies on peep-hole optimization using optimal gate libraries.This is expensive, as is evidenced by the runtimes reported in Tables I and II therein, taking already more than 100 seconds for a 20-qubit, 1,000-gate circuit.
To compare our results to those reported previously, we consider t count, cnot count, and a scalar cost metric that accounts for the relative difficulty of performing cnot and t gates in a fault-tolerant implementation.While the t gate is considerably more expensive due to the need for state distillation [5], neglecting the cost of the cnot gates may lead to a significant underestimate if there are many such gates [26].Roughly speaking, a fault-tolerant t gate may be about 10−100 times more expensive to implement than a local, fault-tolerant cnot gate.The true overhead depends on many details, including the fault tolerance scheme, the error model, the size of the computation, architectural restrictions, the extent to which the implementation of the t gate can be optimized, and whether t state production happens offline so its cost can be (partially) discounted; it is beyond the scope of this paper to account for all these factors.For a rough comparison, we suppose that the t gate is 20 times as expensive as a typical cnot gate, and we call the cnot gate count plus 20 times the t gate count the aggregate cost.
We directly compare our results to those reported in [1], which aims to reduce the t count and t depth using techniques based on matroid partitioning.We refer to that approach as t-par.We use our approach to optimize a set of benchmark circuits appearing in that work and compare the results with the t-par optimization, as shown in Table 4.
The benchmark circuits fall into three categories.The first set consists of a selection of arithmetic operations.For these circuits, we obtained better or matching t counts compared to [1] while also obtaining much better cnot counts.Note that we excluded circuit CSLA-MUX 3 from the comparison since we do not believe t-par optimized it correctly (for more detail, see the first footnote in Table 4).To illustrate the advantage of our approach using the aggregate cost metric, observe that we reduced the cost of the RC-Adder 6 circuit from 1,494 to 1,011.
The second set of benchmarks consists of multiple-control Toffoli gates.While our optimizer matched the t count obtained by the t-par and substantially reduced the cnot count, neither our optimizer nor [1] could find the best known implementations constructed directly in [25].This is not surprising, given the very different circuit structure employed in [25].
The third set of benchmarks contains Galois field multiplier circuits.We saw no advantage from the Heavy optimizer over the Light optimizer in the cases we tested, so we did not apply the Heavy optimizer to the four largest instances (the corresponding entries are left blank in Table 4).Our t count again matches that of the t-par optimizer, but our cnot count is much lower, resulting in the circuits that are clearly preferred.For example, the optimized GF(2 64 ) multiplier circuit in [1] uses 180,892 cnot gates, whereas our optimized implementation uses only 24,765 cnot gates; the aggregate cost is thus reduced from 509,852 to 353,725 despite no change in the t count, illustrating the advantage of our approach.This comparison  demonstrates that the discrepancy between t count and true cost predicted in theory [26] is manifested in practice.The efficiency of our Light optimizer allowed us to optimize of the GF(2 131 ) and GF (2 163 ) multiplier quantum circuits, corresponding to instances of the elliptic curve discrete logarithm problem that remain unsolved [6].Given the reported t-par runtimes [1], an instance of this size appears to be intractable for the t-par optimizer.

Overall performance
Our numerical optimization results are summarized across Table 1, Table 2, Table 3, and Table 4.These tables contain benchmarks relevant to practical quantum computations that are beyond the reach of classical computers.In Table 1 and Table 2 these are the 1,024-and 2,048-qubit QFT and integer adders used in classically-intractable instances of Shor's factoring algorithm [37].In Table 3 these include all instances with n 50, for which direct classical simulation of quantum dynamics is currently infeasible.In Table 4 these are Galois field multipliers over binary fields of sizes 131 and 163, which are relevant to quantum attacks on unsolved Certicom ECC Challenge problems [6].This illustrates that our optimizer is capable of handling quantum circuits that are sufficiently large to be practically relevant.
Our optimizer can be applied more generally than previous work on circuit optimization.It readily accepts composite gates, such as Toffoli gates (which may have negated controls).It also handles gates with continuous parameters, a useful feature for algorithms that naturally use r z gates, including Hamiltonian simulation and factoring.Many quantum information processing technologies natively support such gates, including both trapped ions [8] and superconducting circuits [16], so our approach may be useful for optimizing physical-level circuits.
Fault-tolerant quantum computations generally rely on a discrete gate set, such as Clifford+t, and optimal Clifford+t implementations of r z gates are already known [21,32].Nevertheless, the ability to optimize circuits with continuous parameters is also valuable in the fault-tolerant setting.This is because optimizing with respect to a natural continuously-parametrized gate set before compiling into a discrete fault-tolerant set will likely result in smaller final circuits.
Finally, unlike previous approaches [1,27,31], our optimizer preserves the structure of the original circuit.In particular, the set of two-qubit interactions used by the optimized circuit is a subset of those used in the original circuit.This holds because neither the preprocessing step nor our optimizations introduce any new two-qubit gates.By keeping the number of interactions under control (in stark contrast to t-par, which dramatically increases the set of interactions used), our optimized implementations are better suited for architectures with limited connectivity.For example, given a layout of the original quantum circuit on hardware with limited connectivity, this property allows one to use the same layout for the optimized circuit.

Conclusions and future work
In this paper, we studied the problem of optimizing large-scale quantum circuits, namely those appearing in quantum computations that are beyond the reach of classical computers.We developed Light and Heavy optimization algorithms and implemented them in software.Our algorithms are based on a carefully chosen sequence of basic optimizations, yet they achieve substantial reductions in the gate counts, improving over more mathematically sophisticated approaches such as t-par optimization [1].The simplicity of our approach is reflected in very fast runtimes, especially using the Light version of the optimizer.We expect that further improvements can lead to even greater circuit optimization, as demonstrated by the Heavy version of our optimizer.To further improve the output, one could revise the routines for reducing r z count by implementing more extensive (and thus more computationally demanding) algorithms for composing stages of cnot and r z gates, possibly with some Hadamard gates included.One may also consider incorporating template-based [27] and peep-hole [31] optimizations.It may be worthwhile to expand the set of subcircuit rewriting rules and explore the performance of the approach on other benchmark circuits.Finally, considering the relative cost of different resources (e.g., different types of gates, ancilla qubits) could lead to optimizers that favorably trade off these resources.[1], except that we write Toff-Barenco and Toff-NC to denote implementations of multiple-control Toffoli gates from [3] and [30], respectively.The notation "(L)" denotes the standard Light optimization, whereas "(H)" denotes the standard Heavy optimization.The symbol indicates that there was no improvement in the Heavy optimization over the Light optimization.

Pre-Optimization
Ref. [ whereas it is supposed to perform the mapping |1024 → |1088 .
b Note that our software reduced the T-count of the original pre-optimization circuit used by t-par to 0. It turned out that the circuit used by t-par is incorrect.In our optimization reported in this table, we used the correct original circuit [36, Figure 5].

Figure 2 :
Figure 2: Commutation rules.Top: Commuting an rz gate to the right.Bottom: Commuting a cnot gate to the right.

Figure 3 :
Figure 3: Gate count preserving rewriting rules employed in Routine 5.

Figure 7 :
Figure 7: Number of cnot gates for Quipper library adders.The points in red/blue/green represent the gate counts in pre-/post-Light/post-Heavy optimization, respectively.

Figure 8 :
Figure8: Total gate count for product formula algorithms.The points in red/blue represent gate counts before/after optimization and the symbols square/circle represent gate counts for the 2nd-/4th-order formula, respectively.

Table 2 :
Heavy optimization of Quipper library adder.

Table 3 :
Optimization of product formula algorithms, showing the cnot gate count reduction (top) and the rz gate count reduction (bottom).Software runtimes range from 0.004 s (1st-order, n = 10) to 0.137 s (6th-order, n = 100).The Clifford gate reduction ranges from 62.5% for Hadamard and 75% for Phase gates (for the 1st-order formula, independent of n) to 75% for Hadamard and 85% for Phase gates (for the 6th-order formula, again independent of n).The notation "(× 1000)" indicates that the gate counts for the 1st-order formula are in units of thousands (no rounding errors).The notation "(L)" denotes the standard Light optimization.

Table 4 :
t-par comparison.The names of the algorithms are taken verbatim from Ref.