Synthesizing efficient circuits for Hamiltonian simulation

We provide an approach for compiling quantum simulation circuits that appear in Trotter, qDRIFT and multi-product formulas to Clifford and non-Clifford operations that can reduce the number of non-Clifford operations. The total number of gates, especially CNOT, reduce in many cases. We show that it is possible to implement an exponentiated sum of commuting Paulis with at most m (controlled)-rotation gates, where m is the number of distinct non-zero eigenvalues (ignoring sign). Thus we can collect mutually commuting Hamiltonian terms into groups satisfying one of several symmetries identified in this work. This allows an inexpensive simulation of the entire group of terms. We further show that the cost can in some cases be reduced by partially allocating Hamiltonian terms to several groups and provide a polynomial time classical algorithm that can greedily allocate the terms to appropriate groupings.


Introduction
One main reason that led Feynman [1] and others to propose the idea of quantum computers was the fact that problems like simulating the dynamics of quantum systems are intractable on a classical computer. Starting from the seminal work of Lloyd [2], much research [3] has been done to develop algorithms for simulating Hamiltonians, culminating in various techniques like product formulas [4,5], quantum walks [6], linear combination of unitaries [7], truncated Taylor series [8], and quantum signal processing [9]. Special techniques have been developed for simulating particular physical systems [10,11,12,13,14,15,16,17], which might find applications in developing new pharmaceuticals, catalysts and materials. Phase estimation can be combined with quantum simulation to find the ground state energy [18] and excited state energies [19,20,21] of the Hamiltonian. This is called the electronic structure problem [14], which is important in chemistry and material science. Research in quantum simulation has also inspired the development of quantum algorithms for various other problems [22,23,24,25,26].
One main challenge for digital quantum simulation is the implementation with efficient circuits that can produce reliable results. Without it, a theoretical exponential speedup may not lead to a useful algorithm if a typical practical application requires an amount of time and memory that is beyond the reach of even a quantum computer. There are a number of factors that can affect the efficiency of a quantum circuit i.e. its running time and error, for example, the number of qubits, depth, gate count, etc. So depending upon the applications or other hardware constraints, one can design algorithms that optimize or reduce the count/depth of one particular type of quantum gate or other resources. For example, there are algorithms that does T-count and T-depth-optimal synthesis [27,28,29] given a unitary or does re-synthesis of a given circuit with reduced T-count, T-depth [30,31,32] or CNOT-count [33,34,35]. The non-Clifford T gate has known constructions in most of the error correction schemes and the cost of fault-tolerantly implementing it exceeds the cost of the Clifford group gates by as much as a factor of hundred or more [36,37,38]. Quantum error correction and fault tolerance is especially significant for large quantum circuits, else the accumulation of errors will make any output highly unreliable and hence useless. The minimum number of T-gates required to implement certain unitaries is a quantifier of difficulty in many algorithms [39,40] that try to classically simulate quantum computation. So, even though alternative fault-tolerance methods such as completely transversal Clifford+T scheme [41] and anyonic quantum computing [42] are also being explored, minimization of the number of T gates in quantum circuits remain an important and widely studied goal. Multi-qubit gates like CNOT introduce more error than single qubit gates, so reducing CNOT gate is important and especially relevant for the noisy intermediate scale quantum (NISQ) computers.
Our contributions : (I) One main result in this paper is Lemma 2.4, which shows that it is possible to implement an exponentiated sum of commuting Paulis with at most m (controlled)-rotation gates, where m is the number of distinct non-zero eigenvalues (ignoring sign). For illustration we consider the Hamiltonian for the Heisenberg model and we show that it is possible to achieve about 50% reduction in the rotation gate cost and for certain underlying graphs this reduction can be about 75%. However, the cost of Toffolis may increase. We have given explicit circuits for 4-qubit and 6-qubit chain (or cycle), where we attempt to reduce both the rotation and Toffoli gate cost.
(II) In most previous works, circuits for individual exponentiated Paulis are synthesized and combined. We show that it is possible to reduce the gate count (not only non-Clifford gates) if we instead consider groups of commuting Paulis. To give some practical demonstration we consider the qDRIFT Hamiltonian simulation algorithm [49]. We call the error introduced due to the algorithm as 'simulation error'. We take the 1-D 4 qubit and 6 qubit Heisenberg Hamiltonians ( Figure 7) and also 4-qubit Hamiltonians for H 2 and LiH (with freezing in the STO-3G basis) (Figure 8), and compare the case where a single Pauli term is selected with the case where a set of commuting Pauli terms is selected for implementation at each iteration of qDRIFT. We observe that the error accumulation is less for multiple terms and also the rotation gate cost is less in this error regime. The number of Toffoli pairs is roughly equal to the number of R z /cR z used, in case of multiple terms. So overall, we have less T-count when implementing multiple commuting Paulis per iteration of qDRIFT. This adds to the motivation of building efficient circuits for such Hamiltonians.
(III) In Section 2.3 we derive explicit quantum circuits for the two-body excitation terms appear-ing in the Coulomb Hamiltonian in quantum chemistry. We mainly use the Clifford+T universal fault-tolerant gate set to implement unitaries. We design efficient circuits for different grouping of commuting Pauli operators. It is evident ( Table 2) that the rotation gate cost depends on the coefficients of the Pauli summands. For some combination of coefficients the circuits derived here are optimal, in the sense, that they have the minimum (i.e. 1) number of cR z gates. Though our focus is on reducing the non-Clifford gate count, but most of the quantum circuits derived here have an overall reduced gate count, including reduction in the 2-qubit gates like CNOT. In Table 1 we have compared the number of gates required to implement one of the Hamiltonians considered in this paper with a previous construction. For the remaining Hamiltonians we did not find any compact previous construction to compare with. In short, our approach can be useful not only in the fault-tolerant regime, but also in the NISQ era.
(IV) In Algorithm 1 we describe a greedy method of grouping into commuting Paulis, but the objective is to optimize the number of non-Clifford gates. There have been a host of work that tackles the question of how to group the commuting Paulis and to the best of our knowledge most (if not all) of them has the objective to reduce the number of measurements required to make an estimation [43]. The latter problem is especially important for variational quantum eigensolvers. The grouping that optimizes the non-Clifford gates may not optimize the number of measurements. In most cases, finding the optimal grouping is difficult. But we can always ask the question that given a grouping (for whatever objective), is it possible to compile efficient circuits. In this case, we can use our techniques (Lemma 2.4) to reduce the gate count. Thus our methods can also be used to design circuits for the measurement problem.
In this paper we use the Jordan-Wigner (JW) transformation [48] to map from the fermionic to the qubit space. And then we group into commuting Paulis. Other transformations like Bravyi-Kitaev and parity transformations [67] can also be used and may be beneficial in circumstances where Clifford operations are costly or inherent quantum error correction is desirable. We focus on Jordan Wigner for two reasons. First, in this paper we focus on the synthesis of efficient quantum circuits for exponentiated commuting Paulis and the techniques hold no matter whichever mapping is considered. Second, previous work has not shown obvious advantages for Bravyi-Kitaev transformations within the domain of fault-tolerant quantum computing.
How we compare the cost of non-Clifford resources : In all the constructions discussed in this paper, two approximately implementable gates are used -R z and controlled R z (cR z ), whose T-count varies inversely with precision or synthesis error. From the results given in [29] and from the implementations performed here until the error 10 −6 , we believe that T-count of cR z can be less than that of R z for most modestly small rotation angles. However, for convenience, we assume these have equal cost and with some abuse of terms, we refer to the T-count of R z /cR z as the '(non-Clifford) rotation gate cost'. The only exactly implementable non-Clifford unitary/gate considered in the constructions is Toffoli with T-count 7 [27] or 4 [45]. For low error regime, the T-count of approximately implementable R z /cR z will dominate, while in high error regime the Tcount of Toffoli may matter, if we use a lot of them. To reduce the T-count of compute-uncompute Toffoli pairs, we can use the temporary logical AND gadget, proposed by Gidney [46]. In fact, in our circuits, we use R z gates controlled on n qubits (n > 1), each of which can be decomposed into (compute-uncompute) pairs of NOT gates controlled on n qubits and a cR Z gate. Each such multi-controlled NOT can be implemented with n − 1 Toffoli. Alternatively, it can be implemented with 4n − 4 T-gates [46]. If we combine compute-uncompute pairs then the overall T-count of the circuit can reduce further, by using logical AND gadget. We must keep in mind that the implementations in [45,46] use classical resources and measurements, and it is not straightforward to argue that it will give advantage, inspite of using less number of T-gates. Alternatively, we can use the construction in [47] that implements an n-controlled NOT gate using 4n − 4 T, 4n − 3 CNOT and n − 1 ancillae qubits. In our paper we have expressed the non-Clifford T-gate cost in terms of the rotation gate cost and the number of Toffoli pairs used.

Related work :
In [50] the authors studied the non-Clifford resource cost required to simulate the chemical process of biological nitrogen fixation by nitrogenase. In [51] the authors developed algorithms to synthesize circuits for the Clifford operators that diagonalize a group of commuting Paulis. The goal was to reduce the two-qubit CNOT gate count because of its low fidelity and limited qubit connectivity of near-term quantum computer architectures. Similar diagonalization algorithm has been used in [52] for efficient simulation of Hamiltonian dynamics. Much work has been done for construction of quantum circuits for the evolution of molecular systems [53,16,54,55,56,57,58,59] and Heisenberg model [60].

Notation
In many places we write G (q) to denote that the gate or operator G acts on qubit q. For multi-qubit gates we write CN OT (c,t) to denote a CNOT with control at qubit c and target at qubit t. For convenience, we have removed the parenthesis in the subscript whenever there is less ambiguity. We write [K] = {1, 2, . . . , K}. We denote the n × n identity matrix by I n or I if dimension is clear from the context. We denote the set of n-qubit unitaries by U n . The size of an n-qubit unitary is N × N where N = 2 n . We have given detail description about the n-qubit Pauli operators (P n ), Clifford group (C n ) and the group (J n ) generated by the Clifford and T gates in Supplementary Note 1 (Appendix A).

Optimizing Trotter-Decompositions
The time evolution of a quantum system, described by a Hamiltonian H is e −iHt . Most often the Hamiltonian H can be decomposed as the sum H = m j=1 α j H j , where each H j is Hermitian. There can be more than one decomposition of H and we select the one such that for each H j the unitary e −iτ H j is efficiently implementable on a quantum computer, for any τ . The goal of the Hamiltonian simulation problem is to find an approximation of e −itH into a sequence of e −iτ H j , up to some desired precision. For example, using the Lie-Trotter formula [5] we have that In the non-asymptotic regime, the Trotter scheme provides a first-order approximation, with the norm of the difference between the exact and approximate time evolution scaling as O(t 2 /k). More advanced higher order schemes [4,3] are also available. Alternatively, a randomized approach called qDRIFT can be used in place of a Trotter formula wherein the quantum state is evolved according to the probabilistic channel Note the error here is also O(t 2 ); however, in this case a single exponential is performed rather than O(m) as would be needed for the comparable Trotter-formula. The cost of such an approach scales as O(|α| 2 1 t 2 / ) for error and does not directly depend on m. The approximation errors arising in the use of product formulas are caused by non-commuting terms in the Hamiltonian. For example, see [61] for a detail exposition on Trotter errors. Given any set of mutually commuting operators P 1 , . . . , P m we have the following.
Thus, the operators are partitioned into mutually commuting subsets. Time evolution for the sum of mutually commuting operators in each such subset is trivial, and the product formulas can be applied to the sum of Hamiltonians formed as the sum of each subset. This approach becomes especially applicable in scenarios where the Hamiltonian can be expressed as a sum of Pauli operators, for which the commutation relations can easily be evaluated. As a specific example, consider the case where H = aZ ⊗ Z ⊗ Z. Since the Hamiltonian is diagonal, e −iaZ⊗Z⊗Zt has computational basis vector |b 1 , b 2 , b 3 and eigenvalues e −i(−1) b 1 ⊕b 2 ⊕b 3 at . Thus the eigenvalues are determined by the parity of the bit strings, which can be computed using CNOT gates. From this reasoning the following quantum circuit will perform the simulation of this Pauli operator exactly.
As every Pauli operator of weight 3 can be diagonalized by Clifford conjugation, this circuit up to an elementary basis transformation, will simulate any weight 3 Pauli Hamiltonian. The exact same strategy of diagonalizing and simulating the Pauli operator in the eigenbasis shows that each exponential of a weight ν Pauli operator Hamiltonian requires 2(ν − 1) CNOT operators and one rotation gate. This strategy is at the heart of most elementary networks for simulating chemistry and spin models [53,62]. The work of [50] provided another way of thinking about these decompositions by showing an explicit method that can diagonalize sums of commuting operators that appear in chemistry simulations by transforming into a simultaneous eigenbasis of such terms. In full generality, such transformations reduce the circuit depth but need not reduce the circuit size. However, we will see here that for some Hamiltonians these transformations can reduce the circuit size as well.
As a motivating example, consider the Hamiltonian H = XX + Y Y + ZZ. This Hamiltonian can be simulated, up to a global phase, by This can be implemented using two Toffoli gates and a single qubit rotation. In contrast, the standard approach from [53,62] would use 3 single qubit rotations and no Toffoli gates. As rotation synthesis often is 10 times more expensive than Toffoli gates [50,27,29], this will almost always be a favorable way of performing the simulation. In contrast, if this symmetry is broken then the Hamiltonian term will be more expensive to simulate. Thus it can be favorable to introduce such symmetries as needed artificially. For example, consider Such a simulation can be performed using two rotation gates rather than the 3 naïvely needed and so it makes sense to compile the Hamiltonian terms this way to reduce the overall complexity. As another example, not all rotations are equally expensive and so we should also combine terms in such a way as to minimize the cost. For example consider the time-evolution operator While the first operation in this Trotterization is not a Clifford operation, it is a simulation of a Hadamard gate for time π/4. As this corresponds to a special angle and since the Hadamard gate can be diagonalized using a constant size H and T circuit, the cost of implementing this first term is O(1) and thus the dominant cost is the remaining rotation. In contrast, if this property were not used then we would have two arbitrary rotations in the Trotterization which would be nearly twice the cost of this simplified approach. These ideas can further be used in concert: remainder terms that arise from inexactly rounding a Hamiltonian evolution to a known cheap simulation can be absorbed into other terms or even other Trotter steps.
We propose an algorithm in Algorithm 1 that exploits this intuition through a greedy decomposition of the Hamiltonian into sums of commuting terms. These mutually commuting terms, or fragments, are chosen such that the ratio of the fraction of the Hamiltonian that is simulated by the term to the cost of the term is maximized. This choice is motivated in part by the fact that the query complexity of a quantum simulation is lower bounded by Ω(|α| 1 t) [68] and thus designing circuits that simulate as large of a fraction of this one-norm as possible per quantum gate operation is a sensible optimization heuristic for our greedy algorithm. Unlike traditional approaches to partitioning the Hamiltonian, our approach allows partial allocation of Hamiltonian terms to multiple commuting sets. Further, the allocation can be negative in our approach. This negative allocation is important because we will see that in some cases the introduction of more Hamiltonian weights on some terms can be more than offset by the reduced costs of simulating the fragment.
The number of optimization steps required for our greedy algorithm is at most O(m 2 ). To see this, assume that the optimal strategy involves µ iterations of the outer loop for µ ∈ Ω(m) and assume that the inner loop optimization requires ν iterations. Since COST≥ 1 it holds that Γ max ≤ j |α j | − j |α j − β j |. Assume that j |α j | − j |α j − β j | < |α | ∞ . In this case, by assumption there exists a trivial solution that outperforms this where the largest term is simulated in isolation at cost 1. Therefore we must have that j |α j | − j |α j − β j | ≥ |α | ∞ . Then from standard norm inequalities we have that |α | ∞ ≥ |α | 1 /m. Thus the one-norm of the vector is given by a first order difference equation of the form |α (j+1) | 1 ≤ (1 − 1/m)|α (j) | 1 . The general solution to this is (1 − 1/m) j |α| 1 which is for j ∈ O(log(1/ )/ log(1/(1 − 1/m))) ∈ O(m log(1/ )). This implies that µ ∈ O(m log(1/ ). Next ν is the maximum number of iterations for the inner loop. Since each iteration continues until the total number of terms remaining is reduced by one we have that ν ∈ O(m). Thus the total number of iteration steps is µν ∈ O(m 2 log(1/ )). This shows

Algorithm 1: Hamiltonian Compilation Using Greedy 1-norm Minimization
Data: H = m j=1 α j P j for distinct Pauli P j and α j ≥ 0, cost function COST : R m → Z + such that COST( x) is the cost of simulating H = j x j P j , where x = (x 1 , . . . , x m ) and COST( x) = 1 if x contains at most 1 non-zero entry. that the algorithm scales polynomially with the number of terms if the optimization process is also efficient. The cost of optimization can vary strongly depending on the continuity / convexity of the objective function and without making further assumptions we cannot assume that the optima over β can be found in polynomial time. If we assume, however, that the optimizer works by considering one of a polynomial number of potential circuits for simulating the terms and then uses linear programming to find the optimal value of β, we have that the optimization problem can be solved in polynomial time on a classical computer. Such a choice corresponds exactly to the discussion in the next sections, where we propose the use of a discrete set of optimization strategies for simulating chemistry that can then be used within Algorithm 1 to greedily find the best possible simulation circuit given these discrete set of optimizations for the value of β chosen.

Truncating Hamiltonian :
We can terminate Algorithm 1 before all terms are allocated i.e. we output {H j = h j i P i : j = 1, . . . , m } such that m j=1 H j =H = H. This leads to truncation errors in our simulation algorithm that will be present even if an algorithm such as qDRIFT is used for the simulation. We show here that if we truncate some terms of the Hamiltonian, then the error incurred is at most twice the error incurred from the complete Hamiltonian simulation by qDRIFT, given that the distance of the truncated and given Hamiltonian is at most square root of the qDRIFT simulation error. We do this because in some cases we may be able to simulate the truncated Hamiltonian with less number of gates.
Suppose we write the given Hamiltonian as follows.
Here each H j is a Hermitian matrix for which an efficient simulation circuit exists. The protocol working with the truncated HamiltonianH, samples each H j independently with probability The error per iteration of qDRIFT, i.e. N , is given by bounding the diamond distance between the channel U N (ρ) corresponding to the Hamiltonian H and the channelẼ(ρ) implemented by the protocol.
Lemma 2.1. The error observed when there are N time-steps taken using a qDRIFT channel, N , as quantified by the diamond distance as a function of the truncation error in the Hamiltonian δ is The proof has been given in Supplementary Method 4 (Appendix E : Lemma E.2). Thus the total error after all repetitions is as follows.
This shows that if δ ∈ O( √ qDRIF T ) then the asymptotic scaling is not impacted by the exclusion of the terms from the Hamiltonian.
Expected cost : Let the cost of implementing the unitary e itw j L j /N be c j . Cost can be defined in many ways, like total number of gates, number of non-Clifford gates like T or Toffoli gate, number of multi-qubit gates like CNOT, etc. In our paper we focus mainly on the number of non-Clifford gates. Let C N be the variable denoting the cost per repetition of our protocol. Then the expected cost and the variance per repetition is as follows.
By Chebyshev's inequality (Supplementary Note 1 : Appendix A) we have the following for some real number k > 0.
The cost per repetition of our protocol is a bounded variable i.e. a ≤ C N ≤ b, for some real numbers a, b. If C is the variable denoting the cost of all repetitions of our protocol, then and since each repetition is independent, making the corresponding cost variables per repetition distributed identically and independently, so we apply Hoeffding's inequality (Supplementary Note 1 : Appendix A) and obtain the following.
Thus with probability at least 1 − c , the cost of all repetitions of the protocol is at most Error in simulation while sampling multiple Paulis : We consider the qDRIFT protocol [49] for simulating Hamiltonians. If H = j h j H j , then in each iteration we sample H j with probability proportional to h j and then simulate it for a short time period. Now H j can be a single Pauli operator or a sum of commuting Paulis, as is achieved in Algorithm 1, to optimize the cost of simulation. Here we derive a bound on the difference in simulation error for these two cases.
Let H j = L j i j =1 P i j -sum over commuting Paulis and M be the total number of Pauli operators. So the Hamiltonian can be written as We assume the most general case where a single Pauli can be shared between multiple commuting groups i.e. H j .
In the first case, a group of commuting Paulis i.e one of the H j is selected independently with probability q j = h j j h j . In the second case, one single Pauli operator P k is sampled independently with probability p k = j h j i h i L i , where in the numerator the sum is over all the commuting Pauli groups in which P k appears. Let λ = j h j and λ = j h j L j . We define the Liouvillian that generates unitaries under Hamiltonian H j and P i j so that λ , that evolves the superoperators L j and L ij for time interval τ = λt N and τ = λ t N respectively. Here we note that for the second channel, for each Pauli P k , we have expanded the sum p k = j h j λ to reflect the commuting groups in which it belongs. Thus M k=1 p k = L j=1 L j i j =1 p j . Then we can prove the following.
Lemma 2.2. The distance between the qDRIFT channel with single and grouped Hamiltonian terms for simulation time t using N time steps obeys The proof has been given in Supplementary Method 4 (Appendix E : Lemma E.1).

Optimized circuits for quantum chemistry
In this section we review quantum algorithms for quantum chemistry and design efficient circuits that are useful for quantum chemistry simulation within the Trotter-Suzuki formalism. The electronic structure problem has emerged as a central application of quantum computers in recent years, with quantum algorithms providing potential exponential speedups relative to the best known classical algorithms [53,50]. The electronic structure problem more specifically is, for a fixed set of positions of the nuclei, find the configuration of electrons that minimizes the total energy for a fixed number of electrons. The properties of materials, molecules and atoms at low temperatures emerge from these energies. In the non-relativistic case, the dynamics of these electrons are governed by the Coulomb Hamiltonian.
where we have used atomic units, r i represent the positions of electrons, R i represent the positions of nuclei, and ζ i are the charges of nuclei.
Following the strategy outlined in [13], we select the second quantization and discretize the Hamiltonian by representing it within some canonical basis such as a Gaussian basis or a planewave basis. Under the above assumptions, the electronic Hamiltonian can be represented in terms of creation and annihilation operators as follows [63,64]. Each spin orbital is assigned a (distinct) qubit where the state |1 corresponds to an occupied orbital and |0 an unoccupied orbital. Specifically, let a † p and a p be the fermionic raising and lowering operators acting on spin-orbital p satisfying the anti-commutation relation where the coefficients h pq , h pqrs are determined by the discrete basis set chosen, and the sums run over the number of discretization elements or basis set for a single particle. From inspection, we can see that the number of terms in Equation 14 is O(N 4 ), where N is the size of the discrete representation. The molecular orbitals are one widely used basis set. These, in turn, can be expressed as linear combinations of atomic basis functions [65,66]. The coefficients of this expansion are obtained by solving the set of Hartree-Fock equations that arise from the variational minimization of the energy using a single determinant wave function. Thus in this representation the location of (indistinguishable) electrons are specified by the occupations of the discrete basis. The Jordan-Wigner [48] or Bravyi-Kitaev [67] transformations are commonly used to convert the fermionic creation and annihilation operators into Pauli operators. For example, within the Jordan-Wigner encoding, a and a † can be written in terms of qubit operators as follows.
) are the qubit creation and annihilation operators respectively. j Z (j) acts as an exchange-phase factor, accounting for the anti-commutation relations of a and a † .
With these tools in place, the second-order Trotter-Suzuki approximation reads Such a simulation can then be performed by substituting in the Pauli representation yielded by the Jordan-Wigner transformation. Higher-order versions of this are also known [68] that can achieve error scaling O(t 2k+1 ); however, we do not focus on such cases here since the optimizations to the operator exponentials that we consider here will apply in all such cases.
Optimizing two-body operator exponentials : The two-body terms are the most common, and often the most significant, contribution to the complexity of a simulation of the Coulomb Hamiltonian in second quantization [10]. In this section we consider the general two-body double excitation terms to reduce this dominant cost for simulation of chemistry, which when expressed using the Jordan-Wigner transformation, can be written as product of X, Y, Z operators as follows [53]. We have removed the parentheses in the subscripts, for convenience. and Note that if a Gaussian orbital basis is chosen then the values of h pqrs are typically real, resulting in H i = 0. We will assume in the remainder of the discussion that such terms are zero and focus our attention on only the real part of the Hamiltonian. If we define h 1 = (h pqrs δ XpXs δ XqXr −h qprs δ XpXr δ XqXs ), h 2 = (h psqr δ XpXr δ XqXs −h spqr δ XpXq δ XrXs ) and h 3 = (h prsq δ XpXq δ XrXs −h prqs δ XpXs δ XqXr ), for distinct p, q, r, s then we have the following [53].
Thus conventionally, the part of the Hamiltonian which can be expressed in the form of Equation 16, are broken down into groups of at most 8 commuting operators that act on the qubits in question. Each term is diagonalized by a Clifford circuit and the evolution is performed based on this, with some R z gates. In [50] the authors diagonalize all 8 terms in the simultaneous eigenbasis and parallelizes all 8 R z gates. This reduces the number of Clifford gates, depth, but comes at the cost of using extra 4 ancillae. Excluding the diagonalizing circuits on both sides they use 32 CNOTs and 8 R Z . Each diagonalizing circuit uses 3 CNOT and 1 H gate. Our goal in this section is to design more efficient quantum circuits for the double excitation terms. In Table 1 we have compared the gate costs of the circuit in [50] with the circuits derived by us in each of the 3 cases considered by us. Fermionic SWAP gates [69,55] can be used to make the orbitals neigboring and hence get rid of the tensor product of Z terms. So from here on, we ignore these terms. Let q 1 , q 2 , q 3 , q 4 be the qubits to which the fermions in the orbitals p, q, r, s are mapped respectively. We follow the technique used in [50]. W = CN OT (1,2) CN OT (1,3) CN OT (1,4) H (1) is the unitary diagonalizing the 8 terms in the simultaneous eigenbasis. We re-write the Hamiltonian H with general coefficients a 0 , . . . , a 7 ⊂ R. Unless mentioned, the leftmost operator acts on qubit q 1 , next ones on q 2 , q 3 and the rightmost on qubit q 4 .
Then following the arguments in [50] we have the following.
The terms in between W and W † add an overall phase φ. We denote the state of the qubits q 1 , . . . , q 4 after application of W by variables x 1 , . . . , x 4 respectively. It is sufficient to analyse the phase when the state is in the standard basis. Consider e −ia 0 ZIIIt -this term contributes a phase of −a 0 t if x 1 = 0 and a 0 t if x 1 = 1. Similarly e ia 1 ZZIIt contributes a phase of a 1 t if x 1 ⊕ x 2 = 0 and vice versa. Thus we can have the following expression for the overall phase.
For different values of a 0 , . . . a 7 we get different value of overall phase and different circuits. We consider the following three cases. It is easy to see that φ . So in all the cases below it is sufficient to calculate the phase while setting x 1 = 0.
Case I : Let a 1 t = a 6 t = −θ and the remaining a 0 t = a 2 t = . . . = a 5 t = a 7 t = θ. Then we can verify that φ = 8θ if Case II : The quantum circuit simulating e −iHt is shown in Figure 1c.     [50]. The gate counts have been given for the 3 different cases considered in this work.
We already remarked that we can ignore the product of Z terms in Equations 16 and 18 by using fermionic SWAP gates. Now if we take two Hamiltonians of the form 19 having some overlapping qubits, then we can get different Hamiltonians by rearranging the commuting Paulis. In the next few subsections we design circuits for the corresponding exponentials of these Hamiltonians. We must keep in mind that in the following subsections P 0 = X and P 1 = Y, i = i + 1 mod 2. Table 2 summarizes the number of non-Clifford gates used to implement the various circuits. All rotation gates with n (> 1) controls, can be decomposed into cR z (single control) and NOT with n controls, each of which can be decomposed into n − 1 Toffolis (as shown in Figure 1b). We have discussed in 'Introduction' about special gadgets that can be used to further reduce the T-count of the circuits. In Figure 1e, 1f we show how Toffolis can be reduced in segments of the circuits. Our circuits have less gates (even the Clifford gates), compared to [50] or the approaches where we synthesize circuit for each exponentiated Pauli and then concatenate them. In fact, we show the dependence of the circuit size or Clifford and non-Clifford gate cost on the coefficients of the commuting Paulis in the Hamiltonian expression.
Overlap on 1 qubit : Previously, we provided an analysis of the circuits for cases where many of the Hamiltonian coefficients are chosen to follow regular patterns and see that the costs of the simulation can be reduced through the use of these techniques. Here we provide a more aggressive strategy wherein we combine multiple commuting terms together and find particular combinations of angles such that the simulation circuits are efficient. The results are summarized in Table 2. We consider the case when there is overlap on 1 qubit. We can have the following sets of commuting Paulis.
Without loss of generality, we assume that the leftmost operator acts on qubit q 1 , next one on q 2 and so on -rightmost one acts on qubit q 7 . We denote a state vector as |Q 1 vQ 2 where Q 1 = |q 1 q 2 q 3 , Q 2 = |q 5 q 6 q 7 and v, q 1 , . . . , q 7 ∈ {0, 1}. We can have the following Hamiltonian terms, expressed as sums of commuting Paulis from the above two sets.  Circuit for simulating e −iH 1y t : Let W 1y be the unitary consisting of the following sequence of gates. The rightmost one is the first to be applied. With a slight abuse of notation we denote CN OT (4,1) CN OT (4,2) CN OT (4,3) by CN OT (4,I) and CN OT (4,5) CN OT (4,6) CN OT (4,7) by CN OT (4,II) .
In the following theorem we show that this is a diagonalizing circuit for the set of Paulis in G 1y .
(c) when the coefficients are as in Equation 18. Thus we have the following.
The state of the qubits q 1 , . . . , q 7 after the application of W 1y is denoted by variables x 1 , . . . , x 7 respectively. We have the following expression for the overall phase incurred between W 1y and W † 1y .
It is easy to check that φ x 4 = −φ x 4 . We consider the following three cases and it is sufficient to check the phase values when x 4 = 0.
Following the conventions and explanations given for Case I we have the following overall phase after the application of W 1y . We can write φ = f 1 (θ 1 ) + f 2 (θ), for two functions f 1 and f 2 . The following can be verified.
3. For any other values of x 1 , x 2 , x 3 , φ = f 2 (θ 2 ) and analogously, for any other values of x 7 , x 6 , x 5 , A quantum circuit simulating e −iH 1y t is shown in Figure 2a.
Case II : Now we consider the case when a 6 t = a 3 t = a 5 t = a 7 t = θ 1 and Here also φ can be written as sum of two functions : φ = f 1 (θ 1 ) + f 2 (θ 2 ). We can make the following observations.
2. If any two of . Similarly, if any two of x 5 , x 6 , x 7 is 1 then A circuit simulating e −iH 1y t in this case, has been shown in Figure 2b.
2. Suppose x i = x j and x k = x i , where i, j, k ∈ {1, 2, 3} and i = j = k. Then flipping the values changes the sign. For example, if φ . Similar phenomenon occurs if i, j, k ∈ {7, 6, 5}, except this time sign of f 2 (g) flips.. So it is sufficient to consider the case when two variables are 1.
A circuit simulating e −iH 1y t in this case, has been shown in Figure 2c.
Circuit for simulating e −iH 1x t : An eigenbasis for the Paulis in G 1x has been given in Lemma B.2 of Supplementary Method 1 (Appendix B). But we are unable to find out (by hand) a unitary (analogous to W 1y ) that diagonalizes the set of commuting Paulis in G 1x , as we did in the previous subsection for G 1y . So we divide the commuting Paulis into two groups of 4-qubit Paulis, i.e. we consider the following two sets.
and the following two Hamiltonians We can use the diagonalizing circuit of [50] and have the following.
It is easy to verify that a non-zero phase φ 1 = −4θ 1 exists if and only if Case II : Now we consider the case when Case III : Next we consider the case where Here, non-zero phase exists if any two of the variable have same value.
Circuits simulating e −iH 1x1 t in Case I, II and III have been shown in Figure 2d, 2e and 2f respectively. Circuits for e −iH 1x2 t are similar. Circuit for e −iH 1x t in each case is obtained by concatenating the corresponding circuits.

Overlap on 2 qubits :
In general, the more options that we have for grouping mutually commuting terms the more effective our compilation strategy will be. While the most natural case to examine is the case where all of the Hamiltonian terms act on disjoint sets of qubits, Hamiltonian terms can commute if they overlap on only two qubits as well. For example, we can have the following sets of commuting Pauli operations (27) Without loss of generality, we assume that the leftmost operator acts on qubit q 1 , next one on q 2 and so on -rightmost one acts on qubit q 6 . We denote a state vector as |Q 1 Q 2 Q 3 where Q 1 = |q 1 q 2 , Q 2 = |q 3 q 4 and Q 3 = |q 5 q 6 are the first, second and third pairs of qubits respectively. We can have the following Hamiltonian terms, expressed as sums of commuting Paulis from the above two sets.
Circuit for simulating e −iH 21 t : As before our simulation strategy involves diagonalizing the Hamiltonian using a Clifford circuit and then build Let W 1 be the unitary consisting of the following sequence of gates. The rightmost one is the first to be applied.
The following theorem shows that this is a diagonalizing circuit for the set of Paulis in G 21 .
Theorem 2.2. For each i, j, k, l ∈ Z 2 , such that P k P l P i P j II, IIP i P j P k P l ∈ G 21 we have the following.
The proof is similar to Theorem 2.1 and has been given in Supplementary Method 2 (Appendix C). Theorem 2.2 then gives us the following. We denote the state of the qubits q 1 , . . . , q 6 after the application of W 1 by the variables x 1 , . . . , x 6 respectively. We have the following expression for the overall phase incurred between W 1 and W † 1 .
It is easy to check that φ x 3 = −φ x 3 . We consider the following cases and it is sufficient to check the phase values when x 3 = 0. Case I (II) : We consider the case when There are no a 1 , a 6 , b 1 , b 6 in the expression of the Hamiltonian. So, for consistency with the previous and following subsection, we can consider this as either Case I or II.
Circuit for simulating e −iH 20 t : An eigenbasis for the Paulis in G 20 has been shown in Lemma C.2 of Supplementary Method 2 (Appendix C). But since we have been unable to derive a diagonalizing circuit, so we divide the commuting Paulis into two groups of 4-qubit Paulis as follows.
G 201 = {P k P l P i P j II : i + j, k + l ≡ 0 mod 2.} G 202 = {IIP i P j P k P l : i + j, k + l ≡ 0 mod 2.} We get the following two Hamiltonians.
Using the diagonalizing circuit of [50] and have the following. where the rightmost gate is the first one to be applied. We denote the state of the qubits q 1 , . . . , q 4 and q 3 , . . . , q 6 after the application of W 01 and W 02 by the variables x 1 , . . . , x 4 and x 3 , . . . , x 6 respectively. We have the following expression for the overall phase incurred between W 01 , W † 01 and between W 02 , W † 02 .
Circuits simulating e −iH 201 t in Case I, II and III have been shown in Figure 3c, 3d and 3e respectively. Circuits for e −iH 202 t are similar. Circuit for e −iH 20 in each of these cases is obtained by concatenating the corresponding circuits.

Overlap on 3 qubits :
Now we consider the case when there is overlap on 3 qubits. We can have the following sets of commuting Paulis.
Without loss of generality, we assume that the leftmost operator acts on qubit q 1 , next one on q 2 and so on -rightmost one acts on qubit q 5 . We denote a state vector as |Q 1 q 2 q 3 q 4 Q 2 where Q 1 = |q 1 , Q 2 = |q 5 and q 1 , . . . , q 5 ∈ {0, 1}. We can have the following Hamiltonian terms, expressed as sums of commuting Paulis from the above two sets.
Circuit for simulating e −iH 3y t : Let W 3y be the unitary consisting of the following sequence of gates. The rightmost one is the first to be applied. With a slight abuse of notation we denote CN OT (c,t 1 ) CN OT (c,t 2 ) CN OT (c,t 3 ) . . . by CN OT (c;t 1 ,t 2 ,t 3 ,...) (multi-target CNOT).
The proof is similar to Theorem 2.1 and has been shown in Supplementary Method 3 (Appendix D). Thus we have the following. We denote the state of the qubits q 1 , . . . , q 5 after the application of W 3y by the variables x 1 , . . . , x 5 respectively. We have the following expression for the overall phase incurred between W 3y and W † 3y .
It is easy to verify that φ x 2 = −φ x 2 . We consider the following cases and it is sufficient to check the phase values when x 2 = 0.
Case I : We consider the case when . So we concentrate on x 1 = x 5 = 0. If x 3 = x 4 = 1 then φ = −4θ 1 and if x 3 = 0, x 4 = 1 then φ = 4θ 2 . A quantum circuit simulating e −iH 3y t has been shown in Figure 4a. If θ 1 = θ 2 then we can have a further reduction of controlled rotation gates, as shown in Figure 4b.
A circuit simulating e −iH 3yt in this case has been shown in Figure 4d. If h 2 = g 1 , h 3 = g 3 , h 1 = g 2 then we can have a simpler circuit, as shown in Figure 4e.
Circuit for simulating e −iH 3x t : The diagonalizing transformation for the Pauli operators in G 3x is shown in Lemma D.2 of Supplementary Method 3 (Appendix D). Since we have been unable to find a diagonalizing circuit, so we divide the commuting Paulis into two groups of 4-qubit Paulis, and have the following two Hamiltonians.
Case I : We consider the case when We have the following phases.
Circuits simulating e −iH 3x1 t in Case I, II and III have been shown in Figure 4f, 4g and 4h respectively. Circuits for e −iH 3x2 t are similar. Circuit for e −iH 3x t in each case is obtained by concatenating the corresponding circuits.

Circuit for arbitrary exponentiated Hamiltonians
Our previous discussion focuses on the case of fermionic simulation within a Jordan-Wigner representation using Hamiltonian terms that are fermionically swapped to be adjacent to each other. While these simulation circuits are among the most important for applications in chemistry, it does not necessarily represent all cases of physical interest let alone chemistry. Here we address this by discussing ways to synthesize circuits for arbitrary exponentiated Hamiltonians in C 2 n ×2 n , expressible as sum of Pauli operators, with an aim to reduce the number of non-Clifford resources. For reasons discussed previously, it is enough to consider a Hamiltonian H expressed as sum of commuting Pauli operators.
In most cases one synthesizes circuit for each e −iα i P i t using a number of CNOT and one R z gate. Thus the number of R z gates required is equal to the number of summands. Here we describe procedure to synthesize circuit for e −iHt i.e. considering multiple summands or Pauli operators. We diagonalize H, for example, by using the algorithms in [51]. In the previous section we have constructed explicit eigenbases for the diagonalization of some specific Hamiltonians. Then we get the following.
With each Q i we associate an n-length vector y i = (y i1 , . . . , y in ) ∈ {0, 1} n such that (y i ) j = y ij = 1 if Q ij = Z, else y ij = 0. Let x 1 , . . . , x n ∈ {0, 1} and |0 = [1, 0] T , |1 = [0, 1] T . The eigenvectors of H are of the form |v = n j=1 |x j and the corresponding eigenvalue is Proof. The summands in H are mutually commuting and so they have a common eigenbasis. Let us first consider 1}, so we have the following.
We can also interpret φ as the overall phase incurred between W and W † . For given values of α i , φ is an n-variable Boolean function where x j are the Boolean variables. We can evaluate a truth table and get all the 2 n values of φ for different values of (x 1 , . . . , x n ) ∈ {0, 1} n . For each distinct non-zero absolute phase value |θ| (ignoring sign), we can have a sub-circuit C |θ| that has only one controlled rotation cR z (2θ) gate. The complete circuit can be obtained by combining these different sub-circuits (one for each |θ| = 0), in between the diagonalizing Clifford circuits W, W † . The ordering of the sub-circuits do not matter. Now we discuss how to synthesize sub-circuit C |θ| , for one such distinct absolute value of φ. Let M θ be the set of binary values for variables x 1 , x 2 , . . . , x n , such that φ computes to θ in Equation 35.
Analogously we can define M −θ . We can also associate M θ and M −θ with S θ and S −θ , the sets of eigenvectors with eigenvalues θ and −θ respectively, as obtained from Lemma 2.3. We define the following operators, which acts on the input vector space and the space of two ancillae -c and r, the latter being initialized to 0.
The circuit If the input vector is in S θ or S −θ then both these operators flip a control ancilla qubit (c). Additionally, if the vector is in S −θ then the second ancilla r is flipped. We apply a cR z (2θ) gate on r, controlled on c. Thus if the input vector is in S −θ then we actually apply cR z (−2θ). The ancillae c, r can be controlled by multi-controlled X gates, that can be further decomposed in terms of Toffoli and CNOT gates [72,73]. For example, let M θ = {(0, 0, 1, 1), (0, 1, 1, 1)} and M −θ = {(1, 1, 0, 0)}. The two Boolean min-terms of M θ can be compressed to have a single term because when x 1 = 0, x 3 = x 4 = 1 then φ = θ, irrespective of the value of x 2 . We call it the 'don't care condition' for x 2 . So, equivalently we can write M θ = {(0, * , 1, 1)}, where * denotes the don't-care condition. In general, algorithms like Karnaugh map [70], ESPRESSO [71] can be used to get compact set of Boolean min-terms. A circuit C |θ| has been shown in Figure 5.
Hence, due to the invariance of the point spectrum of unitarily equivalent operators we have the following. Illustration -Quantum Heisenberg and Quantum Ising model : We consider the problem of designing quantum circuits for simulating the quantum Heisenberg and Ising model with Hamiltonians H H and H I respectively. The Heisenberg Hamiltonian is widely used to study magnetic systems, where the magnetic spins are treated quantum mechanically [74,75,76,60,77]. Let G = (E, V ) be the underlying graph with the vertex and edge set being V and E respectively.
In the above J x , J y , J z are coupling parameters, denoting the exchange interaction between nearest neighbor spins along the X,Y,Z direction respectively. d h , d h is the time amplitude of the external magnetic field along the Z-direction. One set of commuting Paulis are {X (i) X (j) : (i, j) ∈ E}, Let us first consider the set {Z (i) : i ∈ V }. Following the previous discussions and Lemma 2.3, the overall phase incurred or the eigenvalues are as follows.
For x ∈ {0, 1} |V | , one particular assignment of values to the Boolean variables, let T 0 = {i ∈ V : So the number of distinct non-zero eigenvalues or absolute values of φ can be |V |/2 . Implementing each e −id h Z (i) t would require |V | rotation gates. Thus, using Lemma 2.4, we have about 50% reduction in the rotation gate cost. Now, let us consider the other commuting sets. Since H † XH = Z and (HSX) † Y (HSX) = Z, so each of the above sets can be diagonalized and we can focus on the problem of simulating a quantum circuit for the Hamiltonian : H = J (i,j)∈E Z (i) Z (j) , where J is a constant. Our aim is to derive an upper bound on the number of controlled rotations required to simulate e −iHt . Following the previous discussions, the overall phase incurred between the diagonalizing Cliffords W, W † is as follows (Lemma 2.3).
where x i , x j ∈ {0, 1} are variables denoting the state of the qubits after application of W . The quantum circuit has |V | qubits, corresponding to each vertex of G. Let x =∈ {0, 1} |V | denote one particular assignment of values to the variables x 1 , . . . , Let φ x be the value of the phase for this particular assignment.
and N (k) be the set of neighbouring vertices of k in G. Then Now for any assignment |S 1 | can vary from 1, . . . , |E|. So the number of distinct values of φ is at most |E|/2 . And hence we need at most |E|/2 controlled-R z gates in the circuit simulating e −iHt . Had we simulated each e −iJZ (i) Z (j) t , we would have required |E| R z gates. So we can achieve about 50% reduction in the rotation gate cost under the assumption that controlled-R z costs the same to implement as a single R z gate.

G is a cycle :
This is basically the traslationally invariant 1-D spin chain. Let G V 1x be the subgraph induced by V 1x , which is a union of paths. For each path p, let S 1p = k∈p {N (k) \ V 1x } ⊆ S 1 be the set of vertices in this path. Each of the terminal vertices has one neighbour in V \ V 1x . So |S 1p | = 2. Thus if P is the set of all such paths in G V 1x , then Now |P| can vary from 1, . . . , |E|/4 to give |E|/4 distinct values of |φ|. This implies a quantum circuit synthesizing e −iHt will require at most |E|/4 cR z gates. This is about 75% reduction in the cost of rotation gates, compared to synthesizing each e −iJZ (i) Z (j) t .
G is a complete graph : In this case for each k ∈ V 1x , we have N (k) \ V 1x = V 0x . So we have, So there can be |V |/2 distinct values of |φ x | as x 1 varies from 1, . . . , |V |/2 . And hence we require at most |V |/2 cR z gates for simulating e −iHt . If we simulate each e −iJZ (i) Z (j) t then we In Figure 6 we have shown quantum circuits simulating e itθ (i,j)∈E Z (i) Z (j) for some simple graphs G = (V, E). The circuits have been designed to optimize the number of Toffoli gates, as well.

Reducing the number of Toffoli gates :
We discussed before that the T-count from the Toffolis may be a significant factor in high error regime as the logarithmic cost of rotation synthesis may not dominate the additive constant that arises from the Toffoli gates needed. In order to reduce the number of Toffolis we can do the following. We design circuits reducing Toffolis for Hamiltonians over smaller graphs, such as in Figure 6a-6e. Then we decompose a Hamiltonian over a large graph into Hamiltonians over these smaller graphs. For example, consider a 1-D cycle on N points and H z = θ (i,j)∈E Z (i) Z (j) . We break this cycle into smaller chains of length 3 i.e. H z = θ(Z (1) Z (2) + Z (2) Z (3) ) + (Z (3) Z (4) + Z (4) Z (5) ) + . . . = i H zi . We have a quantum circuit that synthesizes each e iH zi t with only one cR z gate (Figure 6b). So to synthesize e iHzt we require approximately N/2 cR z . This is about twice the number of controlled rotations required, had we synthesized without decomposing. But it does not require any extra Toffoli-pairs. We manage to get approximately 50% reduction, compared to synthesizing each summand i.e. e itZ (i) Z (j) . Now consider a large N × N lattice which has N 2 vertices and 2N (N − 1) edges and the Hamiltonian H z = θ (i,j)∈E Z (i) Z (j) . We can decompose this into (N − 1) 2 smaller interior cycles of 4-points and a bigger outer circle with 2N + 2(N − 2) = 4(N − 1) points. From Figure 6a, we know that we can design a circuit simulating the exponentiated Hamiltonian corresponding to each interior cycle with 1 cR z and 1 Toffoli pair. We can further decompose the outer circle (as explained in the previous paragraph) and have a circuit with approximately 2(N − 1) cR z gates. Thus we require ≈ (N − 1) 2 + 2(N − 1) = (N − 1)(N + 1) cR z and (N − 1) 2 Toffoli-pairs. We have discussed before that for general graphs, number of cR z required is ≈ |E|/2 = N (N − 1), so we use ≈ (N − 1) more cR z by decomposing, but the Toffoli cost reduces a lot. Had we synthesized each e itZ (i) Z (j) , we would have used 2N (N − 1) R z . Thus we manage to get a reduction of ≈ (N − 1) 2 in the number of R z /cR z .
In Figure 6e we gave a circuit for simulating e itθ (i,j)∈E Z (i) Z (j) , when the underlying graph is a 6-point cycle. We reduced the Toffoli-pairs by decomposing the graph into smaller cycles.

Application : Simulating with qDRIFT
In this section we consider one simulation algorithm -qDRIFT. We focus on qDRIFT rather than Trotter for our experiments because qDRIFT is easier to analyze numerically. This is because Trotter errors subtly depend on operator ordering. Specifically we consider a Hamiltonian H = L j=1 h j P j and sample Pauli operators to apply in each short time step, as described earlier in the paper and in [49]. We can assume each h j > 0, since the negation affects the angles of the rotation gates. In qDRIFT, in each iteration one Pauli term is sampled up to a total of N samples. The probability of sampling P j is h j / i h i and is then simulated for a short time period. We consider another procedure where we re-write the Hamiltonian as H = L j=1 h j H j , where each H j = i P i , is a sum of commuting Paulis. In each iteration one H j is sampled with probability h j i h i and simulated for a short time period. Then we compare the growth of error, number of R z /cR z , Toffoli gates used in these two procedures -(i) one Pauli sampled in each iteration, (ii) group of commuting Paulis sampled in each iteration.
For our first set of experiments we examine the case of simulation of 4 and 6-qubit Heisenberg models. The coefficients J x , J y , J z , d h have been sampled from a 0 mean normal distribution with variance 1. In Figure 7 we show that we achieve better scaling of R z /cR z when multiple commuting Pauli operators are sampled and evolved in each iteration. In fact, the error also scales well with the number of iterations, i.e. we can achieve the same error in less number of iterations, or in another way, it is possible to achieve much lower error in the same time (iterations) when multiple operators are sampled. We calculate error as: Where E 1 = e iHt ρe −iHt , τ = t · j h j /N and · l 2 is the induced Euclidean norm on matrices and E ρ is the Haar average over input states. We obtain E 2 through averaging M random qDRIFT protocols, where M varies from 100 to 3000 for our purposes. These values are chosen to ensure that the sampling error is small at the scale of the plots generated.
In our experiments ρ is randomly drawn rather than chosen to maximize the diamond distance. As a result, this does not give a tight upper bound on the error quantified by any induced channel norm. Further, all evolution is done using t = 1 and the groupings are hand optimized using counts given in Supplementary Method 5 (Appendix F). The data, tabulated in Figure 7, shows that the number of iterations of the qDRIFT channel needed to simulate the dynamics to bound the error below a particular value, is reduced by a factor of 2.34 and 2.8 through the use of grouping commuting terms for the randomly chosen 4 and 6 qubit Heisenberg Hamiltonian respectively. The number of rotations is found to be reduced by a factor of roughly 2.34 for the 4 qubit ensemble but 1.8 for the 6 qubit case. This suggests that the groupings that we consider, while highly successful at reducing the number of iterations of qDRIFT needed, the number of gates per iteration increases from the 4 to 6 qubit examples. This suggests that further computer aided optimization may be needed in order to see the full benefit of such groupings as we increase the size of models. Similar observations can be made for our second set of experiments where we simulate the Hamiltonian of H 2 and LiH (with freezing in the STO-3G basis). The plots in Figure 8 show that in case of H 2 , the number of iterations of the qDRIFT channel needed to simulate the dynamics to bound the error below a particular value, is reduced by a factor of 4 through the use of grouping commuting terms. For LiH this factor is nearly 2.1. The number of rotations is found to be reduced by a factor of roughly 3.2 for H 2 and 2 for LiH.
For all the experiments that we consider the Toffoli-pair gate count is comparable with the R z /cR z count, so the Toffoli pairs do not contribute significantly to the overall T-count, as compared to the rotation gates. The number of gates depend on the diagonalizing circuits and the grouping into commuting Paulis. In this paper we have shown the set of results for the eigenbasis or grouping that were better among the options considered by us. In Supplementary Method 5 (Appendix F) we have explicitly mentioned the Hamiltonians, the groupings and given a short description of how we obtained the rotation and Toffoli costs.
All plots, code, and data can be found online in our public repository https://github.com/ SNIPRS/hamiltonian. All code was written in Python. Our results were obtained partly with computing resources in the Cedar cluster of Compute Canada. Specifically, our code was run on an Intel(R) Xeon(R) E5-2683 v4 CPU at 2.10 GHz, utilizing 48 cores, up to 12GBs of RAM, and running Gentoo Linux 2.6. For the Heisenberg Hamiltonians, our results were obtained using 12 cores of an Intel(R) Core(TM) i5-12600K CPU at 3.6 GHz running Ubuntu 20.04.4 and up to 32GBs of RAM.

Discussion
In this paper, we have considered the problem of designing efficient quantum circuits for exponential of Hamiltonians that can be expressed as sum of Paulis. In contrast with most previous approaches, we synthesize circuit for a sum of exponentiated commuting Paulis, rather than concatenate circuits for each exponentiated Pauli. These resulting circuits are observed, for some parameter combinations, to require far fewer non-Clifford operations than the standard circuits. We therefore propose an algorithm for greedily compiling a Trotter or qDRIFT simulation into a sequence of such simulations and observe that when multiple rotations are grouped we see at fixed error that a factor of roughly 1.8 − 3.2 fewer rotations are needed to simulate 6 and 4-qubit Heisenberg models, LiH, H 2 . Also, for simulation protocols like qDRIFT, it is possible to achieve a better performance, in the sense that the error accumulated per iteration is less if we sample multiple commuting Paulis. The overall non-Clifford gate cost of the entire protocol is also less.
There are a number of interesting avenues that are revealed by this work. The first is that a more complete set of rules for compiling Hamiltonian terms into sets that can be easily exponentiated reveals the potential for more efficient simulation compilation of Hamiltonians. These replacement rules, once identified, can be used inside a more systematic Hamiltonian compiler package that would allow more substantial optimizations of the Hamiltonian for the given simulation method.  This raises a second issue, while in this work we focus on the case of optimizing Trotter and related simulation methods, similar considerations could be performed for optimizing the prepare and select circuits used in LCU / qubitization simulation algorithms. Such procedures are harder to optimize as the simulation algorithm does not factorize as nicely into independent simulations; however, the importance of these simulation methods makes the development of compilation strategies essential.
Finally, an important avenue hinted at by this work is the possibility that approximate unitary synthesis methods can be combined with quantum simulation routines to further reduce the cost. If fermionic swaps are used, for example, simulation reduces to implementing a series of 4-local Hamiltonians and optimal circuits can be in principle constructed for such Hamiltonians using existing approaches. The computational overheads required for optimal (approximate) synthesis of these unitaries makes this a daunting task; however, if a sufficient lexicon of cheap unitaries are found for such simulation then it will not only lead to lower costs for quantum simulation using Trotter / qDRIFT: it will also unify Hamiltonian compilation with circuit synthesis into a single conceptual framework.

Data Availability statement
All plots and data can be found online in our public repository https://github.com/SNIPRS/ hamiltonian.

Code Availability statement
All code can be found online in our public repository https://github.com/SNIPRS/hamiltonian.
The single-qubit Clifford group C 1 is generated by the Hadamard and phase gates : When n > 1 the n-qubit Clifford group C n is generated by these two gates (acting on any of the n qubits) along with the two-qubit CNOT = |0 0| ⊗ I + |1 1| ⊗ X gate (acting on any pair of qubits). The Clifford group is special because of its relationship to the set of n-qubit Pauli operators. Cliffords map Paulis to Paulis, up to a possible phase of −1, i.e. for any P ∈ P n and any C ∈ C n we have CP C † = (−1) b P for some b ∈ {0, 1} and P ∈ P n .
The group generated by Clifford and T gates : The group J n is generated by the n-qubit Clifford group along with the T gate, where Thus for a single qubit J 1 = H, T and for n > 1 qubits It can be easily verified that J n is a group, since the H and CNOT gates are their own inverses and T −1 = T 7 . Here we note S = T 2 .

Diamond distance :
We use · to denote the operator norm or Schatten-∞ norm, which is equal to the largest singular value of an operator. We use · 1 for the trace norm or Schatten 1-norm, defined as Y 1 = Tr √ Y † Y , which is equal to the sum of the singular values of an operator. Throughout, we use the diamond norm distance as a measure of error between two channels. The diamond distance is denoted where · is the diamond norm where I acts on the same size Hilbert space as P.
The key properties of diamond norm, used in this paper are: 1. Triangle inequality : A ± B ≤ A + B , 2. Sub-multiplicativity : AB ≤ A B and consequently A n ≤ A n .
From the definition of diamond norm, it follows that if we apply the channels E and N to the quantum state σ, we ahve the following.
The trace norm is an important quantity because it bounds the error in expectation values. If M is an operator, then we have the following.
If M is a projection so that this represents a probability, then M = 1. We see error in diamond distance ensures that the measurements statistics are correct up to additive error 2 .
We have used the following inequalities in the paper.
Chebyshev's inequality : Let X be a random variable with finite expected value µ and finite non-zero variance σ 2 . Then for any real number k > 0, Markov's inequality : If X is a non-negative random variable and a > 0, then the probability that X is at least a is as follows.
Hoeffding's inequality : Let X 1 , . . . , X n be independent random variables such that a a ≤ X i ≤ b i almost surely. Let S n = n i=1 X i be the sum of these variables. Then for all t > 0 we have the following.

B Supplementary Method 1
In this section we show how to derive a diagonalizing circuit when overlap is on 1 qubit. We first show an eigenbasis for the Paulis in G 1y and then we prove that a unitary W 1y diagonalizes these Paulis.
Lemma B.1 (Eigenbasis for G 1y ). For the Paulis in G 1y the eigenvectors are of the following form.
Similarly we can prove the following.
It is easy to check the orthogonality and the number of eigenvectors as 2 3 · 2 · 2 3 = 2 7 .
Let W 1y be the unitary consisting of the following sequence of gates. The rightmost one is the first to be applied. With a slight abuse of notation we denote CN OT (4,1) CN OT (4,2) CN OT (4,3) by CN OT (4,I) and CN OT (4,5) CN OT (4,6) CN OT (4,7) by CN OT (4,II) . Theorem B.1. For each i, j, k, l, a, b, c ∈ Z 2 , such that P i P j P k Y III, IIIY P a P b P c ∈ G 1y we have the following.
Proof. We prove the theorem by showing that the two operators have equivalent actions on the eigenstates of the Paulis in G 1y . Let Case -A : First we consider |v y+ . We apply W † 1y first. The state vector evolves as follows.
We have the following after applying the tensor of Z operators. 7) |v y+ 1 = (−1) aq 5 +bq 6 +cq 7 |v y+ 1 Since we only have accumulation of different phase, so for the evolution of the state after applying W † 1y , in both cases it is enough to check the evolution of |v y+ 1 . Since it is the same operators applied in reverse order we are not writing the states explicitly.
Thus in this case the operators on the LHS and RHS have the same eigenvalues for the eigenvector |v y+ . Case -B : By similar analysis we can prove that the operators on both the LHS and RHS have the same eigenvalues for the eigenvector |v y− .
This proves the theorem.
Lemma B.2 (Eigenbasis for G 1x ). For the Paulis in G 1x the eigenvectors are of the following form.
, 1} 3 such that either x 1 = 0 or 2. Specifically we have the following.

C Supplementary Method 2
In this section we show how to derive a diagonalizing circuit when the overlap is on 2 qubits. The following lemma gives an eigenbasis for the Paulis in G 21 and can be proved in a manner similar to Lemma B.1.
Lemma C.1 (Eigenbasis for G 21 ). For the Paulis in G 21 the eigenvectors are of the following form.
Specifically we have the following. P k P l P i P j II |v 10 = ±α(−1) kq 1 +lq 2 |v 10 and IIP i P j P k P l |v 10 = ±α(−1) kq 5 +lq 6 |v 10 P k P l P i P j II |v 11 = ±α(−1) kq 1 +lq 2 +i |v 11 and IIP i P j P k P l |v 11 = ±α(−1) kq 5 +lq 6 +i |v 11 Theorem C.1. For each i, j, k, l ∈ Z 2 , such that P k P l P i P j II, IIP i P j P k P l ∈ G 21 we have the following.
Proof. We prove this theorem by showing that the operators on LHS and RHS have the same eigenvalue-eigenvectors. The eigenvectors for the operators on RHS have been derived in Lemma C.1.
Case -I Consider the eigenvector |v 10 . Let and Case -IA First we consider |v 10+ . We apply W † 1 first. The state vector evolves as follows.
We have the following after applying the tensor of Z operators. We note that k + l = i + j = 1.
Since we only have accumulation of different phase, so for the evolution of the state after applying W 1 , in both cases it is enough to check the evolution of |v 10+ 1 .
Thus we have the following.
Thus in this case the operators on the LHS and RHS have the same eigenvalues for the eigenvector |v 10+ . By similar arguments we can prove that the operators on the LHS and RHS have the same eigenvalues for the eigenvector |v 10− and |v 11− , hence proving the theorem.
Lemma C.2 (Eigenbasis for G 20 ). For the Paulis in G 20 the eigenvectors are of the following form.

D Supplementary Method 3
In this section we show how to derive a diagonalizing circuit when the overlap is on 3 qubits. The following lemma gives an eigenbasis for the Paulis in G 3y and can be proved in a manner similar to Lemma B.1.
Lemma D.1 (Eigenbasis for G 3y ). For the Paulis in G 3y the eigenvectors are of the following form.
In the first procedure, we sample H j independently with probability q j = h j j h j . In the second procedure, in each iteration we select one single Pauli operator P k independently with probability p k = j h j i h i L i , where in the numerator the sum is over all the commuting Pauli groups in which P k appears. Let λ = j h j and λ = j h j L j . The following Liouvillian generating unitaries under Hamiltonian H j and P i j have been defined.
Thus if L = i(Hρ − ρH), then L = L j=1 h j L j = L j=1 h j L j i j =1 L i j . We define two channels E 1 = L j=1 q j e τ L j and E 2 = L j=1 p j L j i j =1 e τ L i j , where p j = h j λ , that evolves the superoperators L j and L ij for time interval τ = λt N and τ = λ t N respectively. Here we note that for the second channel, for each Pauli P k , we have expanded the sum p k = j h j λ to reflect the commuting groups in which it belongs. Thus M k=1 p k = L j=1 L j i j =1 p j .
We have τ λ = τ λ = t N and The error can be expressed as follows.
[Triangle inequality and Sub-multiplicativity] Since L i j are Liouvillian derived from Pauli operators, which are trace-preserving, so L i j ≤ 2.
Thus we have the following.
Truncating Hamiltonian : Here we prove Lemma 2.1 CHECK. We recall that we have the following Hamiltonian.
Here qDRIF T is the error inherent in the qDRIFT protocol which we can bound as follows [49].
To bound we make use of the Liouvillian representation of a unitary channel as follows.
Thus in our case the error per repetition or segment is

F Supplementary Method 5
In this section we give the Hamiltonians considered by us in Section 2. We also give the grouping into commuting Paulis, that gave us the plots in Figures 7 and 8