T-count and T-depth of any multi-qubit unitary

While implementing a quantum algorithm it is crucial to reduce the quantum resources, in order to obtain the desired computational advantage. For most fault-tolerant quantum error-correcting codes the cost of implementing the non-Clifford gate is the highest among all the gates in a universal fault-tolerant gate set. In this paper we design provable algorithm to determine T-count of any $n$-qubit ($n\geq 1$) unitary $W$ of size $2^n\times 2^n$, over the Clifford+T gate set. The space and time complexity of our algorithm are $O\left(2^{2n}\right)$ and $O\left(2^{2n\mathcal{T}_{\epsilon}(W)+4n}\right)$ respectively. $\mathcal{T}_{\epsilon}(W)$ ($\epsilon$-T-count) is the (minimum possible) T-count of an exactly implementable unitary $U$ i.e. $\mathcal{T}(U)$, such that $d(U,W)\leq\epsilon$ and $\mathcal{T}(U)\leq\mathcal{T}(U')$ where $U'$ is any exactly implementable unitary with $d(U',W)\leq\epsilon$. $d(.,.)$ is the global phase invariant distance. Our algorithm can also be used to determine the (minimum possible) T-depth of any multi-qubit unitary and the complexity has exponential dependence on $n$ and $\epsilon$-T-depth. This is the first algorithm that gives T-count or T-depth of any multi-qubit ($n\geq 1$) unitary. For small enough $\epsilon$, we can synthesize the T-count and T-depth-optimal circuits. Our results can be used to determine the minimum count (or depth) of non-Clifford gates required to implement any multi-qubit unitary with a universal gate set consisting of Clifford and non-Clifford gates like Clifford+CS, Clifford+V, etc. To the best of our knowledge, there were no such optimal-synthesis algorithm for arbitrary multi-qubit unitaries in any universal gate set.


Introduction
The vision of Feynman [1] that a quantum computer can be used to overcome the limitations of classical computers, gained momentum with the design of quantum algorithms that outperform their classical counterparts for popular challenging problems like factorization [2,3], searching an unstructured solution space [4]. Quantum circuit is one of the most popular way for describing and implementing quantum algorithms. These consist of a series of elementary operations or gates belonging to a universal set and dictated by the implementing technologies. Like their classical d(U, W ), d(U , W ) ≤ .
T (U ) and T d (U ) are called the -T-count (T (W )) and -T-depth (T d (W )) of W , respectively. The T-count and T-depth-optimal circuits of U are called -T-count-optimal and -T-depthoptimal circuit for W , respectively. In this paper we use the global phase invariant distance (Section 2) as the distance metric and not the operator norm. This is because the global phase invariant distance ignores the global phase and hence avoids unnecessarily long approximating sequences that achieve a specific global phase. This can be the reason for the fact that the bound on T-count of single-qubit Z-rotations is less in [19], which works with this distance, compared to [20,21], that work with operator norm. (More discussions can be found in [22].) This distance is composable [22] and has been used to synthesize unitaries in other models like topological quantum computation [23,24]. It is not hard to see that if = 0 then we get the problem of synthesizing Tcount and T-depth-optimal circuits for exactly implementable unitaries. In this case, both provable [25,26,27] and much more efficient heuristic [26,27] algorithms have been developed (see Table 1 for a comparison). We say that an algorithm is provable if its claimed efficiency and correctness or quality of solution (in this case optimality) can be proved by rigorous arguments. An algorithm is heuristic if either one or both of these factors are conjectured to be true.
Any synthesis algorithm will have complexity at least O(2 n ), the input size. Placing further optimality constraint makes the problem even harder, in fact impractical to synthesize on a PC after a certain value of n. So re-synthesis algorithms have been developed which takes a circuit implementing a unitary and then tries to reduce (not minimize) a certain resource (see for example [28,29,30]). In literature, usually these algorithms do not account for the complexity of synthesizing the circuit and claim a running time of poly(n). A detail study about the relative merit and demerit of synthesis and re-synthesis algorithms, is beyond the scope of this work. But we would like to point out that the importance of designing better (optimal) synthesis algorithms or studying their complexity cannot be undermined, not only for theoretical reasons but also for the various applications they can have. Apart from guaranteeing optimality, they can be used as sub-routines in compiling large unitaries [31,32,22], assess the quality of re-synthesis algorithms, etc. For example, the T-depth-optimal synthesis algorithm of [27] was able to generate optimal circuits for standard unitaries like Fredkin, Peres and Quantum OR, which could not be done by the re-synthesis methods used in [33]. In Section 4 we show that we get much less T-count for widelyused multi-qubit unitaries like controlled rotation, compared to the number of T-gates obtained by compiling them first into single-qubit rotations and then replacing the T-count-optimal circuit of each such single-qubit rotation.
To the best of our knowledge, before this paper there was no algorithm to determine -T-count or -T-depth of arbitrary multi-qubit (n ≥ 1) unitaries, considering any distance metric. Previous algorithms like [19,20,21] synthesize -T-count-optimal circuits for single qubit Z-rotations. In fact, even if we consider other discrete, finite universal gate sets like Clifford+V or Clifford+CS, there are no algorithms that work for arbitrary multi-qubit unitaries and minimize the non-Clifford gate count/depth. Even it is not clear how to modify or generalize the methods introduced in these papers. However, our results not only work for multi-qubit unitaries but can also be applied in these alternate bases, as explained in the next section.

Our contributions
In this paper we give algorithms that can be used to synthesize (provably) -T-count and -T-depthoptimal circuits. We treat arithmetic operations on the entries of a unitary at unit cost and we Op. norm [21] = 1 (Rz) poly(2 n ) poly(log(1/ )) Global Ph.
Inv. Table 1: Complexity of some state-of-the-art optimal synthesis algorithms. The first column specifies the resouce to be optimized. In the fourth column Op. norm and Global Ph. Inv. denote the operator norm and global phase invariant distance respectively. In the sixth column [21] and [19] give optimal results only for 1-qubit R z (θ) gates. In the last two columns c ≥ 2 and k are constants. m and d are the -T-count and -T-depth of the input approximately implementable unitary respectively. The corresponding terms for exactly implementable unitaries ( = 0) are m and d respectively.
do not account for the bit-complexity associated with specifying or manipulating them. Suppose the input n-qubit unitary is W , having size 2 n × 2 n . Then the space complexity of our algorithms, described in Sections 3 and 5, is poly(2 n ). The time complexity has an exponential dependence on T (W ) or T d (W ) while synthesizing -T-count or -T-depth optimal circuit respectively ( Table 1). For the design and analysis of our algorithm the following results (Section 3) have been crucial. Suppose E is a unitary that is close to I in the global phase invariant distance i.e. d(E, I) ≤ . C 0 is an n-qubit Clifford operator. EC 0 behaves almost like Clifford C 0 i.e. it approximately inherits some characteristics from C 0 . First, expanding both EC 0 and C 0 in the Pauli basis, we can see that the absolute value of the coefficients (at each point) are almost equal (amplitude test). Second, if we expand (EC 0 )P (EC 0 ) † in the Pauli basis then the absolute value of the coefficients is almost 1 at P if C 0 P C † 0 = P , and nearly zero at other points (conjugation test). These results may be of independent interest and can be used for resource-optimal synthesis in other bases as described below.
Most discrete universal gate sets consist of Clifford gates and one non-Clifford gate. Consider one such set Clifford+A, where A is a non-Clifford gate and let U A is a unitary exactly implementable by this set. Since usually the cost of fault-tolerantly implementing the non-Clifford gate is higher, so we are required to optimize the A-count or A-depth, which are defined analogous to T-count and T-depth. One of the tricks in many resource-optimal-synthesis algorithms is to find a nice generating set G such that it has finite cardinality and U A can be decomposed as follows.
Each G i ∈ G has a specific property : A-count 1 or A-depth 1. Then we design a search algorithm to search for products of G i such that, we get U A C −1 0 (up to a global phase), or rather U A ( i G i ) −1 is a Clifford. We know that for any discrete finite universal gate set not all unitaries are exactly implementable. Let V A be one such unitary. U A is a unitary such that d(U A , V A ) ≤ and it has the minimum A-count or A-depth among all exactly implementable unitaries within distance of V A . Then we can perform amplitude test (Theorem 3.1) and conjugation test (Theorem 3.3, Corollary 3.1) and get an -A-count-optimal or -A-depth-optimal decomposition of V A . So it will be interesting and useful to find such nice generating set for other bases, as has been found for Clifford+T [25] (T-count), [27] (T-depth) and Clifford+CS [34] (CS-count, only for 2-qubit unitaries). One simple way of constructing G for count-optimality is to write U A as follows.
In the above A (q i ) denotes the A-gate applied on qubit q i . C 0 , . . . , C m , C 0 . . . , C m ∈ C n . So G = {CA (q i ) C † : C ∈ C n , i ∈ [n]}, which is obviously finite since the Clifford group is finite [35,36]. It is possible to get more compact sets by exploiting other algebraic properties (for example, [25]). A generating set for depth-optimality can be constructed by conjugating products of at least n A-gates on distinct qubits by Clifford, as has been done in [27]. From Table 1 we see that for exactly implementable unitaries the provable algorithms like [25,26,27] had an exponential dependence on T-count and T-depth. Significant improvements were achieved in [26,27], where heuristics were designed that led to algorithms with a polynomial dependence on T-count and T-depth. The algorithm in our paper also suffer from exponential dependence on -T-count and -T-depth, which usualy have an inverse dependence on i.e. T , T d ∝ f (1/ ). We conjecture that for approximately implementable unitaries it is not possible to have algorithms with a polynomial dependence on these parameters. Conjecture 1. It is not possible to have -T-count or -T-depth-optimal synthesis algorithms with complexity poly (2 n , T ) or poly (2 n , T d ) respectively.
However, from the improvements in T-count obtained by us (Section 4), we feel it is important to design efficient multi-qubit resource-optimal synthesis algorithms. In many practical quantum algorithms it is not too difficult to decompose a large unitary into smaller ones. We can apply composability rules (for example see [22] for global phase invariant distance) and distribute the errors among these small unitaries, such that the overall error remains within the desired bound. The complexity of resource-optimal synthesis algorithms will determine the maximum size of the component unitaries in a decomposition. The larger the components, the better the resource estimates, as evident from the results in our paper (Section 4).

Related work
Much work has been done to synthesize a circuit for any multi-qubit unitary (without provable optimality on any resource) [16,37,17,18,38,39,20,40,41]. Comparitively little has been done for arbitrary multi-qubit unitaries, when additional constraints are imposed, like minimizing the T-count or T-depth. To the best of our knowledge, all the previous works for approximately implementable unitaries, have been exclusively for single-qubit unitaries, in fact specifically for R z (θ) gates. They work with either operator norm [20,21] or global phase invariant distance [19]. However, considerable amount of work has been done to synthesize T-count and T-depth-optimal circuits for exactly implementable multi-qubit unitaries. These include algorithms with provable optimality like [25,26,27] that employ meet-in-the-middle (MITM) and nested MITM techniques, as well as more efficient heuristic algorithms whose optimality depend on some conjecture [26,27]. A crisp summary of the complexity of some state-of-the-art optimal synthesis algorithms has been given in Table 1.

Organization
Some necessary preliminary definitions and results have been given in Section 2. We describe our algorithm in Section 3 and provide implementation results in Section 4. In Section 5 we discuss about T-depth-optimality. Finally we conclude in Section 6.

Preliminaries
We write [K] = {1, 2, . . . , K}. We denote the n × n identity matrix by I n or I if dimension is clear from the context. We denote the set of n-qubit unitaries by U n . The size of an n-qubit unitary is N × N where N = 2 n . We have given detail description about the n-qubit Pauli operators (P n ), Clifford group (C n ) and the group (J n ) generated by the Clifford and T gates in Appendix A.1 and A.2. In this section we give some additional definitions and results required for the rest of the paper.

The group generated by Clifford and T gates
We observe the following when expanding a Clifford in the Pauli basis. 46]). If C ∈ C n then for each P ∈ P n ∃r P ∈ C, such that C = P ∈Pn r P P . Further, if r P , r P = 0 for any pair of P, P , then |r P | = |r P | = r, for some r ∈ R.
Fact 2.2. Let Q = P ∈Pn q P P be the expansion of a matrix Q in the Pauli basis. Then The proof has been given in Appendix A.1 (Fact A.2) Now consider a Clifford C which has an expansion, as given in Fact 2.1. Let there be M (≤ N 2 ) such non-zero coefficients. Since C is a unitary, so we apply Fact 2.2 and get the following.
A unitary U is exactly implementable if there exists a Clifford+T circuit that implements it (up to some global phase), else it is approximately implementable. Specifically, we say W is -approximately implementable if there exists an exactly implementable unitary U such that d(U, W ) ≤ . The Solovay-Kitaev algorithm [16,17] guarantees that any unitary is -approximately implementable, for arbitrary precision ≥ 0. We denote the set of exactly implementable unitaries by J n . In this paper we use the following distance measure d(., .), which has been used in previous works like [38,19] (qubit based computing), [23,24] (topological quantum computing). Definition 2.1 (Global phase invariant distance). Given two unitaries U, W ∈ U n , we define the global phase invariant distance between them as follows.
Composability of this distance with respect to tensor product and multiplication of unitaries, has been derived in [22]. This implies that if

T-count of circuits
The T-count of a circuit is the number of T-gates in it.

T-count of exactly implementable unitaries
The T-count of an exactly implementable unitary U ∈ J n , denoted by T (U ), is the minimum number of T-gates required to implement it (up to a global phase).
In [25] the authors proved the following decomposition result, by which any exactly implementable unitary over the Clifford+T set can be written as a product of T-count 1 unitaries.
Theorem 2.1 (Proposition 1 in [25] (re-stated)). For any U ∈ J n there exists a phase φ ∈ [0, 2π), Using Fact A.1, given P and Z (q i ) we can compute (circuit for) C i efficiently such that P = A decomposition of U , as in Theorem 2.1, with the minimum number of T-gates is called a T-count-optimal decomposition of U .
-T-count of approximately implementable unitaries We call a T-countoptimal circuit for any such U as the -T-count-optimal circuit for W .
It is not hard to see that the above definitions are very general and can be applied to any unitary W ∈ U n , exactly or approximately implementable. If a unitary is exactly implementable then = 0. In fact, nearly all the following results can be deduced for the special case of exactly implementable unitaries by applying = 0.

An exponential time and polynomial space algorithm
In this section we describe an algorithm for determining the -T-count of an n-qubit unitary W ∈ U n . This algorithm has space and time complexity O 2 2n and O 2 2nT (W )+4n respectively. First we derive some results that will help us design our algorithm.
Let U be an exactly synthesizable unitary such that d(U, W ) ≤ .
Let U = 1 i=t R(P i ) C 0 e iφ for some C 0 ∈ C n and global phase φ. And W = U E. Then from Equation 4 we have The above implies that d(E, I) ≤ . We have We now study some properties of |Tr(EC 0 P 1 )|, which will help us check if we have identified a correct 1 i=t R(P i ). For this, we prove the following theorem.
and 0 ≤ |Tr Proof. Since E is unitary, we can expand it in the Pauli basis as where e P = Tr (EP ) /N and P |e P | 2 = 1 (Fact 2.2). Thus and Since C 0 = P ∈Pn r P P , from Fact 2.1 and Equation 2, we know that |r P | = r = 1 √ M or r P = 0. Then r P EP 1 where P 1 ∈ P n \ {I} and P P 1 = P 1 = I. So From equation 13 we can also obtain the following lower bound. Since r = 1 √ M , we prove the following inequalities.
Basically this theorem says that if E is close to identity then distribution of absolute-valuecoefficients of EC 0 and C 0 in the Pauli basis expansion, is almost similar. In fact, we can have a more general theorem that can be deduced from the calculations in Theorem 3.1.
So we can define two sets S 1 and S 0 as follows.
From our results so far, it follows that for small enough (which is usually the case in nearly all applications) the values in S 1 are nearly equal, while those in S 0 are nearly 0. Let ∆ = max t 1 ∈S 1 ,t 0 ∈S 0 (t 1 − t 0 ). Then to get a positive difference we have the following.
Solving this we obtain the following conditions.
Since usually < 1, so we consider the second inequality. Expanding the term in the square root we obtain Since this function decreases with M and 1 ≤ M ≤ N 2 , so we can say that For all practical purposes, the value of is much smaller than this. So we can easily distinguish the sets S 0 and S 1 . To test if we have guessed the correct U , we can apply the results deduced in the previous section. Specifically we calculate W = W † U and then calculate the set S c = {|Tr(W P )/N | : P ∈ P n }, of coefficients. Then we check if we can distinguish two subsets S 0 and S 1 , as shown in Equation 16 and 17, for some 1 ≤ M ≤ N 2 . Further details have been given in Algorithm 2. Let us call this the amplitude test. After passing this test we have a unitary of the form E † Q where Q is a unitary. This test sort of filters out the approximate values of the coefficients of Q in the Pauli basis (Theorem 3.2). So after passing this test Q will be a unitary with equal or nearly equal amplitudes or coefficients (absolute value) at some points and zero or nearly zero at other points. To ensure Q is a Clifford i.e. W = E † C 0 for some C 0 ∈ C n , we perform the conjugation test (Algorithm 3) for further verification. Theorem 3.3. Let E, Q ∈ U n such that d(E, I) ≤ . P ∈ P n \ {I} such that QP Q † = P α P P , where α P ∈ C. Then for each P ∈ P n ,

Algorithm
Proof. We have the following.
Multiplication by P gives us the following.
(E † QP Q † E)P = |e I | 2 α P I + |e I | 2 P =P α P P P + P =I |eP | 2 α P (±I) + P =I,P =P |eP | 2 α PP PP P + P =P eP eP α P P P P P + P =P ,P =P eP eP α PP PP P So, Given P , P,P , we can haveP PP P = ±I for one particular value ofP .
Tr (E † QP Q † E)P /N ≤ |α P | + P =P |α P | P |eP ||eP | [P =P is such thatP PP P = ±I] Let IPP 0 P = ±I andP 0 P IP = ±I, for some PaulisP 0 ,P 0 ∈ P n \ {I}. Then we can write In Appendix B we show that P =I,P 0 |eP ||eP | ≤ (2 2 − 4 ). We observe that in the above inequality we have taken |e I | ≤ 1, but if |e I | = 1 then |e P | = 0 for any P = I, since P |e P | 2 = 1. To get nonzero values for the sum within bracket |e I | < 1. If we have to maximize |eP 0 | + |eP 0 | given Equation 11, then if we consider an optimization problem with these two variables only, then it is not difficult to see that the maximum occurs if they have the same value. That is |eP Ignoring higher order terms of , we can write the following.
We also have the following lower bound using similar reasoning as above.
And we have the following corollary.
Corollary 3.1. Let C 0 ∈ C n and P ∈ P n such that C 0 P C † 0 =P ∈ P n . If E ∈ U n such that d(E, I) ≤ , then The above theorem and corollary basically says that EC 0 (or E † C 0 ) approximately inherits the conjugation property of C 0 . For each P ∈ P n , if we expand C 0 P C † 0 in the Pauli basis then the absolute value of the coefficients has value 1 at one point, 0 in the rest. If we expand EC 0 P C † 0 E † in the Pauli basis then one of the coefficients (absolute value) will be almost 1 and the rest will be almost 0. From Theorem 3.3 this pattern will not show for at least one Pauli P ∈ P n if we have E † Q, where Q / ∈ C n . If we expand EQP Q † E † or E † QP Q † E in the Pauli basis then the spike in the amplitudes will be in at least two points. Also, we observe that 2 < (1 − 4 2 + 2 4 ) for any ≤ 0.31. Thus there exists a distinguishable gap between the two cases of Corollary 3.1. For all practical purposes is much less than this value.

Synthesizing T-count-optimal circuits
So far we have been able to determine T (W ) for any W ∈ U n . We now describe how we can synthesize -T-count-optimal circuit for W , using the above algorithms. It is easy to see that A DECIDE can return a sequence {U m , . . . , U 1 } of unitaries such that U = 1 i=T (U ) U i C 0 e iφ (for some C 0 ∈ C n ) and T (U ) = T (W ). We can efficiently construct circuits for each U i ∈ {R(P ) : P = I} using Fact A.1. So what remains, is to determine C 0 . Then we can efficiently construct a circuit for it, for example, by using results in [47].
If W = U E then at step 3 of Algorithm 2 we calculate W From Algorithm 2 we can also obtain the following information : (1) set S 1 , as defined in Equation 16, (2) M = |S 1 |. Thus we can calculate r = 1 √ M and from step 4 we can actucally calculate the set S 1 = {(t P , P ) : |t P | = |Tr(W P )/N | ∈ S 1 }. From Equation 13 we can say that for small enough (say << 1 M ) we have We perform the following steps.
1. Calculate a P = t P t I = Tr(W P ) Tr(W ) , where (t P , P ) ∈ S 1 (or equivalently |t P | ∈ S 1 ). We must remember that (t I , I) ∈ S 1 .
We explained that from Equation 13, a P ≈ r P r I . From Fact 2.1 we know that |r P | |r I | = |r P | |r I | = 1. So we adjust the fractions such that their absolute value is 1. For small enough this adjustment is not much and so with a slight abuse we use the same notation for the adjusted values.
2. Select c, d ∈ R such that c 2 + d 2 = r 2 . Let r I = c + di. Then we claim that the Clifford C 0 = r I P :r P =0 a P P is sufficient for our purpose.
It is not hard to see that C 0 = e iφ C 0 for some φ ∈ [0, 2π). Thus if U = 1 i=m U i C 0 , then T (U ) = T (U ) and d(U , W ) ≤ .

Complexity
We first do a worst case analysis of both space and time complexity and then mention some practical considerations.

Time complexity : worst case analysis
We first analyse the time complexity of A CON J . The outer and inner loop at steps 2 and 6, respectively, can run at most N 2 * N 2 = N 4 times, where N = 2 n . At step 7 multiplication of four N × N matrices take O(N 2 ) time and calculating trace takes N time steps. So overall complexity at step 7 is dominated by O(N 2 ). We note that at step 13 we do not need to calculate the product and trace again. In the worst case every loop at steps 2 and 6 are implemented, incurring an overall time complexity of O(N 6 ). Now we analyse the time complexity of A DECIDE , when testing for a particular T-count m. The algorithm loops over all possible products of m unitaries U j , which is R(P ) in case of T-countdecision. Since there can be N 2 − 1 non-identity Paulis P , so this loop happens at most N 2m times. Now in each such loop we do m matrix multiplications at step 2 and 3. This has time complexity O(mN 2 ). At step 4 we make a list of N 2 real numbers. Each is obtained by multiplying two N × N matrices and then taking trace. So time complexity for making this list is O(N 4 ). Sorting this list takes time O(nN 2 ). The inner loop 5-13 happens N 2 times. Each of the list elements is checked and so step 8 has complexity O(N 2 ). Now let within the inner loop the conjugation test is called k 1 times. So the loop 5-13 incurs a complexity O(k 1 · N 6 + (N 2 − k 1 )N 2 ), when k 1 > 0, else it is O(N 4 ). Let k is the number of outer loops (steps [2][3][4][5][6][7][8][9][10][11][12][13][14] for which conjugation test is invoked in the inner loop 5-13 and k 1 is the maximum number of times this test is called within any 5-13 inner loop. Then the overall complexity of The conjugation test is invoked only if a unitary passes the amplitude test. We assume that the occurrence of non-Clifford unitaries with equal amplitude is not so frequent such that kk 1 < N 2m −k. Time complexity -practical considerations : 1. It is not hard to see that if [P 1 , P 2 ] = P 1 P 2 − P 2 P 1 = 0 then R(P 1 )R(P 2 ) = (αI + βP 1 )(αI + βP 2 ) = R(P 2 )R(P 1 ), where α = 1 2 1 + e iπ/4 and β = 1 So in step 2 of Algorithm 2 we need not loop over all m-length products of R(P ). It is easy to check if two n-qubit Paulis commute. There are even number of places where the respective 1-qubit Paulis are non-identity and different. We need not go into actual matrix multiplications. This can speed-up the actual running time by orders of magnitude. For example, for the unitaries considered by us, we obtained a speed-up of 5-10 times. In fact, it may be possible to show that the asymptotic complexity also decreases. One can work out more such symmetries in order to prune the search space.
2. We already made an assumption that the number of non-Cliffords that pass the amplitude test is much less. Even if such a unitary is tested in A CON J , we need not loop over N 4 times. As soon as there are 2 spikes for any outer loop Pauli P out , the program exits (step 10). If a non-spike is "far enough" from 0 then also the program exits (step 14). So in most cases testing a non-Clifford with equal amplitudes take less time. If there is Clifford then all the N 4 loops have to run, but then it implies that A DECIDE has obtained T-count.
3. Most of the matrix multiplications, especially by Paulis are sparse, so here the run-time complexity is less. In step 2 of Algorithm 2 one has to repeatedly multiply a unitary U by R(P ) = αI+βP . Since P is sparse, we can first multiply U by P , then multiply each non-zero off-diagonal element by β and finally add α to the diagonal. This can reduce some practical running time.

Space complexity
The input to our algorithm is a N ×N unitary, with space complexity O(N 2 ). In step 1 of A DECIDE , we can store the single qubit Paulis and calculate R(P ) whenever required. We require O(N 2 ) space to perform matrix multiplication of two N × N matrices. In A DECIDE , we either store a N × N matrix or a list of N 2 real numbers (step 4). Even in A CON J we store one N × N matrix. Hence the overall space complexity is O(N 2 ) ∈ O(2 2n ), without storing R(P ). This increases running time because we have to calculate the n-qubit Paulis and R(P ) repeatedly. But the asymptotic time complexity remains unchanged.
If we store the n-qubit Paulis or R(P ), then we require O(N 4 ) space. This factor dominates and overall space complexity is O(N 4 ) ∈ O(2 4n ). In this approach the actual running time reduces but the asymptotic time complexity remains same.
T (R z (θ)) = 3.067 log 2 (1/ ) − 4.322 (22) Then an upper bound was given by adding the T-count of component unitaries. For example, in Figure 1: cR z (θ) (a) implemented by using two Fredkin gates and one R z (θ) gate [51], where upper qubit is the control, middle qubit is the target and bottom qubit is the ancilla; and (b) implemented by two CNOT and two R z gates. (c) cR k implemented by two Toffoli gates and one R k gate. (d) ccR z (θ) implemented by two Toffoli, one cR z (θ) and one ancilla set to |0 . Figure 1 we have shown two implementations of cR z (θ) gate, that we found in literature. In Figure  1a two Fredkin gates (T-count=7 [26]), one R z (θ) and an extra ancilla [51] is used. In Figure 1b, the implementation uses two R z gates. So upper bound on the T-count of cR z (θ), averaged over θ is as follows.
#T(cR z (θ)) = 3.067 log 2 (1/ ) − 4.322 + 14 = 3.067 log 2 (1/ ) + 9.678 [ Figure 1a] #T(cR z (θ)) = 2 (3.067 log 2 (1/ ) − 4.322) = 6.134 log 2 (1/ ) − 8.644 [ Figure 1b] The first upper bound is better (gives less T-count) for every < 0.016, but the implementation uses an extra ancilla. In Figure 1c and 1d we show an implementation of cR k and ccR z (θ) respectively. The latter circuit can be used to implement ccR k by replacing the cR z (θ) with cR k . Upper bounds on T-count can be deduced in a similar manner for the respective unitaries. We took θ = 2π 2 k and varied k from 2 to 11. We were more interested in synthesizing multi-qubit unitaries, since these were not T-count-optimally synthesized before. It took on an average 48 mins to synthesize a 2-qubit unitary with T-count at most 7; and about 5.7 hours for a 3-qubit unitary with T-count at most 4. We have synthesized only one 2-qubit unitary with T-count 9. We have not synthesized 2 and 3 qubit unitaries with higher T-count because of time constraint. We made the following observations.
1. The T-count of controlled rotation gates reduce, as we increase the number of controls, at least for many of the angles and precision tested by us. This has been shown in Table 2. The average running time has been stated before.
2. The T-count of 2-qubit QFT is equal to the T-count of R 2 and has been shown in Table 3. In this table we have also shown the T-count of 3-qubit QFT for some precision. The running time for these tests has been explicitly mentioned 1 .
3. The T-count of Givens(θ) is similar to cR z (θ), on an average. (So we have not shown it separately.) 4. The T-count of R z (θ) (and hence R k ) agrees with the results given in [19].
The numerical results of this subsection, together with instructions on how to reproduce them, are available online at https://github.com/vsoftco/approx-t.  Table 3: -T-count of 2 and 3 qubit QFT.

Discussion : T-depth-optimal synthesis
The algorithms 1 (A M IN ) and 2 (A DECIDE ) can be used for T-depth-optimal-synthesis of any multiqubit unitary, since we know there is a finite generating set [27] such that any exactly implementable unitary can be written as a product of elements from this set and a Clifford. We first give some definitions.

T-depth of circuits
Suppose the unitary U implemented by a circuit is written as a product U = U m U m−1 . . . U 1 such that each U i can be implemented by a circuit in which all the gates can act in parallel or simultaneously. We say U i has depth 1 and m is the depth of the circuit. The T-depth of a circuit is the number of unitaries U i where the T/T † gate is the only non-Clifford gate and all the T/T † gates can act in parallel 2 .

T-depth of exactly implementable unitaries
The T-depth or min-T-depth of an exactly synthesizable unitary U ∈ J n , denoted by T d (U ), is the minimum T-depth of a Clifford+T circuit that implements it (up to a global phase). In [27] a subset, V n ⊂ { i∈[n] CT (i) C † , C ∈ C n , T ∈ {T, T † , I}}, of T-depth-1 unitaries has been defined. It has been shown that |V n | ≤ n · 2 5.6n and any T-depth-1 unitary U 1 ∈ J n can be written as We call each V i , with T-depth 1, as a (parallel) block and it can be written as product of R(P ) or R † (P ), where P ∈ ±P n . It is possible to multiply consecutive T-depth-1 unitaries to get another T-depth-1 unitary (conditions given in [27]). Thus V n can be regarded as a generating set (modulo Clifford) for the set of T-depth-1 unitaries, and hence for the complete group J n . A decomposition which has the minimum number of T-depth-1 unitaries is called a T-depth-optimal decomposition. A circuit implementing U ∈ J n with the minimum T-depth is called a T-depth-optimal circuit.
-T-depth of approximately implementable unitaries The -T-depth of an approximately implementable unitary W ∈ U n , denoted by T d (W ), is equal to T d (U ), the T-depth of an exactly implementable unitary U ∈ J n such that d(U, W ) ≤ and T d (U ) ≤ T d (U ) for any U ∈ J n and d(U , W ) ≤ . We call a T-count-optimal (or T-depth-optimal) circuit for any such U as the -T-count-optimal (or -T-depth-optimal respectively) circuit for W .

Modification of A M IN
Since the set V n is finite, so it is not hard to see that algorithms A M IN and A DECIDE can be applied to find T-depth-optimal decomposition of any unitary W ∈ U n . Replace step 1 of Algorithm 2 by S ← V n . Suppose W = U E for some exactly implementable unitary U such that T d (W ) = T d (U ) and d(W, U ) ≤ . Then we can decompose U as in Equation 23. If we have guessed the correct V 1 , . . . , V d then after multiplying i V i with W † we are left with EC 0 . Now the amplitude test and conjugation test can be applied to check if we have the correct guess. We have said before that it is possible to multiply consecutive V i such that the product has T-depth 1. In that case the -T-depth is less than d. So to find the minimum possible T-depth we might have to iterate more than T d (W ) times.

Time complexity
The time complexity of conjugation test A CON J is same as before. The analysis of the complexity of A DECIDE is also similar, but here |S| = |V n |, so if we take all possible m -length product at step 2, then the number of iterations for the outer loop 2-14 is at most n2 5.6nm . The complexity of the remaining steps are same, so the overall complexity of A DECIDE is O(n2 5.6nm +4n ). We explained before that it is possible to combine more than one consecutive unitaries V i ∈ V n such that we get one T-depth 1 unitary. Thus this procedure gives a T-depth m ≤ m . We do not know how much is the difference m − m. Alternatively, we can do a pessimistic analysis of A DECIDE . This algorithm is basically an exhaustive search procedure to test for a certain T-depth m. Let in step 2 we make sure that we have a T-depth-m unitaryŨ i.e. it is not possible to combine any further. Basically, this meansŨ is from the set of T-depth-1 unitaries modulo Clifford. Now there can be at most O(4 n 2 ) of these. This is because there can be at most O(4 n 2 ) n-length product of R(P ). This is a naive bound and more explanations can be found in [27]. So this time the outer loop can occur at most O(4 n 2 m ) times. Arguing in the same way, the complexity of A DECIDE is O(2 2n 2 m+4n ), and hence complexity of

Space complexity
In step 1 of A DECIDE we can store |V n | in a symbolic way, for example, for each V i = j R(P j ) ∈ V n , simply store the Paulis in the product. Then we can calculate the necessary matrices whenever necessary by taking products of the corresponding R(P )s. In all other steps we store N ×N matrix, taking at most O(N 2 ) ∈ O(2 2n ) space. Thus space complexity is O(n2 5.6n ). As explained before this approach leads to more running time, without affecting the asymptotic time complexity.
In this paper we do not implement our algorithm to determine -T-depth. For small enough , we can use the procedure chalked out in Section 3.2 to synthesize a T-depth-optimal circuit.

Conclusion
In this paper we have developed results and algorithms that can be used to determine the optimal count or depth of non-Clifford gates required to implement any multi-qubit unitary with the gates of a finite universal gate set, that consists of Clifford and non-Clifford gates like Clifford+V, Clif-ford+CS, etc. Our primary focus has been the Clifford+T set. Given an 2 n × 2 n unitary W , the space complexity of our algorithm is poly(2 n ). The time complexity has an exponential dependence on T (W ) or T d (W ), while determining T-count and T-depth-respectively. For small enough , we show how we can synthesize the -T-count or -T-depth-optimal circuit.
To the best of our knowledge, we give the first algorithm that works for arbitrary multi-qubit unitaries (exactly or approximately implementable) and any universal gate set. We think it is unlikely that we can have an algorithm whose complexity has a polynomial dependence on -Tcount or -T-depth. A complexity theoretic argument to prove or disprove Conjecture 1 will throw theoretical insights into the complexity of these problems. It might be possible to decrease the exponent in the time complexity by applying techniques like meet-in-the-middle [25] or nested meet-in-the-middle [26,27]. It is not hard to see that our algorithm can be parallelized. It will be interesting to investigate if additional tricks can be used. Another interesting question is to find more compact generating sets for other universal gate sets for multi-qubit unitaries. When n > 1 the n-qubit Clifford group C n is generated by these two gates (acting on any of the n qubits) along with the two-qubit CNOT = |0 0| ⊗ I + |1 1| ⊗ X gate (acting on any pair of qubits). The Clifford group is special because of its relationship to the set of n-qubit Pauli operators. Cliffords map Paulis to Paulis, up to a possible phase of −1, i.e. for any P ∈ P n and any C ∈ C n we have CP C † = (−1) b P for some b ∈ {0, 1} and P ∈ P n . In fact, given two Paulis (neither equal to the identity), it is always possible to efficiently find a Clifford which maps one to the other.
Fact A.1 ( [25]). For any P, P ∈ P n \ {I} there exists a Clifford C ∈ C n such that CP C † = P . A circuit for C over the gate set {H, S, CNOT} can be computed efficiently (as a function of n). Proof. If Q = P ∈Pn q P P , then taking trace on both sides and dividing by N we get the first equation.
If Q is unitary then we get the following.
QQ † = I = P q P P P q P P = P |q P | 2 I + P =P q P q P P P Now taking trace on both sides and dividing by N , we get the second equation.

A.2 The group generated by Clifford and T gates
The group J n is generated by the n-qubit Clifford group along with the T gate, where Thus for a single qubit J 1 = H, T and for n > 1 qubits It can be easily verified that J n is a group, since the H and CNOT gates are their own inverses and T −1 = T 7 . Here we note S = T 2 .

B Some additional upper bound
In this section we want to prove the following claim. For eachP , there exists oneP that satisfies a certain condition.
Proof. Let us denote |eP | by x i , where i ∈ 1, . . . , N 2 − 1. IfP is the Pauli that satisfies the given condition withP , then we denote the corresponding variable by x i . We must remember that this convention of indices is adapted to identify the variables that satisfy a condition. Each of the variables x i ≥ 0. From Equation 11 we know that i x 2 i ≤ 2 2 − 4 .
So we upper bound the following optimization problem.
We can use Karush-Kuhn-Tucker (KKT) conditions [54,55,56] to solve the above optimization problem with inequality constraint. For simplicity and without loss of much generality we have used equality in the constraint and follow the Lagrangian method of optimization. Let λ is a Lagrange multiplier. The Lagrange formulations is: To find the optimal points we have the following for each i.
Since this is true for each x i , we can infer that x i = x j , where i = j and hence The claim follows.
C Circuit for Quantum Fourier Transform Figure 2: Quantum Fourier Transform circuit given in [18]