Efficient decomposition methods for controlled-Rn using a single ancillary qubit

We consider decomposition for a controlled-Rn gate with a standard set of universal gates. For this problem, a method exists that uses a single ancillary qubit to reduce the number of gates. In this work, we extend this method to three ends. First, we find a method that can decompose into fewer gates than the best known results in decomposition of controlled-Rn. We also confirm that the proposed method reduces the total number of gates of the quantum Fourier transform. Second, we propose another efficient decomposition that can be mapped to a nearest-neighbor architecture with only local CNOT gates. Finally, we find a method that can minimize the depth to 5 gate steps in a nearest-neighbor architecture with only local CNOT gates.

Due to the recent advances in quantum device technology, an arbitrary single-qubit gate or a Z-rotation gate can be implemented with fairly high accuracy, and a small quantum algorithm can be tested. However, even with the gate of a small error rate currently being realized, it is difficult to directly perform scalable quantum computation since it requires that arbitrarily large computations is implemented. In order to overcome this problem, fault-tolerable computation is still needed 1 . Therefore, for reliable quantum computation, all quantum operations of a quantum algorithm should be represented by a universal gate set that arises from a fault-tolerant protocol such as Clifford + T gates 2 .
We consider a standard set of universal gates consisting of Hadamard (denoted H), phase (S), π/8 (T), and controlled-NOT (CNOT) gates. Although it is known that quantum algorithms have much lower computational complexities than classical algorithms for problem such as factoring large integers 3 , when such quantum algorithm are decomposed into CNOT, H, S, and T gates, the result includes a huge number of gates. Thus, the advantages of quantum computing might be nullified. To enhance the benefits of quantum computation, it is important to use an efficient decomposition of quantum algorithms into universal gates. Here, we first consider the decomposition of single-qubit gates and two-qubit gates. Any single-qubit gate can be decomposed in terms of Hadamard gates and Z-rotation gates R z (θ) 4,5 , and there are well-known methods to approximate R z (θ) efficiently [6][7][8][9] . Next, we consider a controlled-R n gate as the simplest 2-qubit gate to be decomposed into a universal set of gates. Controlled-R n gates represent the fundamental part of the quantum Fourier transform (QFT) and many other quantum algorithms. Thus, controlled-R n decomposition has a significant impact on the overall decomposition of a quantum algorithm. In this work, we propose efficient controlled-R n decomposition methods as a technique to help enhance the benefits of quantum computation.

Background
Approximation of R n gate. An R n gate is defined as follows: The R 2 gate is an S gate (or P gate), and the R 3 gate is a T gate. The R 2 and R 3 gates are included in the universal set. However, R n for n ≥ 4 cannot be exactly decomposed with only a standard set of universal gates 8 . Thus, we should approximate R n for n ≥ 4 to express it with the standard set.
To approximate the R n gate, we use the gridsynth method 9 . Given a precision ε > 0, the approximation of an R n gate is to find an operator U expressible as H, S, T and Pauli operators such that where the norm is the operator norm.
The gridsynth algorithm 9 gives the result of the efficient approximation of an R n gate in a probabilistic manner. Thus, we estimate the average number of gates for it. From Table 1, we can assume the average numbers of gates for an approximation of an R n gate as 127, 253 and 379 with ε = 10 −5 , 10 −10 , and 10 −15 , respectively. Note that the average number of gates is independent of the rotation angle.
Zero ancillary qubit method (Method 1). A controlled-R n gate is defined as follows:  Figure 1 shows the circuit of the controlled-R n gate with 2 CNOTs, 2 R n+1 s and 1 R n 1 + † gate. This method is a well-known and fundamental method for the decomposition of a controlled-R n 10 . When we approximate the controlled-R n with precision 10 −10 , the total number of gates is 761 on average from Table 1. Thus, the approximation of one controlled-R n requires an excessive number of gates. Figure 2 shows the circuit of the controlled-R n gate using a single ancillary qubit. The circuit consists of 1 R n , 16 CNOTs, 4 Hs, 8 Ts and 6 T † s.

One ancillary qubit method (Method 2).
As noted in ref. 8 , one advantage of such a circuit is that it reduces the depth with only a small constant overhead. As mentioned earlier, R n and R n+1 require many gates according to the precision. In the case of the precision 10 −10 , R n and R n+1 both require approximately 253 gates. Therefore, the approach where a single ancillary qubit is employed appears to be beneficial.
We note that the ref. 11 offers an approach to implementing a controlled-U operation using an ancillary qubit containing an eigenstate of U. However, in this paper, we only focus on an approach using 0 state as an ancillary qubit. Thus, we have considered decomposition of controlled-R n gate in an approach of the ref. 11 . As future work, we will analyze the decomposition of a controlled-U operation.

Controlled-T decomposition based method (Method 3). The previously known efficient decomposi-
tion of a controlled-T is shown in ref. 12 . We can observe that the middle T gate in ref. 12 can be replaced with the R n . In this case, controlled-R n gate can be decomposed into 4 Hadamard gates, 2 Phase gates, 12 CNOT gates, 8 T gates, and 1 R n gate. This result is the best known to date and is the same as in ref. 13 . If we use two ancillary qubits,  Table 1. Average numbers of gates over 10,000 runs for an approximation of R n with angle π/2 n−1 .

Results
In this work, we improve the previous method to three ends: to reduce the total number of gates, achieve an efficient layout and achieve a smaller depth.
Smaller number of total gates (Improvement 1). We propose an improvement whereby the controlled-R n consists of a lower total number of gates keeping one R n gate.
Theorem 1. The controlled-R n gate can be decomposed with at most one ancillary state 0 into one R n , eight CNOTs, four Hs, four Ts and four T † s. The proof is given in Section Proofs. The corresponding decomposition is shown in Fig. 3. The advantage of the proposed method is shown in Table 2. The data were estimated by the ScaffCC program 14 . In particular, in the case of a controlled-T, using ancillary qubits results in an exact decomposition of the controlled-R n and not an approximation. Thus, the gap between Method 1 and Improvement 1 is more larger. The Method 3 is more efficient than the Method 2 in decomposition of controlled-T. However, it consist of 12 CNOTs, 4 Hs, 1 P, 1 P † , 5 Ts and 4 T † s. The decomposition includes 27 gates, whereas our decomposition includes only 21 gates. In more detail, T-count is the same for ref. 12 and our method. However, the advantage of our method is reduction by 4 CNOT gates and 2 Phase gates. The reduction of CNOT gates is important since implementation of CNOT gates is physically not easy and controlled-R n is not the final algorithm 15,16 . Thus, its impact in quantum algorithms will be large. For example, according to module count analysis of ScaffCC Program 14 for Shor's algorithm, the controlled-T gate is used 641,990,656 times in total. This means that reducing 6 gates in decomposition of the controlled-T gate reduces 3,851,943,936 gates in computing of Shor's algorithm.   Efficient layout (Improvement 2). For practical quantum computing, we should consider the layout of quantum circuits. Since nonlocal two-qubit-gate operation is not allowed in general, a long-range CNOT gate is implemented with several adjacent SWAP gates. In the following theorem, we present an efficient decomposition of a controlled-R n gate without using nonlocal CNOT gates.

Theorem 2. A controlled-R n gate can be implemented under the nearest-neighbor-interaction-only architecture with
at most one ancillary state 0 using one R n , twelve adjacent CNOTs, four Hs, four Ts and four T † s. The proof is given in Section Proofs. The corresponding circuit is shown in Fig. 4. Let us consider one long-range CNOT gate, where the control qubit is the first qubit and the target qubit is the third qubit. Naively, we can decompose such a CNOT gate into one adjacent CNOT gate and two swap gates. The swap gates can be decomposed into three CNOT gates. Thus, the long-range CNOT can be implemented with 7 CNOT gates. More efficiently, the long-range CNOT can be implemented with only 4 CNOT gates 17 . Thus, Method 2 consists of 1 R n , 28 adjacent CNOTs, 4 Hs, 8 Ts and 6 T † s, while Improvement 2 consists of 1 R n , 12 adjacent CNOTs, 4 Hs, 4 Ts and 4 T † . Therefore, using our method, we use 16 fewer CNOT gates, 4 fewer T gates and 2 fewer T † gates.

Smaller depth (Improvement 3). The depth of a circuit means the length of the critical path of the circuit.
To ensure an efficient run time of a practical quantum computer, the depth of a circuit should be minimized. For this purpose, we propose a circuit with a smaller depth for a controlled-R n . Theorem 3. While maintaining the R n -type gate depth 1, the controlled-R n can be implemented with at most one ancillary state 0 with a depth of 5 gates in The proof is given in Section Proofs. The corresponding circuit is shown in Fig. 5. Method 2 for the controlled-R n has a depth of 25, while this circuit only has a depth of 5. Although Method 1 only has a depth of 4, the depth after the approximation of the R n -type gates is nearly twice that of Improvement 3.
We note that from Fig. 8.(a) in ref. 12 , controlled-S gate can be decomposed in a depth of 5. However, in the decomposition, two long-range CNOTs is used. Thus, in order to represent controlled-S gate only with adjacent CNOTs and R n -type gates, the long-range CNOTs should be transformed into several adjacent CNOTs or layout of qubits should be changed. That is, more resources than in the method of in Fig. 5 are required. According to module count analysis of ScaffCC Program 14 for Shor algorithm, the controlled-S gate is used 641,013,760 times in total. This means that reducing one depth in decomposition of the controlled-S gate affects 641,013,760 computing in Shor's algorithm.

Efficient decomposition of the quantum Fourier transform
The quantum Fourier transform (QFT) is the key ingredient for quantum factoring and many other quantum algorithms 2 . The total number of gates of the QFT for n qubits (denoted QFT(n)) is obtained as   Now, we compare the total number of gates for the QFT by applying each decomposition method. QFT M1 (n), QFT M2 (n), QFT M3 (n) and QFT I1 (n) denote the total number of gates by Method 1, Method 2, Method 3 and Improvement 1, respectively, as follows:

M2
QFT n n n n c n ( ) 33 59

I1
where c means average number of gates over 10,000 runs for an approximation of R n with angle π/2 n−1 corresponding to the precision of Table 1. For example, if a precision ε = 10 −10 then c = 253. Thus, the benefit of Improvement 1 for Method 3 is obtained as for n. In this paper, we only consider the error rate in approximation of R n gate not the overall error rate in approximation of QFT. However, we can notice that Method 3 and Improvement 1 have the same number of R n gate, and Improvement 1 has smaller number of gates than Method 3. Thus, the overall error rate in approximation of QFT for Improvement 1 might be not greater than that for Method 3. From Table 3 and the above Equations (5)(6)(7)(8), it is shown that Improvement 1 is more efficient than Method 1, Method 2 and Method 3.

Discussion
We have investigated the decomposition problem for the controlled-R n gate since it is an important two-qubit gate. One method has been proposed that utilized a single ancillary qubit to reduce the number of gates. In this work, we have extended this method for three purposes: to reduce the number of gates, to find a good mapping for an architecture with only nearest-neighbor interactions, and to minimize the critical path. Specifically, we have realized that the proposed method reduces the number of gates for the quantum Fourier transform. As future work, we will consider three issues. First, we need to check whether the proposed methods are optimal. In addition, it would be interesting to investigate how much performance gain is possible for quantum algorithms such as Shor's factoring algorithm since it heavily uses the quantum Fourier transform. For more general situations, we need to develop a decomposition method for controlled multi-qubit unitary transforms.

Proofs
Proof of Theorem 1.
Proof. Let ψ be an arbitrary two-qubit state. Then, ψ can be represented as 00 01 10 11 , Let an unitary operator U be an operator denoted by † † = ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗  Thus,  Proof. Let ψ be an arbitrary two-qubit state. Then, ψ can be represented as

U I I H C T I T C C T I T C I I H
00 01 10 11 where α i are complex numbers and α ∑ | | = = 1 i i 00 11 2 . Let an unitary operator U be the operator denoted by † †  Proof. Let ψ be an arbitrary two-qubit state. Then, ψ can be represented as