Introduction

Clifford circuits are ubiquitous in quantum computing. They appeared early in the study of quantum computations. For example, Bernstein-Vazirani algorithm1, which is a Clifford circuit in its entirety, served to establish a separation between quantum and classical computations in the black-box model. In early (as well as modern) experimental work, randomized benchmarking2,3, performed entirely by Clifford circuits, was found to be a powerful tool for establishing the quality of quantum gates/computations. Quantum error correction is based on Clifford circuits, including both logical state encoding (as well as the underlying formalism)4,5 and state purification required to obtain non-Clifford gates on the logical level6. Fault-tolerant and even physical-level circuits (although slightly less so) are frequently considered over Clifford+T and/or Clifford+Rz libraries—the role of the Clifford circuits follows from the library names themselves. Other notable use cases include shadow tomography7, the study of entanglement8, and more.

There are two popular choices to evaluate a circuit cost by a single number: weighted gate count, and weighted depth. The weights depend on whether one works with physical-level or logical-level circuits and the details of physical-level control and error correcting approach, correspondingly. Given that logical-level Clifford circuits can often be implemented transversely, the costs are most likely determined by the physical-level constraints. Both leading quantum technologies, superconducting circuits9,10,11 and trapped ions12,13, offer single-qubit gates at a fraction of the cost of the two-qubit gates. It thus makes the most sense to discard the single-qubit gates and simply count the entangling gates. We furthermore choose to focus on optimizing the depth by entangling gates rather than the gate count. We believe this to be a better choice since the depth most accurately reflects the time to solution, and the purpose of any computation is above all to optimize this figure of merit. Secondly, minimizing depth helps to optimize the fidelity in computations where noise is dominated by decoherence, such as the case in current superconducting circuit devices9,10,11. For quantum technologies where noise may be dominated by a coherent source, such as the trapped ions14, one may too prefer to optimize depth, since the noise due to random over/underrotations may scale sublinearly with the number of gates14 or even as slow as the square root15, whereas the effect of the decoherence is always linear with time. Thus, a smaller decoherence noise channel may cause a larger disturbance in sufficiently deep computations compared to a coherent over/underrotation channel, due to coherent error cancellations.

All quantum computations need to be broken down into single-qubit and two-qubit gates, which are then implemented in hardware. A single-qubit gate can be implemented by directly addressing the respective physical qubit, but the two-qubit gates require addressing a pair of qubits. Depending on the technology and particulars of the controlling apparatus, it may not always be possible. Of the two leading technologies, trapped ions12,13 is generally said to offer all-to-all qubit connectivity. However, this does not remain the case for arbitrary numbers of qubits, and it is possible that keeping all-to-all connectivity beyond 100 qubits can be challenging14. In our work, we address the possibility of unrestricted architectures by considering the task of depth optimization without limiting the pairs of qubits to which an elementary two-qubit gate can apply. In contrast, superconducting circuits technology9,10,11 offered limited connectivity at the offset, including the first 5-qubit IBM quantum processor that premiered in 201616. Current offerings by the companies focusing on superconducting technology include a 2-dimensional square lattice (Google), heavy hexagonal lattice (where each edge of a hexagonal lattice contains a qubit in the middle, IBM), 2-dimensional square lattice with diagonal terms (IBM, presently retired Tokyo architecture), and a square lattice with octagons (Rigetti). What unites these is the ability to find a chain subgraph in each, that is, embed the Linear Nearest Neighbor (LNN) architecture. A chain is frequently a subgraph of many other graphs and lattices. As a result, and due to its universality, we also focus on implementing Clifford circuits over the LNN architecture.

Clifford circuits have been studied quite extensively, leading to numerous efficient implementations addressing a number of metrics of value. A first noteworthy result in this area was the development of an 11-stage layered decomposition -H-CNOT-P-CNOT-P-CNOT-H-P-CNOT-P-CNOT-17, which, together with gate-count asymptotically optimal synthesis algorithm for linear reversible circuits18 settled the asymptotic gate complexity. This asymptotically optimal algorithm can be parallelized19,20 to offer depth upper bound guarantee of \(O(n/\log (n))\), leading to asymptotic depth optimality. However, the leading constant in front of the \(n/\log (n)\) term is as high as 2019, resulting in circuits that become advantageous compared to those featuring a higher asymptotic complexity of O(n) only when the number of qubits exceeds 1,345,000 or more19,21. Indeed21, offers an improvement over19, implying that the performance crossover number 1,345,000 was likely pushed higher by the newer work. In either case, we do not anticipate executing Clifford circuits spanning this many qubits on quantum computers. This suggests that the currently known constructions achieving asymptotic optimality offer theoretical guarantees but are not necessarily practical. We also note that constructions offering asymptotic optimality and exploring ancillary space have also been considered20. However, leading constants are high19.

A more practical approach to implementing Clifford circuits would be to first start with a short layered decomposition -X-Z-P-CNOT-CZ-H-CZ-H-P-22,23, offering only three entangling stages, as opposed to the original five17, and noting that a -CZ- stage is ‘simpler’ than a -CNOT- stage. Then, depending on the qubit connectivity:

  • Over LNN: -X-Z-P-CNOT-CZ-H-CZ-H-P- can be rewritten as -X-P-CNOT\(\widehat{\,{{\mbox{-CZ-}}}\,}\)H\(\widehat{\,{{\mbox{-CZ-}}}\,}\)H-P-, where \(\widehat{\,{{\mbox{-CZ-}}}\,}\) is a -CZ- stage together with qubit reversal. The -CNOT- stage can be implemented in depth at most 5n24, and \(\widehat{\,{{\mbox{-CZ-}}}\,}\) can be implemented in depth 2n + 225, which combined with some local optimizations leads to the depth 9n implementation of arbitrary Clifford circuits;

  • Over all-to-all: both -CZ- and -CNOT- stages are best implemented according to21, resulting in depth \(2n+O({\log }^{2}(n))\) circuits.

Here, we show that a -CNOT-CZ- stage can always be implemented in depth 5n over LNN, leading to a Clifford circuit depth guarantee of no more than 7n − 4. We also show computational evidence that a -CNOT-CZ- stage can be implemented in depth \(n+O({\log }^{2}(n))\) over all-to-all architecture, leading to a likely upper bound of \(1.5n+O({\log }^{2}(n))\) on the depth of Clifford circuits in unrestricted architectures.

The core of technical discussions is the study of quantum circuits composed with Phase (P), CZ, and CNOT gates. These gates can be defined as the unitary matrices or by the transformations they perform over basis states, as follows:

$$\begin{array}{lll} \qquad\qquad P(x):\left\vert x\right\rangle \,\mapsto \,{i}^{x}\left\vert x\right\rangle ,\\ \quad\,\,\,{{{\rm{CZ}}}}(x,y):\left\vert x,y\right\rangle \,\mapsto \,{(-1)}^{xy}\left\vert x,y\right\rangle ,\,{{\mbox{and}}}\,\\ {{{\rm{CNOT}}}}(x,y):\left\vert x,y\right\rangle \,\mapsto \,\left\vert x,x\oplus y\right\rangle .\end{array}$$

Such circuits compute Hadamard-free Clifford transformations and form a finite group. It was shown25 that the circuits with Phase, CZ, and CNOT gates can be computed as a three-stage layered computation -P-CZ-CNOT-, where each layer can be composed using the gates of its specified type, and the layers can come in any order. This layered decomposition offers an efficient way of implementing Hadamard-free circuits. Indeed, -P- layer may consist of the gates ID (identity), P, Z = P2, and P = P3 on each qubit, and thus is trivial to compose. A -CZ- layer can be implemented straightforwardly and optimally by the CZ gates alone (see next paragraph). A -CNOT- layer, also known as linear reversible circuits, has been studied extensively18,20,21,24,26. We note that the two -CZ- layers can be implemented more efficiently by utilizing P and CNOT gates25 and the -CX- layer representing a linear reversible circuit can sometimes be improved by utilizing P and H (Hadamard) gates23. Here, we employ related rules to reduce the depth of the decomposition -P-CZ-CNOT- by mixing the individual layers.

First, recall some basic properties of circuits composed with CZ gates. CZ(x, y) = CZ(y, x), meaning the order of control and target does not matter. A CZ gate is self-inverse; in other words, two neighboring CZ gates operating over the same set of qubits can be removed from the circuit without affecting its functionality. Any two CZ gates commute. This means that any pair of identical CZ gates can be removed and allows to express arbitrary CZ circuit over n qubits by listing a set of at most \(\frac{(n-1)n}{2}\) pairs of qubits to which such gates apply. That is, we can uniquely represent a CZ circuit on n qubits by a \(\frac{(n-1)n}{2}\)-dimensional vector over the binary field \({{\mathbb{F}}}_{2}\,:= \,\{0,1\}\), where each coordinate of the vector indicates the presence of a particular CZ gate after removing all pairs of identical CZ gates, and the composition of two CZ circuits is represented by the modulo-2 addition of the corresponding binary vectors. Furthermore, the vector representations of the CZ circuits span the linear space \({{\mathbb{F}}}_{2}^{(n-1)n/2}\). A linear space has a basis, which will become an important consideration in future discussions. The standard basis is {CZ(i, j)1≤i<jn}.

A CZ circuit can also be implemented by inserting Phase gates into linear reversible circuits, by considering phase polynomials. To illustrate how this works, consider a well-known circuit identity:

On the left-hand side, we have a CZ(x, y) which applies the phase − 1xy = i2xy; and on the right-hand side, we have three Phase gates, \({{{\rm{P}}}}\left\vert x\right\rangle ,{{{\rm{P}}}}\left\vert y\right\rangle ,\) and \({{{{\rm{P}}}}}^{{\dagger} }\left\vert x\oplus y\right\rangle\), which apply phases ix, iy and ixy. Here, we call x and y the primary variables as they denote the primary input states fed to the circuit. Each takes the value of either \(\left\vert 0\right\rangle\), \(\left\vert 1\right\rangle\), or a superposition of the two. The notation \(U\left\vert f\right\rangle\) where f is a linear function, or simply U(f), conveys the application of the unitary gate U to the qubit in the state represented by f. In this case, f = xy. The equality holds due to the mixed arithmetic identity \(2x\cdot y\equiv x+y-x\oplus y\,(\,{{\mathrm{mod}}}\,\,\,\,4)\). Since the primary variables, x and y, are available to experience the application of Phase gates on the input side of the circuit, it shows that the gate CZ(x, y) can be computed via Phase gate insertion by a linear reversible circuit that computes the linear function xy at some point. Furthermore, a linear reversible circuit spanning qubits x1, x2, . . . , xn and computing linear functions xixj for all i, j: 1≤i < j ≤ n offers enough opportunities to insert Phase gates to compute arbitrary CZ gate transformations, since {CZ(i, j)1≤i<jn} is a basis.

Next consider a linear function \(f={x}_{{i}_{1}}\oplus {x}_{{i}_{2}}\oplus ...\oplus {x}_{{i}_{k}}\), spanning a subset of k qubits. It is known and is easy to verify explicitly that up to a layer of Phase gates applied to primary variables on the input side of the circuit, a Phase gate applied to f is equivalent to the CZ circuit \({\prod }_{1\le s < t\le k}{{{\rm{CZ}}}}({x}_{{i}_{s}},{x}_{{i}_{t}})\)25. This observation offers a natural way to treat arbitrary linear functions as CZ circuits (up to Phase gates applied to primary variables) or otherwise vectors in the \(\frac{(n-1)n}{2}\)-dimensional linear space spanning CZ circuits.

Finally, we recall another important identity, that we express using phase polynomials as follows:

$$a+b+c-a\oplus b-a\oplus c-b\oplus c+a\oplus b\oplus c\equiv 0(\,{{\mathrm{mod}}}\,\,4).$$
(1)

Here, each literal corresponds to some linear function of primary variables. A summand with the positive sign expresses the application of the P gate to it, a summand with the negative sign indicates the application of the P gate, and the sum is taken modulo-4 to reflect the phase identity i4 = 1. This seven-term identity says that a circuit implementing the set of seven Phase operations participating as summands computes the identity function. For the purpose of our work, Eq. (1) is a very important identity that establishes linear dependence of linear functions as vectors in the space \({{\mathbb{F}}}_{2}^{(n-1)n/2}\), which should not be confused with their linear dependence under binary addition. For example, {x1x2, x2x3, x1x3} is not a linearly independent set under binary addition, since (x1x2)  (x2x3)  (x1x3) = 0. Meanwhile, the phase polynomial (x1x2) + (x2x3) + (x1x3) computes a non-identity. Applying this identity most frequently comes in the form of rewriting one of seven terms as a combination of the remaining six; it lies at the core of the constructions that follow. In particular, we focus on the discovery of a full CZ basis in the linear functions computed inside a -CNOT- stage as a means for merging a -CZ- stage with the -CNOT- stage.

The rest of the paper is organized as follows. Subsection II A-Subsection II D focus on developing circuits over LNN architecture. They describe a depth-5n implementation of the joint -CNOT-CZ- stage and offer an O(n3) algorithm to construct it. We employ this construction to implement arbitrary Clifford operation in depth 7n − 4, see Subsection II E. Subsection II F-Subsection II H report a heuristic solution to the problem of optimizing circuit depth of the implementation of the -CNOT-CZ- stage in the all-to-all connected architecture. Final remarks and conclusion can be found in Section III.

Results

-CNOT-CZ- circuits over Linear Nearest Neighbor architecture

First, consider the Linear Nearest Neighbor (LNN) architecture, where qubits are arranged in a line and only adjacent qubits are allowed to interact. Our main result can be summarized as follows.

Theorem 1

Any Hadamard-free Clifford transformation over n qubits can be implemented over LNN architecture as a circuit with the two-qubit gate depth of at most 5n.

At first glance, Theorem 1 may be surprising. In the LNN architecture, a -CNOT- layer alone requires a circuit of depth 5n when constructed using the best-known synthesis algorithm24. Since Hadamard-free Clifford transformations can be written as a three-stage computation -P-CZ-CNOT-25, our result implies that the -CZ- layer adjacent to a -CNOT- layer can be implemented “for free” (without increasing the two-qubit gate depth) over the LNN architecture through P gate insertions. As we’ll show in Subsection II E, Theorem 1 also allows us to implement any Clifford operation as a circuit of depth 7n − 4 over LNN, improving the prior best-known upper bound of 9n23. In the remainder of this section, we take a closer look at the -CNOT- synthesis algorithm of24 in Subsection II B before describing an efficient Hadamard-free synthesis algorithm (Subsection II C) that achieves the advertised upper bound in Theorem 1.

-CNOT- synthesis over LNN

The best-known synthesis algorithm for CNOT circuit synthesis over LNN achieves depth 5n by starting with two back-to-back copies of the sorting network, C1 and C2, and modifying them according to a set of rules24. Since a CNOT transformation corresponds to a linear reversible operation, we can represent it as an n × n invertible Boolean matrix M. Application of the CNOT gate performs modulo-2 addition (EXOR, denoted  ) of the matrix column corresponding to the control qubit to the column corresponding to the target qubit of the given CNOT gate. The synthesis of M using CNOT gates can therefore be thought of as the diagonalization of M via column operations. The diagonalization is done in two steps24: first, M is transformed into a northwest-triangular matrix \({M}^{{\prime} }\) using a modification of the CNOT circuit C1 of depth at most 2n, and then, \({M}^{{\prime} }\) is diagonalized by modifying the sorting network C2 in depth at most 3n24. An n × n matrix is called northwest-triangular if all entries below the antidiagonal are 0. The combined circuit C1C2 diagonalizes M and has depth at most 2n + 3n = 5n; therefore, \({({C}_{1}{C}_{2})}^{-1}\) is a depth-5n circuit that implements M.

We focus our attention on C2, the circuit diagonalizing a northwest-triangular matrix, since it has a stable and predictable structure. It is important to note that in the following sections, primary variables refer to the primary input of the underlying CNOT circuit, which is recovered on the right-hand side of C2. In Subsection II C, we prove that C2 always generates enough linear functions to induce any -CZ- circuit via Phase gate insertions. C2 starts as the sorting network24 and uses (n − 1)n/2 “boxes” in n alternating layers, each being either SWAP or SWAP+. The SWAP+ gate is a two-qubit gate that acts on its inputs as follows, \({{{{\rm{SWAP}}}}}^{+}(x,y):\left\vert x,y\right\rangle \mapsto \left\vert y,x\oplus y\right\rangle\). In the circuit notation,

Whether a box within the sorting network is a SWAP or SWAP+ is determined by the corresponding northwest-triangular matrix24. In fact, the correspondence between northwest-triangular matrices and such sorting networks is bijective, which can be established by the counting argument. An example of a sorting network on 6 qubits and the northwest-triangular matrix it diagonalizes is given in Fig. 1.

Fig. 1: Initially, qubit i contains the linear function represented by the ith column of M.
figure 1

Circuit C diagonalizes the matrix M. Outputs of the circuit C are the primary variables.

We next label qubits and boxes of the sorting network and define an ordering of the boxes. Using the labeling and ordering, we make several important observations on the structure of the sorting network that lie at the heart of our algorithm.

First, label each qubit i on the output side of the circuit with the label n + 1 − i in the input side. This is justified by the observation that the sorting network inverts the order of the qubits, and the ith primary variable is recovered and stored in qubit i at the output side of the diagonalization circuit. When a box (either SWAP or SWAP+) acts on the qubits i and i + 1 with labels li and li+1, we exchange the labels of the two qubits: qubit i is relabeled with li+1 and qubit i + 1 with li. As shown in [Ref. 24, Section 7.3], the label of a qubit always equals to the largest index of the primary variable stored in that qubit. For example, in Fig. 1, the first qubit initially holds the value x4x6, corresponding to its label, l1 = 6 = 6 + 1 − 1. It can also be shown that for each pair of labels li, lj, there is exactly one box that exchanges li and lj, and this box always shuffles the larger label downward and the smaller label upward. It follows that we can uniquely label each box with an unordered tuple of two indices li and lj, where li and lj are the labels of the qubits it exchanges. That is, box(li, li+1) (or box(li+1, li)) refers to the box that swaps qubits i and i + 1 and exchanges their labels li and li+1. Whenever can be determined, the smaller label precedes the larger label when writing down the indices that specify a box. An example of this labeling is shown in Fig. 2.

Fig. 2: Example labeling and layer ordering.
figure 2

For n = 7, Pn = {6, 4, 2, 1, 3, 5}, which is the index shared by the six diagonal layers from left to right, with each layer highlighted in a different color.

Due to the above observations, boxes in the same bottom-left to top-right diagonal layers share a common smaller index, as illustrated in Fig. 2. We label each diagonal layer by the index they share, which gives rise to the following sequence,

$${P}_{n}:= \left\{\begin{array}{ll}(n-1,n-3,...,1,2,4,...,n-2)&\,{{\mbox{for even}}}\,n\\ (n-1,n-3,...,2,1,3,...,n-2)&\,{{\mbox{for odd}}}\,n\end{array}\right..$$

This labeling is related to Pj labeling in ref. 25.

Let the boxes be ordered by the smaller of the two indices used to specify it. That is, let box(i, j) be ordered by i if i < j, and by j otherwise. We can compare two distinct boxes, box(i, j) and box(k, l), by comparing where the smaller indices that specify them appear in the sequence Pn. WLOG, let i < j and k < l, we write box(i, j) box(k, l) when i precedes k in Pn, and box(i, j) = box(k, l) when i = k. Intuitively, boxes are ordered by the diagonal layers they belong to, from left to right. As the content of a qubit travels from left to right, it either stays in the same layer (when the qubit’s label is shuffled up) or goes to the next layer (when the its label is shuffled down, with the content of an adjacent qubit added, or is left out of a round of shuffling). The layer ordering gives an important guarantee that will aid us in proofs in the next subsection.

Inducing arbitrary CZ circuits

We now have enough vocabulary to prove Theorem 1. An outline of the proof is as follows: we show that a northwest-triangular matrix diagonalization circuit generates a set of linear functions of primary variables that forms a basis in the CZ circuit space by induction on k, the number of SWAP+ boxes it contains. Then, in Subsection II D we describe an algorithm to find a schedule of Phase gates that induces a CZ circuit in O(n3) time, where n is the number of qubits.

We focus on the northwest-triangular matrix diagonalization circuit C. We first deal with the base case k = 0.

Lemma 1

Let C be a northwest-triangular diagonalization circuit consisting of only SWAP gates. Then, C generates linear functions {xixj1 ≤ i < j ≤ n}, which serves as a basis in the space of CZ circuits.

We omit the proof since it has been addressed in Section I and overall is easy to obtain by noting that the SWAP generates EXOR of the input qubits when implemented by the CNOT gates, and that in the sorting network, each qubit labeled by i meets another qubit labeled by j exactly once for each pair of labels i ≠ j. We further note that this Lemma follows from ref. 25, Theorem 6.

Assume C has k > 0SWAP+ boxes. Consider the leftmost SWAP+ in C in the smallest layer by its order. Let us refer to this box as box(i, j), where i and j are its labels, i < j. If we replace box(i, j) with a SWAP, we obtain another circuit \({C}^{{\prime} }\) with the corresponding matrix \({M}^{{\prime} }\) that has k − 1SWAP+ boxes. By the induction hypothesis, \({C}^{{\prime} }\) contains a CZ basis. An example of C and \({C}^{{\prime} }\) is shown in Fig. 3.

Fig. 3: For simplicity, qubit concatenation refers to (output) qubit label EXORs.
figure 3

The shaded box represents a SWAP+ gate, whereas the unshaded box represents a SWAP gate. Leftmost SWAP+s in the smallest order layer labeled by (2, 7) is enclosed by the red box---this is the one we focus on in the induction. All linear functions generated by the circuit after box(2, 7) in higher layer order are enclosed by the dashed lines; they are identical in C and \({C}^{{\prime} }\). All linear functions generated by boxes can be written as the EXORs of two inputs. The differences between linear functions computed by C and \({C}^{{\prime} }\) are highlighted in red.

We make two observations. First, the subcircuits following box(i, j) compute identical linear functions between C and \({C}^{{\prime} }\). Also, all boxes box(k, l) such that box(k, l) box(i, j) generate the same linear functions in both C and \({C}^{{\prime} }\) (for illustration, see dashed boxes in Fig. 3). Second, the circuit preceding box(i, j) consists of only SWAP gates, and as such it only swaps the linear functions without introducing new ones. For all boxes box(k, l) box(i, j), box(k, l) generates clck in C and \({c}_{l}^{{\prime} }\oplus {c}_{k}^{{\prime} }\) in \({C}^{{\prime} }\), where ci, \({c}_{i}^{{\prime} }\) denotes the initial input stored in the qubit with label i on the left hand side of circuits C and \({C}^{{\prime} }\) respectively. We know from the first observation that the linear function stored in each qubit after layer i must be identical between C and \({C}^{{\prime} }\). Matching the outputs of the boxes on layer i, we can show that all but one initial input of C and \({C}^{{\prime} }\) are identical; that is, \({c}_{k}\,=\,{c}_{k}^{{\prime} }\) for all k ≠ j, and \({c}_{j}^{{\prime} }={c}_{i}\oplus {c}_{j}\) (or equivalently \({c}_{j}={c}_{i}^{{\prime} }\oplus {c}_{j}^{{\prime} }\)). We formalize these observations in the following Lemma:

Lemma 2

Let C be a modification of the sorting network with k > 0SWAP+ boxes, and let \({C}^{{\prime} }\) be the circuit obtained from it by replacing the leftmost and first in layer order SWAP+ box, labeled with (i, j), with a SWAP. Let ci and \({c}_{i}^{{\prime} }\) be the initial inputs stored in the qubit with label i of circuits C and \({C}^{{\prime} }\) respectively. Then:

  1. 1.

    ci differs from \({c}_{i}^{{\prime} }\) in exactly one input. In particular, \({c}_{j}={c}_{j}^{{\prime} }\oplus {c}_{i}^{{\prime} }\), and \({c}_{k}={c}_{k}^{{\prime} }\) for all k ≠ j.

  2. 2.

    At most n − 1 boxes generate different linear functions between C and \({C}^{{\prime} }\). In particular, box(j, k) generates ckcj in C and \({c}_{k}^{{\prime} }\oplus {c}_{j}^{{\prime} }={c}_{k}\oplus {c}_{i}\oplus {c}_{j}\) in \({C}^{{\prime} }\), if and only if box(j, k) box(i, j)

We next show that we can always express the applications of Phase gates to linear functions generated by C that are different from those generated by \({C}^{{\prime} }\) using a constant number of Phase gates applied elsewhere by employing the identity Eq. (1).

Lemma 3

Let C, \({C}^{{\prime} }\) and ci, \({c}_{i}^{{\prime} }\) be defined as above, and box(i, j), where i < j WLOG, denotes the leftmost first in order SWAP+ box in C. For each Phase gate applied inside \({C}^{{\prime} }\), there exists a corresponding schedule of either 1 or 6 Phase gates found in C that computes this phase. We give these schedules explicitly:

  1. 1.

    A P gate applied to the unique \({c}_{j}^{{\prime} }\) in \({C}^{{\prime} }\) such that \({c}_{j}^{{\prime} }\,\ne \,{c}_{j}\) is equivalent to the P gate applied inside box(i, j) in C.

  2. 2.

    ("inversely” to the above) A P gate applied inside box(i, j) in \({C}^{{\prime} }\) is equivalent to the P gate applied to cj in C.

  3. 3.

    For box(j, k) box(i, j), a P gate applied inside box(j, k) in \({C}^{{\prime} }\) is equivalent to a schedule of P gates applied to/in ci, cj, ck, box(i, j), box(i, k), and box(j, k) in C.

  4. 4.

    A P gate applied to any other location in \({C}^{{\prime} }\) is equivalent to the P applied to the same location in C.

Proof

We begin by pointing out that box(i, j), box(j, k), and box(i, k) are located on or before the layer i. This is clearly the case for box(i, j) and box(j, k) by construction. To show that the box(i, k) is located on or before layer i, we consider two cases. Firstly, when i < k, then box(i, k) is in the layer i by the definition. Otherwise, k < i < j, which takes us to the second case. Note that the smaller index, k, is the second parameter in both box(i, k) and box(j, k). In this case, both box(i, k) and box(j, k) belong to the layer k, and both have the same layer order. Since box(j, k) box(i, j), box(i, k) must be located before box(i, j). The linear functions generated by all these boxes are the EXORs of two inputs; that is, box(i, j), box(j, k), and box(i, k) always generate linear functions cicj, cjck, and cick, where ci, cj, and ck are the inputs to the qubits labeled by i, j, and k.

The first two items in the statement of Lemma follow from Lemma 2. First, \({c}_{j}^{{\prime} }={c}_{i}\oplus {c}_{j}\), and cicj is generated by the box(i, j) in C. The linear function generated by box(i, j) in \({C}^{{\prime} }\) is \({c}_{i}^{{\prime} }\oplus {c}_{j}^{{\prime} }={c}_{j}\), and cj is an initial input to C.

In item 3, box(j, k) generates \({c}_{k}^{{\prime} }\oplus {c}_{j}^{{\prime} }\) in \({C}^{{\prime} }\). We can rewrite the application of the Phase gate to this linear function using the identity Eq. (1) as follows:

$${c}_{j}^{{\prime} }\oplus {c}_{k}^{{\prime} }=({c}_{i}\oplus {c}_{j})\oplus {c}_{k}={c}_{j}\oplus {c}_{k}+{c}_{i}\oplus {c}_{k}+{c}_{i}\oplus {c}_{j}-{c}_{i}-{c}_{j}-{c}_{k}(\,{{\mathrm{mod}}}\,\,4)$$

The application of the P gates to the linear function in the circuit \({C}^{{\prime} }\) on the right-hand side of the equation is thus generated by the P and P gates applied to the linear functions ci, cj, ck, box(i, j), box(j, k), and box(i, k) in the circuit C. We note in passing that this involves the introduction of the P gates, which can be rewritten as P = PZ and Pauli-Z can be moved to the side of a Clifford circuit, or better yet by allowing to multiply the identity in Eq. (1) by modulo-4 integers, and thus allowing to treat arbitrary powers of the P gate.

Finally, item 4 also follows from Lemma 2, since all other linear functions generated are equal in C and \({C}^{{\prime} }\).□

Algorithm and runtime analysis

Algorithm 1

InitializeS

  Input: -P-CZ-

  Initialize: S ← n × n zero matrix

  for Pk(i)  -P- do

   S[i, i] ← k

  end for

  for CZ(i, j)  -CZ- do

   \((i,j)\leftarrow (\min (i,j),\max (i,j))\)

   S[i, j] ← 3

   S[i, i] ← (S[i, i] + 1)%4

   S[j, j] ← (S[j, j] + 1)%4

  end for

  Return: S

Finally, we discuss the algorithm that offers the construction of a depth-5n implementation of arbitrary Hadamard-free Clifford circuit in time O(n3). For simplicity, we assume that the -P-CZ-CNOT- decomposition25 and the depth-5n implementation of the -CNOT- layer24 are given. The algorithm starts by loading the boxes (either SWAP or SWAP+) given by the northwest-triangular matrix diagonalization procedure in a data structure that allows fast access to each box, as well as the ability to traverse through boxes in layer order, through InitializeC. Then, we consider the SWAP+-free case and set the Phases according to Lemma 1, by calling the algorithm Algorithm 1, InitializeS. This part has complexity O(n2). The Phase schedule is then updated iteratively, based on the proof of Theorem 1, as shown in the Algorithm 2. This part has complexity O(n3), since adding one SWAP+ box requires updating at most O(n) Phase schedules S[i, j], with each update taking constant effort, and there can be at most \(\frac{(n-1)n}{2}=O({n}^{2})\,\)SWAP+ boxes in the northwest-triangular matrix diagonalization circuit. The overall complexity is thus O(n3).

Algorithm 2

FindPhaseSchedule

 Input: -P-CZ-CNOT-

 C ← InitializeC(-CNOT-)

S ← InitializeS(-P-CZ-)

/* Now, we iteratively add SWAP+ boxes in descending layer order */

/* and update S */

 for Box(i, j) C. boxes (in layer - order) do

  if Box(i, j) is SWAP then

   

/* No corrections needed */

   Continue

  end if

  

/* Case 1 & 2 - Switch the P gates applied to box(i,j) and cj */

  (S[j, j], S[i, j]) ← (S[i, j], S[j, j])

  for {k = 1, k ≤ n, k ← k + 1} do

   if Box(k, j) Box(i, j) then

    

/* Case 3 - Replace the P gates applied to box(k,j) with 6 P (or P) gates */

    pk ← S[k, j]

    for (w1, w2)  {(i, i), (j, j), (k, k)} do

     S[w1, w2] ← (S[w1, w2] + 3pk)%4

    end for

    for (w1, w2)  {(i, j), (i, k)} do

     S[w1, w2] ← (S[w1, w2] + pk)%4

    end for

   end if

  end for

end for

Return: S

Extension to Clifford circuits

The ability to implement -CNOT-CZ- circuits in depth 5n over the LNN architecture coupled with the ability to implement -CZ- circuits in depth 2n + 225 directly implies the ability to implement arbitrary Clifford circuits in depth 5n + 2n + 2 = 7n + 2 by relying on the decomposition -X-Z-P-CX-CZ-H-CZ-H-P-[Ref. 23,Lemma 8]. However, by considering local optimizations, one may reduce this figure to 7n − 4.

Lemma 4

Any Clifford transformation over n qubits can be implemented over the LNN architecture as a circuit with the two-qubit gate depth of at most 7n − 4.

Proof

Recall that the decomposition -X-Z-P\({}_{1}\widehat{\,{{\mbox{-CZ}}}\,{}_{1}\,{{\mbox{-}}}\,}\)CX-H\({}_{1}\widehat{\,{{\mbox{-CZ}}}\,{}_{2}\,{{\mbox{-}}}\,}\)H2-P2-23 (here we changed the order of -CX- and -CZ-, which is allowed, mark all stages with subscripts, and use \(\widehat{\ \ }\) to denote operation together with the qubit reversal) features the stage -H1- with the Hadamard gates applying to all qubits.

We first express \(\widehat{\,{{\mbox{-CZ}}}\,{}_{2}\,{{\mbox{-}}}\,}\) as a depth 2n + 2 circuit according to25. The layer -H1- and the first four layers of the implementation of \(\widehat{\,{{\mbox{-CZ}}}\,{}_{2}\,{{\mbox{-}}}\,}\) can be rewritten by pushing Hadamards through to the right and flipping the controls and targets of the CNOT gates, as follows (illustrated for n = 7),

This is justified since Phase gates need not be applied until after the first depth-4 S stage25. This means that the depth-4 CNOT layer can be merged into the -CX- layer to its left, offering savings of 4 layers in depth compared to the baseline construction of depth 7n + 2.

We next assume that n is odd and show how to reduce the depth by additional 2 layers. To this end, focus on the last three two-qubit gate stages in the decomposition of \(\widehat{\,{{\mbox{-CZ}}}\,{}_{1}\,{{\mbox{-}}}\,}\)CX- into the depth 5n circuit, the layer -H1-, and the first two-qubit gate stage in the now-reduced decomposition of \(\widehat{\,{{\mbox{-CZ}}}\,{}_{2}\,{{\mbox{-}}}\,}\). Note that the gates in these four layers apply only to pairs of qubits (1, 2), (3, 4), …. For each pair of qubits (k, k + 1), the last three two-qubit gate stages in the decomposition of \(\widehat{\,{{\mbox{-CZ}}}\,{}_{1}\,{{\mbox{-}}}\,}\)CX- are either SWAP or SWAP+, with Phase gate or without it; the -H1- is always the layer of two Hadamards, and the first two-qubit gate stage of \(\widehat{\,{{\mbox{-CZ}}}\,{}_{2}\,{{\mbox{-}}}\,}\) is a set of P gates followed by the CNOT(k, k + 1). We group these cases into the following three and show that each can be implemented as a circuit of depth at most 2 (here, a, b, c, d, and e take Boolean values to indicate if the given gate is present, 1, on not, 0).

  1. a.

    Case 1: Last box is SWAP.

  2. b.

    Case 2: Last box is SWAP+ and e = 0.

  3. c.

    Case 3: Last box is SWAP+ and e = 1.

This means that for odd n a Clifford circuit can be implemented in depth 7n − 4 = 7n + 2 − 4 − 2. The case of even n is handled similarly, except to enable the gate cancellations in the routine reducing the depth by 2, we need to modify the reversal network used to implement \(\widehat{\,{{\mbox{-CZ-}}}\,}\) by inverting the controls and targets of the CNOT gates and exchanging odd and even layers, which clearly commute.□

-CNOT-CZ- circuits over all-to-all architecture

We next turn our attention to the all-to-all connected architecture. There are a number of methods available to synthesize a CNOT circuit, targeting various optimization criteria. Here, we consider two methods: MZ synthesis method that achieves the circuit depth \(n+O({\log }^{2}n)\)21, which is currently the best-known method for the number of qubits up to 1,345,000, and the PMH method achieving asymptotically optimal gate count of \(O({n}^{2}/\log (n))\)18.

Our goal is to establish a CZ basis in the CNOT circuits found by these algorithms, and when this is impossible, find the smallest number of CZ gate insertions that allow generating a CZ basis. We note that the basis must contain \(\frac{(n-1)n}{2}\) functions, and thus for the algorithms showing sub-quadratic scaling such as the PMH18, this will eventually become impossible without the addition of CZ gates. However, given the high leading constant19 and our experiments in Fig. 6, we do not anticipate this to happen for reasonably small numbers n. We also note that due to the input-dependant structure of the circuits generated by MZ and PMH algorithms, it is difficult to come up with proofs of performance guarantee. We thus resort to the combination of heuristics and random tests.

A Heuristic Algorithm

Algorithm 3

InsertCZ

 Input: -CNOT-

 Z ← CZ vectors generated by the -CNOT- layer

 B ← basis for Z

 B ← basis for the orthogonal complement of B

 for CZ(xi, xj) B do

  for (i, j) s.t. xili, xjlj and wires i, j are unoccupied in layer l do

   v ← CZ vector corresponding to lilj

   

/* Checking whether v is spanned by B */

   for bB do:

   if bv = 1 then

     v ← vb

    end if

    end for

    if v ≠ 0 then

    l.add(CZ(i, j))

     B. add(v)

     if vi,j = 1 then

      

/* The added CZ generates CZ(xi,xj) as desired */

      Continue

     else

      

/* The added CZ generates a different missing basis vector in B */

      B. remove(v)

     end if

    end if

   end for

   

/* CZ(xi,xj) cannot be inserted without increasing depth */

 -CNOT-.add(CZ(xi, xj))

  end for

Return: -CNOT-

In case there are insufficient linear functions in a given CNOT circuit to form a CZ basis, we can seek to conditionally add CZ gates to generate new linearly independent vectors in the space \({{\mathbb{F}}}_{2}^{(n-1)n/2}\). The algorithm accomplishing it works as follows. First, find the CZ subspace spanned by the linear functions generated, with a basis B, and its orthogonal complement, with the basis B. dim(B) specifies how many linearly independent conditional CZ gates need to be added since each linearly independent CZ gate reduces the dimensionality of B by one. We obtain B and B by performing Gaussian elimination on the CZ vectors generated. These observations allow establishing the exact minimum number of the CZ gates that need to be added and computing them, thus addressing the scenario of the gate count minimization. We next consider depth minimization.

For each layer in the given CNOT circuit, we greedily add the CZ to it, so long as it applies to two unoccupied qubits and reduces the dimensionality of B. If no such opportunity exists, we create a new layer at the end of the circuit and add the necessary CZ to it.

The runtime of the algorithm depends on two steps: performing Gaussian elimination, and checking whether lilj, the EXOR of linear functions on a pair of unoccupied qubits in a given layer, is spanned by B. Gaussian elimination algorithm takes O(n6) time, since CZ vectors have dimension (n − 1)n/2 = O(n2) and the number of linear functions generated is O(n2). It takes time O(n4) to check whether each lilj is spanned by B; however, the number of times this check is invoked depends on the number of missing basis vectors as well as the available spots to go through, which can be O(n4) in the worst case.

Given the performance of the above simple heuristic, we found it to be not worth the time to investigate other approaches to finding a full basis.

Algorithm performance on two synthesis methods

In this section, we test the Algorithm 3 on random CNOT circuits offered by the two synthesis algorithms considered, MZ21 and PMH18.

We sample linear reversible transformations uniformly at random using the method given by [ref. 23, Alg. 4]. We next apply MZ and PMH algorithms to generate two CNOT circuits. We check whether the CNOT circuits generate enough linear functions to form a basis in the CZ space via Gaussian elimination; if not, we apply the heuristic algorithm that conditionally inserts the CZ gates on the available qubits to generate the remaining linear functions. Finally, we compare the circuit depth and gate count before and after the CZ gate insertion.

The results for the depth-optimized MZ synthesis algorithm21 are summarized in Fig. 4. Remarkably, we found that the CNOT circuits synthesized generate a CZ basis in about 30% of cases when the number of qubits is between 25 and 125. When there are basis vectors missing, the insertion of a linearly independent CZ gate via Algorithm 3 most often comes at no increase in the total depth. The average increase in depth is less than 1 for each number of qubits n tested over 10,000 trials.

Fig. 4: Summary of results for the MZ algorithm21.
figure 4

a The probability that MZ implementation of a random linear reversible function contains a CZ basis (blue), and the probability the depth increases (orange) as a function of the number of qubits. b The average absolute increase in depth after building a CZ circuit into the MZ implementation of a random linear reversible function using the heuristic algorithm Algorithm 3, as a function of the number of qubits.

When the number of qubits is higher than ~150, we observed an increased yet very small probability that the CZ gate insertions increase the two-qubit depth of the circuit when using the heuristic Algorithm 3. Furthermore, the probabilities and the average increase in depth oscillate, see Fig. 4. A possible explanation of this phenomenon is twofold. Firstly, the MZ algorithm depth calculation relies on a recursion with multiple integer ceiling operations and a relatively small value of the magnitude of depth (~n). This suggests that the properties of the number n such as its parity may have a large effect on the value of the depth. Secondly, beyond n ~ 150, randomly generated circuits contain less than \(\frac{(n-1)n}{2}\) CNOT gates on average, being the minimum number of linear functions required to form a basis for CZ circuits. This increases the probability of a CZ gate insertion, which in turn increases the probability of depth growth. See Fig. 5 for relevant plots.

Fig. 5
figure 5

Probability that a CZ gate insertion increases depth (orange), the average fraction of missing basis vectors (blue), and the average number of the CNOT gates, divided by \(\frac{(n-1)n}{2}\) minus 1 (green), over circuit implementations of the MZ algorithm on 10,000 random linear reversible functions, as a function of the number of qubits.

For PMH algorithm18, we focus on the minimization of the gate count. The results are summarized in Fig. 6. Although asymptotically optimal, the PMH algorithm generates circuits with significantly higher gate counts for a small number of qubits compared to the MZ synthesis algorithm. As a result, an abundance of linear functions is generated as partial computations, which almost always form a basis in the CZ circuit space. We found that for circuits spanning 45+ qubits, the linear functions generated always form a basis in the CZ circuit space over 1000 random linear reversible operations tried for each number of qubits n up to 100.

Fig. 6: Summary of results for the PMH algorithm18.
figure 6

a Proportion of the CZ vectors generated as a function of the number of qubits. b Gate counts before and after CZ gate insertions as a function of the number of qubits.

Discussion

In this paper, we studied the ability to insert Phase gates into CNOT circuits to induce arbitrary Hadamard-free Clifford transformation generated by the given linear reversible function. The reason this is often possible to accomplish is the application of Phase gates to linear functions computed inside the CNOT circuit generates vectors in the linear space representing CZ circuits, and if sufficiently many independent vectors are found that form the basis in the CZ circuits space, this guarantees the ability to implement any CZ transformations adjacent to such a CNOT transformation. In particular, we showed a depth upper bound of 5n for the implementation of arbitrary Hadamard-free Clifford transformation, matching the best-known upper bound of 5n to generate arbitrary linear reversible transformation over LNN architecture. We also developed a heuristic that offers an average depth increase of less than 1 for all number of inputs n tried (n ≤ 250) to turn a linear reversible circuit chosen uniformly at random into a Hadamard-free Clifford circuit generated by it over all-to-all architecture. In other words, depths of best-known Hadamard-free Clifford and linear reversible computations appear to be essentially equal in at least two scenarios—LNN architecture, and all-to-all architecture with stochastic samples of limited size. This is surprising, given the difference between group sizes—\({2}^{{n}^{2}+O(1)}\) for linear reversible computations vs \({2}^{1.5{n}^{2}+O(1)}\) for Hadamard-free Clifford circuits. Indeed, one would expect the bigger group to demand higher circuit complexity.

We would like to highlight the important role the sorting network (also known as the qubit reversal circuit) composed with SWAP gates plays over the LNN architecture. A modification of the sorting network that removes a subset of the CNOT gates (each SWAP can be implemented by three CNOT gates) and inserts Phase gates implements arbitrary -CZ- transformation up to qubit reversal in depth 2n + 225. A modification of the sorting network that erases a subset of the CNOTs implements an arbitrary qubit swapping circuit in depth no more than 3n24. A modification of the set of two back-to-back sorting networks that removes a subset of the CNOT gates implements arbitrary linear reversible function in depth no more than 5n24. A modification of the set of two back-to-back sorting networks that removes a subset of the CNOT gates and inserts Phase gates implements an arbitrary Hadamard-free Clifford circuit in depth no more than 5n. Finally, a modification of the set of three back-to-back sorting networks with a Hadamard layer between some two of them that removes a subset of the CNOT gates and inserts Phase gates, together with a few local optimizations, implements arbitrary Clifford circuit in depth no more than 7n − 4.