Introduction

Quantum computations are studied for their promise to outperform classical counterparts for certain kinds of computations1. The Clifford group is an important finite subgroup of the full unitary group, describing the set of quantum computations. Despite being possible to simulate classically2,3 by a low degree polynomial and having a simple structure4 (admitting efficient parametrization and being possible to compute by linear depth circuits), the group is most famous for lying at the core of quantum error correction1, which is believed to be necessary for scalable quantum computation. Restricted to the study of fault-tolerance, Clifford group plays multiple roles still. To illustrate, all (standard) encoding circuits are Clifford1, and so are the circuits for state distillation5,6, necessary for fault-tolerant implementation of non-Clifford gates. Clifford circuits lie at the core of randomized benchmarking protocols7,8. Other use cases include shadow tomography9,10, the study of entanglement1,11, and quantum data hiding12. It is perhaps fair to regard the Clifford group as one of the most visible and important subgroups of the group of all quantum computations.

Superconducting circuits and trapped ions are two technological frameworks that produced a stream of (universal prototype) programmable quantum computers, publicly available since the year 2016. Each technology comes in a range of flavors: e.g., superconducting circuits can be based on phase, charge, or flux qubits (or even hybrid kinds), and rely on various qubit coupling mechanisms, and trapped ions can be based on various ion species and rely on different approaches to the two-qubit gates (e.g., stationary vs mobile qubits). However, no matter the specific flavor, all prototype quantum computers based on these two approaches share one property13,14: the two-qubit gate has notably lower fidelity than a single-qubit gate. Thus, to the first degree of approximation, the fidelity of an entire quantum computation depends on the number of two-qubit gates it uses. To make a more subtle point, since the single-qubit gates are most frequently implemented by pulses with real-valued control parameters, the number of two-qubit gates in a circuit upper bounds the number of the single-qubit gates (up to a constant factor), meaning the reduction of the two-qubit gate count likely leads to the reduction in the number of single-qubit gates. We further note that the CNOT gates are available natively (i.e., requiring the minimal number of one two-qubit physical-level interaction) in both superconducting circuits and trapped ions technologies. Finally, recall that the physical-level entangling pulses frequently take the form of XX, ZX, and ZZ, requiring single-qubit corrections to turn those interactions into commonly used CNOT or CZ gates. This means that minimizing single-qubit gate count in an abstract circuit may not directly minimize the number of single-qubit physical pulses, since the single-qubit gates will be reshuffled during technology mapping. This justifies our focus on minimizing the CNOT gate count, selected as the optimization criterion in this paper.

In this paper, we study the problem of optimal synthesis of Clifford circuits. Since the problem of optimal circuit synthesis is hard, we restrict our attention to a small number of qubits, at most 6. The number of Clifford group elements over 6 qubits, 2.1 × 1023, is still very large, and we employ a range of techniques to make the search tractable using modern computers. At the core of our approach is a mechanism to break down the set of Clifford unitaries into a set of classes containing unitaries sharing a similar optimal circuit structure, efficient computation of the canonical representative of each class, and efficient manipulation of class members and the database of canonical representatives.

We define the n-qubit Clifford group \({{{{\mathcal{C}}}}}_{n}\) as the group of 2n × 2n symplectic matrices M over the two-element field \({{\mathbb{F}}}_{2}\), \({{{\rm{Sp}}}}(2n,{{\mathbb{F}}}_{2})\,:=\,\{M:\,{M}^{T}{{{\Omega }}}_{n}M\,=\,{{{\Omega }}}_{n}\}\), where MT denotes the transpose matrix, Ωn is the matrix \(\left(\begin{array}{ll}0&{I}_{n}\\ {I}_{n}&0\end{array}\right)\), and In is the n × n identity matrix. Symplectic matrices are equivalent to and alternatively known as the tableaux3. The size of the symplectic group is \(| {{{\rm{Sp}}}}(2n,{{\mathbb{F}}}_{2})| ={2}^{{n}^{2}}\mathop{\prod }\limits_{j=1}^{n}({2}^{2j}-1)\), which for the purpose of this paper implies \(| {{{{\mathcal{C}}}}}_{6}| \,=\,208,\,114,\,637,\,736,\,580,\,743,\,168,\,000\,\approx \,2.1\times 1{0}^{23}\) and assigns the numeric value to the size of the search space we are exploring.

Tableau representation is particularly useful since it allows to define quantum gates and circuits directly without the need to resort to standard definitions in quantum information that employ 2n × 2n unitary matrices1. Indeed,

  • the Hadamard gate H on qubit k can be defined as the 2n × 2n identity matrix with swapped columns k and n + k,

  • the Phase gate P on qubit k can be defined as the addition of column k to column n + k in the 2n × 2n identity matrix,

  • the CNOT gate with control qubit k and target j performs simultaneous addition of column k to column j and column n + j to column n + k in the 2n × 2n identity matrix,

and circuits are matrix multiplications. The computational completeness of the {H, P, CNOT} library is readily exposed by the ability to apply Gaussian elimination to obtain arbitrary symplectic matrix as a product of gates. An additional advantage of such a definition of gates and circuits comes from displaying the capacity to implement transformations by Clifford gates efficiently by a computer program.

As a side note, we highlight that each element of the Clifford group \({{{{\mathcal{C}}}}}_{n}\) defines an equivalence class of 2n × 2n unitary matrices realizable by the circuits over H, P, and CNOT gates (defined, in turn, via unitary matrices1). A pair of unitary matrices is considered equivalent if they can be mapped to each other by the left (or right) multiplication with single-qubit Pauli gates and overall phase factors. Since we focus on the minimization of the two-qubit gate count, Pauli gates and phase factors can be safely factored out. Had Pauli gates been included in the Clifford group, the search space size for n = 6 would read 8.5 × 1026.

Results

6-qubit optimal Clifford circuits

The distribution of the number of equivalence classes across CNOT gate costs is shown in Table 6. For the number of qubits 2 through 5 the most complex function to implement is unique (within the equivalence class definition), and it is equivalent to a cyclic permutation of qubits. For n = 6, the cyclic permutation is one of three such functions; the other two are illustrated in Fig. 1. The small number of equivalence classes for a small number of qubits implies an efficient formula (based on ReduceU) to compute the CNOT cost of a small Clifford unitary.

Fig. 1: All most expensive 6-qubit Clifford unitaries requiring 15 entangling gates (up to left and right multiplication by the single-qubit gates and qubit relabeling).
figure 1

a left: a compact representation in the form (UU)SWAP, right: its optimal implementation. b left: a compact representation in the form \((U^{\prime} \otimes V^{\prime} )\,{{\mbox{SWAP}}}\,\), right: its optimal implementation. Not illustrated is the cyclic SWAP of all 6 qubits, that also requires 15 entangling gates.

We ran a script to calculate the distribution of the number of Clifford group elements across optimal CNOT gate costs. Given the database, it took a few days to collect the data using an HPC system. This computation is highly parallelizable, and the runtime can be reduced significantly with many processors, e.g., GPUs; we have not pursued those reductions. The results are reported in Table 1.

Table 1 The distribution of the number of 6-qubit Clifford unitaries across the entangling gate cost.

We used the database to look for examples of quantum Clifford advantage over classical reversible CNOT circuits, meaning optimal CNOT circuits that can be implemented with fewer entangling gates as a Clifford circuit. We found one such example, illustrated in Fig. 2, that gives a reduction of 14 gates into 12, improving the 8 to 7 reduction seen earlier4\(\left(\,{{\mbox{indeed,}}}\,\frac{14}{12} > \frac{8}{7}\right)\).

Fig. 2: Quantum advantage by Clifford circuits.
figure 2

An optimal CNOT gate circuit (left) can be implemented with fewer entangling gates as an optimal Clifford circuit (right).

The compiler was benchmarked using both consumer-grade and enterprise-grade systems for a test set with 10, 000 elements of the Clifford group \({{{{\mathcal{C}}}}}_{6}\). Each element was generated by a Clifford circuit with 600 randomly chosen gates over the library {H, P, CNOT}. The number of gates was selected to be high enough to effect a close to random uniform distribution over the elements of the group \({{{{\mathcal{C}}}}}_{6}\). We observed that such random test set is dominated by the elements with costs 11 and 12. The compiler runtime reported below is the time required to obtain optimal circuits for all test set elements divided by the size of the test set. We observed the runtime of 0.0009358 s for a laptop with Intel® i7-1068NG7 2.3 GHz CPU and 16GB RAM with USB-C-attached consumer-grade SSD. The search relies on the database stored on SSD, and a 2.5GB index in RAM, see the section “Software tricks” for details. The time reported measures hot cache performance, cold cache performance reads 0.003708 s per an optimal circuit, on average. The compiler performance improves when the entire database can be stored in RAM. We observed the hot cache runtime of approximately 0.0006274 s for a server with Intel® Xeon® 128-CPU E7-4850 v4 @ 2.10GHz and 6TB RAM. The process of loading the full database into RAM took approximately 2 h.

This performance allows to use our implementation to obtain individual circuits and entire randomized benchmarking schedules in mere seconds using consumer-grade hardware as well as online via a web interface. For the use in demanding applications such as peep-hole optimization of large circuits, we suggest relying on large-RAM commercial-grade servers and note that it takes roughly half the time to look up the cost without computing the optimal circuit (the procedure that would likely get called most frequently during peep-holing).

The average runtime of our compiler for random n-qubit Clifford operators with n ≤ 5 is shown in Table 2.

Table 2 Average runtime for optimally compiling n-qubit Clifford operators with the full database of reduced elements loaded into RAM. The runtime was measured on MacBook Pro laptop (early 2015 model) with Intel® i7-5557U 3.1GHz CPU and 16GB RAM.

Optimal 2-designs

Unitary designs15 are probability distributions on the unitary group that reproduce low-order moments of the Haar (uniform) distribution. Of particular interest are unitary designs that can be efficiently implemented by quantum circuits16. Such designs can serve as a substitute for the Haar distribution in certain randomized quantum protocols such as data hiding12, estimating fidelity of quantum operations8,17, and quantum state tomography10. In this section, we leverage the database of reduced Clifford elements to construct optimal unitary designs that have the minimum average cost, subject to the constraint that all elements of the design are Clifford operators.

Let U(2n) be the group of unitary complex matrices of size 2n × 2n. Suppose \({{{\mathcal{D}}}}\,\subseteq \,U({2}^{n})\) is a finite subset and \(\mu :\,{{{\mathcal{D}}}}\,\to \,{{\mathbb{R}}}_{+}\) is a probability distribution on \({{{\mathcal{D}}}}\). The pair \(({{{\mathcal{D}}}},\mu )\) is called a unitary 2-design18 if

$$\mathop{\sum}\limits_{\hat{U}\in {{{\mathcal{D}}}}}\mu (\hat{U})({\hat{U}}^{{\dagger} }\hat{A}\hat{U})\otimes ({\hat{U}}^{{\dagger} }\hat{B}\hat{U})={\int}_{U({2}^{n})}({\hat{U}}^{{\dagger} }\hat{A}\hat{U})\otimes ({\hat{U}}^{{\dagger} }\hat{B}\hat{U}){\rm{d}}U$$
(1)

for any complex matrices \(\hat{A}\) and \(\hat{B}\). Here the tensor product separates two n-qubit registers and the integral in the right-hand side of Eq. (1) is the average over the Haar distribution on the unitary group U(2n). We reserve the hat notation for complex unitary matrices to avoid confusion with binary symplectic matrices considered in the rest of the paper. Below we choose \({{{\mathcal{D}}}}\) to be the n-qubit Clifford group and construct a probability distribution μ that minimizes the average cost

$$\mathop{\sum}\limits_{\hat{U}\in {{{\mathcal{D}}}}}\mu (\hat{U})\cdot {{{\rm{cost}}}}(\hat{U}),$$
(2)

subject to the constraint that \(({{{\mathcal{D}}}},\mu )\) is a unitary 2-design. Here \({{{\rm{cost}}}}(\hat{U})\) is the minimum number of the CNOT gates required to implement \(\hat{U}\) by a quantum circuit composed of the Hadamard, Phase, and CNOT gates.

Since Pauli operators have zero cost, we can assume wlog that the optimal solution μ is Pauli-invariant, i.e., \(\mu (\hat{U})\,=\,\mu (\hat{U}\hat{O})\) for all n-qubit Pauli operators \(\hat{O}\). As defined earlier, the unitary version of the n-qubit Clifford group is isomorphic to \({{{{\mathcal{C}}}}}_{n}\,\times \,{\{I,X,Y,Z\}}^{n}\). Here we ignore the overall phase factors. Define the probability distribution \(\pi :\,{{{{\mathcal{C}}}}}_{n}\,\to \,{{\mathbb{R}}}_{+}\) such that π(U) = 4nμ(U × P) for all \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) and P {I, X, Y, Z}n. The distribution π is well-defined whenever μ is Pauli-invariant. In the section “Pauli mixing constraint”, we show that μ is a Clifford 2-design iff π obeys the so-called Pauli mixing constraint16

$${\Pr }_{U \sim \pi }[Ux=y]:=\mathop{\sum}\limits_{U\in {{{{\mathcal{C}}}}}_{n}:\,Ux=y}\,\pi (U)=\frac{1}{{4}^{n}-1}\,\,{{\mbox{for all non-zero vectors}}}\,\,x,y\in {\{0,1\}}^{2n}.$$
(3)

Furthermore, μ has the average cost

$$\mathop{\sum}\limits_{U\in {{{{\mathcal{C}}}}}_{n}}\pi (U)\cdot {{{\rm{cost}}}}(U).$$
(4)

Thus it suffices to minimize the average cost Eq. (4) over variables π(U) ≥ 0 subject to the normalization constraint \({\sum }_{U\in {{{{\mathcal{C}}}}}_{n}}\pi (U)\,=\,1\) and the Pauli mixing constraint, Eq. (3). This gives a linear program with \(| {{{{\mathcal{C}}}}}_{n}|\) variables.

The next step is to reduce the number of variables and the number of constraints in the linear program. Suppose π is a Pauli mixing distribution on \({{{{\mathcal{C}}}}}_{n}\), that is, π obeys Eq. (3). Define a symmetrized version of π as follows. First, sample \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) from the distribution π. Second, sample WSn and \(L,R\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\) from the uniform distribution on the respective groups. Finally, output \(U^{\prime} \,=\,L{W}^{-1}UWR\). The probability distribution of \(U^{\prime}\) is given by

$$\pi ^{\prime} (U^{\prime} )=\frac{1}{{6}^{2n}n!}\mathop{\sum}\limits_{L,R\in {{{{\mathcal{C}}}}}_{n}^{0}}\,\mathop{\sum}\limits_{W\in {S}_{n}}\pi (W{L}^{-1}U^{\prime} {R}^{-1}{W}^{-1}).$$

Since the cost is invariant under a qubit relabeling and left/right multiplications by the elements of local subgroup \({{{{\mathcal{C}}}}}_{n}^{0}\), the distributions π and \(\pi ^{\prime}\) have the same average cost. We claim that \(\pi ^{\prime}\) is Pauli mixing. Indeed, pick any non-zero vectors x, y {0, 1}2n, a qubit permutation WSn, and local Cliffords \(L,R\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\). Then

$${\Pr }_{U \sim \pi }[L{W}^{-1}UWRx\,=\,y]={\Pr }_{U \sim \pi }[Ux^{\prime} \,=\,y^{\prime} ]=\frac{1}{{4}^{n}-1},$$
(5)

where \(x^{\prime} \,=\,WRx\,\ne \,0\) and \(y^{\prime} \,=\,W{L}^{-1}y\,\ne \,0\). The last equality in Eq. (5) follows from the assumption that π is Pauli mixing. Thus \(\pi ^{\prime}\) is a convex linear combination of Pauli mixing distributions, that is, \(\pi ^{\prime}\) itself is Pauli mixing.

The above shows that an optimal Clifford 2-design can be found by minimizing the average cost Eq. (4) over symmetric Pauli mixing distributions π such that the probability π(U) depends only on the equivalence class [U] that contains U. Such distribution π can be compactly specified by considering the set of reduced elements

$${{{{\mathcal{R}}}}}_{n}:=\{{\mathsf{ReduceU}}(U):\,U\,\in \,{{{{\mathcal{C}}}}}_{n}\}.$$

Given a reduced element \(U\,\in \,{{{{\mathcal{R}}}}}_{n}\), define the probability distribution

$$\eta (U)=\mathop{\sum}\limits_{U^{\prime} \in [U]}\pi (U^{\prime} )=\pi (U)\cdot | [U]| .$$

Note that η is a probability distribution on \({{{{\mathcal{R}}}}}_{n}\) since each equivalence class [U] contains a unique reduced element, see the section “Computation of ReduceU”. For brevity, we will refer to η as a reduced distribution. The average cost of the original distribution π depends only on η and can be computed using the formula

$$\mathop{\sum}\limits_{U\in {{{{\mathcal{R}}}}}_{n}}\eta (U)\cdot {{{\rm{cost}}}}(U).$$
(6)

It remains to express the Pauli mixing constraint in terms of the reduced distribution η. Given a reduced element \(U\,\in \,{{{{\mathcal{R}}}}}_{n}\) and non-zero vectors x, y {0, 1}2n, define the quantity

$$g(U,x,y)=\frac{\#\{U^{\prime} \in [U]\,:\,U^{\prime} x=y\}}{| [U]| }.$$

In words, g(U, x, y) is the probability that a random uniformly distributed element of the equivalence class [U] maps x to y. Then π is Pauli mixing iff

$$\mathop{\sum}\limits_{U\in {{{{\mathcal{R}}}}}_{n}}\eta (U)g(U,x,y)=\frac{1}{{4}^{n}-1}$$
(7)

for all non-zero vectors x, y {0, 1}2n. It remains to note that some constraints Eq. (7) are redundant. Indeed, since the equivalence class [U] is invariant under the left/right multiplications of U by the elements of the local subgroup \({{{{\mathcal{C}}}}}_{n}^{0}\), one has g(U, x, y) = g(U, Lx, Ry) for all \(L,R\in {{{{\mathcal{C}}}}}_{n}^{0}\). Suppose (xj, xn+j) ≠ (0, 0) for some qubit j. Then one can choose \(L\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\) acting non-trivially only on the jth qubit such that (Lx)j = 0 and (Lx)n+j = 1, see the section “Computation of ReduceU”. Applying this transformation to all qubits we conclude that the Pauli mixing constraint Eq. (7) has to be imposed only for vectors

$$x,y\in \{({0}^{n}z):\,z\in {\{0,1\}}^{n}\setminus {0}^{n}\}.$$
(8)

Minimizing the average cost Eq. (6) over variables η(U) ≥ 0 with \(U\,\in \,{{{{\mathcal{R}}}}}_{n}\), subject to the normalization \({\sum }_{U\in {{{{\mathcal{R}}}}}_{n}}\eta (U)\,=\,1\) and the Pauli mixing constraints Eqs. ((7), (8)), gives a linear program with \(| {{{{\mathcal{R}}}}}_{n}|\) variables and \(1\,+\,{({2}^{n}-1)}^{2}\) equality constraints. We were able to find an optimal solution of this linear program numerically for n = 2, 3, 4 qubits. The optimal reduced distributions η presented in Table 3, Table 4, and Table 5 are compactly represented by a list of reduced elements \({U}_{1},{U}_{2},\ldots,{U}_{m}\in {{{{\mathcal{R}}}}}_{n}\) along with their probabilities η(Uj). Only reduced elements that appear with non-zero probability are shown. The tables display an optimal circuit implementation of each reduced element Uj. To avoid clutter, we omit single-qubit gates on the left and on the right. The actual 2-design has the form LW−1UjWR, where the index j {1, 2, …, m} is sampled with the probability η(Uj), the qubit permutation W is sampled uniformly from Sn, and L, R are sampled uniformly from the local subgroup \({C}_{n}^{0}\).

Table 3 Optimal two-qubit Clifford 2-design with the average cost 1.5. This coincides with the average cost of the full Clifford group \({{{{\mathcal{C}}}}}_{2}\).
Table 4 Optimal three-qubit Clifford 2-design with the average cost 3.12363. . . . For comparison, the full Clifford group \({{{{\mathcal{C}}}}}_{3}\) has the average cost 3.50937. . . .
Table 5 Optimal four-qubit Clifford 2-design with the average cost 5.08034. . . . For comparison, the full Clifford group \({{{{\mathcal{C}}}}}_{4}\) has the average cost 5.85856. . . . We note that all except for two circuits in this table have cost 5. The remaining pair of circuits have cost 6.

Comparison to prior work

Similar-spirited prior work includes the synthesis of 4-qubit optimal Clifford circuits19, the synthesis of 4-bit optimal reversible circuits20, and optimal solution of Rubik’s cube puzzle21,19 is most closely related to our work, given the focus on Clifford circuits; the difference is we chose to study the two-qubit gate cost, which better reflects the constraints of the existing quantum computers than the total gate count. The search space size comparison is 4.7 × 1010 in19 to 2.1 × 1023 in our work—an almost 13 orders of magnitude difference20 study reversible circuits, being a highly relevant type of computations. Their search space size is 2.1 × 1013, meaning we solved a problem with 10 orders of magnitude higher search space size. Finally21, studies Rubik’s cube, which is also a finite group. Their search space size is 4.3 × 1019, meaning ours is almost 4 orders of magnitude higher.

Discussion

In this paper, we reported algorithms and their C++ implementation that compute all two-qubit gate count optimal 6-qubit Clifford circuits. There are about 2.1 × 1023 different Clifford functions. The large search space required us to employ server-class machines to make the computation possible. In particular, we used HPC to break down the set of canonical representatives of Clifford group elements sharing similar optimal circuit structure, and store them in a database of size 2.1TB. Given this database on an SSD and a 2.5GB index file in RAM, the time to extract an optimal circuit using a consumer-grade laptop is 0.0009358 s—10 times faster than the typical access time for a spindle drive. The time to extract an optimal circuit using an enterprise-level system while storing the database in RAM is 0.0006274 s—15 times faster than the typical HDD access time. We used the database to establish the maximal gate count needed to implement an arbitrary 6-qubit Clifford unitary and showed the distribution of the number of Clifford functions across their required gate counts. We established a new example of quantum advantage by Clifford circuits over CNOT gate circuits and found optimal Clifford 2-designs for the number of qubits up to, and including, 4.

Methods

Algorithm and its implementation: an overview

Our approach relies on the use of pruned breadth-first search (BFS) to generate a number of databases containing Clifford unitaries that can be implemented by equal cost optimal circuits, and augment it by a set of tools that extract useful statistics (e.g., distribution of the number of unitaries by entangling gate cost, average cost, largest cost) as well as individual optimal circuits. BFS is a strategy that relies on taking optimal implementations of cost up to k, modifying them by applying cost-1 transformations to cost-k elements, and recording the result as a cost k + 1 element if it is not yet found in the database. BFS is initiated with the identity operator costing zero and ends when all elements in the target set were explored. While our algorithm can be applied to obtain optimal 2-, 3-, 4-, 5-, and 6-qubit Clifford circuits using modern computers, we focus the rest of the description on the most difficult but still amenable to classical computers 6-qubit case.

Since the database we are synthesizing contains Clifford unitaries, the first order of business is to choose a suitable data structure to store those. The data structure must be both compact and allow quick application of gates; this is because BFS boils down to a series of gate applications and memory lookups. We start with the tableau, which is naturally suited for quick gate application, and modify it to remove two last rows corresponding to X and Z stabilizers each3. As described in the section “Data structure”, these rows can be quickly restored. However, removing them allows to reduce the storage from 4n2n=6 = 144 bits to 2 × 2n(n − 1)n=6 = 120 bits. Each unitary is thus stored across two 64-bit machine words (each half corresponding to X and Z parts), with 4 bits per machine word of (yet) unused space. While information-theoretic minimum storage requirement, \(\lceil {\log }_{2}(| {{{{\mathcal{C}}}}}_{6}| )\rceil \,=\,78\), implies that more compact storage exists, BFS imposes the requirement of quick gate application and we furthermore rely on canonicity (discussed in next paragraph) to reduce the size of the database; thus, it is not obvious if more efficient storage is possible.

Should each Clifford element require storage, the search would not be possible to execute on modern computers since \(| {{{{\mathcal{C}}}}}_{6}| \,\approx \,2\times 1{0}^{23}\). We, therefore, break Clifford group elements into classes of equivalence such that class members share the same optimal circuit structure, a canonical representative exists, and it is efficient to compute. In our approach, a class of equivalence can be thought of as containing unitaries with optimal circuits equivalent up to left- and right-hand multiplication by single-qubit Clifford unitaries, and qubit relabeling; the canonical representative is chosen to be the one with the least lexicographic order across all elements in its equivalence class. This means that we can pack up to \(| {{{{\mathcal{C}}}}}_{1}{| }^{2n}\cdot | {S}_{n}| {| }_{n = 6}={6}^{12}\cdot 6!=1,567,283,281,920\) unitaries into one class. More precisely, the number of unitaries contained in each equivalence class may vary between \(| {{{{\mathcal{C}}}}}_{1}{| }^{n}\) and \(| {{{{\mathcal{C}}}}}_{1}{| }^{2n}\cdot | {S}_{n}|\). The former case is realized for the identity operator which is invariant under all qubit relabelings and does not differentiate between left- and right-hand multiplications by single-qubit Clifford unitaries. The latter case is realized for a generic element of the Clifford group without any special symmetries. Here, \(| {{{{\mathcal{C}}}}}_{1}|\) is the size of the single-qubit Clifford group \({{{{\mathcal{C}}}}}_{1}\) raised to the power 2n to represent one-qubit operators on each qubit in the beginning and end of the circuit, and Sn is the permutation group. However, the computation of canonical representative must be efficient, as otherwise, complexity moves from storage to computation. We utilized a Pareto-efficient definition of the equivalence class, as determined by ReduceU, the function computing the canonical representative, to be most practical. Our computationally-defined canonical representative is at most factor 14 storage inefficient, but it allows a quick computation of the canonical representative, taking on average 0.000003 s (using Intel Core i7-10700K processor). The computation of ReduceU turns out to be the runtime-level bottleneck of our implementation since other operations that are applied with a comparable frequency (such as tableau restoration and gate application) are faster. Further details about ReduceU may be found in the section “Computation of ReduceU”.

The restriction to equivalence classes helps not only to dramatically reduce the storage requirement, but also to minimize the number of CNOT-equivalent transformations that we need to apply to a Clifford unitary requiring k gates to explore Clifford unitaries requiring k + 1 entangling gates. Specifically, the number of transformations is only \(9\frac{n(n-1)}{2}{\left|\right.}_{n = 6}=135\), as illustrated in Fig. 3.

Fig. 3: CNOT gate equivalent entangling transformations that need to be applied to each of \(\frac{n(n-1)}{2}\) pairs of qubits of a Clifford group element implementable with k entangling gates to explore the possibility of expanding it into a Clifford group element requiring k + 1 gates.
figure 3

It suffices to apply these gates to a pair of qubits in an arbitrary fixed order, since the application of a gate in the other order is enabled by some other gate among those listed. For instance, the CNOT with flipped controls with respect to (a) is accomplished by (h), noting that the single-qubit gates on the right side do not matter due to the choice to work with equivalence classes.

The 15-part (one part per a fixed gate count ranging from 1 to 15, with 15 turning out to be the maximum) sorted database with canonical representatives of equal cost is 2.1TB in size, and it took roughly 6 months to synthesize it on a small cluster of Intel® server-class machines. Since we made software updates as the search progressed, and improved the performance in doing so, we believe it may take about 2 months to rerun it from scratch. We store the database on an SSD (2 + TB RAM was expensive at the time of this writing). Given the database, an optimal circuit for a given 6-qubit Clifford unitary U may be found as follows: compute ReduceU(U), find it in part of the database containing size k unitaries, apply each of \(9\frac{n(n-1)}{2}\) gates, compute the resulting canonical element and look it up in the size k − 1 database; once found repeat for k : = k − 1 until k = 0. Our implementation of the above algorithm takes an average of 0.1 s to extract an optimal circuit. The bottleneck is the database search on the SSD, since the average number of times an element needs to be searched is at most \(\frac{135}{2}=67.5\), the databases for large k are large, and search needs to make multiple queries that add up quickly given SSD’s limited access time. Instead, recall that 4 + 4 = 8 bits of the original data structure are unused, and note that 8 bits suffice to store the gate information, since \(\lceil {\log }_{2}(135)\rceil \,=\,8\). We thus augment the database by loading these 8 bits with the last gate information, allowing to select the correct gate right away during the circuit restoration. This modification reduces the runtime by roughly a factor of 67.5. We further optimize the performance by storing an index with each 1024th element of the database in RAM. This allows finding an optimal circuit implementation of an arbitrary 6-qubit Clifford unitary in as little as 0.0009358 s on a MacBook Pro® (2.3 GHz Quad-Core Intel® Core i7-1068NG7 CPU, 16GB RAM) with a USB-C attached SSD (4TB VectoTech Rapid® 540MB/s 3D NAND Flash), and 0.0006274 s on a high-performance server (Quad Intel® Xeon E7-4850 v4 16-Core/2.1GHz, 6TB RAM). These performance figures were established by averaging out the time to synthesize optimal circuits for 10,000 random uniformly distributed Clifford unitaries while relying on kernel-owned memory to cache files with the use of mmap and using a supplementary index for the laptop version of the search.

In the following subsections we report further details of our implementation.

Database generation

Let \({{{{\mathcal{C}}}}}_{n}^{k}\,\subseteq \,{{{{\mathcal{C}}}}}_{n}\) be the set of all Clifford group elements with the CNOT cost k. Here k = 0, 1, …, kmax(n) for some a-priori unknown maximum cost kmax(n). For example, \({{{{\mathcal{C}}}}}_{n}^{0}\) is the local subgroup of \({{{{\mathcal{C}}}}}_{n}\), i.e., one generated by the single-qubit Clifford gates. Suppose \({\mathsf{ReduceU}}:\,{{{{\mathcal{C}}}}}_{n}\to {{{{\mathcal{C}}}}}_{n}\) is a function such that ReduceU(U) = ReduceU(V) if and only if U and V are equivalent up to left and right multiplications by single-qubit gates and a qubit relabeling. In other words, ReduceU(U) is a canonical representative of the equivalence class

$$[U]:=\{K{W}^{-1}UWL\,:\,K,L\in {{{{\mathcal{C}}}}}_{n}^{0},\,W\in {S}_{n}\}.$$
(9)

Here and below \({S}_{n}\,\subseteq \,{{{{\mathcal{C}}}}}_{n}\) is the subgroup of qubit permutations. A specific implementation of the function ReduceU, which we refer to the section “Computation of ReduceU”, does not matter at this point. Let \({{{{\mathcal{R}}}}}_{n}^{k}\) be the set of all reduced cost-k Clifford group elements,

$${{{{\mathcal{R}}}}}_{n}^{k}:=\{{\mathsf{ReduceU}}(U)\,:\,U\in {{{{\mathcal{C}}}}}_{n}^{k}\}.$$

Our database consists of kmax(n) + 1 parts, such that the k-th part contains all elements of \({{{{\mathcal{R}}}}}_{n}^{k}\). The elements are furthermore stored in the lexicographic order to enable binary search.

Let \(I\,\in \,{{{{\mathcal{C}}}}}_{n}\) be the identity matrix and CNOTi,j be the CNOT gate with the control qubit i and the target qubit j. Since any cost-0 and cost-1 element is equivalent to I and CNOT1,2 respectively, we have

$${{{{\mathcal{R}}}}}_{n}^{0}=\{{\mathsf{ReduceU}}(I)\}\,{{{\rm{and}}}}\,{{{{\mathcal{R}}}}}_{n}^{1}=\{{\mathsf{ReduceU}}({{{{\rm{CNOT}}}}}_{1,2})\}.$$

Suppose we have the sets \({{{{\mathcal{R}}}}}_{n}^{0},{{{{\mathcal{R}}}}}_{n}^{1},\ldots,{{{{\mathcal{R}}}}}_{n}^{k-1}\) for some k ≥ 2 (initially k = 2). The rest of this section explains how to compute \({{{{\mathcal{R}}}}}_{n}^{k}\). First, we need to choose a set of cost-1 generators that obey certain technical conditions. Let m = 9n(n − 1)/2 and \({G}_{1},{G}_{2},\ldots,{G}_{m}\in {{{{\mathcal{C}}}}}_{n}^{1}\) be the generators shown in Fig. 3. By definition, each generator has the form AiBjCNOTi,j for some pair of qubits i < j and A, B  {I, PH, HP}. We will use the following properties of the generator set.

Lemma 1

Any cost-k element \(U\,\in \,{{{{\mathcal{C}}}}}_{n}^{k}\) can be written as \(U={G}_{{a}_{1}}{G}_{{a}_{2}}\cdots {G}_{{a}_{k}}L\) for some \(L\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\) and some a1, a2, …, ak {1, 2, …, m}.

The proof is deferred to the section “Proof of Lemma 1”. This lemma has the following simple corollaries.

Corollary 1

Suppose WSn is a qubit permutation and \(L\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\). For any generator Ga there exist a generator Gb and \(M\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\) such that WLGa = GbWM.

Proof

Let U = WLGaW−1. Note that \(U\,\in \,{{{{\mathcal{C}}}}}_{n}^{1}\) since U is equivalent to a cost-1 element Ga. Lemma 1 with k = 1 implies that \(U\,=\,{G}_{b}M^{\prime}\) for some generator Gb and some \(M^{\prime} \,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\). Thus \(WL{G}_{a}={G}_{b}M^{\prime} W={G}_{b}WM\), where \(M={W}^{-1}M^{\prime} W\in {{{{\mathcal{C}}}}}_{n}^{0}\).

Corollary 2

For any generator Ga and \(L\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\) there exists a generator Gb such that \({G}_{a}L{G}_{b}\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\).

Proof

Let \(U\,=\,{({G}_{a}L)}^{-1}\). Note that \(U\,\in \,{{{{\mathcal{C}}}}}_{n}^{1}\) since the cost is invariant under taking the inverse. Lemma 1 with k = 1 implies that U = GbM for some generator Gb and \(M\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\). Thus \({G}_{a}L{G}_{b}={M}^{-1}\in {{{{\mathcal{C}}}}}_{n}^{0}\).

We claim that the following algorithm outputs the set \(S\,=\,{{{{\mathcal{R}}}}}_{n}^{k}\).

Algorithm 1

\(S\leftarrow {{\emptyset}}\)

for \(V\in {{{{\mathcal{R}}}}}_{n}^{k-1}\) do

for b {1, 2, . . . . , m} do

U ← ReduceU(VGb)

if \(U\notin {{{{\mathcal{R}}}}}_{n}^{k-2}\cup {{{{\mathcal{R}}}}}_{n}^{k-1}\) then

S ← S {U}.

end if

end for

end for

Let us first check that \({{{{\mathcal{R}}}}}_{n}^{k}\,\subseteq \,S\). Consider any element \(U\,\in \,{{{{\mathcal{R}}}}}_{n}^{k}\). Then \(U\,=\,{\mathsf{ReduceU}}(\tilde{U})\) for some \(\tilde{U}\,\in \,{{{{\mathcal{C}}}}}_{n}^{k}\). By Lemma 1, we can write \(\tilde{U}\,=\,{G}_{{a}_{1}}{G}_{{a}_{2}}\cdots {G}_{{a}_{k}}M\) for some \(M\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\). Define

$$\tilde{V}:={G}_{{a}_{1}}{G}_{{a}_{2}}\cdots {G}_{{a}_{k-1}}\,{{{\rm{and}}}}\,V:={\mathsf{ReduceU}}(\tilde{V}).$$

Note that \(\tilde{V}\,\in \,{{{{\mathcal{C}}}}}_{n}^{k-1}\) (if \(\tilde{V}\in {{{{\mathcal{C}}}}}_{n}^{\ell }\) for some  < k − 1 then \(\tilde{U}\,=\,\tilde{V}{G}_{{a}_{k}}M\) would have cost less than k). Accordingly, \(V\,\in \,{{{{\mathcal{R}}}}}_{n}^{k-1}\). By definition of the function ReduceU, we have \(\tilde{V}=K{W}^{-1}VWL\) for some \(K,L\in {{{{\mathcal{C}}}}}_{n}^{0}\) and some qubit relabeling WSn. Thus

$$\tilde{U}={G}_{{a}_{1}}{G}_{{a}_{2}}\cdots {G}_{{a}_{k}}M=\tilde{V}{G}_{{a}_{k}}M=K{W}^{-1}VWL{G}_{{a}_{k}}M.$$

Commuting \({G}_{{a}_{k}}\) through WL next to V using Corollary 1 we obtain \(\tilde{U}=K{W}^{-1}(V{G}_{b})WM^{\prime}\) for some generator Gb and some \(M^{\prime} \,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\). This shows that \(\tilde{U}\) is equivalent to VGn and thus \({\mathsf{Reduce}}(V{G}_{b})={\mathsf{Reduce}}(\tilde{U})=U\) for some \(V\in {{{{\mathcal{R}}}}}_{n}^{k-1}\) and some generator Gb. Thus US. We have proved that \({{{{\mathcal{R}}}}}_{n}^{k}\,\subseteq \,S\).

Conversely, suppose US. Then U is a reduced element obtained from some cost-(k − 1) element V by adding a single generator, relabeling the qubits, and left/right multiplications by the single-qubit gates. Since adding a single generator can change the cost by at most one, we conclude that \(U\in {{{{\mathcal{R}}}}}_{n}^{k-2}\cup {{{{\mathcal{R}}}}}_{n}^{k-1}\cup {{{{\mathcal{R}}}}}_{n}^{k}\). The cost cannot grow by more than 1 for an obvious reason. It cannot decline by d > 1 since this would imply that V can be implemented with cost (k − 1 − d) + 1 = k − d < k − 1 as the circuit (Vg). g−1, where g is the generator, which contradicts the notion that V is a cost-(k − 1) element. Thus the algorithm adds U to S only if \(U\,\in \,{{{{\mathcal{R}}}}}_{n}^{k}\). We have proved that \(S\,\subseteq \,{{{{\mathcal{R}}}}}_{n}^{k}\).

By sorting the elements of each set \({{{{\mathcal{R}}}}}_{n}^{\ell }\) and using the binary search to check set membership, the above algorithm requires \(\tilde{O}(| {{{{\mathcal{R}}}}}_{n}^{k-1}| m)\) calls to the function ReduceU, where the \(\tilde{O}\) notation hides factors logarithmic in the size of \({{{{\mathcal{R}}}}}_{n}^{k-2}\), \({{{{\mathcal{R}}}}}_{n}^{k-1}\), and \({{{{\mathcal{R}}}}}_{n}^{k}\). The database generation terminates as soon as \({{{{\mathcal{R}}}}}_{n}^{k}\,=\,{{\emptyset}}\). This determines the maximum cost kmax(n) as k − 1.

As discussed in the section “Methods”, the generation of the 6-qubit database spans a few CPU months and involves manipulations with terabytes of data. How can we be confident that this computation is error-free? Our correctness tests included the verification that the size of the Clifford group inferred from the database agrees with the analytic formula \(| {{{{\mathcal{C}}}}}_{n}| ={2}^{{n}^{2}}\mathop{\prod }\nolimits_{j = 1}^{n}({4}^{j}-1)\). In more detail, the number of cost-k Clifford group elements can be inferred from the identity

$$| {{{{\mathcal{C}}}}}_{n}^{k}| =\mathop{\sum}\limits_{U\in {{{{\mathcal{R}}}}}_{n}^{k}}| [U]|,$$
(10)

where [U] is the size of the equivalence class [U] that contains U, see Eq. (9). Furthermore,

$$| [U]| =\frac{| {{{{\mathcal{C}}}}}_{n}^{0}{| }^{2}\cdot | {S}_{n}| }{| {{{\rm{Aut}}}}(U)| }=\frac{{6}^{2n}n!}{| {{{\rm{Aut}}}}(U)| },$$
(11)

where Aut(U) is the automorphism group of U that consists of all triples \(K\times L\times W\in {{{{\mathcal{C}}}}}_{n}^{0}\times {{{{\mathcal{C}}}}}_{n}^{0}\times {S}_{n}\) such that U = KW−1UWL. We have checked that the counts \(| {{{{\mathcal{C}}}}}_{n}^{k}|\) inferred from Eqs. ((10), (11)) indeed obey \(\mathop{\sum }\nolimits_{k = 0}^{{k}_{max}(n)}| {{{{\mathcal{C}}}}}_{n}^{k}| =| {{{{\mathcal{C}}}}}_{n}|\). Thus our database passed the self-consistency test. Table 6 and Table 1 display the counts \(| {{{{\mathcal{R}}}}}_{n}^{k}|\) and \(| {{{{\mathcal{C}}}}}_{n}^{k}|\) can be found in the section “Results”.

Table 6 The distribution of the number of equivalence classes across Clifford circuits over 2, 3, 4, 5, and 6 qubits.

In order to speed up the synthesis of optimal circuits, we augmented each database entry \(U\,\in \,{{{{\mathcal{R}}}}}_{n}^{k}\) with 8 auxiliary bits specifying a generator Gb that reduces the cost of U by one, such that \(U{G}_{b}\,\in \,{{{{\mathcal{C}}}}}_{n}^{k-1}\). Here we assume k ≥ 1. Let us prove that such cost-reducing generator Gb exists for any \(U\,\in \,{{{{\mathcal{R}}}}}_{n}^{k}\). Indeed, use Lemma 1 to write \(U={G}_{{a}_{1}}{G}_{{a}_{2}}\cdots {G}_{{a}_{k}}L\) for some \(L\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\). By Corollary 2, there exists a generator Gb such that \(F\equiv {G}_{{a}_{k}}L{G}_{b}\in {{{{\mathcal{C}}}}}_{n}^{0}\). Now \(U{G}_{b}={G}_{{a}_{1}}{G}_{{a}_{2}}\cdots {G}_{{a}_{k-1}}F\) for some \(F\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\), that is, UGb has cost k − 1.

To augment a given element U of the cost-k database \({{{{\mathcal{R}}}}}_{n}^{k}\) we find the first cost-reducing generator b {1, 2, …, m} such that \({\mathsf{ReduceU}}(U{G}_{b})\,\in \,{{{{\mathcal{R}}}}}_{n}^{k-1}\). This requires at most m calls to ReduceU and binary searches in \({{{{\mathcal{R}}}}}_{n}^{k-1}\) (computing the group multiplication takes a negligible time). Once a cost-reducing generator Gb is found, its index b is recorded in the database using the unused bits of U. The augmentation step is applied to all \(U\,\in \,{{{{\mathcal{R}}}}}_{n}^{k}\) and for all k = 1, 2, …, kmax(n).

Synthesis of optimal circuits

The optimal compiler takes as input an element of the Clifford group \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) and outputs a Clifford circuit (a list of the primitive gates H, P, and CNOT) implementing U with the smallest possible CNOT gate count, equal to the cost of U. The cost can be computed by making a single call to ReduceU and performing at most kmax(n) database searches. Below we assume that the database is augmented with the cost-reducing generators, as discussed in the section “Database generation”. Thus the database search returns the cost k element V such that \(V\,\equiv \,{\mathsf{Reduce}}(U)\in {{{{\mathcal{R}}}}}_{n}^{k}\) and a cost-reducing generator Ga such that \(V{G}_{a}\,\in \,{{{{\mathcal{C}}}}}_{n}^{k-1}\). The next step is to convert Ga into a cost-reducing generator for U. To this end, write V = KW−1UWL for some \(K,L\in {{{{\mathcal{C}}}}}_{n}^{0}\) and some qubit permutation W. The group elements K, L, and W that transform U into the reduced form are readily available by adding appropriate bookkeeping steps to the implementation of ReduceU described in the section “Computation of ReduceU”. At this point we have

$$K{W}^{-1}UWL{G}_{a}\in {{{{\mathcal{C}}}}}_{n}^{k-1}.$$

Commute Ga through WL next to U using Corollary 1. This gives

$$K{W}^{-1}U{G}_{b}WM\in {{{{\mathcal{C}}}}}_{n}^{k-1}$$

for some generator Gb and some \(M\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\). The generator Gb can be computed in time O(1) using the standard commutation rules of the Clifford group. Thus \(U{G}_{b}\,\in \,{{{{\mathcal{C}}}}}_{n}^{k-1}\), that is, Gb is a cost-reducing generator for U. Replacing U by UGb and applying the above step recursively, one constructs a k-tuple of generators such that \(M=U{G}_{{a}_{1}}{G}_{{a}_{2}}\cdots {G}_{{a}_{k}}\in {{{{\mathcal{C}}}}}_{n}^{0}\) is a product of single-qubit gates. This gives \({U}^{-1}={G}_{{a}_{1}}{G}_{{a}_{2}}\cdots {G}_{{a}_{k}}{M}^{-1}\). Decomposing each generator and M−1 into a product of primitive gates H, P, and CNOT gives an optimal circuit implementing U−1. Since all primitive gates are self-inverse, an optimal circuit implementing U is obtained simply by reversing the order of gates. If needed, the number of single-qubit gates in the compiled circuit can be optimized by commuting single-qubit gates to the last time step (whenever possible) and merging them using optimal lookup of \({{{{\mathcal{C}}}}}_{1}\) elements.

Computation of ReduceU

In this section we introduce reduced forms of Clifford group elements and give algorithms for computing these forms. A given matrix \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) is transformed into a reduced form by applying a sequence of elementary reductions from the following list:

  1. 1.

    Multiplication of U on the left by single-qubit Clifford gates.

  2. 2.

    Multiplication of U on the right by single-qubit Clifford gates.

  3. 3.

    Relabeling of qubits.

Depending on which type of reductions is considered, there are three different reduced forms: a left-reduced form (reductions of type 1 only), a locally reduced form (reductions of types 1 and 2), and a fully reduced form (reductions of types 1, 2, and 3). Each form comes with an algorithm specifying the sequence of reductions to be applied. We define the reduced forms inductively starting from the left-reduced form. The function ReduceU used in the sections “Database generation” and “Synthesis of optimal circuits” computes the fully reduced form.

We begin by defining convenient notations. Let \({e}^{1},{e}^{2},\ldots,{e}^{2n}\in {{\mathbb{F}}}_{2}^{2n}\) be the standard basis of \({{\mathbb{F}}}_{2}^{2n}\): the basis vector ej has a single non-zero at the jth position. We consider ej as column vectors. Let \({e}_{j}:={({e}^{j})}^{T}\) be the corresponding row vector. For example, if n = 1 then

$${e}^{1}=\left[\begin{array}{c}1\\ 0\end{array}\right],\quad {e}^{2}=\left[\begin{array}{c}0\\ 1\end{array}\right],\quad {e}_{1}=\left[\begin{array}{ll}1&0\\ \end{array}\right],\,\,{{\mbox{and}}}\,\,{e}_{2}=\left[\begin{array}{ll}0&1\\ \end{array}\right].$$

We write uv to denote the addition of binary vectors u and v modulo 2. Elements of the Clifford group \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) are treated as binary symplectic matrices of the size 2n × 2n. A matrix U has the jth column and the jth row Uej and ejU, respectively.

Recall that \({{{{\mathcal{C}}}}}_{n}^{0}\,\subseteq \,{{{{\mathcal{C}}}}}_{n}\) is the local subgroup generated by the single-qubit gates (H and P). Define a subgroup \({{{{\mathcal{C}}}}}_{n,j}\,\subseteq \,{{{{\mathcal{C}}}}}_{n}^{0}\) generated by the single-qubit gates acting on the jth qubit, where j = 1, 2, …, n. Equivalently, \(U\,\in \,{{{{\mathcal{C}}}}}_{n,j}\) iff Uei = ei for all i {j, n + j}, whereas Uej = aejben+j and Uen+j = cejden+j for some coefficients \(a,b,c,d\in {{\mathbb{F}}}_{2}\) such that

$$\left[\begin{array}{ll}a&c\\ b&d\\ \end{array}\right]\in {{{\rm{GL}}}}(2,{{\mathbb{F}}}_{2}).$$

Note that the subgroups \({{{{\mathcal{C}}}}}_{n,j}\) pairwise commute.

A matrix \(U\in {{{{\mathcal{C}}}}}_{n}\) is said to be left-reduced if

$${e}_{j}U \,<\, {e}_{n+j}U \,<\, ({e}_{j}\oplus {e}_{n+j})U\,\,{{\mbox{for all}}}\,\,j=1,2,\ldots,n.$$
(12)

Here and below the bit strings are compared using the lexicographic order (i.e., 00 < 01 < 10 < 11 in the case n = 1). The following lemma shows that left-reduced elements of \({{{{\mathcal{C}}}}}_{n}\) can serve as canonical representatives of cosets \({{{{\mathcal{C}}}}}_{n}^{0}U\). In other words, \({{{{\mathcal{C}}}}}_{n}\) is a disjoint union of cosets \({{{{\mathcal{C}}}}}_{n}^{0}U\) and each coset contains a unique left-reduced element, which can be efficiently computed. We refer to the unique left-reduced element of a coset \({{{{\mathcal{C}}}}}_{n}^{0}U\) as the left-reduced form of U and denote it leftReduce(U). Our symplectic matrix data structure described in the section “Data structure” enables the computation of leftReduce(U) for a randomly picked matrix \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) in time less than 2 × 10−8 s for any n ≤ 6 on a server-class CPU, in this case an Intel® Xeon® CPU E7-4850 v4 @ 2.10GHz.

Lemma 2

Each coset \({{{{\mathcal{C}}}}}_{n}^{0}U\) with \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) contains a unique left-reduced element that can be computed in time O(n2), given symplectic matrix representation of U.

Proof

First note that the rows of a symplectic matrix are linearly independent. Thus for each qubit j the bit strings xj : = ejU, zj : = en+jU, and yj : = (ejen+j)U are all distinct: xj ≠ yj ≠ zj. It follows directly from the above definitions that multiplying U on the left by the elements of the subgroup \({{{{\mathcal{C}}}}}_{n,j}\) we can implement any permutation of the bit strings xj, yj, and zj. For example, the Hadamard gate swaps xj and zj, the Phase gate swaps xj and yj. Since \(| {{{{\mathcal{C}}}}}_{n,j}| \,=\,6\), there is a one-to-one correspondence between elements of \({{{{\mathcal{C}}}}}_{n,j}\) and permutations of xj, yj, zj. Multiply U on the left by the unique element of \({{{{\mathcal{C}}}}}_{n,j}\) that permutes the bit strings such that xj < zj < yj. Now Eq. (12) is satisfied for the jth qubit. Repeating this for all n qubits and noting that \({{{{\mathcal{C}}}}}_{n}^{0}\) is generated by the subgroups \({{{{\mathcal{C}}}}}_{n,j}\) proves that the coset \({{{{\mathcal{C}}}}}_{n}^{0}U\) contains a unique left-reduced element. All above steps can be efficiently implemented. Indeed, given a matrix U, one can compute the bit strings xj, yj, and zj and sort all three in time O(n). Repeating this for all n qubits gives the total runtime of O(n2).

Given a matrix \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) define a double coset

$${[U]}^{{\mathsf{loc}}}:={{{{\mathcal{C}}}}}_{n}^{0}U{{{{\mathcal{C}}}}}_{n}^{0}.$$

It includes all elements of the Clifford group obtained from U by adding single-qubit Clifford gates on the left and on the right. Clearly, the full Clifford group \({{{{\mathcal{C}}}}}_{n}\) is a disjoint union of double cosets [U]loc and the cost of the matrix U depends only on the double coset that contains U. The next step is to choose an efficiently computable canonical representative of each double coset. First define the map \(\chi \,:\,{{\mathbb{F}}}_{2}^{2n}\to {{\mathbb{F}}}_{2}^{n}\) as

$$\chi (v):=[{v}_{1}\vee {v}_{n+1},\,{v}_{2}\vee {v}_{n+2},\ldots,{v}_{n}\vee {v}_{2n}],$$

where  stands for the logical OR operation. The jth component of χ(v) is non-zero iff vj = 1 or vn+j = 1 (the bit string χ(v) can be interpreted as the support of an n-qubit Pauli operator parameterized by v, according to the standard binary parameterization of Pauli operators3). We claim that the map χ is invariant under left multiplications by the elements of the local subgroup, in the sense that

$$\chi (Lv)=\chi (v)\,{{{\rm{for}}}}\,{{{\rm{all}}}}\,L\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\,{{{\rm{and}}}}\,v\,\in \,{{\mathbb{F}}}_{2}^{2n}.$$
(13)

Indeed, it suffices to check Eq. (13) for the special case \(L\,\in \,{{{{\mathcal{C}}}}}_{n,j}\) (since the local subgroup is generated by matrices \(L\,\in \,{{{{\mathcal{C}}}}}_{n,j}\) with j = 1, 2, …, n). As discussed above, the action of \(L\,\in \,{{{{\mathcal{C}}}}}_{n,j}\) on v is equivalent to applying a 2 × 2 binary invertible matrix to the components vj and vn+j while all other components of v remain unchanged. Since an invertible matrix maps non-zero vectors to non-zero vectors, (Lv)j (Lv)n+j = 1 iff vjvn+j = 1. This implies Eq. (13).

A matrix \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) is said to be locally ordered if U is left-reduced and

$$\chi (U{e}^{j})\le \chi (U{e}^{n+j})\le \chi (U{e}^{j}\oplus U{e}^{n+j})\,\,{{\mbox{for all}}}\,\,j=1,2,\ldots,n.$$
(14)

Here bit strings are compared using the lexicographic order. Let \({{{\mathcal{L}}}}(U)\,\subseteq \,{[U]}^{{\mathsf{loc}}}\) be the set of all locally ordered elements of the double coset [U]loc. Define a locally reduced form of the matrix \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\), denoted localReduce(U), as the lexicographically smallest element of the set \({{{\mathcal{L}}}}(U)\). The following lemma shows that locally reduced elements of \({{{{\mathcal{C}}}}}_{n}\) can serve as canonical representatives of the double cosets [U]loc. In other words, \({{{{\mathcal{C}}}}}_{n}\) is a disjoint union of the double cosets [U]loc and each double coset contains a unique locally reduced element that can be efficiently computed (albeit slightly less efficiently than leftReduce). The symplectic matrix data structure described in the section “Data structure” enables the computation of localReduce(U) for a randomly picked matrix \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) in time less than 4 × 10−7 s for all n ≤ 6 on a server-class CPU, in this case an Intel® Xeon® CPU E7-4850 v4 @ 2.10GHz.

Lemma 3

Each double coset \({[U]}^{{\mathsf{loc}}}\,=\,{{{{\mathcal{C}}}}}_{n}^{0}U{{{{\mathcal{C}}}}}_{n}^{0}\) contains a unique locally reduced element that can be computed in time O(n26n), given the symplectic matrix U.

Proof

For each qubit j define the bit strings xj : = χ(Uej), zj : = χ(Uen+j), and yj : = χ(UejUen+j). Same as before, multiplying U on the right by the elements of the subgroup \({{{{\mathcal{C}}}}}_{n,j}\) one can implement any permutation of the bit strings xj, yj, and zj. Define a subset \({{{{\mathcal{S}}}}}_{j}\,\subseteq \,{{{{\mathcal{C}}}}}_{n,j}\) as the one including all elements \({R}_{j}\,\in \,{{{{\mathcal{C}}}}}_{n,j}\) such that the right multiplication U ← URj permutes the bit strings xj, yj, and zj into the non-decreasing order xj ≤ zj ≤ yj. Note that \({{{{\mathcal{S}}}}}_{j}\) is non-empty since the right multiplication by the elements of \({{{{\mathcal{C}}}}}_{n,j}\) can implement any permutation of xj, yj, and zj. Recall that the set \({{{\mathcal{L}}}}(U)\) includes all locally ordered elements of the double coset [U]loc. We claim that

$${{{\mathcal{L}}}}(U)=\{{\mathsf{leftReduce}}(U{R}_{1}{R}_{2}\cdots {R}_{n})\,:\,{R}_{1}\,\in \,{{{{\mathcal{S}}}}}_{1},{R}_{2}\,\in \,{{{{\mathcal{S}}}}}_{2},\ldots,{R}_{n}\,\in \,{{{{\mathcal{S}}}}}_{n}\}.$$
(15)

Indeed, \({{{\mathcal{L}}}}(U)\subseteq {[U]}^{{\mathsf{loc}}}\) since any matrix \(W\,\in \,{{{\mathcal{L}}}}(U)\) has the form W = LUR for some \(L,R\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\). Furthermore, \({{{\mathcal{L}}}}(U)\) is non-empty since each subset \({{{{\mathcal{S}}}}}_{j}\) is non-empty. Let us check that any element \(W\,\in \,{{{\mathcal{L}}}}(U)\) is locally ordered. Indeed, pick any matrices \({R}_{j}\,\in \,{{{{\mathcal{S}}}}}_{j}\) and let R = R1R2Rn. By construction, the matrix V = UR satisfies Eq. (14) with U replaced by V. Let W = leftReduce(V). Then W = LV for some \(L\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\). The invariance of the map χ under left multiplications by the elements of the local subgroup, see Eq. (13), implies that W satisfies Eq. (14) with U replaced by W. Thus W is locally ordered. Conversely, suppose W [U]loc is locally ordered. Then W = LUR for some \(L,R\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\) and leftReduce(W) = W. The invariance of the map χ under left multiplications by the elements of the local subgroup and the local ordering condition imply that the matrix V = UR satisfies Eq. (14) with U replaced by V. Thus R = R1R2Rn for some \({R}_{j}\,\in \,{{{{\mathcal{S}}}}}_{j}\). This proves that \(W\,\in \,{{{\mathcal{L}}}}(U)\). The uniqueness follows from the ability to encode the elements of the sets considered by distinct integers and the existence of the smallest integer in any finite set of integers.

It remains to check that the set \({{{\mathcal{L}}}}(U)\) can be computed in time O(n26n). Indeed, for any given qubit j one can compute the bit strings xj, yj, and zj and the subset \({S}_{j}\,\subseteq \,{{{{\mathcal{C}}}}}_{n,j}\) in time O(n). Note that \(| {{{{\mathcal{S}}}}}_{j}| \,\le \,| {{{{\mathcal{C}}}}}_{n,j}| \,=\,6\). Thus the number of matrices R = R1R2Rn with \({R}_{j}\,\in \,{{{{\mathcal{S}}}}}_{j}\) is at most 6n. Since the right multiplication by the elements of the subgroup \({{{{\mathcal{C}}}}}_{n,j}\) changes at most two rows of a matrix, we can compute UR in time O(n2). By Lemma 2, computing the left reduced form of UR takes time O(n2). Thus the overall runtime of computing \({{{\mathcal{L}}}}(U)\) is O(n26n). Once the set \({{{\mathcal{L}}}}(U)\) is computed, finding its lexicographically smallest element takes time \(O(n| {{{\mathcal{L}}}}(U)| )\,=\,O(n{6}^{n})\).

Comment 1: Our implementation of localReduce(U) relies on a streamlined version of the above algorithm with a modified definition of the subsets \({{{{\mathcal{S}}}}}_{j}\). Namely, we define \({{{{\mathcal{S}}}}}_{j}\) as a set of all elements \({R}_{j}\,\in \,{{{{\mathcal{C}}}}}_{n,j}\) such that the right multiplication U ← URj permutes the bit strings xj, yj, and zj into the non-decreasing order and leftReduce(URj) ≠ leftReduce(U). The last condition rules out the possibility that the right multiplication of U by Rj is equivalent to a left multiplication of U by some element of the local subgroup (for example, this is the case if U is the identity matrix). Since leftReduce(U) depends only on the coset \({{{{\mathcal{C}}}}}_{n}^{0}U\), the left multiplication of U by any element of the local subgroup does not change leftReduce(U). Thus the set of locally ordered elements \({{{\mathcal{L}}}}(U)\) can be computed using Eq. (15) with the modified definition of \({{{{\mathcal{S}}}}}_{j}\).

Comment 2: We empirically observed that the average-case runtime of the above algorithm is much better than the worst case upper bound of O(n26n). Indeed, a direct inspection shows that the runtime scales as O(n2M), where \(M\,=\,| {{{{\mathcal{S}}}}}_{1}| \cdot | {{{{\mathcal{S}}}}}_{2}| \cdot \ldots \cdot | {{{{\mathcal{S}}}}}_{n}|\). For randomly picked matrices \(U\,\in \,{{{{\mathcal{C}}}}}_{6}\) we observed that M ≈ 5 on average even though \(M\,=\,| {{{{\mathcal{C}}}}}_{6}^{0}| \,=\,{6}^{6}\,=\,46,656\) in the worst case. We leave it as an open question whether the average-case runtime of the above algorithm scales polynomially with n.

Recall that we consider the symmetric group Sn that includes all qubit permutations as a subgroup of \({{{{\mathcal{C}}}}}_{n}\). If w is a permutation of integers {1, 2, …, n}, then the corresponding symplectic matrix WSn acts on the basis vectors as Wej = ew(j) and Wen+j = en+w(j) for all j = 1, 2, …, n. Given a matrix \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\), define the equivalence class

$$[U]:=\{L{W}^{-1}UWR\,:\,L,R\in {{{{\mathcal{C}}}}}_{n}^{0},W\,\in \,{S}_{n}\}.$$

The rest of this section is devoted to choosing an efficiently computable canonical representative of each class [U]. Let \({{\mathbb{Z}}}^{n\times n}\) be the set of n × n matrices with integer entries. Define the map \(\kappa \,:\,{{{{\mathcal{C}}}}}_{n}\,\to \,{{\mathbb{Z}}}^{n\times n}\) such that the matrix element of κ(U) located at the ith row and the jth column is the rank of the 2 × 2 submatrix of U formed by the intersection of rows i and i + n and columns j and j + n. The rank is computed over the binary field \({{\mathbb{F}}}_{2}\). In other words, each matrix element of κ(U) has the form

$$\kappa {(U)}_{i,j}={{{{\rm{rank}}}}}_{{{\mathbb{F}}}_{2}}\left[\begin{array}{ll}{U}_{i,j}&{U}_{i,n+j}\\ {U}_{n+i,j}&{U}_{n+i,n+j}\\ \end{array}\right].$$

By definition, κ(U) contains entries from the set {0, 1, 2} and the full matrix κ(U) can be computed in time O(n2). We claim that the left and right multiplications of U by the single-qubit Clifford gates leave κ(U) invariant, that is,

$$\kappa (LUR)=\kappa (U)\,\,{{\mbox{for all}}}\,\,L,R\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}.$$
(16)

Indeed, suppose first that L = I and \(R\,\in \,{{{{\mathcal{C}}}}}_{n,j}\). Right multiplication U ← UR applies an invertible linear transformation to the pair of columns Uej and Uen+j, and acts trivially on the remaining columns. Since the matrix rank is invariant under applying an invertible linear transformation, we conclude that κ(UR) = κ(U) for all \(R\,\in \,{{{{\mathcal{C}}}}}_{n,j}\). Same argument shows that κ(LU) = κ(U) for all \(L\,\in \,{{{{\mathcal{C}}}}}_{n,j}\). This proves Eq. (16) since the local subgroup \({{{{\mathcal{C}}}}}_{n}^{0}\) is generated by the subgroups \({{{{\mathcal{C}}}}}_{n,j}\).

Let κmin(U) be the lexicographically smallest matrix in the set of matrices {κ(W−1UW) : WSn}. Define a set of qubit permutations

$${{{\mathcal{S}}}}(U):=\{W\in {S}_{n}\,:\,\kappa ({W}^{-1}UW)={\kappa }_{min}(U)\}$$

and a set of matrices

$${{{\mathcal{R}}}}(U):=\{{\mathsf{localReduce}}({W}^{-1}UW)\,:\,W\in {{{\mathcal{S}}}}(U)\}.$$

Note that \({{{\mathcal{R}}}}(U)\,\subseteq \,[U]\) since

$${\mathsf{localReduce}}({W}^{-1}UW)=L{W}^{-1}UWR\in [U]$$

for some \(L,R\in {{{{\mathcal{C}}}}}_{n}^{0}\). Define a fully reduced form of a matrix \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\), denoted ReduceU(U), as the lexicographically smallest element of the set \({{{\mathcal{R}}}}(U)\). The following lemma shows that the fully reduced elements of \({{{{\mathcal{C}}}}}_{n}\) can serve as canonical representatives of the equivalence classes [U]. In other words, \({{{{\mathcal{C}}}}}_{n}\) is a disjoint union of the equivalence classes [U] and each class contains a unique fully reduced element that can be efficiently computed (albeit slightly less efficiently than localReduce). The symplectic matrix data structure enables the computation of ReduceU(U) for a randomly picked matrix \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) in time less than 3 × 10−6 s for n = 6 and time less than 10−6 s for all n ≤ 5 on a server-class CPU, in this case an Intel® Xeon® CPU E7-4850 v4 @ 2.10GHz.

Lemma 4

Each equivalence class [U] with \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) contains a unique fully reduced element that can be computed in time \(O({n}^{2}\cdot n!+{t}_{n}\cdot | {{{\mathcal{S}}}}(U)| )\), given the symplectic matrix representation of U. Here tn is the runtime of localReduce for elements of \({{{{\mathcal{C}}}}}_{n}\).

Proof

Consider a matrix \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\). It follows directly from the definitions that ReduceU(U)  [U]. Thus it suffices to check that

$${{{\mathcal{R}}}}(U^{\prime} )={{{\mathcal{R}}}}(U)\,\,{{\mbox{for all}}}\,\,U^{\prime} \in [U].$$
(17)

Indeed, this equation implies \({\mathsf{ReduceU}}(U)\,=\,{\mathsf{ReduceU}}(U^{\prime} )\) for all \(U^{\prime} \,\in \,[U]\), that is, the equivalence class [U] contains a unique reduced element. Let us prove Eq. (17). Write \(U^{\prime} \,=\,L{W}^{-1}UWR\) for some \(L,R\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\) and WSn. Then

$$\begin{array}{ll}{{{\mathcal{R}}}}(U^{\prime} )&=\{{\mathsf{localReduce}}({\tilde{W}}^{-1}L{W}^{-1}UWR\tilde{W})\,:\,\tilde{W}\in {{{\mathcal{S}}}}(U^{\prime} )\}\\ &=\{{\mathsf{localReduce}}(L^{\prime} {\tilde{W}}^{-1}{W}^{-1}UW\tilde{W}R^{\prime} )\,:\,\tilde{W}\in {{{\mathcal{S}}}}(U^{\prime} )\}\\ &=\{{\mathsf{localReduce}}({\tilde{W}}^{-1}{W}^{-1}UW\tilde{W})\,:\,\tilde{W}\in {{{\mathcal{S}}}}(U^{\prime} )\}.\end{array}$$
(18)

Here \(L^{\prime} \,:=\,{\tilde{W}}^{-1}L\tilde{W}\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\) and \(R^{\prime} \,:=\,{\tilde{W}}^{-1}R\tilde{W}\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\). In the third equality we noted that localReduce is invariant under left/right multiplications by the elements of the local subgroup \({{{{\mathcal{C}}}}}_{n}^{0}\), see Lemma 3. Finally, the invariance of the map κ under the left and right multiplications by the elements of the local subgroup, see Eq. (16), implies \({\kappa }_{{\rm{min}}}(U^{\prime} )={\kappa }_{{\rm{min}}}(U)\). Thus \(\tilde{W}\,\in \,{{{\mathcal{S}}}}(U^{\prime} )\) iff \(W\tilde{W}\,\in \,{{{\mathcal{S}}}}(U)\). Combining this and Eq. (18) gives \({{{\mathcal{R}}}}(U^{\prime} )\,=\,{{{\mathcal{R}}}}(U)\), as claimed.

The runtime stated in the lemma consists of two terms. The term O(n2n!) is the time needed to compute the set of permutations \({{{\mathcal{S}}}}(U)\). The term \(O({t}_{n}\cdot | {{{\mathcal{S}}}}(U)| )\) is the time needed to compute the set of matrices \({{{\mathcal{R}}}}(U)\) and pick the lexicographically smallest element of \({{{\mathcal{R}}}}(U)\).

Comment 3: Our implementation of ReduceU(U) relies on a streamlined version of the above algorithm with a modified definition of the set \({{{\mathcal{S}}}}(U)\). Namely, we define \({{{\mathcal{S}}}}(U)\) as the set of all permutations WSn such that κ(W−1UW) = κmin(U) and leftReduce(W−1UW) ≠ leftReduce(U). The last condition rules out the possibility that the conjugation of U by W is equivalent to a left multiplication of U by some element of the local subgroup (for example, this is the case if U is the identity matrix). Since localReduce(U) depends only on the double coset \({{{{\mathcal{C}}}}}_{n}^{0}U{{{{\mathcal{C}}}}}_{n}^{0}\), a left multiplication of U by any element of the local subgroup does not change localReduce(U). Thus one can compute the set \({{{\mathcal{R}}}}(U)\) using the modified definition of \({{{\mathcal{S}}}}(U)\).

Comment 4: We empirically observed that \(| {{{\mathcal{S}}}}(U)| =1\) for typical a element of the Clifford group and the maximal value of \(| {{{\mathcal{S}}}}(U)|\) is 14. The mean value of \(| {{{\mathcal{S}}}}(U)|\) is approximately 1.03 for a randomly picked \(U\,\in \,{{{{\mathcal{C}}}}}_{6}\).

By a slight abuse of terminology, we refer to the computationally-defined fully reduced elements of the Clifford group as the reduced elements in the remainder of the paper. This should not lead to confusion since the left-reduced and the locally reduced forms are used only in this subsection.

Data structure

By definition, any element of the Clifford group \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) can be represented by a binary matrix of size 2n × 2n. However, if we only care about the reduced form of U, a slightly more efficient representation is possible, as given by the following lemma.

Lemma 5

Let \(U^{\prime}\) be the matrix obtained from \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) by removing the n-th and the 2n-th rows from it. Then U is uniquely determined by \(U^{\prime}\) up to left multiplication by the single-qubit Clifford gates acting on the n-th qubit.

Proof

Let \({{{\mathcal{L}}}}\,\subseteq \,{{\mathbb{F}}}_{2}^{2n}\) be the linear subspace spanned by the jth row of U with j {n, 2n} and let \({{{{\mathcal{L}}}}}^{\perp }\,\subseteq \,{{\mathbb{F}}}_{2}^{2n}\) be the linear subspace spanned by the vectors orthogonal to \({{{\mathcal{L}}}}\) with respect to the symplectic inner product. Note that \({{{\mathcal{L}}}}\) depends only on \(U^{\prime}\). The condition that U is a symplectic matrix implies \({{{{\rm{span}}}}}_{{{\mathbb{F}}}_{2}}({e}_{n}U,{e}_{2n}U)\,=\,{{{{\mathcal{L}}}}}^{\perp }\). Here we use the notations from the section “Computation of ReduceU”. The missing pair of rows enU and e2nU is uniquely defined by \({{{\mathcal{L}}}}\) up to an invertible linear transformation enU ← aenUbe2nU and e2nU ← cenUde2nU for some

$$\left[\begin{array}{ll}a&c\\ b&d\\ \end{array}\right]\in {{{\rm{GL}}}}(2,{{\mathbb{F}}}_{2}).$$

As discussed in the section “Computation of ReduceU”, there is a one-to-one correspondence between such transformations and left multiplications U ← LU, where \(L\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\) acts non-trivially only on the nth qubit.

We refer to the matrix \(U^{\prime}\) obtained from \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) by removing the pair of rows n and 2n as a thin matrix representation of U. Our C++ implementation adopts the thin matrix data format for all intermediate steps of the algorithm. The thin matrix spans 4n(n − 1) bits and can be conveniently distributed over two machine words, each of length 64 bits. The first word stores the rows e1U, e2U, …, en−1U and the second word stores the rows en+1U, en+2U, …, e2n−1U. This leaves 128 − 4n(n − 1)n≤6 ≥ 8 free bits that can be conveniently used to specify the cost-reducing generator in the augmented database, see the section “Database generation”. Recall that the number of generators is m = 9n(n − 1)/2n≤6 ≤ 135. Thus the generator can be specified using only 8 bits. Note also that storing the full matrix \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) using only two machine words is impossible for n = 6, as it requires 4n2n=6 = 144 bits.

The thin matrix format enables fast left and right multiplication by the single-qubit and two-qubit Clifford gates, that require at most 24 CPU instructions per gate for all n ≤ 6 (each instruction implements a bitwise operation on a single machine word). When needed, the thin matrix \(U^{\prime}\) can be expanded into the full symplectic matrix \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) by calculating the missing pair of rows enU and e2nU using the symplectic version of Gram–Schmidt orthogonalization. Our implementation converts the thin matrix to the full matrix in time less than 2 × 10−7 s for any n ≤ 6 on a server-class CPU, in this case an Intel® Xeon® CPU E7-4850 v4 @ 2.10GHz, which is negligible compared with the time it takes to compute the reduced form.

Software tricks

Database generation

The calculation of the reduced cost-k Clifford group set \({{{{\mathcal{R}}}}}_{n}^{k}\), as described in the section “Database generation”, lends itself to parallel processing. Specifically, each element of the set \({{{{\mathcal{R}}}}}_{n}^{k}\) can be calculated concurrently from its own data on its own processor. The implementation considerations for this run-once parallel processing job depended on factors such as:

  1. i.

    the cost and availability of scaled-up/scaled-out hardware, and

  2. ii.

    the cost-benefit for implementing, measuring, and tuning for different data-level parallel processing options, including shared memory versus distributed memory (e.g., OpenMP/MPI) and specialized processors (e.g., vector processors, GPUs, FPGAs),

not to mention the multiple software options with each, from programming languages to libraries22.

Using Flynn’s taxonomy23, the Single Program, Multiple Data (SPMD) streams model was implemented using the C++ concurrent-set template class; specifically, each reduced cost-k Clifford group set \({{{{\mathcal{R}}}}}_{n}^{k}\) is an instance of set<pair<uint64, uint64>>. This is a good choice for programmer productivity, i.e., letting the container’s semantics deal with the requirements of maintaining distinct and efficiently-searchable elements of a multi-terabyte set on SMP hardware, in this case an Intel® Xeon® 128-CPU E7-4850 v4 @ 2.10GHz with 6TB RAM.

Runtime was extrapolated to take about 100 days to complete the full database generation on a single machine, amounting to approximately 100  24  128 = 307, 200 CPU-hours that can be effectively divided among as many machines as there are available. Hardware and software measurements during database generation, using performance analysis tools such as vmstat to VTune™, exposed heavy “NUMA thrashing,” i.e., soft page faults24. To alleviate this for the final half of the run, C’s most basic systems programming mechanisms were more readily and easily used to replace the C++ set template in order to allocate, position, and search raw memory, resulting in a 5x speed-up; namely, malloc, bsearch, and qsort, along with read/write and uint128.

Synthesis of optimal circuits

With the one-time generation of the database complete and saved on secondary storage (Solid State Disk), similar systems programming mechanisms in C were exploited to optimize performance and scalability in order to read/search what is now effectively a lookup table (LUT), with the expensive runtime calculation of an optimal 6-qubit Clifford circuit completed and replaceable by a simple array indexing operation. The database can be memory-mapped with mmap25 for a greater degree of

  1. i.

    programmer productivity, i.e., the database can be easily referenced as memory using pointers, with no explicit file IO, and

  2. ii.

    operational flexibility, i.e., the database can be effectively used by any type of hardware, ranging from a single laptop to a cluster of server-class machines, with scaling solely dependent on the choice of hardware,

all without changing the code; while the OS kernel and mmap transparently and efficiently take care of

  1. i.

    demand paging, and

  2. ii.

    maintaining only a single copy of data in memory, as opposed to copies in both the file cache and user space.

In addition, to reduce the number of SSD queries, being the most time-consuming operation our search relies on, we employed the following strategy:

  1. i.

    we store the databases of Clifford circuits requiring 1–8, 14, and 15 gates in RAM,

  2. ii.

    we store an index consisting of each 1024th element of Clifford unitaries implementable with 9–13 gates in RAM, and

  3. iii.

    when the length-1024 chunk containing the desired element is found by the binary search, we make one long query to extract all 2048 64-bit integers in this chunk.

The above modification limits the number of SSD queries required to synthesize an optimal circuit to at most 10 (at most two queries per searches over the gate counts of 9, 10, 11, 12, and 13) at the cost of RAM memory usage of 2.5GB.

A machine with enough RAM to fit the entire database in will get the best performance as the complete database fills the file cache, and a machine with little-to-no available RAM will get the worst performance as every pointer access to a memory-mapped region (e.g., bsearch) will touch the secondary storage. A commodity machine with typical RAM sizes will get near-best performance as the “hot” parts of the database—the internal nodes of bsearch—will tend to remain in the cache hierarchy (L1-L3, file cache) and result in minimal access to secondary storage. OS-specific parameters were not explored but can also be benchmarked and tuned independently of the database and code, including page sizes and pinned memory.

Proof of Lemma 1

We need to show that any element \(U\,\in \,{{{{\mathcal{C}}}}}_{n}^{k}\) can be written as \(U={G}_{{a}_{1}}{G}_{{a}_{2}}\cdots {G}_{{a}_{k}}L\) for some \(L\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\) and some k-tuple of generators. We use the induction in k. The base of induction is k = 0, in which case the statement is trivial. Suppose k ≥ 1 and \(U\,\in \,{{{{\mathcal{C}}}}}_{n}^{k}\). By definition, U can be implemented by a circuit composed of k CNOT gates and some number of single-qubit gates. Let CNOTi,j be the last CNOT gate in this circuit. Then

$$U=M{{{{\rm{CNOT}}}}}_{i,j}V$$

for some \(M\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\) and \(V\,\in \,{{{{\mathcal{C}}}}}_{n}^{k-1}\). We can assume without loss of generality that i < j. Indeed, if i > j, use the identity CNOTj,i = HiHjCNOTi,jHiHj to flip the control and the target qubits of the last CNOT gate. The extra H gates can be absorbed into M and V layers. By the induction hypothesis, \(V={G}_{{a}_{2}}\cdots {G}_{{a}_{k}}L\) for some \(L\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\). Furthermore, we can assume without loss of generality that M = AiBj for some \({{{\rm{A}}}},{{{\rm{B}}}}\,\in \,{{{{\mathcal{C}}}}}_{1}\). Indeed, all single-qubit gates in M that act on qubits  {i, j} can be commuted through CNOTi,j and absorbed into V. If A, B  {I, HP, PH}, we are done. Indeed, in this case \({{{{\rm{A}}}}}_{i}{{{{\rm{B}}}}}_{j}{{{{\rm{CNOT}}}}}_{i,j}\,=\,{G}_{{a}_{1}}\) is a generator and \(U={G}_{{a}_{1}}V={G}_{{a}_{1}}{G}_{{a}_{2}}\cdots {G}_{{a}_{k}}L\) with \(L\,\in \,{{{{\mathcal{C}}}}}_{n}^{0}\). Otherwise, transform A and B into the desired form by “borrowing" the missing single-qubit gates from V and commuting them through CNOTi,j using the Clifford group identities:

$${{{{\rm{P}}}}}^{2}={{{{\rm{H}}}}}^{2}={({{{\rm{P}}}}{{{\rm{H}}}}{{{\rm{P}}}})}^{2}={{{\rm{I}}}},\quad {{{\rm{P}}}}{{{\rm{H}}}}{{{\rm{P}}}}={{{\rm{H}}}}{{{\rm{P}}}}{{{\rm{H}}}},$$
$$\begin{array}{l}{{{{\rm{P}}}}}_{i}{{{{\rm{CNOT}}}}}_{i,j}={{{{\rm{CNOT}}}}}_{i,j}{{{{\rm{P}}}}}_{i},\quad {{{{\rm{H}}}}}_{i}{{{{\rm{CNOT}}}}}_{i,j}={({{{\rm{H}}}}{{{\rm{P}}}})}_{i}{{{{\rm{CNOT}}}}}_{i,j}{{{{\rm{P}}}}}_{i},\\ {({{{\rm{P}}}}{{{\rm{H}}}}{{{\rm{P}}}})}_{i}{{{{\rm{CNOT}}}}}_{i,j}={({{{\rm{P}}}}{{{\rm{H}}}})}_{i}{{{{\rm{CNOT}}}}}_{i,j}{{{{\rm{P}}}}}_{i},\end{array}$$
$$\begin{array}{l}{{{{\rm{P}}}}}_{j}{{{{\rm{CNOT}}}}}_{i,j}={({{{\rm{H}}}}{{{\rm{P}}}})}_{j}{{{{\rm{CNOT}}}}}_{i,j}{({{{\rm{P}}}}{{{\rm{H}}}}{{{\rm{P}}}})}_{j},{{{{\rm{H}}}}}_{j}{{{{\rm{CNOT}}}}}_{i,j}={({{{\rm{P}}}}{{{\rm{H}}}})}_{j}{{{{\rm{CNOT}}}}}_{i,j}{({{{\rm{P}}}}{{{\rm{H}}}}{{{\rm{P}}}})}_{j},\,\\{{\mbox{and}}}\,{({{{\rm{P}}}}{{{\rm{H}}}}{{{\rm{P}}}})}_{j}{{{{\rm{CNOT}}}}}_{i,j}={{{{\rm{CNOT}}}}}_{i,j}{({{{\rm{P}}}}{{{\rm{H}}}}{{{\rm{P}}}})}_{j}.\end{array}$$

Recall that these identities only apply to elements of the binary symplectic group; the corresponding identities for unitary Clifford operators may include some extra phase factors and Pauli gates. This completes the proof.

Pauli mixing constraint

In this section, we prove that a Pauli-invariant probability distribution μ on the n-qubit Clifford group is a unitary 2-design iff μ is Pauli mixing. The fact that Pauli-invariance and Pauli mixing are sufficient for being a 2-design is known16,Appendix D]. Thus it suffices to prove that any Pauli-invariant Clifford 2-design is Pauli mixing.

The Haar integeral in Eq. (1) can be computed explicitly using Weingarten functions26,

$$\begin{array}{l}{\int}_{U({2}^{n})}({\hat{U}}^{{\dagger} }\hat{A}\hat{U})\otimes ({\hat{U}}^{{\dagger} }\hat{B}\hat{U}){\rm{d}}U={{{\rm{SWAP}}}}\left[\frac{{{{\rm{Tr}}}}(\hat{A}\hat{B})}{{4}^{n}-1}-\frac{{{{\rm{Tr}}}}(\hat{A}){{{\rm{Tr}}}}(\hat{B})}{{2}^{n}({4}^{n}-1)}\right]\\\qquad\qquad\qquad\qquad\qquad\qquad\quad+\hat{I}\otimes \hat{I}\left[\frac{{{{\rm{Tr}}}}(\hat{A}){{{\rm{Tr}}}}(\hat{B})}{{4}^{n}-1}-\frac{{{{\rm{Tr}}}}(\hat{A}\hat{B})}{{2}^{n}({4}^{n}-1)}\right].\end{array}$$

Here SWAP is a unitary operator that swaps the two n-qubit registers separated by the tensor product. It is well-known that any complex matrix of size 2n × 2n can be expanded in the Pauli basis

$${{{{\mathcal{P}}}}}_{n}={\{\hat{I},\hat{X},\hat{Y},\hat{Z}\}}^{\otimes n}.$$

Thus it suffices to impose Eq. (1) only for \(\hat{A},\hat{B}\in {{{{\mathcal{P}}}}}_{n}\). Noting that the Pauli basis is orthonormal with respect to the inner product \({{{\rm{Tr}}}}({\hat{A}}^{{\dagger} }\hat{B})/{2}^{n}\) one concludes that a pair \(({{{\mathcal{D}}}},\mu )\) is a unitary 2-design iff

$$\mathop{\sum}\limits_{\hat{U}\in {{{\mathcal{D}}}}}\mu (\hat{U})({\hat{U}}^{{\dagger} }\hat{A}\hat{U})\otimes ({\hat{U}}^{{\dagger} }\hat{B}\hat{U})=\left\{\begin{array}{rcl}0&{{{\rm{if}}}}&\hat{A}\ne \hat{B},\\ \hat{{{\Lambda }}}&{{{\rm{if}}}}&\hat{A}=\hat{B}\ne \hat{I}\\ \end{array}\right.\quad \,{{\mbox{for all}}}\,\,\hat{A},\hat{B}\in {{{{\mathcal{P}}}}}_{n}$$
(19)

where

$$\hat{{{\Lambda }}}=\frac{1}{{4}^{n}-1}\,({2}^{n}{{{\rm{SWAP}}}}-\hat{I}\,\otimes \,\hat{I})=\frac{1}{{4}^{n}-1}\,\mathop{\sum}\limits_{\hat{O}\in {{{{\mathcal{P}}}}}_{n}\setminus \{\hat{I}\}}\,\hat{O}\otimes \hat{O}.$$

A Pauli operator \(\hat{O}\,\in \,{{{{\mathcal{P}}}}}_{n}\) can be parameterized by a bit string v {0, 1}2n such that

$$\hat{O}(v)\equiv \hat{O}({v}_{1}{v}_{n+1})\otimes \hat{O}({v}_{2}{v}_{n+2})\otimes \cdots \otimes \hat{O}({v}_{n}{v}_{2n}),$$

where \(\hat{O}(00)\,\equiv \,\hat{I}\), \(\hat{O}(10)\,\equiv \,\hat{X}\), \(\hat{O}(01)\,\equiv \,\hat{Z}\), and \(\hat{O}(11)\,\equiv \,\hat{Y}\). The unitary version of the Clifford group, which we denote \({{\mathfrak{C}}}_{n}\), is a group of complex matrices \(\hat{U}\,\in \,U({2}^{n})\) that map Pauli operators to Pauli operators under conjugation. More formally, \(\hat{U}\in {{\mathfrak{C}}}_{n}\) iff there exists a symplectic matrix \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) such that

$$\hat{U}\hat{O}(v){\hat{U}}^{{\dagger} }=\pm \hat{O}(Uv)$$
(20)

for all v {0, 1}2n. Here the sign may depend on v. The symplectic matrix \(U\,\in \,{{{{\mathcal{C}}}}}_{n}\) in Eq. (20) is uniquely determined by \(\hat{U}\). Conversely, \(\hat{U}\) is uniquely determined by U up to (right) multiplications by Pauli operators and the overall phase. In other words, \({{\mathfrak{C}}}_{n}\) is isomorphic (as a set) to \({{{{\mathcal{C}}}}}_{n}\times {{{{\mathcal{P}}}}}_{n}\) if one ignores the overall phase of unitary matrices.

Suppose \(\mu :\,{{\mathfrak{C}}}_{n}\,\to \,{{\mathbb{R}}}_{+}\) is a Pauli-invariant probability distribution, that is, \(\mu (\hat{U})\,=\,\mu (\hat{U}\hat{O})\) for all \(\hat{O}\,\in \,{{{{\mathcal{P}}}}}_{n}\) and \(\hat{U}\,\in \,{{\mathfrak{C}}}_{n}\). Using the isomorphism \({{\mathfrak{C}}}_{n}\,\cong \,{{{{\mathcal{C}}}}}_{n}\times {{{{\mathcal{P}}}}}_{n}\), define a distribution \(\pi :\,{{{{\mathcal{C}}}}}_{n}\,\to \,{{\mathbb{R}}}_{+}\) such that μ(U × P) = π(U)/4n for all UCn and \(P\,\in \,{{{{\mathcal{P}}}}}_{n}\). Suppose \(({{\mathfrak{C}}}_{n},\mu )\) is a 2-design, that is, μ obeys Eq. (19) with \({{{\mathcal{D}}}}={{\mathfrak{C}}}_{n}\). Consider the second case of Eq. (19) such that \(\hat{A}\,=\,\hat{B}\,=\,\hat{O}(x)\) for some non-zero vector x {0, 1}2n. Then it is equivalent to

$$\mathop{\sum}\limits_{U\in {{{{\mathcal{C}}}}}_{n}}\pi (U)\hat{O}(Ux)\otimes \hat{O}(Ux)=\frac{1}{{4}^{n}-1}\,\mathop{\sum}\limits_{y\in {\{0,1\}}^{2n}\setminus {0}^{2n}}\,\hat{O}(y)\otimes \hat{O}(y).$$

Since Pauli operators are linearly independent, this is possible only if a random vector Ux with U sampled from π(U) is distributed uniformly on the set of all non-zero vectors {0, 1}2n02n. This gives the Pauli mixing condition Eq. (3).