6-qubit optimal Clifford circuits

Clifford group lies at the core of quantum computation—it underlies quantum error correction, its elements can be used to perform magic state distillation and they form randomized benchmarking protocols, Clifford group is used to study quantum entanglement, and more. The ability to utilize Clifford group elements in practice relies heavily on the efficiency of their circuit-level implementation. Finding short circuits is a hard problem; despite Clifford group being finite, its size grows quickly with the number of qubits n, limiting known optimal implementations to n = 4 qubits. For n = 6, the number of Clifford group elements is about 2.1 × 1023. In this paper, we report a set of algorithms, along with their C++ implementation, that implicitly synthesize optimal circuits for all 6-qubit Clifford group elements by storing a subset of the latter in a database of size 2.1TB (1kB = 1024B). We demonstrate how to extract arbitrary optimal 6-qubit Clifford circuit in 0.0009358 and 0.0006274 s using consumer- and enterprise-grade computers (hardware) respectively, while relying on this database. We use this implementation to establish a new example of quantum advantage by Clifford circuits over CNOT gate circuits and find optimal Clifford 2-designs for up to 4 qubits.


Introduction
Quantum computations are studied for their promise to outperform classical counterparts for certain kinds of computations [1].Clifford group is an important finite subgroup of the full unitary group, describing the set of quantum computations.Despite being possible to simulate classically [2,3] by a low degree polynomial and having a simple structure [4] (admitting efficient parametrization and being possible to compute by linear depth circuits), the group is most famous for lying at the core of quantum error correction [1], which is believed to be necessary for scalable quantum computation.Restricted to the study of fault-tolerance, Clifford group plays multiple roles still.To illustrate, all (standard) encoding circuits are Clifford [1], and so are the circuits for state distillation [5,6], necessary for fault-tolerant implementation of non-Clifford gates.Clifford circuits lie at the core of randomized benchmarking protocols [7,8].Other use cases include shadow tomography [9,10], study of entanglement [1,11], and quantum data hiding [12].It is perhaps fair to regard the Clifford group as one of the most visible and important subgroups of the group of all quantum computations.Superconducting circuits and trapped ions are two technological frameworks that produced a stream of (universal prototype) programmable quantum computers, publicly available since the year 2016.Each technology comes in a range of flavors: e.g., superconducting circuits can be based on phase, charge, or flux qubits (or even hybrid kinds), and rely on various qubit coupling mechanisms, and trapped ions can be based on various ion species and rely on different approaches to the two-qubit gates (e.g., stationary vs mobile qubits).However, no matter the specific flavor, all prototype quantum computers based on these two approaches share one property [13,14]: the two-qubit gate has notably lower fidelity than a single-qubit gate.Thus, to the first degree of approximation, the fidelity of an entire quantum computation depends on the number of two-qubit gates it uses.To make a more subtle point, since the single-qubit gates are most frequently implemented by pulses with real-valued control parameters, the number of two-qubit gates in a circuit upper bounds the number of the single-qubit gates (up to a constant factor), meaning the reduction of the two-qubit gate count likely leads to the reduction in the number of single-qubit gates.We further note that the cnot gates are available natively (i.e., requiring the minimal number of one two-qubit physical-level interaction) in both superconducting circuits and trapped ions technologies.Finally, recall that the physical-level entangling pulses frequently take the form of XX, ZX, and ZZ, requiring single-qubit corrections to turn those interactions into commonly used cnot or cz gates.This means that minimizing single-qubit gate count in an abstract circuit may not directly minimize the number of single-qubit physical pulses, since the single-qubit gates will be reshuffled during technology mapping.This justifies our focus on minimizing the cnot gate count, selected as the optimization criterion in this paper.
In this paper, we study the problem of optimal synthesis of Clifford circuits.Since the problem of optimal circuit synthesis is hard, we restrict our attention to a small number of qubits, at most 6.The number of Clifford group elements over 6 qubits, 2.1•10 23 , is still very large, and we employ a range of techniques to make the search tractable using modern computers.At the core of our approach is a mechanism to break down the set of Clifford unitaries into a set of classes containing unitaries sharing a similar optimal circuit structure, efficient computation of the canonical representative of each class, and efficient manipulation of class members and the database of canonical representatives.
The rest of the paper is organized as follows.Section 2 reports definitions necessary to understand the technical parts.Section 3 starts with a subsection containing an overview of our algorithm; all technical details can be found in the following five subsections.Section 4 discusses the results, including a summary of relevant statistics (average and maximal circuit sizes, distribution of optimal costs), properties of optimal Clifford circuits that were possible to calculate using the data synthesized (advantage of Clifford circuits over linear reversible circuits, optimal 2-designs), and compares our work to previous similar results.

Definitions
We define the n-qubit Clifford group C n as the group of 2n×2n symplectic matrices M over the twoelement field F 2 , Sp(2n, F 2 ) := {M : M T Ω n M = Ω n }, where M T denotes transpose matrix, Ω n is the matrix 0 I n I n 0 , and I n is the n×n identity matrix.Symplectic matrices are equivalent to and alternatively known as the tableaux [3].The size of the symplectic group is |Sp(2n, F 2 )| = 2 n 2 n j=1 (2 2j − 1), which for the purpose of this paper implies |C 6 | = 208,114,637,736,580,743,168,000 ≈ 2.1•10 23 and assigns the numeric value to the size of the search space we are exploring.Tableau representation is particularly useful since it allows to define quantum gates and circuits directly without the need to resort to standard definitions in quantum information that employ 2 n ×2 n unitary matrices [1].Indeed, • the Hadamard gate h on qubit k can be defined as the 2n×2n identity matrix with swapped columns k and n+k, • the Phase gate p on qubit k can be defined as the addition of column k to column n+k in the 2n×2n identity matrix, • the cnot gate with control qubit k and target j performs simultaneous addition of column k to column j and column n+j to column n+k in the 2n×2n identity matrix, and circuits are matrix multiplications.The computational completeness of the {h, p, cnot} library is readily exposed by the ability to apply Gaussian elimination to obtain arbitrary symplectic matrix as a product of gates.An additional advantage of such a definition of gates and circuits comes from displaying the capacity to implement transformations by Clifford gates efficiently by a computer program.
As a side note, we highlight that each element of the Clifford group C n defines an equivalence class of 2 n ×2 n unitary matrices realizable by the circuits over h, p, and cnot gates (defined, in turn, via unitary matrices [1]).A pair of unitary matrices is considered equivalent if they can be mapped to each other by the left (or right) multiplication with single-qubit Pauli gates and overall phase factors.Since we focus on the minimization of the two-qubit gate count, Pauli gates and phase factors can be safely factored out.Had Pauli gates been included in the Clifford group, the search space size for n=6 would read 8.5•10 26 .
3 Algorithm and its implementation

Overview
Our approach relies on the use of pruned breadth-first search (BFS) to generate a number of databases containing Clifford unitaries that can be implemented by equal cost optimal circuits, and augment it by a set of tools that extract useful statistics (e.g., distribution of the number of unitaries by entangling gate cost, average cost, largest cost) as well as individual optimal circuits.BFS is a strategy that relies on taking optimal implementations of cost up to k, modifying them by applying cost-1 transformations to cost-k elements, and recording the result as a cost k+1 element if it is not yet found in the database.BFS is initiated with the identity operator costing zero and ends when all elements in the target set were explored.While our algorithm can be applied to obtain optimal 2-, 3-, 4-, 5-, and 6-qubit Clifford circuits using modern computers, we focus the rest of the description on the most difficult but still amenable to classical computers 6-qubit case.
Since the database we are synthesizing contains Clifford unitaries, the first order of business is to choose a suitable data structure to store those.The data structure must be both compact and allow quick application of gates; this is because BFS boils down to a series of gate applications and memory lookups.We start with the tableau, which is naturally suited for quick gate application, and modify it to remove two last rows corresponding to X and Z stabilizers each [3].As described in Subsection 3.5, these rows can be quickly restored.However, removing them allows to reduce the storage from 4n2 | n=6 = 144 bits to 2•2n(n−1)| n=6 = 120 bits.Each unitary is thus stored across two 64-bit machine words (each half corresponding to X and Z parts), with 4 bits per machine word of (yet) unused space.While informationtheoretic minimum storage requirement, ⌈log 2 (|C 6 |)⌉ = 78, implies that more compact storage exists, BFS imposes the requirement of quick gate application and we furthermore rely on canonicity (discussed in next paragraph) to reduce the size of the database; thus, it is not obvious if more efficient storage is possible.
Should each Clifford element require storage, the search would not be possible to execute on modern computers since |C 6 | ≈ 2•10 23 .We, therefore, break Clifford group elements into classes of equivalence such that class members share the same optimal circuit structure, a canonical representative exists, and it is efficient to compute.In our approach, a class of equivalence can be thought of as containing unitaries with optimal circuits equivalent up to left-and right-hand multiplication by single-qubit Clifford unitaries, and qubit relabeling; the canonical representative is chosen to be the one with the least lexicographic order across all elements in its equivalence class.This means that we can pack up to 12 • 6! = 1,567,283,281,920 unitaries into one class1 .Here, |C 1 | is the size of the singlequbit Clifford group C 1 raised to the power 2n to represent one-qubit operators on each qubit in the beginning and end of the circuit, and S n is the permutation group.However, the computation of canonical representative must be efficient, as otherwise, complexity moves from storage to computation.We utilized a Pareto-efficient definition of the equivalence class, as determined by ReduceU, the function computing the canonical representative, to be most practical.Our computationally-defined canonical representative is at most factor 14 storage inefficient, but it allows a quick computation of the canonical representative, Figure 1: cnot gate equivalent entangling transformations that need to be applied to each of n(n−1) 2 pairs of qubits of a Clifford group element implementable with k entangling gates to explore the possibility of expanding it into a Clifford group element requiring k+1 gates.It suffices to apply these gates to a pair of qubits in an arbitrary fixed order, since the application of a gate in the other order is enabled by some other gate among those listed.For instance, the cnot with flipped controls with respect to (a) is accomplished by (h), noting that the single-qubit gates on the right side do not matter due to the choice to work with equivalence classes.
taking on average 0.000003 seconds (using Intel Core i7-10700K processor).The computation of ReduceU turns out to be the runtime-level bottleneck of our implementation since other operations that are applied with a comparable frequency (such as tableau restoration and gate application) are faster.Further details about ReduceU may be found in Subsection 3.4.
The restriction to equivalence classes helps not only to dramatically reduce the storage requirement, but also to minimize the number of cnot-equivalent transformations that we need to apply to a Clifford unitary requiring k gates to explore Clifford unitaries requiring k+1 entangling gates.Specifically, the number of transformations is only 9 n(n−1) 2 n=6 = 135, as illustrated in Fig. 1.The 15-part (one part per a fixed gate count ranging from 1 to 15, with 15 turning out to be the maximum) sorted database with canonical representatives of equal cost is 2.1TB in size, and it took roughly 6 months to synthesize it on a small cluster of Intel ® server-class machines.Since we made software updates as the search progressed, and improved the performance in doing so, we believe it may take about 2 months to rerun it from scratch.We store the database on an SSD (2+TB RAM was expensive at the time of this writing).Given the database, an optimal circuit for a given 6-qubit Clifford unitary U may be found as follows: compute ReduceU(U ), find it in part of the database containing size k unitaries, apply each of 9 n(n−1) 2 gates, compute the resulting canonical element and look it up in the size k−1 database; once found repeat for k := k−1 until k=0.Our implementation of the above algorithm takes an average of 0.1 seconds to extract an optimal circuit.The bottleneck is the database search on the SSD, since the average number of times an element needs to be searched is at most 135 2 = 67.5, the databases for large k are large, and search needs to make multiple queries that add up quickly given SSD's limited access time.Instead, recall that 4+4 = 8 bits of the original data structure are unused, and note that 8 bits suffice to store the gate information, since ⌈log 2 (135)⌉ = 8.We thus augment the database by loading these 8 bits with the last gate information, allowing to select the correct gate right away during the circuit restoration.This modification reduces the runtime by roughly a factor of 67.5.We further optimize the performance by storing an index with each 1024 th element of the database in RAM.This allows finding an optimal circuit implementation of an arbitrary 6-qubit Clifford unitary in as little as 0.0009358 seconds on a MacBook Pro ® (2.3 GHz Quad-Core Intel ® Core i7-1068NG7 CPU, 16GB RAM) with a USB-C attached SSD (4TB VectoTech Rapid ® 540MB/s 3D NAND Flash), and 0.0006274 seconds on a high-performance server (Quad Intel ® Xeon E7-4850 v4 16-Core/2.1GHz,6TB RAM).These performance figures were established by averaging out the time to synthesize optimal circuits for 10,000 random uniformly distributed Clifford unitaries while relying on kernel-owned memory to cache files with the use of mmap and using a supplementary index for the laptop version of the search.
In the following subsections we report further details of our implementation.

Database generation
Let C k n ⊆ C n be the set of all Clifford group elements with the cnot cost k.Here k = 0, 1, . . ., k max (n) for some a-priori unknown maximum cost k max (n).For example, C Here and below S n ⊆ C n is the subgroup of qubit permutations.A specific implementation of the function ReduceU, which we defer to Subsection 3.4, does not matter at this point.Let R k n be the set of all reduced cost-k Clifford group elements, Our database consists of k max (n)+1 parts, such that the k-th part contains all elements of R k n .The elements are furthermore stored in the lexicographic order to enable binary search.
Let I ∈ C n be the identity matrix and cnot i,j be the cnot gate with the control qubit i and the target qubit j.Since any cost-0 and cost-1 element is equivalent to I and cnot 1,2 respectively, we have Suppose we have the sets for some k ≥ 2 (initially k=2).The rest of this section explains how to compute R k n .First, we need to choose a set of cost-1 generators that obey certain technical conditions.Let m = 9n(n−1)/2 and G 1 , G 2 , . . ., G m ∈ C 1 n be the generators shown in Fig. 1.By definition, each generator has the form a i b j cnot i,j for some pair of qubits i<j and a, b ∈ {i, ph, hp}.We will use the following properties of the generator set.
The proof is deferred to Appendix A. This lemma has the following simple corollaries.
Corollary 2. For any generator G a and L ∈ C 0 n there exists a generator We claim that the following algorithm outputs the set and some generator G b .Thus U ∈ S. We have proved that R k n ⊆ S. Conversely, suppose U ∈ S. Then U is a reduced element obtained from some cost-(k−1) element V by adding a single generator, relabeling the qubits, and left/right multiplications by the single-qubit gates.Since adding a single generator can change the cost by at most one2 , we conclude that We have proved that S ⊆ R k n .By sorting the elements of each set R ℓ n and using the binary search to check set membership, the above algorithm requires Õ(|R k−1 n |m) calls to the function ReduceU, where the Õ notation hides factors logarithmic in the size of R k−2 n , R k−1 n , and R k n .The database generation terminates as soon as R k n = ∅.This determines the maximum cost k max (n) as k−1.
As discussed in Section 3, the generation of the 6-qubit database spans a few CPU months and involves manipulations with terabytes of data.How can we be confident that this computation is errorfree?Our correctness tests included the verification that the size of the Clifford group inferred from the database agrees with the analytic formula j=1 (4 j −1).In more detail, the number of cost-k Clifford group elements can be inferred from the identity where |[U ]| is the size of the equivalence class [U ] that contains U , see Eq. ( 1).Furthermore, where Aut(U ) is the automorphism group of U that consists of all triples K×L×W ∈ C 0 n ×C 0 n ×S n such that U = KW −1 U W L. We have checked that the counts |C k n | inferred from Eqs. (2,3) indeed obey Thus our database passed the self-consistency test.Table 1 and Table 2 displaying the counts |R k n | and |C k n | can be found in Section 4. In order to speed up the synthesis of optimal circuits, we augmented each database entry U ∈ R k n with 8 auxiliary bits specifying a generator G b that reduces the cost of U by one, such that U G b ∈ C k−1 n .Here we assume k≥1.Let us prove that such cost-reducing generator G b exists for any To augment a given element U of the cost-k database R k n we find the first cost-reducing generator . This requires at most m calls to ReduceU and binary searches in R k−1 n (computing the group multiplication takes a negligible time).Once a cost-reducing generator G b is found, its index b is recorded in the database using the unused bits of U .The augmentation step is applied to all U ∈ R k n and for all k = 1, 2, . . ., k max (n).

Synthesis of optimal circuits
The optimal compiler takes as input an element of the Clifford group U ∈ C n and outputs a Clifford circuit (a list of the primitive gates h, p, and cnot) implementing U with the smallest possible cnot gate count, equal to the cost of U .The cost can be computed by making a single call to ReduceU and performing at most k max (n) database searches.Below we assume that the database is augmented with the cost-reducing generators, as discussed in Subsection 3.2.Thus the database search returns the cost k element V such that V ≡ Reduce(U ) ∈ R k n and a cost-reducing generator G a such that V G a ∈ C k−1 n .The next step is to convert G a into a cost-reducing generator for U .To this end, write V = KW −1 U W L for some K, L ∈ C 0 n and some qubit permutation W .The group elements K, L, and W that transform U into the reduced form are readily available by adding appropriate bookkeeping steps to the implementation of ReduceU described in Subsection 3.4.At this point we have Commute G a through W L next to U using Corollary 1.This gives for some generator G b and some M ∈ C 0 n .The generator G b can be computed in time O(1) using the standard commutation rules of the Clifford group.Thus U G b ∈ C k−1 n , that is, G b is a cost-reducing generator for U .Replacing U by U G b and applying the above step recursively, one constructs a ktuple of generators such that n is a product of single-qubit gates.This gives Decomposing each generator and M −1 into a product of primitive gates h, p, and cnot gives an optimal circuit implementing U −1 .Since all primitive gates are self-inverse, an optimal circuit implementing U is obtained simply by reversing the order of gates.If needed, the number of single-qubit gates in the compiled circuit can be optimized by commuting single-qubit gates to the last time step (whenever possible) and merging them using optimal lookup of C 1 elements.

Computation of ReduceU
In this section we introduce reduced forms of Clifford group elements and give algorithms for computing these forms.A given matrix U ∈ C n is transformed into a reduced form by applying a sequence of elementary reductions from the following list: 1. Multiplication of U on the left by single-qubit Clifford gates.
2. Multiplication of U on the right by single-qubit Clifford gates.

Relabeling of qubits.
Depending on which type of reductions is considered, there are three different reduced forms: a leftreduced form (reductions of type 1 only), a locally reduced form (reductions of types 1 and 2), and a fully reduced form (reductions of types 1, 2, and 3).Each form comes with an algorithm specifying the sequence of reductions to be applied.We define the reduced forms inductively starting from the left-reduced form.The function ReduceU used in Subsection 3.2 and Subsection 3.3 computes the fully reduced form.
We begin by defining convenient notations.Let e 1 , e 2 , . . ., e 2n ∈ F 2n 2 be the standard basis of F 2n 2 : the basis vector e j has a single non-zero at the j-th position.We consider e j as column vectors.Let e j := (e j ) T be the corresponding row vector.For example, if n=1 then , e 1 = 1 0 , and e 2 = 0 1 .
We write u ⊕ v to denote the addition of binary vectors u and v modulo 2. Elements of the Clifford group U ∈ C n are treated as binary symplectic matrices of the size 2n×2n, see Section 2. A matrix U has the j-th column and the j-th row U e j and e j U , respectively.Recall that C 0 n ⊆ C n is the local subgroup generated by the single-qubit gates (h and p).Define a subgroup C n,j ⊆ C 0 n generated by the single-qubit gates acting on the j-th qubit, where j = 1, 2, . . ., n.
Equivalently, U ∈ C n,j iff U e i = e i for all i / ∈ {j, n+j}, whereas U e j = ae j ⊕be n+j and U e n+j = ce j ⊕de n+j for some coefficients a, b, c, d Note that the subgroups C n,j pairwise commute.A matrix U ∈ C n is said to be left-reduced if e j U < e n+j U < (e j ⊕ e n+j )U for all j = 1, 2, . . ., n.
Here and below the bit strings are compared using the lexicographic order (i.e., 00 < 01 < 10 < 11 in the case n=1).The following lemma shows that left-reduced elements of C n can serve as canonical representatives of cosets C 0 n U .In other words, C n is a disjoint union of cosets C 0 n U and each coset contains a unique left-reduced element, which can be efficiently computed.We refer to the unique left-reduced element of a coset C 0 n U as the left-reduced form of U and denote it leftReduce(U ).Our symplectic matrix data structure described in Subsection 3.5 enables the computation of leftReduce(U ) for a randomly picked matrix U ∈ C n in time less than 2•10 −8 seconds for any n ≤ 6 on a server-class CPU, in this case an Intel ® Xeon ® CPU E7-4850 v4 @ 2.10GHz.Lemma 2. Each coset C 0 n U with U ∈ C n contains a unique left-reduced element that can be computed in time O(n 2 ), given symplectic matrix representation of U .
Proof.First note that the rows of a symplectic matrix are linearly independent.Thus for each qubit j the bit strings x j := e j U , z j := e n+j U , and y j := (e j ⊕e n+j )U are all distinct: x j = y j = z j .It follows directly from the above definitions that multiplying U on the left by the elements of the subgroup C n,j we can implement any permutation of the bit strings x j , y j , and z j .For example, the Hadamard gate swaps x j and z j , the Phase gate swaps x j and y j .Since |C n,j | = 6, there is a one-to-one correspondence between elements of C n,j and permutations of x j , y j , z j .Multiply U on the left by the unique element of C n,j that permutes the bit strings such that x j < z j < y j .Now Eq. ( 4) is satisfied for the j-th qubit.Repeating this for all n qubits and noting that C 0 n is generated by the subgroups C n,j proves that the coset C 0 n U contains a unique left-reduced element.All above steps can be efficiently implemented.Indeed, given a matrix U , one can compute the bit strings x j , y j , and z j and sort all three in time O(n).Repeating this for all n qubits gives the total runtime of O(n 2 ).
Given a matrix U ∈ C n define a double coset It includes all elements of the Clifford group obtained from U by adding single-qubit Clifford gates on the left and on the right.Clearly, the full Clifford group C n is a disjoint union of double cosets [U ] loc and the cost of the matrix U depends only on the double coset that contains U .The next step is to choose an efficiently computable canonical representative of each double coset.First define the map χ : where ∨ stands for the logical OR operation.The j-th component of χ(v) is non-zero iff v j =1 or v n+j =1 (the bitstring χ(v) can be interpreted as the support of an n-qubit Pauli operator parameterized by v, according to the standard binary parameterization of Pauli operators [3]).We claim that the map χ is invariant under left multiplications by the elements of the local subgroup, in the sense that Indeed, it suffices to check Eq. ( 5) for the special case L ∈ C n,j (since the local subgroup is generated by matrices L ∈ C n,j with j = 1, 2, . . ., n).As discussed above, the action of L ∈ C n,j on v is equivalent to applying a 2×2 binary invertible matrix to the components v j and v n+j while all other components of v remain unchanged.Since an invertible matrix maps nonzero vectors to nonzero vectors, (Lv) j ∨ (Lv) n+j = 1 iff v j ∨ v n+j = 1.This implies Eq. ( 5).
A matrix U ∈ C n is said to be locally ordered if U is left-reduced and χ(U e j ) ≤ χ(U e n+j ) ≤ χ(U e j ⊕ U e n+j ) for all j = 1, 2, . . ., n.
Here bit strings are compared using the lexicographic order.Let L(U ) ⊆ [U ] loc be the set of all locally ordered elements of the double coset [U ] loc .Define a locally reduced form of the matrix U ∈ C n , denoted localReduce(U ), as the lexicographically smallest element of the set L(U ).The following lemma shows that locally reduced elements of C n can serve as canonical representatives of the double cosets [U ] loc .In other words, C n is a disjoint union of the double cosets [U ] loc and each double coset contains a unique locally reduced element that can be efficiently computed (albeit slightly less efficiently than leftReduce).The symplectic matrix data structure described in Subsection 3.5 enables the computation of localReduce(U ) for a randomly picked matrix U ∈ C n in time less than 4•10 −7 seconds for all n ≤ 6 on a server-class CPU, in this case an Intel ® Xeon ® CPU E7-4850 v4 @ 2.10GHz.
n contains a unique locally reduced element that can be computed in time O(n 2 6 n ), given the symplectic matrix U .
Proof.For each qubit j define the bit strings x j := χ(U e j ), z j := χ(U e n+j ), and y j := χ(U e j ⊕U e n+j ).Same as before, multiplying U on the right by the elements of the subgroup C n,j one can implement any permutation of the bit strings x j , y j , and z j .Define a subset S j ⊆ C n,j as the one including all elements R j ∈ C n,j such that the right multiplication U ← U R j permutes the bit strings x j , y j , and z j into the non-decreasing order x j ≤ z j ≤ y j .Note that S j is non-empty since the right multiplication by the elements of C n,j can implement any permutation of x j , y j , and z j .Recall that the set L(U ) includes all locally ordered elements of the double coset [U ] loc .We claim that Indeed, L(U ) ⊆ [U ] loc since any matrix W ∈ L(U ) has the form W = LU R for some L, R ∈ C 0 n .Furthermore, L(U ) is non-empty since each subset S j is non-empty.Let us check that any element W ∈ L(U ) is locally ordered.Indeed, pick any matrices R j ∈ S j and let R = R 1 R 2 • • • R n .By construction, the matrix V = U R satisfies Eq. ( 6) with U replaced by V .Let W = leftReduce(V ).Then W = LV for some L ∈ C 0 n .The invariance of the map χ under left multiplications by the elements of the local subgroup, see Eq. ( 5), implies that W satisfies Eq. ( 6) with U replaced by W . Thus W is locally ordered.Conversely, suppose W ∈ [U ] loc is locally ordered.Then W = LU R for some L, R ∈ C 0 n and leftReduce(W ) = W .The invariance of the map χ under left multiplications by the elements of the local subgroup and the local ordering condition imply that the matrix V = U R satisfies Eq. ( 6) with U replaced by V .Thus R = R 1 R 2 • • • R n for some R j ∈ S j .This proves that W ∈ L(U ).The uniqueness follows from the ability to encode the elements of the sets considered by distinct integers and the existence of the smallest integer in any finite set of integers.
It remains to check that the set L(U ) can be computed in time O(n 2 6 n ).Indeed, for any given qubit j one can compute the bit strings x j , y j , and z j and the subset S j ⊆ C n,j in time O(n).Note that Comment 1: Our implementation of localReduce(U ) relies on a streamlined version of the above algorithm with a modified definition of the subsets S j .Namely, we define S j as a set of all elements R j ∈ C n,j such that the right multiplication U ← U R j permutes the bit strings x j , y j , and z j into the non-decreasing order and leftReduce(U R j ) = leftReduce(U ).The last condition rules out the possibility that the right multiplication of U by R j is equivalent to a left multiplication of U by some element of the local subgroup (for example, this is the case if U is the identity matrix).Since leftReduce(U ) depends only on the coset C 0 n U , the left multiplication of U by any element of the local subgroup does not change leftReduce(U ).Thus the set of locally ordered elements L(U ) can be computed using Eq. ( 7) with the modified definition of S j .
Comment 2: We empirically observed that the average-case runtime of the above algorithm is much better than the worst case upper bound of O(n 2 6 n ).Indeed, a direct inspection shows that the runtime scales as O(n 2 M ), where For randomly picked matrices U ∈ C 6 we observed that M ≈ 5 on average even though M = |C 0 6 | = 6 6 = 46,656 in the worst case.We leave it as an open question whether the average-case runtime of the above algorithm scales polynomially with n.
Recall that we consider the symmetric group S n that includes all qubit permutations as a subgroup of C n .If w is a permutation of integers {1, 2, . . ., n}, then the corresponding symplectic matrix W ∈ S n acts on the basis vectors as W e j = e w(j) and W e n+j = e n+w(j) for all j = 1, 2, . . ., n.Given a matrix U ∈ C n , define the equivalence class The rest of this section is devoted to choosing an efficiently computable canonical representative of each class [U ].Let Z n×n be the set of n×n matrices with integer entries.Define the map κ : C n → Z n×n such that the matrix element of κ(U ) located at the i-th row and the j-th column is the rank of the 2×2 submatrix of U formed by the intersection of rows i and i+n and columns j and j+n.The rank is computed over the binary field F 2 .In other words, each matrix element of κ(U ) has the form .
By definition, κ(U ) contains entries from the set {0, 1, 2} and the full matrix κ(U ) can be computed in time O(n 2 ).We claim that the left and right multiplications of U by the single-qubit Clifford gates leave Indeed, suppose first that L=I and R ∈ C n,j .Right multiplication U ← U R applies an invertible linear transformation to the pair of columns U e j and U e n+j , and acts trivially on the remaining columns.
Since the matrix rank is invariant under applying an invertible linear transformation, we conclude that κ(U R) = κ(U ) for all R ∈ C n,j .Same argument shows that κ(LU ) = κ(U ) for all L ∈ C n,j .This proves Eq. ( 8) since the local subgroup C 0 n is generated by the subgroups C n,j .Let κ min (U ) be the lexicographically smallest matrix in the set of matrices {κ(W −1 U W ) : W ∈ S n }.Define a set of qubit permutations for some L, R ∈ C 0 n .Define a fully reduced form of a matrix U ∈ C n , denoted ReduceU(U ), as the lexicographically smallest element of the set R(U ).The following lemma shows that the fully reduced elements of C n can serve as canonical representatives of the equivalence classes [U ].In other words, C n is a disjoint union of the equivalence classes [U ] and each class contains a unique fully reduced element that can be efficiently computed (albeit slightly less efficiently than localReduce).The symplectic matrix data structure enables the computation of ReduceU(U ) for a randomly picked matrix U ∈ C n in time less than 3•10 −6 seconds for n=6 and time less than 10 −6 seconds for all n ≤ 5 on a server-class CPU, in this case an Intel ® Xeon ® CPU E7-4850 v4 @ 2.10GHz.
Indeed, this equation implies ReduceU(U ) = ReduceU(U ′ ) for all U ′ ∈ [U ], that is, the equivalence class [U ] contains a unique reduced element.Let us prove Eq. ( 9).Write Here In the third equality we noted that localReduce is invariant under left/right multiplications by the elements of the local subgroup C 0 n , see Lemma 3. Finally, the invariance of the map κ under the left and right multiplications by the elements of the local subgroup, see Eq. ( 8), implies κ min (U ′ ) = κ min (U ).Thus W ∈ S(U ′ ) iff W W ∈ S(U ).Combining this and Eq.(10) gives R(U ′ ) = R(U ), as claimed.
The runtime stated in the lemma consists of two terms.The term O(n 2 •n!) is the time needed to compute the set of permutations S(U ).The term O(t n •|S(U )|) is the time needed to compute the set of matrices R(U ) and pick the lexicographically smallest element of R(U ).
Comment 3: Our implementation of ReduceU(U ) relies on a streamlined version of the above algorithm with a modified definition of the set S(U ).Namely, we define S(U ) as the set of all permutations W ∈ S n such that κ(W −1 U W ) = κ min (U ) and leftReduce(W −1 U W ) = leftReduce(U ).The last condition rules out the possibility that the conjugation of U by W is equivalent to a left multiplication of U by some element of the local subgroup (for example, this is the case if U is the identity matrix).Since localReduce(U ) depends only on the double coset C 0 n U C 0 n , a left multiplication of U by any element of the local subgroup does not change localReduce(U ).Thus one can compute the set R(U ) using the modified definition of S(U ).By a slight abuse of terminology, we refer to the computationally-defined fully reduced elements of the Clifford group as the reduced elements in the remainder of the paper.This should not lead to confusion since the left-reduced and the locally reduced forms are used only in this subsection.

Data structure
By definition, any element of the Clifford group U ∈ C n can be represented by a binary matrix of size 2n×2n.However, if we only care about the reduced form of U , a slightly more efficient representation is possible, as given by the following lemma.
Lemma 5. Let U ′ be the matrix obtained from U ∈ C n by removing the n-th and the 2n-th rows from it.Then U is uniquely determined by U ′ up to left multiplication by the single-qubit Clifford gates acting on the n-th qubit.
Proof.Let L ⊆ F 2n 2 be the linear subspace spanned by the j-th row of U with j / ∈ {n, 2n} and let L ⊥ ⊆ F 2n 2 be the linear subspace spanned by the vectors orthogonal to L with respect to the symplectic inner product.Note that L depends only on U ′ .The condition that U is a symplectic matrix implies span F 2 (e n U, e 2n U ) = L ⊥ .Here we use the notations from Subsection 3.4.The missing pair of rows e n U and e 2n U is uniquely defined by L up to an invertible linear transformation e n U ← ae n U ⊕ be 2n U and e 2n U ← ce n U ⊕ de 2n U for some As discussed in Subsection 3.4, there is a one-to-one correspondence between such transformations and left multiplications U ← LU , where L ∈ C 0 n acts non-trivially only on the n-th qubit.
We refer to the matrix U ′ obtained from U ∈ C n by removing the pair of rows n and 2n as a thin matrix representation of U .Our C++ implementation adopts the thin matrix data format for all intermediate steps of the algorithm.The thin matrix spans 4n(n−1) bits and can be conveniently distributed over two machine words, each of length 64 bits.The first word stores the rows e 1 U , e 2 U , . . ., e n−1 U and the second word stores the rows e n+1 U , e n+2 U , . . ., e 2n−1 U .This leaves 128 − 4n(n−1)| n≤6 ≥ 8 free bits that can be conveniently used to specify the cost-reducing generator in the augmented database, see Subsection 3.2.Recall that the number of generators is m = 9n(n−1)/2| n≤6 ≤ 135.Thus the generator can be specified using only 8 bits.Note also that storing the full matrix U ∈ C n using only two machine words is impossible for n=6, as it requires 4n 2 | n=6 = 144 bits.
The thin matrix format enables fast left and right multiplication by the single-qubit and two-qubit Clifford gates, that require at most 24 CPU instructions per gate for all n≤6 (each instruction implements a bitwise operation on a single machine word).When needed, the thin matrix U ′ can be expanded into the full symplectic matrix U ∈ C n by calculating the missing pair of rows e n U and e 2n U using the symplectic version of Gram-Schmidt orthogonalization.Our implementation converts the thin matrix to the full matrix in time less than 2•10 −7 seconds for any n≤6 on a server-class CPU, in this case an Intel ® Xeon ® CPU E7-4850 v4 @ 2.10GHz, which is negligible compared with the time it takes to compute the reduced form.

Software tricks
Database generation: The calculation of the reduced cost-k Clifford group set R k n , as described in Subsection 3.2, lends itself to parallel processing.Specifically, each element of the set R k n can be calculated concurrently from its own data on its own processor.The implementation considerations for this run-once parallel processing job depended on factors such as: i. the cost and availability of scaled-up/scaled-out hardware, and ii. the cost-benefit for implementing, measuring, and tuning for different data-level parallel processing options, including shared memory versus distributed memory (e.g., OpenMP/MPI) and specialized processors (e.g., vector processors, GPUs, FPGAs), not to mention the multiple software options with each, from programming languages to libraries [15].Using Flynn's taxonomy [16], the Single Program, Multiple Data (SPMD) streams model was implemented using the C++ concurrent-set template class; specifically, each reduced cost-k Clifford group set R k n is an instance of set<pair<uint64, uint64>>.This is a good choice for programmer productivity, i.e., letting the container's semantics deal with the requirements of maintaining distinct and efficientlysearchable elements of a multi-terabyte set on SMP hardware, in this case an Intel ® Xeon ® 128-CPU E7-4850 v4 @ 2.10GHz with 6TB RAM.
Runtime was extrapolated to take about 100 days to complete the full database generation on a single machine, amounting to approximately 100•24•128 = 307,200 CPU-hours that can be effectively divided among as many machines as there are available.Hardware and software measurements during database generation, using performance analysis tools such as vmstat to VTune™, exposed heavy "NUMA thrashing," i.e., soft page faults [17].To alleviate this for the final half of the run, C's most basic systems programming mechanisms were more readily and easily used to replace the C++ set template in order to allocate, position, and search raw memory, resulting in a 5x speed-up; namely, malloc, bsearch, and qsort, along with read/write and uint128.
Synthesis of optimal circuits: With the one-time generation of the database complete and saved on secondary storage (Solid State Disk), similar systems programming mechanisms in C were exploited to optimize performance and scalability in order to read/search what is now effectively a lookup table (LUT), with the expensive runtime calculation of an optimal 6-qubit Clifford circuit completed and replaceable by a simple array indexing operation.The database can be memory-mapped with mmap [18] for a greater degree of cnot count Qubits 2 3 4 5 6 i. programmer productivity, i.e., the database can be easily referenced as memory using pointers, with no explicit file IO, and ii.operational flexibility, i.e., the database can be effectively used by any type of hardware, ranging from a single laptop to a cluster of server-class machines, with scaling solely dependent on the choice of hardware, all without changing the code; while the OS kernel and mmap transparently and efficiently take care of i. demand paging, and ii. maintaining only a single copy of data in memory, as opposed to copies in both the file cache and user space.
In addition, to reduce the number of SSD queries, being the most time-consuming operation our search relies on, we employed the following strategy: i. we store the databases of Clifford circuits requiring 1-8, 14, and 15 gates in RAM, ii. we store an index consisting of each 1024 th element of Clifford unitaries implementable with 9-13 gates in RAM, and iii. when the length-1024 chunk containing the desired element is found by the binary search, we make one long query to extract all 2048 64-bit integers in this chunk.
The above modification limits the number of SSD queries required to synthesize an optimal circuit to at most 10 (at most two queries per searches over the gate counts of 9, 10, 11, 12, and 13) at the cost of RAM memory usage of 2.5GB.
A machine with enough RAM to fit the entire database in will get the best performance as the complete database fills the file cache, and a machine with little-to-no available RAM will get the worst performance as every pointer access to a memory-mapped region (e.g., bsearch) will touch the secondary storage.A commodity machine with typical RAM sizes will get near-best performance as the "hot" parts of the database-the internal nodes of bsearch-will tend to remain in the cache hierarchy (L1-L3, file cache) and result in minimal access to secondary storage.OS-specific parameters were not explored Figure 2: All most expensive 6-qubit Clifford unitaries requiring 15 entangling gates (up to left and right multiplication by the single-qubit gates and qubit relabeling).(a) left: a compact representation in the form (U ⊗ U)SWAP, right: its optimal implementation; (b) left: a compact representation in the form (U ′ ⊗V ′ )SWAP, right: its optimal implementation.Not illustrated is the cyclic SWAP of all 6 qubits, that also requires 15 entangling gates.
Figure 3: An optimal cnot gate circuit (left) can be implemented with fewer entangling gates as an optimal Clifford circuit (right).
but can also be benchmarked and tuned independently of the database and code, including page sizes and pinned memory.

Results
The distribution of the number of equivalence classes across cnot gate costs is shown Table 1.For the number of qubits 2 through 5 the most complex function to implement is unique (within the equivalence class definition), and it is equivalent to a cyclic permutation of qubits.For n=6, the cyclic permutation is one of three such functions; the other two are illustrated in Fig. 2. The small number of equivalence classes for a small number of qubits implies an efficient formula (based on ReduceU) to compute the cnot cost of a small Clifford unitary.We ran a script to calculate the distribution of the number of Clifford group elements across optimal cnot gate costs.Given the database, it took a few days to collect the data using an HPC system.This computation is highly parallelizable, and the runtime can be reduced significantly with many processors, e.g., GPUs; we have not pursued those reductions.The results are reported in We used the database to look for examples of quantum Clifford advantage over classical reversible cnot circuits, meaning optimal cnot circuits that can be implemented with fewer entangling gates as a Clifford circuit.We found one such example, illustrated in Fig. 3, that gives a reduction of 14 gates into 12, improving the 8 to 7 reduction seen earlier [4] indeed, 14  12 > 8 7 .The compiler was benchmarked using both consumer-grade and enterprise-grade systems for a test set with 10,000 elements of the Clifford group C 6 .Each element was generated by a Clifford circuit with 600 randomly chosen gates over the library {h, p, cnot}.The number of gates was selected to be high enough to effect a close to random uniform distribution over the elements of the group C 6 .We observed that such random test set is dominated by the elements with costs 11 and 12.The compiler runtime reported below is the time required to obtain optimal circuits for all test set elements divided by the size of the test set.We observed the runtime of 0.0009358 seconds for a laptop with Intel ® i7-1068NG7 2.3GHz CPU and 16GB RAM with USB-C-attached consumer-grade SSD.The search relies on the database stored on SSD, and a 2.5GB index in RAM, see Subsection 3.6 for details.The time reported measures hot cache performance, cold cache performance reads 0.003708 seconds per an optimal circuit, on average.The compiler performance improves when the entire database can be stored in RAM.We observed the hot cache runtime of approximately 0.0006274 seconds for a server with Intel ® Xeon ® 128-CPU E7-4850 v4 @ 2.10GHz and 6TB RAM.The process of loading the full database into RAM took approximately 2 hours.
This performance allows to use our implementation to obtain individual circuits and entire randomized benchmarking schedules in mere seconds using consumer-grade hardware as well as online via a web interface.For the use in demanding applications such as peep-hole optimization of large circuits, we suggest relying on large-RAM commercial-grade servers and note that it takes roughly half the time to look up the cost without computing the optimal circuit (the procedure that would likely get called most frequently during peep-holing).
The average runtime of our compiler for random n-qubit Clifford operators with n ≤ 5 is shown in

Optimal 2-designs
Unitary designs [19] are probability distributions on the unitary group that reproduce low-order moments of the Haar (uniform) distribution.Of particular interest are unitary designs that can be efficiently implemented by quantum circuits [20].Such designs can serve as a substitute for the Haar distribution in certain randomized quantum protocols such as data hiding [12], estimating fidelity of quantum operations [8,21], and quantum state tomography [10].In this section, we leverage the database of reduced Clifford elements to construct optimal unitary designs that have the minimum average cost, subject to the constraint that all elements of the design are Clifford operators.Let U (2 n ) be the group of unitary complex matrices of size 2 n ×2 n .Suppose D ⊆ U (2 n ) is a finite subset and µ: D → R + is a probability distribution on D. The pair (D, µ) is called a unitary 2-design [22] if for any complex matrices Â and B.Here the tensor product separates two n-qubit registers and the integral in the right-hand side of Eq. ( 11) is the average over the Haar distribution on the unitary group U (2 n ).We reserve the hat notation for complex unitary matrices to avoid confusion with binary symplectic matrices considered in the rest of the paper.Below we choose D to be the n-qubit Clifford group and construct a probability distribution µ that minimizes the average cost subject to the constraint that (D, µ) is a unitary 2-design.Here cost( Û ) is the minimum number of the cnot gates required to implement Û by a quantum circuit composed of the Hadamard, Phase, and cnot gates.Since Pauli operators have zero cost, we can assume wlog that the optimal solution µ is Pauliinvariant, i.e., µ( Û ) = µ( Û Ô) for all n-qubit Pauli operators Ô.As discussed in Section 2, the unitary version of the n-qubit Clifford group is isomorphic to C n × {I, X, Y, Z} n .Here we ignore the overall phase factors.Define the probability distribution π: C n → R + such that π(U ) = 4 n µ(U ×P ) for all U ∈ C n and P ∈ {I, X, Y, Z} n .The distribution π is well-defined whenever µ is Pauli-invariant.In Appendix B we show that µ is a Clifford 2-design iff π obeys the so-called Pauli mixing constraint [20] Pr Furthermore, µ has the average cost Thus it suffices to minimize the average cost Eq.( 14) over variables π(U ) ≥ 0 subject to the normalization constraint U ∈Cn π(U ) = 1 and the Pauli mixing constraint, Eq. ( 13).This gives a linear program with |C n | variables.16) over variables η(U ) ≥ 0 with U ∈ R n , subject to the normalization U ∈Rn η(U ) = 1 and the Pauli mixing constraints Eqs.(17,18), gives a linear program with |R n | variables and 1 + (2 n −1) 2 equality constraints.We were able to find an optimal solution of this linear program numerically for n = 2, 3, 4 qubits.The optimal reduced distributions η presented in Table 4, Table 5, and Table 6 are compactly represented by a list of reduced elements U 1 , U 2 , . . ., U m ∈ R n along with their probabilities η(U j ).Only reduced elements that appear with non-zero probability are shown.The tables display an optimal circuit implementation of each reduced element U j .To avoid clutter, we omit single-qubit gates on the left and on the right.The actual 2-design has the form LW −1 U j W R, where the index j ∈ {1, 2, . . ., m} is sampled with the probability η(U j ), the qubit permutation W is sampled uniformly from S n , and L, R are sampled uniformly from the local subgroup C 0 n .

Comparison to prior work
Similar-spirited prior work includes the synthesis of 4-qubit optimal Clifford circuits [23], the synthesis of 4-bit optimal reversible circuits [24], and optimal solution of Rubik's cube puzzle [25].[23] is most closely circuit U j η(U j ) circuit U j η(U j ) Table 6: Optimal four-qubit Clifford 2-design with the average cost 5.08034....For comparison, the full Clifford group C 4 has the average cost 5.85856....We note that all except for two circuits in the above table have cost 5.The remaining pair of circuits have cost 6.related to our work, given the focus on Clifford circuits; the difference is we chose to study the two-qubit gate cost, which better reflects the constraints of the existing quantum computers than the total gate count.The search space size comparison is 4.7•10 10 in [23] to 2.1•10 23 in our work-an almost 13 orders of magnitude difference.[24] study reversible circuits, being a highly relevant type of computations.Their search space size is 2.1•10 13 , meaning we solved a problem with 10 orders of magnitude higher search space size.Finally, [25] studies Rubik's cube, which is also a finite group.Their search space size is 4.3•10 19 , meaning ours is almost 4 orders of magnitude higher.

Conclusion
In this paper, we reported algorithms and their C++ implementation that compute all two-qubit gate count optimal 6-qubit Clifford circuits.There are about 2.1•10 23 different Clifford functions.The large search space required us to employ server-class machines to make the computation possible.In particular, we used HPC to break down the set of canonical representatives of Clifford group elements sharing similar optimal circuit structure, and store them in a database of size 2.1TB.Given this database on an SSD and a 2.5GB index file in RAM, the time to extract an optimal circuit using a consumer-grade laptop is 0.0009358 seconds-10 times faster than the typical access time for a spindle drive.The time to extract an optimal circuit using an enterprise-level system while storing the database in RAM is 0.0006274 seconds-15 times faster than the typical HDD access time.We used the database to establish the maximal gate count needed to implement an arbitrary 6-qubit Clifford unitary and showed the distribution of the number of Clifford functions across their required gate counts.We established a new example of quantum advantage by Clifford circuits over cnot gate circuits and found optimal Clifford 2-designs for the number of qubits up to, and including, 4. π(U )/4 n for all U ∈ C n and P ∈ P n .Suppose (C n , µ) is a 2-design, that is, µ obeys Eq. ( 19) with D = C n .Consider the second case of Eq. ( 19) such that Â = B = Ô(x) for some non-zero vector x ∈ {0, 1} 2n .Then it is equivalent to U ∈Cn π(U ) Ô(U x) ⊗ Ô(U x) = 1 4 n − 1 y∈{0,1} 2n \0 2n Ô(y) ⊗ Ô(y).
Since Pauli operators are linearly independent, this is possible only if a random vector U x with U sampled from π(U ) is distributed uniformly on the set of all non-zero vectors {0, 1} 2n \0 2n .This gives the Pauli mixing condition Eq. ( 13).
is at most 6 n .Since the right multiplication by the elements of the subgroup C n,j changes at most two rows of a matrix, we can compute U R in time O(n 2 ).By Lemma 2, computing the left reduced form of U R takes time O(n 2 ).Thus the overall runtime of computing L(U ) is O(n 2 6 n ).Once the set L(U ) is computed, finding its lexicographically smallest element takes time O(n|L(U )|) = O(n6 n ).

Lemma 4 .
Each equivalence class [U ] with U ∈ C n contains a unique fully reduced element that can be computed in time O(n 2 •n! + t n •|S(U )|), given the symplectic matrix representation of U .Here t n is the runtime of localReduce for elements of C n .Proof.Consider a matrix U ∈ C n .It follows directly from the definitions that ReduceU(U ) ∈ [U ].Thus it suffices to check that R(U ′ ) = R(U ) for all U ′ ∈ [U ].

Comment 4 :
We empirically observed that |S(U )|=1 for typical a element of the Clifford group and the maximal value of |S(U )| is 14.The mean value of |S(U )| is approximately 1.03 for a randomly picked U ∈ C 6 .

circuit U j probability ηTable 4 :Table 5 :
Optimal two-qubit Clifford 2-design with the average cost 1.5.This coincides with the average cost of the full Clifford group C 2 .circuit U j probability η(U j ) circuit U j probability η(U j ) Optimal three-qubit Clifford 2-design with the average cost 3.12363....For comparison, the full Clifford group C 3 has the average cost 3.50937.... Minimizing the average cost Eq.( 0 n is the local subgroup of C n , i.e., one generated by the single-qubit Clifford gates.Suppose ReduceU: C n → C n is a function such that ReduceU(U ) = ReduceU(V ) if and only if U and V are equivalent up to left and right multiplications by single-qubit gates and a qubit relabeling.In other words, ReduceU(U ) is a canonical representative of the equivalence class [U ]

Table 1 :
The distribution of the number of equivalence classes across Clifford circuits over 2, 3, 4, 5, and 6 qubits.

Table 2 :
The distribution of the number of 6-qubit Clifford unitaries across the entangling gate cost.

Table 3 :
Average runtime for optimally compiling n-qubit Clifford operators with the full database of reduced elements loaded into RAM.The runtime was measured on MacBook Pro laptop (early 2015 model) with Intel ® i7-5557U 3.1GHz CPU and 16GB RAM.