QuEST and High Performance Simulation of Quantum Computers

We introduce QuEST, the Quantum Exact Simulation Toolkit, and compare it to ProjectQ, qHipster and a recent distributed implementation of Quantum++. QuEST is the first open source, hybrid multithreaded and distributed, GPU accelerated simulator of universal quantum circuits. Embodied as a C library, it is designed so that a user’s code can be deployed seamlessly to any platform from a laptop to a supercomputer. QuEST is capable of simulating generic quantum circuits of general one and two-qubit gates and multi-qubit controlled gates, on pure and mixed states, represented as state-vectors and density matrices, and under the presence of decoherence. Using the ARCUS and ARCHER supercomputers, we benchmark QuEST’s simulation of random circuits of up to 38 qubits, distributed over up to 2048 compute nodes, each with up to 24 cores. We directly compare QuEST’s performance to ProjectQ’s on single machines, and discuss the differences in distribution strategies of QuEST, qHipster and Quantum++. QuEST shows excellent scaling, both strong and weak, on multicore and distributed architectures.


I. INTRODUCTION
Classical simulation of quantum computation is vital for the study of new algorithms and architectures.As experimental researchers move closer to realising quantum computers of sufficient complexity to be useful, their work must be guided by an understanding of what tasks we can hope to perform.This in turn means we must explore an algorithm's scaling, its robustness versus errors and imperfections, and the relevance of limitations of the underlying hardware.Because of these requirements simulation tools are needed on many different classical architectures; while a workstation may be sufficient for the initial stages of examining an algorithm, further study of scaling and robustness may require more powerful computational resources.Flexible, multi-platform supporting simulators of quantum computers are therefore essential.
Further it is important these simulations are very efficient since they are often repeated many times, for example to study the influence of many parameters, or the behaviour of circuits under noise.But it is expensive to exactly simulate a quantum system using a classical system, since a high-dimensional complex vector must be maintained with high fidelity.Both the memory requirements, and the time required to simulate an elementary circuit operation, grow exponentially with the number of qubits.A quantum computer of only 50 qubits is already too large to be comprehensively simulated by our best classical computers [5], and is barely larger than the 49 qubit computers in development by Intel and Google [6,7].To simulate quantum computers even of the size already experimentally realised, it is necessary that a classical simulator take full advantage of the performance optimisations possible of high performance classical computing.
It is also equally important that the research community have access to an ecosystem of simulators.Verfication of complex simulations is a non-trivial task, one that is much eased by having the facility to compare the results of simulations performed by multiple packages.

A. Target Platforms and Users
Simulations of quantum computation are performed on a wide variety of classical computational platforms, from standard laptops to the most powerful supercomputers in the world, and on standard CPUs or on accelerators such as GPUs.Which is most suitable for the simulation of a given circuit will depend upon the algorithm being studied and the size of the quantum computer being modelled.
To date this has resulted in a number of simulators which typically target one, or a small number, of these architectures.While this leads to a very efficient exploitation of a given architecture, it does mean that should a research project need to move from one architecture to another, for instance due to the need to simulate more qubits, a different simulation tool is required.This may require a complete rewrite of the simulation code, which is time consuming and makes verification across platforms difficult.In this article we describe QuEST which runs efficiently on all architectures typically available to a researcher, thus facilitating the seamless deployment of the researcher's code.This universal support also allows the researcher to easily compare the performance of the different architectures available to them, and so pick that most suitable for their needed simulations.
In the rest of this section we shall examine the nature of the architectures that are available, cover briefly how codes exploit them efficiently, and show how QuEST, the universal simulator, compares with the more platform specific implementations.

B. Simulator Optimisations
Classical simulators of quantum computation can make good use of several performance optimisations.
For instance, the data parallel task of modifying the state vector under a quantum operation can be sped up with single-instruction-multiple-data (SIMD) execution.SIMD instructions, like Intel's advanced vector extensions (AVX), operate on multiple operands held in vector registers to concurrently modify multiple array elements [25], like state vector amplitudes.
Task parallelism can be achieved through multithreading, taking advantage of the multiple cores found in modern CPUs.Multiple CPUs can cooperate through a shared NUMA memory space, which simulators can interface with through OpenMP [26].
Simulators can defer the expensive exchange of data in a CPU's last level cache (LLC) with main memory through careful data access; a technique known as cache blocking [27].Quantum computing simulators can cache block by combining sequential operations on adjacent qubits before applying them, a technique referred to as gate fusion [2,18].For instance, gates represented as matrices can be fused by computing their tensor product.
Machines on a network can communicate and cooperate through message passing.Simulators can partition the state vector and operations upon it between distributed machines, for example through MPI, to achieve both parallelisation and greater aggregate memory.Such networks are readily scalable, and are necessary for simulating many qubit circuits [18].
With the advent of general-purpose graphical processing units (GPGPUs), the thousands of linked cores of a GPU can work to parallelise scientific code.Simulators can make use of NVIDIA's compute unified device architecture (CUDA) to achieve massive speedup on cheap, discrete hardware, when simulating circuits of a limited size [23].We mention too a recent proposal to utilise multi-GPU nodes for highly parallel simulation of many qubit quantum circuits [22].

Single node
ProjectQ is an open-source quantum computing framework featuring a compiler targeting quantum hardware and a C++ quantum computer simulator behind a Python interface [1].In this text, we review the performance of its simulator, which supports AVX instructions, employs OpenMP and cache blocking for efficient parallelisation on single-node shared-memory systems, and emulation to take computational shortcuts [28].
QuEST is a new open source simulator developed in ISO standard conformant C [29], and released under the open source MIT license.Both OpenMP and MPI based parallelisation strategies are supported, and they may be used together in a so-called hybrid strategy.This provides seamless support for both single-node sharedmemory and distributed systems.QuEST also employs CUDA for GPU acceleration, and offers the same interface on single-node, distributed and GPU platforms.Though QuEST does not use cache blocking or emulation, we find QuEST performs equally or better than ProjectQ on multicore systems, and can use its additional message-passing facilities for faster and bigger simulations on distributed memory architectures.
ProjectQ offers a high-level Python interface, but can therefore be difficult to install and run on supercomputing architectures, though containerisation may make this process easier in future [30,31].Conversely, Quest is light-weight, stand-alone, and tailored for highperformance resources -its low-level C interface can be compiled directly to a native executable and run on personal laptops and supercomputers.
Both QuEST and ProjectQ maintain a pure state in 2 n complex floating point numbers for a system of n qubits, with (by default) double precision in each real and imaginary component; QuEST can otherwise be configured to use single or quad precision.Both simulators store the state in C/C++ primitives, and so (by default) consume 16×2 n B [32] in the state vector alone.However ProjectQ incurs a ×1.5 memory overhead during state allocation, and QuEST clones the state vector in distributed applications.Typical memory costs of both simulators on a single thread are shown in Figure 1, which vary insignificantly from their multithreaded costs.While QuEST allows direct read and write access to the state-vector, Pro-jectQ's single amplitude fetching has a Python overhead, and writing is only supported in batch which is memory expensive due to Python objects consuming more memory than a comparable C primitive -as much as 3× [33,34].Iterating the state-vector in ProjectQ is therefore either very slow, or comes with an appreciable memory cost.
QuEST applies a single-qubit gate (a 2 × 2 matrix G) on qubit q of an N -qubit pure state-vector |ψ = α n |n , represented as the complex vector α, by updating vector elements where . This applies G via 2 N computations of ab + cd for complex a, b, c, d and avoids having to compute and matrix-multiply a full 2 N × 2 N unitary on the state-vector.This has a straight-forward generalisation to multi-control single-target gates, and lends itself to parallelisation.
We leverage the same hardware-optimised code to enact gates on N -qubit density matrices, by storing them as 2N -qubit state-vectors, Here the object ρ does not, in general, respect the constraint |α n | 2 = 1.An operation G q ρG † q , that is a gate on qubit q, can then be effected on ρ as G * q+N G q ρ , by exploiting the Choi-Jamiolkowski isomorphism [35] This holds also for multi-qubit gates.The distribution of the density matrix in this form lends itself well to the parallel simulation of dephasing and depolarising noise channels.

Distributed
How simulators partition the state vector between processes and communicate over the network is key to their performance on distributed memory architectures.All simulators we have found so far employ a simple partitioning scheme; the memory to represent a state vector is split equally between all processes holding that vector.A common strategy to then evaluate a circuit is to pair nodes such that upon applying a single qubit gate, every process must send and receive the entirety of its portion of the state vector to its paired process [2,3,17].
The number of communications between paired processes, the amount of data sent in each and the additional memory incurred on the compute nodes form a 2.An illustration of strategies to distribute the state vector between two 64 GiB nodes.A complete cloning (×2 memory) of the partition on each node is wasteful.Half the partition can be cloned, at the cost of twice as many MPI messages, to fit another qubit into memory [17].Further division requires more communication for less memory overhead [2].The bottom plot shows the maximum number of qubits which can fit on 2 k nodes of varying memory, assuming a 50 MiB overhead per node.
tradeoff.A small number of long messages will ensure that the communications are bandwidth limited, which leads to best performance in the communications layer.However this results in a significant memory overhead, due to the process having to store buffers for both the data it is sending and receiving, and in an application area so memory hungry as quantum circuit simulation this may limit the size of circuit that can be studied.On the other hand many short messages will minimise the memory overhead as the message buffers are small, but will lead to message latency limited performance as the bandwidth of the network fabric will not be saturated.This in turn leads to poor parallel scaling, and hence again limits the size of the circuit under consideration, but now due to time limitations.Note that the memory overhead is at most a factor 2, which due to the exponential scaling of the memory requirements, means only 1 less qubit may be studied.Some communication strategies and their memory overheads and visualised in Figure 2.
QuEST partitions the state vector equally between the processes within the job, and the message passing between the process pairs is so organised as to absolutely minimise the number of communications during the operation of a single gate.Thus parallel performance should be good, but there will be a significant memory overhead; in practice a factor of 2 as described above.For n qubits distributed over 2 k nodes, these communications occur when operating on qubits with index ≥ n − k, indexing from 0.
An alternative strategy is to clone, send and receive only half of each node's data in two exchanges [17], incurring instead a 1.5× memory cost.This often leaves room to simulate an additional qubit, made clear in Figure 2.This strategy can be recursed further to reduce the memory overhead even more, and negligible additional memory cost can be achieved by communicating every amplitude separately as in [3], though this comes at a significant communication cost, since a message passing pattern is latency dominated and will exhibit poor scaling with process count.However an improvement made possible by having two exchanges is to overlap the communication of the first message with the computation on the second half of the state vector, an optimisation implemented in qHipster [2].This depends on the network effectively supporting asynchronous communications.
We also mention recent strategies for further reducing network traffic by optimising the simulated circuit through gate fusion, state reordering [2,18] and rescheduling operations [18], though opportunities for such optimisations may be limited.
In terms of the functionality implemented in the simulation packages we note that while qHipster is limited to single and two-qubit controlled gates, QuEST additionally allows the distributed operation of any-qubit controlled gates.

GPU
Though MPI distribution can be used for scalable parallelisation, networks are expensive and are overkill for deep circuits of few qubits.Simulations limited to 29 qubits can fit into a 12 GB GPU which offers high parallelisation at low cost.In our testing, QuEST running a single Tesla K40m GPU (retailing currently for ∼3.6 k USD) outperforms 8 distributed 12-core Xeon E5-2697 v2 series processors, currently retailing at ∼21 k USD total, ignoring the cost of the network.
QuEST is the first available simulator of both statevectors and density matrices which can run on a CUDA enabled GPU, offering speedups of ∼5× over already highly-parallelised 24-threaded single-node simulation.We QCGPU [36] which is a recent GPUaccelerated single-node simulator being developed with Python and OpenCL, and Quantumsim, a CUDA-based simulator of density matrices [37].

Multi-platform
QuEST is the only simulator which supports all of the above classical architectures.A simulation written in QuEST can be immediately deployed to all environments, from a laptop to a national-grade supercomputer, performing well at all simulation scales.We list the facilities supported by other state-of-the-art simulators in Table II FIG. 3.An example of a depth 10 random circuit on 5 qubits, of the linear topology described in [38].This diagram was generated using ProjectQ's circuit drawer.

C. Algorithm
We compare QuEST and ProjectQ performing simulations of universal psuedo-random quantum circuits of varying depth and number of qubits.A random circuit contains a random sequence of gates, in our case with gates from the universal set {H, T , C(Z), X 1/2 , Y 1/2 }.These are the Hadamard, π/8, controlled-phase and root Pauli X and Y gates.Being computationally hard to simulate, random circuits are a natural algorithm for benchmarking simulators [38].We generate our random circuits by the algorithm in [38], which fixes the topology for a given depth and number of qubits, though randomises the sequence of single qubit gates.An example is shown in Figure 3.The total number of gates (single plus control) goes like O(nd) for an n qubit, depth d random circuit, and the ratio of single to control gates is mostly fixed at 1.2 ± 0.2, so we treat these gates as equal in our runtime averaging.For measure, a depth 100 circuit of 30 qubits features 1020 single qubit gates and 967 controlled phase gates.
Though we here treat performance in simulating a random circuit as an indication of the general performance of the simulator, we acknowledge that specialised simulators may achieve better performance on particular classes of circuits.For example, ProjectQ can utilise topological optimisation or classical emulation to shortcut the operation of particular subcircuits, such as the quantum Fourier transform [28].
We additionally study QuEST's communication efficiency by measuring the time to perform single qubit rotations on distributed hardware.

A. Hardware
We evaluate the performance of QuEST and ProjectQ using Oxford's computing facilities, specifically the AR-CUS Phase-B supercomputer, and the UK National Supercomputing facility ARCHER.
QuEST and ProjectQ are compared on single nodes
On ARCHER, ProjectQ v0.3.6 is compiled with GCC 5.3.0, and run in Python 3.5.3inside an Anaconda 4.0.6.QuEST is compiled with ICC 17.0.0which supports OpenMP 4.5 [26], and is distributed with the MPICH3 implementation of the MPI 3.0 standard, optimised for the Aries interconnect.

Configuration
We attempt to optimise ProjectQ when simulating many qubits by enabling gate fusion only for multithreaded simulations [31].
from projectq import MainEngine from projectq.backendsimport Simulator MainEngine( backend=Simulator( gate_fusion=(threads > 1))) We found that ProjectQ's multithreaded simulation of few qubit random circuits can be improved by disabling all compiler engines, to reduce futile time spent optimising the circuit in Python.

MainEngine( backend=Simulator(gate_fusion=True), engine_list=[])
However, this disables ProjectQ's ability to perform classical emulation and gate decomposition, and so is not explored in our benchmarking.We studied ProjectQ's performance for different combinations of compiler engines, number of gates considered in local optimisation and having gate fusion enabled, and found the above configurations gave the best performance for random circuits on our tested hardware.
Our benchmarking measures the runtime of strictly the code responsible for simulating the sequence of gates, and excludes the time spent allocating the state vector, instantiating or freeing objects or other one-time overheads.
In ProjectQ, this looks like:

A. Single Node Performance
The runtime performance of QuEST and ProjectQ, presented in Figure 4, varies with the architecture on which they are run, and the system size they simulate.Anomalous slowdown of ProjectQ at 22 qubits may be explained by the LLC becoming full, due to its use of cache blocking through gate fusion [31].
For fewer than ∼22 qubits, ProjectQ's Python overhead is several orders of magnitude slower than QuEST's C overhead, independent of circuit depth.The Python overhead can be reduced by disabling some simulation facilities -see Section III B 2. For larger systems, the time spent in ProjectQ's C++ backend operating on the state vector dominates total runtime, and the time per gate of both simulators grows exponentially with increasing number of qubits.
On a single ARCUS-B thread, ProjectQ becomes twice as fast as QuEST, attributable to its sophisticated circuit evaluation.However, these optimisations appear to scale poorly; QuEST outperforms ProjectQ on 16 threads on ARCUS-B, and on ARCHER both simulation packages are equally fast on 24 threads.This is made explicit in the strong scaling over threads shown in Figure 5, which reveals ProjectQ's scaling is not monotonic.Per-  formance suffers with the introduction of more than 8 threads, though is restored at 16.
We demonstrate QuEST's utilisation of a GPU for highly parallelised simulation in Figure 6, achieving a speedup of ∼5× from QuEST and ProjectQ on 24 threads.

B. Distributed Performance
Strong scaling of QuEST simulating a 30 and 38 qubit random circuit, distributed over 1 to 2048 ARCHER nodes, is shown in Figure 7.In all cases one MPI process per node was employed, each with 24 threads.Recall that QuEST's communication strategy involves cloning the state vector partition stored on each node.The 30 qubit (38 qubit) simulations therefore demand 32 GiB (8 TiB) memory (excluding overhead), and require at least 1 node (256 nodes), whereas qHipster's strategy would fit a 31 qubit (39 qubit) simulation on the same hardware [2].
Communication cost is shown in Figure 8 as the time to rotate a single qubit when using just enough nodes to store the state-vector; the size of the partition on each node is constant for increasing nodes and qubits.QuEST shows excellent weak scaling, and moving from 34 to 37 simulated qubits slows QuEST by a mere ≈ 9%.It is interesting to note that the equivalent results for qHipster show a slowdown of ≈ 148% [2], but this is almost certainly a reflection of the different network used in generating those results, rather than in any inherent weakness in qHipster itself.QuEST and qHipster show comparable ∼ 10 1 slowdown when operating on qubits which require communication against operations on qubits which do not (shown in the bottom subplot of Figure 8).Though such slowdown is also network dependent, it is significantly smaller than the ∼ 10 6 slowdown reported by the Quantum++ adaptation on smaller systems [3], and reflects a more efficient communication strategy.We will discuss these network and other hardware dependencies further in future work, and also intend to examine qHipster on ARCHER so a true like with like comparison with QuEST can be made.

V. SUMMARY
This paper introduced QuEST, a new high performance open source framework for simulating universal quantum computers.We demonstrated QuEST shows good strong scaling over OpenMP threads, competitive with a state of the art single-node simulator ProjectQ when performing multithreaded simulations of random circuits.We furthermore parallelised QuEST on a GPU for a 5× speedup over a 24 threaded simulation, and a 40× speedup over single threaded simulation.QuEST also supports distributed memory architectures via message passing with MPI, and we've shown QuEST to have excellent strong and weak scaling over multiple nodes.This behaviour has been demonstrated for up to 2048 nodes and has been used to simulate a 38 qubit random circuit.Despite its relative simplicity, we found QuEST's communication strategy yields comparable performance to qHipster's, and strongly outperforms the distributed adaptation of Quantum++.QuEST can be downloaded in Reference [39] FIG.2.An illustration of strategies to distribute the state vector between two 64 GiB nodes.A complete cloning (×2 memory) of the partition on each node is wasteful.Half the partition can be cloned, at the cost of twice as many MPI messages, to fit another qubit into memory[17].Further division requires more communication for less memory overhead[2].The bottom plot shows the maximum number of qubits which can fit on 2 k nodes of varying memory, assuming a 50 MiB overhead per node.

ARCHERFIG. 4 .FIG. 5 .
FIG. 4. Comparison of QuEST and ProjectQ when simulating random circuits over 1, 16 (on ARCUS Phase-B) and 24 (on ARCHER) threads (top to bottom).Coloured lines indicate the mean, with shaded regions indicating a standard deviation either side, over a total of ∼77 k simulations of varying depth.Vertical dashed lines indicate the maximum number of qubits for which the entire state vector fits into the LLC.The speedup subgraphs show the ratio of ProjectQ to QuEST runtime.

10 FIG. 6 .
FIG. 6.QuEST's single-node performance using multithreading and GPU acceleration to parallelise random circuit simulations.The subplot shows the speedup (ratio of runtimes) that a GPU of 2880 CUDA cores on ARCUS Phase-B achieves against 24 threads on ARCHER.
Memory consumption of QuEST's C and ProjectQ's Python processes, as reported by Linux's /proc/self/status during random circuit simulation on a single 256 GiB ARCUS Phase-B compute node.Full and dashed lines show the typical and maximum usage respectively, while the gray dashed line marks the memory required to store only the state-vector (in double precision).The subplot shows the ratio of total memory consumed to that by only the state-vector.

TABLE I .
[3]omparison of the facilities offered by some publicly available, state-of-the-art simulators.Note the distributed adaptation of Quantum++[3]is not currently publicly available.Here, density matrices refers to the ability to precisely represent mixed states.GiB compute nodes, each with two 12-core Intel Xeon E5-2697 v2 series processors linked by two QuickPath Interconnects, and a collective LLC of 61 MB between two NUMA banks.Thus a single node is capable of simulating up to 32 qubits with 24 threads.We furthermore evaluate the scalability of QuEST when distributed over up to 2048 ARCHER compute nodes, linked by a Cray Aries interconnect, which supports an MPI latency of ∼ 1.4 ± 0.1 µs and a bisection bandwidth of 19 TB/s.
B. Software FIG.8.QuEST multinode weak scaling of a single qubit rotation, distributed on {16, 32, 64, 128, 256} ARCHER nodes respectively, each with 24 threads between two sockets and 64 GiB of memory.Communication occurs for qubits at positions ≥ 30, indexing from 0. Time to rotate qubits at positions 0-25 are similar to those of 26-29 and are omitted.The bottom subplot shows the slowdown caused by communication, while the top subplot shows the slowdown of rotating the final (communicated) qubit as the total number of qubits simulated increases from 34.