Abstract
Classical simulation of quantum computation is vital for verifying quantum devices and assessing quantum algorithms. We present a new quantum circuit simulator developed on the Sunway TaihuLight supercomputer. Compared with other simulators, the present one is distinguished in two aspects. First, our simulator is more versatile. The simulator consists of three mutually independent parts to compute the full, partial and single amplitudes of a quantum state with different methods. It has the function of emulating the effect of noise and support more kinds of quantum operations. Second, our simulator is of high efficiency. The simulator is designed in a twolevel parallel structure to be implemented efficiently on the distributed manycore Sunway TaihuLight supercomputer. Random quantum circuits can be simulated with 40, 75 and 200 qubits on the full, partial and single amplitude, respectively. As illustrative applications of the simulator, we present a quantum fast Poisson solver and an algorithm for quantum arithmetic of evaluating transcendental functions. Our simulator is expected to have broader applications in developing quantum algorithms in various fields.
Introduction
In recent years, tremendous technological progress has been made in the construction of quantum computers, especially with superconducting qubits^{1,2}. As these nascent quantum computers become competitive against classical computers in simulating general quantum circuits, an interesting race come to the climax. The quantum beings are eager to accomplish the first demonstration of quantum supremacy^{1,3}, while classical beings try to push back the classical simulation barrier as far as possible^{4,5,6,7}.
During the race, many novel methods and programs are developed to simulate quantum circuits efficiently on conventional computers^{8,9,10,11}, including parallel platforms^{12,13,14,15,16}, FPGAbased hardware^{17,18}, etc. In fact, classical simulation of quantum computation is vital both for the verification of quantum computers and for the assessment of the correctness and performance of new quantum algorithms. The fundamental task of such simulation is to calculate all or a certain number of amplitudes of quantum states produced by a quantum circuit.
However, it is extremely expensive to simulate quantum computation classically because of the curse of dimensionality, i.e., the memory and time requirements grow exponentially with the number of qubits. For instance, to accurately simulate a quantum system with 50 qubits, one needs a classical computer with slightly more than 16 Petabytes of memory (with double precision). Moreover, increasing the number of qubits by one requires a doubling of the amount of memory space. Performing such a largescale computation requires one to take advantage of the stateoftheart highperformance distributed computation.
In the present work, we develop a new quantum circuit simulator on the Sunway TaihuLight supercomputer. Albeit other simulators have been developed on supercomputers including Sunway TaihuLight^{12,14,16}, our simulator is designed to be a powerful tool for quantum algorithm research. The simulator consists of three mutually independent subprograms to calculate the full, partial and single amplitudes of a quantum state with three completely different methods. Therefore, a wide range of number of qubits and circuit depths can be covered. This could provide choices when people execute quantum algorithms of different fields. In addition, it can emulate the effect of noise and support more quantum operations, such as the controlled and inverse operations on a group of gates, which are very useful in practical applications. On the other hand, the efficiency of the simulator is high. The algorithms of the simulator has a twolevel parallel structure to fully take advantage of the Sunway system architecture. We can simulate random quantum circuits with 40, 75 and 200 qubits on the full, partial and single amplitude, respectively. With this simulator, we further develop quantum algorithms for solving the Poisson equations and for quantum arithmetic of evaluating transcendental functions.
Simulation techniques
The present quantum circuit simulator consists of three mutually independent subprograms, referred to as three working modes of the simulator, i.e. full amplitude, partial amplitude and single amplitude mode. The fundamental methodologies for the three modes are completely different. They are, respectively, direct evolution of quantum state, circuit partition by decomposing controlledZ gate^{10}, and the complex undirected graphical model^{9}. In addition, noisy one and twoqubit gates are defined to emulate the effect of noise. A description of the instruction set of our simulator and an illustrative example of the input and output are given in the supplementary material.
Sunway TaihuLight supercomputer
Before proceeding to the details of the simulation techniques, we first give a brief introduction of the classical hardware. Our simulator is developed based on the Sunway TaihuLight at the National Supercomputer Center in Wuxi, China. The Sunway TaihuLight is so far the most powerful supercomputer in China. It can reach a peak performance of 125 PFlops, and had ranked the first in the TOP500 list for four times in the years of 2016 and 2017.
The supercomputer consists of 40,960 homegrown processors called SW26010. Each SW26010 processor contains four coregroups. Each coregroup contains one management processing element (hereafter called as master core) with a memory space of 8 GB, and 64 computing processing elements (hereafter called as slave core) in an 8 × 8 array^{19}. Within a coregroup, the 64 slave cores can communicate with each other in a few cycles. In the present work, one coregroup is set as a unique MPI process. When mentioning a computational node, it refers to one coregroup, namely 1 master core plus 64 slave cores. The simulator is written by C++ language.
To take full advantage of the system architecture of Sunway TaihuLight, we implement algorithms of the three working modes in a twolevel parallel way. More specifically, the entire simulation are first divided equally to the available nodes, which is the first level of parallel. In each node corresponding to a unique MPI process, the computing task is further assigned to the 64 slave cores equally, while the master core is responsible for the process control and I/O operation. This is the second level of parallel. The specific designs of algorithms are discussed in the subsequent sections.
Full amplitude mode
The full amplitude mode of the simulator is an instance of the socalled Schrodinger simulation. It is based on the direct evolution of quantum state through the product of unitary operations, contrasting to the linear combinations of unitary operations^{20}. All the information of the quantum state is precisely maintained and updated stepbystep throughout the simulation. The Schrodinger approach is straightforward, and it could provide a great speed in simulating lowwidth circuits. However, when processing manyqubits circuits, it requires a significant amount of RAM to store all amplitudes. In the present work, we use at most 16,384 computational nodes, roughly 10% of the computing resource of Sunway TaihuLight, and can simulate a quantum circuit with up to 40 qubits on this mode.
Now we use the singlequbit and controlled twoqubit operations as examples to illustrate the distributed implementation of Schrodinger simulation. It is well known that an nqubit quantum state can be represented by Dirac notations and column vectors as follows,
where the decimal and binary index are related by \(i = i_{n  1} \times 2^{n  1} + \cdots + i_{k} \times 2^{k} + \cdots + i_{0} \times 2^{0}.\). In practice, we store all the amplitudes α_{i} in the memory during the simulation and update them according to the action of unitary operations.
Let U^{k} represents a singlequbit gate acting on the kth qubit, namely i_{k} in Eq. (1). It can be easily verified that the amplitudes can be updated in the following way,
Note that the amplitudes indexed by i_{k} = 1 are calculated when traversing the index of i + 2^{k}. As can be seen from the equation, for one action of U^{k}, all the 2^{n} amplitudes are changed. Thus, the one singlequbit operation corresponds to a computation scale of 2^{n} additions and multiplications. The controlled twoqubit operation can be implemented similarly. Let CU^{q,k} represents a controlled twoqubit gate, where the qubit q (k) is the control (target) bit. That is to say, when i_{q} is zero, the gate CU^{q,k} will do nothing; when i_{q} is 1, the gate performs the same transformation as Eq. (2). This can be formalized as
Here we remark that although the above singlequbit and controlled twoqubit operations are enough as they form a universal set for quantum computation^{21}, our simulator could support more quantum gates and operations. They are very useful in the practical design of quantum circuits. Particularly, the simulator supports arbitrary singlequit rotation gates, controlled operation on a group of gates, and inverse operation on a group of gates, etc. (see supplementary material for details).
The above two equations are of great importance because they lend the process of updating amplitudes to parallelization and distribution. That is, they update amplitudes via 2^{n} computations of aα_{i} + bα_{j} as shown in Eq. (3), not by multiplying a full 2^{n} × 2^{n} matrix on the column vector. Such equations can be implemented in a way of twolevel parallel. Specially, all the amplitudes are divided equally to the nodes and stored in the corresponding master cores. Then the master core in each node calls the slave cores to update the amplitudes in parallel.
In summary, the program of this mode proceeds in the following three steps:

1st: Configure the computation nodes. Then every node parses the script to obtain a linked list recording instructions of the quantum circuit.

2nd: Assign all the amplitudes equally to the nodes. The amplitudes are initialized as zero in the master core of each node.

3rd: The master core traverses every node of the linked list in turn, and prepares the computing parameters, including the matrix coefficients of the gate, the number of amplitudes of each node, starting address of the target amplitude, etc. Then the master core assign the task of updating the amplitudes equally to the 64 slave cores. The slave cores get the requited data using the address information according to Eqs. (2) or (3), and compute the new amplitude values, and then sent them back to the same position in the master core.
Partial amplitude mode
The partial amplitude mode use a hybrid algorithm to simulate a quantum circuit with more than 50 qubits but of limited depth. Generally, in this mode the original quantum circuit are divided into several subcircuits with less qubits, which are then simulated independently using the same method as the full amplitude mode. With 16,384 computational nodes, we can simulate a quantum circuit with up to 75 qubits under this mode. Below is a brief introduction of the partition scheme of the circuit. More information can be found in our previous paper^{10}.
The controlledZ gates can be decomposed into the projection and singlequbit Z gates as follows,
The superscripts represent that qubit i is the control qubit and qubit j the target qubit. On the left hand side of the equation, qubits i and j are entangled, while on the right hand side they are independent. Therefore, after decomposing the CZ gate, the quantum states of qubits i and j can evolve independently, and then be recombined to get the final state. This turns out to be a very useful method of reducing the memory requirements when simulating a quantum circuit with many qubits.
Now we take a quantum circuit with 8 qubits and 8 layers of depth as an example to illustrate the partition scheme. The circuit is shown in Fig. 1. The circuit is made up of two blocks, that is, the upper block with qubits from 0 to 3, and the lower one with the other qubits. The two blocks are entangled by the CZ gates in 7th and 8th layer. The entanglement between the two blocks can be dismissed by decomposing the two CZ gates in turn, as shown in Fig. 1. After the decomposition, the original circuit results in four circuits, whose upper and lower blocks are untangled. Then each of the four circuits can be divided into two subcircuits with a half number of qubits, which can be simulated independently. Therefore, the task of simulating the original circuit with 8 qubits is converted to the simulation of 8 independent subcircuits with 4 qubits. The number of amplitudes stored in the memory is reduced from 2^{8} to 2^{7}. Since the subcircuits are simulated in a parallel way, the time span of the simulation is also reduced.
There are also restrictions on the partition scheme. The gates crossing the dividing line should be the controlled twoqubit gate, such as the CNOT and CZ gates, not the gate like SWAP. Furthermore, the number of subcircuits grows exponentially with the number of decomposed CZ gates. For example, if there is one more CZ gate crossing the dividing line between qubits 3 and 4 in Fig. 1, the partition is not efficient. Therefore, this method is suitable for quantum circuits with low depth and large sampling number (the large sampling number is originate from the fact that all the subcircuits are simulated on the full amplitude mode).
In summary, the program for the partial amplitude mode proceeds in the following four steps:

1st: Configure the computation nodes. Then every node parses the script to extract the gates. Judge whether the gates crossing the dividing line is the controlled twoqubit gates, and decompose it by doubling the circuit. The dividing line is always set to be in the middle of qubits.

2nd: Cut each of the final circuits into two subcircuits along the dividing line. There should be 2^{c+1} subcircuits generated, where c is the number of decomposed gates. Establish a linked list of quantum gates for each subcircuit.

3rd: Assign the task of simulating the subcircuits equally to the nodes. The result of assignment would be that one node simulates one subcircuit, one node simulates several subcircuits, or several nodes simulate one subcircuit. The simulations are implemented in the same way as the full amplitude mode.

4th: Combine the state of each subcircuit to get the final states.
Single amplitude mode
The single amplitude mode makes use of undirected graphical model to be capable of simulating quantum circuit with much more qubits. Broadly, the original quantum circuit is first mapped to an undirected graphical model, then the undirected graph is split into several ones by fixing the value of variables, and then the resulting graphs are processed in parallel by the vertical variable elimination algorithm.
The undirected graph model is a way of interpreting the relation between the change of bit values of qubit state and the quantum gates. Naturally, the bit of state will change with actions of a sequence of quantum gates. We define a sequence of Boolean variables to describe the change. For example, being acted upon by the PauliX and H gate in sequence, the state \(\left 0 \right\rangle\) will be first changed to \(\left 1 \right\rangle\), then to \({1 \mathord{\left/ {\vphantom {1 {\sqrt 2 }}} \right. \kern\nulldelimiterspace} {\sqrt 2 }}\left( {\left 0 \right\rangle  \left 1 \right\rangle } \right)\). Then the corresponding Boolean variables are a_{0} = 0, a_{1} = 1, and a_{2} = {0, 1}, respectively. The undirected graph is constructed based on the Boolean variables and quantum gates. Specially, each Boolean variable in the circuit corresponds exactly to one vertex in the graph, and one or multiple gates in the circuit result in one edge in the graph.
The rule of mapping a quantum circuit to an undirected graph is simple and easy to follow^{9}. It is summarized to four cases as shown in Fig. 2. For the diagonal one or twoqubit gate, it does not change the Boolean variable, so the vertices corresponding to the same variable merge into one. For example, the CZ gate will transform the state \(\left {11} \right\rangle\) to \( \left {11} \right\rangle\) without flipping of the bit, so the input and output vertices are merged as shown in Fig. 2c. The cross lines in the graph should be considered as one line, which corresponds to one gate, as shown in Fig. 2d. Figure 3 presents an example to further illustrate the mapping of a circuit to undirected graph.
After getting the undirected graph, tensor techniques are used to process it. One edge in the graph corresponds to a particular tensor, and the number of vertices connecting to the edge is the rank of the tensor. For example, the edge in Fig. 2d corresponds to a tensor T of rank 4, with 2^{4} elements indexed by \(T_{{a_{0} b_{0} a_{1} b_{1} }}\). The elements of tensor T are filled using U_{2n} in the lexicographical order of the index, such as that T_{00,00} = (U_{2n})_{0,0}, T_{00,10} = (U_{2n})_{0,2}, T_{01,00} = (U_{2n})_{1,0}, T_{10,00} = (U_{2n})_{2,0} and so on.
There are two kinds of processes performed on the undirected graph, which are edge merging and vertex elimination. Edge merging means that two edges connecting to the same vertex are merged to one. This is actually to merge two tensors with the same subscript into one. For instance, suppose that the edge between vertexes b_{0} and b_{1} in Fig. 3b corresponds to a tensor \(A_{{b_{0} b_{1} }}\), and the edge between vertex b_{1} and d_{1} corresponds to \(B_{{b_{1} d_{1} }}\), then the two edges merges into one to get a higherrank tensor as \(C_{{b_{0} b_{1} d_{1} }} = A_{{b_{0} b_{1} }} B_{{b_{1} d_{1} }}\).
Vertex elimination reduces the number of vertexes connecting to a particular edge. This is actually a variant of tensor contraction. We do this using two different methods, of which one is a differential way and the other an integral way. In the differential method, the variable corresponding to a vertex is fixed to be 0 and 1^{22}. For example, the vertex b_{1} in Fig. 3b is fixed to 0, then the tensor \(B_{{b_{1} d_{1} }}\) is converted to \(B_{{0d_{1} }}\). Thus, the tensor rank is reduced from 2 to 1, and the number of elements from 2^{2} to 2. The expense of this method is that it doubles the graph. That is, the graph needs to be computed twice with the target variable being 0 and 1, respectively. In the integral method, all the elements of a tensor corresponding to a specific subscript are summed over to eliminate that index. For instance, the subscript b_{1} in tensor \(C_{{b_{0} b_{1} d_{1} }}\) is eliminated by \(C^{\prime}_{{b_{0} d_{1} }} = C_{{b_{0} 0d_{1} }} + C_{{b_{0} 1d_{1} }}\), so the vertex b_{1} is eliminated from the edge corresponding to tensor \(C_{{b_{0} b_{1} d_{1} }}\).
In summary, the program for the single amplitude mode proceeds in the following four steps:

1st: Configure the computation nodes. Then every node parses the script to obtain a linked list recording instructions of the quantum circuit. Map the quantum circuit to the undirected graphical model using the linked list.

2nd: Eliminate the vertices in the first and last depth of the graph according to the specified initial and measurement states using the differential vertex elimination method. Since the initial and measurement states are certain, this step does not double the number of graphs.

3rd: Find the top N vertices with the largest number of connecting edges. Then perform the differential vertex elimination on the N vertices, and this result into 2^{N} graphs. Assign the task of simulating the 2^{N} graphs equally to the nodes. (Note that eliminating the top N highdegree vertices would be not the best way of simplifying the graph. The treewidth of the graph really matters, but it is NPcomplete to determine^{9,22}. For simplicity, we choose the top N highdegree vertices to remove at this step.)

4th: For each graph, eliminate all the vertices. Specifically, for each vertex, first merge all the connecting edges into one in the order of rank, and then eliminate this vertex using the integral method. Multiply the elements of the tensors corresponding to the left edges, and obtain the amplitude of each graph. Sum over the amplitude of each graph to get the final amplitude of the state to measure.
Simulation of the effect of noise
In practical quantum devices, qubits are performed imperfectly. Various kinds of noise would randomly induce errors on the states of qubits. Particularly, in the coming NISQ era, quantum computers have noisy gates unprotected by quantum error correction^{23}. Thus, it is important to characterize the effect of noise by classical simulations.
The effect of noise can be described by a series of super operators {K_{1}, K_{2}, …, K_{s}}, which satisfy the relation \(\sum\nolimits_{i} {K_{i}^{\dag } } K_{i} = I\). For the singlequbit gate, we consider the following six kinds of noise,
The value p in the equation is on [0, 1], which is proportional to the noise intensity. Specifically, for the first three kinds of noise, when p approaches 1, the noise close to zero; for the last three kinds of noise, when p approaches zero, the noise close to zero.
For the twoqubit gate, the noise operators are defined as the Kronecker products of singlequbit gates. For example, suppose the noise operators of singlequbit gates are {K_{1}, K_{2}} and {M_{1}, M_{2}}, respectively. Then, the noise operators of twoqubit gate are \(\left\{ {K_{1} \otimes M_{1} ,\,K_{1} \otimes M_{2} ,\,K_{2} \otimes M_{1} ,\,K_{2} \otimes M_{2} } \right\}\).
In the program, the procedure of simulating the noise goes as follows:
1st. Determine the class of quantum gates specified to be noisy and the kind of noise. Let every operator of {K_{1}, K_{2}, …, K_{s}} act on the present quantum state using the same method as the full amplitude mode. Then calculate the modulus of the states, namely the probabilities of the states.
2nd. Produce a random number between 0 and 1, and compare it with the above sequence of probabilities, then determine which suboperator K_{i} to be used. Multiply the matrix K_{i} with the quantum gate to obtain a new matrix, i.e., the noisy gate.
3rd. Update the state by the new matrix using the same method as the full amplitude mode. Finally, normalize the quantum state (the noisy gate may not be unitary).
To sum up, we have discussed the basic principles of the full, partial and single amplitude modes, as well as the way of defining noisy gate to emulate the effect of noise. Subsequently, we introduce numerical results and applications of the present simulator.
Results and applications
To characterize the performance of the simulator, we first implement the random quantum circuits (RQCs) generated using the prescription of Google^{24}. Then we demonstrate the quantum circuits for solving the Poisson equations and for the quantum arithmetic of evaluating transcendental functions. Here, we remark that the quantum fast Poisson solver and quantum arithmetic algorithms are implemented mainly on the full amplitude mode since these circuits have relatively few qubits and high depth. We leave such applications to future work that the partial and single amplitude modes as well as the function of emulating the effect of noise are exploited.
Implementation of RQCs
The full amplitude mode is the foundation of the other two modes, because the resulting subcircuits in partial and single modes are finally simulated using the same method as the full amplitude mode. The main factor of limiting the computing speed of full amplitude mode is the data communication between nodes. According to Eqs. (2) and (3), when updating one term of amplitudes α_{i}, one need another term α_{i+2}^{k}, which may be stored in another coregroup or another SW26010 processors. As shown in Fig. 4, for the onequbit gate, the speed of computation on a state stored in one coregroup (node) is about ten times faster than that in different coregroups. On the other hand, amplitudes being stored in one SW26010 processor or two has almost no influence.
For the partial amplitude mode, we simulate a sequence of RQCs with 4096 nodes. The running time is shown in Fig. 5. In addition to the numbers of qubit and depth, the structure of the lattice of qubits also has a big impact on the running time, as shown by the results of 60 qubits (6 × 10 and 5 × 12).
For the single amplitude mode, we simulate RQCs with 49, 110 and 200 qubits using 256 nodes. The running time is shown in Fig. 6. By taking advantage of the distributed computing system, we accomplished the simulation of circuits with up to 200 qubits and 21 depths.
Quantum fast Poisson solver
The Poisson equation is a widely used partial differential equation across many areas of physics and engineering. For instance, when simulating the dynamic process of ocean current, the Navier–Stokes equations^{25} can be reduced to the Poisson equation under certain conditions^{26}. Solving the Poisson equation, thus, constitutes the most computationally intensive part of the ocean current simulation. We develop a quantum algorithm for solving the multidimensional Poisson equation^{27}. It could provide an exponential speedup to some degree over the classical counterparts. Here, we remark that for the onedimensional Poisson equation, there may exist more efficient quantum algorithms^{28}. It could be implemented on the nearterm NISQ devices. We leave this point to future work.
The general idea of our quantum fast Poisson solver is straightforward. First, we discretize the Laplacian operator to a square matrix using the central difference approximation, and then solve the resulting linear system of equations using the HarrowHassidimLloyd (HHL) algorithm^{29}. Schematically, the algorithm is shown in Fig. 7. It consists of three main stages, i.e., phase estimation, controlled rotation and uncomputation. The complexity of our algorithm is \(O\left( {d\log^{2} (\varepsilon^{{{  }\alpha }} )} \right)\) in qubits and \(O\left( {\kappa d\log^{3} (\varepsilon^{{{  }\alpha }} )} \right)\) in quantum operations, where ε is the error of the solution, d the dimension of Poisson equation, α > 0 a smoothness constant and κ the condition number of the discretized matrix. On the other hand, any direct or iterative classical algorithms have a cost of at least ε^{−αd}^{30}. Thus, our quantum Poisson solver could provide an exponential speedup over classical methods in the terms of dimension.
To demonstrate the correctness of the algorithm, we propose a simplified version of the circuit with four discretized points^{27}. The circuit consists of 38 qubits and 800 gates. It is simulated using the full amplitude mode, and the run time is 20 min with 4096 nodes. The input state is \(\frac{1}{\sqrt 2 }{01}\rangle + \frac{1}{2}{10}\rangle + \frac{1}{2}{11}\rangle\). This corresponds to a Poisson equation with the solution of (0.9053, 1.1036 0.8018), which turns to (0.553 0.674 0.490) after normalization. The output state is \(0.551{01}\rangle + \;0.675{10}\rangle + 0.491{11}\rangle\), which is consistent with the real solution with an error less than 0.5%. The running results verify the correctness of our algorithm.
Quantum arithmetic of transcendental functions
Quantum arithmetic in the computational basis constitutes the fundamental component of many circuitbased quantum algorithms. A vast amount of literature provided quantum circuits for solving the algebraic functions, including the addition^{31}, multiplication^{32}, reciprocal^{33}, and square root^{34} operations, etc. However, studies about the higherlevel transcendental functions are scare^{35,36}. We develop a novel quantum algorithm, the qFBE (quantum Functionvalue Binary Expansion) method, to evaluate the transcendental functions^{37}. The qFBE provides a unified and programmed solution for the evaluation of logarithmic, exponential, trigonometric and inverse trigonometric functions.
Our qFBE method can be used to evaluate two classes of functions: the Class 1 including log_{2}(x), ln(x), arccos(x), arcsin(x), arccot(x) and arctan(x), and Class 2 including 2^{x}, e^{x}, cos(x), sin(x), cot(x), tan(x). More specifically, suppose the functions of Class 1 are define as f: I → [0,1] with I ⊆ R, then the function value can be expanded in a binary form as follows^{38},
where D_{0} and D_{1} are subintervals of I with D_{0} ∪ D_{1} = I, D_{0} ∩ D_{1} = Ø; r_{0} and r_{1} are functions defined as r_{0}: D_{0} → I, r_{1}: D_{1} → I. On the other hand, the functions of Class 2 can be approximated in the following way^{37},
Apparently, Eq. (6) outputs the function value digitbydigit in a recursive way, while Eq. (7) approximates the function value stepbystep in an iterative way.
The complexity of evaluating transcendental functions by the qFBE method is nO(m) in qubits and nO(m^{2}) in quantum gates, where n is the number of qubits to encode input or output and m the number of qubits to store the intermediate values. The cost of our method is comparable with the best known results^{36} at worst case; while when the input binary has a small number of bits, our method cost much lower. Furthermore, all digits of the binary output can be exact, which makes the control of error propagation easy. The qFBE method provides a unified and programmed solution for most transcendental functions, and the circuits are compact and modular which are easy to be implemented on the virtual or the future real quantum machine.
The quantum circuits for evaluating functions of Class 1 and 2 are shown in Fig. 8 (a) and (b), respectively. For functions of Class 1, the circuit consists of (n1) modules, which actually implement the recursions in Eq. (6). Each module outputs one bit of the solution. For functions of Class 2, the circuit consists of n modules, which approximate the function value stepbystep according to Eq. (7). The last module outputs the final solution. We present the complete quantum circuits for all the functions in Class 1 and 2 and demonstrate the correctness of these circuits on the simulator, which include arccot(x)/π, cos(πx), arccos(x)/π, cot(πx), 2^{×} and log_{2}(x)^{37}. The running results verify the correctness of our algorithm.
Conclusions
We have developed an efficient quantum circuit simulator on the Sunway TaihuLight supercomputer. The simulator possesses three working modes, being capable of calculating the full, partial and single amplitudes of a quantum state. The three modes are built using entirely different methodologies. They are the direct evolution of quantum states, circuit partition by decomposing controlledZ gate and the complex undirected graphical model. Our simulator has the function of emulating the effects of noise, and it supports many kinds of useful quantum gates and operations. To make full use of the Sunway distributed system, the simulation was implemented in a twolevel parallel way. With 16,384 computational nodes, roughly 10% of the computing resource of the Sunway, random quantum circuits with up to 40, 75 and 200 qubits can be simulated on full, partial and single amplitude modes, respectively.
Based on the simulator, we further developed the quantum algorithms for solving the Poisson equations and for quantum arithmetic of evaluating transcendental functions. The present quantum fast Poisson solver takes the HHL algorithm as the framework, and provides an exponential speedup over the classical methods in the terms of dimension. The qFBE method provides a unified and programmed way of evaluating the transcendental functions, including the logarithmic, exponential, arccosine, arcsine, cosine, sine, arccotangent, arctangent, cotangent and tangent functions.
For future work, we will (1) advance the study of quantum Poisson solver to further reduce the algorithm complexity and quantify the effect of noise, and (2) optimize the qFBE circuits by selecting the proper circuits of evaluating algebraic functions. Furthermore, we will expand the applications of the present simulator to other fields, like variational quantum algorithms and quantum machine learning.
References
 1.
Arute, F. et al. Quantum supremacy using a programmable superconducting processor. Nature 574, 505 (2019).
 2.
Kjaergaard, M. et al. Superconducting qubits: current state of play. Annu. Rev. Condens. Matter. Phys. 11, 369 (2020).
 3.
Dalzell, A. M., Harrow, A. W., Koh, D. E. & Placa, R. L. L. How many qubits are needed for quantum computational supremacy?. Quantum 4, 264 (2020).
 4.
Pednault, E., Gunnels, J.A., Nannicini, G., Horesh, L. & Wisnieff, R. Leveraging secondary storage to simulate deep 54qubit Sycamore circuits. arXiv, 1910.09534v2 (2019).
 5.
Villalonga, B. et al. Establishing the quantum supremacy frontier with a 281 Pflops/s simulation. Quantum Sci. Technol. 5, 3 (2020).
 6.
Gray, J. & Kourtis, S. Hyperoptimized tensor network contraction. arXiv 2002, 01935 (2020).
 7.
Huang, C., et al. Classical simulation of quantum supremacy circuits. arXiv, 2005.06787 (2020).
 8.
Pednault, E., et al. Breaking the 49qubit barrier in the simulation of quantum circuits. arXiv, 1710.05867 (2017).
 9.
Boixo, S., Isakov, S. V., Smelyanskiy, V. N. & Neven, H. Simulation of lowdepth quantum circuits as complex undirected graphical models. arXiv, 1712.05384v2 (2018).
 10.
Chen, Z.Y. et al. 64qubit quantum circuit simulation. Sci. Bull. 63, 964 (2018).
 11.
Zulehner, A. & Wille, R. Advanced simulation of quantum computations. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 38, 5 (2019).
 12.
Li, R.L., Wu, B.J., Ying, M.S., Sun, X.M. & Yang, G.W. Quantum supremacy circuit simulation on Sunway TaihuLight. IEEE. Trans. Parallel. Distrib. Syst. 31, 4 (2020).
 13.
Jones, T., Brown, A., Bush, I. & Benjamin, C. QuEST and high performance simulation of quantum computers. Sci. Rep. 9, 10736 (2019).
 14.
Raedt, H. D. et al. Massively parallel quantum computer simulator, eleven years later. Comput. Phys. Commun. 237, 47 (2019).
 15.
Guo, C. et al. Generalpurpose quantum circuit simulator with projected entangledpair states and the quantum supremacy frontier. Phys. Rev. Lett. 123, 190501 (2019).
 16.
Chen, M.C. et al. Quantumteleportationinspired algorithm for sampling large random quantum circuits. Phys. Rev. Lett. 124, 080502 (2020).
 17.
Pilch, J. & Długopolski, J. An FPGAbased real quantum computer emulator. J. Comput. Electron. 18, 329 (2019).
 18.
Mahmud, N., ElAraby, E. & Caliga, D. Scaling reconfigurable emulation of quantum algorithms at high precision and high throughput. Quantum Eng. 1, e19 (2019).
 19.
Fu, H.H. et al. The Sunway TaihuLight supercomputer: system and applications. Sci. China Inf. Sci. 59, 072001 (2016).
 20.
Long, G.L. General quantum interference principle and duality computer. Commun. Theor. Phys. 45, 825 (2006).
 21.
Nielsen, M. A. & Chuang, I. L. Quantum Computation and Quantum Information 171–202 (Cambridge University Press, Cambridge, 2010).
 22.
Chen, J.X., Zhang, F., Huang, C., Newman, M. & Shi, Y.Y. Classical simulation of intermediatesize quantum circuits. arXiv, 1805.01450v2 (2018).
 23.
Preskill, J. Quantum Computing in the NISQ era and beyond. Quantum 2, 79 (2018).
 24.
Boixo, S. et al. Characterizing quantum supremacy in nearterm devices. Nat. Phys. 14, 595–600 (2018).
 25.
Lukaszewicz, G. & Kalita, P. NavierStokes Equations: An Introduction with Applications (Springer International Publishing, Cham, 2016).
 26.
Steijl, R. & Barakos, G. N. Parallel evaluation of quantum algorithms for computational fluid dynamics. Comput. Fluids 173, 22–28 (2018).
 27.
Wang, S. B. et al. Quantum fast Poisson solver: the algorithm and complete and modular circuit design. Quantum Inf. Process 19, 170 (2020).
 28.
Wang, S. B., et al. A quantum Poisson solver implementable on NISQ devices. arXiv, 2005.00256 (2020).
 29.
Harrow, A. W., Hassidim, A. & Lloyd, S. Quantum algorithm for linear systems of equations. Phys. Rev. Lett. 103, 150502 (2009).
 30.
Ritter, K. & Wasilkowski, G. W. On the average case complexity of solving Poisson equations. Lect. Appl. Math. 32, 677 (1996).
 31.
Draper, T. G., Kutin, S. A., Rains, E. M. & Svore, K. M. A logarithmicdepth quantum carrylookahead adder. Quantum Inf. Comput. 6, 351 (2006).
 32.
Rines, R. & Chuang, I. High Performance Quantum Modular Multipliers. arXiv, 1801.01081 (2018).
 33.
Thapliyal, H., MunozCoreas, E., Varun, T.S.S. & Humble, T.S. Quantum circuit designs of integer division optimizing Tcount and Tdepth. arXiv, 1809.09732 (2018).
 34.
MunozCoreas, E. & Thapliyal, H. Tcount and qubit optimized quantum circuit design of the nonrestoring square root algorithm. ACM J. Emerg. Technol. Comput. Syst. 14, 3 (2018).
 35.
Bhaskar, M. K., Hadfield, S., Papageorgiou, A. & Petras, I. Quantum algorithms and circuits for scientific computing. Quantum Inf. Comput. 16, 197 (2016).
 36.
Häner, T., Roetteler, M. & Svore, K.M. Optimizing quantum circuits for arithmetic. arXiv, 1805.12445 (2018).
 37.
Wang, S. B. et al. Quantum circuits design for evaluating transcendental functions based on a functionvalue binary expansion method. Quantum Inf. Process 19, 347 (2020).
 38.
Borwein, J. M. & Girgensohn, R. Addition theorems and binary expansions. Can. J. Math. 47, 262 (1995).
Acknowledgements
We are very grateful to the National Supercomputing Center in Wuxi for the great computing resource. The present work is financially supported by the National Natural Science Foundation of China (Grant No. 61575180, 61701464, 11475160) and the Pilot National Laboratory for Marine Science and Technology (Qingdao).
Author information
Affiliations
Contributions
Z.W. designed the quantum algorithms for applications, participated partially in the design of the simulator and prepared the manuscript. Z.C. wrote the simulation programs and tested the simulator. S.W. and W.L. designed the quantum algorithms for applications and tested the simulator. Y.G., G.G. and Z.W. planned, organized and supervised the project. All authors discussed the results and reviewed the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, Z., Chen, Z., Wang, S. et al. A quantum circuit simulator and its applications on Sunway TaihuLight supercomputer. Sci Rep 11, 355 (2021). https://doi.org/10.1038/s4159802079777y
Received:
Accepted:
Published:
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.