In recent years, tremendous technological progress has been made in the construction of quantum computers, especially with superconducting qubits1,2. As these nascent quantum computers become competitive against classical computers in simulating general quantum circuits, an interesting race come to the climax. The quantum beings are eager to accomplish the first demonstration of quantum supremacy1,3, while classical beings try to push back the classical simulation barrier as far as possible4,5,6,7.

During the race, many novel methods and programs are developed to simulate quantum circuits efficiently on conventional computers8,9,10,11, including parallel platforms12,13,14,15,16, FPGA-based hardware17,18, etc. In fact, classical simulation of quantum computation is vital both for the verification of quantum computers and for the assessment of the correctness and performance of new quantum algorithms. The fundamental task of such simulation is to calculate all or a certain number of amplitudes of quantum states produced by a quantum circuit.

However, it is extremely expensive to simulate quantum computation classically because of the curse of dimensionality, i.e., the memory and time requirements grow exponentially with the number of qubits. For instance, to accurately simulate a quantum system with 50 qubits, one needs a classical computer with slightly more than 16 Petabytes of memory (with double precision). Moreover, increasing the number of qubits by one requires a doubling of the amount of memory space. Performing such a large-scale computation requires one to take advantage of the state-of-the-art high-performance distributed computation.

In the present work, we develop a new quantum circuit simulator on the Sunway TaihuLight supercomputer. Albeit other simulators have been developed on supercomputers including Sunway TaihuLight12,14,16, our simulator is designed to be a powerful tool for quantum algorithm research. The simulator consists of three mutually independent sub-programs to calculate the full, partial and single amplitudes of a quantum state with three completely different methods. Therefore, a wide range of number of qubits and circuit depths can be covered. This could provide choices when people execute quantum algorithms of different fields. In addition, it can emulate the effect of noise and support more quantum operations, such as the controlled and inverse operations on a group of gates, which are very useful in practical applications. On the other hand, the efficiency of the simulator is high. The algorithms of the simulator has a two-level parallel structure to fully take advantage of the Sunway system architecture. We can simulate random quantum circuits with 40, 75 and 200 qubits on the full, partial and single amplitude, respectively. With this simulator, we further develop quantum algorithms for solving the Poisson equations and for quantum arithmetic of evaluating transcendental functions.

Simulation techniques

The present quantum circuit simulator consists of three mutually independent sub-programs, referred to as three working modes of the simulator, i.e. full amplitude, partial amplitude and single amplitude mode. The fundamental methodologies for the three modes are completely different. They are, respectively, direct evolution of quantum state, circuit partition by decomposing controlled-Z gate10, and the complex undirected graphical model9. In addition, noisy one- and two-qubit gates are defined to emulate the effect of noise. A description of the instruction set of our simulator and an illustrative example of the input and output are given in the supplementary material.

Sunway TaihuLight supercomputer

Before proceeding to the details of the simulation techniques, we first give a brief introduction of the classical hardware. Our simulator is developed based on the Sunway TaihuLight at the National Supercomputer Center in Wuxi, China. The Sunway TaihuLight is so far the most powerful supercomputer in China. It can reach a peak performance of 125 PFlops, and had ranked the first in the TOP500 list for four times in the years of 2016 and 2017.

The supercomputer consists of 40,960 homegrown processors called SW26010. Each SW26010 processor contains four core-groups. Each core-group contains one management processing element (hereafter called as master core) with a memory space of 8 GB, and 64 computing processing elements (hereafter called as slave core) in an 8 × 8 array19. Within a core-group, the 64 slave cores can communicate with each other in a few cycles. In the present work, one core-group is set as a unique MPI process. When mentioning a computational node, it refers to one core-group, namely 1 master core plus 64 slave cores. The simulator is written by C++ language.

To take full advantage of the system architecture of Sunway TaihuLight, we implement algorithms of the three working modes in a two-level parallel way. More specifically, the entire simulation are first divided equally to the available nodes, which is the first level of parallel. In each node corresponding to a unique MPI process, the computing task is further assigned to the 64 slave cores equally, while the master core is responsible for the process control and I/O operation. This is the second level of parallel. The specific designs of algorithms are discussed in the subsequent sections.

Full amplitude mode

The full amplitude mode of the simulator is an instance of the so-called Schrodinger simulation. It is based on the direct evolution of quantum state through the product of unitary operations, contrasting to the linear combinations of unitary operations20. All the information of the quantum state is precisely maintained and updated step-by-step throughout the simulation. The Schrodinger approach is straightforward, and it could provide a great speed in simulating low-width circuits. However, when processing many-qubits circuits, it requires a significant amount of RAM to store all amplitudes. In the present work, we use at most 16,384 computational nodes, roughly 10% of the computing resource of Sunway TaihuLight, and can simulate a quantum circuit with up to 40 qubits on this mode.

Now we use the single-qubit and controlled two-qubit operations as examples to illustrate the distributed implementation of Schrodinger simulation. It is well known that an n-qubit quantum state can be represented by Dirac notations and column vectors as follows,

$$ \begin{aligned} \left| \varphi \right\rangle & = \sum\limits_{i = 0}^{{2^{n} - 1}} {\alpha_{i} \left| i \right\rangle } = \sum\limits_{{i_{n - 1} \cdots i_{0} = \left\{ {0,1} \right\}^{n} }} {\alpha_{i} \left| {i_{n - 1} \cdots i_{k} \cdots i_{0} } \right\rangle } \quad (Dirac). \\ & = (\alpha_{0} ,\alpha_{1} , \cdots ,\alpha_{{2^{n} - 1}} )^{T} \quad (column\;vector) \\ \end{aligned} $$

where the decimal and binary index are related by \(i = i_{n - 1} \times 2^{n - 1} + \cdots + i_{k} \times 2^{k} + \cdots + i_{0} \times 2^{0}.\). In practice, we store all the amplitudes αi in the memory during the simulation and update them according to the action of unitary operations.

Let Uk represents a single-qubit gate acting on the kth qubit, namely ik in Eq. (1). It can be easily verified that the amplitudes can be updated in the following way,

$$ \begin{gathered} \left( {\begin{array}{*{20}l} {\alpha^{\prime}_{i} } \\ {\alpha^{\prime}_{{i + 2^{k} }} } \\ \end{array} } \right) = U^{k} \left( {\begin{array}{*{20}c} {\alpha_{i} } \\ {\alpha_{{i + 2^{k} }} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} a & b \\ c & d \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {\alpha_{i} } \\ {\alpha_{{i + 2^{k} }} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {a\alpha_{i} + b\alpha_{{i + 2^{k} }} } \\ {c\alpha_{i} + d\alpha_{{i + 2^{k} }} } \\ \end{array} } \right). \hfill \\ for\;all\;(i)_{10} = (i_{n - 1} \cdots i_{k} \cdots i_{0} )_{2} \;with\;i_{k} = 0. \hfill \\ \end{gathered} $$

Note that the amplitudes indexed by ik = 1 are calculated when traversing the index of i + 2k. As can be seen from the equation, for one action of Uk, all the 2n amplitudes are changed. Thus, the one single-qubit operation corresponds to a computation scale of 2n additions and multiplications. The controlled two-qubit operation can be implemented similarly. Let CUq,k represents a controlled two-qubit gate, where the qubit q (k) is the control (target) bit. That is to say, when iq is zero, the gate CUq,k will do nothing; when iq is 1, the gate performs the same transformation as Eq. (2). This can be formalized as

$$ \begin{gathered} \left\{ {\begin{array}{*{20}c} {\alpha^{\prime}_{i} = \alpha_{i} ,} & {with\;i_{q} = 0\;;} \\ {\left( {\begin{array}{*{20}c} {\alpha^{\prime}_{i} } \\ {\alpha^{\prime}_{{i + 2^{k} }} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {a\alpha_{i} + b\alpha_{{i + 2^{k} }} } \\ {c\alpha_{i} + d\alpha_{{i + 2^{k} }} } \\ \end{array} } \right),} & {with\;i_{q} = 1,\;i_{k} = 0} \\ \end{array} } \right. \hfill \\ for\;all\;(i)_{10} = (i_{n - 1} \cdots i_{q} \cdots i_{k} \cdots i_{0} )_{2} \;or\;(i_{n - 1} \cdots i_{k} \cdots i_{q} \cdots i_{0} )_{2} \hfill \\ \end{gathered} $$

Here we remark that although the above single-qubit and controlled two-qubit operations are enough as they form a universal set for quantum computation21, our simulator could support more quantum gates and operations. They are very useful in the practical design of quantum circuits. Particularly, the simulator supports arbitrary single-quit rotation gates, controlled operation on a group of gates, and inverse operation on a group of gates, etc. (see supplementary material for details).

The above two equations are of great importance because they lend the process of updating amplitudes to parallelization and distribution. That is, they update amplitudes via 2n computations of i + j as shown in Eq. (3), not by multiplying a full 2n × 2n matrix on the column vector. Such equations can be implemented in a way of two-level parallel. Specially, all the amplitudes are divided equally to the nodes and stored in the corresponding master cores. Then the master core in each node calls the slave cores to update the amplitudes in parallel.

In summary, the program of this mode proceeds in the following three steps:

  • 1st: Configure the computation nodes. Then every node parses the script to obtain a linked list recording instructions of the quantum circuit.

  • 2nd: Assign all the amplitudes equally to the nodes. The amplitudes are initialized as zero in the master core of each node.

  • 3rd: The master core traverses every node of the linked list in turn, and prepares the computing parameters, including the matrix coefficients of the gate, the number of amplitudes of each node, starting address of the target amplitude, etc. Then the master core assign the task of updating the amplitudes equally to the 64 slave cores. The slave cores get the requited data using the address information according to Eqs. (2) or (3), and compute the new amplitude values, and then sent them back to the same position in the master core.

Partial amplitude mode

The partial amplitude mode use a hybrid algorithm to simulate a quantum circuit with more than 50 qubits but of limited depth. Generally, in this mode the original quantum circuit are divided into several sub-circuits with less qubits, which are then simulated independently using the same method as the full amplitude mode. With 16,384 computational nodes, we can simulate a quantum circuit with up to 75 qubits under this mode. Below is a brief introduction of the partition scheme of the circuit. More information can be found in our previous paper10.

The controlled-Z gates can be decomposed into the projection and single-qubit Z gates as follows,

$$ CZ^{i,j} = P_{0}^{i} \otimes I^{j} + P_{1}^{i} \otimes Z^{j} = \left( {\begin{array}{*{20}c} 1 & 0 \\ 0 & 0 \\ \end{array} } \right)^{i} \otimes I^{j} + \left( {\begin{array}{*{20}c} 0 & 0 \\ 0 & 1 \\ \end{array} } \right)^{i} \otimes Z^{j} . $$

The superscripts represent that qubit i is the control qubit and qubit j the target qubit. On the left hand side of the equation, qubits i and j are entangled, while on the right hand side they are independent. Therefore, after decomposing the CZ gate, the quantum states of qubits i and j can evolve independently, and then be recombined to get the final state. This turns out to be a very useful method of reducing the memory requirements when simulating a quantum circuit with many qubits.

Now we take a quantum circuit with 8 qubits and 8 layers of depth as an example to illustrate the partition scheme. The circuit is shown in Fig. 1. The circuit is made up of two blocks, that is, the upper block with qubits from 0 to 3, and the lower one with the other qubits. The two blocks are entangled by the CZ gates in 7th and 8th layer. The entanglement between the two blocks can be dismissed by decomposing the two CZ gates in turn, as shown in Fig. 1. After the decomposition, the original circuit results in four circuits, whose upper and lower blocks are untangled. Then each of the four circuits can be divided into two sub-circuits with a half number of qubits, which can be simulated independently. Therefore, the task of simulating the original circuit with 8 qubits is converted to the simulation of 8 independent sub-circuits with 4 qubits. The number of amplitudes stored in the memory is reduced from 28 to 27. Since the sub-circuits are simulated in a parallel way, the time span of the simulation is also reduced.

Figure 1
figure 1

The partition scheme for a quantum circuit with 8 qubits and 8 depths10. The first line shows the original circuit. In the second line, the CZ gate in 7th depth is decomposed to projection and single-qubit Z gate, then the original circuit converts to two circuits. In the third line, the CZ gate in 8th depth is further decomposed, and the number of circuit is again double. In each of the final 4 circuits, the upper and lower parts is untangled, and they can be simulated independently.

There are also restrictions on the partition scheme. The gates crossing the dividing line should be the controlled two-qubit gate, such as the CNOT and CZ gates, not the gate like SWAP. Furthermore, the number of sub-circuits grows exponentially with the number of decomposed CZ gates. For example, if there is one more CZ gate crossing the dividing line between qubits 3 and 4 in Fig. 1, the partition is not efficient. Therefore, this method is suitable for quantum circuits with low depth and large sampling number (the large sampling number is originate from the fact that all the sub-circuits are simulated on the full amplitude mode).

In summary, the program for the partial amplitude mode proceeds in the following four steps:

  • 1st: Configure the computation nodes. Then every node parses the script to extract the gates. Judge whether the gates crossing the dividing line is the controlled two-qubit gates, and decompose it by doubling the circuit. The dividing line is always set to be in the middle of qubits.

  • 2nd: Cut each of the final circuits into two sub-circuits along the dividing line. There should be 2c+1 sub-circuits generated, where c is the number of decomposed gates. Establish a linked list of quantum gates for each sub-circuit.

  • 3rd: Assign the task of simulating the sub-circuits equally to the nodes. The result of assignment would be that one node simulates one sub-circuit, one node simulates several sub-circuits, or several nodes simulate one sub-circuit. The simulations are implemented in the same way as the full amplitude mode.

  • 4th: Combine the state of each sub-circuit to get the final states.

Single amplitude mode

The single amplitude mode makes use of undirected graphical model to be capable of simulating quantum circuit with much more qubits. Broadly, the original quantum circuit is first mapped to an undirected graphical model, then the undirected graph is split into several ones by fixing the value of variables, and then the resulting graphs are processed in parallel by the vertical variable elimination algorithm.

The undirected graph model is a way of interpreting the relation between the change of bit values of qubit state and the quantum gates. Naturally, the bit of state will change with actions of a sequence of quantum gates. We define a sequence of Boolean variables to describe the change. For example, being acted upon by the Pauli-X and H gate in sequence, the state \(\left| 0 \right\rangle\) will be first changed to \(\left| 1 \right\rangle\), then to \({1 \mathord{\left/ {\vphantom {1 {\sqrt 2 }}} \right. \kern-\nulldelimiterspace} {\sqrt 2 }}\left( {\left| 0 \right\rangle - \left| 1 \right\rangle } \right)\). Then the corresponding Boolean variables are a0 = 0, a1 = 1, and a2 = {0, 1}, respectively. The undirected graph is constructed based on the Boolean variables and quantum gates. Specially, each Boolean variable in the circuit corresponds exactly to one vertex in the graph, and one or multiple gates in the circuit result in one edge in the graph.

The rule of mapping a quantum circuit to an undirected graph is simple and easy to follow9. It is summarized to four cases as shown in Fig. 2. For the diagonal one- or two-qubit gate, it does not change the Boolean variable, so the vertices corresponding to the same variable merge into one. For example, the CZ gate will transform the state \(\left| {11} \right\rangle\) to \(- \left| {11} \right\rangle\) without flipping of the bit, so the input and output vertices are merged as shown in Fig. 2c. The cross lines in the graph should be considered as one line, which corresponds to one gate, as shown in Fig. 2d. Figure 3 presents an example to further illustrate the mapping of a circuit to undirected graph.

Figure 2
figure 2

The quantum circuit representations of qubits and gates, and the corresponding undirected graphical model. (a) The diagonal one-qubit gate, (b) the non-diagonal one-qubit gate, (c) the diagonal two-qubit gate, (d) the non-diagonal two-qubit gate. The Boolean variables of a0, a1, b0 and b1 represent the bit value of the state, which is 0 or 1. The vertices in the graph corresponds to the Boolean variables in the circuit, and the edges corresponds to the gates.

Figure 3
figure 3

An illustration example of mapping a quantum circuit to the undirected graphical model. (a) A quantum circuit adapted from the part of Fig. 1, (b) the corresponding undirected graph. The triangle symbols are used to explain the change of Boolean variables along the worldline of qubit 0. Note that since the cross lines are considered as one line in the undirected graph, the vertices are rearranged to avoid the false crossover.

After getting the undirected graph, tensor techniques are used to process it. One edge in the graph corresponds to a particular tensor, and the number of vertices connecting to the edge is the rank of the tensor. For example, the edge in Fig. 2d corresponds to a tensor T of rank 4, with 24 elements indexed by \(T_{{a_{0} b_{0} a_{1} b_{1} }}\). The elements of tensor T are filled using U2n in the lexicographical order of the index, such as that T00,00 = (U2n)0,0, T00,10 = (U2n)0,2, T01,00 = (U2n)1,0, T10,00 = (U2n)2,0 and so on.

There are two kinds of processes performed on the undirected graph, which are edge merging and vertex elimination. Edge merging means that two edges connecting to the same vertex are merged to one. This is actually to merge two tensors with the same subscript into one. For instance, suppose that the edge between vertexes b0 and b1 in Fig. 3b corresponds to a tensor \(A_{{b_{0} b_{1} }}\), and the edge between vertex b1 and d1 corresponds to \(B_{{b_{1} d_{1} }}\), then the two edges merges into one to get a higher-rank tensor as \(C_{{b_{0} b_{1} d_{1} }} = A_{{b_{0} b_{1} }} B_{{b_{1} d_{1} }}\).

Vertex elimination reduces the number of vertexes connecting to a particular edge. This is actually a variant of tensor contraction. We do this using two different methods, of which one is a differential way and the other an integral way. In the differential method, the variable corresponding to a vertex is fixed to be 0 and 122. For example, the vertex b1 in Fig. 3b is fixed to 0, then the tensor \(B_{{b_{1} d_{1} }}\) is converted to \(B_{{0d_{1} }}\). Thus, the tensor rank is reduced from 2 to 1, and the number of elements from 22 to 2. The expense of this method is that it doubles the graph. That is, the graph needs to be computed twice with the target variable being 0 and 1, respectively. In the integral method, all the elements of a tensor corresponding to a specific subscript are summed over to eliminate that index. For instance, the subscript b1 in tensor \(C_{{b_{0} b_{1} d_{1} }}\) is eliminated by \(C^{\prime}_{{b_{0} d_{1} }} = C_{{b_{0} 0d_{1} }} + C_{{b_{0} 1d_{1} }}\), so the vertex b1 is eliminated from the edge corresponding to tensor \(C_{{b_{0} b_{1} d_{1} }}\).

In summary, the program for the single amplitude mode proceeds in the following four steps:

  • 1st: Configure the computation nodes. Then every node parses the script to obtain a linked list recording instructions of the quantum circuit. Map the quantum circuit to the undirected graphical model using the linked list.

  • 2nd: Eliminate the vertices in the first and last depth of the graph according to the specified initial and measurement states using the differential vertex elimination method. Since the initial and measurement states are certain, this step does not double the number of graphs.

  • 3rd: Find the top N vertices with the largest number of connecting edges. Then perform the differential vertex elimination on the N vertices, and this result into 2N graphs. Assign the task of simulating the 2N graphs equally to the nodes. (Note that eliminating the top N high-degree vertices would be not the best way of simplifying the graph. The treewidth of the graph really matters, but it is NP-complete to determine9,22. For simplicity, we choose the top N high-degree vertices to remove at this step.)

  • 4th: For each graph, eliminate all the vertices. Specifically, for each vertex, first merge all the connecting edges into one in the order of rank, and then eliminate this vertex using the integral method. Multiply the elements of the tensors corresponding to the left edges, and obtain the amplitude of each graph. Sum over the amplitude of each graph to get the final amplitude of the state to measure.

Simulation of the effect of noise

In practical quantum devices, qubits are performed imperfectly. Various kinds of noise would randomly induce errors on the states of qubits. Particularly, in the coming NISQ era, quantum computers have noisy gates unprotected by quantum error correction23. Thus, it is important to characterize the effect of noise by classical simulations.

The effect of noise can be described by a series of super operators {K1, K2, …, Ks}, which satisfy the relation \(\sum\nolimits_{i} {K_{i}^{\dag } } K_{i} = I\). For the single-qubit gate, we consider the following six kinds of noise,

$$ \begin{gathered} {\text{Bit}}\;{\text{flip}}:\quad K_{1} = \sqrt p \left[ {\begin{array}{*{20}c} 1 & 0 \\ 0 & 1 \\ \end{array} } \right],\;K_{2} = \sqrt {1 - p} \left[ {\begin{array}{*{20}c} 0 & 1 \\ 1 & 0 \\ \end{array} } \right]; \\ {\text{Phase}}\;{\text{flip}}:\quad K_{1} = \sqrt p \left[ {\begin{array}{*{20}c} 1 & 0 \\ 0 & 1 \\ \end{array} } \right],\;K_{2} = \sqrt {1 - p} \left[ {\begin{array}{*{20}c} 1 & 0 \\ 0 & { - 1} \\ \end{array} } \right]; \\ {\text{Bit - Phase}}\;{\text{flip}}:\quad K_{1} = \sqrt p \left[ {\begin{array}{*{20}c} 1 & 0 \\ 0 & 1 \\ \end{array} } \right],\;K_{2} = \sqrt {1 - p} \left[ {\begin{array}{*{20}c} 0 & { - i} \\ i & 0 \\ \end{array} } \right]; \\ {\text{Amplitude}}\;{\text{Damping}}:\quad K_{1} = \left[ {\begin{array}{*{20}c} 1 & 0 \\ 0 & {\sqrt {1 - p} } \\ \end{array} } \right],\;K_{2} = \left[ {\begin{array}{*{20}c} 0 & {\sqrt p } \\ 0 & 0 \\ \end{array} } \right]; \\ {\text{Phase}}\;{\text{Damping}}:\quad K_{1} = \left[ {\begin{array}{*{20}c} 1 & 0 \\ 0 & {\sqrt {1 - p} } \\ \end{array} } \right],\;K_{2} = \left[ {\begin{array}{*{20}c} 0 & 0 \\ 0 & {\sqrt p } \\ \end{array} } \right]; \\ {\text{Depolarizing}}:\quad K_{1} = \sqrt {1 - {{3p} \mathord{\left/ {\vphantom {{3p} 4}} \right. \kern-\nulldelimiterspace} 4}} \left[ {\begin{array}{*{20}c} 1 & 0 \\ 0 & 1 \\ \end{array} } \right],\;K_{2} = {{\sqrt p } \mathord{\left/ {\vphantom {{\sqrt p } 2}} \right. \kern-\nulldelimiterspace} 2}\left[ {\begin{array}{*{20}c} 0 & 1 \\ 1 & 0 \\ \end{array} } \right],\;K_{3} = {{\sqrt p } \mathord{\left/ {\vphantom {{\sqrt p } 2}} \right. \kern-\nulldelimiterspace} 2}\left[ {\begin{array}{*{20}c} 0 & { - i} \\ i & 0 \\ \end{array} } \right],\;K_{4} = {{\sqrt p } \mathord{\left/ {\vphantom {{\sqrt p } 2}} \right. \kern-\nulldelimiterspace} 2}\left[ {\begin{array}{*{20}c} 1 & 0 \\ 0 & { - 1} \\ \end{array} } \right]. \\ \end{gathered} $$

The value p in the equation is on [0, 1], which is proportional to the noise intensity. Specifically, for the first three kinds of noise, when p approaches 1, the noise close to zero; for the last three kinds of noise, when p approaches zero, the noise close to zero.

For the two-qubit gate, the noise operators are defined as the Kronecker products of single-qubit gates. For example, suppose the noise operators of single-qubit gates are {K1, K2} and {M1, M2}, respectively. Then, the noise operators of two-qubit gate are \(\left\{ {K_{1} \otimes M_{1} ,\,K_{1} \otimes M_{2} ,\,K_{2} \otimes M_{1} ,\,K_{2} \otimes M_{2} } \right\}\).

In the program, the procedure of simulating the noise goes as follows:

1st. Determine the class of quantum gates specified to be noisy and the kind of noise. Let every operator of {K1, K2, …, Ks} act on the present quantum state using the same method as the full amplitude mode. Then calculate the modulus of the states, namely the probabilities of the states.

2nd. Produce a random number between 0 and 1, and compare it with the above sequence of probabilities, then determine which sub-operator Ki to be used. Multiply the matrix Ki with the quantum gate to obtain a new matrix, i.e., the noisy gate.

3rd. Update the state by the new matrix using the same method as the full amplitude mode. Finally, normalize the quantum state (the noisy gate may not be unitary).

To sum up, we have discussed the basic principles of the full, partial and single amplitude modes, as well as the way of defining noisy gate to emulate the effect of noise. Subsequently, we introduce numerical results and applications of the present simulator.

Results and applications

To characterize the performance of the simulator, we first implement the random quantum circuits (RQCs) generated using the prescription of Google24. Then we demonstrate the quantum circuits for solving the Poisson equations and for the quantum arithmetic of evaluating transcendental functions. Here, we remark that the quantum fast Poisson solver and quantum arithmetic algorithms are implemented mainly on the full amplitude mode since these circuits have relatively few qubits and high depth. We leave such applications to future work that the partial and single amplitude modes as well as the function of emulating the effect of noise are exploited.

Implementation of RQCs

The full amplitude mode is the foundation of the other two modes, because the resulting sub-circuits in partial and single modes are finally simulated using the same method as the full amplitude mode. The main factor of limiting the computing speed of full amplitude mode is the data communication between nodes. According to Eqs. (2) and (3), when updating one term of amplitudes αi, one need another term αi+2k, which may be stored in another core-group or another SW26010 processors. As shown in Fig. 4, for the one-qubit gate, the speed of computation on a state stored in one core-group (node) is about ten times faster than that in different core-groups. On the other hand, amplitudes being stored in one SW26010 processor or two has almost no influence.

Figure 4
figure 4

The time span of performing one- and two-qubit gates in different cases on the full amplitude mode. 1Q-1CG means that a one-qubit gate acts on a state stored in one core-group. 1Q-2CG means that a one-qubit gate acts on a state stored in two different core-groups, but in the same SW26010 processor. 1Q-2Pr means that a one-qubit gate acts on a state stored in two SW26010 processors. 2Q represents the two-qubit gate.

For the partial amplitude mode, we simulate a sequence of RQCs with 4096 nodes. The running time is shown in Fig. 5. In addition to the numbers of qubit and depth, the structure of the lattice of qubits also has a big impact on the running time, as shown by the results of 60 qubits (6 × 10 and 5 × 12).

Figure 5
figure 5

The time span of executing RQCs on the partial amplitude mode. The product of two numbers represents the total number of qubits.

For the single amplitude mode, we simulate RQCs with 49, 110 and 200 qubits using 256 nodes. The running time is shown in Fig. 6. By taking advantage of the distributed computing system, we accomplished the simulation of circuits with up to 200 qubits and 21 depths.

Figure 6
figure 6

The time span of executing random quantum circuits on the single amplitude mode. The circuit with 200 qubits and 21 depths was simulated.

Quantum fast Poisson solver

The Poisson equation is a widely used partial differential equation across many areas of physics and engineering. For instance, when simulating the dynamic process of ocean current, the Navier–Stokes equations25 can be reduced to the Poisson equation under certain conditions26. Solving the Poisson equation, thus, constitutes the most computationally intensive part of the ocean current simulation. We develop a quantum algorithm for solving the multi-dimensional Poisson equation27. It could provide an exponential speedup to some degree over the classical counterparts. Here, we remark that for the one-dimensional Poisson equation, there may exist more efficient quantum algorithms28. It could be implemented on the near-term NISQ devices. We leave this point to future work.

The general idea of our quantum fast Poisson solver is straightforward. First, we discretize the Laplacian operator to a square matrix using the central difference approximation, and then solve the resulting linear system of equations using the Harrow-Hassidim-Lloyd (HHL) algorithm29. Schematically, the algorithm is shown in Fig. 7. It consists of three main stages, i.e., phase estimation, controlled rotation and uncomputation. The complexity of our algorithm is \(O\left( {d\log^{2} (\varepsilon^{{{ - }\alpha }} )} \right)\) in qubits and \(O\left( {\kappa d\log^{3} (\varepsilon^{{{ - }\alpha }} )} \right)\) in quantum operations, where ε is the error of the solution, d the dimension of Poisson equation, α > 0 a smoothness constant and κ the condition number of the discretized matrix. On the other hand, any direct or iterative classical algorithms have a cost of at least εαd30. Thus, our quantum Poisson solver could provide an exponential speedup over classical methods in the terms of dimension.

Figure 7
figure 7

The overall quantum circuit of the quantum fast Poisson solver. It has four main registers, i.e. register B, E, A and Ancilla. Register B is used to encode the coefficients of Poisson equation. It is the input of the circuit. The approximated eigenvalues are stored in register E. Register A is used to store angles for the controlled rotation operation. The Ancilla register will transduce the reciprocals of eigenvalues to the amplitudes.

To demonstrate the correctness of the algorithm, we propose a simplified version of the circuit with four discretized points27. The circuit consists of 38 qubits and 800 gates. It is simulated using the full amplitude mode, and the run time is 20 min with 4096 nodes. The input state is \(\frac{1}{\sqrt 2 }{|01}\rangle + \frac{1}{2}{|10}\rangle + \frac{1}{2}{|11}\rangle\). This corresponds to a Poisson equation with the solution of (0.9053, 1.1036 0.8018), which turns to (0.553 0.674 0.490) after normalization. The output state is \(0.551{|01}\rangle + \;0.675{|10}\rangle + 0.491{|11}\rangle\), which is consistent with the real solution with an error less than 0.5%. The running results verify the correctness of our algorithm.

Quantum arithmetic of transcendental functions

Quantum arithmetic in the computational basis constitutes the fundamental component of many circuit-based quantum algorithms. A vast amount of literature provided quantum circuits for solving the algebraic functions, including the addition31, multiplication32, reciprocal33, and square root34 operations, etc. However, studies about the higher-level transcendental functions are scare35,36. We develop a novel quantum algorithm, the qFBE (quantum Function-value Binary Expansion) method, to evaluate the transcendental functions37. The qFBE provides a unified and programmed solution for the evaluation of logarithmic, exponential, trigonometric and inverse trigonometric functions.

Our qFBE method can be used to evaluate two classes of functions: the Class 1 including log2(x), ln(x), arccos(x), arcsin(x), arccot(x) and arctan(x), and Class 2 including 2x, ex, cos(x), sin(x), cot(x), tan(x). More specifically, suppose the functions of Class 1 are define as f: I → [0,1] with I R, then the function value can be expanded in a binary form as follows38,

$$ \begin{gathered} f(x) = \sum\limits_{{n \ge 0,\;a_{n} \in D_{1} }} {\frac{1}{{2^{n + 1} }}} . \hfill \\ with\;a_{0} = x,\;a_{n + 1} = \left\{ \begin{aligned} r_{0} (a_{n} ) & \;\,if\,a_{n} \in D_{0} \\ r_{1} (a_{n} ) & \;\,if\,a_{n} \in D_{1} \\ \end{aligned} \right.. \hfill \\ \end{gathered} $$

where D0 and D1 are subintervals of I with D0D1 = I, D0 ∩ D1 = Ø; r0 and r1 are functions defined as r0: D0 → I, r1: D1 → I. On the other hand, the functions of Class 2 can be approximated in the following way37,

$$ \begin{gathered} f(x) = a_{n + 1} \hfill \\ with\;x = (0.v_{n - 1} v_{n - 2} \cdots v_{1} v_{0} ),\;v_{i} \in \left\{ {0,\,1\} ,} \right. \hfill \\ and\;a_{0} = const,\;a_{i + 1} = \left\{ \begin{gathered} r_{0}^{ - 1} (a_{i} )\;\,if\,v_{i} = 0 \hfill \\ r_{1}^{ - 1} (a_{i} )\;\,if\,v_{i} = 1 \hfill \\ \end{gathered} \right.. \hfill \\ \end{gathered} $$

Apparently, Eq. (6) outputs the function value digit-by-digit in a recursive way, while Eq. (7) approximates the function value step-by-step in an iterative way.

The complexity of evaluating transcendental functions by the qFBE method is nO(m) in qubits and nO(m2) in quantum gates, where n is the number of qubits to encode input or output and m the number of qubits to store the intermediate values. The cost of our method is comparable with the best known results36 at worst case; while when the input binary has a small number of bits, our method cost much lower. Furthermore, all digits of the binary output can be exact, which makes the control of error propagation easy. The qFBE method provides a unified and programmed solution for most transcendental functions, and the circuits are compact and modular which are easy to be implemented on the virtual or the future real quantum machine.

The quantum circuits for evaluating functions of Class 1 and 2 are shown in Fig. 8 (a) and (b), respectively. For functions of Class 1, the circuit consists of (n-1) modules, which actually implement the recursions in Eq. (6). Each module outputs one bit of the solution. For functions of Class 2, the circuit consists of n modules, which approximate the function value step-by-step according to Eq. (7). The last module outputs the final solution. We present the complete quantum circuits for all the functions in Class 1 and 2 and demonstrate the correctness of these circuits on the simulator, which include arccot(x)/π, cos(πx), arccos(x)/π, cot(πx), 2× and log2(x)37. The running results verify the correctness of our algorithm.

Figure 8
figure 8

The overall circuits for evaluating functions in Group 1 (a), and Group 2 (b). The Group 1 circuit consists of n−1 modules G1-i, which implement the recursions in Eq. (7). Similarly, the Group 2 circuit consists of n modules G2-i, which corresponds to the iterations in Eq. (8). Both circuits mainly include two registers for inputs and outputs.


We have developed an efficient quantum circuit simulator on the Sunway TaihuLight supercomputer. The simulator possesses three working modes, being capable of calculating the full, partial and single amplitudes of a quantum state. The three modes are built using entirely different methodologies. They are the direct evolution of quantum states, circuit partition by decomposing controlled-Z gate and the complex undirected graphical model. Our simulator has the function of emulating the effects of noise, and it supports many kinds of useful quantum gates and operations. To make full use of the Sunway distributed system, the simulation was implemented in a two-level parallel way. With 16,384 computational nodes, roughly 10% of the computing resource of the Sunway, random quantum circuits with up to 40, 75 and 200 qubits can be simulated on full, partial and single amplitude modes, respectively.

Based on the simulator, we further developed the quantum algorithms for solving the Poisson equations and for quantum arithmetic of evaluating transcendental functions. The present quantum fast Poisson solver takes the HHL algorithm as the framework, and provides an exponential speedup over the classical methods in the terms of dimension. The qFBE method provides a unified and programmed way of evaluating the transcendental functions, including the logarithmic, exponential, arc-cosine, arc-sine, cosine, sine, arc-cotangent, arc-tangent, cotangent and tangent functions.

For future work, we will (1) advance the study of quantum Poisson solver to further reduce the algorithm complexity and quantify the effect of noise, and (2) optimize the qFBE circuits by selecting the proper circuits of evaluating algebraic functions. Furthermore, we will expand the applications of the present simulator to other fields, like variational quantum algorithms and quantum machine learning.