Efficient representation of quantum many-body states with deep neural networks

Part of the challenge for quantum many-body problems comes from the difficulty of representing large-scale quantum states, which in general requires an exponentially large number of parameters. Neural networks provide a powerful tool to represent quantum many-body states. An important open question is what characterizes the representational power of deep and shallow neural networks, which is of fundamental interest due to the popularity of deep learning methods. Here, we give a proof that, assuming a widely believed computational complexity conjecture, a deep neural network can efficiently represent most physical states, including the ground states of many-body Hamiltonians and states generated by quantum dynamics, while a shallow network representation with a restricted Boltzmann machine cannot efficiently represent some of those states.

In this construction, we restrict to simple RBM gadget with only one hidden neuron connected to k visible neurons and identical weight function W on each edge. we need to solve the equation for a certain correlation g(v 1 , · · · , v k ). Apart from the graph state example given in the main text, we construct RBM representation of three classes of entangled states: the toric code, which is the simplest state with topological order and useful for quantum error correction; the randomly distributed entangled pair state, which satisfies the entanglement volume law instead of the area law [1]; and the coherent thermal state which describes a critical system [2]. These examples and the corresponding RBM representations are shown in Supplementary Fig. 1.
For the toric code, the correlation function g(v 1 , v 2 , v 3 , v 4 ) = (v 1 + v 2 + v 3 + v 4 mod 2) on each vertex and the wave function is a product of these functions over all vertices [3]. We have so the weight W (v i , h) = iπv i h − (ln 2) /4 for the toric code example. Supplementary Ref. [4] gives another construction of RBM representation of the toric code state, and compared with that our construction here is significantly simpler. For the randomly distributed entangled pair states, we have g(v 1 , v 2 ) = δ v1v2 for each entangled pair |00 + |11 . The weight function W (v i , h) is a special case of the phase gadget W θ with θ = 0 (the identity gadget).
For a coherent thermal state defined as it has the same correlation function as the corresponding thermal state with β denoting the inverse temperature [2]. We consider the binary variables v defined on a square lattice as shown in Supplementary Fig. 1, and the state |Ψ ch then defines a coherent thermal state for a 2D Ising model which has a phase transition at β = β c . At this critical point β c , the state |Ψ ch describes a many-body entangled state for a critical system. For any value of β, the wave function of the state |Ψ ch can be simply represented by a RBM as shown in Supplementary Fig. 1. We have g(v 1 , v 2 ) = e −βv1v2/2 for each pair of visible neurons. The correlation is identical to the case that we have considered for the DBM representation of a fully connect Boltzmann machine (see the method section of the main text). We just need a single hidden neuron to generate this correlation with the weight function given in the method section of the main text where J is replaced by −β/2.

Supplementary Note 2 A review of definition of related complexity classes
For completeness, here we review some definitions of complexity classes related to our proof for the limitation of RBM representation. More detailed discussion can be found in Supplementary Ref. [5]. The concept of language L (a subset of the string {0, 1} * ) is used to formalize decision problems (of which solution can only be true or false). We call an instance of the problem as x; if the solution of x is true, x ∈ L, otherwise x / ∈ L. First, we review the definitions of the complexity classes P, NP, and the polynomial hierarchy: This class characterizes problems which can be solved efficiently by classical computer without randomness. In contrast, NP-hard is usually used to characterize difficult problems. Intuitively, NP is the set of problems for which the "yes" solutions can be efficiently verified by a classical computer.
restricted Boltzmann machine tensor network representation Supplementary Fig. 1: RBM representation of some many-body entangled states. a, Toric code is the simplest topologically ordered state used as a quantum error correcting code, which is defined as the stabilizer state of stabilizers A v and B p , where v and p denote vertex and plaquette, respectively. The wave function for the toric code is a product of functions shown in the figure for each vertex v and there is no constraint on plaquette p. b, Randomly distributed entangled pairs |00 + |11 on a line (or any lattice). The state obeys the entanglement volume law. The wave function of this state is a simple product of functions shown in the figure for each pair. c, Coherent thermal state which represents a critical system when β reaches β c , the critical point in the corresponding thermal state of the classical Ising model. The wave function is square root of the corresponding probability in classical thermal model. For RBM representation, the coefficient a, b, c, d is given by the solution in the method section of the main text with J replaced by −β/2. Polynomial hierarchy is a generalization of the complexity class NP. We call u a witness.
where Q i denotes ∀ or ∃ depending on whether i is even or odd, respectively. And Note that NP = Σ p 1 and one can extend i to 0 such that P = Σ p 0 . Clearly, Σ p i ⊆ Σ p i+1 ⊆ PH. Most computer scientists believe P = NP. A generalization of this conjecture is that for every i, Σ p i is strictly contained in Σ p i+1 , which means Σ p i = Σ p i+1 . It can also be stated as "the polynomial hierarchy does not collapse". This conjecture is often used in complexity theory.
We also used the complexity classes P/poly and #P in proof of theorem 1. Boolean circuit model is often used in computational complexity theory in addition to the Turing machine. The complexity class P/poly characterizes those problems that can be solved by polynomial size boolean circuits.
Definition 4 (P/poly) Polynomial-sized circuit family is a sequence of {C n } of boolean circuit, where the input size of C n is n and the size of C n is bounded by a poly(n). P/poly is the class of languages which can be decided by polynomial-sized circuit family.
The difference between P/poly and P is that the circuit may not be built using an efficient Turing machine.
Another way to generalize the complexity class NP is to go to the so-called counting problem. According to the definition of NP, it only requires knowing whether there exists at least one witness such that the Turing machine accepts. We can generalize NP to counting the number of witnesses that the Turing machine accepts. This complexity class is defined as #P Definition 5 (#P) A function f is in #P if there exists a polynomial q and a polynomial time classical Turing machine M such that for every Different from other complexity classes defined for decision problems, #P defines problems of which output is not 0 or 1, but an integer number. So in order to compare decision problems and counting problems, it is convenient to introduce oracle machine: P O is the problem decided by a polynomial deterministic Turing machine with access to a sub-routine or oracle O (which is not necessarily efficiently computable, but we only count the number of times required to access the oracle.) For example, P NP represents the set of problems that can be decided by a polynomial Turing machine with a polynomial number of times to access a sub-routine which can solve NP problems.

Supplementary Note 3 Proof of theorem 1 under approximation representation
Here we introduce a specific state on 2D lattice, denoted as |ψ GWD and shown in Supplementary Fig. 2. State |ψ GWD is cluster state after one layer translation-invariant single-qubit unitary transformation. This state was introduced in Supplementary Ref. [6] for proof of quantum supremacy. We prove here |ψ GWD cannot be efficiently represented by RBMs under reasonable conjectures in complexity theory. This no-go theorem holds for both exact and approximate representation.
filling with ancilla to form 2d lattice Phase on the spin = Supplementary Fig. 2: Construction of a state that cannot be represented by a RBM. The state |ψ GWD used in the proof of theorem 1, is introduced in Supplementary Ref. [6] for proof of quantum Supremacy. To construct this state, we start from a brickwork of white circles [7] (the left top side), with each white circle represented by seven blue circles. Each blue circle represents a qubit, and the brickwork of blue circles can be filled with additional red and green circles (each represent an ancillary qubit) to form a regular 2D square lattice (shown on the right top side). We start with a standard 2D cluster state for the square lattice, and then apply the phase gates Z(θ) on the blue-circle qubits with the angle θ forming a periodic pattern shown in the left bottom figure and the Hadamard gate on each green-circle and blue-circle qubits (no gate on the red-circle qubits). After this layer of single-qubit unitary operations, we get the state |ψ GWD .

Exact representation
Suppose the coefficient of |ψ GWD in the computational basis is Ψ(v) (In term of notation, |Ψ(v)| 2 corresponds to q x in Supplementary Ref. [6]). In Supplementary Ref. [6], it is proved that computing where 0 ≤ c < 1/2 and the lattice size is n × m. This equation implies computing | Ψ(v)| 2 is also #P-hard such that Denote O as an oracle with the ability to compute the first mn − 1 digits of Ψ(v). The above statement implies If |ψ GWD can be represented efficiently by a RBM, calculating its wave function in the computational basis belongs to the complexity class P/poly as discussed in the main text. Thus O ⊆ P/poly. Combining these results, we have Lemma 1 RBM cannot represent |ψ GWD exactly unless Supplementary Ref. [8] proves if the containment in the above equation is true, the polynomial hierarchy will collapse [5], which is widely believed to be unlikely.

Approximate representation
Denote a measurement in the computational basis as a quantum operator E and Ψ(v) as the wave function of a RBM state |ψ GWD to approximate |ψ GWD , then we have where D (ρ 1 , ρ 2 ) denotes the trace distance, defined as tr|ρ 1 − ρ 2 |/2. The above equation means if the trace distance between |ψ GWD and |ψ GWD is smaller than where /δ < 1/2 and Pr v [f (v)] denotes the probability such that v satisfies condition f (v) if random variable v is uniform distributed. This equation means that for 1 − δ fraction of the whole set v, we have Denote O as an oracle with the ability to compute the first mn − 1 digits of |Ψ(v)| 2 for such fraction of v, then O ⊆ O ⊆ P/poly. In Supplementary Ref. [6], we introduced a conjecture that #P-hardness of approximating Ψ(v) to the error given by Supplementary Eq. (10) still holds when we lift from the worst-case to the average-case, that is, Supplementary Ref. [6] has discussed why this is a reasonable conjecture. It is related to classical-hardness for simulating distribution from random quantum circuit and is supported by both quantum chaos theory and extensive numerical simulations [9]. With this conjecture, we have Combining the results above, we have the following theorem Lemma 2 If the conjecture 1 is true, any RBM states cannot approximate |ψ GWD with the trace distance smaller than /2, otherwise P #P ⊆ P P/poly and the polynomial hierarchy collapses.

Supplementary Note 4 A comment on exact and approximate representations of local tensors
In the proof of Theorem 3 in the main text, we use the universal quantum computing gadget to construct an exact neural network representation of local tensors in the tensor network state. We can also use the universal approximation theorem [10] to construct an approximate representation of local tensors as in Supplementary Ref. [11]. However, we should note that for the tensor network state, some property of the state is extremely sensitive to the approximation in the local tensors. For instance, for the toric code state, we have the tensor network representation with the exact local tensors given by Input Post-selection denoted as combined as Supplementary Fig. 3: Construction of a pseudo quantum circuit for DBM representation of ground states of many-body Hamiltonians. We Construct a pseudo quantum circuit where each elementary gate represents a linear but non-unitary operation which can be described through multiplication of a local tensor. The left diagram represents construction of the pseudo-gate controlled-(−βĤ/k), which acts as a basic building block for construction of the pseudogate e −βĤ shown on the lower right side through the Taylor series expansion.
If we approximate the local tensors by the following expression where can be an arbitrarily small constant, it is shown In Supplementary Ref. [12] that as long as = 0, the topological entanglement entropy vanishes in the thermodynamic limit. So the existence of topological entanglement entropy is extremely sensitive to approximation in the local tensors for this tensor network state. The reason for this is that the approximation error in each local tensor is amplified when we connect the local tensors to build up the global tensor network state, in particular when we go to the thermodynamical limit where the number of connected local tensors goes to infinity. So we prefer the use of exact representation of the local tensors as we did in theorem 3. If one uses the approximate representation, one should bound the total error in the resultant tensor network state (instead of the error of each individual local tensor) to be small.

Supplementary Note 5 Efficient tensor network representation for ground states
In this section, we prove the theorem 4 in the main text. For this purpose, we develop a method using Taylor series expansion to efficiently construct ground state of any Hamiltonian with tensor network (and thus DBM network as well due to the theorem 3). Compared with the previous construction method [13], this approach allows an exponential improvement on the scaling with respect to the precision of the representation. We use pseudo quantum circuit to present our construction as shown in Supplementary Fig. 3. Pseudo quantum circuit is similar to conventional quantum circuit except that the pseudo-gate is not required to be unitary. Each pseudo-gate still represents a linear transformation through matrix multiplication, so it can be easily constructed through a local tensor. The pseudo quantum circuit then just represents a tensor network. Suppose the number of interaction terms in the Hamiltonian is m, i.e.,Ĥ ≡ m i=1Ĥ i , where eachĤ i involves at most k-body interactions (k is typically a small constant. For instance, k = 2 for any Hamiltonian with two-body interaction. The interaction can be long-range.) We simulate the operatorĤ by first generating a state m i=1 |i , where |i ≡ |0 1 0 2 · · · 1 i · · · 0 m , i.e., only the i-th bit is 1. Applying the operationĤ i controlled by the i-th qubit as shown in Supplementary Fig. 3 and post-selecting the control bits in the state m i=1 |i (note that postselection can be easily represented in a tensor network), we get This requires O(m) pseudo quantum gates as shown in Supplementary Fig. 3. All of the above operations are controlled by an additional qubit (see Supplementary Fig. 3). We then apply a non-unitary matrix diag(1, −β/k) on this control qubit, and construct a pseudo-gate controlled-(−βĤ/k). Note that this gate requires O(m) elementary pseudo quantum gates for its construction and we use the pseudo-gate controlled-(−βĤ/k) as a building block for the next step. In the next step, we first generate a superposition state K k=0 |k , where K corresponds to the truncation number in Taylor series expansion and |k ≡ |1 1 1 2 · · · 1 k 0 k+1 · · · 0 K , i.e. only the first k bits are in state |1 . Applying the circuit shown in Supplementary Fig. 3 and post-selecting the output state of the control qubits in K k=0 |k , we get which is the Taylor expansion of e −βĤ truncated to the K-th order. Note that this construction of e −βĤ requires O(Km) elementary pseudo quantum gates. We apply the above operator e −βĤ on the state where n represents the total number of qubits in the HamiltonianĤ, |ψ i denotes the i-th eigenstate ofĤ, and |ψ * i is the complex conjugate of |ψ i in the computational basis. After tracing out the register storing |ψ * i and dropping the unimportant normalization factor, we get the state for the first register. In the equation above, the first term represents our targeted ground state |ψ 0 , and there are two error terms: the first term comes from contribution of all the other eigenstates, shrunk by the imaginary time evolution factor e −β∆ with ∆ denoting the energy gap; and the second term comes from the truncation error in Taylor series expansion. Suppose we require the total representation error is bounded by a small constant , then we need With an optimal choice of the parameter β = O((n + log(1/ ))/∆), we need K = O(β Ĥ ) to satisfy these inequalities. As Ĥ = O(m), we have K = O((n + log(1/ ))m/∆). So the total number of elementary tensors we need to represent the ground state |ψ 0 of the HamiltonianĤ is given by O(Km), which is Each elementary tensor has a constant bond dimension D and a typically small coordination number d. Actually, we can choose D = 2 and d = 4 because each interaction term in the Hamiltonian can be expressed as tensor product of k Pauli matrices and thus has a unitary form. The elementary tensor then has the controlled-controlled-unitary form, which can be decomposed into a small sequence of two-qubit unitary gates [14,15]. Each two-qubit unitary gate can be expressed as a tensor with D = 2 and d = 4. Combining with theorem 3 in the main text, we find the total number of neurons in the DBM to represent the ground state of the HamiltonianĤ is also given by the above equation, which is the statement of theorem 4.

Supplementary Note 6 A Prototype Algorithm for Training DBM
Our constructive proof of theorems 2 and 4 gives a systematic approach to construct an efficient (with a polynomial number of neurons or parameters) DBM representation for any states from quantum dynamics or ground states of physical Hamiltonians. However, they don't address the question on how to extract properties of physical observables from the DBM representation. Furthermore, significantly more succinct neural network representation (either by RBM or by DBM) could exist for practical quantum states than what indicated by those general theorems. To address this question, in this section we give a prototype algorithm for training DBM by the gradient descent (reinforcement learning) method to directly approximate ground states of given Hamiltonians. This algorithm is similar to the algorithm proposed recently in Supplementary Ref. [16] for training RBM: it is also based on Monte Carlo sampling such that computing variational derivative can be reduced to computing expectation values of some easily computable function over a probability distribution. The probability distribution can be sampled with a Markov chain, and for each sample from this probability distribution, the function (corresponding to the expectation value of some physical observable) can be computed efficiently. We emphasize that this algorithm is only considered to be a prototype as it has not been numerically tested and could be significantly improved for real problems. The algorithm is not guaranteed to be efficient. From the computational complexity theory, the general contraction problem for either RBM or DBM (or tensor network) representation is at least NP-hard [17], so we don't expect there is always an efficient algorithm to extract physics observables in the worst case. The general NP-hardness about the contraction of course does not prevent the use of RBM or DBM learning algorithms for solving real problems as the typical-case hardness could be very different from the worst-case hardness. Supplementary Ref. [16] has shown the learning algorithm based on RBM has successfully solved several spin models. By borrowing ideas and numerical techniques from the deep learning and other machine learning literature for solving practical problems [18,19], we expect that the learning algorithms based on RBM or DBM representation of quantum states will attract a lot of interest in the coming years.

Notations
The wave function of the quantum state represented by a DBM can be written as as shown in Supplementary Fig. 4, where v, h, h 0 are the values of visible neurons, neurons of first hidden layers, neurons of second hidden layers respectively and W, W 0 are correlation matrices between successive layers in weight function. Without loss of generality and for simplification of the notation, we omit the bias term (linear term in the exponent) here. Notice that the last summation term can be regarded as wave function of a RBM with visible neurons h and hidden neurons h 0 . For convenience of notation, we denote a DBM state as Where the symbol |RBM is only used to simplify the notation and does not correspond to a real quantum state represented by a RBM.

Extracting physical observables
Here, we give a Monte Carlo algorithm to compute expectation value of a physical observable O from the DBM state.
We consider the numerator and the denominator separately. For the numerator, where Since wave function of RBM can be factorized given a specific configuration (a specific value assignment to neurons h and h ), the ratio p h p h /p h1 p h 1 (h 1 , h 1 denotes a new configuration) is easy to compute. Then we can use Metropolis algorithm to sample the distribution: where Pr(hh → h 1 h 1 ) denotes the probability to change the configuration from hh to h 1 h 1 . The function O h,h can be computed efficiently given the configuration h and h . Suppose O is decomposed as a tensor product of at most k Pauli matrices, where k is typically a small constant. Denote |v as computational basis of Hilbert space acted trivially by O and the remaining as |v 0 , |v 0 ∈ {0, 1} k . Thus Then we have The first summation in the numerator only contains 2 2k = O(1) terms.
To minimize the energy, we minimize the numerator A while keeping B almost constant. We take the gradient decent method for which the change of parameters ηw should be along the direction max w −∂A · w (39) ∂B · w = 0 (40) where η denotes a small step size or learning rate and w denotes the direction of parameter changes. The optimal direction w is easy to calculate if we know the variational derivatives ∂A and ∂B. First, we consider the derivative ∂A