Neural Quantum States of frustrated magnets : generalization and sign structure

Tom Westerhout, ∗ Nikita Astrakhantsev, 3, † Konstantin S. Tikhonov, 5, 6, ‡ Mikhail I. Katsnelson, § and Andrey A. Bagrov ¶ Institute for Molecules and Materials, Radboud University, Heijendaalseweg 135, 6525 AJ, Nijmegen, The Netherlands Moscow Institute of Physics and Technology, Dolgoprudny, 141700 Russia BLTP, Joint Institute for Nuclear Research, Joliot-Curie str. 6, 141980 Dubna, Russia Institut für Nanotechnologie, Karlsruhe Institute of Technology, 76021 Karlsruhe, Germany Skolkovo Institute of Science and Technology, 143026 Skolkovo, Russia Landau Institute for Theoretical Physics RAS, 119334 Moscow, Russia

Following fascinating success in image and speech recognition tasks, machine learning (ML) methods have recently proven themselves to be very useful in physical sciences. For example, ML has been used to classify phases of matter [1], to enhance quantum state tomography [2], to bypass expensive dynamic ab initio calculations [3], and much more [4]. Currently, artificial neural networks (NNs) are being explored as variational approximations for many-body quantum systems in the context of variational Monte Carlo (vMC) approach. vMC is a well-established class of methods suitable for studying low-energy physics of many-body quantum systems with a more than fifty-year history [5]. A vast variety of trial wave functions have been suggested in different contexts. One of the simplest choices is mean-field form of the wavefunction which can be enriched by explicit account for particle-particle correlation [6][7][8] and generalized to include many variational parameters [9][10][11][12]. Certain variational ansätze, such as tensor networks [13], do not require stochastic Monte Carlo sampling and are thus amenable to exact optimization. In particular these include Matrix Product States (MPS) [14], Projected Entangled Pair States [15], and Multiscale Entanglement Renormalization Ansatz [16]. The common shortcoming of all these methods is that the trial functions are tailored to a concrete model of interest and often require some prior knowledge about structure of the ground state (such as short-range entanglement for MPS methods) or intuition which help to constrain the optimization landscape (e.g. approximate understanding of nodal surfaces for Quantum Monte Carlo methods [17]). However, in many cases our prior intuition can be insufficient or unreliable. This poses a natural question whether a more generic ansatz that can efficiently approximate ground states of many-body systems could exist.
A novel and fresh look at this problem was given in [18], where the traditional vMC optimization approach was hybridized with ML. A simple yet very unrestricted variational ansatz that inherits the structure of a certain neural network (Restricted Boltzmann Machine) was suggested. For the test cases of one-and twodimensional Heisenberg and transverse field Ising models, it was demonstrated that, optimizing this ansatz with the Stochastic Reconfiguration (SR) scheme [19], one can achieve high accuracy in approximating ground states of systems of up to hundreds of spins, sometimes outperforming the state-of-the-art methods.
In subsequent years, a number of new variational neural quantum states have been suggested and their properties were thoroughly analyzed. Among other important discoveries, it was realized that even the simplest Restricted Boltzmann Machines (RBMs) with polynomial number of parameters have rich enough structure to host volume law entanglement [20,21], indicating that NQS are more flexible than, for instance, tensor networks [13]. Hybrid wave functions, combining properties of RBMs and more traditional pair product wave functions, were demonstrated to significantly reduce relative energy error of variational ground state of two-dimensional Fermi-Hubbard model [22] and to enhance the accuracy of Gutzwiller-projected wave functions in frustrated magnets [23]. Recently, an algorithm for computing the spectrum of low-lying excited states has been suggested [24], opening a route to studying finite-temperature phenomena with NQS. However, it also became evident that NQS must not be perceived as a magic bullet in the area of strongly correlated quantum systems. Although a variational wave function with a network structure may be able to approximate the ground state really well, in some cases the desired point in the space of variational parameters can be hard to reach, and learning algorithm hits a saddle point before approaching the solution. This results in a large relative energy error and a low overlap between the NQS and the actual exact ground state, making the obtained solution almost useless for computing physical observables such as electric conductivity or spin-spin correlation functions. This problem is particularly pronounced for systems where the energy gap between the ground state and the first excited state is very small, like for frustrated spin systems such as J 1 − J 2 antiferromagnetic Heisenberg model on square lattice [25], or the Fermi-Hubbard model away from the neutrality point [26].
So far, significant effort has been put into the search for neural quantum states architectures that have good expressibility, -a potential capacity to represent a manybody wave function with high accuracy using a moderate number of parameters [27][28][29]. At the same time, there is another issue that is not widely discussed in this context. In the variational optimization schemes, an ansatz is adjusted iteratively in a certain way, so that we expect the system to end up in the lowest energy state allowed by the form of the ansatz [19,30,31]. At each step of this iterative procedure, one has to evaluate the change of the trial wavefunction parameters, induced by the evolution operator: ψ n (σ) → ψ n+1 (σ). Evaluation of the NN weights describing the state ψ n+1 (σ) relies on MC sampling from basis of the Hilbert space of the model, and for large systems the total number of samples is negligibly small in comparison with the dimension of the Hilbert space. Hence, it is of crucial importance for the NN to accurately generalize onto a larger subspace that was not sampled in the course of learning and correctly predict phases and amplitudes of the wave function on the full set of basis vectors. Although generalization properties of neural networks have been analyzed [32,33] in more conventional machine learning context, we are not aware of any systematic studies of this issue for the problem at hand -approximating the eigenstates of large quantum systems, -and with the present work we fill this gap.
Although the generalization issue concerns both phases and amplitudes, it turns out that these two components of the wavefunction behave differently in this respect. Already from the first works in the field, it seemed plausible that effectiveness of NN as variational ansatz is somehow connected to the sign structure of the models. For instance, in [18], even for the unfrustrated Heisenberg antiferromagnet on a square lattice, the Hamiltonian must first be brought into stoquastic (sign-definite) form by a unitary transformation in order to reduce noise and attain proper level of convergence (see also Ref. [34]). As another example, let us note that in recent study [31] it was stressed that biasing the NQS anzats with certain predefined (heuristic) sign structures is very important for performance of the method. Therefore, although we will study both aspects, special attention will be paid to the sign structure.
We consider several antiferromagnetic spin models described by the Heisenberg Hamiltonian: where, for each lattice geometry, the first sum is taken over the unfrustrated sublattice (solid lines in Fig. 1), and the second sum is taken over the sublattice that brings in frustrations (dashed lines in Fig. 1). Namely, we consider J 1 − J 2 model on a square lattice [35,36] and the nearest-neighbor antiferromagnetic on spatially anisotropic triangular [37] and Kagome [38,39] lattices. For every model, its ground state belongs to the sector of minimal magnetization, thus the dimension of the corresponding Hilbert space is K = C N [N/2] (where N is the number of spins). It is convenient to work in the basis of eigenstates ofσ z operator: |S ∼ | ↑↓ . . . ↓↑ . In this basis the HamiltonianĤ is real-valued. The ground state is thus also real-valued, and every coefficient in its basis expansion is characterized by a sign s i = sign(ψ i ) (instead of a continuous phase): By running numerical experiments, we shall demonstrate that it is indeed the lack of generalization that prevents a neural quantum state from accurately learning the signs of the wavefunction, even though expressibility of the corresponding ansätze could be good enough. Our strategy is to consider exact ground states of the models and test how well NN can predict sign structure of the whole state when they are given only a small fraction of it during training. We shall demonstrate that, when the models are interpolated between unfrustrated and fully frustrated regimes, networks' generalization abilities change in a non-trivial way, becoming very poor in certain cases. This motivates a search for neural quantum states architectures with better generalization capacity.
For the three models defined above, we systematically study the generalization properties of NNs (separating the signs from the amplitudes) of different architectures varying the degree of frustration (controlled by J 2 /J 1 ) and the size of the training dataset.

RESULTS
We have analyzed how NNs learn ground state structures of three lattice models, in each case considering periodic clusters of 24 spins. Effective dimension of Hilbert space in the zero-magnetization sector is d = C 24 12 2.7 · 10 6 . Our main results seem universal for all studied models and architectures and can be summarized in the following four statements.
(i) Generalization from a relatively small subset of Hilbert space basis of the wave function sign structure is not granted even when the ansatz is able to express the ground state with high accuracy. Very well known to machine learning practitioners, this fact is also valid for spin systems, in both frustrated and ordered phases.
(ii) Construction and training of a network to achieve good generalization, a task which is relatively simple in the ordered phase, becomes much harder upon approaching the frustrated phase.
(iii) Quality of generalization depends on the size of training set in an abrupt way exhibiting a sharp increase at some ε train .
(iv) Prediction of wave function amplitudes turns out to be a substantially easier task than prediction of signs.
In the remaining part of this section we explain the finding in more detail using Kagome lattice as an example. We focus on a two-layer dense neural network architecture. For results for other models and detailed comparison of different architectures we refer the reader to Supplementary Material. Fig. 2 illustrates both points i and ii. Here, we use a small subset (1%) of the Hilbert space basis to train the NN and then evaluate how well it predicts the sign structure on the remaining basis vectors unavailable to it during training. To assess the quality of generalization we use overlap between the exact ground state and the trial state. The latter is defined as a state with amplitudes taken from exact diagonalization and sign structure encoded in a NN. Consider, for example, the middle panel, where generalization quality for Kagome model is studied as a function of J 2 /J 1 . It is known [39] that Kagome model posesses a frustrated phase for 0.51 J 2 /J 1 1.82. Strikingly, this phase transition shows itself as a sharp decrease of overlap around the value J 2 /J 1 ≈ 0.51. As one may expect, the frustrated phase is characterized by very intricate sign distribution leading to a drastic reduction in the overlap. For the square and the triangular lattices (left and right panels of Fig. 2), generalization quality behaves somewhat differently. For J 1 − J 2 model on the square lattice, instead of a sharp transition, it exhibits a large but smooth dip in frustrated phase (0.4 < J 2 /J 1 < 0.6). On the triangular lattice, overlap slightly improves away from the transition point (J 2 /J 1 ≈ 1.25). However, for all three models we see that behavior of generalization quality reflects very well the known phase transitions with generalization being easy in ordered phases and becoming notoriously hard in disordered phases. Note also that different neural networks may generalize very differently: in particular, as shown on the left panel of Fig. 2, dips in performance of convolutional NNs are much smaller than those for dense networks for the square lattice model. We believe that experiments of this kind would help to choose proper architectures to be used in iterative diagonalization schemes.
Let us now turn to observation iii. As we have already mentioned, it is very important to distinguish the ability to represent the data from generalization. In the context of NQS, the former means that a NN is able to express complex quantum states well if training was conducted in a perfect way. We observe that perfect expressibility (overlap = 1) is indeed possible if the training set is large enough -meaning that the network can represent the target state very well. However, this ability does not automatically make a neural network useful if it cannot generalize well. To make the boundary more clear, we study how generalization quality changes when size of the training dataset is increased. Results for Kagome lattice are shown on the Fig. 3. Interestingly, even in the frustrated phase (J 2 = 0.6) it is possible to generalize reasonably well from a relatively small subset of the basis states, but the required ε train becomes substantially larger than in the magnetically ordered phase. Most importantly, the ability of the NN to generalize establishes in an abrupt manner contrary to more typical smooth behavior observed in statistical models of learning [40][41][42] In our discussion up to this point, we concentrated entirely on the quality of generalization of the wavefunction sign structure. One may wonder whether it is indeed the signs rather than amplitudes which are responsible for the difficulty of learning the wavefunction as a whole (this possibility has been discussed in the context of state tomography in Ref. [2]). To prove this statement, we conduct the following analysis. In the context of learning, overlap between a trial wavefunction and the target state can be used to characterize the effectiveness of NNs in two different ways. First, one can fix the amplitudes of the wavefunction and use a NN to predict the signs. This produces a trial wavefunction ψ sign . Alternatively, one can fix the sign structure, and encode the amplitudes in a NN to get a trial wavefunction ψ amp . Clearly, the accuracy of ψ amp and ψ sign will depend on the relative complexity of learning amplitudes and signs of the wavefunction coefficients. We illustrate statement iv with Fig. 4, where we use overlap to compare the quality of generalization of signs and amplitudes (using, again, 1% of the basis for training). Upon increase of J 2 , one moves from a simple ordered phase to frustrated phase where overlap drops sharply. Although the prediction of both signs and amplitudes becomes harder at the point of phase transition J 2 /J 1 = 0.51, drop in the sign curve is much larger, and at even higher J 2 the quality of the learned states becomes too poor to approximate the target wavefunction. At the same time, even deeply in the frustrated regime generalization of amplitudes, given the exact sign structure, leads to a decent result. Moreover, networks manage to generalize amplitudes even for small training datasets (see Fig. S4). These observations suggest that it is indeed the sign part of the wavefunction that becomes problematic for generalization in frustrated region 1 . Comparison of generalization quality as measured by overlap for learning the sign structure (red) and amplitude structure (blue) for Kagome lattice for 2-layer dense architecture. Note that both curves decrease in the frustrated region, but sign structure is much harder to learn.

DISCUSSION
In this paper, we have analyzed the ability of neural networks to generalize many-body quantum states from a small number of basis vectors to the whole Hilbert space. The main observation we made is that for all models we have considered, quality of generalization of the ground state sign structure falls of sharply near quantum phase transitions and remains low in the frustrated phase.
This has important implications for iterative methods of approximating ground states of quantum systems via vMC. At each iterative step, one updates the wavefunction ψ → ψ by acting on it with an evolution operator. Applicability of this method to many-body systems relies on two features. First, evaluation of matrix elements of the evolution operator in the chosen basis should be fast. This holds naturally for systems with pairwise interactions: their Hamiltonians are sparse and desired matrix elements can be evaluated in a time polynomial in the number of particles. In the worst-case scenario, however, one needs to evaluate an exponentially large number of such elements to properly approximate ψ . Here, the second feature comes into play: the wavefunction ansatz and optimization routine should be chosen such that together they provide a scheme which can represent the wavefunction well even when only a small subset of the basis vectors has been sampled to represent ψ . This property is exactly what is called generalization in the context of related to the famous Quantum Monte Carlo sign problem. For example, Fig. 2 shows that in J 2 → 0 limit of J 1 -J 2 model, networks have no trouble learning the sign structure even though in σ z basis, there is sign problem since we are not applying Marshall's Sign Rule.
machine learning and what we have characterized in our study.
We have demonstrated that generalization may indeed be an essential factor that is likely responsible for spoiling the convergence of NQS in a number of physically interesting cases such as frustrated quantum spin systems. Our main observation which is qualitatively valid for all the studied models and NN architectures is that a NN fails to generalize the distribution of signs in the ground state of a many-body system with competing interactions in the regime of strong frustrations if the training is done on a small fraction of basis states. At the same time, even simple neural networks seem to have no problem in generalizing amplitudes from the training dataset onto the full Hilbert space, and have very good capacity to express both sign and amplitude distributions of the studied states. This understanding gives us a possibility to formulate a very concrete and simple test for future NQS architectures that will be used for studying ground state physics of many-body quantum systems: A neural quantum state can be trained to approximate ground state of a large-scale many-body system only if it is capable of generalizing sign structure of a moderatelysized (exactly solvable) system ground state.
Another important feature we have discovered is the threshold behavior of generalization as a function of the training dataset size. This is rather unusual and different from the smooth behavior known for standard models of learning, such as teacher-student scenario in a binary perceptron [40,41] and some other studies of NNs generalization [43]. From the point of view of vMC applications, it is desirable to understand how the required number of samples depends on system parameters such as size and degree of frustration and training algorithm parameters.
As a closely related phenomenon, let us mention the fact known from the binary perceptron problem: bias towards dense clusters of local minima on the loss landscape makes generalization error a significantly steeper function of the number of the samples [42]. This may partially explain the observed abrupt change in generalization quality since stochastic gradient descent employed by us is known to have similar properties (bias towards wide minima of the loss landscape [33,44,45]). It would be very useful to have analytically tractable models which showed the threshold behavior of generalization.
Finally, it is worth mentioning that, while the dip in generalization is not desirable in the context of variational energy optimization, it could be used as a tool to identify -in a completely unsupervised manner -the position of the phase transitions, similarly in spirit to approaches of Refs [46][47][48][49].

METHODS
In this study, we use feed-forward networks of three different architectures (dense 1-layer, dense 2-layer, and convolutional 2-layer) to encode wavefunction coefficients via splitting them into amplitudes and signs. All of our networks have the same input format: spin configuration |S i = |σ 1 σ 2 . . . σ N represented as a binary sequence, σ k = ±1. Network encoding amplitudes outputs a real number, natural logarithm of the amplitude. Network encoding sign structure outputs a real number p ∈ [0, 1] interpreted as a probability for the corresponding sign to be plus. Thus, unlike the approach of Ref. [18], we represent wavefunction sign using a binary classifier.
Both networks are trained on data obtained from exact diagonalization. We sample ε train · K spin configuration from the Hilbert space basis according to probability distribution P (i) = |ψi| 2 j |ψj | 2 . They constitute the training dataset 2 . Then, we sample another ε val · K spin configurations which we use as a validation dataset during training. It enables us to monitor the progress and employ regularization techniques such as early stopping. In practical applications of NQS [18,24,25], SR [19,50], Stochastic Gradient Descent [11,51], or Generalized Lanczos [30], the training dataset is generated by Monte Carlo sampling from basis of the Hilbert space of the model, and, since dimension of the latter grows exponentially with the number of spins, only a tiny fraction of it can be covered with a Monte Carlo chain in reasonable time. Therefore it is natural to mimic this incomplete coverage with ε train , ε val 1. To assess the performance of the NNs we evaluate overlap (scalar product) between exact eigenstate and the trial state. A trial state for sign NN is defined as a state with exact amplitudes but with sign structure encoded in a NN. Analogously, a trial state for amplitude NN is obtained by superimposing the exact sign structure onto the positive amplitudes, predicted by an amplitude NN.
We train the classifier by minimizing binary crossentropy loss function where p i is the predicted probability for the spin configuration |S i to have sign +1, s i = ±1 is the expected sign obtained from ED, and the sum is taken over the training dataset (which is sampled from the Hilbert space basis according to P (i)). Training of the neural network which predicts amplitudes occurs via minimization of where ψ e i is the exact value of i-th coefficient. Usually, in machine learning algorithms it is crucial to choose hyperparameters correctly. For example, dependence of critical ε train on batch size is non-monotonic. Choosing a wrong batch size can lead to an order of magnitude increase of required ε train . In our calculations, we typically work with batches of 64 or 128 samples. For optimizaion, we mostly use Adam [52] (a stochastic gradient-based method) with learning rates around 10 −4 -10 −3 . Early stopping is our main regularization technique, but we have also experimented with dropout layers (which randomly throw away some hidden units) and L 2 -regularization. Code to reproduce our results and more details about the training scheme can be found on GitHub [53]. In this Supplemental Material, we provide some additional information regarding the numerical analysis outlined in the main part of the paper.

I. NETWORK ARCHITECTURES
We use three different Neural Network architectures to demonstrate that the effects we observe might be general for all standard deep learning techniques.
a. One-layer dense network One-layer dense network can be applied to all three models at hand. We arrange 24 spins in a linear pattern and feed them to the network which has one hidden layer of 64 neurons, the ReLU non-linearity, and two output nodes representing the ±1 sign probabilities. The total number of parameters is then 24 × 64 + 64 + 64 × 2 = 1728.
b. Two-layer dense network Two-layer dense network can be applied to all three models at hand. We arrange 24 spins in a linear pattern and feed them to the network which has two hidden layers of 64 neurons each, and two output nodes representing the ±1 sign probabilities. ReLU non-linearity is applied after every hidden layer. The total number of parameters is 24 × 64 + 64 + 64 × 64 + 64 + 64 × 2 = 5888.
c. CNN Periodic convolutional network can applied to the J 1 -J 2 model on a square lattice. The main goal of this architecture is to preserve the translational invariance on the level of NN architecture. In every convolutional layer we apply periodic boundary conditions. When all the convolutional layers are applied, we take mean value operation over all spins in every "channel", i.e. over all lattice sites. One can check that the prediction of such neural network is invariant with respect to any translation of the input spin configuration.
In particular, we arrange 4 × 6 spins in a rectangular pattern and apply Conv(1, 10, 5) and Conv(10, 20, 5) convolutional layers, where Conv(n, m, s) is the convolutional layer which has n input channels, m output channels, and trains an s × s convolutional mask. Each convolutional layer has ReLU non-linearity. After two convolutional layers, one has a 4 × 6 × 20 array that is reduced to 20 numbers by taking the mean value over space in each channel. We finally apply a dense layer with 20 inputs and 2 outputs and use the result as class probabilities.

II. RESULTS OF ADDITIONAL NUMERICAL EXPERIMENTS
In this section, we provide additional results on learning NQS for frustrated spin models. Some of them give extra support for the statements we made in the main part of the paper, and some are not directly related to the main narrative but seem useful for future attempts in constructing NQS suitable for simulations of realistic many-body systems.
In Fig. S1, we show how distribution of amplitudes in ground states of the three studied models depends on J 2 /J 1 by visualizing the cumulative sum of squared amplitudes. For the Kagome lattice, one can clearly see that the amplitude distribution changes drastically as we proceed from J 2 /J 1 = 0.5 to J 2 /J 1 = 0.6, and the sign generalization problem seems to be much more difficult when the amplitude distribution gets more narrow (for J 2 /J 1 = 0.6, 1% of basis states accumulate more than 0.5 of the total probability). For the square lattice, the tendency is similar though less pronounced. The stronger the frustrations, the higher probability is accumulated on the 1% of most significant basis vectors.
In Fig. S2, we show the dependence of sign structure generalization quality for square and the triangular lattices on the size of the training dataset. Sharp transition around certain critical value of ε train is qualitatively the same as we observed for the Kagome lattice in the main text.
In Fig. S3, we highlight how difficult it is to generalize sign structure as compared to amplitude structure for the square and the triangular lattices. While for both learning tasks, there is a dip in the generalization quality, it is evident that the sign structure is much more difficult to learn. Finally, on Fig. S4, exemplifying the Kagome lattice, we show that for the amplitude generalization quality, there is no sharp transition as we increase ε train .