Neural network approach to quasiparticle dispersions in doped antiferromagnets

Numerically simulating spinful, fermionic systems is of great interest from the perspective of condensed matter physics. However, the exponential growth of the Hilbert space dimension with system size renders an exact parameterization of large quantum systems prohibitively demanding. This is a perfect playground for neural networks, owing to their immense representative power that often allows to use only a fraction of the parameters that are needed in exact methods. Here, we investigate the ability of neural quantum states (NQS) to represent the bosonic and fermionic $t-J$ model - the high interaction limit of the Fermi-Hubbard model - on different 1D and 2D lattices. Using autoregressive recurrent neural networks (RNNs) with 2D tensorized gated recurrent units, we study the ground state representations upon doping the half-filled system with holes. Moreover, we present a method to calculate dispersion relations from the neural network state representation, applicable to any neural network architecture and any lattice geometry, that allows to infer the low-energy physics from the NQS. To demonstrate our approach, we calculate the dispersion of a single hole in the $t-J$ model on different 1D and 2D square and triangular lattices. Furthermore, we analyze the strengths and weaknesses of the RNN approach for fermionic systems, pointing the way for an accurate and efficient parameterization of fermionic quantum systems using neural quantum states.

The simulation of quantum systems has remained a persistent challenge until today, primarily due to the exponential growth of the Hilbert space, making it exceedingly difficult to parameterize the wave functions of large systems using exact methods.Since the seminal work of Carleo and Troyer [1], the idea of using neural networks to simulate quantum systems [1][2][3][4][5] has been applied successfully for a large number of quantum systems, leveraging various neural network architectures.These architectures include restricted Boltzmann machines [2,3], convolutional neural networks (CNNs) [6], group CNNs [7], autoencoders [8] as well as autoregressive neural networks such as recurrent neural networks (RNNs) [9][10][11][12][13], with neural network representations of both amplitude and phase distributions of the quantum state under consideration.These neural quantum states (NQS) make use of the innate ability of neural networks to efficiently represent probability distributions.When applying them to represent quantum systems, this ability can help to reduce the number of parameters required to encode the system.
Despite their representative power, NQS have been shown to face challenges during the training process, for example when they are trained to minimize the energy, i.e. to represent ground states.This results from the intricate nature of the loss landscape, characterized by numerous saddle points and local minima that complicate the search for the global minimum [14].One promising avenue to overcome this problem is the use of many uncorrelated samples during the training.This strategy is facilitated when using autoregressive neural networks [15,16], allowing to directly sample from the wave functions' amplitudes.Autoregressive networks have already been applied in the physics context [17,18], such as for variational simulation of spin systems [10][11][12][13].
Many works have so far focused on NQS representations of spin systems at half-filling, revealing that NQS can be used to study a variety of phenomena that are relevant to state-of-the-art research, as e.g.shown for RNN representations on various lattice geometries, including frustrated spin systems [10,19], and systems with topological order [20].For all of these systems, the physics becomes even richer when introducing mobile impurities, e.g.holes, into the system, yielding a competition between the magnetic background and the kinetic energy of the impurity.Simulating such systems holds particular relevance for understanding high-temperature superconductivity, where the superconducting dome arises upon doping the antiferromagnetic half-filled state with holes [21].The search for NQS that are capable of representing such spinful fermionic systems is still in its early stages.In recent years, first NQS have been developed that obey the fermionic statistics, simulating molecules [22][23][24], spinless fermions [16] and spinful fermions [25][26][27][28].Among those architectures are Fer-miNet [22,23], Slater-Jastrow ansätze [16,25,27,28] or variants of Jordan-Wigner transformations [24,26,[29][30][31].
Here, we use an autoregressive neural network architecture, supplemented with a Jordan-Wigner transformation, to simulate ground states of the high interaction limit of the Fermi-Hubbard model, believed to capture essential features of high-temperature cuprate superconductors.Specifically, we use RNNs, proven to successfully model spin systems [9,10,19,20,32,33], and simulate the ground states of the fermionic (bosonic) t − J model, both in one and two dimensions.In its more generalized form, known as the fermionic (bosonic) t−XXZ model, with anisotropic superexchange interactions denoted as J z and J ± , the Hamiltonian under consideration reads as follows: with the fermionic (bosonic) creation and annihilation operators ĉ † i,σ and ĉi,σ for particles at site i with spin σ; spin operators are denoted by Ŝi = σ,σ ′ ĉ † i,σ 1 2 σ σσ ′ ĉi,σ ′ as well as density operators by ni [34].For J z = J ± , Eq. ( 1) reduces to the t − J model and for J ± = 0 to the t − J z model.
In the absence of doping (n i = 1), Eq. ( 1) reduces to the XXZ model or, in the case of J z = J±, the Heisenberg model.Prior studies have already utilized RNNs to simulate these spin models [19,36], with the possibility of rendering the model stoquastic by making use of the Marshall sign rule [37].This is done by implementing the sign rule directly in the RNN architecture [19], yielding a simplified optimization procedure of the wave functions' phase.
When the ground state at ni = 1 is doped with a single hole, the resulting mobile impurity gets dressed with a cloud of magnetic excitations.This yields the formation of a magnetic polaron, which has already been observed in ultracold atom experiments [38].Its properties strongly depend on the spin background, see Fig. 1a and b.Upon further doping, the strong correlations in the model make the simulation of the Fermi-Hubbard or t − J models numerically challenging, despite impressive numerical advances in the past years [39][40][41][42]: Commonly used methods all come with their specific limitations, e.g.density matrix renormalization group [43,44] is limited by the area-law of entanglement, making it challenging to apply this methods to 2D or higher dimensions.Finally, the calculation of spectral functions or the dispersion relations E(k) [35], as exemplary shown in Fig. 1, is of great interest for many fields in physics to reveal emergent physics of a system under investigation.In condensed matter physics, they are typically used to infer the dominating excitations in the ground state or higher energy states, e.g.upon doping the system.This information is contained in specific features of the spectra, e.g. the bandwidth of the quasiparticle dispersion E(k).However, the calculation of spectra or dispersions E(k) is in general computationally costly using conventional methods, e.g.density-matrix renormalization group (DMRG) simulations: The former typically involves a, in general expensive, time-evolution of the state [45], and the latter the calculation of a global operator, the momentum k, which is typically very costly for matrix-product-states.
The remaining part of the paper is structured as follows: In the first section, we introduce the fermionic RNN architecture and its training.Second, we apply the RNN architecture for the ground state search of the t−XXZ model on different lattice geometries, including 1D and 2D lattices.Furthermore, we present a method to map out the dispersion relation of the system under consideration.This method is not limited to our specific RNN quantum state representation, but applicable for any NQS architecture.Moreover, it can in principle be combined with spatial symmetries, that potentially help to improve the accuracy, and furthermore enable the analysis of low-lying excitations in a specific symmetry sector, e.g.m 4 rotational resonances [46,47].We present the results for different lattice geometries, including a triangular ladder.Finally, we address the limitations and drawbacks of our RNN ansatz, provide tests on the effects of more sophisticated training procedures, and discuss possible improvements.

I. ARCHITECTURE AND TRAINING
In the present paper we use a recurrent neural network (RNN) [48] to represent a quantum state defined on a 2D lattice with N sites = N x • N y positions occupied by N p particles.RNNs and similar generative architectures combined with variational energy minimization have already been applied successfully for spin systems [5,10,32,36].One of the advantages of these architectures is their autoregressive property, which allows extremely efficient independent sampling from the RNN wave function [18,49], which is important for the training procedure.
In order to represent fermionic wave functions, we start from the same approach as for bosonic spin systems and use an RNN architecture consisting of N sites (tensorized) gated recurrent units (GRUs), each one representing one site of the system.The information is passed from the first cell, corresponding to the first lattice site, to the last site in a recurrent fashion, see Fig. 13 in Appendix A.
The RNN architecture and its application to model quantum states can most easily be understood for 1D systems: At each lattice site i we define σ i , a N s × d v matrix, to denote the N s local sample configurations at the respective site, and σ the complete configuration of system size L, a N s × L × d v matrix, with d v the visible dimension.For the t − J model, each (local) configuration consists of zeros, ones and minus ones to denote holes, spin up and spin down particles, respectively, i.e. the visible dimension is d v = 3.Furthermore, we define the hidden state h i of dimension N s × d h that is used to pass information from previous lattice sites through the network, with d h the hidden dimension.Given the configuration σ i at site i and a hidden state h i−1 , the RNN cell outputs the updated hidden state h i as well as a conditional probability distribution and a local phase.Hereby, the hidden dimension d h determines the number of parameters of our RNN quantum state.
Since it is possible to generate N s ≥ 1 samples at once, by passing sets of local configurations σ i through the network in parallel, we will use the notation as vectors σ i and σ in the following, where each entry in σ (σ i ) corresponds to one configuration (local configuration).
The RNN wave function is represented by an RNN with cells that have two output layers, one for the local phase ϕ λ (σ i |σ <i ), and one for the local amplitude P λ (σ i |σ <i ) [10].In total, the RNN wave function is given by where ϕ λ (σ) = N i ϕ λ (σ i |σ <i ) is the phase and P λ (σ) with P λ (σ) = Π N i P λ (σ i |σ <i ) is the amplitude of the respective configuration σ.
In the present work we use the tensorized 2D version of the RNN wave function introduced above, as proposed in Ref. [50], where the information encoded in the hidden states is passed in a 2D manner, see Appendix A. Furthermore, we use a variant of a gated recurrent unit (GRU) instead of a simple RNN cell, that are more successful in capturing long-term dependecies [51][52][53].
Our RNN ansatz uses U (1) = U (1) N × U (1) Ŝz symmetry, i.e. conserved total particle and total magnetization, as in Refs.[9,10,19,24,36,54].Further details on the RNN architecture can be found in Appendix A. Moreover, in contrast to previous RNN works on the Heisenberg model [10], we do not implement any bias on the phase of the quantum state such as the Marshall sign rule [37], in order to make our architecture applicable to any number of holes in the system.

A. Minimization Procedure
In order to find the ground state of the system under consideration, we use the variational Monte Carlo (VMC) minimization of the energy [49,55].VMC has already been used in a wide range of machine learning applications (see e.g.Refs.[17,56] for an overview).In VMC, the expectation value of the energy of the RNN trial wave function, is minimized.Here, we have defined the local energy As shown e.g. in Refs.[10,26] one can use the cost function to minimize both the local energy as well as the variance of the local energy to make the training more stable.In Eq.
One of the main difficulties of neural network quantum states is the optimization of Eq. ( 5), due to its typically with five holes and ten spin up (red) and spin down (blue) particles each.Sites are labeled in a 1D manner, as denoted by the white numbers.Right: An exemplary hopping process to the nearest neighbor in horizontal direction ends in the configuration σ ′ and effectively exchanges P particles, here P = 3.The respective sign of σ ′ relative to σ is calculated using Eq. ( 9).rugged landscape with many local minima and saddle points [14].If not stated differently, we use the Adam optimizer [57] for the optimization of Eq. ( 5), following previous works on NQS using RNNs [9,10,36].To improve the optimization, often stochastic reconfiguration (SR) [58,59] is used.In this method, each parameter λ k of the neural network is optimized individually according to with ∂λ k and Ōσk = (O σk − ⟨O σk ⟩)/ √ N s .In the cases where SR is applied, we use the two recently proposed, SR variants, namely minimum-step stochastic reconfiguration (minSR) and the SR variant based on a linear algebra trick by Rende et al. [60].Both enable the use of a large numbers of NQS parameters, see Appendix B 2. In the minSR update, Eq. ( 6) is solved by with T = Ō Ō † [61].In the version of Rende et al., with [60].

B. Fermionic RNN Wave Functions
The architecture introduced above is per se bosonic.When considering fermionic systems, we need to take the antisymmetry of the wave function into account.This antisymmetry is included during the variational Monte Carlo steps when calculating the local energy introduced in Eq. ( 4).We can expand the local energy to Figure 3.
Adding the momentum constrain C k target , Eq. ( 13), on top of the energy minimization C, Eq. ( 5), (top right) changes the loss landscape as schematically shown on the bottom right and forces the NQS into a higher energy state with the desired momentum ktarget (top left vs. bottom left).
In this sum, we multiply each term with a factor (−1) P if σ ′ is connected to σ by P two-particle permutations, as suggested in Ref. [26].In order to do so, we take the permutations along the sampling path into account.For the t−XXZ Hamiltonian under consideration we only need to consider the hopping term for calculating the antisymmetric signs.An example is shown in Fig. 2.This procedure is similar to the implementation of Jordan-Wigner strings as e.g. in Ref. [24].

II. NQS DISPERSION RELATIONS
A lot of information on a physical system under investigation is contained in its dispersion relation E(k), e.g. in the bandwidth (effective mass) and low-lying elementary excitations relative to the ground state, that determine the physical properties.Hence, it is of high relevance to access E(k).However, its calculation is in general computationally costly [62], since it typically requires a time-evolution of the state [45].
In this section, we calculate the dispersion relations E(k) of t−XXZ models in different dimensions and on different lattice geometries using NQS.Specifically, we use the RNN wave function introduced in Sec.I.However, the method is applicable to any NQS architecture, in contrast to e.g.Ref. [63].It only requires the possibility to draw samples from the NQS and calculate the respective probabilities, making the calculation of E NQS (k x , k y ) computationally efficient.Furthermore, the scheme can also be combined with spatial symmetries, as discussed further in Sec.III 3.This could help to improve the accuracy, e.g. when using a NQS with implemented translational invariance, but additional symmetries could also be used to calculate e.g.m 4 rotational resonances [46] In order to calculate the dispersion relation from the NQS under consideration, we train our NQS to represent the ground state and then turn on a constrain in the loss function that forces the system to a higher energy state with the respective target momentum, see Fig. 3.
The momentum k NQS of the NQS wave function is calculated from the translation operator TR , which translates a state ψ(r) by the respective vector R, i.e.TR ψ(r) = ψ(r + R).Furthermore, it can be written as [64] with the momentum operator k.To determine the expectation value k NQS = (k x , k y ) using samples σ drawn from the NQS wave function, we calculate the expectation value of TR .For example, for a square lattice, this is done by translating all snapshots by R = e x and R = e y with |e µ | = a for lattice distance a and µ = x, y.
Then, we calculate the respective NQS amplitudes of the translated states, P λ ( Teµ σ), to determine the expectation value with the last equality due to the translational invariance of the ground state of a square lattice, which we assume to be (approximately) present for our NQS ground states, see also Appendix C. Hence, Using a sufficiently converged NQS ground state wave function as initial state, we train using VMC with an additional term in the loss function, with the RNN momentum k NQS and the target momentum k target .We use a prefactor γ(t) = γ 0 log 10 (1 + 9(t − t warmup )/τ ) that is turned on with typically τ = 100, . . ., 1000 and γ 0 = 1, . . ., 10 and gradually lifts all areas in the loss landscape that correspond to a NQS wave function with momentum k NQS ̸ = k target , forcing the NQS to a higher energy state at momentum k NQS = k target , see Fig. 3.
For k target far away from the ground state momentum, we observe empirically that the imaginary part of k NQS can become large, on the same order as the real part, in particular if the ground state accuracy was not sufficiently high.In these cases, the RNN ends up in states that are not eigenstates of the momentum operator.In order to prevent our RNN wave function to get trapped in these states we apply an additional constrain in the loss function in these cases, penalizing large imaginary parts of the momentum, Im k NQS .

t−XXZ model in 1D
In Fig. 4a the dispersion for an antiferromagnetic t−XXZ chain with 20 sites and J ± = 1, J z = 4 and t = 8, obtained with a 1D RNN and exact diagonalization (ED) is shown.The relative error on the ground state energy at k x = 0.5π, obtained during a training with 20000 iterations, is shown in Fig. 4b.The energies away from the ground state at k x = 0.5π, see Fig. 4a, are in relatively good agreement with the exact values from ED.However, at some values of k x ̸ = 0.5π it can be seen that the RNN is trapped in local minima close to the ground state.Overall, the RNN succeeds in capturing physical properties like the bandwidth very accurately, revealing the underlying physical excitations: For the system under consideration, the bandwidth and the shape of the dispersion in Fig. 4a is a result of spin-charge separation in 1D systems.Spin-charge separation denotes the fact that the motion of a hole in such an AFM spin chain with coupling J ± , J z ≪ t can be approximated by an almost free hole that is only weakly coupled to the spin chain.Hence, the dispersion in Fig. 4 can be approximated by two separate dispersions; i.e. holon and spinon dispersions.Hereby, the holon is the charge excitation, associated with energy scales t, and the spinon is the spin excitation associated with energy J ⊥ , J z .In Ref. [46] it is shown that the combined dis- persion is where k h is the momentum of the holon and k x = k h + k s is the combined momentum of holon and spinon.Eq. ( 14) is denoted by the gray line in Fig. 4. Again, the agreement with the RNN is relatively good.

t − J model on a square lattice
Due to the layered structure of high-T c superconductors like cuprates [21] or nickelates [65,66], the physics of t − J systems upon doping is particularly interesting in 2D.In Figs. 1 and 5, the Quasiparticle dispersion for a single hole on 10 × 4 and 4 × 4 t − J and t − J z lattices are presented.In both cases, Figs.1b and 5b show that the ground state convergence is better for the t − J z model with relative errors on the order of ∆ϵ ≈ 10 −3 for both system sizes, yielding a good agreement with the reference energies from DMRG (10 × 4 system) and ED (4 × 4 system) for all considered energies E(k x , k y ) away from the ground state.With a relative error of ∆ϵ ≈ 10 −2 , the error of the t − J ground states is above the t − J z systems, which is also reflected in the accuracy of the dispersion E RNN (k x , k y ) in Figs.1a and 5a.
In contrast to the previous section, there is no spincharge separation in the strict sense in two dimensional systems.In the case t ≫ J ± = J z =: J that we consider here (t/J = 3), the mobile dopant can be described by fractionalized spinons and chargons that are confined by a string-like potential that arises due to the spin background distortion when the dopant moves through the system [67,68].Based on this idea, Laughlin [69] drew the analogy with the 1D Fermi-Hubbard or t − J systems and suggested that the dispersion in the respective 2D systems can be interpreted in terms of pointlike partons, spinons and chargons, that interact with each other.This parton picture explains that the quasiparticle dispersion for a single hole is dominated the spinon with a bandwidth on the order of J ± , with corrections by the chargon on energy scales of t [35].This mechanism also provides the explanation for the flat dispersion for the t−J z model in contrast to the t−J model, as captued by the RNN, see Figs. 1 and 5. Despite the small deviations from the dispersions calculated with ED or DMRG, our RNN architecture, succeeds in capturing the respective bandwidths of t − J z and t − J models very accurately, allowing to gain valuable insights on the spinon and chargon physics from the RNN dispersions.Furthermore, the fact that node (π/2, π/2) and antinode (π, 0) are degenerate in the 4x4 system is correctly reproduced.
Lastly, we would like to mention that there is a small region of suppressed spectral weight near (π, π) in the DMRG results of the t − J system [46].This suppression yields difficulties for our RNN scheme that are further discussed in Appendix C.

t − J model on a triangular lattice
On triangular lattices, the physical phenomena that are observed are distinctly different from the physics of bipartite lattices, due to the notion of frustration and the absence of particle-hole symmetry in non-bipartite lattices, among them e.g.kinetic frustration [70,71].In particular, the underlying constituents upon doping the triangular ladder are not known [71], making the triangular lattice an intriguing system to study.Recent advancements have shown that these lattices can also be studied experimentally using optical triangular lattices [72][73][74] and solid state platforms based on Moiré heterostructures [75][76][77].
Triangular spin systems have already been studied using RNNs [19].Here, we consider a triangular t−J ladder with length L x = 9, with the quasiparticle dispersion for a single hole and the learning curves with and without doping shown in Fig. 6.
As suggested in Ref. [19], we use variational annealing for the training for the triangular lattice, that was shown to improve the performance for frustrated systems like the triangular Heisenberg model [19].The idea of annealing is to avoid getting stuck in local minima by including an artificial temperature T in the learning process.In order to do so, the variational free energy of the model, instead of the energy (3) is minimized.Here, the averaged Hamiltonian ⟨H λ ⟩ is given by ⟨H λ ⟩ = σ |ψ λ (σ)| 2 H λ (σ).Furthermore, S denotes the Shannon entropy The minimization procedure that we use starts with a warmup phase with a constant temperature T 0 , before decreasing the temperature T (t) = T 0 (1−(t−t warmup )/τ ) linearly with the minimization steps t with τ = 10000 and t final = 40000 training steps.
In Fig. 6b it can be seen that this procedure yields relatively good results for the ground states, with errors of ∆ϵ ≈ 0.001 for both N h = 0 and N h = 1.For the dispersion shown in Fig. 6a, we consider the momentum k defined along the ladder, as shown in the inset figure.When enforcing k ̸ = 0.444π away from the ground state, the exact energy gaps from ED to the first excited states strongly decrease and the the RNN gets trapped in these states in most cases, in particular for k > 0.444π.Furthermore, the errorbars of the enforced momenta are much higher compared to the other lattice geometries that were studied in Figs. 1, 4 and 5, suggesting that the RNN states partly break the translation invariance, and hence challenge the momentum optimization scheme.
In this section, we discuss the capability of our bosonic and fermionic RNN ansätze presented in Sec.I to learn and represent the ground states of the t−XXZ model.For our analysis, we focus on t − J and t − J z models on a 4 × 4 square lattice.
Figs. 7 and 8 show the relative error for the ground state energies of t − J z and t − J models obtained with our RNN ansatz upon doping the half-filled system with N h holes.Starting from N h = 0 in the t − J z model, the accuracy of the respective Ising ground state is very high in both cases with relative errors ∆ϵ = ERNN−EED |EED| below the numerical precision.The t − J model, reducing to the Heisenberg model at N h = 0, features spin-flip terms besides the Ising interactions, making the ground state search more difficult.Our RNN reaches a ground state energy error ∆ϵ ≈ 10 −4 after 20000 training steps.For both models, the phase and amplitude distributions shown in Figs.7b and 8b are relatively simple with a low variance for the logarithmic amplitude and only two values for the phase, 0 and π.In particular, the Ising state for the N h = 0 case of the t − J z model, features basically only two Néel states with non-zero amplitude (i.e.approx.zero log-amplitudes), shown in Fig 7b on the very left.Note that when comparing to the literature of ground state representations using RNNs for the Heisenberg model [10,36], the optimization problem in our setup is more challenging due to the following reasons: (i) The RNN that we use has a local Hilbert space dimension of three states instead of two, allowing for all values of N h in principle.(ii) Our RNN learns the sign structure without any bias, i.e. we do not implement the Marshall sign rule already in the RNN, which would only work for N h = 0. (iii) We do not include the knowledge of spatial symmetries yet, which will be done later in Sec.III 3.

III. PERFORMANCE OF THE RNN ANSATZ
Upon doping, the relative error of the ground states without antisymmetrization of the RNN wave function for the t − J z model in Fig. 7 is below ∆ϵ ≪ 5 • 10 −4 for all considered hole dopings 1 ≤ N h ≤ 12.As exemplary shown for the bosonic N h = 6 case in Fig. 7b in blue, the true ground state from exact diagonalization does not have a phase structure in this case and the logarithmic amplitudes are very similar.When including the antisymmetry for the fermionic wave functions, the variance of both phase and amplitude distributions increases, from 47, which can be seen from bare eye when comparing the bosonic and fermionic ED distributions in Fig. 7b.This complicates the ground state search and the ground state error increases significantly between 2 ≤ N h ≤ 9 for the fermionic t − J z model.At N h = 10, when only four particles remain in the system and probably a Fermi-liquid regime is entered, the error decreases again to ∆ϵ < 1% in the fermionic case, coinciding with a lower variance of the exact log-probabilities than for The exact log-amplitude and phase distributions from ED for N h > 0 of the t − J model are typically more complicated than for the t − J z model.For example, for N h = 4, the variance of the exact amplitudes becomes very large, σ b N h =6 (log|ψ| 2 ) = 15.91,see Fig. 8b.This yields larger ground state energy errors than for the t − J z model, and is further complicated when including the antisymmetry in the fermionic case.Again, we make the observation that for larger hole dopings, N h ≥ 6 for bosons and N h ≥ 10 for fermions, the distributions for phase and amplitude become less complicated than in the low to intermediate doping regime, yielding a higher accuracy of the RNN wave function with errors ∆ϵ ≤ 10 −4 for bosons and ∆ϵ ≤ 10 −2 for fermions in the respective doping regimes.
Our results show that in the low doping regime of the t − J model, both fermionic systems and bosonic systems are difficult to learn, see Fig. 8.This suggests that not only the fermionic sign structure is challenging, but also the motion of bosonic holes in the AFM Heisenberg background.When these holes move through the system, the spin background is affected, giving rise to an effective J 1 − J 2 spin model with nearest and next-nearest spin exchange interactions and is hence more difficult to learn [78].For the t − J z model, we observe that, probably due to the lack of spin dynamics resulting from the absence of spin-flip terms, the relative errors are comparably low in the bosonic case.
Furthermore, for all states with high log|ψ| 2 variance, there are several configurations σ with a large negative log-amplitude, i.e. |ψ(σ)| 2 ≈ 0. This makes an accurate determination of expectation values extremely costly and can affect the training process.For example, in Ref. [79] it was shown that this yields higher variances for the gradients determined by stochastic reconfiguration.
Given these relatively high errors on the ground state energies in some cases, we test potential bottlenecks of our approach in the following, namely: (i) Difficulties in learning either the phase or the amplitude, by considering the partial learning problems separately.(ii) The optimization procedure.(iii) The optimization landscape.(iv) The expressivity of the RNN ansatz, compared to the complexity of the learning problem.

The partial learning problem
One potential bottleneck of our approach is the way the RNN wave function is split into amplitude and phase.In order to test if there are problems with the optimization of the phase or amplitude alone, we consider their learning problems separately as suggested e.g. in Refs.[14,80].

Phase training:
We sample from the exact ground state distribution |ψ| 2 , calculated with ED, and optimize only the phase.
2. Amplitude training: Given the correct phase distribution from ED, we optimize only the logarithmic amplitude to check if the ground-state probability amplitudes can be learned.
Fig. 9 shows the results of amplitude and phase trainings (dark and light blue), compared to the full train- ing of both amplitude and phase (red).For all considered systems, the results of the partial trainings are closer to the exact ground state, e.g. for open boundaries and N h = 1, the relative error is decreased from ∆ϵ = 0.0147 (37) to ∆ϵ = 0.0040 (30) for the amplitude training and ∆ϵ = 0.0039 (33) for the phase training.However, for all considered cases we observe the same problem as in the full training: the RNN gets stuck in a plateau that survives up to 20000 training steps.Although the relative error of the plateau decreases when considering the partial learning problems, the improvement is surprisingly low given the amount of information that is added to the training.Furthermore, whether the amplitude or phase training is more problematic remains unclear.Even for the phase training, for which the training samples are generated from the exact distribution |ψ| 2 calculated with ED, the improvement is not significantly larger than for the amplitude training.This is in agreement with the results of Bukov et al. [14].

Comparison of optimizers
As a next test, we compare the optimization results of different optimizers in Fig. 10a, namely Stochastic gradient descent (SGD), adaptive methods like AdaBound [81] and Adam [57], and more advanced methods such as Adam+Annealing [19] and the recently developed variant of stochastic reconfiguration (SR), minimum-step SR (minSR) [61].We show the optimization results for the t − J z model on the left and the t − J model on the right, both for N h = 1.
Typically, Adam is used for RNN wave function optimization [10,19,36,50], adapting the learning rate in each VMC update.For 200 samples used in each Another modification of the Adam training is the use of variational annealing as introduced in Sec.II 3, shown to improve the performance for frustrated systems [19].The minimization procedure that we use starts with a warmup phase with a constant temperature T 0 = 1, before decreasing the temperature T (t) = T 0 (1 − (t − t warmup )/τ ) linearly with the minimization steps t.Typically, we use τ = 5000 and stop the training after t final = 20000 training iterations, but tests up to τ = 20000 and t final = 40000 did not yield any improvements.Fig. 10a shows that for the square lattice, the use of annealing does not bring any advantage within the errorbars.
Lastly, we apply minSR, a recently developed variant of SR [61], as introduced in Sec.I A. For a stable training, we ensure non-exploding gradients by adding a diagonal offset δ(t) to the diagonals of the T -matrix, with δ(t) exponentially decaying from 1 to 10 −10 .After determining the gradients using Eq. ( 7), we apply the Adam update rule, which we empirically find to perform better than the GD update.Moreover, since it is crucial to use enough samples for a sufficiently good approximation of the gradients in SR, typically more samples than for the other optimization routines are needed.Here, we use 1000 samples in each minSR update and find that the results on the one-hole t − J ground state errors improve below the values obtained with Adam, see Fig. 10a on the right.However, we show in Appendix B 2 that a comparison with Adam using the same number of samples does not lead to a conclusive result which optimization routine is better, similar to the SR results in Ref. [14].
The reason behind this can be understood when considering the spectrum of the T -matrix of the minSR algorithm: Similar to the results of Ref. [82] for the S-matrix of the SR algorithm, Fig. 10b shows that the eigenvalues of T , λ i , decrease extremely rapidly, in particular at the beginning of the training, indicating a very flat optimization landscape.This is a typical problem of autoregressive architectures [82] and causes uncontrolled, high values of T −1 and consequently also of the gradients δθ, see Eq. (7).Furthermore, the shape of the spectrum does not have any feature that indicates that the spectrum could be cut off at a specific eigenvalue, making a regularization very difficult.Hence, the diagonal offset δ(t) must be chosen relatively large, yielding parameter updates that are very similar to the plain vanilla Adam optimization as long as δ(t) is larger than many of the T -eigenvalues.The spectrum of the (X T X) matrix of the SR variant by Rende et al. [60], see Eq. ( 8), exhibits the same problem.
When comparing the results for different hidden dimensions, e.g. for minSR in Fig. 10a (right), it may suggest that a hidden dimension h d > 100 could in principle improve the results further.However, we will show in Sec.III 4 that for such a large number of parameters, it is even possible, by restricting to a fixed number of holes and hence reducing the Hilbert space dimension to ≪ 3 Nsites , to encode the wave function using exact methods.

Spatial symmetries
The RNN ansatz we use has implemented U (1) = U (1) N × U (1) Ŝz symmetry, i.e. conserved total particle and total magnetization [10,24].This is done by calculating the current particle number N p (i) (magnetization S z (i)) after the i-th RNN cell during the sampling process and assigning a zero conditional probability if N p (i) = N target (S z (i) = S z,target ) for all sites j > i that are considered afterwards, see Appendix A 3. As a next test, we employ additional spatial symmetries: For a symmetry operation T according to the lattice symmetry, we know that for the exact ground state.For rotational C 4 symmetry of the square lattice, we employ this constrain (i) in the training, by implementing it in the cost function, or (ii) in the RNN ansatz as in Ref. [10].
For (ii), we assign for all operations T i in the symmetry group, similar to Ref. [10].
The optimization results using (i) and (ii) are shown in Fig. 11 for the t − J and t − J z model on a 4 × 4 square lattice.It can be seen that constraining the RNN wave function directly via (ii) is more succesful than via the cost function (i): Using (ii), we get an order of magnitude lower relative errors compared to the results without spatial symmetries for the t−J z model.This possibly results from the fact that the additional constrain on the symmetry leads to barriers in the loss landscape in the regions where the symmetry is violated.Even when increasing the symmetry constrain gradually during the training, as described above, these barriers can prevent getting close to the minimum.
The t − J model results do not improve significantly for both symmetry implementations (i) and (ii), with an error on the order of ∆ϵ ≈ 10 −2 with and without spatial symmetries.Hence, we conclude that applying symmetries does only help to improve the accuracy if the ground state can already be learned sufficiently well, as for the t − J z model.
For systems with sufficiently high convergence, also ro- tational symmetries like s, p or d-wave symmetries could be enforced to probe the competition between the ground state energies in the respective symmetry sectors [83], which is highly relevant for the study of high-T c superconductivity.In addition, also low-energy excited states for these symmetry sectors could be calculated by making use of the dispersion scheme from Sec. II, e.g.m 4 rotational spectra [47].

Complexity of the learning problem
Lastly, we consider the complexity of our learning problem and compare it to the expressivity of our RNN ansatz in terms of the number of parameters that are encoded in the RNN.In Fig. 12 on the left, we show the number of parameters used in the RNN ansatz for the 4 × 4 t − J square lattice for hidden dimensions 30 ≤ h d ≤ 100.The number of parameters encoded in the ansatz is slightly lower than the number of parameters that is actually used (gray circles on the left).This is due to the way we encode the U (1) symmetry in our approach, resulting in a small fraction of weights that are not updated since the respective probabilities are set to zero to obey the U (1) symmetry, see Appendix A 3. Furthermore, we show the dimension of the Hilbert space for the same system 3 16 in black, i.e. the dimension of the distribution that needs to be learned by our RNN.For the small system size that we consider in Fig. 12, the Hilbert space dimension is two orders of magnitude larger than the number of RNN parameters.For the 10×4 system in Fig. 1 however, our RNN representation has 13 orders of magnitude less parameters than the Hilbert space with dimension 3 40 that is learned.
The Hilbert space dimension 3 Nsites that was considered so far allows for three states per site -spin up, down and hole -, i.e. for a variable number of holes in the system.For a fixed number of holes, the number of parameters to describe the exact state can be reduced to the Hilbert space dimension of the spin system multiplied by all combinations of how holes can be distributed on the lattice.This yields a much lower number of parameters than 3 Nsites , as shown by the blue lines in Fig. 12 for 1 ≤ N h ≤ 4. In fact, for N h = 1 our RNNs encode even more parameters than this exact parameterization when h d > 70.This reveals one main problem of our RNN ansatz, namely the flexibility to encode any number of holes and hence a 3 Nsites -dimensional parameter space.For future studies, we envision an RNN ansatz for a fixed number of holes, reducing the dimension of the parameter space that needs to be learned and hence facilitating the learning problem.
Lastly, we would like to point out that the learning problem that we consider here is more complex than for spin systems that are typically considered with this architecture [10,32,33,36], as can be seen when comparing the Hilbert space dimensions for local dimensions d = 2 as for spin systems, vs. d = 3 as for the t − J model in Fig. 12 on the right.For larger systems, this difference increases, e.g. for the 10 × 4 system in Fig. 1 the Hilbert space dimension increases by seven orders of magnitude when going from a spin to a t − J system (with flexible number of holes).This problem becomes even more pronounced when the Fermi-Hubbard model with local dimension d = 4 would be considered.

IV. SUMMARY AND OUTLOOK
To conclude, we present a neural network architecture, based on RNNs [10], to simulate ground states of the fermionic and bosonic t − J model upon finite hole doping.We show that, despite many challenges due to the increased complexity of the learning problem compared to spin systems, the RNN succeeds in capturing remarkable physical properties like the shape of the dispersion, indicating the dominating emergent excitations of the systems.In order to calculate the dispersion, we present a new method that can be used with any NQS ansatz and for any lattice geometry and map out quasiparticle dispersion using the RNN ansatz for several different lattice geometries, including 1D and 2D systems.Moreover, it enables an extremely efficient calculation of dispersion relations compared to conventional methods like DMRG [62], which usually require a time-evolution of the state [45].The dispersion scheme yields a good agreement when comparing to exact diagonalization or DMRG results, and is expected to perform even better for a better ground state convergence.In principle, it can also be combined with a translationally symmetric NQS ansatz to improve the accuracy.Furthermore, the scheme could be combined additional symmetries, e.g.rotational symmetries, enabling the calculation of m 4 rotational spectra [84].
In addition, we provide a detailed discussion on the challenges that are encountered during training our t − J RNN architecture, namely (i) the enlarged local Hilbert space with three states for spin up particles, spin down particles and holes, respectively, yielding 3 Nsites possible configurations instead of 2 Nsites as for spin systems; (ii) the significant number of wave function amplitudes that are close to zero; (iii) the learning plateau associated with a local minimum that is encountered for all considered optimization routines -including annealing [19], minimum-step stochastic reconfiguration (minSR) [61] and the recently proposed SR variant based on a linear algebra trick [60] -and the fact that SR algorithms have problems with autoregressive architectures [82]; (iv) the complicated interplay between phase and amplitude optimization [14]; (v) the difficulty to implement constrains on the symmetry sector under consideration, e.g. the particle number, magnetization and spatial symmetries directly into the RNN architecture [10,36].Remarkably, all of these challenges are inherent to the simulation of both bosonic and fermionic systems.Our results indicate that the bottleneck for simulating fermionic spinful systems is the training and not the expressivity of the ansatz, and point the way to possible improvements concerning the ansatz and the training procedure.

Tensorized gated recurrent units
In the present work we use the 2D version of RNN wave functions, as proposed in Ref. [50].The underlying idea is to separate the sampling path and the information path when going through the network, as indicated in Fig. 13.While the sampling is still done in a one-dimensional fashion (see light gray arrows), the information contained in the hidden states is passed in a 2D manner (red arrows) indicated by the blue arrows in Fig. 13.More precisely, the hidden state of the RNN is calculated via h i,j = F ([σ i−1,j σ i,j−1 ] T i,j [h i−1,j h i,j−1 ] + b i,j ) = F (W [σ i−1,j σ i,j−1 ] [h i−1,j h i,j−1 ] T + b i,j ) (A7) with a nonlinear activation function F and a tensor T i,j ∈ R 2dv×2d h ×d h or a weight matrix W ∈ R 2dv×2d h ×d h Here, [σ i−1,j σ i,j−1 ] has dimension N s × 2d v and [h i−1,j h i,j−1 ] dimension N s × 2d h , i.e. the product of these vectors with the tensor T i,j ∈ R 2dv×2d h ×d h is of dimension N s × d h as desired.Furthermore, we use a variant of a gated recurrent unit (GRU) instead of a simple RNN cell.GRUs tackle the difficulties of plain vanilla RNNs to capture long-term dependecies [51][52][53].In GRUs the hidden state h i,j is determined by calculating u i,j = sig(W 1 [σ i−1,j σ i,j−1 ] [h i−1,j h i,j−1 ] T + b 1 i,j ) hi,j = tanh(W 2 [σ i−1,j σ i,j−1 ] [h i−1,j h i,j−1 ] T + b 2 i,j ) h i,j = u i,j • hi,j + (1 where W m ∈ R 2d h ×d h is used to match the dimensions.The nonlinear activation functions "sig" and "tanh" denote the sigmoid and hyperbolic tangent activation functions respectively.In contrast to the simple RNN cell, the updated hidden state h i,j is given by a combination of the previous hidden states h i−1,j and h i,j−1 and a updated candidate hi,j .The update gate u i,j decides how much information from each of them is taken into account in the next step.This implementation is slightly different from usual implementations of GRUs which involve a so-called forget-gate and hence contain more parameters to be optimized.For the tensorized version of a full GRU cell, one would need even more parameters to match the dimensions in the forget gate which would make the optimization process very slow.

U (1) Symmetry
Since the ground states of the t − J model have conserved particle number and conserved magnetization, i.e., a U (1) = U (1) N × U (1) Ŝz symmetry, it is helpful to enforce this constraint on our RNN wave functions, as shown for the magnetization sector in Ref. [10].The procedure that we use effectively applies a projector P Ŝz=0 ( P Ŝz=0.5 ) and P N =Ntarget for even (odd) particle numbers N target .This restricts the RNN wave function to the subspace of configurations under interest, yielding a simpler optimization landscape.To satisfy our U (1) = U (1) N × U (1) Ŝz constrain, we utilize the following algorithm.At each site i, we 1. generate the RNN output y (1) i and calculate the conditional probabilities P λ ( 1 2 |σ <i ), P λ (− 1 2 |σ <i ) and P λ (0|σ <i ) for spin up, down and holes respectively.
2. define the respective amplitudes for spin up, down and holes: with N holes target = N sites − N target , N holes (i) = N sites − (N up (i) + N down (i)) and Θ the heaviside function.N up (i), N down (i) and N holes (i) are averaged values calculated from samples generated up to site < i.This procedure sets all probabilities for non-desired magnetizations and particle numbers to zero, but leaves the amplitudes of the wave function normalized to one.In order to find the ground state of the system under consideration, we use Variational Monte Carlo (VMC) [49,55].VMC has already been combined in a wide range of machine learning applications (see e.g.Refs.[17,56]).In VMC we minimize the expectation value of the energy of the RNN trial wave function where we have defined the local energy and the probability distribution given by the RNN

(B3)
As shown e.g. in Refs.[10,26] one can use the cost function to minimize both the local energy as well as the variance of the local energy.Here, ⟨E loc λ ⟩ is given by

Figure 1 .
Figure 1.t − J and t − Jz square lattice with 10 × 4 sites, t/Jz = 3 and open boundaries in x, periodic boundaries in y direction: a) Quasiparticle dispersion of a single hole for the t−J system obtained with the RNN (blue markers), compared to the MPS spectral function from Ref. [35] with the spectral weight S indicated by the colormap and shown in the inset figure for k = (0.44π, 0.44π).We average the energy over the last 100 training iterations, each with 200 samples, with the respective error bars denoted in blue.b) Dispersion of the t − Jz system obtained with the RNN, compared to the MPS spectral function.c) Relative errors ∆ϵ = E RNN −E DMRG |E DMRG | during the training for t − J and t − Jz systems, both with d h = 300.Dashed vertical lines denote the training step where the training was restarted.In the last restart the number of samples per minimization step was increased from 200 to 600 (t − J) or 1000 (t − Jz).

Figure 2 .
Figure2.Left: A typical configuration σ for a 5 × 5 system with five holes and ten spin up (red) and spin down (blue) particles each.Sites are labeled in a 1D manner, as denoted by the white numbers.Right: An exemplary hopping process to the nearest neighbor in horizontal direction ends in the configuration σ ′ and effectively exchanges P particles, here P = 3.The respective sign of σ ′ relative to σ is calculated using Eq.(9).

Figure 4 .
Figure 4. 1D t−XXZ system with 20 sites and J± = 1, Jz = 4 and t = 8: a) Quasiparticle dispersion for a single hole obtained with the RNN (red markers), compared to exact energies from ED (light red lines) and the combined spinon and holon dispersions from Eq. (14) (gray).We average the RNN energy over the last 100 training iterations, each with 200 samples, with the errors denoted by the errorbars.We show the exact low-energy excited states as well.b) Relative error ∆ϵ = E RNN −E ED |E ED | during the ground state training.a) and b) are obtained using a 1D RNN architecture with d h = 100.

Figure 5 .
Figure 5. t − J (blue) and t − Jz (red) square lattice with 4 × 4 sites, t/J = 3 and periodic boundaries: a) Quasiparticle dispersion for a single hole obtained with the RNN (blue and red markers), compared to the exact energies from ED (lines).We average the energy over the last 100 training iterations, each with 200 samples, with the respective error bars shown in blue and red.We show the exact low-energy excited states as well.b) Relative error ∆ϵ during the ground state training for t − J (light blue) and t − Jz (light red) square lattice ground states, with d h = 100 and minSR (t − J) and d h = 70 and Adam (t − Jz).Thick lines are averages over 100 training iterations to guide the eye.

Figure 6 .
Figure 6.t − J model on a triangular lattice with 9 × 2 sites, t/J = 3 and periodic boundaries along x direction: a) Quasiparticle dispersion for a single hole obtained with the RNN (blue markers), compared to the exact energies from ED (light blue lines).We average the energy over the last 100 training iterations, each with 200 samples, with the error denoted by the blue errorbars.We show the exact low-energy excited states as well.b) Relative error ∆ϵ during the ground state training without doping (orange) and with one hole (blue).

Figure 7 .
Figure 7. RNN representation for ground states of the bosonic and fermionic t − Jz model with t/Jz = 3, 0 ≤ N h ≤ 12 for a 4 × 4 square lattice with open boundaries.a) Relative error for bosons (blue) and fermions (orange).b) Logarithmic amplitude and phase distributions from ED for exemplary bosonic (blue) and fermionic (orange) hole numbers.On the very left, the two states σ Néel with log|ψ(σ Néel )| 2 = 0 are the Néel states.We use a hidden dimension of h d = 100.

Figure 8 .
Figure 8. RNN representation for ground states of the bosonic and fermionic t − J model with t/J = 3, 0 ≤ N h ≤ 12 for a 4×4 square lattice with open boundaries.a) Relative error for bosons (blue) and fermions (orange).b) Logarithmic amplitude and phase distributions from ED for exemplary bosonic (blue) and fermionic (orange) hole numbers.We use a hidden dimension of h d = 100.

Figure 9 .
Figure 9. Partial training, i.e. separate amplitude (dark blue) and phase (light blue) training, for ground states of the t − J model on a 4 × 4 square lattice with t/J = 3, open boundaries and N h = 0 (top) and N h = 1 (bottom), compared to the full training in red.We use a hidden dimension of h d = 70.

Figure 10 .
Figure 10.a) Testing different optimizers: Optimization results for the t − Jz model (left) and the t − J model (right) on a 4 × 4 square lattice with t/Jz = 3, both for N h = 1 and periodic boundaries, using SGD, AdaBound, Adam, Adam+Annealing and minSR, and 200 samples (1000 samples for minSR) in each VMC step.b) Eigenvalues of the T -matrix (minSR algorithm [61], solid lines) and of the X T X matrix (SR variant of Rende et al. [60], dotted lines) before the training, for the 4 × 4 t − J system with one hole and open boundaries and h d = 30, 70, using 1000 samples,

Figure 11 .
Figure 11.Relative error for t − J (dark blue) and t − Jz (light blue) models on a 4 × 4 square lattice with one hole, t/Jz = 3 and periodic boundaries, for RNNs with implemented U (1) = U (1) N × U (1) Ŝz symmetry, U (1) and C4 symmetry, implemented via the cost function and the RNN ansatz.We use a hidden dimension of h d = 70.For the t − Jz model, we provide the relative errors as numbers in light blue.

Figure 12 .
Figure 12.Number of parameters for the exact wave function of a 4 × 4 system compared to the RNN ansatz.Left: We compare the number of parameters of the exact wave function using U (1) N × U (1) Ŝz symmetry for 0 ≤ N h ≤ 4 (blue) to the Hilbert space dimension 3 16 that we want to learn with the RNN ansatz.The number of parameters of the RNN ansatz with hidden dimension 30 ≤ h d ≤ 100 is denoted in gray.Right: Hilbert space dimension for a local dimension of 2 (Heisenberg model), 3 (t − J model) and 4 (Fermi-Hubbard model).

Figure 13 .
Figure 13.Left: RNN representation of a quantum state: The hidden units hi and the configurations σi are passed through the RNN cell for each lattice position.The network outputs the amplitude P (σ) and the phase ϕ(σ) of the RNN state representation as given by Eq. (2).Right: The two dimensional and tensorized RNN: The sampling path is represented by the dark gray lines.Internally, the hidden and configuration states, hi and σi, are passed through the network in two directions.Furthermore, inputs are passed to the tensorized version of a GRU cell (see Eq. (A7)) at each step i as proposed by Hibat-Allah et al.[50].

3 . 1 √ a 2 i +b 2 i +c 2 i.
calculate the new Pλ (σ i |σ <i ) using a i , b i and c i and normalize by multipling with Hence, the new probabilities Pλ are also normalized.