The autoregressive neural network architecture of the Boltzmann distribution of pairwise interacting spins systems

Generative Autoregressive Neural Networks (ARNNs) have recently demonstrated exceptional results in image and language generation tasks, contributing to the growing popularity of generative models in both scientific and commercial applications. This work presents an exact mapping of the Boltzmann distribution of binary pairwise interacting systems into autoregressive form. The resulting ARNN architecture has weights and biases of its first layer corresponding to the Hamiltonian's couplings and external fields, featuring widely used structures such as the residual connections and a recurrent architecture with clear physical meanings. Moreover, its architecture's explicit formulation enables the use of statistical physics techniques to derive new ARNNs for specific systems. As examples, new effective ARNN architectures are derived from two well-known mean-field systems, the Curie-Weiss and Sherrington-Kirkpatrick models, showing superior performance in approximating the Boltzmann distributions of the corresponding physics model compared to other commonly used architectures. The connection established between the physics of the system and the neural network architecture provides a means to derive new architectures for different interacting systems and interpret existing ones from a physical perspective.

Generative Autoregressive Neural Networks (ARNNs) have recently demonstrated exceptional results in image and language generation tasks, contributing to the growing popularity of generative models in both scientific and commercial applications.This work presents an exact mapping of the Boltzmann distribution of binary pairwise interacting systems into autoregressive form.The resulting ARNN architecture has weights and biases of its first layer corresponding to the Hamiltonian's couplings and external fields, featuring widely used structures such as the residual connections and a recurrent architecture with clear physical meanings.Moreover, its architecture's explicit formulation enables the use of statistical physics techniques to derive new ARNNs for specific systems.As examples, new effective ARNN architectures are derived from two well-known mean-field systems, the Curie-Weiss and Sherrington-Kirkpatrick models, showing superior performance in approximating the Boltzmann distributions of the corresponding physics model compared to other commonly used architectures.The connection established between the physics of the system and the neural network architecture provides a means to derive new architectures for different interacting systems and interpret existing ones from a physical perspective.

I. INTRODUCTION
The cross-fertilization between machine learning and statistical physics, in particular of disordered systems, has a long history [1,2].Recently, the development of deep neural network frameworks [3] have been applied to statistical physics problems [4] spanning a wide range of domains, including quantum mechanics [5,6], classical statistical physics [7,8], chemical and biological physics [9,10].On the other hand, techniques borrowed from statistical physics have been used to shed light on the behavior of Machine Learning algorithms [11,12], and even to suggest training or architecture frameworks [13,14].In recent years, the introduction of deep generative autoregressive models [15,16], like transformers [17], has been a breakthrough in the field, generating images and text with a quality comparable to human-generated ones [18].The introduction of deep ARNNs was motivated as a flexible and general approach to sampling from a probability distribution learned from data [19][20][21].
In classical statistical physics, the ARNN was introduced, in a variational setting, to sample from a Boltzmann distribution (or equivalently an energy-based model [22]) as an improvement over the standard variational approach relying on the high expressiveness of the ARNNs [8].
The use of ARNNs in statistical physics problems has largely relied on pre-existing neural network architectures which may not be well-suited for the particular problem at hand.This approach has been largely favored due to the high expressive capacity of ARNNs, which can encapsulate the complexity of the Boltzmann probability distribution, remapped in an autoregressive form, within their parameters that, typically, grow polynomially with system size.To encode this complexity exactly, however, one might expect the need for an exponentially large number of parameters.This work aims to demonstrate how knowledge of the physics model can inform the design of more effective ARNN architectures.I will present the derivation of an ARNN architecture that encodes exactly the classical Boltzmann distribution associated with a general pairwise interacting Hamiltonian of binary variables.Despite the generality of the Hamiltonian, the resulting architecture exhibits interesting properties: the first layer's parameters, which scale polynomially with the system size, are fixed by the Hamiltonian parameters.Additionally, the derivation introduces both residual connections and recurrent structures, each with a clear physical interpretation.As expected for the exact architecture of the general case, the resulting deep ARNN architecture has the number of hidden layer parameters scaling exponentially with the system's size.In the general case, it is possible to approximate these hidden layers with feed-forward neural network structures containing a polynomial number of free parameters.The advantage of this approach over existing architectures is that the first layer's parameters can be fixed by the Hamiltonian parameters, reducing the number of parameters to be learned and trained.For instance the proposed architecture could be used in accelerating Markov chain simulations [23,24].The quality of the approximation of the Boltzmann distribution relies on both the architecture of the feed-forward neural network used and the complexity of the problem being tackled.However, the clear physical interpretation of the architecture allows us to leverage problem-specific knowledge to develop specific feed-forward neural network architectures.As an example, standard statistical physics techniques will be used in the following to find feasible ARNN architecture for specific Hamiltonian.To showcase the potential of the derived representation, the ARNN architectures for two well-known arXiv:2302.08347v3[cond-mat.dis-nn]25 Mar 2024 mean-field models are derived: the Curie-Weiss (CW) and the Sherrington-Kirkpatrick model (SK).These fully connected models are chosen due to their paradigmatic role in the history of statistical physics systems.
The CW model, despite its straightforward Hamiltonian, was one of the first models explaining the behavior of ferromagnet systems, displaying a second-order phase transition [37].In this case, an exact ARNN architecture at finite N and an approximated architecture in the thermodynamic limit are derived, both with a number of parameters scaling polynomially with the system's size.
The SK model [38] is a fully connected spin glass model of disordered magnetic materials.The system admits an analytical solution in the thermodynamic limit, Parisi's celebrated [39] k-step replica symmetric breaking (k-RSB) solution [40,41].The complex many-valley landscape of the Boltzmann probability distribution captured by the k-RSB solution of the SK model is the key concept that unifies the description of many different problems, and similar replica computations are applied to very different domains like neural networks [42,43], optimizations [44], inference problems [11], or in characterizing the jamming of hard spheres [45,46].In this work, an ARNN architecture of the Boltzmann distribution of the SK model for a single instance of disorder, with a finite number of variables will be shown.The derivation will be based on the k-RSB solution, resulting in a deep ARNN architecture with parameters scaling polynomially with the system size.
In the following, I will first present the ARNN architecture for the Boltzmann distribution of the pairwise interacting systems, and then demonstrate how to derive the new architecture for the CW and SK models.Finally, the last section compares the performance of the derived ARNN architectures with standard ARNN architectures used in the literature.

II. AUTOREGRESSIVE ARCHITECTURE OF THE BOLTZMANN DISTRIBUTION OF PAIRWISE INTERACTING SYSTEMS
The Boltzmann probability distribution of a given Hamiltonian H[x] of a set of N binary spin variables /Z, where Z = x e −βH(x) is the normalization factor.It is generally challenging to compute marginals and average quantities when N is large and in particular, generate samples on frustrated systems.By defining the sets of variables x <i = (x 1 , x 2 . . .x i−1 ) and x >i = (x i+1 , x i+2 . . .x N ) respectively with an index smaller and larger than i, then if we can rewrite the Boltzmann distribution in the autoregressive form: P B (x) = i P (x i |x <i ), it becomes straightforward to produce independent samples from it, thanks to the ancestral sampling procedure [8].It has been proposed [8] to use a variational approach to approximate the Boltzmann distribution with trial autoregressive probability distributions where each conditional probability is represented by a feed-forward neural network with a set of parameters θ, Q θ The parameters θ can be learned minimizing the variational free energy of the system: Minimizing the variational free energy F [Q θ ] with respect to the parameters of the ARNN is equivalent to minimizing Kullback-Leibler divergence with the Boltzmann distribution as the target [8].The computation of F [Q θ ] and their derivatives with respect to the ARNN's parameters involve a summation over all the configurations of the systems, that grows exponentially with the system's size, making it unfeasible after a few spins.In practice, they are estimated summing over a subset of configurations sampled directly from the ARNN thanks to the ancestral sampling procedure [8].Besides the minimization procedure, the choice of the architecture of the neural networks is crucial to obtain a good approximation of the Boltzmann distribution.In the following, I will derive an exact ARNN architecture of the Boltzmann distribution of pairwise-interacting spins.
A. The single variable conditional probability In the parametrization Q θi (x i = 1|x <i ) of the single variable conditional probability distribution P (x i = 1|x <i ) as a feed-forward neural network, the set of variables x <i is the input, and a nested set of linear transformations, and non-linear activation functions is applied on them.
Usually, the last layer is a sigma function σ(x) = 1 1+e −x , assuring the output is between 0 and 1.The probability straightforward to obtain.The set of parameters θ i are the weights and biases of the linear transformations.In the following, I will rewrite the single variable conditional probability of the Boltzmann distribution as a feed-forward neural network.The generic i-th conditional probability factor of the Boltzmann distribution can be rewritten in this form: where I defined: The δ a,b is the Kronecker delta function that is one when the two values (a, b) coincide and zero otherwise.Now imposing as last activation function a sigma function, with simple algebraic manipulations, we obtain: Consider a generic two-body interaction Hamiltonian of binary spin variables , where the sets of J ij are the interaction couplings and h i are the external fields.Taking into account a generic variable x i the elements of the Hamiltonian can be grouped into the following five sets: where the dependence on the variable x i has been made explicit.Substituting these expressions in Eq. ( 4), we obtain: where: The set of elements H ss cancels out.The conditional probability, Eq. ( 5), can be interpreted as a feed-forward neural network, following, starting from the input, the operation done on the variables x <i .The first operation on the input is a linear transformation.Defining: x as outputs of the first layer (see Fig. 1), we can write the conditional probability as a feed-forward neural network: H2ARNN Architectures of a single Boltzmann conditional probability of a pairwise interacting Hamiltonian, Eq. ( 9.) The x<i variables are the input, the output provides an estimation of the conditional probability P (xi = 1|x <i ).The first layer computes the x 1 i and x 1 il variables, see Eq. ( 7), where the weight and bias, directly related to the Hamiltonian parameters, are shown in orange.The non-linear operators are represented by square symbols.The width of the second layer increases exponentially with the system size.The log exp(x) = log i e x i represents the set of linear transformations and non-linear activation functions acting on the second layer.The last layer is the sigma function.
As shown in Fig. 1, a second linear transformation acts on the set of x 1 il variables.The parameters of the second layer are where c is the index of the configuration of the set of x >i variables.This second linear transformation compute the 2 N −i possible values of the exponent in the ρ ± i functions, Eq. ( 10).Next, the two functions ρ ± i are obtained by first applying the exponential function to the output of the second layer.Then, for each of ρ ± i , we sum their elements and finally apply the logarithmic function.As the last layer, the values log ρ ± i and x 1 i are combined with the right signs, and the sigma function is applied.The entire ARNN architecture of the Boltzmann distribution of the general pairwise interacting Hamiltonian (H 2 ARNN) is depicted in Fig. 1.The total number of parameters scales exponentially with the system size, making its direct application infeasible for the sampling process.Nevertheless, the H 2 ARNN architecture shows some interesting features: • The weights and biases of the first layer are the parameters of the Hamiltonian of the Boltzmann distribution.As far as the author knows, this is the first time that this connection is derived.
• Residual connections among layers, due to the x 1 i variables, naturally emerge from the derivation.The importance of residual connections has recently been highlighted [47] and has become a crucial element in the success of the ResNet and transformer architectures [48], in classification and generation tasks.They were presented as a way to improve the training of deep neural networks avoiding the exploding and vanishing gradient problem.In this context, they represent the direct interactions among the variable x i and all the previous variables x <i .
• The H 2 ARNN exhibits a recurrent structure [3,49].The first layer, as seen in Figure 1, is composed of a set of linear transformations (see eq.7 and 8).The set of x 1 il = i−1 s=1 J si x s variables, can be rewritten in recursive form observing that: The output of the first layer of the conditional probability of the variable i depend on the output of the first layer, x 1 i−1,l , of the previous i − 1 conditional probability.In practice, we can explicitly write the following dependence: ).These reduce the number of parameters of the first layers and could reduce its total computational cost if efficiently implemented.
The most computationally demanding part of the H 2 ARNN architecture is the computation of the ρ ± i functions, Eq. (10); their parameters scale exponentially with the system size, proportionally to 2 N −i .However, generally, the ρ ± i functions can be approximated using standard feed-forward neural network structures, possessing a polynomial number of parameters.Here, the input variables are those of the first layer x 1 il , while the parameters of the first layer remain unchanged, maintaining the skip connection.Instead of exploring this possibility, in the following, I will show how to derive new ARNN architectures for specific systems, based on statistical physics techniques.In fact, the ρ ± i function can be interpreted as the partition function of a system, where the variables are the x >i and the external fields are determined by the values of the variables x <i .Based on this observation, in the following, I will show how to use standard tools of statistical physics to derive deep ARNN architectures that eliminate the exponential growth of the number of parameters.

A. The Curie-Weiss (CW) model
The Curie-Weiss model (CW) is a uniform, fullyconnected Ising model.The Hamiltonian, with N spins, is H The conditional probability of a spin i, Eq. ( 5), of the CW model is: where: x s , at given x <i , Eq. ( 15) is equivalent to the partition function of a CW model, with N −i spins and external fields h ± i .As shown in the Supporting Information (SI), the summations over x >i can be easily done, finding the following expression: where we defined: The final feed-forward architecture of the Curie-Weiss Autoregressive Neural Network (CW N ) architecture is: , where b = 2βh, ω = 2βJ N are the same, and so shared, among all the conditional probability functions, see Fig. 2. Their parameters have an analytic dependence on the parameters J and h of the Hamiltonian of the systems.
The number of parameters of a single conditional probability of the CW N is 2 + 4(N − i), which decreases as i increases.The total number of parameters of the entire conditional probability distribution scales as 2N 2 .
If we consider the thermodynamical limit, N ≫ 1, the ARNN architecture of the CW model simplifies (see sec.IB of the SI for details) to the following expression: where b = 2βh, ω = 2βJ N are the same as before, and shared, among all the conditional probability functions.The ω 1 i = −2βJ|m i | is different for each of them and can be computed analytically.The total number of parameters of the CW ∞ scales as N + 2.

B. The Sherrington-Kirkpatrick (SK) model
The SK Hamiltonian, considering zero external fields for simplicity, is given by: where the set of couplings, J, are i.i.d.random variable drawn from a Gaussian probability distribution P (J) = N (0, J 2 /N ).
To find a feed-forward representation of the conditional probability of its Boltzmann distribution we have to compute the quantities in Eq.10, that, defining s=1 J sl x s , can be written as: The above equation can be interpreted as an SK model over the variables x >i with site-dependent external fields h ± l [x <i ].I will use the replica trick [50], which is usually applied together with the average over the system's disorder.In our case, we deal with a single instance of disorder, with the set of couplings being fixed.In the following I will assume that N − i ≫ 1, and the average over the disorder E is taken on the coupling parameters J ll ′ with l, l ′ > i.In practice, I will use the following approximation to compute the quantity: In the last equality, I use the replica trick.Implicitly, it is assumed that the quantities log ρ ± i are self-averaged on the x >i variables.The expression for the average over the disorder of the replicated function is: where dP J ll ′ = P (J ll ′ )dJ ll ′ , and the set of x a are the replicated spin variables.Computing the integrals over the disorder, we find: where in the last line I used the Hubbard-Stratonovich transformation to linearize the quadratic terms.See the SI or, for instance, [51], for details about the formal mathematical derivations of the previous and following expressions.The Parisi solution of the SK model prescribes how to parametrize the matrix of the overlaps {Q ab } [50].The easiest way to parametrize the matrix of the overlaps is the replica symmetric solutions (RS), where the overlaps are equal and independent from the replica index: A sequence of better approximations can then be ob-tained by breaking the replica symmetry step by step, from the 1-step replica symmetric breaking (1-RSB) to the k-step replica symmetric breaking (k-RSB) solution.
The infinite k limit of the k-step replica symmetric breaking solution gives the exact solution of the SK model [52].
The sequence of k-RSB approximations can be seen as nested non-linear operations [53], see the SI for details.Every k-step replica symmetric breaking solution leads to adding a Gaussian integral and two more free variational parameters to the representation of the ρ ± functions.In the following, we will use a feed-forward representation that enlarges the space of parameters, using a more computationally friendly non-linear operator.Numerical evidence of the quality of the approximation used is shown in the SI.Overall, the parametrization of the overlaps matrix, which introduces free parameters in the derivation, allows the summing of all the configurations of the variables x i> eliminating the exponential scaling with the system's size of the number of parameters.The final ARNN architecture of the SK model is as follows (see the SI for details): For the RS and 1-RSB cases, we have: The set of x 1 il (x <i ) is the output of the first layer of the ARNN, see eqs.7-8, and (w 0± il , b 1± il , w 1± il , b 2± il , w 2± il ) are free variational parameters of the ARNN (see Fig. 3).The number of parameters of a single conditional probability distribution scales as 2(k + 1)(N − i) where k is the level of the k-RSB solution used, assuming k = 0 as the RS solution.

IV. RESULTS
In this section, I compare several ARNN architectures in learning to generate samples from the Boltzmann distribution of the CW and SK models.Additionally, the ability to recover the Hamiltonian coupling parameters from Monte-Carlo-generated instances is presented.The CW N , CW ∞ and SK RS/kRSB architectures, presented in previous sections, are compared with: • The one parameter (1P) architecture, where a single weight parameter is multiplied by the sums of the input variables, and then the sigma function is applied.This architecture was already used for the CW system in [36].The total number of parameters scales as N .
• The single layer (1L) architecture, where a fully connected single linear layer parametrizes the whole probability distribution, where a mask is applied to a subset of the weights in order to preserve the autoregressive properties.The width of the layer is N , and the total number of parameters scale as N 2 [15].
• The MADE architecture [15], where the whole probability distribution is represented with a deep sequence of fully connected layers, with non-linear activation functions and masks in between them, to assure the autoregressive properties.Respect the 1L the deep architecture of MADE enhances the expressive power.The MADE dc used has d hidden layers, each of them with c channels of width N .For instance, the 1L architecture is equivalent to the MADE 11 and MADE 23 has two hidden fullyconnected layers, each of them composed of three channels of width N .
The parameters of the ARNN are trained to minimize the Kullback-Leibler divergence or, equivalently, the variational free energy (see Eq. ( 1)).Given an ARNN, Q θ , that depends on a set of parameters θ and the Hamiltonian of the system H, the variational free energy can be estimated as: The samples are drawn from the trial ARNN, Q θ , using ancestral sampling.At each step of the training, the derivative of the variational free energy with respect to the parameters θ is estimated and used to update the parameters of the ARNN.Then a new batch of samples is extracted from the ARNN and used again to compute the derivative of the variational free energy and update the parameters [8].This process was repeated until a stop criterion is met or a maximum number of steps is reached.For each model and temperature, a maximum 1000 epochs are allowed, with a batch size of 2000 samples, and a learning rate of 0.001.The ADAM algorithm [54] was applied for the optimization of the ARNN parameters.An annealing procedure was used to improve performance and avoid mode-collapse problems [8], where the inverse temperature β was increased from 0.1 to 2.0 in steps of 0.05.The code was written using the PyTorch framework [55], and it is opensource, released in [56].The CW N has all its parameters fixed and precomputed analytically, see Eq. ( 17).The CW ∞ has one free parameter for each of its conditional probability distributions to be trained, and one shared parameter, see Eq. ( 19).The parameters of the first layer of the SK RS/kRSB architecture are shared and fixed by the values of the couplings and fields of the Hamiltonian.The parameters of the hidden layers are free and trained.The parameters of the MADE dc , 1L and 1P architectures are free and trained.As explained in Sec.II, the variational free energy F [Q θ ] is always an upper bound of the free energy of the system.Its value will be used, in the following, as a benchmark for the performance of the ARNN architecture in approximating the Boltzmann distribution.After the training procedure, the variational free energy was estimated using 20,000 configurations sampled from each of the considered ARNN architectures.The training procedure was the same for all the experiments unless conversely specified.

A. The CW model
The results on the CW model, with J = 1 and h = 0, are shown in Fig. 4. The plots A1, A2, and A3, in the first row, show the relative error of the free energy density (f e[P ] = F [P ]/N ), with respect to the exact one, computed analytically [37], see the SI for details, for different system sizes N .The variational free energy density estimated from samples generated with the CW N architecture does not have an appreciable difference with the analytic solution, and for the CW ∞ , it improves as the system size increases.Fig. 4.B plots the error, in the estimation of the free energy density for the architectures with fewer parameters, 1P and CW ∞ (both scaling linearly with the system's size); It shows clearly that a deep architecture with skip connections, in this The system undergoes a second-order phase transition at β = 1 where a spontaneous magnetization appears [37].[A1, A2, A3] Relative error in the estimation of the free energy for different system sizes with respect to the analytic solution.The CWN architecture has its parameters fixed and precomputed analytically, and the error is too small to be seen at this scale.The y-axis is plotted on a logarithmic scale down to 10 −4 and then linearly to zero.
[B] The dependence on N of the mean and maximum relative error of the two smaller architectures, 1P and CW∞, both of which scale linearly with the size of the system.[C] Distribution of the overlaps of the samples generated by the ARNNs for the CW system with N = 200 variables and β = 1.3 case with only one more parameter, in the skip connection, improves the accuracy by orders of magnitude.The need for deep architectures, already on a simple model as the CW, is indicated by the poor performance of the 1L architecture, despite its scaling of parameters as N 2 , it achieves similar results to the 1P.The MADE architecture obtains good results but was not comparable to CW N , even though having a similar number of parameters.The plot in Fig. 4.c shows the distribution of the overlaps, q a,b = 1 N i a i b i where a i , b i are two system configurations, between the samples generated by the ARNNs.The distribution is computed at β = 1.3 for N = 200.It can be seen as the poor performance of the 1-layer networks (1P, 1L) is due to the difficulty of correctly representing the configurations with magnetization different from zero in the proximity of the phase transition.This could be due to mode collapse problems [36], which do not affect the deeper ARNN architectures tested.

B. The SK model
In Fig. 5, the result of the SK model, with J = 1 and h = 0 are shown; as before in the first row there is the relative error in the estimation of the free energy density at different system sizes.In this case, the exact solution, for a single instance of the disorder and a finite N is not known.The free energy estimation of the SK 2RSB was taken as the reference to compute the relative difference.
The free energy estimations of SK kRSB with k = 1, 2 are very close to each other.The performance of the SK RS net is the same as the 1L architecture even with a much higher number of parameters.The MADE architecture tested, even with a similar number of parameters of the SK kRSB nets, see Fig. 5.C, estimate a larger free energy, with differences increasing with N .To better assess the difference in the approximation of the Boltzmann distribution of the architecture tested, I consider to check the distributions of the overlaps q among the generated samples.The SK model, with J = 1 and h = 0, undergoes a phase transition at β = 1, where a glassy phase is formed, and an exponential number of metastable states appears [50].This fact is reflected in the distribution of overlaps that have values different from zero in a wide region of values of q [57].Observing the distribution of the overlaps in the glassy phase, β = 1.3, between the samples generated by the ARNNs, Fig. 5.C, we can check as the distribution generated by the SK kRSB is higher in the region between the peak and zero overlaps, suggesting that these architectures better capture the complex landscape of the SK Boltzmann probability distribution [57].
The final test for the derived SK kRSB architectures involves evaluating the ability to recover the Hamiltonian couplings of the system using only samples extracted from the Boltzmann distribution of a single instance of the SK model in the glassy phase at β = 2.The Metropolis Monte-Carlo algorithm was used to sample, every 200 Monte-Carlo sweeps, 10,000 system configurations.The SK 1RSB was trained to minimize the loglikelihood computed on these samples (see the SI for details).According to the derivation of the SK kRSB architecture, the weights of the first layer of the neural network should correspond to the coupling parameters of the Hamiltonian.Due to the gauge invariance of the Hamiltonian with respect to the change of sign of all the couplings Js, I will consider their absolute values in the comparison.The weights parameters of the first layers of the SK 1RSB were initialized at small random values.As shown in Fig. 6, there is a strong correlation between the weights of the first layer and the couplings of the Hamiltonian, even though the neural network was trained in an over-parameterized setting; it has 60,000 parameters, significantly more than the number of samples.

V. CONCLUSIONS
In this study, the exact architecture Autoregressive Neural Network (ARNN) architecture (H 2 ARNN) of the Boltzmann distribution of the pairwise interacting system Hamiltonian was derived.The H 2 ARNN is a deep neural network, with the weights and biases of the first layer corresponding to the couplings and external fields of the Hamiltonian, see eqs.7-8.The H 2 ARNN architecture has skip-connections and a recurrent structure with a clear physical interpretation.Although the H 2 ARNN is not directly usable due to the exponential increase in the number of hidden layer parameters with the size of the system, its explicit formulation allows using statistical physics techniques to derive tractable architectures for specific problems.For example, ARNN architectures, scaling polynomially with the system's size, are derived for the CW and SK models.In the case of the SK model, the derivation is based on the sequence of k-step replica symmetric breaking solutions, which were mapped to a sequence of deeper ARNNs architectures.
The results, checking the ability of the ARNN architecture to learn the Boltzmann distribution of the CW and SK models, indicate that the derived architectures outperform commonly used ARNNs.Furthermore, the close connection between the physics of the problem and the neural network architecture is shown in the results of Fig. 6.In this case, the SK 1RSB architecture was trained on samples generated with the Monte-Carlo technique from the Boltzmann distribution of an SK model; the weights of the first layer of the SK 1RSB were found to have a strong correlation with the coupling parameters of the Hamiltonian.
Even though the derivation of a simple and compact ARNN architecture is not always feasible for all types of pairwise interactions and exactly solvable physics systems are rare, the explicit form of the H 2 ARNN and its clear physical interpretation provides a means to derive approximate architectures for specific Boltzmann distributions.
In this work, while the ARNN architecture of an SK model was derived, its learnability was not thoroughly examined.The problem of finding the configurations of minimum energy for the SK model is known to belong to the NP-hard class, and the effectiveness of the ARNN approach in solving this problem is still uncertain and a matter of ongoing research [27,35,36].Further systematic studies are needed to fully understand the learnability of the ARNN architecture presented in this work at very low temperatures and also on different systems.
There are several promising directions for future research to expand upon presented ARNN architectures.For instance, deriving the architecture for statistical models with more than binary variables.In statistical physics, the models with variables that have more than two states are called Potts models.The language mod-els, where each variable represents a word, and could take values among a huge number of states, usually more than tens of thousand possible words (or states), belong to this set of systems.The generalization of the present work to Potts models could allow us to connect the physics of the problem to recent language generative models like the transformer architecture [58].Another direction could be to consider systems with interactions beyond pairwise, to describe more complex probability distributions.Additionally, it would be interesting to examine sparse interacting system graphs, such as systems that interact on grids or random sparse graphs.The first case is fundamental for a large class of physics systems and image generation tasks, while the latter type, such as Erdos-Renyi interaction graphs, is common in optimization [44] and inference problems [59]. where: proving Eq. (10) shown in the main text.
B. Thermodynamical limit of the conditional probability of the CW model In the thermodynamical limit, the Curie-Weiss model admits an analytical solution.The order parameter of the system is the magnetization, m β = 1 N Z x i x i e −βH with Z = x e −βH .At high temperatures, with zero external fields h = 0, the magnetization, β , is zero up to a critical temperature β c = 1 J , where a phase transition occurs, and a non-zero magnetization is observed [1].Considering the following variables: , and for simplicity, the h = 0 case, we can rewrite the expression, Eq. ( 3), as: In the limit N ≫ 1, we obtain: . =e ∓βJ mi (11) where in the second line I used the Stirling approximation for the binomial factor, in the third line I used the saddle point method, assuming N ≫ 1, where mi is the extreme of the function inside the curly brackets.In the last line, I simplify all the elements that are equal between ρ + i and ρ − i that cancel each other in the conditional probability function Eq. (1).Deriving by m i the function inside the curly brackets, we obtain that the extreme mi should satisfy the following equation: In the N large limit, and for a typical sample, I assume that: , where the m β is the magnetization of the Curie-Weiss system at inverse temperature β and sgn(x) = |x| x is the sign function.We can distinguish two cases when the magnetization of the system is zero or not.In the first case, when β ≤ 1/J, the solution of Eq. ( 12) is zero as well, and log(ρ + i ) − log(ρ − i ) = 0 because the only term that makes the two quantities different, ∓βJm i , vanish.When instead the system acquires a magnetization m β different from zero, Eq. ( 12) admit one maximum that depends on the two possible symmetric values of i N s xs i ≈ i N | mβ |sign( q x q ).The solution of Eq. ( 12), ± mextrem depends again on sign( s x s ), and we can write the maximum solution as mi = | mi |sgn( s x s ).Easily we obtain that log(ρ + i ) − log(ρ − i ) = −2βJ| mi |sgn( s x s ), demonstrating the formula in the main text.

II. ANALYTIC EXPRESSION OF THE CURIE-WEISS FREE ENERGY AT FINITE N
The free energy of a system is defined as: For the CW system, with similar computation that leads to Eq. ( 7), we find:

III. ARNN ARCHITECTURES OF THE SHERRINGTON-KIRKPATRICK MODEL
In order to derive the architecture for the SK model based on the replica method, I'll start from Eq. ( 17) presented in the main text: where, the sums over (l, l ′ ), s and a run respectively over (i + 1, N ), (1, i − 1) and (1, n), and dP J ll ′ = P (J ll ′ )dJ ll ′ .Here we assumed that N − i ≫ 1, and the average over the disorder E J is on the parameters J l,l ′ with l, l ′ > i. Defining s=1 J sl x s as an external field, we can observe that the above quantity is the partition function of the SK model on x i> variables, with external fields h ± l [x <i ], at fixed x <i , and J ll ′ as coupling constants.Computing the integrals over the disorder variables {J ll ′ } yields [2]: Using the Hubbard-Stratonovich transformation of the quadratic terms, we can write: = c(n, N, i) where we defined: .
The Parisi solutions of the SK model prescribe how to parametrize the matrix of the overlaps Q [3].The following shows how to obtain neural network architectures based on the replica symmetric (RS) and k-step replica symmetric broken (k-RSB) solutions.

A. Replica Symmetric solution (RS)
We assume that the overlaps between the replicas are symmetric under permutations, and the matrix of the overlaps between replicas is parametrized with only one variable q: obtaining: Using the limit that n → 0 we can write: obtaining: = log(c ′′ (N, i)) In the second line, I use the saddle point methods to evaluate the integral over q, assuming that the single maximum value q 0 does not depend on the input values x <i in the set of h ± l [x <i ].It is a bold assumption to be verified a posteriori on the quality of the neural network architectures performances.In the third line, we use the identity log cosh(x) = 2x − log σ(2x) and the elements that are equals between log(ρ + ) and log(ρ − ) are simplified.We introduced the log σ non-linear operator for computational reasons.Now we consider the following approximation of the Gaussian convolution: where (b 0 , w 0 , b 1 , w 1 ) are free parameters to be determined.In the sec.III C of the SI a numerical analysis of the quality of this approximation is shown.Putting together all the pieces, we can parameterize the conditional probability as: where the set of (b, w) are free variational parameters to learn in the training process.

B. K-step Replica Symmetric Breaking (k-RSB) solution
Assuming that the replica symmetry is broken, we can use the ansatz called 1-step replica symmetric breaking (1RSB), where the overlaps between replicas are divided into m blocks: With the above ansatz, we can compute the following quantities: The equation 19 becomes: where we defined: dP y l = dy l 2π(q 1 − q 0 ) e y 2 l 2(q 1 −q 0 ) .
Considering N ≫ 1 and n → 0 to use the saddle point methods and the identity in Eq. ( 25), we can write: = c i + Extr q0,q1 c ′ i (N, n, q 0 , q 1 ) + 1 m l ˆdP z l log ˆdP y l cosh m β h ± l + βz l + βy l .

C. Numerical evidence of integrals approximations
In this section, I show the numerical evidence of some of the approximations used to represent the non-linear operations derived for the SK model.In the RS case, see Subsec.III A, the non-linear operation is approximated as in the following: The results in Fig. 1 of the fit, using the non-linear least squares method, of the f (x) with the r.h.s of the above expression, with free parameters b 0 , w 0 , b 1 , w 1 , are shown.The fits, for different values of β and q 0 , show the robustness of the approximation used.In the 1RSB case, see subsection III B, one of the non-linear operations is approximated as in the following: (q 1 −q 0 ) e β 2 mz−m log σ(βx+β 2 z) ≈ b 0 + w 0 log σ(b 1 + w 1 log σ(b 2 + w 2 x)) As before, the function f (x) is fitted with the r.h.s of the above expression, and the results in Fig. 2 show the robustness of the approximation used for different values of (q 1 − q 0 ), m, and β.  47) at different values of β and q0.The free parameters are b0, w0, b1, w1.The points are numerical evaluations of the integrals, and the grey lines are the corresponding fit using the approximated functions.The results show the robustness of the approximation used.

IV. MONTE CARLO SAMPLING OF THE SK MODEL
In the main text, in the section results, in order to check the correlation between the weights of the first layer of the SK 1RSB architecture with the coupling of the Hamiltonian, 10,000 samples were generated through the Monte-Carlo Metropolis algorithm [4].The samples are generated for a system with N = 200 variables at β = 2.Each configuration is sampled after 200 * N spin trail flips.I observe that, for the system under study, for this number of trail spin flips of the Metropolis algorithm the autocorrelation time between the configurations drops below 0.5.The training of the SK 1RSB was done minimizing, over the parameters of the SK 1RSB , the average of the negative log-likelihood (LL) of the generated m samples:  48) at different values of beta m, and q1 − q0.The free parameters are b0, w0, b1, w1, b2, w2.The points are numerical evaluations of the integrals, and the gray lines are the corresponding fit using the approximated functions.The results show the robustness of the approximation used.
FIG. 1.H2ARNN Architectures of a single Boltzmann conditional probability of a pairwise interacting Hamiltonian, Eq. (9.)The x<i variables are the input, the output provides an estimation of the conditional probability P (xi = 1|x <i ).The first layer computes the x 1 i and x 1 il variables, see Eq. (7), where the weight and bias, directly related to the Hamiltonian parameters, are shown in orange.The non-linear operators are represented by square symbols.The width of the second layer increases exponentially with the system size.The log exp(x) = log i e x i represents the set of linear transformations and non-linear activation functions acting on the second layer.The last layer is the sigma function.

B
FIG.2.CWN and CW∞ architectures of a single conditional probability.Diagrams A and B represent the CWN and CW∞ architectures, respectively.Both diagrams involve the operation of the sum of the input variables x<i.A skip connection, composed of a shared weight (represented by the orange line), is also present in both cases.In the CWN architecture, 2(N − 1) linear operations are applied (with fixed weights and biases, as indicated in Eq. (7)), followed by two non-linear operations represented by log exp(x).On the other hand, in the CW∞ architecture, apart from the skip connection, the input variables undergo a sgn operation before being multiplied by a free weight parameter and passed through the final layer represented by the sigma function.The number of parameters in the CWN architecture scales as 2N 2 , while in the CW∞ architecture, it scales as N + 1.

FIG. 3 .
FIG. 3. SK RS/kRSB architectures of the single variable conditional probability The diagram depicts the SK RS/kRSB architectures that approximate a single conditional probability of the Boltzmann distribution in the SK model.The input variables are x<i, and the output is the conditional probability Q RS/k-RSB (xi = 1|x<i).The nonlinear operations are represented by squares and the linear operations by solid lines.The parameters, in the orange lines, are equal to the Hamiltonian parameters and shared among the conditional probabilities, as indicated in Eq. (7).The depth of the network is determined by the level of approximation used, with the Q RS architecture having only one hidden layer and the Q k-SRB architecture having a sequence of k +1 hidden layers.The total number of parameters scales as 2(k + 1)N 2 + O(N ), where the RS case corresponds to k = 0.

FIG. 4 .
FIG.4.Results for CW model.The CW model considered has J = 1 and h = 0 (see the text for details).The system undergoes a second-order phase transition at β = 1 where a spontaneous magnetization appears[37].[A1, A2, A3] Relative error in the estimation of the free energy for different system sizes with respect to the analytic solution.The CWN architecture has its parameters fixed and precomputed analytically, and the error is too small to be seen at this scale.The y-axis is plotted on a logarithmic scale down to 10 −4 and then linearly to zero.[B]The dependence on N of the mean and maximum relative error of the two smaller architectures, 1P and CW∞, both of which scale linearly with the size of the system.[C] Distribution of the overlaps of the samples generated by the ARNNs for the CW system with N = 200 variables and β = 1.3

FIG. 5 .
FIG.5.Results for SK model.The SK model considered has J = 1 and h = 0 (see the text for details).The system undergoes a phase transition at β = 1[50].[A1, A2, A3] Relative difference in the estimation of the free energy for increasing system sizes with respect to the free energy computed by SK2RSB architecture.The results are averaged over 10 instances of the disorder.The y-axis is plotted on a logarithmic scale down to 10 −4 and then linearly to −10 4 .[B] Scaling with N of the number of parameters of the ARNN architectures.[C] Distribution of the overlaps of the samples generated by the ARNNs architectures for the SK model with N = 200 variables and β = 1.5, averaged over 10 different instances.The translucent error bands surrounding the plotted lines represent the 95% confidence intervals.

FIG. 6 .
FIG. 6. Scatter plot of the weights vs the couplings.Scatter plot of the absolute values of weights of the first layer of a SK1RSB vs the absolute values of the coupling parameters of the SK model.The weights are trained over 10,000 samples generated by the Metropolis Monte Carlo algorithm on a single instance of the SK model with N = 100 variables at β = 2.They are initialized at small random values.The blue line is the fit of the blue points, clearly showing a strong correlation between the weights and the coupling parameters of the Hamiltonian.

3 FIG. 1 .
FIG.1.Fit of Eq. (47) at different values of β and q0.The free parameters are b0, w0, b1, w1.The points are numerical evaluations of the integrals, and the grey lines are the corresponding fit using the approximated functions.The results show the robustness of the approximation used.

5 FIG. 2 .
FIG.2.Fit of Eq.(48) at different values of beta m, and q1 − q0.The free parameters are b0, w0, b1, w1, b2, w2.The points are numerical evaluations of the integrals, and the gray lines are the corresponding fit using the approximated functions.The results show the robustness of the approximation used.