Inverse problems for structured datasets using parallel TAP equations and restricted Boltzmann machines

We propose an efficient algorithm to solve inverse problems in the presence of binary clustered datasets. We consider the paradigmatic Hopfield model in a teacher student scenario, where this situation is found in the retrieval phase. This problem has been widely analyzed through various methods such as mean-field approaches or the pseudo-likelihood optimization. Our approach is based on the estimation of the posterior using the Thouless–Anderson–Palmer (TAP) equations in a parallel updating scheme. Unlike other methods, it allows to retrieve the original patterns of the teacher dataset and thanks to the parallel update it can be applied to large system sizes. We tackle the same problem using a restricted Boltzmann machine (RBM) and discuss analogies and differences between our algorithm and RBM learning.

We propose an efficient algorithm to solve inverse problems in the presence of binary clustered datasets.We consider the paradigmatic Hopfield model in a teacher student scenario, where this situation is found in the retrieval phase.This problem has been widely analyzed through various methods such as mean-field approaches or the pseudo-likelihood optimization.Our approach is based on the estimation of the posterior using the Thouless-Anderson-Palmer (TAP) equations in a parallel updating scheme.At the difference with other methods, it allows to retrieve the exact patterns of the teacher and the parallel update makes it possible to apply it for large system sizes.We also observe that the Approximate Message Passing (AMP) equations do not reproduce the expected behavior in the direct problem, questioning the standard practice used to obtain time indexes coming from Belief Propagation (BP).We tackle the same problem using a Restricted Boltzmann Machine (RBM) and discuss the analogies and the differences between our algorithm and the RBM learning.
Inverse problems consist in inferring information about the structure of a system from the observation data of configurations.Cases where the system's variables s i are binary can be studied in the framework of the inverse Ising model, whose parameters {J ij , h i } are tuned in order to describe the observed configurations according to the Boltzmann weight P (s) ∼ exp( i<j J ij s i s j + i h i s i ).This is the simplest distribution emerging when using the maximum entropy approach in order to reproduce exactly the one and two points statistics of the data.One of the most successfull application of this method is the 3D reconstruction of protein structures [1].Inverse problems arise also in collective behavior [2], immunology [3], neural activity [4,5] and financial time series [6].In general, inferring the parameters of the model is a challenging problem because maximizing the likelihood involves the computation of the partition function Z = s P (s), which is impossible in most of the realistic cases.On the other hand, when dealing with time-series, it is possible to use a simples approach modeling the sampling process by a stochastic parallel dynamics analysed in [7], optimized in [8] and generalized in [9,10].A recent review on this subject can be found in [11].
The original attempt to solve the problem was based on an Expectation-Maximization algorithm known as Boltzmann learning [12].This method is unpractical on large systems unless heuristic methods, like Monte Carlo (MC) sampling, are used to estimate correlations [13].Nevertheless MC is slow and thus many sophisticated techniques coming from statistical mechanics and machine learning have been proposed as alternative approaches [14][15][16][17][18][19][20][21][22][23][24][25].These methods, however, share one or both of the following shortcomings: i) they require a large number of observations and ii) the overall performance drops significantly when the dataset is structured.This is often the case when data is produced from a (sub)set of many attractive states, or data is collected in different regimes, e.g.quiescent and spiking regimes in neural networks.This problem becomes particularly relevant at low temperatures and it has already been studied both in the sparse [26] and in the dense case [27,28].Pseudolikelihood [29] based methods [28] were shown to be the best options in a wide range of temperatures.Here, we present two algorithms to compete with the existing state-of-the-art by posing the problem in a Bayesian framework using the Thouless-Anderson-Palmer (TAP) equations [30] and the Restricted Boltzmann Machine (RBM) [31].Our TAP-based algorithm will be shown to achieve a better quality of the results by observing far fewer configurations in the clusterized phase.Moreover, it allows to consider larger system size with respect to those studied in [27,28].As a side aspect of this work, we discuss the behavior of the Approximate Message Passing (AMP) equations for the Hopfield model [32], and show that they do not do not reproduce the thermodynamic expected behavior.
We consider a dataset with many clusters by drawing configurations from the Hopfield model [33,34].Given a set of N -dimensional binary independent patterns {ζ µ } µ=1,...,P , teacher's patterns, the coupling matrix of the associated Hopfield model is defined as Next, we construct a set of equilibrium configurations D = {s a } a=1,...,M sampled from the Boltzmann distribution P (s) = Z −1 exp[−βH ζ (s)], β being the inverse temperature.The set D is given to a student whose task is to infer the teacher's patterns.This task cannot be accomplished with the methods developed in [27,28], where the focus was the inference of the coupling matrix J. Using a uniform prior on teacher's patterns, the posterior distribution is proportional to the likelihood.
For P = 1, the Hopfield model is nothing but a Curie-Weiss model.In this case, the problem is called the dual Hopfield model [35,36], and the log-likelihood is , where ξ denote the student's pattern.This is readily established by absorbing the ξ-dependence of the partition function into a redefined set of variables s via s i = ξ i s i .On the other hand, for P > 1 this transformation is not feasible.In this case, the log-likelihood reads TAP equations [32,[37][38][39][40] describe the stationary points of the free energy and their use as an inference method has been pioneered in [41][42][43], as well as in [40,44], where their relationship between the Bayes theorem and the Belief Propagation (BP) equations was discussed.These works paved the way to their applications in a number of other problems [45] such as error correcting codes, compressed sensing and learning in neural networks, as discussed in the recent review [46].Overlap between the teacher's pattern and pattern recovered by the student when using Eq. ( 2) with P = 1, as a function of the number of samples M , at different temperatures.System size is N = 1000.When β < 1, there exists a critical value of M ∼ O(N ) below which it is impossible to infer the pattern, whereas above only a finite set of samples is needed.
For the direct problem, where the coupling matrix is , TAP equations estimate the local magnetizations.For the inverse problem, where we are interested in retrieving the model's parameters given the dataset D, we propose an algorithm based on these equations to estimate the posterior in Eq. (1).Even if TAP and mean field methods have already been used to solve this inverse problem [20,[26][27][28]47], this approach is completely different from the previous ones: dealing with the dual model is the key to improve the quality of the reconstructed network.On the dual model, the role of spins and patterns is exchanged: m i = ξ i and the M sampled configurations play the role of the student's patterns.We notice that a similar approach has been indipendently proposed in [48] for an RBM with 2 hidden binary units, using BP.
We start by considering the simplest case P = 1.We introduce a naive time indexing for an iterative scheme of the TAP equations, where J ij = N −1 a s a i s a i , α = P/N and N q t = i (m t i ) 2 .The entire set of magnetizations m t are updated to achieve m t+1 in a parallel way.In principle, any sophisticated time indexing schemes can be employed as long as it achieves the convergence to a physical state.Particularly, the so-called Approximate Message Passing (AMP) equations has been the focus of many studies in inference problems [46].This scheme is inspired by the convergence issues of naive indexing in SK model even in the replica symmetric phase [49].An explanation to this behavior can be found in [50], where a less trivial time index setting is shown to improve convergence properties outside the glassy phase.It was later realized that this time indexing emerges naturally when deriving AMP equations from BP equations, keeping track of BP time indexes.This approach requires considering the fully connected limit of BP equations, that are usually written on sparse graphs [40,[51][52][53].To our surprise, the AMP equations exhibit convergence issues for the case of Hopfield model in the direct problem, when the initial condition is chosen at random.A different approach to derive AMP equations is proposed in [54].This new approach, in our case, does not improve the performances obtained by the naive time indexing Eq. ( 2).These issues are discussed in detail in appendix A. Thus, in the following, we mainly adopt Eq. ( 2), unless stated otherwise.
In Fig. 1, we present the teacher-student overlap q = N −1 i m i ζ i for N = 1000.We observe that in the ferromagnetic-retrieval phase β > 1, a perfect reconstruction may be realized already with a small number of samples.This is due to the large signal contained in the correlation matrix of the data c.In particular, we notice that in the ferromagnetic phase the student is able to find a pattern correlated with the teacher's one even at M = 1.On the other hand in the paramagnetic phase the signal in c is weaker and reconstruction is possible only exploiting finite size effects, at the price of observing an extensive number of samples.As discussed in [36], the critical fraction M/N of samples necessary for reconstruction corresponds to the paramagnetic-spin glass transition of the direct problem, as long as we restrict the analysis to the Bayes optimal scenario.
The P > 1 case is more difficult because of the presence of the denominator in Eq. ( 1).However, we argue in appendix B that this term is effectively a (soft) orthogonality constraint over inferred patterns.This observation allows us to design an inference algorithm accordingly.First, let us construct a time evolution of the coupling matrix J τ ij with its initial condition given by At each time step τ , we consider P TAP students trying to learn the teacher's patterns independently.Namely, the magnetizations m µ i = ξ µ i for each student evolve according to Eq. ( 2 a randomly initialized configuration [55].Upon convergence, we evaluate the P solutions with the score S µ = ij M a=1 s a i s a j m µ i m µ j .These scores characterize the quality of the TAP fixed points and we pick the one with the largest score [56].The corresponding magnetization selected by this criterion at time τ are denoted by m τ .Finally, in order to learn the remaining contributions, we remove the rank-1 part associated to the retrieved state m τ from the coupling matrix.When the student knows the actual number of patterns, this correspond to the rule , where γ = M/P (assuming that different states are uniformly sampled in the dataset).We repeat these steps until no further patterns are found.
We stress that our algorithm finds solutions correlated with the patterns without any prior information, i.e. we start iterating the TAP equations from a random initial configuration.This is a rather remarkable property in comparison to the method used in [26], where BP equations were guided to converge to the fixed points associated with the patterns using a reinforcement term aligned with the magnetizations of the states.In Fig. 2 we compare the learned P TAP fix points and the P teacher's patterns in a system of N = 1000 with P = 20 at β = 2.In this regime data is generated in the retrieval phase [32].The students observe M = 200 samples, i.e. 10 samples per state.We clearly see that all the 20 patterns are successfully retrieved from the first 20 students.In addition, let us define two performance measures, the  simplified likelihood and the reconstruction error , where J * denotes the teacher's coupling matrix, and J r is the inferred matrix at time τ , J r ij = N −1 τ t=1 m t i m t j .The simplified likelihood is defined by neglecting the partition function in Eq. ( 1).In Fig. 3 their behaviors are reported as a function of iteration time.As expected, decreases as the students learn the patterns but then increases when the students start to learn the remaining noise.Similarly, the simplified likelihood L develops a kink at the point where students learn all the patterns, that can be used as a stopping condition of the algorithm.In Fig. 4 we study the behavior of for different values of the temperature.As a function of β, the system sweeps through different regions of the phase diagram.Data is generated with a sequential Glauber dynamics and states are equally sampled.In Fig. 4 we show the behavior of the error computed using the criterion discussed above with the number of observations in different regions of the phase diagram.As expected, perfect reconstruction is obtained only in the retrieval phase.
Another method to perform inference is the RBM, which is closely related to the Hopfield model [32,36,[57][58][59].In fact, the posterior distribution of Eq. (1) can be recast to with At the same time, Eq. ( 1) defines an RBM with N v = N binary visible units, and N h = P Gaussian hidden units.Compared to existing methods [16,18,20,21,[23][24][25], the RBM is both time and space efficient as the number of parameters to be optimized scales as N v N h rather than N 2 .The number of hidden units N h plays the role of P .We consider the general setting N h ≥ P in the following.Following the standard practice [60], the weights W µ i are learned maximizing the log-likelihood using the PCD-T algorithm, where PCD stands for Persistent Constrastive Divergence [61] and T for the number of Monte Carlo steps used to estimate the part of the log-likelihood derivative involving the partition function.We used T = 10.RBM learns a set of weights J r ij = β −1 µ W µ i W µ j that we can compare with the teacher coupling matrix.The error between J r ij and J * ij decreases during learning but it never achieves the values found with TAP.In order to monitor learning, we study the pseudo-likelihood S [29], i.e a proxy for the likelihood that can be easily computed.In our case, it is defined by S = More details on these points as well as on the implementations are given in appendix C. In Fig. 5 we show the behavior of these quantities for the dataset generated by the teacher at β = 2, N = 1000, P = 10, when observing a different number M of samples.An RBM with N v = 1000 visible units and N h = 15 hidden units is used.The minimum of is achieved when the pseudolikelihood flattens.This happen when all of the relevant (P = 10) modes of the data have been learned, as can be seen in Fig. 6.Unless learning starts in the vicinity of the teacher's patterns, final RBM weights do not reproduce them, contrarily to the TAP-based algorithm discussed above.In fact, the Hopfield model is invariant under a rotation in the pattern space [27]: the student RBM can learn, at most, the subspace spanned by teacher's patterns.To prove it, we consider the Singular Value Decomposition (SVD) of the dataset, and the SVD of the weights.We denote by {σ α } the singular values of the matrix W µ i and by t α the error in reconstructing the singular vector of the data, indexed by α, using only the singular vectors of the weight matrix.In Fig. 6, we show the emergence of different modes during learning.When the singular values σ α of the coupling matrix emerge, the error t α decreases.The first P = 10 principal modes of the dataset are well represented by the subspace spanned by the singular vectors of the weight matrix W .
In summary, we discussed a new method to solve inverse problems with a clusterized dataset.We analyzed the fully connected Hopfield model in a teacher-student scenario and proposed an inference method based on the TAP equations working directly on the posterior distribution, i.e. the dual problem.We discussed a retrieval algorithm based on the parallel updating of the TAP equations with a naive indexing, showing that in our case it gives good results.Contrarily to previous methods, our algorithm is able at retrieving patterns, besides couplings, because TAP equations allows to reduces the continuous symmetry under rotation to a simple symmetry under permutation over the pattern labels.As a side result, we provide the analysis of the failure of AMP equations in our case, when iterated in a parallel manner starting from a random initial condition.Finally we compare these results with those obtained with RBM, exploiting their analogies with the Hopfield model.RBM is a good candidate model to perform inference with many variables, a task that would require a much longer execution time to methods based on the optimization of the pseudo-likelihood of an associate pairwise Ising model.Their ability to perform inference tasks systematically, as well as their performance on inferring sparse models, will be addressed elsewhere.
Appendix A: linear stability analysis of TAP and AMP in the paramagnetic state Here we present the linear stability analysis of TAP and AMP equations in the paramagnetic state.We will focus on the direct problem where J ij is constructed from M random patterns.While the complete analysis is possible for arbitrary α, we find it more instructive to focus on the limit α → 0 as it greatly simplifies the discussion.As will be shown below, our results are valid if |1 − β|∼ O(1), which is larger than O(α).
From now on, we denote by TAP the simple iterative updating scheme discussed in the text, reported here for convenience and by AMP the iterative scheme derived in [32], where m t i = tanh(βH t i ), u t = β(1 − q t ) and N q t = i (m t i ) 2 .This time index setting naturally emerges from the expansion of the BP equations in the large connectivity limit [40].
Let us first consider the linear stability of Eq. (A1).Near the paramagnetic state M i ∼ 0, this equation may be expanded into where the second term is neglected as it is of O(α).Performing the coordinate change with the eigenvectors of J ij as its basis, one obtains where λ is an eigenvalue of J ij .This implies that the paramagnetic solution becomes unstable when βλ max > 1.The spectrum of coupling matrix follows the Marchenko-Pastur law [32].Namely, P eigenvalues are 1 + O( √ α) and their eigenvectors span the same space spanned from the set of patterns.The remaining N − P eigenvalues are zero.Thus we find that the critical temperature is T c = 1 + O( √ α) (the true value is T c = 1/(1 + √ α), found expanding TAP equation beyond the α → 0 limit, which is identical to the result of replica theory [32].
Similarly, AMP Eq. (A2) can be expanded as follows: Because of the λ − 1 term, in the limit α → 0, the N − P eigenvalues equal to zero give the largest O(1) contribution to the instability of the paramagnetic fixed point.In particular, the modes associated with patterns, with eigenvalues 1+O( √ α), give a vanishing contribution.From the infinite temperature limit, the first T where this equation becomes unstable is given by − β 1−β = −1, i.e.T c = 1/2.Nevertheless, this unstable direction is orthogonal to the patterns and the magnetization either converges to an unphysical state or never converge (see Fig. 7).The negative value of the leading eigenvalue for β ∈ [1/2, 1] leads to an oscillating behavior starting from the paramagnetic solution, as can be seen in the second plot in Fig. 7. Similar issues with parallel updating of the AMP equations were discussed in [46], and they can be alleviated by updating spins sequentially and introducing a strong dumping.Nevertheless their sequential updating leads to a much slower algorithm, without showing any improvement in the quality of inference in comparison to the parallel updating scheme of Eq. (A1).
A different updating scheme of the TAP equations has been recently proposed by [54].This approach does not require to consider the fully connected limit of the BP equations and it is suitable to be applied in systems with dense random coupling matrices.It is based on a dynamical mean field theory which allows to study the dynamics of iterative algorithms in the thermodynamic limit by averaging over the noise contained in the couplings.For the Hopfield model, the updating scheme turns out to be i in the updating schemes of TAP, Eq. (A1), and AMP, Eq. (A2), for three different temperatures at N = 1000, P = 1.Most of the trajectories are very similar, thus they are indistinguishable.The critical value is at β = 1.The starting point is chosen at random with the absolute value of the local magnetization equal to one.In the first steps, both TAP and AMP destroy the initial condition and create very small magnetization values.Then, once close to the paramagnetic fixed point m = 0, AMP eqs.escape from it for β > 0.5 while TAP eqs.do not until β > 1.Moreover, when leaving the paramagnetic state, the direction chosen by AMP is completely random, while TAP moves towards the pattern.
where A t = β/(1 + αu t ) and u t is the same quantity introduced in the AMP Eq. (A2).It is possible to see that this updating scheme does not present the issues of the AMP algorithm by repeating the same α → 0 analysis presented above.
In Fig. 8 we compare the performances of these three algorithms for different system sizes.We define P c as the probability to converge to one of the patterns of the system with overlap greater than 0.7 when the initial condition is chosen at random.Sequential AMP were iterated with a damping term d, i.e. m t+1 i = (1 − d) tanh βH t+1 i + dm t i , and d = 0.95.For the two parallel TAP equations, (Eq.(A1)-Eq.(A7)), the iteration is stopped when the average difference between m t+1 i and m t i is smaller than 0.001.For the AMP sequential algorithm the iteration is stopped when the average between tanh βH t+1 i and tanh βH t i is smaller than 0.001.In all the cases we observe that convergence to patterns is achieved in the retrieval phase.For small values of N , due to finite size effect, convergence regime extends in the metastable retrieval phase too.
The instability issue of the AMP equations presented above holds for the direct problem, but it can be extended also to the inverse, dual problem.In this last case, where J ij = N −1 M a=1 s a i s a j , if there is enough signal in the data and λ max > 2, inference is possible also with parallel AMP equations.Nevertheless the analysis shows that obtaining time indexes from BP does not necessarily lead to good algorithms.TAP, as well as BP, equations describe only fixed points of the associated free energy and, in principle, any updating scheme could be used to solve these equations in an iterative manner, as shown in [54].The relevance of this observation for other problems requires further analysis and, given that the AMP convergence issues are usually mitigated by considering a sequential updating with a strong dumping, it would be interesting to study whether a similar improvements is achieved when iterating TAP equations with the naive time indexing sequentially and with a strong damping, in problems where their parallel updating was failing.
Appendix B: Posterior for P > 1 We discuss the role of the difficult term arising in the posterior distribution when P > 1.We show that for the simple case P = 2, it has a clear interpretation in terms of a constraint on the orthogonality of the inferred patterns.In fact, let us consider 8. Probability to converge to a pattern when iterating Eq. (A1) (left), Eq. (A2) (center), Eq. (A7) (right), starting from a random initial condition in the direct problem.This probability is estimated running 1000 independent experiments from different realizations of the patterns and different initial conditions and counting the number of times that the equations converged to one of the patterns of the system with overlap greater than 0.7, in order to exclude mixture states.The sequential updating of the AMP equations is done with a dumping term equal to 0.95.The performance of all these algorithms is similar, with the second one being much slower.The initial absolute value of the local magnetizations are mostly irrelevant in the first two cases (and it is chosen to be 1), but needs to be chosen small at low temperatures in the third case (and it is chosen to be 0.1).

and let us define
, where q is the mutual overlap between the two patterns, N q = i ξ 1 i ξ 2 i .The exponent in Eq. (B1) reads where we indicate with S the complement of set S. Using again the gauge transformation s i = ξ 1 i s i , Eq. (B2) leads to randomness in the path to the solution, and the associate learning algorithm is called Stochastic Gradient ascent.In our experiments we use a mini-batch size equal to 100.Since mini-batch samples are independent, different parallel MC can be used.In our experiments we used 100 MC chains, one per mini-batch sample.Their initial conditions can be chosen to be the considered samples, but this quickly results in over fitting the parameters, since the MC dynamics spend all the time in the phase space regions close to the samples.When the initial condition of the MC dynamics is chosen at random and we keep track of their positions through different batches and epochs, this method is called Persistent CD (PCD).Our results are obtained using PCD.As stated above, the likelihood function cannot be easily computed.Thus, we introduce the pseudo-likelihood that, for a model with hidden units, is defined by S = Nv r=1 S r , where where the term inside the log is defined by p(s a r |λ P (λ|s a ,{W }) = dP (λ)p(s a r |λ)P (λ|s a , {W }) (C9) In the infinite sampling limit, (C17) lim In fact, it is easy to show that (C18) and, on the other hand, that lim Thus the gradient of the pseudo-likelihood S vanishes on the same set of parameters {W } that solve 0 = ∂ W L, and this is the reason the pseudo-likelihood can be used to control the learning state.In practice, the probability in Eq. (C9), can be estimated after one step of Monte Carlo: p(s r a |λ) P (λ|s a ,{W }) ∼ e s a r µ W µ i λµ where λ is sampled from the distribution P (λ|s a , {W }).Finally, we discuss the learning of the RBM compared to our TAP based algorithm.As mentioned in the text, unless learning starts in the vicinity of the teacher's patterns, final RBM weights do not reproduce them.In fact, the Hopfield model is invariant under a rotation in the pattern space.The dataset D analyzed by the student could have been produced by another set of patterns { ζµ } µ=1,...,P given by ζµ i = P ν=1 O µ,ν ζ ν i where O is an orthogonal matrix.This symmetry implies that the student RBM cannot learn exactly the teacher's patterns.One could think that the singular vectors of W should learn at least the principal vectors of the data (that, given the spherical symmetry, are not necessarily aligned along the teacher's patterns), as discussed in [59].Nevertheless this is true only during the initial steps of learning, when couplings are small.This is reminiscent of the results discussed in [27], where the posterior of the problem is analyzed in a perturbative expansion.At the first order, corresponding to the small couplings regime, the student's patterns are aligned along the singular vectors of the data at zero order.Anyway, computing higher order corrections, this relation breaks down.
In the following we show that RBM is learning the subspace spanned by the singular vectors of the data.To prove it, we consider the Singular Value Decomposition SVD of the dataset, D = U D Σ D V T D , where, considering N < M , D is a N × M matrix, U D is a orthogonal N × N matrix, Σ D is a N × M matrix, with only N diagonal elements different from zero, and V D is a orthogonal M × M matrix.D represent the matrix of the dataset D, where each column is a sample.Similarly, we consider the SVD of the weight matrix, W T = U W Σ W V T W , where W T is N v × N h , U W is a N v × N v orthogonal matrix, Σ W is a N v × N h matrix, with N h diagonal elements different from zero, and V W is a N h × N h orthogonal matrix.We consider N v = N and we decompose all of the data modes u where { u (µ),W } µ=1,...,N h are orthogonal vectors normalized to one.We measure the behavior of at the beginning and at the end of learning.This quantity measures the difference between the original vector and its projection onto the subspace spanned by the basis { u (µ),W } µ=1,...N h .The results of this analysis are found in the insets of Fig. 6, where we plot these quantities at the initial stage of learning and at the end.
FIG.1.Overlap between the teacher's pattern and pattern recovered by the student when using Eq.(2) with P = 1, as a function of the number of samples M , at different temperatures.System size is N = 1000.When β < 1, there exists a critical value of M ∼ O(N ) below which it is impossible to infer the pattern, whereas above only a finite set of samples is needed.

1 FIG. 2 .
FIG.2.Overlap between the best TAP solutions and the teacher's patterns.The system size is N = 1000, the teacher generates P = 20 patterns at β = 2. Inference is done with P = 25 students observing only M = 200 samples, i.e. 10 per state.At each iteration step τ = 1 . . ., P , we pick the best TAP solution and we plot its overlap with all of the teacher's patterns.We observe clearly that the students are able to retrieve all the patterns from the teacher.

FIG. 3 .
FIG. 3. Evolution of the error and of the simplified Likelihood L, as defined in the text, with the iteration of the algorithm.Different lines refer to different values of M = 100, 200, 500, 1000 at β = 2, P = 20, N = 1000.The error decreases with M and it reaches zero for M = 1000, although we observe that even with very few samples, the errors are very small and, as shown on Fig. 2 the patterns are perfectly recovered.The dependency of L on M is negligible.L is rescaled in order to fit in the figure.

30 FIG. 4 .
FIG.4.Average error as a function of M/P for a system of size N = 1000 with a number of patterns P = 10 (top panel) and P = 30 (bottom panel).The reconstruction is done using P = 2P students.The error is computed stopping the algorithm with the criterion described in the text.Each point represent an average over 100 independent trials.In the retrieval phase, the error goes to zero with M .

FIG. 5 .
FIG. 5. Pseudo-likelihood S and error for M during learning.Data is produced by a teacher Hopfield model with N = 1000, P = 10 at β = 2. Learning is done with an RBM with Nv = N visible units and N h = 15 hidden units.

10 FIG. 6 .
FIG.6.Emergence of singular values σ during learning for the same dataset analyzed in Fig.5.P = 10 modes emerge.Inset: Error tα for different modes at the beginning of learning (blue line) and at the end of learning (orange line).

3 FIG. 7 .
FIG. 7. Trajectories of the N magnetizations m ti in the updating schemes of TAP, Eq. (A1), and AMP, Eq. (A2), for three different temperatures at N = 1000, P = 1.Most of the trajectories are very similar, thus they are indistinguishable.The critical value is at β = 1.The starting point is chosen at random with the absolute value of the local magnetization equal to one.In the first steps, both TAP and AMP destroy the initial condition and create very small magnetization values.Then, once close to the paramagnetic fixed point m = 0, AMP eqs.escape from it for β > 0.5 while TAP eqs.do not until β > 1.Moreover, when leaving the paramagnetic state, the direction chosen by AMP is completely random, while TAP moves towards the pattern.
FIG.8.Probability to converge to a pattern when iterating Eq. (A1) (left), Eq. (A2) (center), Eq. (A7) (right), starting from a random initial condition in the direct problem.This probability is estimated running 1000 independent experiments from different realizations of the patterns and different initial conditions and counting the number of times that the equations converged to one of the patterns of the system with overlap greater than 0.7, in order to exclude mixture states.The sequential updating of the AMP equations is done with a dumping term equal to 0.95.The performance of all these algorithms is similar, with the second one being much slower.The initial absolute value of the local magnetizations are mostly irrelevant in the first two cases (and it is chosen to be 1), but needs to be chosen small at low temperatures in the third case (and it is chosen to be 0.1).