Mode-Assisted Unsupervised Learning of Restricted Boltzmann Machines

Restricted Boltzmann machines (RBMs) are a powerful class of generative models, but their training requires computing a gradient that, unlike supervised backpropagation on typical loss functions, is notoriously difficult even to approximate. Here, we show that properly combining standard gradient updates with an off-gradient direction, constructed from samples of the RBM ground state (mode), improves their training dramatically over traditional gradient methods. This approach, which we call mode training, promotes faster training and stability, in addition to lower converged relative entropy (KL divergence). Along with the proofs of stability and convergence of this method, we also demonstrate its efficacy on synthetic datasets where we can compute KL divergences exactly, as well as on a larger machine learning standard, MNIST. The mode training we suggest is quite versatile, as it can be applied in conjunction with any given gradient method, and is easily extended to more general energy-based neural network structures such as deep, convolutional and unrestricted Boltzmann machines.

Boltzmann machines 1 and their restricted version (RBMs), are generative models applied to a variety of machine learning problems 2 . They enjoy a universal approximation theorem for discrete probability distributions 3 , are used as building blocks for deep-belief networks 4 and, in no small feat, can even represent correlated states in quantum many-body systems 5,6 .
Training RBMs is typically formulated as a gradient descent in the Kullbach-Leibler (KL) divergence between the data distribution defined by a dataset, and the RBM model distribution, parameterized by a set of weights and biases. This unsupervised procedure results in a computationally intractable expectation value popularly approximated by a Markov Chain Monte Carlo (MCMC) procedure dubbed "contrastive divergence" (CD) 7 . This approach faces difficulty when the model distribution represented by the RBM contains peaks of probability far away from the elements of the dataset, resulting in "spurious modes" that trap the Markov chain 4 . The limitations of CD, the de facto algorithm for training RBMs, combined with the rapid advances in supervised learning approaches, has led to the sideline of their unsupervised learning, known also as "pretraining", in favor of supervised backpropagation from random initial conditions 4 .
However, many state-of-the-art neural networks have been shown to be vulnerable to what are called adversarial examples 8 , or slight perturbations of the input that "fool" the network. In fact, one of the most popular supervised learning techniques, batch normalization, was found to contribute to this phenomenon 9 . It is known that pretraining can be a strong regularizer 10 resulting in better generalization for supervised models, and an improvement in their unsupervised training could lead to more robust performance in a downstream task. This motivates the search for better unsupervised training methods.
In a recent work, a memcomputing-assisted training scheme for RBMs 11 was proposed to address this unsupervised training difficulty. Memcomputing 12 is a novel computing paradigm in which memory (time non-locality) assists in the computation of hard computational decision and optimization problems 13,14 .
In the algorithm of Ref. 11, the difficult model expectation term in the log-likelihood gradient was re-placed by a sample of the mode of the RBM's probability distribution obtained from a memcomputing solver, which led to better quality solutions versus CD in a downstream classification task. However, despite demonstrating a significant reduction in the number of pretraining iterations necessary to achieve a given level of classification accuracy, as well as a total performance gain over CD, the algorithm of Ref. 11 does not fully exploit samples of the mode, in particular it does not give rise to training advantages over standard methods in the unsupervised setting. Additionally, in the present work, we introduce i) a principled schedule for incorporating samples of the RBM ground state into pre-training, ii) an appropriate mode-driven learning rate, iii) comparisons to other state-of-theart unsupervised pre-training approaches without the need of supervised fine-tuning, and iv) proofs of advantageous properties of the method.
We show that by appropriately combining the RBM's mode (ground state) samples and data initiated chains (as in CD) not only improves considerably the model quality over CD and other MCMC procedures, but also improves the stability of the pretraining routine. This mode training utilizes both the dataset (as in CD) and samples of the mode of the RBM model distribution in the training process to "push down" spurious modes of the model, whenever they appear.
Superficially, this method resembles 'mode hopping' MCMC proposed in recent literature 15,16 , where local maxima are either found with some optimization method or assumed to be known before hand (via a dataset). However, a crucial difference between our mode training for RBMs and mode hopping algorithms is that we do not use the modal configuration to initiate a new MCMC update to improve the mixing rate. Instead, the mode itself is used to inform the weight updates directly. The difference is substantial. In fact, since higher energy states are exponentially suppressed, exposing the Markov chain to the mode will most likely get it stuck there, which requires ad hoc constructions to recover detailed balance. Our mode-training method does not suffer from these drawbacks and is thus a more computationally efficient way to utilize the mode to train RBMs. Furthermore, we show that under a sufficiently large learning rate, sampling the global mode alone is capable of exploring efficiently a multi-modal energy landscape.
To realize this method in practice, one must supplement standard gradient updates with updates constructed from samples of the ground state of the RBM. Finding this ground state is equivalent to a Quadratic Unconstrained Binary Optimization (QUBO) problem, known to be NP-hard 17 . Therefore, although we can compute the ground state of RBMs exactly for small datasets, for efficient mode sampling in realistically sized cases, we employ a memcomputing solver that compares favorably to other state-of-the-art optimizers in efficiently sampling the ground states of similar non-convex landscapes 18,19 . The details of our implementation, including computational complexity and energy comparisons with MCMC, can be found in the appendix that accompanies this work. However, in principle, one could use other optimizers for mode training.
To corroborate our method, we will show exact KL/log-likelihood achieved on small synthetic datasets and on the MNIST dataset. In all cases, we find that mode training is able to learn more accurate models than several other training methods such as CD, persistent contrastive divergence (PCD), parallel tempering (PT) used in tandem with enhanced gradient RBMs (E-RBMs), and centered RBMs (C-RBMs), as reported in Ref. 20.
The paper is organized as follows. In Section I, we introduce RBMs and the standard unsupervised training procedures, and identify their main weaknesses. Section II introduces our mode-training method and its main features. Section III contains the results of our numerical experiments. Finally, we offer our thoughts for future work in Section IV.

I. TRAINING RBMS WITH MCMC
RBMs are undirected graphical models with a bipartite structure that differentiates between n visible nodes, v ∈ {0, 1} n and a set of m latent, or 'hidden' nodes, h ∈ {0, 1} m , not directly constrained by the data 2 . These states are usually taken to be binary but can be generalized. Each state of the machine corresponds to an energy of the form where the biases a ∈ R n , b ∈ R m , and weights W ∈ R n×m are the learnable parameters. Note that an RBM does not allow connections within a layer. This defines a distribution over states given by a Boltzmann-Gibbs distribution, The normalizing factor, Z = {v} {h} e −E(v,h) , is the formidable partition function that involves the sum over an exponentially scaling number of states, thus making the exact computation of its value infeasible. Additionally, the bipartite structure of the RBM connectivity implies that the hidden nodes are conditionally independent given any visible nodes (and vice versa), with a closed form conditional distribution given by 7 p(h j = 1|v) = σ( i w ij v i + b j ), where σ(x) = (1 + e −x ) −1 . We indicate the unique elements of the dataset for training and testing of the network as D = {v 1 , · · · , v n d } ⊂ Ω, where Ω = {0, 1} n is the space of all binary sequences of length n. We can then write the data distribution as where δ is the delta function that evaluates to 1 if v = v i and 0 otherwise. This effectively defines a probability mass function (PMF) over Ω with nonzero values only for values v i ∈ D. We then call D the support of q.
Let us assume further that all data points have equal amplitude over the support, i.e., c i = 1/n d . Since most real world datasets consist of unique elements with no exact repeats, this class of distributions includes all relevant ones. However, we will see in Sec. III that our method seems to work equally well also for non-uniform distributions.
Training an RBM then amounts to a search for the appropriate weights and biases, θ = {W, a, b}, that will minimize the quantity This is known as the Kullback-Leibler (KL) divergence between the data distribution, q(v), and the model distribution of the RBM over the visible layer, p(v) = {h} p(v, h), with hidden nodes traced out. The latter can be written as, The optimization of Eq. (4) is typically done with gradient descent over the KL divergence which leads to weight updates of the form 21 , The first term on the rhs of Eq. (6) is an expectation with the hidden nodes driven by the data, and hence is referred to as the data term. Since the conditional distributions across the hidden nodes are factorial and known in closed form, this inference problem is easy in the RBM case. The second term on the rhs of Eq. (6), instead, is an expectation over the model distribution with no nodes fixed, and called the model term. The exact calculation of this term requires computing the partition function of the RBM, which is proved to be hard even to estimate 22 . It is this term that MCMC algorithms (including CD) attempt to approximate.

A. Contrastive Divergence
A popular method to training RBMs is CD, which is a special case of an MCMC method known as Gibbs sampling 7 . The Markov chain is initialized from a point in the dataset, v, then the hidden and visible nodes are sequentially re-sampled a k number of times. A distorted model expectation is then computed from the reconstructed v k . In practice, choosing some finite k introduces a bias, but empirically it is found that using k = 1 gives a sufficient signal for learning 23 . Since the CD chain starts from a point in the dataset (i.e., a sample from the data distribution), difficulties arise when the model distribution represented by the RBM contains modes at points where the data distribution has negligible probability. CD will have a hard time finding and hence pushing down these spurious modes. This, coupled with the prohibitively slow mixing of this MCMC method due to randomwalk exploration, and typical high dimensionality of the problem, renders CD not a particularly effective method for unsupervised learning.

II. MODE INFORMED WEIGHT UPDATES
After these preliminaries we can now describe our mode-training method. In a nutshell, it consists in replacing the average in the model term of Eq. (6) with the mode of p(v) at appropriate steps of the training procedure. However, p(v) is very cumbersome to compute (see Eq. (5)), thereby adding a considerable computational burden. Instead, we sample the mode of p(v, h), the model distribution of the RBM.
The rationale for replacing the mode, v + , of p(v) with the visible configuration of the mode, v * , of p(v, h) is because the two modes are equivalent with high probability under scenarios typical for different stages of the RBM pre-training. We prove this rigorously in the appendix, while here we provide numerical evidence of this fact.

A. Mode Correspondence of Joint and Marginal PMF
To illustrate the equivalence between the modes of p(v) and p(v, h), let us begin by expressing the joint PMF in terms of the product of the marginal PMF over the visible layer and the conditional PMF over the hidden layer p(v, h) = p(v)p(h|v). For any given visible configuration v, we then have We can then define the hidden "activation" of v to be which allows us to write max h p(v, h) = p(v)r(v). Note that we can interpret r(v) as a measure of the "certainty" that the hidden nodes acquire the value 0 or 1.
It is then clear that we can write the probability amplitude of the mode of the joint PMF as max {v,h} where we have used the fact that r(v) ≤ 1 and v + is the mode of the marginal PMF, p(v). If r(v + ) = 1 then we have modal equivalence of the joint and marginal PMFs. In Fig. 1, we plot the evolution of r(v + ) as a function of the number of CD-1 training iterations for a shifting bar synthetic dataset, which is small enough that we can compute the exact mode of p(v) at any iteration. The figure indeed shows that r(v + ) approaches 1 rather quickly as pre-training proceeds. The activation of a random visible configuration is being used as comparison.
In the appendix we also prove that the condition of r(v + ) being close to 1 is not necessary for establishing modal equivalence. In fact, we prove that it is still possible for the two modes to be equal even when the weights are small (thus a smaller r(v + ) value). Additionally, we show in the appendix that mode training is more effective in exploring the PMF of the model distribution for RBM instances of greater frustration. The latter is a measure of the degeneracy of the lowenergy states of an RBM, and thus the difficulty of finding the ground state configuration. Since it was shown that the frustration of the RBM increases as pre-training proceeds 24 , in order to effectively utilize the power of mode training, the frequency of mode updates should be higher at the later stages of the training than the earlier stages.

B. Optimal Mode-Training Schedule
The results from the previous subsection then suggest a schedule for the mode training routine that performs mode updates more frequently the longer the pre-training routine has elapsed.
To realize this, we use a sigmoid, σ, to calculate the probability of replacing the data driven hidden CD term with a mode driven term at the iteration step n P mode (n) = P max σ(αn + β).
Here, 0 < P max ≤ 1 is the maximum probability of employing a mode update, and α and β are parameters that control how the mode updates are introduced into the pre-training. They are chosen such that the frequency of mode updates becomes dominant only when both the conditions of large weights and frustration are met (see Sec. III for the value of these parameters). Initially, P mode will be small, since the joint-and marginal-distribution modes are unequal, and gradually rises to its maximal value when the modes are of equal magnitude. Note that one may employ different functions to quantify the degree to which the joint-and marginal-distribution modes equalize during training. However, we have found that the sigmoid works well enough in practice.

C. Combining MCMC with mode updates
We are now ready to outline the full procedure of mode training, that combines a MCMC method with the mode updates following the schedule (8). Although one may choose any variation of the MCMC method to train RBMs, for definiteness of discussion, we consider here the standard training method, CD 7 . In this case, weight updates follow the modified KL(q||p) gradient. As discussed in Section I, it evaluates to a difference of two expectations called the data term and model term which we can write as where CD is the CD update learning rate, and the expectation in the second term is taken over the reconstructed distribution over a Markov chain initialized from the dataset after k Gibbs samples (k = 1 in most cases). When driving the weights with samples of the RBM ground state with the schedule (8), we use instead the following update, where [ ] p is the mode of the joint RBM model distribution. Note that the mode update learning rate, mode , may be different from the CD learning rate, CD . We also stress that the updates in Eq. (10) are in an off-gradient direction. As we show now, this is the reason for the increased stability of the training over MCMC approaches, and its convergence to arbitrarily small KL divergences.

D. Stability and Convergence
The data term, which is identical in both Eq. (9) and Eq. (10), tends to increase the weights associated with the visible node configurations in the dataset, thereby increasing their relative probabilities compared to states not in the support set, v ∈ Ω\D. Instead, the model term decreases the weights/probability corresponding to highly-probable model states. CD does this poorly and often diverges, while mode training achieves this with better stability and faster convergence (see Fig 2). We provide here an intuitive explanation of this phenomenon, while a formal treatment on this topic will be provided in the appendix.
The pre-training routine can be broken down in two phases. In the first phase, the training procedure attempts to discover the support D of the data distribution q(v). We call this phase the discovery phase. To better see this, consider a randomly initialized RBM with small weights. These small and uncorrelated weights give rise to RBM energies close to zero for all nodal states, or E(v, h) ≈ 0 for all v and h, see Eq. (1). This results in the model distribution p(v, h) being almost uniform.
Therefore, we see that in the discovery phase of training, the model term plays little role in the training as it simply pushes down on the weights in a practically uniform manner, with v i h j M ≈ 0.25. On the other hand, the data term drives the initial phase of the training by increasing the marginal probability of the visible states in the support, v ∈ D. We can then employ a large learning rate (say, CD = 1) in the beginning of the training, driving the visible layer configurations in the dataset, D, to high probability versus configurations outside the support. Empirically, we find that CD training performs in the discovery phase reasonably well, and is quickly able to "find" the visible states in the support. Now, having discovered the support, we arrive at the second phase of the training where we have to bring the model distribution as close to uniformity as possible over the support in order to minimize the KLdivergence. We call this phase the matching phase of the training, where we bring the model distribution as close to the data distribution as possible. CD usually performs poorly in this phase (see Fig. 2). To see this most directly, we simply have to consider a visible state with a slightly larger probability than the other states. It should then be necessary for the model term to locate and "push down" on this state to increase the uniformity of the distribution over the support. However, for any CD approximation of the model term, this rarely happens in a timely manner as the mixing rate of the MCMC chain is far too slow to locate this state before the training diverges. This is where samples of the mode are most effective, and can assist in the correction of the states' amplitudes. As we have discussed in Sec. II A, finding the modal state, v * , of the model distribution, p(v, h) allows us to immediately locate the mode, v + , of the marginal probability, p(v), and "push" down on this state through an iteration of weight updates. This "push" may result in another state "popping" up and becoming the new modal state; however, often times the probability amplitude of this new state will be less than that of the previous mode (see also the appendix). This results in a training routine that "cycles" through any dominant state that emerges at each iteration, and the probability amplitude of the mode decreases as training proceeds until the proba- bility amplitudes of all the states in the support become equal (see the formal demonstration of this in the appendix), which results in the desired uniform distribution over the support. This can be visualized as a "seesaw" between the dominant states, with the oscillation amplitude of this seesaw decaying to zero in time.
We outline the pseudo-code for mode training in Algorithm 1 and a visual depiction of the training side by side with CD-1 is shown in Fig. 2.
As it should now be clearer, these mode-driven updates are deviations from the gradient direction, since in general the mode over the model distribution is different from the expected value. This makes the modetraining algorithm, which mixes mode driven samples and data-driven ones, distinct from gradient descent. This is also supported by the fact that our method tends toward a particular class of distributions (uniform), when gradient descent would settle in some local minima or saddle points in the KL landscape. p mode ← Pmaxσ(αi + β)

6:
if u ≤ p mode then The free parameters in this method are the schedules of the mode sample using P mode (n) (defined by P max , α and β in Eq. (8)) and the CD learning rate, CD . With CD fixed, we set mode = γ CD , where γ = −E 0 /[(n + 1)(m + 1)], with E 0 (< 0) being the ground state of the corresponding RBM with nodal values {−1, 1} n+m . This particular choice of γ is an upper bound to the learning rate which minimizes the RBM energy variance over all states (see the appendix for the proof of this statement).
Finally, we find that the mode training method is not very sensitive to the parameters chosen. In fact, as long as the mode samples are incorporated after the joint and marginal mode equilibration, the training is stabilized and the learned distribution will tend to uniformity (see also the appendix). This result reinforces the intuitive notion that the pushes on the mode provide a stabilizing quality to the training over CD (or any other MCMC approach), which can otherwise diverge when mixing rates grow too large at later times during training.

E. Importance of Representability
Note that since mode training is driven to distributions of a particular form, instead of local minima as in the case of CD or other gradient approaches, the representability of the RBM becomes important. The ability of a RBM to represent a given data distribution is given by the amount of hidden nodes, where one is guaranteed universal representability with n d + 1 hidden nodes 3 . In other words, one more hidden node than the number of visible configurations with nonzero probability is sufficient (but perhaps not necessary) for general representability. In practice, this bound is found to be very conservative and typically much fewer nodes are needed for a reasonable solution.
Representability can become an issue in mode training when the parameter space of the RBM does not include the uniform distribution over the support (or a reasonable approximation). Since the mode training is generally in a non-gradient direction, this means that it may settle to a worse solution than a local optimum obtainable by CD. This is a signal that more hidden nodes are required for an optimal solution.
Since most natural datasets live on a very small dimensional manifold of the total visible phase space, |n d |/|Ω| 1, the amount of hidden nodes required typically scales polynomially with the size of the problem, versus the exponential scaling of the visible phase space. This makes representability not an insurmountable problem for mode training, even in full size problems. To this end, the examples of Fig. 2 and Fig. 3 show that mode training does not necessarily fail if the number of hidden nodes is less than that needed to guarantee representability.

III. RESULTS
As examples of our method, we have computed the log-likelihoods achieved with mode training across two synthetic and one realistic (MNIST) datasets, and compared the results against the best achieved loglikelihoods with CD-1, PCD-1 and PT on standard RBMs, E-RBMs, and C-RBMs 20 . For the small synthetic datasets we could also compute the exact loglikelihoods, thus providing an even stronger comparison. For the larger MNIST case, mode sampling was done via simulation of a digital memcomputing machine based on Ref. 25. The specific details of our implementation can be found in the appendix.
For synthetic data, we use the commonly employed binary shifting bar and bars and stripes datasets, both described in Ref. 26. The former is defined by two parameters: the total length of the vector, L, and the amount of consecutive elements (with periodic bound-ary conditions), B < L, set to one, with the rest set to zero. This results in L unique elements in the dataset with uniform probability, giving a maximum likelihood of L log(1/L). The inverted shifting bar set is obtained by swapping ones and zeros. The bars and stripes dataset is constructed by setting each row of a D × D binary pattern to one with probability 1/2, and then rotating the resulting pattern 90 • with probability 1/2. This produces 2 D+1 elements, with the all-zero and all-one patterns being twice as likely as the others.
For a direct comparison to previous work, we followed the same setup as Ref. 20. A 9×4 RBM was tested on a shifting bar dataset with L = 9, B = 1 and a D = 3 bars and stripes dataset. Both synthetic sets were trained for 50,000 parameter updates, with no mini-batching, and a constant CD = 0.2. For the MNIST dataset, a 784×16 sized model was trained for 100 epochs, with batch sizes of 100. The mode samples in both cases are slowly incorporated into training in a probabilistic way following Eq. (8), initially with P mode = 0 and driven to P max = 0.1 for the shifting bar and MNIST datasets, and P max = 0.05 for the bars and stripes dataset. In both cases, we chose α = 20/N and β = −6, where N is the total number of parameter updates.
We plot an example of training progress in a moderately large synthetic problem in Fig. 3. Reported is the KL divergence (which differs from the log-likelihood by a constant factor independent of the RBM parameters 2 ) of a slightly bigger 14 × 10 RBM as a function of number of parameter updates on a L = 14, B = 7 shifting bar set, for both CD-1 and mode training. We consider two learning rate schedules, constant ( CD = 0.05) and exponential decay ( CD (n) = e −cn , c = 4, n ∈ [0, 1], the fraction of completed training iterations). Additionally, every time a mode sample is taken, CD is allowed to run with k = 720, a number scaled to the equivalent computational cost of taking a mode sample. The details of the computational equivalence between a mode sample using memcomputing and iterations of CD are discussed in greater detail in the appendix. In both cases, even when computational cost is factored in, mode training leads to better solutions and proceeds in a much more stable way across runs (lower KL variance at convergence). Importantly, mode training never diverges while CD oftentimes does. Following our intuition about mode training established in Sec. II, using larger learning rates in the CD-dominated phase accelerates the convergence of mode training.
It is known that using CD to train RBMs can result in poor models for the data distribution 27 , for which PCD and PT are recommended. We note that for the mode training employed in this paper, CD-1 was employed as the gradient approximation (except in the case for MNIST where PCD-1 was used). Impressively, in all cases tested, the mode samples were able to stabilize the CD algorithm sufficiently to overcome the other, more involved approximations (PT) and model enhancements (centering).
In addition, it is clear that mode training exhibits several desirable properties over CD (or other gradient approaches). Most significantly, it seems to perform better with larger learning rates during the gradient dominated phase, and smaller learning rates when using mode samples. CD and other gradient methods generally perform better with smaller learning rates, as their approximation to the exact gradient gets better. Irrespective, even in this regime, the mode training eventually drives the system to the uniform solution compared to the local optimum of CD. The main advantage is that with mode training, one can (and often should) use larger learning rates, resulting in fewer required iterations. For further comparison, we report in Table I  and MNIST datasets obtained with mode training and those reported in Ref. 20. The results show mode training with a standard RBM always converges to models with log-likelihoods higher than E-RBMs, and C-RBMs trained with CD-1, PCD-1, or PT. Furthermore, the mode training log-likelihood increases with an increasing number of hidden nodes (better representability). Empirically, we also find the incredible result that with sufficient representability and the proper learning rate, mode training can find solutions arbitrarily close to the exact distribution.

IV. CONCLUSIONS
In this paper we have introduced an unsupervised non-gradient training method that stabilizes gradient based methods by utilizing samples of the ground state of the RBM, and is empirically seen to get as close as desired to a given uniform data distribution. It relies on the realization that as training proceeds, the RBM becomes increasingly frustrated, leading to the modes of the visible layer distribution and joint model distribution becoming effectively equal. As a consequence, by using the mode (or ground state) of the RBM during training, our approach is able to "flatten" all modes in the model distribution that do not correspond to modes in the data distribution, reaching a steady state only when all modes are of equal magnitude. In this sense, the ground state of the RBM can be thought of as 'supervising' the gradient approximation during training, preventing any pathological evolution of the model distribution.
Our results are valid if the representability of the RBM is enough to include good approximations of the data distribution. Once the representability is sufficient, a properly annealed learning-rate schedule will take the KL divergence as low as desired. Increasing the number of hidden nodes increases the non-convexity of the KL-divergence landscape, easily trapping standard algorithms in sub-optimal states. In practice, after some point, increasing the number of hidden nodes will not decrease the KL divergence that a pre-training procedure actually converges to, as the trade-off between effective gradient update and representation quality is reached. We here claim that this point of tradeoff for our mode-assisted procedure is reached at far greater number of nodes than standard procedures, thus allowing us to find representations with far smaller KL-divergence. The mode training we suggest then provides an extremely powerful tool for unsupervised learning, which i) prevents a divergence of the model, ii) promotes a more stable learning, and iii) for data distributions uniform across some support, can lead to solutions with arbitrary high quality.
To scale our approach, one would need an efficient way to sample low-energy configurations of the RBM, a difficult optimization problem on its own. This is equivalent to a weighted MAX-SAT problem, for which there are several industrial-scale solvers available. Also, the recent successes of memcomputing on these kind of energy landscapes in large cases (million of variables) are fodder for optimism 18,19 .
Finally, fitting general discrete distributions (with modes of different height) with this technique seems also within reach. In this respect, we can point to our results on the bars and stripes dataset (a non-uniform q(v)) for inspiration. We have found the best loglikelihood on that set with mode training with a lower frequency of the mode sampling, P max = 0.1 → 0.05, compared to the shifting bar (a uniform set). This suggests that a general update, which properly weighs the mode sample in combination with the dataset samples, may extend this technique to general nonuniform probabilities, with the weight analogous to a tunable demand for uniformity.
Our method is useful from a number of perspectives. First, from the unsupervised learning point of view, it opens the door to the training of RBMs with unprecedented accuracy in a novel, non-gradient approach. Second, many unsupervised models are used as 'feature learners' in a downstream supervised training task (e.g., classification), where the unsupervised learning is referred to as pre-training. We suspect that starting backpropagation from an initial condition obtained through mode training would be highly advantageous. Third, the mode training we suggest can be done on models with any kind of pairwise connectivity, which include deep, convolutional, and fully-connected Boltzmann machines. We leave the analysis of these types of networks for future work. The mode-training method introduced in the main text requires sampling the mode of the model distribution of a given RBM. This task can be transformed to sampling the optimum of an equivalent weighted, mixed maximum satisfiability (MAX-2-SAT) optimization problem 11 . To obtain high-quality samples for large models, we employ the memcomputing approach 12,14,28 , a novel computing paradigm that employs memory to both store and process information.

Memory Dynamics
Our implementation is based on the approach used in Ref. 25 The voltages, v i ∈ [−1, 1], are continuous representations of the N Boolean variables of the problem, y i , with a false assignment represented as v i < 0, a true assignment represented as v i > 0, and v i = 0 is ambiguous. Rather than thresholding the voltages to check the clause states, we use the clause function directly. A 2-SAT clause in Boolean form is comprised of two literals, {l i,m , l j,m }, where a literal in the m-th clause, l i,m , is either a negated,ȳ i , or unnegated, y i , variable. The Boolean clause is represented as a continuous clause function, The factor q m,i contains the information about the relation between the literal in the m-th clause, l i,m , and its associated variable, and we consider a clause to be satisfied when C m (v i , v j ) < 0.5. By thresholding the clause function we also avoid the ambiguity associated with v i = 0. Each clause has a "fast", x f m , and a "slow", x s m , memory variable that serve as indicators of the history of the state of C m (v i , v j ). The memory is "fast" in the sense that it contains information of the recent history of C m , and "slow" in the sense that it contains information on the entire history of C m . Both memory variables are bounded, where v j is the value of the voltage corresponding to the other literal in the clause. The "rigidity" term in Eq. (A1) is This term only influences the voltage that is closest to the satisfying assignment in the clause.
The weight of each 2-SAT clause, W 2,m , is incorporated in the dynamics of the slow memory variable and the dynamics of voltages. The weights of the 1-SAT clauses are used to bias the voltage dynamics in Eq. (A1) as b i = (W 1,i − W 1,ī )/2, where W 1,i is the weight of the 1-SAT with a literal that is equivalent to variable y i and W 1,ī is the weight of the 1-SAT with a literal that is the negation of variable y i . The weight is zero if no corresponding 1-SAT exists.

Computational Cost of Sampling with Dynamics
For simplicity, let us assume the form of an N × N RBM, resulting in 2N voltage variables and O(N 2 ) clauses as a MAX-2-SAT instance. The time complexity of a forward Euler integration step is dominated by the sparse matrix-vector multiplication of a 2N ×N 2 sparse matrix. Since each node connects to N other nodes this matrix contains N 2 non-zero elements encoding the connectivity structure of the problem. This matrix multiplies the gradient vector of length N 2 , for a total of O(N 2 ) floating point operations per second per time step. If the maximum number of timesteps is independent of system size, the total time complexity is then O(N 2 ). Memory complexity also scales as O(N 2 ), since the algorithm requires the storage of 4 length N 2 floating point elements, and a 2N × N 2 sparse matrix with N 2 non-zero elements.

Performance Comparison Between CD and Memcomputing
Here, we want to show that the RBM energy sampled by the memcomputing approach is consistently better than the one found by CD independently of the size of the RBM. Even though memcomputers are ideally realized as physical devices in hardware 12,28 , here we compare their performance as numerically integrated dynamical systems versus traditional algorithmic methods (e.g., CD).
We then first compute an "exchange rate" between one iteration of the numerical integration and k steps of CD, such that the resulting computational complexity (i.e., wall time on the same processor) will be essentially identical. We discover empirically, across a large range of system sizes, that this exchange rate is about 30 steps of CD per iteration of the dynamics described by Eqs. A1, A2, and A3.
We choose as our test problems a set of randomly initialized N × N RBMs, with all weights sampled from a normal distribution with µ = 0 and σ 2 = 0.01. The system sizes ranged from N = 100 to 1000, which we chose to be large enough to observe the scaling in time. We then compute the relative energy differences, in percentage, ∆ % = 100 * (E M em − E CD )/E CD , between the energy E M em obtained with the memcomputing ODEs described above, as compared to the energy E CD obtained using CD-k. For a direct comparison, we have run the memcomputing solver for N tot = 2N integration steps, scaled with the system size. Contrastive divergence was then run using the empirical exchange rate, k = 30 * N tot , resulting in the same computational cost seen in Fig. 4 (right panel).
The energy results are plotted in Fig. 4 (left panel), where the memcomputing dynamics perform very favorably in terms of energies obtained compared CD-k, consistently above 400%, often showing an improvement of more than 1000%. In terms of time complexity (right plot), both algorithms follow the same linear trend on a log-log plot, indicating a polynomial scaling. Indeed, the best fit asymptotic behavior of both algorithms is almost cubic. This is consistent with our complexity analysis in the last section. Since both algorithms have a leading order scaling of O(N 2 ) for a fixed number of iterations, they would scale cubically if we allowed the number of iterations to grow as N , the system size. Finally, we want to stress that the set of equations used in the present work are only an example of how to implement a memcomputing solver and have not been optimized in terms of both speed and performance. The stability of a pre-training procedure to training neural networks is a very desirable feature. This is because the KL divergence cannot be monitored during the pre-training process for a realistically sized RBM, so it is crucial for us to ensure that the KL divergence does not diverge. In this section, we show that using the mode in the model update term will guarantee convergence to a uniform distribution, and there is an optimal learning rate that provides the largest rate of convergence, with the learning rate being easily computable.
Note that in this work, the data term of a mode-assisted update is the same as traditional CD algorithms, so the difference is entirely in the way that the model term is approximated. Therefore, we only have to focus on the model term (which is "approximated" by the mode of the joint distribution), and point out some of its key properties, in particular those pertaining to the stability of the pre-training procedure.

Gauging the RBM
For the sake of simplicity, we consider an n × m unbiased RBM with nodal values of v ∈ {−1, 1} n and h ∈ {−1, 1} m , then the RBM energy is given by: Note that an RBM with nodal values {0, 1} can be trivially transformed into one with nodal values {−1, 1}. For the analysis in this appendix, we will always assume (unless specifically mentioned) that an RBM is unbiased and equipped with nodal values {−1, 1}.
Since in our work, we are interested in particular to the mode of the joint distribution, which is equivalently the nodal configuration that minimizes the RBM energy, we give a special denotation to this configuration, {v * , h * }, and name it the ground state of the RBM energy. Furthermore, we denote the ground state energy to be Note that in practice, the ground state of an RBM can be thought of as being unique. In fact, for randomly initialized weights, the probability of having two or more minimal energy states, or degenerate ground states, is of measure zero. In theory, if there were to be multiple ground states, we can randomly select one of them to be {v * , h * }, and our analysis will not be affected at all.
Note that for any RBM, we can always map it to an equivalent RBM such that the ground state is +1. This is called a gauge operation, which we formally define as follows Definition B.2 (Gauged RBM). Given an n × m RBM with weights W and ground state {v * , h * }, we define the gauge mapping G : R nm → R nm such that W = G(W) satisfies the following condition: Then we call W a gauged RBM.
Remark. Note that by this definition, it is easy to see that the ground state of any gauged RBM must be +1. This means that the ground state energy of a gauged RBM is simply the sum of its weights Furthermore, note that the form of the weight update equation is invariant under conjugation. In other words, if we let f : R nm → R nm denote one iteration of weight update, then it is clear that This means that the dynamics of W can be analyzed in terms of the dynamics of W . In this section, we will always assume that the RBM is gauged.
For a gauged RBM, the change of the weight elements (under unit learning rate) as a result of an iteration of mode-informed update is: Therefore, we see that every weight element is decremented by 1 uniformly across the entire weight matrix, and the energy change of the ground state energy is:

Metric
In order to investigate how the joint probability mass function (joint PMF), or p(v, h), evolves under mode training, we have to look at how the energy changes for all nodal states. To do so, it is useful to define a distance measure between two states, to have a sense of how "far apart" the two states are. We then propose the following distance measure. , we define the distance to be Remark.
Note that n v simply counts the number of visible nodes that are different between the two states, and m h counts the number of different hidden nodes that are different. Note that the space of s with this distance definition is a pseudometric space, in the sense that it is possible for the distance between two distinct points to be zero, in particular states that are related by Z 2 symmetry (or global spin flips). This can be easily verified by letting s 2 = −s 1 , giving us n v = n and n m = m, and d(s 1 , s 2 ) = 0. In this pseudometric space, the distance d is a measure of how "different" two spin states are up to a Z 2 symmetry. A formal discussion of this metric, including a proof of triangle inequality, is provided in Appendix C of our related work 24 . The usefulness of defining the metric this way will be apparent in proposition B.1.

Remark.
It is important to note that {n v , m h } is not uniquely determined by d. To see this clearly, we rewrite Eq. (B3) in terms of the following Diophantine equation solving for integers n v ≤ n and m h ≤ m. It is easy to see that this equation is over-determined by realizing that it is possible for the RHS to have multiple prime factors.

Energy Change
Equation (B2) gives the change in the energy of the ground state under a mode-assisted update iteration. However, to analyze the stability of the training procedure, it is necessary to look at the energy change of all states. To simplify our discussion, instead of looking at the energy of each individual state, let us consider the average energy of all the states distance d from the modal configuration, which we denote as E(d). Note that the average is not the expected value over the joint PMF p(v, h). Rather, it is an unweighted average (or the expected value over a uniform probability measure). It is interesting to note that this average energy only depends on the ground state energy and the distance d from the ground state.
Proposition B.1 (Average Energy). The average energy of states distance d from the ground state is: Proof. Given some distance d, there can be multiple assignments of {n v , m h } that correspond to this distance. However, if given a particular tuple {n v , m h }, we show that the average energy of all states with spins differing from the ground state by {n v , m h } is only dependent on the distance d corresponding to the tuple, then the average energy of states of distance d from the ground state is simply the average energy of states of with spins differing from the mode by {n v , m h }.
The average energy of states with spins differing from the ground state by {n v , m h } can be expressed as where in the last equality, we used the linearity of the expected value and the symmetry of the RBM. We easily see that the marginal probability distribution of a single spin is given by (with the underlying joint distribution being uniform) which gives us Therefore, the average energy of states distance d from the mode is also E(d ) = E 0 (1 − 2d ).
Since the average energy distance d from the ground state is only dependent on d, we expect this to be true also for the change in energy for a state at a distance d from the ground state, under the weight update routine given in Eq. (B1).
Proposition B.2 (Energy Change). Given any state distance d from the ground state, the change in the energy of that state is given by Proof. Again, we only have to focus on one particular assignment of the tuple {n v , m h } which corresponds to the distance d, and show that the change in energy of a state corresponding to that tuple depends only on d. Without loss of generality (WLOG), we assume that the first n v visible nodes are of value −1, and the first m h hidden nodes are of value −1. Then the change in energy is given by: where we have used the fact that δW ij = −1 from Eq. (B1).
Remark. Note that the energy change is only dependent on the size of the RBM and the distance d from the ground state, so all the states at distance d experience the same energy change. Under a given learning rate γ, the actual energy change is then Combining propositions B.1 and B.2, we see that the energy change can be alternatively written as At this point, it is necessary to take an intermission to look at the role that the mode update term plays in the pre-training procedure. From Eq. (B4), we see that the energy change of a state distance d from the ground state is proportional to the average energy of the states at the same distance E(d). In the context of the entire pre-training procedure, this energy change can be interpreted as a constant drift term that pulls the energy back to zero with strength proportional to the average energy of all the states of the same distance. Loosely speaking, the joint distribution will become more uniform under an iteration of mode-assisted update.
Note that this behavior can also be achieved with standard regularization procedures such as an exponential weight decay term like δW ij = −W ij . However, such regularization techniques are usually undesirable as they do not induce an effective sampling of a multi-modal distribution. Our procedure, however, does not suffer from such drawbacks, and in fact promotes the effective sampling of a multi-modal distribution (see section C).

Approaching Uniformity
In this section, we formalize the argument that the RBM energies over all states become more uniform under a mode-assisted update iteration. To do so, we mainly focus on the energy variance across all states, and show that it must decrease under a suitable learning rate. This statement can be made more precise as follows. Proof. We reiterate the fact that the underlying PMF for the states is assumed to be uniform, or f (s) = 1 2 n+m for every nodal configuration s. We can then define a random variable D with its PMF being: which can be interpreted as the probability of a randomly chosen state to be a distance d from the ground state. From this PMF expression, we can easily derive the expected value and the variance of the distance of two randomly chosen states where we see that the variance is small relative to the expectation value for a large system. We then use the law of total variance to write the variance of the energies over all states as We first begin by focusing on the first term. Note that the term Var s (E(s) d(s) = D) is the conditional variance of energies of the states distance D from the mode. If we update the energies according to Eq. (B4), then the new variance can be written as Var s (E(s) + γnm(1 − 2D) d(s) = D). The term γnm(1 − 2D) is dependent only on D but not the specific nodal configuration s, so it is just a constant offset in the context of the conditional variance, and the variance will remain constant. Therefore, the first term of the variance decomposition is constant, and we only have to focus on the second term, which can be conveniently written as: After a weight update, this variance becomes In this form, it is easy to see that the variance decreases when the learning rate satisfies with the largest decrease being δVar D (E(D)) = 4E 2 0 Var D (D) = E 2 0 nm , which occurs at the learning rate γ = −E 0 /nm. This is then our optimal learning rate.
Remark. To avoid confusion, note that E 0 is negative, so − 2E0 nm is positive, so the learning rate γ is bounded in some positive interval. Note that the two biases for the visible and hidden spins can be expressed as two ghost spins 24 , thereby effectively adding one more spin to each layer. By taking into account the biases, we see that the largest decrease of the variance occurs when which is what we use in the main text.
There are two important things to note here. First, the learning rate, as presented in Eq. (B9) is generally very large and is only optimal in the sense that it provides the fastest convergence to a uniform joint PMF, which is desirable for a stable pre-training routine, but not necessarily optimal for minimizing the KL divergence. The practical usefulness of Eq. (B9) is to mainly provide an upper bound to the learning rate that ensures stability. It should be noted that the analysis ignores the presence of the data term (see Eq. (6) in the main text) and is only carried out over a single iteration; in other words, it may be possible that a large learning rate will force the system into a local minimum in the KL divergence rather quickly. Therefore, in the practical setting a smaller learning rate would be more beneficial. In the main paper, we then normalized this learning rate with the learning rate of CD, which results in CD γ < γ (as CD < 1).
The second thing to note is that Eq. (B9) is not exact as the ghost spins are fixed nodes that cannot be "flipped", so theorem B.1 no longer applies, meaning that the average energy of states distance d from the ground state can no longer be uniquely determined by E 0 and d alone. Nonetheless, for large RBMs, the contribution from biases are relatively small, and the approximation is close to exact.

Suboptimal Updates
Before we conclude this section, we make two final remarks concerning suboptimal updates, or updates that are not informed by the global mode directly. The first remark pertains to a practical setting where locating the global mode is difficult or too computationally expensive, and only an approximate mode can be obtained, or a state with energy close to the ground state. We discuss how an update informed by this state still ensures stability. The second remark compares a mode-assisted update with an update with the model term sampled by some form of stochastic algorithm (such as CD), and we show that the latter update procedure does not ensure stability.
Note that in Eq. (B 1), we transformed the weight elements such that the ground state is v * = +1 and h * = +1. However, this procedure is general and can be done for any given state. Given any two states, v 1 and h 1 with some associated energy E 1 , it is always possible to gauge the RBM in a way such that v 1 = +1 and h 1 = +1. The previous proofs will still carry through for E 1 as long as E 1 < 0. This means that the mode training procedure does not hinge on the fact that the weight update has to be informed by the exact ground state, and any state sufficiently close to the ground state should suffice. However, it should be noted that using the ground state to inform the weight update provides the greatest decrease in energy variance since the maximum of δVar D (E(D)) scales quadratically with E 0 (see Eq. (B7)).
Note that in theorem B.3, the argument that the conditional variance of the energies conditioned on some distance d from the ground state does not change is based on the fact that the weights are updated uniformly across the RBM according to Eq. (B1). However, for a stochastic algorithm, the weight updates are clearly not uniform (or even deterministic for that matter), so nothing can be said about the change of the conditional variance. It is possible for the conditional variance to increase under a stochastic update, thus pulling the energies away from uniformity if the magnitude of the increase overcomes the decrease in the second term in Eq. (B6) (the ground state variance).
To conclude this subsection, we discuss briefly the contribution of the data term in updating the weight matrix. Clearly, if we look at the gauged RBM matrix, the change in each element generated by the data term is bounded above by +1, meaning that its contribution cannot overcome the guaranteed −1 decrease generated by the mode update term. This means that it is impossible for the ground state energy to decrease even in the presence of the data term, so the mode of the joint distribution must not increase, thus the training never diverges. This effectively ensures the global stability of our mode-assisted training method.

Appendix C: Efficient Sampling of Multi-modal Distributions
So far, we have shown that our update procedure guarantees stability. However, as briefly mentioned at the end of section B 3, stability is also guaranteed by standard regularization terms such as the weight decay term, δW ij = −W ij . In this section, we make the crucial distinction between our procedure and standard weight regularization procedures by pointing out the key phenomenon that our procedure is capable of efficiently exploring the landscape of a multi-modal PMF.
This property of the mode-training method is most readily analyzed from the perspective of the frustration index of the RBM instance. The frustration index can be interpreted as a measure of the difficulty of discovering the nodal ground state of a given RBM instance, and interestingly, an increase in the frustration index is correlated with an increased rate of exploration of the multi-modal distribution. Therefore, in some sense, for a given iteration of weight updates, the difficulty of finding the mode of that distribution is "compensated" by an increased efficiency of PMF exploration.
We begin by formally defining the frustration index, followed by a brief explanation of how the mode-training algorithm explores efficiently the PMF. Finally, we relate the two concepts in a cohesive manner. We provide an extensive analysis on the frustration of the RBM and its practical applications in our related work 24 .

Frustration Index
The frustration index is the ratio between the sum of unsatisfied couplings at the ground state and the sum of all coupling strengths. Formally, for a gauged RBM, it can be defined as follows This index is closely related with the degeneracy of the low-energy states. In other words, with an increase in the frustration index, the excited states will be spaced closer to the ground state in energy. Furthermore, for a highly frustrated system, the transition from the ground state to the excited states usually involves flipping a large cluster of nodes. This gives rise to a large population of local minima in the energy landscape spaced far apart in distance but close together in energy (in terms of the metric discussed in section B 2), and this property of a highly frustrated system makes it difficult for local search algorithms to locate the global minimum. This motivates the need for an algorithm that is able to learn the long-range correlations of the RBM spins, and a possible candidate of this algorithm is presented in section A.

Inefficiency of Weight Decay
In this section, we discuss briefly why the standard weight decay algorithm δW ij = −γW ij (where γ is some learning rate) is not efficient in assisting local algorithms in sampling a multi-modal distribution. To begin with, we first recall that the joint distribution of the RBM is where E(v, h) = ij W ij for a gauged RBM. Note that the weight decay update is a contracting affine transformation of the energies of all states, or simply a rescaling of the energies by some constant β = (1 − γ) < 1, meaning that the joint distribution transforms as where the normalization condition is ignored.
Of course, the distribution does become more uniform under this transformation; however, the ordering of the states with respect to their energies will not change, meaning that the ordering of the dominant modes remains invariant under this transformation. In other words, a poorly initialized Markov chain trapped under a dominant mode will still remain trapped unless β becomes sufficiently small; this means that a large learning rate, γ, is required to free the Markov chain and allow efficient exploration of the joint distribution. However, a large learning rate in this context is undesirable, as it brings the RBM to uniformity in a drastic manner, which voids much of the information gained from the previous iterations of pre-training. The inefficiency of this approach boils down to the indiscriminate update of the weight matrix that is ignorant of the energy ordering of the states or the distance between them (see definition B.3).
Our mode-assisted update, on the other hand, updates the weight matrix based on the ground state configuration of the RBM, resulting in a maximal increase in energy for the ground state, and the energy change is "propagated" to the other states based on their distances from the ground state (see proposition B.2). An entirely different energy landscape will then emerge under this update procedure even under a small learning rate, and it is likely that a new ground state at a faraway distance will "pop" up. The next update iteration is then based on this new found mode, and the process is repeated. Effectively, we are dynamically sampling the energy landscape by making large leaps between dominant states without resorting to forcing uniformity on the energies.

Global Mode Cycling
For the sake of simplicity, consider a gauged RBM with a joint distribution having three dominant states, with their RBM energies being E 0 < E 1 < E 2 . The heuristic analysis in this section can be easily generalized to multi-modal distributions with arbitrary number of dominant states. We can also assume that the pairwise distances between the three modes are the same (meaning that the three modes form an equilateral triangle under the metric defined in definition B.3), which we can then denote simply as d.
If we assume that the learning rate is γ, then from Eq. (B4), we see that the new energies of the three states will become where we are assuming that the magnitude of the learning rate is much larger than the energy gaps 29 , or more precisely where we note that the lower bound of gamma is proportional to the energy difference between the first excited state and the ground state. This guarantees that after one update, the ordering of the new energies of the states will become is the new ground state energy, and the next iteration of weight update will be based on state E The energies are then reordered as 2 becomes the new ground state energy. And similarly, the third iteration will recover the original energy ordering E Therefore, we see that in general, whenever we perform a weight update, the energy ordering of the modal states will experience a left circular shift, so we are, in some sense, sampling the multiple modes in a cyclic fashion, which allows us to effectively cover a large volume of the probability measure.

Relationship between Frustration and Mode Sampling
Now, we discuss how an increase in the frustration index is conducive to an efficient sampling of the multimodal distribution. We here consider simply a gauged n × n RBM, with ground state energy E 0 . We denote the average energy of states distance d from the ground state as E(d) (see proposition B.1). Under an iteration of mode update, the new energies are (see Eq. (B4)) 2d).

a. Small Frustration
For the sake of simplicity, consider the case where the frustration index of the RBM is zero, then all the weights can be assumed positive. Furthermore, we make the simplifying assumption that the weights are iid 30 random variables with uniform distribution in [0, 1]. We then make the following claim.
Proposition C.1. If we update the weight matrix continuously with the mode-assisted update procedure, then the new ground state will differ from the original ground state by distance d ∼ 1 n almost surely.
More formally put, if we let E min (d) be the minimum energy of states distance d from the ground state, and ∆E(d) = E min (d) − E 0 . Then the smallest learning rate for which a new ground state can emerge is If we let d be the distance such that 2d n 2 γ = ∆E(d ), then Proof. Given any state distance d from the ground state, we have WLOG, we can also assume that n is a prime number, then the number of states distance k n from the ground state (where k < n) is 2 n k , which gives us (denoting β = √ 6(1 − 2γ) and k = nd).
Pr 2dn 2 γ > ∆E min (d) Note that ∀ ∈ (0, 1), we let β such that J(n, 1, β ) = 1 − , then we have This result implies that in the limit of large n, the new ground state is only likely going to differ from the old ground state by distance d ∼ 1 n , so we are only moving away from the old ground state by a very small distance. This means that a small frustration is not conducive to an efficient sampling of the phase space.
b. Large Frustration A highly frustrated system is generally hard to study, so we here provide a brief heuristic argument for the efficient sampling of the PMF for a highly frustrated RBM. Recall that in the case of large frustration, the first excited state differs from the ground state by a large number of nodes (hence a large distance d) but by only a small amount of energy. Also recall from Eq. (C1) that the lower bound of the learning rate scales proportionally to the energy difference and inversely proportionally to the distance. Putting the two results together, we see that in order for the first excited state to become the new ground state, we only require a very small learning rate (which is conducive to a faster convergence of the KL-divergence), and furthermore, transitioning from the ground state to the new ground state effectively allows us to traverse a large distance, which allows us to efficiently sample the full PMF.

Unnormalized PMFs
Recall that we base the analysis in this section on an n × m unbiased RBM with nodal values of v ∈ {−1, 1} n and h ∈ {−1, 1} m . The discussion in this section can be easily extended to a biased RBM. To ease the burden of notation, we first begin by defining an angle variable θ = v · W, which allows us to rewrite the RBM energy and the joint probability mass function (PMF) as follows: where Z is the partition function of the RBM. The marginal PMF of the visible layer can be obtained by fixing the visible layer and summing the joint PMF over all the hidden layer configurations: where the last equality is obtained by factoring the sum into each individual hidden nodes.
Since we are mainly concerned with the correspondence of the modal configurations instead of the normalized probability mass, we can simply ignore the constant prefactor 1 Z as the normalization prefactor and simply look at the unnormalized PMFs: where the use of the capital letter P is to denote the unnormalized PMF. Note that since p → P is an affine transformation, the ordering of the states in terms of their energies is invariant.
An issue we have to first address is that the nodal configuration of the joint distribution is described by the configurations of both layers {v, h}, while the nodal configuration of the marginal distribution is only described by the visible layer v. So in order to compare the nodal configurations of the two PMFs, we have to relegate P (v, h) into a PMF that only depends on v, which we do as follows: Definition D.1. Given a PMF P (v, h), we denote Remark. In other words, Q(v) is the maximum of the P (v, h) over all h under some fixed v. Note that the purpose of this definition is to have the mode of Q(v) be the same as the mode of P (v, h) "projected" onto the space of v. In other words, if we let {v * , h * } be the mode of the joint distribution P (v, h), then we have the following: This means that the mode of the joint distribution P (v, h) is the same as the mode of Q(v) in the v component.

Remark.
Note that there is a bijection between the visible configurations and the angle variables given by θ = v · W, so we can make Q depend on θ instead, or Q(θ), which is usually the form that we will be using for this section. Similarly, we can also write P (θ) as the unnormalized marginal distribution.
To simplify the analysis of modal correspondence, we first obtain a closed form expression for Q(v): Proof. Note that the expression for P (v, h) can be written as P (v, h) = exp( j θ j h j ). It then follows that arg max h P (v, h) = exp(arg max h j θ j h j ) = exp( m j=1 arg max hj (θ j h j )). Since h j ∈ {−1, 1}, it is easy to see that arg max hj (θ j h j ) = |θ j |. Therefore, we have Q(v) = arg max h P (v, h) = exp( j |θ j |) Now, if we denote v as the v component of the mode of P (v, h) and v as the mode of P (v), then the question of whether the marginal mode equals to the joint mode can be succinctly expressed as The equality, in fact, does not hold in the absolute sense, and it is very easy to construct pathological examples to violate the equality. However, for practical purposes, we only need this equality to hold with some nonnegligible probability for an RBM with weights randomly sampled from some distribution. We then formally define the notion of correspondence as follows Definition D.2. Given an n × m RBM with weights w sampled from some distribution f W (w), we say that the marginal mode and joint mode of the RBM are strongly correlated if the following holds where v is the v component of the mode of P (v, h).

Remark.
First, we recall that v is the v component of the joint distribution P (v, h). If v is also the mode of the marginal distribution P (v), or v = v , then clearly we require that P (v) ≤ P (v ) = P (v ), for all v configurations. In order to weaken the condition of exact modal correspondence, we simply require that the probability of the inequality, P (v) ≤ P (v ), holds for all v to be greater than some arbitrary value, which we chose to be 0.5 here.

Trivial Cases
There are two cases where proving the modal correspondence is trivial; the two cases occur at the beginning and end of the pre-training respectively. At the beginning of the training, the frustration index is small for the RBM, and the system is trivially ferromagnetic. At end of the training, the magnitude of the weights are large, and the nodal activation of the hidden layer is almost certain.

a. Small Frustration
If the frustration index is small, we can state the following.
Proof. We look at the gauged RBM where all weight elements are non-negative. Recall that the ground state of a gauged RBM is +1, then we have arg max {v,h} P (v, h) = +1, which implies arg where the inequality comes from the fact that W ij ≥ 0 and v i ∈ {−1, 1}, so we have arg max v P (v) = +1 as well. The proposition is then shown.
Remark. Note that this proposition implies directly modal correspondence as defined in definition D.2 in the absolute sense.

b. Large Weights
Near the end of the RBM training, the magnitude of the weights are usually very large (thus also the magnitude of the elements of θ), and the activation of the hidden nodes becomes increasingly certain. Intuitively speaking, this means that given any visible configuration, there is only one dominant hidden configuration corresponding to it. Therefore, the marginal distribution p(v) (which involves the sum over all hidden configurations) can be effectively approximated with the joint distribution p(v, h). We formalize this argument as follows: Proposition D.3. Given an n × m weight matrix, W, with the joint mode v satisfying and the ground state is not degenerate. Then ∃M > 0, such that for an RBM with the weight matrix, M W, the following is true Proof. We look at the gauged RBM so that the ground state is +1, then we set Let v be the visible component of any other state, then we denote Then the following must be true Recall from proposition D.1 that Furthermore, we can write the marginal distribution as Note that lim x→∞ = 2 cosh(x) exp(x) = 1. This implies that ∀δ > 0, ∃x > 0 such that 2 cosh(x) < (1 + δ) exp(x). If we assume that the proposition is false, then we can set δ < exp( /m) − 1 and choose M > 0 such that a contradiction. Therefore, the proposition must be true.

Appendix E: Showing Modal Correspondence
We have shown in the previous section that the modes of the joint and marginal PMF of the RBM correspond absolutely under two trivial cases: large weights and small frustration. The remaining case where the weights are small and the frustration is large is highly non-trivial, and we dedicate this entire section to showing, in the probabilistic sense, the modal correspondence as defined in definition D.2. The problem of showing modal correspondence can be reduced to analyzing the value Gaussian integrals over simplexes of varying sizes. Before we tackle this problem, we first formalize the notion of a random RBM.

Random RBM
Definition E.1 (Random RBM). A Random RBM is an RBM with a weight matrix, W, whose elements are iid normal variables with mean µ = 0 and standard deviation σ. Furthermore, the configuration of the visible layer is sampled uniformly from {−1, +1} n .
Lemma E.1. Given a random RBM, {v, W}, the angle variables, are iid normal variables with mean 0 and variance σ 2 θ = nσ 2 . Proof. This is a three stage proof. First, we have to show the product W ij v i is a random normal variable, so the elements of θ are also random normal variables. Second, we show that the probability distribution function (pdf) of θ is a multivariate normal distribution. Finally, we show that the elements of θ are uncorrelated, thus implying that they are independent.
To show that W ij v i is a random normal variable, we find the cumulative distribution function (CDF) of this product, and show that it is the CDF of a normal distribution. The CDF of the product is given by which is simply the CDF of W ij . Note that we have exploited the fact that the PDF of W ij is even. Therefore, θ j = i W ij v i is the sum of n random normal variables, resulting in another random normal variable N (0, nσ 2 ).
To show that the pdf of θ is a multivariate normal distribution, it is sufficient to show that any linear combination of the angle variables is a normal variable. Let the linear combination be If we denote φ i = j W ij c j , then the linear combination can be expresed as i v i φ i . Note that we can show that v i φ i is a random normal variable by the same argument as above, then i v i φ i must be a random normal variable as well, as it is the sum of independent normal variables. Therefore, θ is a multivariate normal distribution.
Finally, since the PDF of θ is a multivariate normal distribution, to show that θ are independent random normal variables, it is sufficient to show that any two elements of θ are uncorrelated. For j 1 = j 2 , we have where we have used the fact that E(v i1 v i2 ) = δ i1i2 . The lemma is then proved.
Remark. An important consequence of this lemma is that we can parameterize a random RBM with the angle variables θ, as the distributions of v and W are fully captured as the distribution of θ as iid normal variables.
As an RBM with large weights trivially satisfies the modal correspondence condition (see proposition D.3), we can assume the weights are small for the sake of non-triviality, and make the following approximation for θ: where the right arrow in the last line denotes an affine transformation which preserves the ordering of the probability masses. Similarly, we approximate Q(v) as follows:

Simplex Condition
To show modal correspondence, it is convenient for us to fix Q(θ), and analyze the conditional distribution of θ. In particularly, we wish to show that if Q(θ) is large, then the conditional expected value of P (θ) will also be large. First, we denote the conditional distribution of θ under a fixed Q(θ) as f (θ Q(θ) = α). Recall that Q(θ) = j |θ j | so the level set of Q(θ) are composed of simplexes, one in each quadrant. Note that θ are iid normal variables, so the PDF is spherically symmetric. Furthermore, θ 2 is also spherically symmetric. This means that all moments of P (θ) are invariant if we rewrite the condition as Lemma E.2. The following two conditional distributions are equivalent.
Proof. Omitted. Follows directly from the spherical symmetry of the PDF of θ.
The graph of condition (E1) is a regular simplex of length √ 2α and dimension m − 1 in the first quadrant, which we can denote as ∆ m−1 α , and we can write the conditional PDF as f (θ ∆ m−1 α ). It is convenient for us to apply an orthogonal transformation to θ, and we denote the new angles as φ = T θ. Note that the new angles are still independent normal random variables, since an orthogonal transformation preserves the independence of normal variables. The orthogonal transformation is chosen such thatφ 1 points from the origin to the centroid of the simplex. We denote φ as the components of ϕ other than φ 1 , meaning that φ = (φ 1 , ϕ). Proof. This can be shown by realizing that n is the displacement of the centroid from the origin, which is perpendicular to the m − 1 hyperplane the simplex is in. In other words Remark. This lemma allows us to express the condition distribution of θ in terms of ϕ.
Lemma E.4. Let ϕ be iid normal variables with variance σ 2 θ , then Proof. This can be shown by realizing that if θ are iid normal variables, then φ must also be iid normal variables. The intersection of the PDF of iid normal variables and a hyperplane is also a PDF of iid normal variables (with one less dimension).
The conditional expected value of P (θ) can then be expressed as Similarly, the conditional variance can be expressed as Note that the k-th moment of ϕ 2 conditioned on the simplex is To lessen the burden of notation, we denote the following Gaussian integral J(σ, α, k) = Note that for the argument of the error function, the real part is close to λ √ 2 = 1 √ π < 1, and the imaginary part approaches zero for large m. We then expand the error function as follows.
erf(x + iy) ≈ erf(x) + 2i √ π exp(−x 2 )y + 2x √ π exp(−x 2 )y 2 , which gives us where we kept terms only up to the order of lim m→∞ (1 + 1 m r ) m = 0 as m → ∞ for r > 1. We can then approximate the m-th power of the above result by using the fact that lim n→∞ x + y √ n + z n n x n exp z x + √ n y x − 1 2 ( y x ) 2 = 1, which allows us to write where we have denoted We then make the following approximation to the integral: ∞ −∞ dp exp(−p 2 ) exp(ap + bp 2 ) = π 1 − b exp a 2 4 (1 − b) .
We can then evaluate the integral J and perform the following linearization around k (the reason for the choice of k will be clear in the following subsection): (E8)

Density of States
We first briefly discuss the choice of k as appeared in Eq. (E5). We first recall that the size of the simplex, is the sum of m iid half-normal variables each with mean 2 π σ θ and variance (1 − 2 π )σ 2 θ . This means that in the limit of large m, α can be considered a normal variable with mean 2 π mσ θ and variance (1 − 2 π )mσ 2 θ . We then see that in the limit of large m, α is sharply peaked at its mean (as the relative standard deviation scales as 1 m ), as the result of the LLN (law of large numbers). This means that the probability that α deviates from its mean by some constant fraction scales as e −m .
However, this exponential decay is compensated by the exponential increase in the number of visible configurations, which is simply 2 n . In fact, for an n × n RBM, the contributions from the law of large numbers and entropy balance out, and a simplex whose size deviates from the typical value of α can still be likely generated by some visible layer configuration. We formalize this argument as follows, where k = α mσ θ is taken to be a random variable with mean 2 π and variance 1 n (1 − 2 π ).
Definition E.2. For a random n × m RBM, we define its density of states at k, D(n, m, k), to be D(n, m, k) = lim δk→0 E W N (k, k + δk) δk , where N (k 1 , k 2 ) denotes the expected number (taken over the probability measure of the weight matrix) of visible configurations that generates a simplex whose size is from kmσ θ to (k + δk)mσ θ .

Remark.
For a n × m random RBM, if we take the graphs of all the simplexes generated by all the visible configurations. D(n, m, k) is simply a measure of how "densely packed" the simplexes are at k.
Proposition E.6. For an n × n random RBM, we denote the density of states to be D(n, k) = D(n, n, k). If we let Then ∀δk > 0, Note that exponent evaluates to 0 at k = k o or k = k , and the exponent is negative if k > k or k < k o , so the proposition follows.

Remark.
This proposition implies that for a random n × n RBM, the largest size of the simplex generated by the visible configuration is typically k , which corresponds to the mode of the joint distribution v . It is convenient to denote the deviation from the size of the largest simplex as κ = k − k, and the size difference between the smallest and largest simplexes as δκ = k − k o , then under this parameterization, we can write the density of states as D(n, κ) = n 2π − 4 exp (κ)(∆κ − κ) If we denote A = (0.376 + 2k ) and B(κ) = 0.942 2 + (0.942 − 0.432κ) 2 , then for sufficiently large κ > 0, we have the following approximation Pr P (κ) > P (0) = 1 2 erfc C(n, κ) ≈ 1 2 1 C(n, κ) √ π exp − C(n, κ) 2 , where we have denoted noting that it scales with √ n. Then clearly, ∀δκ > 0, we have lim m→∞ erfc C(n, κ) C(n, κ) √ π exp − C(n, κ) 2 , meaning that the asymptotic approximation to the erfc function is valid in the limit of large n.
If we denote an instance with the random variable P (θ) conditioned on the simplex j |θ j | = (k − κ)nσ θ as P (κ), then we can make the following approximation to Eq. (D2).
where δκ > 0 is some small constant. We formalize this approximation as follows.
To prove the proposition, it is sufficient to show in the limit of large n that the integral goes to zero for the first and last terms, and the asymptotic approximation for the error function is valid for the second term.
Recall that D(n, κ) = n 2π − 4 exp (κ)(∆κ − κ) Proof. This can be shown by directly evaluating the logarithm of the integral as given in Eq. (E10), and verify that the result is greater than log( 1 2 ), up to a certain value of n max .