Mode-Assisted Joint Training of Deep Boltzmann Machines

The deep extension of the restricted Boltzmann machine (RBM), known as the deep Boltzmann machine (DBM), is an expressive family of machine learning models which can serve as compact representations of complex probability distributions. However, jointly training DBMs in the unsupervised setting has proven to be a formidable task. A recent technique we have proposed, called mode-assisted training, has shown great success in improving the unsupervised training of RBMs. Here, we show that the performance gains of the mode-assisted training are even more dramatic for DBMs. In fact, DBMs jointly trained with the mode-assisted algorithm can represent the same data set with orders of magnitude lower number of total parameters compared to state-of-the-art training procedures and even with respect to RBMs, provided a fan-in network topology is also introduced. This substantial saving in number of parameters makes this training method very appealing also for hardware implementations.

One of the most highly influential models in Artificial Intelligence (AI) and deep learning in particular is the Boltzmann machine (BM). They were constructed 1,2 as a powerful stochastic generalization of Hopfield networks 3 that possess a simple expression for their log-likelihood gradient. However, they remained impractical to train due to their reliance on high-dimensional sampling to calculate that gradient. A relatively efficient learning algorithm, called contrastive divergence (CD), was discovered for BMs with a simplified topology called restricted Boltzmann machines (RBMs) 4 , which have since gone on to see success in various domains 5 . However, the extension of RBMs to deep Boltzmann machines (DBMs) has been difficult 6 , and as such, DBMs are now mostly overshadowed by their deep feedforward cousins 7 in generative applications. This is not due to DBM's lack of ability, but rather the absence of effective means to train these models. There remain quite a few reasons to search for better learning algorithms for DBMs, as they are a versatile computational medium. A principle use is as compact generative models for complex probability distributions in unsupervised settings, considered a critical component of the forthcoming "third-wave" of AI 8 . Trained DBMs can also serve as an informed prior for feedforward networks, leading to better generalization in supervised tasks 9 . In the physical sciences, DBMs serve as powerful variational representations of many body wavefunctions, more efficiently 10 than RBMs 11,12 , and have potential applications in condensed matter physics and quantum computing 13 .
Various attempts have been made to climb the summit of DBM training [14][15][16][17][18] . Most approaches rely on pretraining by breaking up layers into RBMs and training them sequentially, after which the DBM is fine-tuned jointly. Correlations between layers are ignored during pre-training, minimizing the potential advantages of the deep architecture and can be disruptive to joint training 18 . Recently, the authors introduced mode-assisted training 19,20 , which combines CD with samples of the model distribution mode. This stabilizes training, allows the learning of very accurate densities, and strikes a better tradeoff between accuracy and computational cost compared to CD. As the method is agnostic to the connectivity of the network, we can apply mode-assistance to DBMs with the hope of capturing the model capacity missed by other approaches.
In this work, we find that the benefits of mode-assisted training are even more dramatic in the case of DBMs. In fact, it produces more accurate models without requiring pre-training while also utilizing orders of magnitude less parameters, compared to pre-trained 14 or centered DBMs 18 . The role of the network topology is also discussed, where we discover that DBMs are easier to train if the size of the layers decreases with depth. We evaluate the density modeling performance of mode-assisted DBMs by computing exact log-likelihoods achieved on small data sets and approximating the likelihood on the MNIST data set. The approach we propose can be extended to other types of neural networks, and is relevant also for the hardware implementation of these models, where a much smaller number of parameters directly translates into components and energy savings.
To see how mode-assisted training can be extended to deep architectures, we give a quick overview of DBMs and the basic approach to their training. DBMs are undi-rected weighted graphs that differentiate between n v visible nodes, and layers of n latent, or 'hidden', nodes, not directly constrained by the data 14 . We assume that > 1 as = 1 recovers the RBM, and like an RBM, there are no connections within a layer. Each state of the machine corresponds to an energy of the form where the biases a ∈ R n0 , b ∈ R n , and weights W ∈ R n −1 ×n are the learnable parameters. The energy function in Eq. (1) induces a Boltzmann-Gibbs distribution over states, where x = (v, h (1) , . . . , h ( ) ). The partition function, Z = {x} e −E(x) , involves the sum of an exponentially growing number of states, making the exact computation of its value infeasible for most data sets. During learning, a DBM is tasked to match its marginal distribution over the visible layer, p(v) = {h} p(v, h), to an unknown data distribution, q(v), represented by a data set, D. Training a DBM amounts to a search for the appropriate weights and biases that will minimize the quantity known as the Kullback-Leibler (KL) divergence between the two distributions, or, equivalently, maximizing the log-likelihood of the dataset, LL(p) = v∈D log p(v). The optimization of the non-linear, and typically high dimensional Eq. (3) (or log-likelihood), is often done via stochastic gradient descent with respect to the DBM parameters, which leads to weight updates of the form 21 , For every gradient update in Eq. (4), nodal statistics must be computed under two different distributions. The first one on the RHS of Eq. (4) is called the "data term", and is an expectation over the data induced distribution, q(v)p(h|v), with the network's visible layer fixed to the data. The second term on the RHS of Eq. (4) is called the "model term" which is an expectation over the entire model distribution in Eq. (2). In the case of RBMs, the data term can be sampled from exactly, but the model term must be approximated. With DBMs, the data term must also be approximated, most popularly with an iterative mean field procedure (see Methods).
In both cases, model statistics are collected via a Markov Chain Monte Carlo (MCMC) procedure dubbed 'contrastive divergence' (CD) 4 . CD-k is a form of Gibbs sampling that initializes chains of length k from elements of the dataset. Trouble arises when the model distribution contains 'spurious' modes where the data distribution has negligible probability. In these cases, ergodicity breaks down, and mixing times become prohibitively long, frequently resulting in CD becoming biased enough to cause training to diverge 22 . Training a DBM jointly with CD has proven to be a formidable task. Even a two-layer DBM on MNIST has not seen success without some kind of modification. 6,14,18,23 Here, instead, we use mode-assisted training in the joint and unsupervised learning of DBMs. The essence of mode-assisted training is the replacement of some gradient updates in Eq. 4 with ones of the form, Where the notation [f (x)] q represents f (x mode ), evaluated at x mode , the mode of some distribution, q(x). One may employ any optimization solver to sample the mode of the above distributions. Due to its proven efficiency, here we employ a memcomputing one as reported in our previous work 20 .
The mode-assisted update can be thought of as a saddle-point approximation of an expectation 24 . This technique, also known as Laplace's method, is commonly employed to approximate integrals (expectations) of the form, The weight updates driven by the mode are incorporated in a probabilistic way, with the probability of a mode driven update following a sigmoid, starting low in the initial phases of the training and reaching a maximal value at the end of training: Here, n is the current epoch, and α, β, P max control the shape of the sigmoid. Throughout the work we set, α = 20/N (N is the total number of epochs), β = −6 and P max = 0.1. This schedule was introduced in Ref. 20, based on the empirical observation that the mode samples work best when the support has been 'discovered' by CD.
The key insight of mode assisted training is not to approximate the expectation, but rather directly prevent spurious modes from appearing in the model distribution. This allows the inexact but efficient approximation of CD-k to minimize the KL divergence (or maximize the log-likelihood) without diverging. Searching for the mode with a highly specialized optimizer results in a better trade-off between log-likelihood performance and computation time. were trained on the shifting bar data set with n v = 24 (left plot) and n v = 12 (right plot) for 200,000 and 100,000 gradient updates respectively, following a linearly decaying learning rate schedule from = 1 → 0.001. For the DBMs the hidden layer ratio was fixed at α = n h (2) /n h (1) = 0.2. Performance is shown as a function of total number of hidden nodes, n h = n h (1) + n h (2) . The solid lines are the median obtained across an ensemble of 50 networks, and the shaded regions enclose the 95th and 5th percentiles.

Results
To illuminate the effectiveness of mode assistance on DBMs, we first evaluate performance on two differently sized synthetic shifting bar data sets. In Fig. 2, the modeassisted algorithm on a two-layer DBM is compared to two baselines, CD on the same DBM and to CD on an RBM with the same number of hidden nodes. The converged log-likelihoods are reported, and results are shown as a function of the total number of hidden nodes. Overall, mode-assisted training almost always converges to more accurate densities than CD alone on a DBM, and in most cases also does better than the corresponding RBM with CD. Mode assistance is also seen to prevent the divergence sometimes seen with Gibbs sampling, as well as reducing the variance of the converged models, resulting in smaller error bars. Past a certain point, all methods expectedly incur a loss in performance as the number of hidden nodes increases for a fixed number of training iterations. Mode-assisted training suffers the least in this regard.
Although the DBMs in Fig. 2 possess the same total number of hidden nodes as the RBMs, they contain fewer total parameters. This means that a better trained DBM takes advantage of abstract features composition afforded by the depth of the network, mirroring similar gains found in deep feedforward neural networks compared to single layer perceptrons 5 .
Note, however, that this parameter efficiency in DBMs is not present when training jointly with CD. In fact, they perform systematically worse than an equivalently sized RBM in terms of log-likelihood. Mode assisted DBMs on the other hand perform as well as RBMs or better, all the while maintaining parameter efficiency.
To quantify this efficiency gain, let us consider the case of a DBM with two hidden layers. If the total number of hidden nodes is fixed, n h = n h (1) + n h (2) , then a measure of parameter efficiency compared to an RBM with the same number of nodes is captured by the parameter α = n h (2) /n h (1) . The total number of parameters (weights and biases) in an RBM with n v visible nodes and n h hidden nodes is, A DBM with the same total number of hidden nodes would have the following number of parameters n DBM = n v n h (1) + n h (1) n h (2) + n v + n h (1) + n h (2) .
Considering typical training scenarios of a large visible layer, n v n h 1, the fraction of total parameters used in the DBM compared to the RBM (the parameter efficiency) can be simplified as, The smaller e, the fewer parameters a DBM possesses with respect to an RBM with the same number of nodes. We then see that a naïve interpretation of Eq. (8) would suggest that the DBM parameter efficiency becomes more significant with increasing α, meaning the resulting DBM has more neurons in the deeper layers (fans out). However, this assumes equal log-likelihood (performance) of the two models, which is not obvious. In fact, for DBMs we expect that a redistribution of the relative number of nodes in the different hidden layers plays a significant role in their performance. This tradeoff between parameter efficiency and performance merits further investigation. This is shown in Fig. 3, where DBMs trained with and without mode assistance are compared as a function of α. For all cases, the mode-assisted algorithm performs better than its unassisted counterpart, which is nothing other than the confirmation of the results of Fig. 2. Importantly, however, we also observe that the log-likelihood performance increases as the network becomes less parameter efficient compared to an RBM (α → 0), namely when it has a fan-in topology. We find that a balance is reached around α ∼ 0.15, namely the second hidden layer has only about 15% of nodes than the first hidden layer. The message here is that for a fixed number of hidden nodes, it is best to organize the network as a fan-in DBM rather than an RBM, if one has a method to train the DBM close to capacity in the first place.
Finally, to show the substantial improvements provided by the mode-assisted training of DBMs, we compare it to the training of centered DBMs 18 and pretrained DBMs 14 using CD on the MNIST data set. In Fig. 4, we report similar trends we observed in the smaller data sets, but scaled to more dramatic results. As the number of hidden nodes increases, small mode-assisted DBMs surpass the performance of significantly larger standard DBMs, centered DBMs and eclipse the performance of much larger pre-trained DBMs. In fact, with a DBM of only 120×18 hidden nodes (and no pre-training), trained with our mode-assisted approach, we reach the same log-likelihood of a DBM with 500 × 1000 hidden nodes and pre-training -a parameter savings of two orders of magnitude. For a fair comparison, the total training iterations were set to 100 epochs in each case, and the log-likelihood results with our approach are shown as a function of total number of hidden nodes at a fixed ratio, α = 0.15. The partition function was approximated using Annealed Importance Sampling (AIS) (see Methods), identical to the other referenced results 18 . It's worth noting that AIS tends to under-approximate the partition function, leading to an over-estimate of the log-likelihood. This effect is exaggerated in larger models. Even with this negative impact, a mode-assisted DBM with a total of 138 nodes without pre-training achieves about the same performance as one with 1,500 that was pre-trained.

Discussion
We conclude by providing an intuitive understanding of the dramatic performance improvement of the modeassisted training over Gibbs sampling (CD). The standard method of pretraining DBMs initializes weights at a greedy starting point but critically ignores correlations within the system. The end result is a dramatic loss of parameter efficiency compared to mode-assisted joint training of DBMs. Even training methods like centering, with more advanced sampling techniques like parallel tempering 25 , fail to train the DBM near its capacity. The issue is the reliance on Gibbs-like sampling to explore the states of the DBM, and using only these statistics to compute the likelihood gradient. The breakdown occurs when Gibbs chains fail to explore adequately the states of the system.
Because of its random-walk nature, the local dynamics of Gibbs sampling suffers long auto-correlation times when equally likely states are separated by an extensive number of variables flips. As a consequence of long mixing times between such states, statistics become heavily biased. In the initial phase of DBM training, when weights are small and randomly distributed, these issues are minimized. Interactions between nodes in this 'high temperature' regime are weak, and are well approximated by a mean-field theory 26 . In this phase, CD works well without much bias, and is the reason mode-assistance is really not necessary early on in the training.
However, as training continues, the DBM is effectively 'cooled' as weights learn from the data and grow larger in magnitude. Correlations between nodes become significant, and can no longer be ignored or averaged over: mean-field theory is inadequate to describe this phase. Gibbs sampling can then dramatically fail in these conditions, that is why mode-assistance is primarily applied in this phase. During mode updates, information is propagated from the ground state, the most 'collective' or 'coordinated' state of the DBM, to all the weights. This prevents probability mass density from accumulating far away from the current state of the chain, keeping Gibbs sampling in its effective regime.
Aside from mode-assistance, we have discovered that the topology of a DBM plays a major role in its performance as well as its parameter efficiency. Too many hidden nodes act as a noise source during training, slowing down learning. Too few hidden nodes reduce the capacity of the network. For the network sizes considered, a 'sweet spot' was found where these two effects are in balance. At this stage, it is not clear if this is particular to DBMs, or there are deeper reasons why this is the case, and further work in this direction would be interesting.
Finally, mode training relies on an efficient search for the mode of a DBM, which is an NP-hard computational problem on its own. On this front, we employ a novel dynamical systems based optimization technique called memcomputing, which has shown great promise in efficient optimization of non-convex energy landscapes like the DBM 20,27 . Since the DBM energy landscape can be represented within the optimizing dynamics, a path exists to extend these systems to (learning) samplers. In doing so, the need for CD would be entirely eliminated. We leave these avenues of research to be explored in future work.

Methods
DBMs are powerful models in machine learning, but bring with them considerable computational burdens during evaluation. To deal with them, we follow the pro-cedure outlined in Ref. 14. First, due to additional complexity, the log-likelihood is not directly maximized as in the case of RBMs. Instead, a variational lower bound to the log-likelihood is optimized. Second, the most common variational form chosen leads to mean-field updates for the data term. Finally, for evaluation of log-likelihood lower bounds, an approximation of the partition function is necessary. We provide a short description of all these approximations which are frequently encountered in DBM training scenarios.
Likelihood Lower Bound -Since the data expectation in Eq. (4) is no longer in closed form for the DBM, datadependent statistics must be approximated with a sampling technique over the conditional distribution, p(h|v), where h = {h (1) , · · · , h ( ) }. In practice, a variational lower bound to the log-likelihood is maximized instead, which is tractable and is found to work well (as in the model term) 14,28,29 .
The variational approximation replaces the original posterior distribution, p(h|v) with an approximate distribution, r(h|v), where the parameters of r follow the gradient of the resulting lower bound on the original loglikelihood 28 , where H r (v) = − h r(h|v) log r(h|v) is the Shannon entropy of r. This variational loss simultaneously attempts to maximize the log-likelihood of the data set and minimize the KL divergence between the true conditional distribution, p and its approximation, r. Data Term -A fully factorial mean field ansatz is often used in the variational approach, r(h|v) = i r(h i |v) with r(h i = 1|v) = µ i , with µ i ∈ [0, 1] randomly initialized from a uniform distribution. Maximizing the lower bound in Eq. (9) results in updates of the form, Here J is a block matrix containing the weights of the hidden nodes and b i is the bias of the i-th hidden node. Convergence is typically fast, (in our experiments less than 30 iterations are enough) and these states are then used to calculate the data expectation in Eq. (4) during the Gibbs phase of the training. Partition Function -Computing the partition function in Eq. (2) exactly is infeasible for large DBMs. Its value is only required to evaluate the performance of the networks and appears in the log-likelihood as a normalizing constant. Annealed Importance Sampling (AIS) 30 is the procedure often used to approximate the partition function of large RBMs and DBMs. AIS estimates the ratio of partition functions, Z N /Z 0 , using a sequence of probability distributions between a chosen initial distribution and the desired one. The initial distribution, p 0 , is chosen to have an exactly known partition function and to be simple to sample from (e.g. uniform) and p N is the desired distribution whose partition function one wants to compute.
The sequence in the case of DBMs is parameterized by β k (inverse temperatures), giving p k (x) = e −β k E(x) /Z k . Markov chains are initialized uniformly according to p 0 and 0 ≤ β k ≤ 1 is slowly annealed to unity according to a desired schedule, all the while allowing the chains to run. The ratio is then approximated by the product of ratios of the unnormalized intermediate distributions, When dealing with bipartite graphs like DBMs, either all the even or all the odd layers can be analytically traced out, resulting in a tighter approximation. For a direct comparison with previous work 14, 18 , we average over an identical number of 100 AIS runs with 29,000 linearly spaced intermediate distributions.