Boltzmann + the machine

After the previous post on energy-based models (EBMs) and the Boltzmann distribution, we will now explore a classical family of such generative models. They are known as Boltzmann machines and have played an important role in the development of neural networks. Notably, the 2024 Nobel Prize in Physics was awarded to John Hopfield and Geoffrey Hinton for their pioneering work in this area.

In this post, we will just briefly cover the basic concepts of restricted Boltzmann machines for unsupervised learning. See [Hinton, 2012; Fischer and Igel, 2012; Fischer and Igel, 2014; Salakhutdinov, 2015] for more detailed introductions.

Boltzmann machine

So-called Boltzmann machines (or stochastic Ising models) are EBMs for distributions of binary random vectors. For \(\boldsymbol{x} = (x_1, \ldots, x_M)^\top\) consisting of \(M\) binary-valued units \(x_i \in \{0, 1\}\), the energy function is given by

\[E_{\boldsymbol{\theta}}(\boldsymbol{x}) = - \boldsymbol{x}^\top \boldsymbol{W} \boldsymbol{x} - \boldsymbol{b}^\top \boldsymbol{x}.\]

Here, \(\boldsymbol{\theta} = (\boldsymbol{W}, \boldsymbol{b})\) are the parameters of the model. The weights \(\boldsymbol{W} \in \mathbb{R}^{M \times M}\) form a symmetric matrix (\(w_{ij} = w_{ji}\)) with zeros on the diagonal (\(w_{ii} = 0\)). The vector \(\boldsymbol{b} \in \mathbb{R}^M\) contains bias terms.

The state vector \(\boldsymbol{x} = (\boldsymbol{v}, \boldsymbol{h})\) includes the visible units \(\boldsymbol{v}\) that correspond to observable data. It may also comprise additional auxiliary units \(\boldsymbol{h}\) that are not directly observed. These hidden nodes increase the model’s expressivity and therefore indirectly contribute to a richer representation of the visible ones.

Boltzmann machines can be visualized as undirected graphs. Here, every node is connected to every other node through undirected edges. The following illustration (from Wikipedia) shows such an everything-connected Boltzmann machine:

The undirected graph of an everything-connected Boltzmann machine is shown

Restricted Boltzmann machine

When removing the intra-group interactions within visible nodes \(\boldsymbol{v} \in \{0, 1\}^R\) and hidden nodes \(\boldsymbol{h} \in \{0, 1\}^S\) from the fully-connected model, but keeping the inter-group interactions, one derives the restricted Boltzmann machine (RBM). Its energy function is given as

\[E_{\boldsymbol{\theta}}(\boldsymbol{v}, \boldsymbol{h}) = - \boldsymbol{v}^\top \boldsymbol{W} \boldsymbol{h} - \boldsymbol{b}^\top \boldsymbol{v} - \boldsymbol{c}^\top \boldsymbol{h}.\]

The parameters \(\boldsymbol{\theta} = (\boldsymbol{W}, \boldsymbol{b}, \boldsymbol{c})\) comprise the weight matrix \(\boldsymbol{W} \in \mathbb{R}^{R \times S}\), which connects visible and hidden units, and the bias vectors \(\boldsymbol{b} \in \mathbb{R}^R\) and \(\boldsymbol{c} \in \mathbb{R}^S\).

It is helpful to think of the hidden and visible nodes as two organizational layers of the RBM. The following figure (again from Wikipedia) illustrates the bipartite structure with inter-layer connections but without intra-layer interactions:

The bipartite graph of an RBM is shown

As an energy-based model, the RBM has a joint probability distribution over visible and hidden units that is defined in the Boltzmann form

\[q_{\boldsymbol{\theta}}(\boldsymbol{v}, \boldsymbol{h}) = \frac{1}{Z_{\boldsymbol{\theta}}} \exp(- E_{\boldsymbol{\theta}}(\boldsymbol{v}, \boldsymbol{h})), \quad Z_{\boldsymbol{\theta}} = \sum_{\boldsymbol{v}} \sum_{\boldsymbol{h}} \exp(- E_{\boldsymbol{\theta}}(\boldsymbol{v}, \boldsymbol{h})).\]

Due to the bipartite graph structure, one can marginalize and condition the joint distribution \(q_{\boldsymbol{\theta}}(\boldsymbol{v}, \boldsymbol{h})\) more or less easily. For instance, following a small calculation, marginalizing out the hidden units yields

\[q_{\boldsymbol{\theta}}(\boldsymbol{v}) = \sum_{\boldsymbol{h}} q_{\boldsymbol{\theta}}(\boldsymbol{v}, \boldsymbol{h}) = \frac{1}{Z_{\boldsymbol{\theta}}} \exp(\boldsymbol{b}^\top \boldsymbol{v}) \prod_{j=1}^S \left( 1 + \exp(\boldsymbol{v}^\top \boldsymbol{W}_{\colon, j} + c_j) \right)\]

as the distribution over the visible units. The derivation uses \(h_j \in \{0, 1\}\). One also finds that the hidden variables given the observables are conditionally independent, and vice versa. Consequently, one has \(q_{\boldsymbol{\theta}}(\boldsymbol{h} \vert \boldsymbol{v}) = \prod_{j=1}^S q_{\boldsymbol{\theta}}(h_j \vert \boldsymbol{v})\) and \(q_{\boldsymbol{\theta}}(\boldsymbol{v} \vert \boldsymbol{h}) = \prod_{i=1}^R q_{\boldsymbol{\theta}}(v_i \vert \boldsymbol{h})\). Furthermore, one can show that

\[\begin{align*} q_{\boldsymbol{\theta}}(h_j = 1 \vert \boldsymbol{v}) &= \operatorname{Sigmoid}(\boldsymbol{v}^\top \boldsymbol{W}_{\colon, j} + c_j), \\ q_{\boldsymbol{\theta}}(v_i = 1 \vert \boldsymbol{h}) &= \operatorname{Sigmoid}(\boldsymbol{W}_{i, \colon} \boldsymbol{h} + b_i). \end{align*}\]

Note that this familiar sigmoid-based form may serve as a post-hoc justification for calling \(\boldsymbol{W}\), \(\boldsymbol{b}\) and \(\boldsymbol{c}\) the “weights” and “biases” of a neural network.

Training RBMs

For model training, it is convenient to rewrite the marginal distribution over visible units in the familiar Boltzmann form \(q_{\boldsymbol{\theta}}(\boldsymbol{v}) = Z_{\boldsymbol{\theta}}^{-1} \exp(- F_{\boldsymbol{\theta}}(\boldsymbol{v}))\). To this end, one introduces the free energy

\[F_{\boldsymbol{\theta}}(\boldsymbol{v}) = - \boldsymbol{b}^\top \boldsymbol{v} - \sum_{j=1}^S \log \left( 1 + \exp(\boldsymbol{v}^\top \boldsymbol{W}_{\colon, j} + c_j) \right).\]

Given a set of data points \(\{\boldsymbol{v}_1, \ldots, \boldsymbol{v}_N\}\), one can maximize the log-likelihood \(\log q_{\boldsymbol{\theta}}(\boldsymbol{x}_1, \ldots, \boldsymbol{x}_N) = N^{-1} \sum_{i=1}^N \log q_{\boldsymbol{\theta}}(\boldsymbol{x}_i)\) in order to learn the parameters \(\boldsymbol{\theta}\). This corresponds to minimizing the expected negative log-likelihood \(\mathbb{E}_{p(\boldsymbol{v})}[- \log q_{\boldsymbol{\theta}}(\boldsymbol{v})]\). With \(\log q_{\boldsymbol{\theta}}(\boldsymbol{v}) = - F_{\boldsymbol{\theta}}(\boldsymbol{v}) - \log Z_{\boldsymbol{\theta}}\) and \(\nabla_{\boldsymbol{\theta}} \log Z_{\boldsymbol{\theta}} = \mathbb{E}_{q_{\boldsymbol{\theta}}(\boldsymbol{v})}[\nabla_{\boldsymbol{\theta}} F_{\boldsymbol{\theta}}(\boldsymbol{v})]\) one finds

\[\nabla_{\boldsymbol{\theta}} \mathbb{E}_{p(\boldsymbol{v})}[- \log q_{\boldsymbol{\theta}}(\boldsymbol{v})] = \mathbb{E}_{p(\boldsymbol{v})}[\nabla_{\boldsymbol{\theta}} F_{\boldsymbol{\theta}}(\boldsymbol{v})] - \mathbb{E}_{q_{\boldsymbol{\theta}}(\boldsymbol{v})}[\nabla_{\boldsymbol{\theta}} F_{\boldsymbol{\theta}}(\boldsymbol{v})]\]

for the gradient of the objective. This is analogous to the expression for general EBMs in this post. The first term of the right-hand side is an expectation over the true data distribution \(p(\boldsymbol{v})\). For the given training data, it can be naturally evaluated by

\[\mathbb{E}_{p(\boldsymbol{v})}[\nabla_{\boldsymbol{\theta}} F_{\boldsymbol{\theta}}(\boldsymbol{v})] \approx \frac{1}{N} \sum_{i=1}^N \nabla_{\boldsymbol{\theta}} F_{\boldsymbol{\theta}}(\boldsymbol{v}_i).\]

The second term of the log-likelihood gradient is more challenging, as it involves an intractable expectation over the marginal model distribution \(q_{\boldsymbol{\theta}}(\boldsymbol{v})\). One usually resorts to drawing samples \(\boldsymbol{v}_i^\prime \sim q_{\boldsymbol{\theta}}(\boldsymbol{v}_i^\prime)\) and approximating the expectation as

\[\mathbb{E}_{q_{\boldsymbol{\theta}}(\boldsymbol{v})}[\nabla_{\boldsymbol{\theta}} F_{\boldsymbol{\theta}}(\boldsymbol{v})] \approx \frac{1}{N^\prime} \sum_{i=1}^{N^\prime} \nabla_{\boldsymbol{\theta}} F_{\boldsymbol{\theta}}(\boldsymbol{v}_i^\prime).\]

Samples can be generated via Markov chain Monte Carlo (MCMC) methods. Because of the conditional independence of visible and hidden units, the RBM structure lends itself easily to blockwise Gibbs sampling. Here, one alternates between sampling from \(q_{\boldsymbol{\theta}}(\boldsymbol{h} \vert \boldsymbol{v})\) and \(q_{\boldsymbol{\theta}}(\boldsymbol{v} \vert \boldsymbol{h})\). This scheme asymptotically approximates the joint distribution and, in turn, the marginal of the visible units.

The Gibbs sampling process can be realized straightforwardly due to the simple sigmoid-based form of the conditional distributions. For reducing the computational cost of multiple repetitions, one may just stop after a single alternation step.

Natural candidates for initializing MCMC chains are the data points \(\boldsymbol{v}_i\) from the training set. The resulting strategy is commonly referred to as contrastive divergence. When starting a short Markov chain from each data point, the gradient of the negative log-likelihood is approximated by

\[\nabla_{\boldsymbol{\theta}} \mathbb{E}_{p(\boldsymbol{v})}[- \log q_{\boldsymbol{\theta}}(\boldsymbol{v})] \approx \frac{1}{N} \sum_{i=1}^N \left( \nabla_{\boldsymbol{\theta}} F_{\boldsymbol{\theta}}(\boldsymbol{v}_i) - \nabla_{\boldsymbol{\theta}} F_{\boldsymbol{\theta}}(\boldsymbol{v}_i^\prime) \right).\]

After the RBM has been trained, one can use it for various tasks such as synthetic data generation, feature extraction and dimensionality reduction. These topics may be covered in a future post.

Tags: Boltzmann machines Restricted Boltzmann machines Energy-based models