An introduction to Gaussian process modeling is given in this short note. Gaussian processes for example enable regression with built-in uncertainty quantification.
Gaussian processes
A Gaussian process (GP) is a continuously indexed collection of random variables, such that every finite subset thereof follows a multivariate normal (MVN) distribution. Let \(\boldsymbol{x} \in \mathbb{R}^d\) represent a \(d\)-dimensional continuous index, e.g. a spatial coordinate. A GP \(\{f(\boldsymbol{x}) \, \vert \, \boldsymbol{x} \in \mathbb{R}^d\}\) is then defined by a mean function \(m(\boldsymbol{x})\) and a covariance function \(k(\boldsymbol{x}, \boldsymbol{x}^\prime)\). One usually writes
\[f(\boldsymbol{x}) \sim \mathcal{GP} \left( m(\boldsymbol{x}), k(\boldsymbol{x}, \boldsymbol{x}^\prime) \right).\]For every finite collection of indices \(\{\boldsymbol{x}_i\}_{i=1}^n\) one has that \(\boldsymbol{f} = (f(\boldsymbol{x}_1), \ldots, f(\boldsymbol{x}_n))^\top \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) is jointly normal. Here, the elements of the mean vector and the covariance matrix are respectively given as \(\mu_i = m(\boldsymbol{x}_i)\) and \(\Sigma_{ij} = k(\boldsymbol{x}_i, \boldsymbol{x}_j)\) for \(i,j=1,\ldots,n\).
In a certain sense, a GP is a generalization of a MVN distribution to infinitely many dimensions. It can be viewed as a distribution over functions. This allows for quantifying the uncertainty of an unknown function in a Bayesian setting.
The properties of a GP heavily depend on the chosen covariance kernel. Some of the most common kernel families are the squared exponential (radial basis function), absolute exponential (Ornstein-Uhlenbeck) and the Matérn kernel. The former two kernels can be written as
\[\begin{align*} k_{\mathrm{RBF}}(\boldsymbol{x}, \boldsymbol{x}^\prime) &= \sigma^2 \exp \left( - \frac{\lVert \boldsymbol{x} - \boldsymbol{x}^\prime \rVert_2^2}{2 \ell^2} \right), \\ k_{\mathrm{OU}}(\boldsymbol{x}, \boldsymbol{x}^\prime) &= \sigma^2 \exp \left( - \frac{\lVert \boldsymbol{x} - \boldsymbol{x}^\prime \rVert_2}{\ell} \right). \end{align*}\]They are isotropic in the sense that they are a function of the distance \(\lVert \boldsymbol{x} - \boldsymbol{x}^\prime \rVert_2\) only. Moreover, they contain a variance \(\sigma^2 > 0\) and a lengthscale parameter \(\ell > 0\).
GP regression
An interesting application of GPs is as nonparametric models for supervised learning. When considering regression problems, this is usually referred to as GP regression (GPR) or less commonly as Kriging. A small intro to this subject is given below.
One sometimes distinguishes between noise-free and noisy observations. The former scenario assumes that the dataset \(\{(\boldsymbol{x}_i, y_i)\}_{i=1}^N\) contains noise-free measurements \(y_i = f(\boldsymbol{x}_i)\) of function values at various locations \(\boldsymbol{x}_i\). The joint distribution of the GP at those data locations and at other locations can be written as
\[\begin{pmatrix} \boldsymbol{f} \\ \boldsymbol{g} \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} \boldsymbol{\mu}_{\boldsymbol{f}} \\ \boldsymbol{\mu}_{\boldsymbol{g}} \end{pmatrix}, \begin{pmatrix} \boldsymbol{\Sigma}_{\boldsymbol{f} \boldsymbol{f}} & \boldsymbol{\Sigma}_{\boldsymbol{f} \boldsymbol{g}} \\ \boldsymbol{\Sigma}_{\boldsymbol{g} \boldsymbol{f}} & \boldsymbol{\Sigma}_{\boldsymbol{g} \boldsymbol{g}} \end{pmatrix} \right).\]Here, \(\boldsymbol{f} = (f(\boldsymbol{x}_1), \ldots, f(\boldsymbol{x}_N))^\top\) are the random variables that are measured, whereas \(\boldsymbol{g} = (g(\boldsymbol{x}^\star_1), \ldots, g(\boldsymbol{x}^\star_M))^\top\) denotes the unobserved random variables at some test locations \(\{\boldsymbol{x}^\star_i\}_{i=1}^M\). The corresponding marginals of this prior model are \(\boldsymbol{f} \sim \mathcal{N}(\boldsymbol{\mu}_{\boldsymbol{f}}, \boldsymbol{\Sigma}_{\boldsymbol{f}})\) and \(\boldsymbol{g} \sim \mathcal{N}(\boldsymbol{\mu}_{\boldsymbol{g}}, \boldsymbol{\Sigma}_{\boldsymbol{g}})\). Given realizations of the observed variables, one can obtain the conditional distribution of the unobserved variables. Such posterior predictions are given as
\[\begin{align*} \boldsymbol{g} \vert \boldsymbol{f} &\sim \mathcal{N}(\boldsymbol{\mu}_{\boldsymbol{g} \vert \boldsymbol{f}}, \boldsymbol{\Sigma}_{\boldsymbol{g} \vert \boldsymbol{f}}), \\ \boldsymbol{\mu}_{\boldsymbol{g} \vert \boldsymbol{f}} &= \boldsymbol{\mu}_{\boldsymbol{g}} + \boldsymbol{\Sigma}_{\boldsymbol{g} \boldsymbol{f}} \boldsymbol{\Sigma}_{\boldsymbol{f} \boldsymbol{f}}^{-1} (\boldsymbol{f} - \boldsymbol{\mu}_{\boldsymbol{f}}), \\ \boldsymbol{\Sigma}_{\boldsymbol{g} \vert \boldsymbol{f}} &= \boldsymbol{\Sigma}_{\boldsymbol{g} \boldsymbol{g}} - \boldsymbol{\Sigma}_{\boldsymbol{g} \boldsymbol{f}} \boldsymbol{\Sigma}_{\boldsymbol{f} \boldsymbol{f}}^{-1} \boldsymbol{\Sigma}_{\boldsymbol{f} \boldsymbol{g}}. \end{align*}\]The standard model for noisy data assumes that the measurements \(y(\boldsymbol{x}_i) = f(\boldsymbol{x}_i) + \epsilon_i\) are subject to Gaussian noise \(\epsilon_i \sim \mathcal{N}(0, \sigma_\epsilon^2)\). Based on certain independence assumptions, such that one has \(\boldsymbol{y} \vert \boldsymbol{f} \sim \mathcal{N}(\boldsymbol{\mu}_{\boldsymbol{f}}, \sigma_\epsilon^2 \boldsymbol{I})\) and \(\boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\mu}_{\boldsymbol{f}}, \boldsymbol{\Sigma}_{\boldsymbol{f} \boldsymbol{f}} + \sigma_\epsilon^2 \boldsymbol{I})\), the prior model for the observed data \(\boldsymbol{y}\) and the unobserved function values \(\boldsymbol{g}\) simply is
\[\begin{pmatrix} \boldsymbol{y} \\ \boldsymbol{g} \end{pmatrix} \sim \mathcal{N} \left( \begin{pmatrix} \boldsymbol{\mu}_{\boldsymbol{f}} \\ \boldsymbol{\mu}_{\boldsymbol{g}} \end{pmatrix}, \begin{pmatrix} \boldsymbol{\Sigma}_{\boldsymbol{f} \boldsymbol{f}} + \sigma_\epsilon^2 \boldsymbol{I} & \boldsymbol{\Sigma}_{\boldsymbol{f} \boldsymbol{g}} \\ \boldsymbol{\Sigma}_{\boldsymbol{g} \boldsymbol{f}} & \boldsymbol{\Sigma}_{\boldsymbol{g} \boldsymbol{g}} \end{pmatrix} \right).\]It is remarked that this treatment of noise is very similar to adding a white noise component to the GP kernel. This would impact the diagonal entries of both \(\boldsymbol{\Sigma}_{\boldsymbol{f} \boldsymbol{f}}\) and \(\boldsymbol{\Sigma}_{\boldsymbol{g} \boldsymbol{g}}\). Such a nugget term is sometimes introduced for numerical stability. In any case, conditioning on the measurements yields the posterior predictions once again
\[\begin{align*} \boldsymbol{g} \vert \boldsymbol{y} &\sim \mathcal{N}(\boldsymbol{\mu}_{\boldsymbol{g} \vert \boldsymbol{y}}, \boldsymbol{\Sigma}_{\boldsymbol{g} \vert \boldsymbol{y}}), \\ \boldsymbol{\mu}_{\boldsymbol{g} \vert \boldsymbol{y}} &= \boldsymbol{\mu}_{\boldsymbol{g}} + \boldsymbol{\Sigma}_{\boldsymbol{g} \boldsymbol{f}} \left( \boldsymbol{\Sigma}_{\boldsymbol{f} \boldsymbol{f}} + \sigma_\epsilon^2 \boldsymbol{I} \right)^{-1} \left( \boldsymbol{f} - \boldsymbol{\mu}_{\boldsymbol{f}} \right), \\ \boldsymbol{\Sigma}_{\boldsymbol{g} \vert \boldsymbol{y}} &= \boldsymbol{\Sigma}_{\boldsymbol{g} \boldsymbol{g}} - \boldsymbol{\Sigma}_{\boldsymbol{g} \boldsymbol{f}} \left( \boldsymbol{\Sigma}_{\boldsymbol{f} \boldsymbol{f}} + \sigma_\epsilon^2 \boldsymbol{I} \right)^{-1} \boldsymbol{\Sigma}_{\boldsymbol{f} \boldsymbol{g}}. \end{align*}\]It is often difficult to specify all hyperparameters of a GP prior on an ad hoc basis. One may therefore resort to model selection or hyperparameter optimization in order to find “good” values. In the present purely Gaussian context, one can for instance maximize (the logarithm of) the marginal likelihood as a function of the hyperparameters. Let us collect the parameters of the covariance \(\boldsymbol{\Sigma}_{\boldsymbol{f} \boldsymbol{f}}(\ell, \sigma)\) and the noise level \(\sigma_\epsilon\) into \(\boldsymbol{\theta} = (\ell, \sigma, \sigma_\epsilon)\). The likelihood as function of \(\boldsymbol{\theta}\) is then explicitly given as
\[p_{\boldsymbol{\theta}}(\boldsymbol{y}) = \int p_{\sigma_\epsilon}(\boldsymbol{y} \vert \boldsymbol{f}) \, p_{\ell, \sigma}(\boldsymbol{f}) \, \mathrm{d} \boldsymbol{f} = \mathcal{N}(\boldsymbol{y} \vert \boldsymbol{\mu}_{\boldsymbol{f}}, \boldsymbol{\Sigma}_{\boldsymbol{f} \boldsymbol{f}}(\ell, \sigma) + \sigma_\epsilon^2 \boldsymbol{I}).\]