Gaussian Processes For Data Discovery

Created: 18 Dec 2017
Modified: 19 Mar 2019

This is an excellent tool for pattern discovery!!

Prerequisites

Understand Multivariate Normals
Covariance Matrices and Kernel Functions

Introduction

Here, I am taking some notes on Gaussian Processes from Gaussian Processses for Machine Learning by Rasmussen and Williams.

We can model the correlation of data using Gaussian Processes. Essentially, we assume that for some fixed inputs, we can estimate an output with general mean and variance. R&W say that we can interpret Gaussian Processes (GP) from a function-space view, a distribution of functions with the inference taking place directly in the space of functions, and a weight-space view.

Weight-Space

To interpret a GP in the weight-space view, we consider Bayesian Linear Regression. We see that we can estimate the weights using Bayes’ rule

\[\\p(\mathbf{w}|\mathbf{y},X) = \frac{p(\mathbf{y}|X,\mathbf{w})p(\mathbf{w})}{p(\mathbf{y}|X)}\]

where the likelihood, prior, and marginal likelihood (normalization constant) are given by

\[\begin{align*} p(\mathbf{y}|X,\mathbf{w}) & = \mathcal{N}(X^T\mathbf{w}, \sigma_n^2 I) \\ p(\mathbf{w}) & = \mathcal{N}(0, \Sigma_p) \\ p(\mathbf{y}|X) & = \int p(\mathbf{y} | X, \mathbf{w}) p(\mathbf{w}) d\mathbf{w} \end{align*}\]

As we can see, this simplifies to

\[\\p(\mathbf{w}|\mathbf{y}, X) \sim \mathcal{N}(\bar{w} = \frac{1}{\sigma_n^2}A^{-1}X\mathbf{y}, A^{-1})\]

where \(A = \sigma_n^{-2}X X^{\top} + \Sigma_p^{-1}\) ignoring the marginal likelihood.

We can make estimations on the data by MAP estimation in which we predict the mode of the distribution.Then, for inferring on a test case, we can average over all parameter values such that our predictive destribution is given by

\[\\p(f_{*}| \mathbf{x}_{*}, X, \mathbf{y}) = \int p(f_{*}|\mathbf{x}_{*}, \mathbf{w})p(\mathbf{w}|X,\mathbf{y})d\mathbf{w} = \mathcal{N}(\frac{1}{\sigma_n^2}\mathbf{x}_{*}^{\top} A^{-1} X \mathbf{y}, \mathbf{x}_{*}^{\top}A^{-1}\mathbf{x}_{*})\]

To add non-linearity to our dataset, we can project our inputs int a feature space by applying some function \(\phi(\mathbf{x})\) onto our input data, which leads to idea that we use the kernel trick to rewrite our mapping such that we have \(k(\mathbf{x}, \mathbf{x}') = \psi(\mathbf{x}) \cdot \psi(\mathbf{x}')\), which allows us to focus more on the kernels to produce our predictions.

Function Space View

Definition 1 A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

We typically write gaussian process as

\[\\ f(x) = \mathcal{G}\mathcal{P}(m(x), k(x,x'))\]

where the mean and the covariance function are denoted by

\[\begin{align*} m(\mathbf{x}) & = \mathbb{E}[f(\mathbf{x})]\\ k(\mathbf{x}, \mathbf{x}') & = \mathbb{E}[(f(\mathbf{x}) - m(\mathbf{x}))(f(\mathbf{x}') - m(\mathbf{x})')] \end{align*}\]

Normally, we want to move to joint Gaussian distributions such that

\[[ f(x_{1}), f(x_{2}), \ldots, f(x_N) ]^{\top} \sim \mathcal{N}(\mathbf{\mu}, \mathbf{K})\]

where the properties are determined by the kernel function. Hence, we would using conditioning and marginalization of a Gaussian to obtain the desired parameters for improving our estimate.

RBF kernel In Practice

Sources: Gaussian Processses for Machine Learning by Rasmussen and Williams