Quick introduction to gaussian mixture models with python

Usually we like to model probability distributions with gaussian distributions. Not only are they the maximum entropy distributions if we only know the mean and variance of a dataset, the central limit theorem guarantees that random variables which are the result of summing many different random variables will be gaussian distributed too. But what to do when we have multimodal distributions like this one?

A gaussian mixture with 2 dimensional features
A gaussian mixture with 2 dimensional features

A gaussian distribution would not represent this very well. So what’s the next best thing? Add another gaussian! A gaussian mixture model is defined by a sum of gaussians $$P(x)=\sum_i w_i \, \mathcal{G}(\mu_i, \Sigma_i)$$ with means $\mu$ and covariance matrices $\Sigma$.

Result of the above code. As we wanted, the individual samples follow the desired probability distribution.
The above gaussian mixture can be represented as a contour plot. Note this is the same distribution we sampled from in the metropolis tutorial.

By fitting a bunch of data points to a gaussian mixture model we can then access the means and covariances of the individual modes of the probability distribution. These modes are a good way of clustering the data points into similar groups. After the fit, we can even check out to which mode each data point belongs the best by calculating the maximum of its class probability: $$P(i|x) = \frac{w_i\, \mathcal{G}(\mu_i, \Sigma_i)}{\sum_k w_k \, \mathcal{G}(\mu_k, \Sigma_k)}$$

So how do we obtain a gaussian mixture fit to a set of data points? The simplest way is to use the maximum a priori estimate to find the set of parameters which best describe the points, i.e. $$\underset{\mathbf{\mu}, \mathbf{\Sigma}, \mathbf{w}}{\operatorname{argmax}} P(\mathbf{\mu}, \mathbf{\Sigma}, \mathbf{w}|x) = \underset{\mathbf{\mu}, \mathbf{\Sigma}, \mathbf{w}}{\operatorname{argmax}} P(x|\mathbf{\mu}, \mathbf{\Sigma}, \mathbf{w}) P(\mathbf{\mu}, \mathbf{\Sigma}, \mathbf{w}) $$

If we don’t have a prior (i.e. it is constant), this is just the maximum likelihood estimate: $$\underset{\mathbf{\mu}, \mathbf{\Sigma}, \mathbf{w}}{\operatorname{argmax}} P(x|\mathbf{\mu}, \mathbf{\Sigma}, \mathbf{w})$$

There is a straightforward iterative method to obtain these estimates, the expectation-maximization algorithm. It consists of an expectation step, which calculates the likelihood for the current parameter set averaged over any latent variables in the model; and a maximization step which maximizes the parameters w.r.t. that likelihood. In this case, we consider the mode assignment $i$ to be a discrete latent variable. So intuitively, first we calculate how strongly to which mode each data point “belongs”, sometimes called the responsibility (expectation step). Knowing that, we calculate what each mode’s mean and covariance should be given the various responsibilities (maximization step). This will now change the responsibilities, so we go back to the expectation step.

Mathematically, in the expectation step we need to calculate $P(i|x, \mathbf{\mu}, \mathbf{\Sigma})$ for all points which allows us to calculate the expected likelihood $$Q = \sum_i P(i|x, \mathbf{\mu}, \mathbf{\Sigma}) P(x|\mathbf{\mu}, \mathbf{\Sigma})$$

Then we can easily find $\mathbf{\mu}, \mathbf{\Sigma}$ which maximize $Q$ and then iterate.

We don’t need to worry too much about the implementation of this algorithm here, since we can use the excellent scikit-learn to do this in python. Let’s take a look at some code which sets up a gaussian mixture model and fits it to a data set:


The result of fitting a GMM to data
The result of fitting a GMM to data

One difficulty when using real data is that we don’t know how many components to use for the GMM, so we run the risk of over or underfitting the data. You might need to use another optimization loop on the number of components, using the BIC or the AIC as a cost function.

Below is the full code used to produce the images in this post.