Dimensionality reduction 101: linear algebra, hidden variables and generative models

Suppose you are faced with a high dimensional dataset and want to find some structure in the data: often there are only a few causes, but lots of different data points are generated due to noise corruption. How can we infer these causes? Here I’m going to cover the simplest method to do this inference: we will assume the data is generated by a linear transformation of the hidden causes. In this case, it is quite simple to recover the parameters of this transformation and therefore determine the hidden (or latent) variables which represent their cause.

Continue reading “Dimensionality reduction 101: linear algebra, hidden variables and generative models”

Quick introduction to gaussian mixture models with python

Usually we like to model probability distributions with gaussian distributions. Not only are they the maximum entropy distributions if we only know the mean and variance of a dataset, the central limit theorem guarantees that random variables which are the result of summing many different random variables will be gaussian distributed too. But what to do when we have multimodal distributions like this one?

A gaussian mixture with 2 dimensional features
A gaussian mixture with 2 dimensional features

A gaussian distribution would not represent this very well. So what’s the next best thing? Add another gaussian! A gaussian mixture model is defined by a sum of gaussians $$P(x)=\sum_i w_i \, \mathcal{G}(\mu_i, \Sigma_i)$$ with means $\mu$ and covariance matrices $\Sigma$. Continue reading “Quick introduction to gaussian mixture models with python”

The link between thermodynamics and inference

In recent blog posts I talked a bit about how many aspects of maximum entropy were analogous to methods in statistical physics. In this short post, I’ll summarize the most interesting similarities. In bayesian inference, we are usually interested in the posterior distribution of some parameters $\theta$ given the data d. This posterior can be written as a boltzmann distribution: $$P(\theta|d)=\frac{P(\theta,d)}{P(d)}=\left.\frac{e^{-\beta H(\theta,d)}}{Z}\right|_{\beta=1}$$ with $H(\theta,d) = -\log P(\theta,d)/\beta$ and $Z=\int d\theta\;e^{-\beta H(\theta,d)}$. I’ll note that we are working with units such that $k_B=1$ and thus $\beta=1/T$.

The energy is just the expectation value of the hamiltonian H (note that the expectation is taken with respect to $P(\theta|d)$): $$E = \langle H \rangle = -\frac{\partial \log Z}{\partial \beta}$$

And the entropy is equal to $$S=-\int d\theta\;P(\theta|d)\log P(\theta|d)=\beta\langle H \rangle – \log Z$$

We can also define the free energy, which is $$F=E\, – \frac{S}{\beta}=-\frac{\log Z}{\beta}$$

A cool way to approximate Z if we can’t calculate it analytically (we usually can’t calculate it numerically for high dimensional problems because the integrals take a very long time to calculate) is to use laplace’s approximation: $$Z=\int d\theta\;e^{-\beta H(\theta,d)}\simeq\sqrt{\frac{2\pi}{\beta|H”(\theta^*)|}}e^{-\beta H(\theta^*)}$$ where $|H”(\theta^*)|$ is the determinant of the hessian of the hamiltonian (say that 3 times real fast) and $\theta^*$ is such that $H(\theta^*)=\min H(\theta)$ (minimum because of the minus sign). Needless to say this approximation works best for small temperature ($\beta\rightarrow\infty$) which might not be close to the correct value at $\beta=1$. $\theta^*$ is known as the maximum a posteriori (MAP) estimate. Expectation values can also be approximated in a similar way: $$\langle f(\theta) \rangle = \int d\theta \; f(\theta) P(\theta|d) \simeq\sqrt{\frac{2\pi}{\beta|H”(\theta^*)|}} f(\theta^*)P(\theta^*|d)$$

So the MAP estimate is defined as $\text{argmax}_{\theta} P(\theta|d)$. The result won’t change if we take the log of the posterior, which leads to a form similar to the entropy: \begin{align}\theta_{\text{MAP}}&=\text{argmax}_{\theta} (-\beta H – \log Z)\\&=\text{argmax}_{\theta} (-2\beta H + S)\end{align} Funny, huh? For infinite temperature ($\beta=0$) the parameters reflect total lack of knowledge: the entropy is maximized. As we lower the temperature, the energy term contributes more, reflecting the information provided by the data, until at temperature zero we would only care about the data contribution and ignore the entropy term.

(This is also the basic idea for the simulated annealing optimization algorithm, where in that case the objective function plays the role of the energy and the algorithm walks around phase space randomly, with jump size proportional to the temperature. The annealing schedule progressively lowers the temperature, restricting the random walk to regions of high objective function value, until it freezes at some point.)

Another cool connection is the fact that the heat capacity is given by $$C(\beta)=\beta^2\langle (\Delta H)^2 \rangle=\beta^2\langle (H-\langle H \rangle)^2 \rangle=\beta^2\frac{\partial^2 \log Z}{\partial \beta^2}$$

In the paper I looked at last time, the authors used this fact to estimate the entropy: they calculated $\langle (\Delta H)^2 \rangle$ by MCMC for various betas and used the relation $$S = \, \int_{1}^{\infty} d\beta\; \frac{1}{\beta} C(\beta)$$

Review of ‘Searching for Collective Behavior in a Large Network of Sensory Neurons’

Last time I reviewed the principle of maximum entropy. Today I am looking at a paper which uses it to create a simplified probabilistic representation of neural dynamics. The idea is to measure the spike trains of each neuron individually (in this case there are around 100 neurons from a salamander retina being measured) and simultaneously. In this way, all correlations in the network are preserved, which allows the construction of a probability distribution describing some features of the network.

Naturally, a probability distribution describing the full network dynamics would need a model of the whole network dynamics, which is not what the authors are aiming at here. Instead, they wish to just capture the correct statistics of the network states. What are the network states? Imagine you bin time into small windows. In each window, each neuron will be spiking or not. Then, for each time point you will have a binary word with 100 bits, where each a 1 corresponds to a spike and a -1 to silence. This is a network state, which we will represent by $\boldsymbol{\sigma}$.

So, the goal is to get $P(\boldsymbol{\sigma})$. It would be more interesting to have something like $P(\boldsymbol{\sigma}_{t+1}|\boldsymbol{\sigma}_t)$ (subscript denoting time) but we don’t always get what we want, now do we? It is a much harder problem to get this conditional probability, so we’ll have to settle for the overall probability of each state. According to maximum entropy, this distribution will be given by $$P(\boldsymbol{\sigma})=\frac{1}{Z}\exp\left(-\sum_i \lambda_i f_i(\boldsymbol{\sigma})\right)$$ Continue reading “Review of ‘Searching for Collective Behavior in a Large Network of Sensory Neurons’”

Maximum entropy: a primer and some recent applications

I’ll let Caticha summarize the principle of maximum entropy:

Among all possible probability distributions that agree with whatever we know select that particular distribution that reflects maximum ignorance about everything else. Since ignorance is measured by entropy, the method is mathematically implemented by selecting the distribution that maximizes entropy subject to the constraints imposed by the available information.

It appears to have been introduced by Jaynes in 57, and has seen a resurgence in the past decade with people taking bayesian inference more seriously. (As an aside, Jayne’s posthumously published book is well worth a read, in spite of some cringeworthy rants peppered throughout.) I won’t dwell too much on the philosophy as the two previously mentioned sources have already gone into great detail to justify the method.

Usually we consider constraints which are linear in the probabilities, namely we constrain the probability distribution to have specific expectation values. Consider that we know the expectation values of a certain set of functions $f^k$. Then, $p(x)$ should be such that $$\langle f^k \rangle = \int dx \; p(x) f^k(x)$$ for all k. Let’s omit the notation $(x)$ for simplicity. Then, we can use variational calculus to find p which minimizes the functional $$S[p]\; – \alpha \int dx\; p\; – \sum_k \lambda_k \langle f^k \rangle$$ The constraint with $\alpha$ is the normalization condition and $S$ is the shannon entropy.

The solution to this is $$p = \frac{1}{Z}\exp\left(-\sum_k\lambda_k f^k \right) $$ with $$Z=\int dx \; \exp \left(-\sum_k\lambda_k f^k \right)$$ the partition function (which is just the normalization constant). Now, we can find the remaining multipliers by solving the system of equations $$-\frac{\partial \log Z}{\partial \lambda_k} = \langle f^k \rangle$$ I’ll let you confirm that if we fix the mean and variance we get a gaussian distribution. Go on, I’ll wait.

Continue reading “Maximum entropy: a primer and some recent applications”

Kernel density estimation

Sometimes you need to estimate a probability distribution from a set of discrete points. You could build a histogram of the measurements, but that provides little information about the regions in phase space with no measurements (it is very likely you won’t have enough points to span the whole phase space). So the data set must be smoothed, as we did with time series. As in that case, we can describe the smoothing by a convolution with a kernel. In this case, the formula is very simple $$f(x)=\frac{1}{N}\sum_i^N K(x-x_i)$$

The choice of K is an art, but the standard choice is the gaussian kernel as we’ve seen before. Let’s try this out on some simulated data

 

Gaussian mixture we will sample from
Gaussian mixture we will sample from

And now let’s apply KDE to some sampled data.

 

100 samples, sigma 1
100 samples, sigma 1
1000 samples, sigma 1
1000 samples, sigma 1

The choice of standard deviation makes a big difference in the final result. For low amounts of data we want a reasonably high sigma, to smoothen out the large variations in the data. But if we have a lot of points, a lower sigma will more faithfully represent the original distribution:

1000 samples, sigma 0.1
1000 samples, sigma 0.1

 

 

An introduction to smoothing time series in python. Part IV: Particle Filter

Last time we started talking about state space models with the Kalman Filter. Our aim has been to find a smoothed trajectory for some given noisy observed data. In the case of state space models, we incorporate a model for the underlying physical system to calculate a likelihood for each trajectory and select the optimal one. When it comes to implementing this idea, we run into the problem of how to represent the probability distributions for the state of the system: it would be unfeasible to calculate the probability for every single point in phase space at all times. The Kalman filter solves this by approximating the target probability density function (abbreviated pdf) with a Gaussian, which has only two parameters. This approximation works surprisingly well but it might not be optimal for cases where the underlying pdf is multimodal.

Today we will be looking at another idea to implement state space models – the particle filter. Our goal is still to find the maximum of the posterior distribution $P(x|y,\theta)$, given by bayes’ formula: $$P(x|y,\theta)\propto P(y | x, \theta)P(x|\theta)P(\theta)$$ Now the idea is to approximate this distribution by randomly drawing points from it. Because we will only look at one time step at a time, the sequence of points we sample will be a markov chain; and because the method relies on random sampling we call it a markov chain monte carlo (MCMC) method. Continue reading “An introduction to smoothing time series in python. Part IV: Particle Filter”

An introduction to smoothing time series in python. Part III: Kalman Filter

If we have a mathematical model for the system in study, we can use that information to dramatically improve the quality of our prediction. Like in the previous filtering methods, we are taking advantage of the fact that to estimate the system state at some time $t$ we can use not only the information available at that time but also the information from the past (and the future if smoothing). But model free methods tend to want to reduce large deviations from one point in time to the next, while we may actually expect that at some specific time point the system does jump drastically in value. Adding in the model allows us to take that into account.

Let’s start by looking at the Kalman Filter, which is the optimal estimator for linear and gaussian systems. Let us define such a system first in the discrete case:

$$x_{n+1} = Ax_{n}+\xi \\ y_{n+1} = Bx_{n+1}+\zeta$$

The stochastic process in $x$ is the underlying process we want to follow. Not only is the process in $x$ a brownian process (additive white noise denoted by $\xi$), we are unable to observe it directly. The observation is denoted by $y$ and is a function of $x$ corrupted by (again) additive white noise $\zeta$. The gaussian assumption is often a reasonable approximation to the problem’s noise statistics because the timescale of whichever microscopic process produces randomness is usually much smaller than the one of the actual dynamics, allowing the central limit theorem to kick in.

Continue reading “An introduction to smoothing time series in python. Part III: Kalman Filter”

Simulating networks of nonlinear stochastic systems

arXiv:1209.3700

In this paper we attempt to find a computationally efficient way to numerically simulate networks with nonlinear stochastic dynamics. With this I mean a continuous dynamical model where the differential equation for each variable depends nonlinearly on some or all variables of the system and has additive noise. If $x$ is a vector with all variables and $\eta$ is a random vector of the same size as $x$ with some unspecified distribution, the dynamics can be compactly described as $$\frac{d x}{dt}=f(x,t)+\eta$$

The challenge lies in the nonlinearity combined with stochasticity. Were only one of them to be present, the problem would be simple. A deterministic nonlinear problem can be straightforwardly be integrated with an ODE package, while a linear stochastic system can be reduced to a system of ODEs for the moments of the probability distribution function (PDF). A full solution would require a Monte Carlo algorithm to simulate a sufficient number of paths to allow us to estimate the PDF of $x$ at each time point. For networks with many nodes we are haunted by the curse of dimensionality, as the volume needed to be sampled increases exponentially and so do the number of simulated paths required to get a good approximation of the distribution at later time points. In systems where there is a well defined mode around which most of the probability mass is concentrated we should be able to derive an analytic approximation which is more tractable. This is exactly what we try to do in the paper. Continue reading “Simulating networks of nonlinear stochastic systems”

Information processing systems post-mortem

Slides from the talk

Yesterday I gave an informal talk about information processing systems and lessons learned from the fields of AI and biology. This was a mix of introductory information theory and some philosophical ramblings.

While creating this talk I took the time to review several concepts from machine learning and AI. In Jaynes’ book about probability theory, bayesian inference is presented as a completely general system for logic under uncertainty. The gist of the argument is that an inference system which obeys certain internal consistency requirements must use probability theory as a formal framework. A hypothetical information processing system should obey such consistency requirements when assigning levels of plausibility to all pieces of information, which means its workings should be built upon probability theory. As a bonus, all the theory is developed, so we need only apply it!

To implement such a system we make a connection with biology. I started by arguing that an organism which wants to maximise its long term population growth must be efficient at decoding environmental inputs and responding to them. Thus if we define long term viability of an organism implementing a given information processing system as a finess function, we can obtain good implementations of our system by maximising such a function.

Continue reading “Information processing systems post-mortem”