## The link between thermodynamics and inference

In recent blog posts I talked a bit about how many aspects of maximum entropy were analogous to methods in statistical physics. In this short post, I’ll summarize the most interesting similarities. In bayesian inference, we are usually interested in the posterior distribution of some parameters $\theta$ given the data d. This posterior can be written as a boltzmann distribution: $$P(\theta|d)=\frac{P(\theta,d)}{P(d)}=\left.\frac{e^{-\beta H(\theta,d)}}{Z}\right|_{\beta=1}$$ with $H(\theta,d) = -\log P(\theta,d)/\beta$ and $Z=\int d\theta\;e^{-\beta H(\theta,d)}$. I’ll note that we are working with units such that $k_B=1$ and thus $\beta=1/T$.

The energy is just the expectation value of the hamiltonian H (note that the expectation is taken with respect to $P(\theta|d)$): $$E = \langle H \rangle = -\frac{\partial \log Z}{\partial \beta}$$

And the entropy is equal to $$S=-\int d\theta\;P(\theta|d)\log P(\theta|d)=\beta\langle H \rangle – \log Z$$

We can also define the free energy, which is $$F=E\, – \frac{S}{\beta}=-\frac{\log Z}{\beta}$$

A cool way to approximate Z if we can’t calculate it analytically (we usually can’t calculate it numerically for high dimensional problems because the integrals take a very long time to calculate) is to use laplace’s approximation: $$Z=\int d\theta\;e^{-\beta H(\theta,d)}\simeq\sqrt{\frac{2\pi}{\beta|H”(\theta^*)|}}e^{-\beta H(\theta^*)}$$ where $|H”(\theta^*)|$ is the determinant of the hessian of the hamiltonian (say that 3 times real fast) and $\theta^*$ is such that $H(\theta^*)=\min H(\theta)$ (minimum because of the minus sign). Needless to say this approximation works best for small temperature ($\beta\rightarrow\infty$) which might not be close to the correct value at $\beta=1$. $\theta^*$ is known as the maximum a posteriori (MAP) estimate. Expectation values can also be approximated in a similar way: $$\langle f(\theta) \rangle = \int d\theta \; f(\theta) P(\theta|d) \simeq\sqrt{\frac{2\pi}{\beta|H”(\theta^*)|}} f(\theta^*)P(\theta^*|d)$$

So the MAP estimate is defined as $\text{argmax}_{\theta} P(\theta|d)$. The result won’t change if we take the log of the posterior, which leads to a form similar to the entropy: \begin{align}\theta_{\text{MAP}}&=\text{argmax}_{\theta} (-\beta H – \log Z)\\&=\text{argmax}_{\theta} (-2\beta H + S)\end{align} Funny, huh? For infinite temperature ($\beta=0$) the parameters reflect total lack of knowledge: the entropy is maximized. As we lower the temperature, the energy term contributes more, reflecting the information provided by the data, until at temperature zero we would only care about the data contribution and ignore the entropy term.

(This is also the basic idea for the simulated annealing optimization algorithm, where in that case the objective function plays the role of the energy and the algorithm walks around phase space randomly, with jump size proportional to the temperature. The annealing schedule progressively lowers the temperature, restricting the random walk to regions of high objective function value, until it freezes at some point.)

Another cool connection is the fact that the heat capacity is given by $$C(\beta)=\beta^2\langle (\Delta H)^2 \rangle=\beta^2\langle (H-\langle H \rangle)^2 \rangle=\beta^2\frac{\partial^2 \log Z}{\partial \beta^2}$$

In the paper I looked at last time, the authors used this fact to estimate the entropy: they calculated $\langle (\Delta H)^2 \rangle$ by MCMC for various betas and used the relation $$S = \, \int_{1}^{\infty} d\beta\; \frac{1}{\beta} C(\beta)$$

## Maximum entropy: a primer and some recent applications

I’ll let Caticha summarize the principle of maximum entropy:

Among all possible probability distributions that agree with whatever we know select that particular distribution that reflects maximum ignorance about everything else. Since ignorance is measured by entropy, the method is mathematically implemented by selecting the distribution that maximizes entropy subject to the constraints imposed by the available information.

It appears to have been introduced by Jaynes in 57, and has seen a resurgence in the past decade with people taking bayesian inference more seriously. (As an aside, Jayne’s posthumously published book is well worth a read, in spite of some cringeworthy rants peppered throughout.) I won’t dwell too much on the philosophy as the two previously mentioned sources have already gone into great detail to justify the method.

Usually we consider constraints which are linear in the probabilities, namely we constrain the probability distribution to have specific expectation values. Consider that we know the expectation values of a certain set of functions $f^k$. Then, $p(x)$ should be such that $$\langle f^k \rangle = \int dx \; p(x) f^k(x)$$ for all k. Let’s omit the notation $(x)$ for simplicity. Then, we can use variational calculus to find p which minimizes the functional $$S[p]\; – \alpha \int dx\; p\; – \sum_k \lambda_k \langle f^k \rangle$$ The constraint with $\alpha$ is the normalization condition and $S$ is the shannon entropy.

The solution to this is $$p = \frac{1}{Z}\exp\left(-\sum_k\lambda_k f^k \right)$$ with $$Z=\int dx \; \exp \left(-\sum_k\lambda_k f^k \right)$$ the partition function (which is just the normalization constant). Now, we can find the remaining multipliers by solving the system of equations $$-\frac{\partial \log Z}{\partial \lambda_k} = \langle f^k \rangle$$ I’ll let you confirm that if we fix the mean and variance we get a gaussian distribution. Go on, I’ll wait.

## How to do inverse transformation sampling in scipy and numpy

Let’s say you have some data which follows a certain probability distribution. You can create a histogram and visualize the probability distribution, but now you want to sample from it. How do you go about doing this with python?

You do inverse transform sampling, which is just a method to rescale a uniform random variable to have the probability distribution we want. The idea is that the cumulative distribution function for the histogram you have maps the random variable’s space of possible values to the region [0,1]. If you invert it, you can sample uniform random numbers and transform them to your target distribution!

To implement this, we calculate the CDF for each bin in the histogram (red points above) and interpolate it using scipy’s interpolate functions. Then we just need to sample uniform random points and pass them through the inverse CDF! Here is how it looks:

## Gamma distribution approximation to the negative binomial distribution

In a recent data analysis project I was fitting a negative binomial distribution to some data when I realized that the gamma distribution was an equally good fit. And with equally good I mean the MLE fits were numerically indistinguishable. This intrigued me. In the internet I could find only a cryptic sentence on wikipedia saying the negative binomial is a discrete analog to the gamma and a paper talking about bounds on how closely the negative binomial approximates the gamma, but nobody really explains why this is the case. So here is a quick physicist’s derivation of the limit for large k.

## How to simulate a model for a genetic oscillator

In the previous post, I showed how to efficiently solve SDEs using python. Today we will use that knowledge to explore a well known model in systems biology: the repressilator.

The repressilator was described in detail by Elowitz and Leibler in Nature. It is essentially a simple way for a cell to create an oscillator by changing the concentration of a number of proteins by the mechanism of gene expression. It has proven to be a difficult model to recreate in practice using synthetic biology, so nobody knows if it is an accurate model of what actually happens inside the cells. But it’s simple to model using differential equations and we’re physicists (yes, you too! for now at least) so let’s have a go at it! The system is composed of N proteins, each of which is repressed by another in the set cyclically. Following the usual hill equation model for gene expression and taking into account that proteins degrade after some time, we can write the sde for the system as such: Continue reading “How to simulate a model for a genetic oscillator”

## Solving stochastic differential equations with theano

In systems biology we often need to solve diffusion equations of the type $$df = f(x,t) dt + g(x,t)dW$$ where W is a white noise process; they’re the most common example of a stochastic differential equation (SDE). There are only very few cases for which we can analytically solve this equation, such as when either f or g are constant or just depend linearly on x. Of course most interesting cases involve complicated f and g functions, so we need to solve them numerically.

One way to solve this is to use a straightforward variant of the Euler method. You just need to do the usual time discretization $\delta t = T/N$ (with T the total time and N the number of steps) and then update your current value with the deterministic contribution $\delta t \times f$ and the stochastic contribution $\sqrt{\delta t} \times g$. If you are not familiar with stochastic calculus you may be wondering why is the time step multiplier $\sqrt{\delta t}$ for the stochastic part.

## Kernel density estimation

Sometimes you need to estimate a probability distribution from a set of discrete points. You could build a histogram of the measurements, but that provides little information about the regions in phase space with no measurements (it is very likely you won’t have enough points to span the whole phase space). So the data set must be smoothed, as we did with time series. As in that case, we can describe the smoothing by a convolution with a kernel. In this case, the formula is very simple $$f(x)=\frac{1}{N}\sum_i^N K(x-x_i)$$

The choice of K is an art, but the standard choice is the gaussian kernel as we’ve seen before. Let’s try this out on some simulated data

And now let’s apply KDE to some sampled data.

The choice of standard deviation makes a big difference in the final result. For low amounts of data we want a reasonably high sigma, to smoothen out the large variations in the data. But if we have a lot of points, a lower sigma will more faithfully represent the original distribution:

## An introduction to smoothing time series in python. Part IV: Particle Filter

Last time we started talking about state space models with the Kalman Filter. Our aim has been to find a smoothed trajectory for some given noisy observed data. In the case of state space models, we incorporate a model for the underlying physical system to calculate a likelihood for each trajectory and select the optimal one. When it comes to implementing this idea, we run into the problem of how to represent the probability distributions for the state of the system: it would be unfeasible to calculate the probability for every single point in phase space at all times. The Kalman filter solves this by approximating the target probability density function (abbreviated pdf) with a Gaussian, which has only two parameters. This approximation works surprisingly well but it might not be optimal for cases where the underlying pdf is multimodal.

Today we will be looking at another idea to implement state space models – the particle filter. Our goal is still to find the maximum of the posterior distribution $P(x|y,\theta)$, given by bayes’ formula: $$P(x|y,\theta)\propto P(y | x, \theta)P(x|\theta)P(\theta)$$ Now the idea is to approximate this distribution by randomly drawing points from it. Because we will only look at one time step at a time, the sequence of points we sample will be a markov chain; and because the method relies on random sampling we call it a markov chain monte carlo (MCMC) method. Continue reading “An introduction to smoothing time series in python. Part IV: Particle Filter”

## An introduction to smoothing time series in python. Part III: Kalman Filter

If we have a mathematical model for the system in study, we can use that information to dramatically improve the quality of our prediction. Like in the previous filtering methods, we are taking advantage of the fact that to estimate the system state at some time $t$ we can use not only the information available at that time but also the information from the past (and the future if smoothing). But model free methods tend to want to reduce large deviations from one point in time to the next, while we may actually expect that at some specific time point the system does jump drastically in value. Adding in the model allows us to take that into account.

Let’s start by looking at the Kalman Filter, which is the optimal estimator for linear and gaussian systems. Let us define such a system first in the discrete case:

$$x_{n+1} = Ax_{n}+\xi \\ y_{n+1} = Bx_{n+1}+\zeta$$

The stochastic process in $x$ is the underlying process we want to follow. Not only is the process in $x$ a brownian process (additive white noise denoted by $\xi$), we are unable to observe it directly. The observation is denoted by $y$ and is a function of $x$ corrupted by (again) additive white noise $\zeta$. The gaussian assumption is often a reasonable approximation to the problem’s noise statistics because the timescale of whichever microscopic process produces randomness is usually much smaller than the one of the actual dynamics, allowing the central limit theorem to kick in.

## An introduction to smoothing time series in python. Part II: wiener filter and smoothing splines

Wiener filter

The wiener filter is a bit more advanced than the filters I previously covered, as it is the first one rooted in probability theory. Consider a more complicated measurement, $y = r*s + n$, where $R$ is an operator describing the response of the measurement equipment (for images, it is known as point spread function). We want to find the signal estimate $\hat{s}$ which minimizes the distance $$E[(\hat{s}-s)^2]$$ i.e., the minimum mean square error. This estimate should be given for a linear filter $w$ such that $\hat{s} = w*y$.