## The link between thermodynamics and inference

In recent blog posts I talked a bit about how many aspects of maximum entropy were analogous to methods in statistical physics. In this short post, I’ll summarize the most interesting similarities. In bayesian inference, we are usually interested in the posterior distribution of some parameters $\theta$ given the data d. This posterior can be written as a boltzmann distribution: $$P(\theta|d)=\frac{P(\theta,d)}{P(d)}=\left.\frac{e^{-\beta H(\theta,d)}}{Z}\right|_{\beta=1}$$ with $H(\theta,d) = -\log P(\theta,d)/\beta$ and $Z=\int d\theta\;e^{-\beta H(\theta,d)}$. I’ll note that we are working with units such that $k_B=1$ and thus $\beta=1/T$.

The energy is just the expectation value of the hamiltonian H (note that the expectation is taken with respect to $P(\theta|d)$): $$E = \langle H \rangle = -\frac{\partial \log Z}{\partial \beta}$$

And the entropy is equal to $$S=-\int d\theta\;P(\theta|d)\log P(\theta|d)=\beta\langle H \rangle – \log Z$$

We can also define the free energy, which is $$F=E\, – \frac{S}{\beta}=-\frac{\log Z}{\beta}$$

A cool way to approximate Z if we can’t calculate it analytically (we usually can’t calculate it numerically for high dimensional problems because the integrals take a very long time to calculate) is to use laplace’s approximation: $$Z=\int d\theta\;e^{-\beta H(\theta,d)}\simeq\sqrt{\frac{2\pi}{\beta|H”(\theta^*)|}}e^{-\beta H(\theta^*)}$$ where $|H”(\theta^*)|$ is the determinant of the hessian of the hamiltonian (say that 3 times real fast) and $\theta^*$ is such that $H(\theta^*)=\min H(\theta)$ (minimum because of the minus sign). Needless to say this approximation works best for small temperature ($\beta\rightarrow\infty$) which might not be close to the correct value at $\beta=1$. $\theta^*$ is known as the maximum a posteriori (MAP) estimate. Expectation values can also be approximated in a similar way: $$\langle f(\theta) \rangle = \int d\theta \; f(\theta) P(\theta|d) \simeq\sqrt{\frac{2\pi}{\beta|H”(\theta^*)|}} f(\theta^*)P(\theta^*|d)$$

So the MAP estimate is defined as $\text{argmax}_{\theta} P(\theta|d)$. The result won’t change if we take the log of the posterior, which leads to a form similar to the entropy: \begin{align}\theta_{\text{MAP}}&=\text{argmax}_{\theta} (-\beta H – \log Z)\\&=\text{argmax}_{\theta} (-2\beta H + S)\end{align} Funny, huh? For infinite temperature ($\beta=0$) the parameters reflect total lack of knowledge: the entropy is maximized. As we lower the temperature, the energy term contributes more, reflecting the information provided by the data, until at temperature zero we would only care about the data contribution and ignore the entropy term.

(This is also the basic idea for the simulated annealing optimization algorithm, where in that case the objective function plays the role of the energy and the algorithm walks around phase space randomly, with jump size proportional to the temperature. The annealing schedule progressively lowers the temperature, restricting the random walk to regions of high objective function value, until it freezes at some point.)

Another cool connection is the fact that the heat capacity is given by $$C(\beta)=\beta^2\langle (\Delta H)^2 \rangle=\beta^2\langle (H-\langle H \rangle)^2 \rangle=\beta^2\frac{\partial^2 \log Z}{\partial \beta^2}$$

In the paper I looked at last time, the authors used this fact to estimate the entropy: they calculated $\langle (\Delta H)^2 \rangle$ by MCMC for various betas and used the relation $$S = \, \int_{1}^{\infty} d\beta\; \frac{1}{\beta} C(\beta)$$

## Review of ‘Searching for Collective Behavior in a Large Network of Sensory Neurons’

Last time I reviewed the principle of maximum entropy. Today I am looking at a paper which uses it to create a simplified probabilistic representation of neural dynamics. The idea is to measure the spike trains of each neuron individually (in this case there are around 100 neurons from a salamander retina being measured) and simultaneously. In this way, all correlations in the network are preserved, which allows the construction of a probability distribution describing some features of the network.

Naturally, a probability distribution describing the full network dynamics would need a model of the whole network dynamics, which is not what the authors are aiming at here. Instead, they wish to just capture the correct statistics of the network states. What are the network states? Imagine you bin time into small windows. In each window, each neuron will be spiking or not. Then, for each time point you will have a binary word with 100 bits, where each a 1 corresponds to a spike and a -1 to silence. This is a network state, which we will represent by $\boldsymbol{\sigma}$.

So, the goal is to get $P(\boldsymbol{\sigma})$. It would be more interesting to have something like $P(\boldsymbol{\sigma}_{t+1}|\boldsymbol{\sigma}_t)$ (subscript denoting time) but we don’t always get what we want, now do we? It is a much harder problem to get this conditional probability, so we’ll have to settle for the overall probability of each state. According to maximum entropy, this distribution will be given by $$P(\boldsymbol{\sigma})=\frac{1}{Z}\exp\left(-\sum_i \lambda_i f_i(\boldsymbol{\sigma})\right)$$ Continue reading “Review of ‘Searching for Collective Behavior in a Large Network of Sensory Neurons’”

## Maximum entropy: a primer and some recent applications

I’ll let Caticha summarize the principle of maximum entropy:

Among all possible probability distributions that agree with whatever we know select that particular distribution that reflects maximum ignorance about everything else. Since ignorance is measured by entropy, the method is mathematically implemented by selecting the distribution that maximizes entropy subject to the constraints imposed by the available information.

It appears to have been introduced by Jaynes in 57, and has seen a resurgence in the past decade with people taking bayesian inference more seriously. (As an aside, Jayne’s posthumously published book is well worth a read, in spite of some cringeworthy rants peppered throughout.) I won’t dwell too much on the philosophy as the two previously mentioned sources have already gone into great detail to justify the method.

Usually we consider constraints which are linear in the probabilities, namely we constrain the probability distribution to have specific expectation values. Consider that we know the expectation values of a certain set of functions $f^k$. Then, $p(x)$ should be such that $$\langle f^k \rangle = \int dx \; p(x) f^k(x)$$ for all k. Let’s omit the notation $(x)$ for simplicity. Then, we can use variational calculus to find p which minimizes the functional $$S[p]\; – \alpha \int dx\; p\; – \sum_k \lambda_k \langle f^k \rangle$$ The constraint with $\alpha$ is the normalization condition and $S$ is the shannon entropy.

The solution to this is $$p = \frac{1}{Z}\exp\left(-\sum_k\lambda_k f^k \right)$$ with $$Z=\int dx \; \exp \left(-\sum_k\lambda_k f^k \right)$$ the partition function (which is just the normalization constant). Now, we can find the remaining multipliers by solving the system of equations $$-\frac{\partial \log Z}{\partial \lambda_k} = \langle f^k \rangle$$ I’ll let you confirm that if we fix the mean and variance we get a gaussian distribution. Go on, I’ll wait.

## How to do inverse transformation sampling in scipy and numpy

Let’s say you have some data which follows a certain probability distribution. You can create a histogram and visualize the probability distribution, but now you want to sample from it. How do you go about doing this with python?

You do inverse transform sampling, which is just a method to rescale a uniform random variable to have the probability distribution we want. The idea is that the cumulative distribution function for the histogram you have maps the random variable’s space of possible values to the region [0,1]. If you invert it, you can sample uniform random numbers and transform them to your target distribution!

To implement this, we calculate the CDF for each bin in the histogram (red points above) and interpolate it using scipy’s interpolate functions. Then we just need to sample uniform random points and pass them through the inverse CDF! Here is how it looks:

## A lovely new minesweeper on android I made

Today I am finally releasing my little minesweeper for android! I’ve been working on this as a hobby for the past few weekends, and now it is finally smooth enough to let other people see it! The problem with most minesweeper applications in the market is that they are either really ugly or haven’t really figured out how to adapt the original mouse controls to a touchscreen. I set out to solve these two problems so I can play some mines on my phone!

To solve the ugliness problem, I drew some tiles in photoshop in a very minimal style, to disturb the eyes as little as possible and let you focus on the game. Here is how it turned out.

To navigate the board, you can use the normal multitouch gestures like pan and pinch to zoom. To place a flag, you can long press a tile or you can double tap an open tile and drag to a closed tile (these gestures won’t let you win speed competitions, but they’re pretty good if you’re lazily solving the board)

You also get some pretty sweet statistics when you win or lose!

## Gamma distribution approximation to the negative binomial distribution

In a recent data analysis project I was fitting a negative binomial distribution to some data when I realized that the gamma distribution was an equally good fit. And with equally good I mean the MLE fits were numerically indistinguishable. This intrigued me. In the internet I could find only a cryptic sentence on wikipedia saying the negative binomial is a discrete analog to the gamma and a paper talking about bounds on how closely the negative binomial approximates the gamma, but nobody really explains why this is the case. So here is a quick physicist’s derivation of the limit for large k.

## Negative binomial with continuous parameters in python

So scipy doesn’t support a negative binomial for a continuous r parameter. The expression for its pdf is $P(k)=\frac{\Gamma(k+r)}{k!\,\Gamma(r)} (1-p)^rp^k$. I coded a small class which computes the pdf and is also able to find MLE estimates for p and k given some data. It relies on the mpmath arbitrary precision library since the gamma function values can get quite large and overflow a double. It might be useful to someone so here’s the code below.

## Automatic segmentation of microscopy images

A few months back I was posed the problem of automatically segmenting brightfield images of bacteria such as this:

I thought this was a really simple problem so I started applying some filters to the image and playing with morphology operations. You can isolate dark spots in the image by applying a threshold to each pixel. The resulting binary image can be modified by using the different morphological operators, and hopefully identifying each individual cell. Turns out there is a reason people stopped using these methods in the 90s and the reason is they don’t really work. If the cells are close enough, there won’t be a great enough difference in brightness to separate the two particles and they will remain stuck.

## How to import structured matlab data into python with scipy

So a few days ago I received this really nice data set from an experimental group in matlab format which contains a list of structs with some properties, some of which are structs themselves. I usually just open it in matlab using my university’s license and export the data as a .csv , but in this case with the structs there was no direct way to export the data and preserve all the associated structure. Luckily scipy has a method to import .mat files into python, appropriately called loadmat.

In the case of a struct array the resulting file is kind of confusing to navigate. You’d expect to access each record with data[i], where data is the struct list. For some reason I cannot hope to understand you need to iterate over data in the following way: data[0,i].

Each record is loaded as a numpy structured array, which allow you to access the data by its original property names. That’s great, but what I don’t understand is why some data gets nested inside multiple one dimensional arrays which you need to navigate out of. An example: I needed to access a 2d array of floats which was a property of a property of a struct (…). You’d expect to access it as record[‘property’][‘subproperty’]. But actually you have to dig it out of record[‘property’][‘subproperty’][0][0]. I’m not sure if this is due to the way .mat files are structured or scipy’s behavior. This is relatively easy to figure out using the interactive shell, although it makes for some ugly code to parse the whole file.

The best way to map the structure would be to create an array of dicts in python with the corresponding properties. However I wanted to have the data in numpy format which led to a slightly awkward design decision: I create a table where each row contains the (unique) value of the properties in the child structs and the corresponding values for the properties in the parent structs. This means that the properties in the parent structs are duplicated across all rows corresponding to their children. With this I traded off memory space for being able to directly access all values for a single property without traversing some complicated structure. I believe this was a reasonable tradeoff.

What about selecting subsets of data based on the parent properties? To solve this problem, I actually converted the massive numpy table into a pandas dataframe. Pandas is extremely useful when your data fits the “spreadsheet” paradigm (i.e. each column corresponds to a different kind of data type), and its advanced selection operations allow you to do SQL-like queries on the data (yes, you can even do joins!), which is what I have been using to do advanced selections.