## An introduction to the metropolis method with python

I already talked about MCMC methods before, but today I want to cover one of the most well known methods of all, Metropolis-Hastings. The goal is to obtain samples according to to the equilibrium distribution of a given physical system, the Boltzmann distribution. Incidentally, we can also rewrite arbitrary probability distributions in this form, which is what allows for the cross pollination of methods between probabilistic inference and statistical mechanics (look at my older post on this). Since we don’t know how to sample directly from the boltzmann distribution in general, we need to use some sampling method. Continue reading “An introduction to the metropolis method with python”

## Simple pattern formation with cellular automata

A cellular automaton is a dynamical system where space, time and dynamic variable are all discrete. The system is thus composed of a lattice of cells (discrete space), each described by a state (discrete dynamic variable) which evolve into the next time step (discrete time) according to a dynamic rule.

x_i^{t+1} = f(x_i^t, \Omega_i^t, \xi)

This rule generally depends on the state of the target cell $x_i^t$, the state of its neighbors $\Omega_i^t$, and a number of auxiliary external variables $\xi$. Since all these inputs are discrete, we can enumerate them and then define the dynamic rule by a transition table. The transition table maps each possible input to the next state for the cell. As an example consider the elementary 1D cellular automaton. In this case the neighborhood consists of only the 2 nearest neighbors $\Omega_i^t = \{x_{i-1}^t, x_{i+1}^t\}$ and no external variables.

In general, there are two types of neighborhoods, commonly classified as Moore or Von Neumann. A Moore neighborhood of radius $r$ corresponds to all cells within a hypercube of size $r$ centered at the current cell. In 2D we can write it as $\Omega_{ij}^t = \{x^t_{kl}:|i-k|\leq r \wedge |j-l|\leq r\}\setminus x^t_{ij}$. The Von Neumann neighborhood is more restrictive: only cells within a manhattan distance of $r$ belong to the neighborhood. In 2D we write $\Omega_{ij}^t = \{x^t_{kl}:|i-l|+|j-k| \leq r\}\setminus x^t_{ij}$.

Finally it is worth elucidating the concept of totalistic automata. In high dimensional spaces, the number of possible configurations of the neighborhood $\Omega$ can be quite large. As a simplification, we may consider instead as an input to the transition table the sum of all neighbors in a specific state $N_k = \sum_{x \in \Omega}\delta(x = k)$. If there are only 2 states, we need only consider $N_1$, since $N_0 = r – N_1$. For an arbitrary number $m$ of states, we will obviously need to consider $m-1$ such inputs to fully characterize the neighborhood. Even then, each input $N_k$ can take $r+1$ different values, which might be too much. In such cases we may consider only the case when $N_k$ is above some threshold. Then we can define as an input the boolean variable

P_{k,T}=\begin{cases}
1& \text{if $N_k \geq T$},\\
0& \text{if $N_k < T$}.
\end{cases}

In the simulation you can find here, I considered a cellular automaton with the following properties: number of states $m=2$; moore neighborhood with radius $r=1$; lattice size $L_x \times L_y$; and 3 inputs for the transition table:

• Current state $x_{ij}^t$
• Neighborhood state $P_{1,T}$ with $T$ unspecified
• One external input $\xi$
\xi_{ij}=\begin{cases}
1& \text{if $i \geq L_x/2$},\\
0& \text{if $i < L_x/2$}.
\end{cases}
• Initial condition $x_{ij} = 0 \; \forall_{ij}$

For these conditions a deterministic simulation of these conditions yields only a few steady states: homogeneous 1 or 0, half the lattice 1 and the other 0, and oscillation between a combination of the previous.

One possibility would be to add noise to the cellular automaton in order to provide more interesting dynamics. There are two ways to add noise to a cellular automaton:

The most straightforward way is to perform the following procedure at each time step:

• Apply the deterministic dynamics to the whole lattice
• For each lattice site $ij$, invert the state $x_{ij}$ with probability $p$

This procedure only works of course for $m=2$. In the case of more states there is no obvious way to generalize the procedure and we need to use a proper monte carlo method to get the dynamics.

A second way is to implement a probabilistic cellular automaton. In this case the transition table is generalized to a markov matrix: each input is now mapped not to a specific state but rather to a set of probabilities for a transition to each state ($m$ probabilities). Naturally for each input these sum to one. In this case we have $m$ times more parameters than before.

## The link between thermodynamics and inference

In recent blog posts I talked a bit about how many aspects of maximum entropy were analogous to methods in statistical physics. In this short post, I’ll summarize the most interesting similarities. In bayesian inference, we are usually interested in the posterior distribution of some parameters $\theta$ given the data d. This posterior can be written as a boltzmann distribution: $$P(\theta|d)=\frac{P(\theta,d)}{P(d)}=\left.\frac{e^{-\beta H(\theta,d)}}{Z}\right|_{\beta=1}$$ with $H(\theta,d) = -\log P(\theta,d)/\beta$ and $Z=\int d\theta\;e^{-\beta H(\theta,d)}$. I’ll note that we are working with units such that $k_B=1$ and thus $\beta=1/T$.

The energy is just the expectation value of the hamiltonian H (note that the expectation is taken with respect to $P(\theta|d)$): $$E = \langle H \rangle = -\frac{\partial \log Z}{\partial \beta}$$

And the entropy is equal to $$S=-\int d\theta\;P(\theta|d)\log P(\theta|d)=\beta\langle H \rangle – \log Z$$

We can also define the free energy, which is $$F=E\, – \frac{S}{\beta}=-\frac{\log Z}{\beta}$$

A cool way to approximate Z if we can’t calculate it analytically (we usually can’t calculate it numerically for high dimensional problems because the integrals take a very long time to calculate) is to use laplace’s approximation: $$Z=\int d\theta\;e^{-\beta H(\theta,d)}\simeq\sqrt{\frac{2\pi}{\beta|H”(\theta^*)|}}e^{-\beta H(\theta^*)}$$ where $|H”(\theta^*)|$ is the determinant of the hessian of the hamiltonian (say that 3 times real fast) and $\theta^*$ is such that $H(\theta^*)=\min H(\theta)$ (minimum because of the minus sign). Needless to say this approximation works best for small temperature ($\beta\rightarrow\infty$) which might not be close to the correct value at $\beta=1$. $\theta^*$ is known as the maximum a posteriori (MAP) estimate. Expectation values can also be approximated in a similar way: $$\langle f(\theta) \rangle = \int d\theta \; f(\theta) P(\theta|d) \simeq\sqrt{\frac{2\pi}{\beta|H”(\theta^*)|}} f(\theta^*)P(\theta^*|d)$$

So the MAP estimate is defined as $\text{argmax}_{\theta} P(\theta|d)$. The result won’t change if we take the log of the posterior, which leads to a form similar to the entropy: \begin{align}\theta_{\text{MAP}}&=\text{argmax}_{\theta} (-\beta H – \log Z)\\&=\text{argmax}_{\theta} (-2\beta H + S)\end{align} Funny, huh? For infinite temperature ($\beta=0$) the parameters reflect total lack of knowledge: the entropy is maximized. As we lower the temperature, the energy term contributes more, reflecting the information provided by the data, until at temperature zero we would only care about the data contribution and ignore the entropy term.

(This is also the basic idea for the simulated annealing optimization algorithm, where in that case the objective function plays the role of the energy and the algorithm walks around phase space randomly, with jump size proportional to the temperature. The annealing schedule progressively lowers the temperature, restricting the random walk to regions of high objective function value, until it freezes at some point.)

Another cool connection is the fact that the heat capacity is given by $$C(\beta)=\beta^2\langle (\Delta H)^2 \rangle=\beta^2\langle (H-\langle H \rangle)^2 \rangle=\beta^2\frac{\partial^2 \log Z}{\partial \beta^2}$$

In the paper I looked at last time, the authors used this fact to estimate the entropy: they calculated $\langle (\Delta H)^2 \rangle$ by MCMC for various betas and used the relation $$S = \, \int_{1}^{\infty} d\beta\; \frac{1}{\beta} C(\beta)$$

## Review of ‘Searching for Collective Behavior in a Large Network of Sensory Neurons’

Last time I reviewed the principle of maximum entropy. Today I am looking at a paper which uses it to create a simplified probabilistic representation of neural dynamics. The idea is to measure the spike trains of each neuron individually (in this case there are around 100 neurons from a salamander retina being measured) and simultaneously. In this way, all correlations in the network are preserved, which allows the construction of a probability distribution describing some features of the network.

Naturally, a probability distribution describing the full network dynamics would need a model of the whole network dynamics, which is not what the authors are aiming at here. Instead, they wish to just capture the correct statistics of the network states. What are the network states? Imagine you bin time into small windows. In each window, each neuron will be spiking or not. Then, for each time point you will have a binary word with 100 bits, where each a 1 corresponds to a spike and a -1 to silence. This is a network state, which we will represent by $\boldsymbol{\sigma}$.

So, the goal is to get $P(\boldsymbol{\sigma})$. It would be more interesting to have something like $P(\boldsymbol{\sigma}_{t+1}|\boldsymbol{\sigma}_t)$ (subscript denoting time) but we don’t always get what we want, now do we? It is a much harder problem to get this conditional probability, so we’ll have to settle for the overall probability of each state. According to maximum entropy, this distribution will be given by $$P(\boldsymbol{\sigma})=\frac{1}{Z}\exp\left(-\sum_i \lambda_i f_i(\boldsymbol{\sigma})\right)$$ Continue reading “Review of ‘Searching for Collective Behavior in a Large Network of Sensory Neurons’”

## Maximum entropy: a primer and some recent applications

I’ll let Caticha summarize the principle of maximum entropy:

Among all possible probability distributions that agree with whatever we know select that particular distribution that reflects maximum ignorance about everything else. Since ignorance is measured by entropy, the method is mathematically implemented by selecting the distribution that maximizes entropy subject to the constraints imposed by the available information.

It appears to have been introduced by Jaynes in 57, and has seen a resurgence in the past decade with people taking bayesian inference more seriously. (As an aside, Jayne’s posthumously published book is well worth a read, in spite of some cringeworthy rants peppered throughout.) I won’t dwell too much on the philosophy as the two previously mentioned sources have already gone into great detail to justify the method.

Usually we consider constraints which are linear in the probabilities, namely we constrain the probability distribution to have specific expectation values. Consider that we know the expectation values of a certain set of functions $f^k$. Then, $p(x)$ should be such that $$\langle f^k \rangle = \int dx \; p(x) f^k(x)$$ for all k. Let’s omit the notation $(x)$ for simplicity. Then, we can use variational calculus to find p which minimizes the functional $$S[p]\; – \alpha \int dx\; p\; – \sum_k \lambda_k \langle f^k \rangle$$ The constraint with $\alpha$ is the normalization condition and $S$ is the shannon entropy.

The solution to this is $$p = \frac{1}{Z}\exp\left(-\sum_k\lambda_k f^k \right)$$ with $$Z=\int dx \; \exp \left(-\sum_k\lambda_k f^k \right)$$ the partition function (which is just the normalization constant). Now, we can find the remaining multipliers by solving the system of equations $$-\frac{\partial \log Z}{\partial \lambda_k} = \langle f^k \rangle$$ I’ll let you confirm that if we fix the mean and variance we get a gaussian distribution. Go on, I’ll wait.

## How to import structured matlab data into python with scipy

So a few days ago I received this really nice data set from an experimental group in matlab format which contains a list of structs with some properties, some of which are structs themselves. I usually just open it in matlab using my university’s license and export the data as a .csv , but in this case with the structs there was no direct way to export the data and preserve all the associated structure. Luckily scipy has a method to import .mat files into python, appropriately called loadmat.

In the case of a struct array the resulting file is kind of confusing to navigate. You’d expect to access each record with data[i], where data is the struct list. For some reason I cannot hope to understand you need to iterate over data in the following way: data[0,i].

Each record is loaded as a numpy structured array, which allow you to access the data by its original property names. That’s great, but what I don’t understand is why some data gets nested inside multiple one dimensional arrays which you need to navigate out of. An example: I needed to access a 2d array of floats which was a property of a property of a struct (…). You’d expect to access it as record[‘property’][‘subproperty’]. But actually you have to dig it out of record[‘property’][‘subproperty’][0][0]. I’m not sure if this is due to the way .mat files are structured or scipy’s behavior. This is relatively easy to figure out using the interactive shell, although it makes for some ugly code to parse the whole file.

The best way to map the structure would be to create an array of dicts in python with the corresponding properties. However I wanted to have the data in numpy format which led to a slightly awkward design decision: I create a table where each row contains the (unique) value of the properties in the child structs and the corresponding values for the properties in the parent structs. This means that the properties in the parent structs are duplicated across all rows corresponding to their children. With this I traded off memory space for being able to directly access all values for a single property without traversing some complicated structure. I believe this was a reasonable tradeoff.

What about selecting subsets of data based on the parent properties? To solve this problem, I actually converted the massive numpy table into a pandas dataframe. Pandas is extremely useful when your data fits the “spreadsheet” paradigm (i.e. each column corresponds to a different kind of data type), and its advanced selection operations allow you to do SQL-like queries on the data (yes, you can even do joins!), which is what I have been using to do advanced selections.

## How to simulate a model for a genetic oscillator

In the previous post, I showed how to efficiently solve SDEs using python. Today we will use that knowledge to explore a well known model in systems biology: the repressilator.

The repressilator was described in detail by Elowitz and Leibler in Nature. It is essentially a simple way for a cell to create an oscillator by changing the concentration of a number of proteins by the mechanism of gene expression. It has proven to be a difficult model to recreate in practice using synthetic biology, so nobody knows if it is an accurate model of what actually happens inside the cells. But it’s simple to model using differential equations and we’re physicists (yes, you too! for now at least) so let’s have a go at it! The system is composed of N proteins, each of which is repressed by another in the set cyclically. Following the usual hill equation model for gene expression and taking into account that proteins degrade after some time, we can write the sde for the system as such: Continue reading “How to simulate a model for a genetic oscillator”

## Simulating tissues with pressure

One small project I did was to code up a simulation of a growing tissue which feels pressure and where each cell has a dynamic state which depends on its neighbors and the pressure it feels. The idea is to reproduce some essential properties of morphogenesis. You can look at the code here. I am going to talk about the most interesting parts of the code.

I initialize the cells in an ordered lattice, with random perturbations in their positions, except those which are in the borders (bottom, left, right). Those are static and do not evolve in the simulation like the others. They are there just to represent the pressure from the rest of the body (huge) on the simulated tissue (tiny). This is not a very realistic assumption because many developmental systems have a size of the order of the body size, but we have to start somewhere!

All cells are connected with springs, which simulate adhesive and pressure forces in the tissue. If left alone, the system relaxes into a hexagonal configuration, since this minimises the spring potential energy. I integrate the harmonic oscillator equations using a fourth order Adams Moulton algorithm.

Now, it is important to realize that there are two time scales in the system: the pressure equilibration and cell lifetimes. We can assume the mechanical pressure equilibrates very fast, while cell divisions take their time. So what we do is run the oscillator system until equlibrium for each time step of the cellular state evolution, which we will talk about later.

## Preprint review: Parameter Space Compression Underlies Emergent Theories and Predictive Models

So here’s a preprint I found really interesting [arxiv:1303.6738]. I’ll try to give a quick overview of the story in my own words.

The main concept used in the paper is the Fisher Information, which is no more than a measure of the curvature in the space of probability distributions. It is easy to intuitively understand what it is in the 1D case. Suppose you have a probability distribution for some random variable $x$ parametrized by $\theta$: $P(x|\theta)$. If you change $\theta$ by an infinitesimal amount, how will the probability distribution itself change? Will it be vastly different or almost the same? We can quantify that change by averaging the square of the relative changes of the probabilities of all the points: $$\mathcal{I}=E\left[\left(\frac{dP(x)}{d\theta}\frac{1}{P(x)}\right)^2\right]_x$$

Another way to look at it is as quantifying the “resolution” with which we can detect the parameter $\theta$: when the FI is high, we can distinguish between 2 parameters with close values more easily than for low FI, which corresponds to a higher resolution in parameter space. But why can I make this statement? After all the FI quantifies the difference between the probability distributions, not the parameters which specify them. The reason the “resolution” picture makes sense is thanks to the Cramér-Rao bound: $$var(\hat{\theta})=\frac{1}{\mathcal{I}}$$

To understand the bound we must make the following definitions: a hat over a parameter denotes the estimator of that parameter (an estimator is just a function that takes a set of realizations (or “measurements”) of the random variable we are looking at and spits out a number, which we hope will be close to the true value of the estimator) and the variance of an estimator is just the variance of the result of applying the estimator to many independent sets of measurements. With that in mind, the statement is that the variance of an estimator of our parameter is bounded by the inverse of the FI (note: we assume the estimator to be “unbiased“). Because FI for a set of N independent samples scales as N, for a very large sample size the variance of the estimator tends to zero, meaning we always get the same, “correct value”, as we’d intuitively expect.

Now bear in mind this is a very mathematical construct, hence all the quotation marks. The intuition I described above is only valid when the model is simple enough that we can afford to use the concept of unbiased estimators (which imply discarding all prior information we might have) and can assume that the underlying parameter is indeed a single number. Im many interesting cases we cannot assume this, but rather that it is a random variable itself (either due to nature, or due to intermediate processes we neglect to add to our model). I digress. The FI stands on its own as a useful tool in a myriad of applications, often related to quantifying the “resolution” of a system in the sense I tried to convey above.

In the paper we are looking at, the authors consider dynamic systems at a microscopic level with many degrees of freedom. In these systems you can consider the attributes of each particle a parameter, and code the many possible state of the system at some point in time by a probability distribution, which of course depends on this very large set of parameters. In this case the FI for any parameter is likely very small, as if you were to make a small change in one of these parameters the system would evolve in a very similar way leaving the probability distributions relatively unchanged, in the FI sense. The interesting step is now to find the eigenvalues of the FI matrix. This projects the parameters onto a new space where the directions correspond to some natural observables of the system.

Now, if we coarsen the system by allowing a long time to pass (i.e. a diffusion process) or by looking at it from a macroscopic scale (i.e. coarse graining an ising model) it turns out a few of these directions have a very large weight and the rest have comparatively low weight. The authors argue that these directions, when cast as observables, correspond to the macroscopic parameters of the system. Going back to the picture of the FI as resolution, these few observables will be the ones which we will be able to easily distinguish, while all the others will get lost in the noise. This is an appealing statement because it agrees with what we already know from statistical physics: we can accurately model systems at the macroscopic scale even if we have no hope to know what is going on at the microscopic level. Now we can see this idea emerge naturally from probability theory.

Another point they make is that this procedure works for both diffusive type processes, where we attribute this scale separation due to the fact that fluctuations are only relevant at the micro scale but not at the macro; and for processes with phase transitions where fluctuations are relevant at all scales at the critical point (cf. renormalization group). Under this framework there is a single explanation for why universal behavior is so prevalent in physics which I think is pretty cool.