Dimensionality reduction 101: linear algebra, hidden variables and generative models

Suppose you are faced with a high dimensional dataset and want to find some structure in the data: often there are only a few causes, but lots of different data points are generated due to noise corruption. How can we infer these causes? Here I’m going to cover the simplest method to do this inference: we will assume the data is generated by a linear transformation of the hidden causes. In this case, it is quite simple to recover the parameters of this transformation and therefore determine the hidden (or latent) variables which represent their cause.

Continue reading “Dimensionality reduction 101: linear algebra, hidden variables and generative models”

How to fix scipy’s interpolating spline default behavior

Scipy’s UnivariateSpline class is a super useful way to smooth time series, especially if you need an estimate of the derivative. It is an implementation of an interpolating spline, which I’ve previously covered in this blog post. Its big problem is that the default parameters suck. Depending on the absolute value of your data, the spline produced by leaving the parameters at their default values can be overfit, underfit or just fine. Below I visually reproduce the problem for two time series from an experiment with very different numerical values.

Two time series with different numerical values and their derivatives below
Two time series with different numerical values and their derivatives below. The first is overfit, the second underfit.

My usual solution was just to manually adjust the $s$ parameter until the result looked good. But this time I have hundreds of time series, so I have to do it right this time. And doing it right requires actually understanding what’s going on. In the documentation, $s$ is described as follows:

Positive smoothing factor used to choose the number of knots. Number of knots will be increased until the smoothing condition is satisfied:

sum((w[i]*(y[i]-s(x[i])))**2,axis=0) <= s

If None (default), s=len(w) which should be a good value if 1/w[i] is an estimate of the standard deviation of y[i]. If 0, spline will interpolate through all data points.

So the default value of $s$ should be fine if $w^{-1}$ were an estimate of the standard deviation of $y$. However, the default value for $w$ is 1/len(y) which is clearly not a decent estimate. The solution then is to calculate a rough estimate of the standard deviation of $y$ and pass the inverse of that as $w$. My solution to that is to use a gaussian kernel to smooth the data and then calculate a smoothed variance as well. Code below:


Same timeseries with the variance estimate fix
Same timeseries with the variance estimate fix

Now, you may be thinking I only moved the parameter dependence around: before I had to fine tune $s$ but now there is a free parameter sigma. The difference is that a) the gaussian filter results are much more robust with respect to the choice of sigma;  b) we only need to provide an estimate of the standard deviation, so it’s fine if the result coming out is not perfect; and c) it does not depend on the absolute value of the data. In fact, for the above dataset I left sigma at its default value of 3 for all timeseries and all of them came out perfect. So I’d consider this problem solved.

I understand why the scipy developers wouldn’t use a method similar to mine to estimate $w$ as default, after all it may not work for all types of data. On the other hand, I think the documentation as it stands is confusing. The user would expect that parameters which have a default value would work without fine tuning, instead what happens here is that if you leave $w$ as the default you must change $s$ and vice versa.

Quick introduction to gaussian mixture models with python

Usually we like to model probability distributions with gaussian distributions. Not only are they the maximum entropy distributions if we only know the mean and variance of a dataset, the central limit theorem guarantees that random variables which are the result of summing many different random variables will be gaussian distributed too. But what to do when we have multimodal distributions like this one?

A gaussian mixture with 2 dimensional features
A gaussian mixture with 2 dimensional features

A gaussian distribution would not represent this very well. So what’s the next best thing? Add another gaussian! A gaussian mixture model is defined by a sum of gaussians $$P(x)=\sum_i w_i \, \mathcal{G}(\mu_i, \Sigma_i)$$ with means $\mu$ and covariance matrices $\Sigma$. Continue reading “Quick introduction to gaussian mixture models with python”

Review of ‘Searching for Collective Behavior in a Large Network of Sensory Neurons’

Last time I reviewed the principle of maximum entropy. Today I am looking at a paper which uses it to create a simplified probabilistic representation of neural dynamics. The idea is to measure the spike trains of each neuron individually (in this case there are around 100 neurons from a salamander retina being measured) and simultaneously. In this way, all correlations in the network are preserved, which allows the construction of a probability distribution describing some features of the network.

Naturally, a probability distribution describing the full network dynamics would need a model of the whole network dynamics, which is not what the authors are aiming at here. Instead, they wish to just capture the correct statistics of the network states. What are the network states? Imagine you bin time into small windows. In each window, each neuron will be spiking or not. Then, for each time point you will have a binary word with 100 bits, where each a 1 corresponds to a spike and a -1 to silence. This is a network state, which we will represent by $\boldsymbol{\sigma}$.

So, the goal is to get $P(\boldsymbol{\sigma})$. It would be more interesting to have something like $P(\boldsymbol{\sigma}_{t+1}|\boldsymbol{\sigma}_t)$ (subscript denoting time) but we don’t always get what we want, now do we? It is a much harder problem to get this conditional probability, so we’ll have to settle for the overall probability of each state. According to maximum entropy, this distribution will be given by $$P(\boldsymbol{\sigma})=\frac{1}{Z}\exp\left(-\sum_i \lambda_i f_i(\boldsymbol{\sigma})\right)$$ Continue reading “Review of ‘Searching for Collective Behavior in a Large Network of Sensory Neurons’”

How to do inverse transformation sampling in scipy and numpy

Let’s say you have some data which follows a certain probability distribution. You can create a histogram and visualize the probability distribution, but now you want to sample from it. How do you go about doing this with python?

gaussian mixture
gaussian mixture

The short answer:

The long answer:

You do inverse transform sampling, which is just a method to rescale a uniform random variable to have the probability distribution we want. The idea is that the cumulative distribution function for the histogram you have maps the random variable’s space of possible values to the region [0,1]. If you invert it, you can sample uniform random numbers and transform them to your target distribution!

How the inverse CDF looks for the the above gaussian mixture
How the inverse CDF looks for the the above gaussian mixture

To implement this, we calculate the CDF for each bin in the histogram (red points above) and interpolate it using scipy’s interpolate functions. Then we just need to sample uniform random points and pass them through the inverse CDF! Here is how it looks:

New samples in red, original in blue
New samples in red, original in blue


Gamma distribution approximation to the negative binomial distribution

In a recent data analysis project I was fitting a negative binomial distribution to some data when I realized that the gamma distribution was an equally good fit. And with equally good I mean the MLE fits were numerically indistinguishable. This intrigued me. In the internet I could find only a cryptic sentence on wikipedia saying the negative binomial is a discrete analog to the gamma and a paper talking about bounds on how closely the negative binomial approximates the gamma, but nobody really explains why this is the case. So here is a quick physicist’s derivation of the limit for large k.

Continue reading “Gamma distribution approximation to the negative binomial distribution”

Negative binomial with continuous parameters in python

So scipy doesn’t support a negative binomial for a continuous r parameter. The expression for its pdf is $P(k)=\frac{\Gamma(k+r)}{k!\,\Gamma(r)} (1-p)^rp^k$. I coded a small class which computes the pdf and is also able to find MLE estimates for p and k given some data. It relies on the mpmath arbitrary precision library since the gamma function values can get quite large and overflow a double. It might be useful to someone so here’s the code below.

Continue reading “Negative binomial with continuous parameters in python”

How to import structured matlab data into python with scipy

So a few days ago I received this really nice data set from an experimental group in matlab format which contains a list of structs with some properties, some of which are structs themselves. I usually just open it in matlab using my university’s license and export the data as a .csv , but in this case with the structs there was no direct way to export the data and preserve all the associated structure. Luckily scipy has a method to import .mat files into python, appropriately called loadmat.

In the case of a struct array the resulting file is kind of confusing to navigate. You’d expect to access each record with data[i], where data is the struct list. For some reason I cannot hope to understand you need to iterate over data in the following way: data[0,i].

Each record is loaded as a numpy structured array, which allow you to access the data by its original property names. That’s great, but what I don’t understand is why some data gets nested inside multiple one dimensional arrays which you need to navigate out of. An example: I needed to access a 2d array of floats which was a property of a property of a struct (…). You’d expect to access it as record[‘property’][‘subproperty’]. But actually you have to dig it out of record[‘property’][‘subproperty’][0][0]. I’m not sure if this is due to the way .mat files are structured or scipy’s behavior. This is relatively easy to figure out using the interactive shell, although it makes for some ugly code to parse the whole file.

The best way to map the structure would be to create an array of dicts in python with the corresponding properties. However I wanted to have the data in numpy format which led to a slightly awkward design decision: I create a table where each row contains the (unique) value of the properties in the child structs and the corresponding values for the properties in the parent structs. This means that the properties in the parent structs are duplicated across all rows corresponding to their children. With this I traded off memory space for being able to directly access all values for a single property without traversing some complicated structure. I believe this was a reasonable tradeoff.

What about selecting subsets of data based on the parent properties? To solve this problem, I actually converted the massive numpy table into a pandas dataframe. Pandas is extremely useful when your data fits the “spreadsheet” paradigm (i.e. each column corresponds to a different kind of data type), and its advanced selection operations allow you to do SQL-like queries on the data (yes, you can even do joins!), which is what I have been using to do advanced selections.