## Dimensionality reduction 101: linear algebra, hidden variables and generative models

Suppose you are faced with a high dimensional dataset and want to find some structure in the data: often there are only a few causes, but lots of different data points are generated due to noise corruption. How can we infer these causes? Here I’m going to cover the simplest method to do this inference: we will assume the data is generated by a linear transformation of the hidden causes. In this case, it is quite simple to recover the parameters of this transformation and therefore determine the hidden (or latent) variables which represent their cause.

## Parallel programming with opencl and python: parallel scan

This post will continue the subject of how to implement common algorithms in a parallel processor, which I started to discuss here. Today we come to the second pattern, the scan. An example is the cumulative sum, where you iterate over an array and calculate the sum of all elements up to the current one. Like reduce, the algorithm we’ll talk about is not exclusive for the sum operation, but for any binary associative operation (the max is another common example). There are two ways to do a parallel scan:  the hills steele scan, which needs $\log n$ steps; and the blelloch scan, requiring $2 \log n$ steps. The blelloch scan is useful if you have more data than processors, because it only needs to do $\mathcal{O}(n)$ operations (this quantity is also referred to as the work efficiency); while the hillis steele scan needs $\mathcal{O}(n \log n)$ operations. So let’s look at how to implement both of them with opencl kernels.

## Parallel programming with opencl and python: parallel reduce

Once you know how to use python to run opencl kernels on your device (read Part I and Part II of this series) you need to start thinking about the programming patterns you will use. While many tasks are inherently parallel (like calculating the value of a function for N different values) and you can just straightforwardly run N copies on your processors, most interesting tasks involve dependencies in the data. For instance if you want to simply sum N numbers in the simplest possible way, the thread doing the summing needs to know about all N numbers, so you can only run one thread, leaving most of your cores unused. So what we need to come up with are clever ways to decompose the problem into individual parts which can be run in parallel, and then combine them all in the end.

## Parallel programming with opencl and python: vectors and concurrency

Last time we saw how to run some simple code on the gpu. Now let’s look at some particular aspects related to parallel programming we should be aware of. Since gpus are massively parallel processors, you’d expect you could write your kernel code for a single data piece, and by running enough copies of the kernel you’d be maximizing your device’s performance. Well, you’d be wrong! I’m going to focus on the three most obvious issues which could hamper your parallel code’s performance:

• Each of the individual cores is actually a vector processor, which means it can perform an operation on multiple numbers at a time.
• At some point the individual threads might need to write to the same position in memory (i.e. to accumulate a value). To make sure the result is correct, they need to take turns doing it, which means they spend time waiting for each other doing nothing.
• Most code is limited by memory bandwidth, not compute performance. This means that the gpu can’t get the data to the processing cores as fast as they can actually perform the computation required.

## Parallel programming with opencl and python

In the next few posts I’ll cover my experiences with learning how to program efficient parallel programs on gpus using opencl. Because the machine I got was a mac pro with the top of the line gpus (7 teraflops) I needed to use opencl, which is a bit complex and confusing at first glance. It also requires a lot of boilerplate code which makes it really hard to just jump in and start experimenting. I ultimately decided to use pyopencl, which allows us to do the boring boilerplate stuff in just a few lines of python and focus on the actual parallel programs (the kernels).

First, a few pointers on what I read. A great introduction to the abstract concepts of parallel programming is the udacity course Introduction to parallel programming. They use C and CUDA to illustrate the concepts, which means you can’t directly apply what you see there on a computer with a non nvidia gpu. To learn the opencl api itself, I used the book OpenCL in Action: How to Accelerate Graphics and Computation. As for pyopencl, the documentation is a great place to start. You can also find all the python code I used in github. Continue reading “Parallel programming with opencl and python”

## How to fix scipy’s interpolating spline default behavior

Scipy’s UnivariateSpline class is a super useful way to smooth time series, especially if you need an estimate of the derivative. It is an implementation of an interpolating spline, which I’ve previously covered in this blog post. Its big problem is that the default parameters suck. Depending on the absolute value of your data, the spline produced by leaving the parameters at their default values can be overfit, underfit or just fine. Below I visually reproduce the problem for two time series from an experiment with very different numerical values.

My usual solution was just to manually adjust the $s$ parameter until the result looked good. But this time I have hundreds of time series, so I have to do it right this time. And doing it right requires actually understanding what’s going on. In the documentation, $s$ is described as follows:

Positive smoothing factor used to choose the number of knots. Number of knots will be increased until the smoothing condition is satisfied:

sum((w[i]*(y[i]-s(x[i])))**2,axis=0) <= s

If None (default), s=len(w) which should be a good value if 1/w[i] is an estimate of the standard deviation of y[i]. If 0, spline will interpolate through all data points.

So the default value of $s$ should be fine if $w^{-1}$ were an estimate of the standard deviation of $y$. However, the default value for $w$ is 1/len(y) which is clearly not a decent estimate. The solution then is to calculate a rough estimate of the standard deviation of $y$ and pass the inverse of that as $w$. My solution to that is to use a gaussian kernel to smooth the data and then calculate a smoothed variance as well. Code below:

Now, you may be thinking I only moved the parameter dependence around: before I had to fine tune $s$ but now there is a free parameter sigma. The difference is that a) the gaussian filter results are much more robust with respect to the choice of sigma;  b) we only need to provide an estimate of the standard deviation, so it’s fine if the result coming out is not perfect; and c) it does not depend on the absolute value of the data. In fact, for the above dataset I left sigma at its default value of 3 for all timeseries and all of them came out perfect. So I’d consider this problem solved.

I understand why the scipy developers wouldn’t use a method similar to mine to estimate $w$ as default, after all it may not work for all types of data. On the other hand, I think the documentation as it stands is confusing. The user would expect that parameters which have a default value would work without fine tuning, instead what happens here is that if you leave $w$ as the default you must change $s$ and vice versa.

## Quick introduction to gaussian mixture models with python

Usually we like to model probability distributions with gaussian distributions. Not only are they the maximum entropy distributions if we only know the mean and variance of a dataset, the central limit theorem guarantees that random variables which are the result of summing many different random variables will be gaussian distributed too. But what to do when we have multimodal distributions like this one?

A gaussian distribution would not represent this very well. So what’s the next best thing? Add another gaussian! A gaussian mixture model is defined by a sum of gaussians $$P(x)=\sum_i w_i \, \mathcal{G}(\mu_i, \Sigma_i)$$ with means $\mu$ and covariance matrices $\Sigma$. Continue reading “Quick introduction to gaussian mixture models with python”

## An introduction to the metropolis method with python

I already talked about MCMC methods before, but today I want to cover one of the most well known methods of all, Metropolis-Hastings. The goal is to obtain samples according to to the equilibrium distribution of a given physical system, the Boltzmann distribution. Incidentally, we can also rewrite arbitrary probability distributions in this form, which is what allows for the cross pollination of methods between probabilistic inference and statistical mechanics (look at my older post on this). Since we don’t know how to sample directly from the boltzmann distribution in general, we need to use some sampling method. Continue reading “An introduction to the metropolis method with python”

## How to do inverse transformation sampling in scipy and numpy

Let’s say you have some data which follows a certain probability distribution. You can create a histogram and visualize the probability distribution, but now you want to sample from it. How do you go about doing this with python?

So scipy doesn’t support a negative binomial for a continuous r parameter. The expression for its pdf is $P(k)=\frac{\Gamma(k+r)}{k!\,\Gamma(r)} (1-p)^rp^k$. I coded a small class which computes the pdf and is also able to find MLE estimates for p and k given some data. It relies on the mpmath arbitrary precision library since the gamma function values can get quite large and overflow a double. It might be useful to someone so here’s the code below.