Dimensionality reduction 101: linear algebra, hidden variables and generative models

Suppose you are faced with a high dimensional dataset and want to find some structure in the data: often there are only a few causes, but lots of different data points are generated due to noise corruption. How can we infer these causes? Here I’m going to cover the simplest method to do this inference: we will assume the data is generated by a linear transformation of the hidden causes. In this case, it is quite simple to recover the parameters of this transformation and therefore determine the hidden (or latent) variables which represent their cause.

Continue reading “Dimensionality reduction 101: linear algebra, hidden variables and generative models”

Parallel programming with opencl and python: parallel scan

This post will continue the subject of how to implement common algorithms in a parallel processor, which I started to discuss here. Today we come to the second pattern, the scan. An example is the cumulative sum, where you iterate over an array and calculate the sum of all elements up to the current one. Like reduce, the algorithm we’ll talk about is not exclusive for the sum operation, but for any binary associative operation (the max is another common example). There are two ways to do a parallel scan:  the hills steele scan, which needs $\log n$ steps; and the blelloch scan, requiring $2 \log n$ steps. The blelloch scan is useful if you have more data than processors, because it only needs to do $\mathcal{O}(n)$ operations (this quantity is also referred to as the work efficiency); while the hillis steele scan needs $\mathcal{O}(n \log n)$ operations. So let’s look at how to implement both of them with opencl kernels.

Continue reading “Parallel programming with opencl and python: parallel scan”

Parallel programming with opencl and python: parallel reduce

Once you know how to use python to run opencl kernels on your device (read Part I and Part II of this series) you need to start thinking about the programming patterns you will use. While many tasks are inherently parallel (like calculating the value of a function for N different values) and you can just straightforwardly run N copies on your processors, most interesting tasks involve dependencies in the data. For instance if you want to simply sum N numbers in the simplest possible way, the thread doing the summing needs to know about all N numbers, so you can only run one thread, leaving most of your cores unused. So what we need to come up with are clever ways to decompose the problem into individual parts which can be run in parallel, and then combine them all in the end.

Continue reading “Parallel programming with opencl and python: parallel reduce”

Parallel programming with opencl and python: vectors and concurrency

Last time we saw how to run some simple code on the gpu. Now let’s look at some particular aspects related to parallel programming we should be aware of. Since gpus are massively parallel processors, you’d expect you could write your kernel code for a single data piece, and by running enough copies of the kernel you’d be maximizing your device’s performance. Well, you’d be wrong! I’m going to focus on the three most obvious issues which could hamper your parallel code’s performance:

  • Each of the individual cores is actually a vector processor, which means it can perform an operation on multiple numbers at a time.
  • At some point the individual threads might need to write to the same position in memory (i.e. to accumulate a value). To make sure the result is correct, they need to take turns doing it, which means they spend time waiting for each other doing nothing.
  • Most code is limited by memory bandwidth, not compute performance. This means that the gpu can’t get the data to the processing cores as fast as they can actually perform the computation required.

Continue reading “Parallel programming with opencl and python: vectors and concurrency”

Parallel programming with opencl and python

In the next few posts I’ll cover my experiences with learning how to program efficient parallel programs on gpus using opencl. Because the machine I got was a mac pro with the top of the line gpus (7 teraflops) I needed to use opencl, which is a bit complex and confusing at first glance. It also requires a lot of boilerplate code which makes it really hard to just jump in and start experimenting. I ultimately decided to use pyopencl, which allows us to do the boring boilerplate stuff in just a few lines of python and focus on the actual parallel programs (the kernels).

First, a few pointers on what I read. A great introduction to the abstract concepts of parallel programming is the udacity course Introduction to parallel programming. They use C and CUDA to illustrate the concepts, which means you can’t directly apply what you see there on a computer with a non nvidia gpu. To learn the opencl api itself, I used the book OpenCL in Action: How to Accelerate Graphics and Computation. As for pyopencl, the documentation is a great place to start. You can also find all the python code I used in github. Continue reading “Parallel programming with opencl and python”

How to fix scipy’s interpolating spline default behavior

Scipy’s UnivariateSpline class is a super useful way to smooth time series, especially if you need an estimate of the derivative. It is an implementation of an interpolating spline, which I’ve previously covered in this blog post. Its big problem is that the default parameters suck. Depending on the absolute value of your data, the spline produced by leaving the parameters at their default values can be overfit, underfit or just fine. Below I visually reproduce the problem for two time series from an experiment with very different numerical values.

Two time series with different numerical values and their derivatives below
Two time series with different numerical values and their derivatives below. The first is overfit, the second underfit.

My usual solution was just to manually adjust the $s$ parameter until the result looked good. But this time I have hundreds of time series, so I have to do it right this time. And doing it right requires actually understanding what’s going on. In the documentation, $s$ is described as follows:

Positive smoothing factor used to choose the number of knots. Number of knots will be increased until the smoothing condition is satisfied:

sum((w[i]*(y[i]-s(x[i])))**2,axis=0) <= s

If None (default), s=len(w) which should be a good value if 1/w[i] is an estimate of the standard deviation of y[i]. If 0, spline will interpolate through all data points.

So the default value of $s$ should be fine if $w^{-1}$ were an estimate of the standard deviation of $y$. However, the default value for $w$ is 1/len(y) which is clearly not a decent estimate. The solution then is to calculate a rough estimate of the standard deviation of $y$ and pass the inverse of that as $w$. My solution to that is to use a gaussian kernel to smooth the data and then calculate a smoothed variance as well. Code below:


Same timeseries with the variance estimate fix
Same timeseries with the variance estimate fix

Now, you may be thinking I only moved the parameter dependence around: before I had to fine tune $s$ but now there is a free parameter sigma. The difference is that a) the gaussian filter results are much more robust with respect to the choice of sigma;  b) we only need to provide an estimate of the standard deviation, so it’s fine if the result coming out is not perfect; and c) it does not depend on the absolute value of the data. In fact, for the above dataset I left sigma at its default value of 3 for all timeseries and all of them came out perfect. So I’d consider this problem solved.

I understand why the scipy developers wouldn’t use a method similar to mine to estimate $w$ as default, after all it may not work for all types of data. On the other hand, I think the documentation as it stands is confusing. The user would expect that parameters which have a default value would work without fine tuning, instead what happens here is that if you leave $w$ as the default you must change $s$ and vice versa.

Quick introduction to gaussian mixture models with python

Usually we like to model probability distributions with gaussian distributions. Not only are they the maximum entropy distributions if we only know the mean and variance of a dataset, the central limit theorem guarantees that random variables which are the result of summing many different random variables will be gaussian distributed too. But what to do when we have multimodal distributions like this one?

A gaussian mixture with 2 dimensional features
A gaussian mixture with 2 dimensional features

A gaussian distribution would not represent this very well. So what’s the next best thing? Add another gaussian! A gaussian mixture model is defined by a sum of gaussians $$P(x)=\sum_i w_i \, \mathcal{G}(\mu_i, \Sigma_i)$$ with means $\mu$ and covariance matrices $\Sigma$. Continue reading “Quick introduction to gaussian mixture models with python”

An introduction to the metropolis method with python

I already talked about MCMC methods before, but today I want to cover one of the most well known methods of all, Metropolis-Hastings. The goal is to obtain samples according to to the equilibrium distribution of a given physical system, the Boltzmann distribution. Incidentally, we can also rewrite arbitrary probability distributions in this form, which is what allows for the cross pollination of methods between probabilistic inference and statistical mechanics (look at my older post on this). Since we don’t know how to sample directly from the boltzmann distribution in general, we need to use some sampling method. Continue reading “An introduction to the metropolis method with python”

Pinch to zoom in libgdx

So I was a bit confused how to reproduce the multitouch gesture you often see in mobile gallery apps using libgdx. The idea is to zoom and recenter the viewport such that the points where your fingers are anchored are always the same (in game coordinates). Assuming you don’t need to rotate, here is the code I came up with:

Of course, you shouldn’t put all this stuff into this method: each logical piece of code should be in its own method (and in minesweeper most of it is actually on another object, since I like to have only code relating to gesture handling on the gesture handler object)

Simple pattern formation with cellular automata

A cellular automaton is a dynamical system where space, time and dynamic variable are all discrete. The system is thus composed of a lattice of cells (discrete space), each described by a state (discrete dynamic variable) which evolve into the next time step (discrete time) according to a dynamic rule.
x_i^{t+1} = f(x_i^t, \Omega_i^t, \xi)
This rule generally depends on the state of the target cell $x_i^t$, the state of its neighbors $\Omega_i^t$, and a number of auxiliary external variables $\xi$. Since all these inputs are discrete, we can enumerate them and then define the dynamic rule by a transition table. The transition table maps each possible input to the next state for the cell. As an example consider the elementary 1D cellular automaton. In this case the neighborhood consists of only the 2 nearest neighbors $\Omega_i^t = \{x_{i-1}^t, x_{i+1}^t\}$ and no external variables.

In general, there are two types of neighborhoods, commonly classified as Moore or Von Neumann. A Moore neighborhood of radius $r$ corresponds to all cells within a hypercube of size $r$ centered at the current cell. In 2D we can write it as $\Omega_{ij}^t = \{x^t_{kl}:|i-k|\leq r \wedge |j-l|\leq r\}\setminus x^t_{ij}$. The Von Neumann neighborhood is more restrictive: only cells within a manhattan distance of $r$ belong to the neighborhood. In 2D we write $\Omega_{ij}^t = \{x^t_{kl}:|i-l|+|j-k| \leq r\}\setminus x^t_{ij}$.

Finally it is worth elucidating the concept of totalistic automata. In high dimensional spaces, the number of possible configurations of the neighborhood $\Omega$ can be quite large. As a simplification, we may consider instead as an input to the transition table the sum of all neighbors in a specific state $N_k = \sum_{x \in \Omega}\delta(x = k)$. If there are only 2 states, we need only consider $N_1$, since $N_0 = r – N_1$. For an arbitrary number $m$ of states, we will obviously need to consider $m-1$ such inputs to fully characterize the neighborhood. Even then, each input $N_k$ can take $r+1$ different values, which might be too much. In such cases we may consider only the case when $N_k$ is above some threshold. Then we can define as an input the boolean variable

1& \text{if $N_k \geq T$},\\
0& \text{if $N_k < T$}.

In the simulation you can find here, I considered a cellular automaton with the following properties: number of states $m=2$; moore neighborhood with radius $r=1$; lattice size $L_x \times L_y$; and 3 inputs for the transition table:

  • Current state $x_{ij}^t$
  • Neighborhood state $P_{1,T}$ with $T$ unspecified
  • One external input $\xi$\begin{equation}
    1& \text{if $i \geq L_x/2$},\\
    0& \text{if $i < L_x/2$}.
  • Initial condition $x_{ij} = 0 \; \forall_{ij}$

For these conditions a deterministic simulation of these conditions yields only a few steady states: homogeneous 1 or 0, half the lattice 1 and the other 0, and oscillation between a combination of the previous.

One possibility would be to add noise to the cellular automaton in order to provide more interesting dynamics. There are two ways to add noise to a cellular automaton:

The most straightforward way is to perform the following procedure at each time step:

  • Apply the deterministic dynamics to the whole lattice
  • For each lattice site $ij$, invert the state $x_{ij}$ with probability $p$

This procedure only works of course for $m=2$. In the case of more states there is no obvious way to generalize the procedure and we need to use a proper monte carlo method to get the dynamics.

A second way is to implement a probabilistic cellular automaton. In this case the transition table is generalized to a markov matrix: each input is now mapped not to a specific state but rather to a set of probabilities for a transition to each state ($m$ probabilities). Naturally for each input these sum to one. In this case we have $m$ times more parameters than before.