## Parallel programming with opencl and python: parallel scan

This post will continue the subject of how to implement common algorithms in a parallel processor, which I started to discuss here. Today we come to the second pattern, the scan. An example is the cumulative sum, where you iterate over an array and calculate the sum of all elements up to the current one. Like reduce, the algorithm we’ll talk about is not exclusive for the sum operation, but for any binary associative operation (the max is another common example). There are two ways to do a parallel scan:  the hills steele scan, which needs $\log n$ steps; and the blelloch scan, requiring $2 \log n$ steps. The blelloch scan is useful if you have more data than processors, because it only needs to do $\mathcal{O}(n)$ operations (this quantity is also referred to as the work efficiency); while the hillis steele scan needs $\mathcal{O}(n \log n)$ operations. So let’s look at how to implement both of them with opencl kernels.

## Parallel programming with opencl and python: parallel reduce

Once you know how to use python to run opencl kernels on your device (read Part I and Part II of this series) you need to start thinking about the programming patterns you will use. While many tasks are inherently parallel (like calculating the value of a function for N different values) and you can just straightforwardly run N copies on your processors, most interesting tasks involve dependencies in the data. For instance if you want to simply sum N numbers in the simplest possible way, the thread doing the summing needs to know about all N numbers, so you can only run one thread, leaving most of your cores unused. So what we need to come up with are clever ways to decompose the problem into individual parts which can be run in parallel, and then combine them all in the end.

## Parallel programming with opencl and python: vectors and concurrency

Last time we saw how to run some simple code on the gpu. Now let’s look at some particular aspects related to parallel programming we should be aware of. Since gpus are massively parallel processors, you’d expect you could write your kernel code for a single data piece, and by running enough copies of the kernel you’d be maximizing your device’s performance. Well, you’d be wrong! I’m going to focus on the three most obvious issues which could hamper your parallel code’s performance:

• Each of the individual cores is actually a vector processor, which means it can perform an operation on multiple numbers at a time.
• At some point the individual threads might need to write to the same position in memory (i.e. to accumulate a value). To make sure the result is correct, they need to take turns doing it, which means they spend time waiting for each other doing nothing.
• Most code is limited by memory bandwidth, not compute performance. This means that the gpu can’t get the data to the processing cores as fast as they can actually perform the computation required.

## Solving stochastic differential equations with theano

In systems biology we often need to solve diffusion equations of the type $$df = f(x,t) dt + g(x,t)dW$$ where W is a white noise process; they’re the most common example of a stochastic differential equation (SDE). There are only very few cases for which we can analytically solve this equation, such as when either f or g are constant or just depend linearly on x. Of course most interesting cases involve complicated f and g functions, so we need to solve them numerically.

One way to solve this is to use a straightforward variant of the Euler method. You just need to do the usual time discretization $\delta t = T/N$ (with T the total time and N the number of steps) and then update your current value with the deterministic contribution $\delta t \times f$ and the stochastic contribution $\sqrt{\delta t} \times g$. If you are not familiar with stochastic calculus you may be wondering why is the time step multiplier $\sqrt{\delta t}$ for the stochastic part.

## Implementing a recurrent neural network in python

In one of my recent projects I was interested in learning a regression for a quite complicated data set (I will detail the model in a later post, for now suffice to say it is a high dimensional time series). The goal is to have a model which given an input time series and an initial condition is able to predict the output at subsequent times. One good tool to tackle this problem is the recurrent neural network. Let’s look at how it works and how to implement it easily in python using the excellent theano library.

A simple feed forward neural network consists of several layers of neurons: units which sum up the input from the previous layer and a constant bias and pass it through a nonlinear function (usually a sigmoid). Neural networks of this kind are known to be universal function approximators (i.e. for an arbitrary number of layers and/or neurons you can approximate any function sufficiently well). This means that when you don’t have an explicit probabilistic model for your data but just want to find a nonparametric model for the input output relation a neural network is (in theory, not necessarily in practice) a great choice. Continue reading “Implementing a recurrent neural network in python”