Last time we saw how to run some simple code on the gpu. Now let’s look at some particular aspects related to parallel programming we should be aware of. Since gpus are massively parallel processors, you’d expect you could write your kernel code for a single data piece, and by running enough copies of the kernel you’d be maximizing your device’s performance. Well, you’d be wrong! I’m going to focus on the three most obvious issues which could hamper your parallel code’s performance:
- Each of the individual cores is actually a vector processor, which means it can perform an operation on multiple numbers at a time.
- At some point the individual threads might need to write to the same position in memory (i.e. to accumulate a value). To make sure the result is correct, they need to take turns doing it, which means they spend time waiting for each other doing nothing.
- Most code is limited by memory bandwidth, not compute performance. This means that the gpu can’t get the data to the processing cores as fast as they can actually perform the computation required.