First, parallel computing frameworks like OpenCL and CUDA bring some of restrictions of their own, like how much memory each workgroup can have, how wavefronts are executed, etc. Just like any other framework, you just need to learn how to deal with those.
The second issue is more serious. It is the paradigm shift from serial computing to parallel computing. You need to rethink your approach to the task in hand. Algorithms that are trivial when executed serially may look very difficult to implement in parallel.
If you are serious about parallel programming, there is one tool you should really get to know. It is one simple tool, but it is so powerful that you will find yourself using it over and over again: "Scans", or as they are also known, a "Prefix Sum" operation. You should get this one straight from the source: Prefix Sums and Their Applications by Guy Blelloch. Anyway I will provide a quick introduction to the method and why it was useful to me.
One of the simplest problems you can find in traditional programming is to compute the sum of all the values in one array. This is one of the first things you would learn in most introductory courses to programming. Now, the same problem in parallel computing seems a daunting task, if you want it to be at least as efficient as the serial version.
It is not difficult to see why. When done sequentially, the problem involves iterating over each element in the array and adding the current value to a temporary variable. Once you are done iterating, the sum of all elements is contained in the temporary variable.
When executed in parallel, having a single variable to keep the sum of all elements is not practical or even feasible. Every thread would be trying to read and write to this single variable at the same time. Unless you force all threads to wait one for another, the results cannot be trusted. If threads are forced to wait, then there is no point to the parallel approach, it would be slower.
How to solve this? The trick lays in the nature of the operation being performed, in this case the sum. If the operation is associative, meaning:
(A + B) + C == A + (B + C)
Nothing is really forcing you to perform it in any sequence. You can breakdown the problem into multiple segments. You perform the operation within each segment, then over the results produced by these segments. Since the operation is associative, it is guaranteed the results will be the same.
So if we needed to compute:
V = A + B + C + D + E + F + G + H
You could do (A + B), (C + D), (E + F) and (G + H) in parallel:
V1 = A + B
V2 = C + D
V3 = E + F
V4 = G + H
Then you would do a second pass:
V11 = V1 + V2
V12 = V3 + V4
And then one last pass:
V = V11 + V12
For the sake of argument let's assume each sum takes a millisecond. The serial approach would take 7 milliseconds. The parallel approach would take only three milliseconds, since all the additions for each pass happen at the same time.
In general this method requires breaking down the problem into a tree-like structure. Branches of the tree which are at the same level and do not depend from each other can be evaluated at the same time. Then the next level of the tree is processed and so on, until we reach the root and obtain the final result.
This can be extended to many other applications, even some that are not obvious, like parsing tokens, evaluating regular expressions, performing searches and sorting.
In my case it made a big difference. From the naive implementation of Dual Contouring it was clear I was wasting threads on the elements that were not crossed at all by a surface. Instead of running expensive functions on every edge and voxel of the grid, it would be better to first select which edges had the surface crossing and consider only those edges and their voxels.
This could be solved using the Scan algorithm. In its most common form, this algorithm not only returns the sum of all values in an array, but it can also be used to find out the sum of the preceding elements for each element in the array.
If you had:
[3 1 7 0 4 1 6 3]
The scan would return:
[0 3 4 11 11 14 16 22]
This allows for a very nice trick. Imagine that in the first array we have an entry for each edge in the voxel grid. If the edge is crossed, a 1 is written in that location of the array. Otherwise a zero is written.
Imagine we have eight edges. Of those only edge 3 and 5 are crossed. This is how the array would look like:
[0 0 0 1 0 1 0 0]
The Scan would return:
[0 0 0 0 1 1 2 2]
If you look closely, you will see the new array can be used to compact the edge array. First, we know how many elements we need to allocate: 2. It is the amount of ones found in the edge array. Second, we can map each element of the new array to the elements in the first array. For that, we just need to go over each element in the Scan result array and use the value in there as the index to the compacted array. In that location we would write the current index in the Scan array. This would be the result:
Which is a neat compact array that contains only the edges we want to look at. This compacting is also suitable for parallelization, so most of the contouring logic can remain in the GPU.
In the next post I will describe how these compact arrays are used and how many times I ended up running the Scan function.