Wednesday, January 12, 2011

OpenCL, second iteration

OpenCL and CUDA evangelists will tell you about about speedups of 5, 10, and even 100 times. What they don't say is some of these case studies are comparing a CUDA port with a very dated FORTRAN implementation. Or that the algorithm is now using floats instead of doubles. Or that the previous implementation was not threaded at all. When you compare to a optimized, multithreaded CPU implementation, the gains are not so obvious.

It is difficult to achieve good performance in CUDA and OpenCL. It really feels like lining all your ducks in a row. If anything goes out of line, you get worse performance than an optimized CPU implementation. And very easily things go out of line.

First, you must be aware of how memory will be accessed by your algorithm. You must access your data in the same way it is being fetched by the GPU. You may find yourself doing weird things, like using a space-filling curve to linearize access to data in multiple dimensions. Most likely you will need to use the workgroup's local memory to cache values each workitem would normally get from global memory.

You must pick workload sizes that will result in an optimal processor occupancy. But be very careful here. All your local variables will be using GPU registers. If you cramp too many threads at once you may run out of registers. Your code will still run, but the variables that did not fit in registers will be stored in global memory. This will make the performance orders of magnitude slower than a CPU implementation. Different cards will have different numbers of registers. If you want an optimal workload size, your implementation must become aware of the different cards out there.

And then you need to think of what your code actually does. It pays to use vector operations whenever possible, to sacrifice precision when it doesn't really matter. Control flow instructions are very expensive so use them if there is no other choice. OpenCL and CUDA are SIMT (Single Instruction Multiple Thread). What this means is that all concurrent threads are running the same instruction. Whenever your code branches the processor will have to serialize the different branches and run them one after the other. If you want to get over that, your code then needs to become aware of processor wavefronts, or warps like the are called in CUDA.

As you see, you must be fully aware of the architecture and how your solution is being mapped into it. If you are a GPU developer who was doing generic computing using vertex and fragment shaders, you will find OpenCL and CUDA refreshing and liberating. If you are a CPU developer, well, I hope you enjoy learning new things.

On top of everything there is Amdahl's Law . This is just a way to compute how much parallelization will improve performance. It tells you that any portions of your code that need to remain serial will have a large impact on the overall performance, no matter how many processors or threads are running the rest. Real-life solutions cannot be fully parallel. Often there are large fragments of your logic that you do not want to run in parallel, either because it would be expensive to port them, or because they are out of your control. So in some cases the massive parallelization offered by OpenCL and CUDA, even if exploited properly, improves very little.

After my initial experiment porting Dual Contouring to OpenCL, I realized some improvements were necessary. My first attempt was the naive approach: use a regular grid and run the kernels on every element of the grid, regardless of it was part of the surface or not. The results were better than a single threaded CPU implementation, but the speed up was about two times.

For the second iteration, it was clear I needed to be selective about which elements to compute. This is how I did it:


  1. Run the density function for every point in the regular grid
  2. Classify which edges in the grid contain a sign change between endpoints
  3. Scan the edge classification grid and create a compact array of intersected edges
  4. Using the edge array, compute intersection point for each edge
  5. Compute the normal of the surface at each intersection point
  6. Classify which voxels in the grid have at least one intersection in one of its 12 edges
  7. Scan the voxel classification and create a compact array of voxels
  8. For each voxel in the compacted array, run the QEF solution


As you can see, that's a lot of kernels. Everything needs to be classified, scanned and compacted. The scanning alone may involve more than one call to a single kernel. If you are not clear about this operation, please check an earlier post: Scan or Die. As you must remember the scan operation operated in several branches of a tree. This solution took more memory as well, since we need classification grids for edges and voxels, and also the arrays to store the edges and voxels that really needed to be processed.

The million dollar question is, was it faster? It was. It was really fast. This optimization hit a nerve, which was computing the surface normals. The density functions I was running were very complex to evaluate. At any point the terrain had dozens of layers, each one with five to ten octaves of Perlin noise. In total over a hundred octaves of Perlin for each point. To compute a single normal I had to evaluate the density function another three times. By being selective and computing normals only when needed I was able to compute many more normals at the same time.

But this was not the last iteration I did. As I had suspected, evaluating the density function was the bottleneck of the entire algorithm. I was still evaluating it for every point in the grid. Improving that took me to the next iteration, which may come with a surprise (or not, if you consider my earlier rant about OpenCL complexity, Amdahl et al).

6 comments:

  1. Hi there, I wanted to say (as other's did) that this blog really cought me, since I came over for the first time.
    This whole topic (Voxel-Terrain generation) is damn interesting (bought the GPU-Gems 3 - book because of Chapter 1), so I wanted to write a comment here :).

    After reading all of your blog-entries I started thinking of creating something similar on my own :).

    Well.... so much more to say and ask, but I hope you will reveal everything in the next few days? or weeks? :D

    Thank you for this great blog!

    ReplyDelete
  2. I have been reading your blog for several months. I am quite interested in Dual Contouring algorithm and hope to implement it on GPU by CUDA programming. I also searched and read some papers. I found there are three main reasons why mapping the Dual Contouring algorithm to the GPU is not entirely straightforward.
    Firstly, it uses an Octree data structure to generate adaptively refined meshes; it requires a more elaborate cell neighbouring analysis to stitch individual isosurface elements together.
    Secondly, vertices are placed on the isosurface by minimizing a Quadric Error Function (QEF). This can be memory expensive, because it requires 10 floating point numbers per QEF.
    Thirdly, because this minimization requires a QR or SVD decomposition of the Error Function, its implementation is not as simple as the interpolation strategy used by Marching Cubes.

    You successfully implemented Dual Contouring algorithm on GPU. Could you briefly explain how to minimizing the QEF on GPU? What should I pay more attention to when I implement this algorithm on GPU? I really hope to fully understand the algorithm and implement the algorithm by myself to in order to verify it.
    Look forward to your reply.

    ReplyDelete
    Replies
    1. Your questions go to the real issues. I have two solutions, one is 100% GPU, but it does not do adaptive sampling. The other solution, which is the one I use most, is a hybrid between GPU and CPU.

      Using octrees is not a problem as long as the octree is serialized. I use different kernels throughout the process. What the kernel sees is just several large vectors. There are A vector may contain just edge information or field values, etc.

      For the adaptive version, I do not solve QEF in the GPU. I found the really expensive process was to compute the field function. Traversing the octree and evaluating the QEF is done in the CPU.

      For a non-adaptive grid, solving the QEF does not require additional memory, since you only need the edge crossings and you can construct the QEF matrix inside the kernel.

      Delete
    2. Dear Miguel Cepero,

      I have lots of questions to ask, but I am new to this area. Is it possible for you to send me some parts of your code please? I am really interest in the implementation of QEF method on GPU. For example, what is the density function? How to define the density function? how to calculate the normal of each intersection point? I think it is better for me to understand your code and do some implementation first before asking you more questions. Thank you so much. (funnyhukang at gmail)

      Delete
  3. It seems as if you didn't post anything about the actual polygon-creation, so I wonder how you generate polygons out of that set of points, which your algorithm outputs, as the native approch would be to iterate through all points and find their neighbors by iterating a second time over them. However this method would be painfully slow.

    ReplyDelete
  4. Was there a sequel to this article? It seems the series abruptly ends here.

    ReplyDelete