Sunday, December 5, 2010

Scan or Die

Getting your head around parallel programming will need some re-adjustments. You will soon find that a naive approach will rarely show any real advantages. As I see it, there are two big issues to overcome.

First, parallel computing frameworks like OpenCL and CUDA bring some of restrictions of their own, like how much memory each workgroup can have, how wavefronts are executed, etc. Just like any other framework, you just need to learn how to deal with those.

The second issue is more serious. It is the paradigm shift from serial computing  to parallel computing. You need to rethink your approach to the task in hand. Algorithms that are trivial when executed serially may look very difficult to implement in parallel.

If you are serious about parallel programming, there is one tool you should really get to know. It is one simple tool, but it is so powerful that you will find yourself using it over and over again: "Scans", or as they are also known, a "Prefix Sum" operation. You should get this one straight from the source: Prefix Sums and Their Applications by Guy Blelloch. Anyway I will provide a quick introduction to the method and why it was useful to me.

One of the simplest problems you can find in traditional programming is to compute the sum of all the values in one array. This is one of the first things you would learn in most introductory courses to programming. Now, the same problem in parallel computing seems a daunting task, if you want it to be at least as efficient as the serial version.

It is not difficult to see why. When done sequentially, the problem involves iterating over each element in the array and adding the current value to a temporary variable. Once you are done iterating, the sum of all elements is contained in the temporary variable.

When executed in parallel, having a single variable to keep the sum of all elements is not practical or even feasible. Every thread would be trying to read and write to this single variable at the same time. Unless you force all threads to wait one for another, the results cannot be trusted. If threads are forced to wait, then there is no point to the parallel approach, it would be slower.

How to solve this? The trick lays in the nature of the operation being performed, in this case the sum. If the operation is associative, meaning:

          (A + B) + C == A + (B + C)

Nothing is really forcing you to perform it in any sequence. You can breakdown the problem into multiple segments. You perform the operation within each segment, then over the results produced by these segments. Since the operation is associative, it is guaranteed the results will be the same.

So if we needed to compute:

          V = A + B + C + D + E + F + G + H

You could do (A + B), (C + D), (E + F) and (G + H) in parallel:

          V1 = A + B
          V2 = C + D
          V3 = E + F
          V4 = G + H

Then you would do a second pass:

          V11 = V1 + V2
          V12 = V3 + V4

And then one last pass:

          V = V11 + V12

For the sake of argument let's assume each sum takes a millisecond. The serial approach would take 7 milliseconds. The parallel approach would take only three milliseconds, since all the additions for each pass happen at the same time.

In general this method requires breaking down the problem into a tree-like structure. Branches of the tree which are at the same level and do not depend from each other can be evaluated at the same time. Then the next level of the tree is processed and so on, until we reach the root and obtain the final result.

This can be extended to many other applications, even some that are not obvious, like parsing tokens, evaluating regular expressions, performing searches and sorting.

In my case it made a big difference. From the naive implementation of Dual Contouring it was clear I was wasting threads on the elements that were not crossed at all by a surface. Instead of running expensive functions on every edge and voxel of the grid, it would be better to first select which edges had the surface crossing and consider only those edges and their voxels.

This could be solved using the Scan algorithm. In its most common form, this algorithm not only returns the sum of all values in an array, but it can also be used to find out the sum of the preceding elements for each element in the array.

If you had:
          [3 1 7 0 4 1 6 3]

The scan would return:

          [0 3 4 11 11 14 16 22]

This allows for a very nice trick. Imagine that in the first array we have an entry for each edge in the voxel grid. If the edge is crossed, a 1 is written in that location of the array. Otherwise a zero is written.

Imagine we have eight edges. Of those only edge 3 and 5 are crossed. This is how the array would look like:


          [0 0 0 1 0 1 0 0]

The Scan would return:


          [0 0 0 0 1 1 2 2]


If you look closely, you will see the new array can be used to compact the edge array. First, we know how many elements we need to allocate: 2. It is the amount of ones found in the edge array. Second, we can map each element of the new array to the elements in the first array. For that, we just need to go over each element in the Scan result array and use the value in there as the index to the compacted array. In that location we would write the current index in the Scan array. This would be the result:


          [3 5]


Which is a neat compact array that contains only the edges we want to look at. This compacting is also suitable for parallelization, so most of the contouring logic can remain in the GPU.

In the next post I will describe how these compact arrays are used and how many times I ended up running the Scan function.

Sunday, November 28, 2010

OpenCL, first iteration

OpenCL is a general purpose computing framework. It is designed for parallel computing so it runs very well in modern GPUs, which are parallel by nature. The whole idea is to have a single "job" executed by a large number of threads at the same time.

Voxels are a natural fit for OpenCL. Many voxel algrorithms run locally, meaning that the algorithm only needs to look at one or a few voxels at a time. There are no dependencies between a voxel and other voxels far away.

I already had an implementation of Dual Contouring for the CPU which would be a good start for a port to OpenCL. A straight port was out of the question, almost in every case OpenCL will require you to re-think your problem from a new angle. Still, having the traditional CPU implementation gave me a good idea of the issues I needed to address.

This first iteration was only to test the waters. I wanted to see if a naive approach would produce acceptable results. The bare minimum I could do was:

  1. For each point in the voxel grid, evaluate the 3D field function
  2. For each edge in the grid, see if the edge crossed the surface and compute the intersection point
  3. For each voxel in the grid, see if its edges had an intersection and compute the inner point for the voxel

I could then create the resulting mesh in the CPU. I would need to iterate through all the edges again, if an edge was crossed, I would output a quadrilateral containing the inner points for the four neighboring voxels.

From the start I saw this would require to run three different "jobs" or "kernels" as they are called in OpenCL-speak. One to evaluate the points in the grid, another for intersecting edges, and one last for computing inner points.

A kernel is like a traditional C function that takes multiple parameters. Some parameters are inputs, some are outputs. In most cases, arrays are used to move data in and out.

For instance, the first kernel for computing the grid function would take an array of floats as an input. Then it would compute one single value for the function and put it into the array, making sure the value was put in the correct index. Why only one value? This is where the parallel way of thinking comes in: this same kernel will run for each point of the array, actually hundreds of them may be running at the same time. OpenCL will spawn as many instances of your kernel as necessary so the entire array can be covered. 

It did not take long to implement this version. The results were good, but it was only a few times faster than the CPU implementation. I knew this was not good enough. So what was the problem with the naive approach? It was not difficult to see it was wasting a lot of processing power.

The CPU implementation, being serial by nature, had the chance to skip further processing edges, voxels and points when it realized they were not crossed by the surface. For instance, it would compute the inner point of a voxel only if any of the voxel edges was intersected by the surface. If there was no intersection, the next CPU cycles would be spent processing the next voxel. 

The key in the sentences before is the word "If".

With OpenCL and parallel computing in general, "if" statements won't save you any processing, unless you can guarantee the same outcome for every time the if statement is evaluated. Since every thread is running the same code in parallel, it saves very little when you realize that no further processing is needed and the function can just bail out. It is already too late because a thread was allocated to the task anyway. 

The naive approach was doing an awful amount of work that was not needed at all. If I wanted really complex field functions with many octaves of Perlin noise, or distance fields to point clouds for rocks and vegetation, I would have to do better. 

I was also solving Quadratic Error Functions to compute the location of the points inside the voxels. This required normal vectors for each intersection point. The way to compute these normals is by using the gradient of the 3D field in the point, which meant evaluating the field function another four times. The amount of computation was staggering. I clearly needed to optimize the inputs of my OpenCL kernels, so they would look only to the elements that were contributing to the surface.

This realization took me to the second iteration using OpenCL, which will be the subject of my next post.





Monday, November 22, 2010

Space Warps and Octrees

When you think about voxels, the image you see in your mind is that of the evenly sized stacked boxes:


This arrangement gives equal importance to every point in space. In practice, however, space is not always that interesting. Actually most of it is either empty or entirely solid and thus impossible to see. If all space is not the same, why voxels should be?

If we could have only voxels in regions of interest, then we would save a lot of memory and processing time. There are two different techniques to address this.

The first is Space Warps. In homogeneus space voxels are perfect cubes. Each edge measures exactly the same. What would happen if we could place voxel corners anywhere we like?


While the total number of voxels did not change, they now cover the same space in a different way. Voxels in the center are smaller, voxels to the outside are bigger. Since the topology of the voxel grid did not change, it is guaranteed that the same algorithms that run in a regular grid will still run here. Marching Cubes and Dual Contouring, for instance, remain the same.

The advantage is that for the same processing cost and memory footprint, we now have more detail allocated to the voxels in the center. If we could pinch space around the features we consider interesting, and relax it around empty air or entirely solid regions, we would end up with improved triangle resolution at no cost.

The second technique is Octrees, to be more specific: Sparse Voxel Octrees. Sparse Voxel Octree

As we saw before, for most applications we are only interested in the surface of voxels. That is where things stops being solid and the empty space begins. Octrees are ideal for representing that. 

Above all an octree is a way to divide space. You start with a box, which would encompass everything you want to represent. If you need more detail in the box, let's say a surface is going trough it, you then divide it into eight boxes. Then you look at each of the boxes and you ask the same question again, is there any detail in here. If there is, you break the box into another eight smaller boxes and ask the question for each one of them.

Eventually you will have a lot of small boxes in the regions where there is detail, and a few large boxes that contain only air or solid. 

You can use Quadratic Error Functions (QEF) to simplify boxes even more. If you find that the error of collapsing eight boxes into its parent box is small enough, you can just discard the smaller boxes. 

Each terminal node in the octree, a leaf, can  be considered a voxel. With little modifications, Dual Contouring can be run on those voxels. The main difference is that with a regular grid, Dual Contouring would run on every edge. Now with the Octree, some edges that belong to a larger voxel may also be shared by a smaller voxel. It must be made sure that only the small edges are considered.

The results are quite impressive. Instead of producing a surface that is evenly covered by triangles, you now see that areas of interest have a lot of triangles while flat and less detailed areas only have a few.

So which of these techniques did I choose to implement? It was not simple. I was not entirely happy by some distortions I was seeing in my tests with Space Warps. On the other hand, since I wanted to use OpenCL, I knew it would be cheap to compute many points at once, so a brute force method could end up being faster than using an octree. I was especially concerned about memory coherence if I moved the contouring to an octree.

In my future posts you will see which approach made the cut, and which one did not.


Friday, November 19, 2010

From Voxels to Polygons

Voxels are just a way to store 3D data. Somehow they need to be visualized. At this point you have two avenues: use volumetric rendering, or extract a surface out of the voxel data and display it using a traditional triangle mesh.

Assuming there is enough processing power on the client, probably the best is to use ray tracing and volumetric rendering from a Sparse Voxel Octree. In the long run this may become the predominant method. With generic GPU computing frameworks like OpenCL and CUDA a paradigm shift at the hardware level is not necessary anymore. The migration will happen as OpenCL/CUDA cards become the norm.

Content creation tools will catch up quickly. There are already some impressive tools that work directly on voxels. Check out 3D Coat if you haven't. Then it will be bye-bye to polygonal meshes, UV-Mapping, texture unwrapping and other oddities of the raster world.

But it is too soon to ditch polygons. What if you want your engine to run on mobile devices, tablets, WebGL, or even just plain old PCs? If that is your case, it makes more sense right now to keep voxels on the content-generation side and use traditional triangle meshes on the clients. That is what I intended to do.

As I described in an earlier post, I tried extracting the surface out of the voxels by relaxing the voxel grid until it became smooth. It produced fairly good results, but not good enough for every case. I needed to look at other methods for extracting surfaces. At this point I had given up on simplistic binary voxels where each cell would be entirely solid or entirely empty. I knew I had to keep different voxel densities.

There was Marching Cubes. This is the grand-daddy of all surface extraction algorithms. This method is so old that computers were really slow when it was invented. It is very simple to implement, it is all based on table lookups so it lends very well for GPU implementation.

The general idea is that for any voxel there is a fixed number of ways a surface may cross it. Based on the density values at the voxel corners, it possible to know if the surface is crossing the edge between the corners. There is a finite number of combinations, as the image below shows:


I borrowed this image from a nice article in GPU Gems 3. If you are interested in Marching Cubes and GPU terrain generation, this is one you should read.

Marching Cubes biggest problem is that it cannot produce sharp features. Everything looks blobby coming out of Marching Cubes. Since I wanted sharp corners, especially for architecture, Marching Cubes was not a good option.

Why so keen on sharp corners? It is not only that they look good. If you remember my goals, I needed to stream geometry to clients. Having sharp features from the onset would provide the best simplification for the meshes sent to the client. Marching Cubes creates too much noise around sharp features, it would never translate into clean meshes.

The lack of sharp features in the resulting surface is caused by all the guessing Marching Cubes does. While it may compute the right intersection points along the edges, it knows nothing about the behavior of the surface inside the cube. The bigger the voxel resolution, the larger this error becomes. You can see this in the following image:
This image I borrowed from another nice article about isocontouring. This one has everything you need to know, even source code.

What if we could know more about the surface's behavior inside the voxel, predict where it goes? If, in addition to the intersection point, we take into account the direction of the surface at that point, it is possible to have a much better idea:

The first section (a) shows the output of Marching Cubes. If there had been a sharp corner in any of the voxels, it would have been smoothen out. The second section (b) shows that if you know the normal vector to the surface at the intersection point, you may still be able to detect a sharp corner inside the voxel.

Marching Cubes is a not good framework to add this since it can only consider points that lay along the voxel edges. In this case we clearly need to position points anywhere inside the voxel's space. There is another method that does exactly this: Dual Contouring.

Dual Contouring looks at each edge of the voxel at a time. If the corners around  the edge have different signs, meaning one is inside the volume and the other is outside, the surface is crossing the edge.

Now, each edge has four neighbor voxels. If the surface is crossing the edge, it also means that it necessarily is crossing the four neighbor voxels. You can say there is at least one point inside each voxel that lays in the surface. If you then output a quadrilateral using the points for the four neighboring voxels, you will have a tiny square patch of the surface. Repeating this for every edge in the voxel data produces the entire surface.

You still need to find the position of this point for each voxel. The best way is to look at all the edges for the voxel and see which ones are intersected by the surface. Then, using the surface normal at each intersection point, you can predict where all these intersections converge inside the voxel. For that it is best to use Quadratic Error Functions (QEF). At this point you may have multiple planes crossing different edges of the voxel, and QEF will find a point that is closest to the intersection of all these planes.

If you want more information on this method, you can check this article.

So I implemented my own Dual Contouring with QEF. The results were encouraging.  The method was fast, and still produced very nice sharp features. All the videos I have posted in YouTube were obtained using this method.

In a future posts, I will describe how I represented everything in the world as a 3D field so I could get surfaces out using Dual Contouring. I will also describe how I moved the Dual Contouring implementation to the GPU using OpenCL. Stick around for that one, you may find it interesting.

Wednesday, November 17, 2010

Doughnuts and Coffee Mugs

My first attempts to extract a smooth surface out of the voxels had failed. Relaxation was not good enough to preserve the features of traditional meshes once they were converted into voxels. Maybe it was a sign that I should move away from voxels?

There was another advantage in voxels that was key for the results I had envisioned. I will take a break from the surface extraction topic to discuss this.

It is mainly about doughnuts and coffee mugs. But I will come to them later.

The number of pixels in the screen is constant. Objects, when viewed in perspective, appear in different size depending on how close they are. A rock next to the viewer may take 1000 pixels on the screen. The same rock when viewed from a distance, may be only 10 pixels. It would take 100 rocks the same size to fill the same screen space a single rock occupied when viewed up close. It is fairly obvious that in a perspective view objects multiply in the distance.

This is a big problem for most 3D engines. There are a few tricks for this, like a dense fog that eats everything in the distance, or popping in the objects in when they are close enough. In most cases engines use a Level Of Detail (LOD) system, which switches to low-resolution models for objects far away. Still, this is not enough to have details preserved at large distances.

I wanted to have the horizon line very far away, and I wanted all the complexity of the world visible even at that distance. If there was a village a couple of miles away, I wanted the windows in the houses to appear as tiny dots, and still be able to see all the walls and roofs for the entire village.

Current terrain engines are able to display very large expanses of land using a technique called Clipmaps. The scene around the viewer is broken into multiple tiles. The tiles get bigger depending on how far they are from the viewer. A tile next to the viewer may be just a few square meters, but a tile next to the horizon line may be kilometers wide.



The nice thing about clipmaps is that each tile, no matter where it is located, has roughly the same amount of information. Due to the perspective projection tiles far away are shrunk. Even if they cover a larger expanse of terrain, they end up taking the same space on screen than nearby tiles. As a result, it is possible to cover very large areas at a fairly constant degree of detail.

This technique is normally restricted to the terrain. Any objects on top of it, like vegetation and architecture, are not part of the clipmap. This is where the traditional LOD approach comes in, but it is not good enough to handle all the instances necessary for a distant horizon line. Distance to horizon is severely restricted an the benefits of the terrain clipmap go to waste.

I wanted to use clipmaps for all the virtual world's geometry. The classical clipmap algorithm is 2D. Terrain is usually represented as a heigthtmap. In my case I would need to extend clipmaps to 3D. There are a few non-trivial issues there, like how to generate seams between tiles of different resolution, but I will address them in a future post.

I set my tile resolution at 64x64x64. Each unit would be a decimeter for the tiles closest to the viewer. That would give tiles of 6.4 meters. The next level tiles would be twice that length, 12.8 meters. If I wanted the horizon at 5km, then some of the tiles would be one kilometer length and even more.

This all sounds very nice in theory. Now remember that each tile, no matter its size, must have roughly the same amount of information. I could easily see my small 6.4 meter tile fitting in a budget of 1000 triangles, but how can I fit an entire one cubic kilometer of my world into 1000 triangles and still make it look good?

This is where the doughnuts and mugs come into play. Traditional mesh optimization methods can reduce a lot the number of polygons in a mesh, but they either cannot simplify the topology of the mesh at all, or are not very good at that. An example will help:

Let assume we have a large wall with a small window. The window in this case is just a square whole in the wall. Traditional mesh simplification methods collapse edges in the mesh until the desired triangle count is reached. Still, no matter how much you crank the optimization up, the algorithm will preserve the hole in the wall. If the algorithm could remove the hole -it is a tiny hole anyway- it could save a lot of triangles.

The problem here is that the topology of a wall is different than the topology of a wall with a hole. It is not like the doughnut and the mug, which are topologically identical:

This is what mathematicians call the Genus of the object, which is probably a fancy way to say how many holes a solid has. Traditional mesh optimization cannot cross that barrier, they cannot close holes in the solid to make the object simpler. There are other mesh optimization methods that use clustering of vertices. While they do modify the topology of the mesh, it is hard to control the results.

But if we optimize in voxel space, then it is adifferent story. Voxels can be filtered, summed up and transformed in many advantageous ways. In the example of the large wall with the small window, the voxels of the wall eventually would take the space of the window, naturally filling up the hole.

I knew voxels would provide an edge when it was time to synthesize the larger tiles of my clipmap. I just needed a better way to extract the surfaces out of them.