Procedural World: OpenCL

Showing posts with label OpenCL. Show all posts

Tuesday, March 18, 2014

Parallel Concerns

Computers continue to increase in power, pretty much at the same rate as before. They double their performance every 18 to 24 months. This trend is known as Moore's law. The leaders of the microprocessor industry swear they see no end to this law in the 21st century, that everything is ﬁne. Some others say this will come to an end around 2020, but with some luck we should be able to find another exponential to ride.

Regardless of who you believe, you should be wary of the ﬁne-print. Moore's law is about transistor count, it says nothing about how fast they operate. Since the sixties programmers became used to see algorithms improve their performance with every new hardware generation. You could count on Moore's law to make your software run twice as fast every 18 months. No changes were needed, not even to recompile your code. The same program would just run faster.

This way of thinking came to an end around 2005. Clock frequencies hit a physical limit. Since then you cannot compute any faster, you can only compute more. To achieve faster results the only option is to compute in parallel.

Chip-makers try not to make a big deal out of this, but there is a world of difference to programmers. If they were car-makers, it is like their cars had reached a physical limit of 60 miles per hour. When asked how could you do 120 miles per hour they would suggest you take two cars.

Many programmers today ignore this monumental shift. It is often because they produce code for platforms that already deal with parallelization, like database engines or web servers. Or they work creating User Interfaces, where a single thread of code is already fast enough to outpace humans.

Procedural generation requires all the processing power you can get. Any algorithms you design will have to run in this new age of computing. You have no choice but to make them run in parallel.

Parallel computing is an old subject. For many years now programmers have been trying to ﬁnd a silver bullet approach that will take a simple serial algorithm and re-shuffle it so it can run in parallel. None has succeeded. You simply cannot ignore parallelization and rely on wishful thinking. Your algorithms have to be designed with it in mind. What is worse, your algorithm design will depend on the hardware you choose because parallel means something different depending where you go.

This post is about my experience with different parallel platforms.

In my view, today there are three main different hardware platforms worth considering. The ﬁrst one is traditional CPUs, like the x86 series by Intel. Multicore CPUs are now common, and the number of cores may still grow exponentially. In ten years from now we could have hundreds of them. If you manage to break your problem into many different chunks, you can feed each chunk to an individual core and have them run in parallel. As the number of cores grows, your program will run faster.

Let's say you want to generate procedural terrain. You could break the terrain into regular chunks by using an Octree, then process many Octree cells in parallel by feeding them to the available cores.

The x86 platform has the nicest, most mature development tools. Also since the number of concurrent threads is not very high, the parallelization effort is around breaking the work into large chunks and sewing the results back together. Most of the actual computation remains serial. This is a bonus: anyone can write stable serial code, but you will pay dearly for good parallel programmers. The more old-fashion serial code you can write within your fully parallel solution the better.

Being so oriented towards serial, generic code is also the weakness of traditional CPUs. They devote a big deal of transistors and energy dealing with the unpredictability of generic algorithms. If your logic is very simple and linear all this silicone goes to waste.

The second hardware platform are the GPUs. These are the video cards powering games in PCs and consoles. Graphics rendering is highly parallel. The image you see on screen is a matrix of pixels, where each pixel's color can be computed mostly in isolation from the rest. Video cards have evolved around this principle. They allow hundreds of concurrent threads to run in parallel, each one is devoted to producing one fragment of the final image. Compared to today's average CPUs which allow only eight simultaneous threads, hundreds of threads may seem a bonanza.

The catch is all these GPU threads are required to run the same instruction at the same time. Imagine you have 100 dancers on stage. Each dancer must execute exactly the same steps. If just one dancer deviates from the routine, the other dancers stop and wait.

Perfect synchronization is desirable in ballet, not so much in general purpose computing. A single "if" statement in the code could be enough to create divergence. What often happens when an "if" is encountered, is that both branches are executed by all threads, then the results of the branch that was not supposed to run are discarded. It is like some sort of weird quantum reality where all alternate scenarios do happen, but only the outcome of one is picked afterwards. The cat is really both dead and alive in your GPU.

Loops having a variable number of iterations are a problem too. If 99 of the dancers spin twice and the one remaining dancer -for some inexplicable reason- decides to spin forty times, the 99 dancers won't be able to do anything else until the rogue dancer is done. The execution time is the same as if all the dancers had done forty loops.

So programming mainstream GPUs is great as long as you can avoid loops and conditional statements. This sounds absurd, but with the right mindset it is very much possible. The speedup compared to a multithreaded CPU implementation may be signiﬁcant.

There are some frameworks that allow general purpose programs to run on GPUs. CUDA is probably the best known. It is deceptively simple. You write a single function in a language almost identical to C. Each one of the many threads in the GPU will run this function at the same time, but each one will input and output data from a different location. Let's say you have two large arrays of the same size, A and B. You want to compute array C as the sum of these two arrays. To solve this using CUDA you would write a function that looks like this:

void Add(in float A[], in float B[], out float C[])
{
int i = getThreadIndex();
C[i] = A[i] + B[i];
}

This is pseudocode, the actual CUDA code would be different, but the idea is the same. This function processes only one element in the array. To have the entire array processed you need to ask CUDA to spawn a thread for each element.

One big drawback of CUDA is that it is a proprietary framework. It is owned by Nvidia and so far it is limited to their hardware. This means you cannot run CUDA programs on AMD GPUs.

An alternative to CUDA is OpenCL. OpenCL was proposed by Apple, but it is an open standard like OpenGL. It is almost identical to CUDA, maybe a tad more verbose, but for a good reason: not only it runs on both Nvidia and AMD GPUs, it also runs on CPUs. This is great news for developers. You can write code that will use all computing resources available.

Even with these frameworks to aid you, GPU programming requires a whole different way of thinking. You need to address your problem in a way that can be digested by this swarm of rather stupid worker threads. You will need to come up with the right choreography for them, otherwise they will wander aimlessly scratching their heads. And there is one big skeleton in the closet. It is easy to write programs that run on the GPU, but it is hard to make full use of it. Often the bottleneck is between the GPU and the main memory. It takes time to feed data and read results. Adding two arrays in the naive form, like it was shown in the example before, would spend 99% of the time moving data and 1% doing the actual operation. As the arrays grow in size, the GPU performs poorly compared to a CPU implementation.

So which platform should you target, CPUs or GPUs?

Soon it many not be a clear cut anymore. CPUs and GPUs are starting to converge. CPUs may soon include a higher number of less intelligent cores. They will not be very good at running unpredictable generic code, but will do great with linear tasks. On the other side GPUs will get better at handling conditionals and loops. And new architectures are already improving the bandwidth issues. Still it will take a few years for all this to happen so this hardware becomes mainstream.

If you were building a procedural supercomputer today, it would make sense to use a lot of GPUs. You would get more done for a smaller electricity bill. You could stick to Nvidia hardware and hire a few brainiacs to develop your algorithms in CUDA.

If you want to crowd-source your computing, have your logic run by your users, GPUs also make a lot of sense. But then using a general purpose computing framework like CUDA or OpenCL may not be a good idea. In my experience you cannot trust everyone to have the right stack of drivers you will require. In absence of a general computing GPU framework you would need to use graphic rendering APIs like DirectX and OpenGL to perform general purpose computing. There is a lot of literature and research on GPGPU (General Computing on GPUs) but this path is not for the faint of hart. Things will get messy.

On the other hand, CPUs are very cheap and easy to program. It is probably better to get a lot of them running not-so-brilliant code, which after all is easier to produce and maintain. As often happens, hardware excellence does not prevail. You may achieve a lot more just because how developer friendly the platform is and how easy it will be to port your code.

This brings us to the third parallel platform, which is a network of computers (aka the cloud). Imagine you get hundreds of PCs talking to each other over a Local Area Network. Individually they could be rather mediocre systems. They will be highly unreliable, hard drives could stop working anytime, power supply units getting busted without a warning. Still as a collective they could be cheaper, faster and simpler to program than anything else.

Sunday, May 22, 2011

Hello Worley

If you are into procedural things you probably heard about Perlin noise. This noise function can be used to recreate many of the patterns you find in nature, like clouds, large mountain ranges, marble. While it is not the only way to produce them, it is probably the most efficient way to achieve very good looking results.

There is another type of noise that is as useful as Perlin noise: Worley noise, which is also called cellular noise. Here you can see how it looks like, compared to Perlin:

Perlin is great for phenomena that originated from mixing different entities, like how moisture permeates a wall or mold grows on a tree. Worley is best for self-organizing patterns, things that break down into clear boundaries like cells, pebbles and even man-made objects like a stone wall.

The idea behind this noise function is simple. The function field is scattered with random points, then for any point in space the function determines how this point relates to the nearby random points.

Unlike Perlin's function, which you may take as a magical black box, it helps to learn the implementation details of the Worley function. Many interesting effects can be achieved by using different criteria. One example would be to return the distance to the closest random point. This is the used in the image before. As you can see it outputs the Voronoi diagram for the field.

I had delayed including Worley noise in my terrain generator for too long. All the screenshots and videos I had posted so far were produced using only some form of Perlin-like noise. It was a long weekend, so while the wife and twins were taking a nap I pulled out the holy book for procedural content and went straight to Worley's chapter.

It only took a few hours to have his noise ported to OpenCL. Here are screenshots of a large rock formation out of Worley noise:

Here it can be seen in more detail:

Worley's algorithm is perfectly suited for parallelization. I only changed the criteria on which neighbor cells are visited. For a CPU implementation, Worley goes into an elaborate scheme to avoid testing points inside a cell if it can be early discarded. My parallel superstitious instinct told me this type of optimization would not carry well into OpenCL. If at least one of the points in the current wavefront hits the worst case scenario, then it was as if all the points were worst case scenario too. So I removed all conditions and went with the brute force approach. I do not have any numbers to back this up, however. I could be wrong.

Now I have this new toy and I keep thinking of shiny things I could build. It is that warm feeling a new noise function will give you.

Saturday, April 9, 2011

OpenCL Voxelization

If you are going down the voxel way, odds are at some point you will need to include polygonal assets in your workflow.

For me it happened when I integrated the output of the architecture L-System into the voxel world. Architecture is produced as a collection of polygonal volumes, I needed to translate those volumes into voxels.

As usual, it had to be fast. It also needed to be accurate, Sharp features in the original polygon mesh should appear in the voxelized version.

There are several approaches to voxelization. One is to use the GPU to render slices of the polygonal object and then construct the voxel information out of the slices. Another way to do it is to compute a 3D distance field, that is, the distance from each point in space to the closest polygon surface. In GPU Gems 3 there is a nice implementation of this method. And then there is the classical method, which is to shoot rays to the polygonal solid and construct the voxels from the intersections.

I chose to implement a flavor of the classical method. The clean intersection points it produced would help later when reconstructing the surface. I also suspected the ray-tracing nature of this approach would translate well into OpenCL or CUDA.

In particular my method is inspired on this one (PDF link). The article is very detailed, but if you want to skip reading it here is the basic idea:

It uses a regular grid. Each element in the grid is a voxel. One of the coordinate planes is used to build a Quad-tree. All the triangles in the polygonal mesh are projected into this plane and inserted in the Quad-tree. I stole an image from the article above that illustrates this phase:

In this image all the triangles are indexed in a Quad-tree aligned to the XZ plane (assuming Y goes up).

The next step is to shoot rays perpendicular to the Quad-tree plane. Each ray is intersected with all the triangles found in that cell of the Quad-tree. This is actually why a Quad-tree is used. It allows to test only the triangles that potentially intersect the ray.

Each ray may intersect several triangles along its path. For each voxel that is visited by the ray, the algorithm must find a way to tell whether it is inside or outside the solid. This is actually simple. If we assume the polygonal volume is closed, then it is possible to count how many intersections we passed before getting to the current voxel. If the number is odd, the voxel is inside the volume, if it is even, it is outside.

This other image, also from the same article, shows this principle:

And that's it. Once you have sent enough rays to cover the grid, the voxelized solid is there.

This method, however, could not produce the results I needed. It only tells you if a voxel is solid or not. I needed a 3D density field. Also I wanted it to preserve the sharp corners in the original mesh, which is something this method doesn't consider. And last but not least, this algorithm is not very fast. Ray-casting on CPU quickly becomes expensive as grid resolution goes up.

This is what I did:

First, instead of having just one Quad-tree, I built three of them. One for each coordinate plane. I realized I had to cast rays along the three main axis instead of just one as the original method does.

Why? The original method only cares about voxel occupancy. In my case I needed to record the normals at the intersection points. If rays are shot only in the Y direction, any polygons perpendicular to the plane XZ would never be intersected, their normals would be unknown. And it is not only that. For a sharp edge to be reconstructed, you need at least two intersections inside a voxel. To reconstruct a vertex, you need three rays.

The following screenshot shows how sharp features are reconstructed. This is a cube that is not aligned with any of the coordinate axis, still its sharp features are preserved.

The Quad-trees are built on the CPU as a pre-processing phase. It takes less than 1% of the entire cost of the voxelization, so it is something I won't move into the GPU soon.

Next, the OpenCL voxelization kernel runs. Each instance of the kernel processes a single ray. Before invoking the kernel I make sure all the rays are packaged in sequence, regardless of their direction.

The ray has access to the corresponding Quad-tree so it can iterate through the list of triangles in that cell of the Quad-tree. For each triangle it will test if there is an intersection. If that is the case, it writes the coordinates to an array.

Then comes a twist. The voxelization method I described before relies on counting how many intersections were found up to the current voxel. For this to work, the list of intersections needs to be sorted. Sorting the intersections in the CPU is no problem, but inside the OpenCL kernel it could get messy very quickly. How to get around this?

I realized that it was still possible to determine if a voxel was inside or outside without having to sort the intersections. The trick is to "flip" the occupancy state of all voxels preceding the current intersection point. If the count is odd, the voxel will flip an odd number of times and it will remain set at the end. If the counter is even, the voxel will finish empty.

You can see this in the following series of images. The intersection points are marked in red, and as you can see they arrive in any order:

I'm satisfied with the speed of this method. Anyway there are some optimizations I could do. If some restrictions are imposed on the polygonal solids, it will be possible to insert triangles already sorted in the Quad-tree. This means the OpenCL rasterization only will need to do one pass per ray. Right now each intersection requires its own pass.

But I'm still on the look for a faster voxelization method. If you know of something better please let me know.

Friday, April 1, 2011

Tree Bits

In an earlier post I mentioned that I was using a Space Colonization algorithm to grow trees. This was more about producing an internal representation of the tree, for instance a list of line segments describing the tree trunk and branches. Still somehow the tree needs to materialize into voxels. In this post I will describe how I have done it so far.

If you remember another post, at some point I had decided to go into a full isosurface representation of the world. This means that for any point of the 3D space a density value is defined. If the value is positive, the point is inside a solid. If it is negative, it is outside. And if it is zero, it means this point is in the surface and should be visualized.

The question is, how to create a density function that will result in a tree?

You will find two different entities in your typical tree: the leaves and the branches --which include the trunk as some sort of over-sized branch. It made sense that the tree density function was the sum of two different density functions, one for the leaves and one for the branches.

For the trunk and branches, the colonization algorithm was producing a series of connected segments. This was just a collection of start and end points. First I needed to compute the girth of each branch.

Long time ago Leonardo DaVinci noticed a rule about tree branches. It seems that when a branch splits into two or more branches, the cross-section areas of the new branches add up to the area of the original branch.

At the right you can see a doodle I took from his handbook that illustrates this point.

This is also called the pipe model. It kind of makes sense when you think most trees just need to move their juices up and down.

With this in mind, it is possible to back-track the hierarchy of connected segments starting from the thinnest and adding up the areas down to the trunk. A the end of this process each segment will have its starting and end diameters defined.

That was the easy part, but this is still far from a real density function.

If you remember yet another post, I was computing all the density functions using OpenCL kernels. Each instance of the kernel would take a point in 3D space and compute the density function.

I wrote a new kernel for a density function based on a list of segments. It is fairly straightforward to compute the value, as the following image shows:

Here the large dot is the point in 3D space we need to compute. The line coming out of this point is the shortest distance from the point to the segment. If you subtract the radius of the cone at that position from the length of this line you will end up with the distance from the point to the surface of the cone. This distance is the value of the density function for the field.

The tricky part is making this fast. In theory each point should be tested against each segment. For a high density grids and large number of segments the naive approach is just too slow.

The first obvious optimization is to avoid testing all segments. If only segments that are reasonably close are tested it would make a big difference. So I stored the segments in an octree and tested only those segments that were in the neighbor cells to the point.

To speed up access to the octree in OpenCL, I made sure I was moving the octree data into local memory before going over the list of segments. You can see this trick in the ATI Stream OpenCL examples, specifically the one about solving the N-body problem. This case is similar to the N-body, only that some bodies are points are some other are segments.

Now, this produced segments that were far too smooth. For pipes or columns they would be great. Tree trunks and branches are not that regular. I realized I needed to add some noise to the segment surfaces.

This was a rather small change. Instead of using the regular segment radius (noted as "r" in the image before), we can affect this radius by a 3D noise. For this to work, you need to pick the right coordinates to evaluate the noise function. A good approach is to use the point where the shortest line intersects the smooth cone volume, that is, the same point we were using before to compute the density field. Using these coordinates we evaluate the noise function. This returns a value we can use to offset the radius of the cone and produce a new distance to the volume.

In the following image you can see the effect this produces. By playing with the noise type and properties you can achieve different effects. Instead of a noise function you could also use cellular fields like the ones described by Worley.

This works for branches, roots and trunks. What about the other half of the problem, the leaves?

I chose to address this in a peculiar way. Maybe my eyesight is not alright, but when looked from a distance, a forest looks more like a field of clouds than a massive collection of individual leaves. You can see the leaves as some sort of noise in the profile of the crowns. One of my goals is to have individual objects that are too distant still blend properly in the background, instead of just disappearing from the view.

Now this works only if the trees are healthy and full of leaves. It is a limitation of my approach. I will see how far I can go with this before considering an alternative. It becomes very apparent when you are really close to the leaves that they are just a really noisy volume.

I have an idea to address this, which is to render additional billboards on the client based on the crown polygons. This is similar to what most games do when rendering grass. I would render tree leaves instead.

With that out of the way, we can start thinking about the tree crown as a cloud. Its definition is quite simple, just a list of ellipsoids. When a branch ends, the colonization algorithm inserts a point in the cloud. The diameter of the ellipsoid at that point is a function of the size of the tree and how distant this point is from the main branches.

If enough ellipsoids are created they eventually blend into some sort of cloud. The density function of this cloud is computed in a kernel very similar to the one used for the segments.

Adding a lot of noise helps making the crown more interesting. This noise needs large low frequency amplitudes but also a lot of high frequency detail too. The low frequencies break the appearance of the base ellipsoids while the high frequency resemble the look of the individual leaves.

The following image shows the results:

I know there is a lot of room for improvement. I still plan to revisit the generation of trees. I want more realistic trunks and roots, and also create new types of trees.

As I said before, individual leaves will be rendered by instancing on the client. This should improve a lot the realism of this method. What you have seen here is rather the content generation side of it. I also plan to revisit the global illumination solution to add light scattering in the tree crowns.

Until then, I hope this will inspire you to make some voxel trees of your own.

Wednesday, January 12, 2011

OpenCL, second iteration

OpenCL and CUDA evangelists will tell you about about speedups of 5, 10, and even 100 times. What they don't say is some of these case studies are comparing a CUDA port with a very dated FORTRAN implementation. Or that the algorithm is now using floats instead of doubles. Or that the previous implementation was not threaded at all. When you compare to a optimized, multithreaded CPU implementation, the gains are not so obvious.

It is difficult to achieve good performance in CUDA and OpenCL. It really feels like lining all your ducks in a row. If anything goes out of line, you get worse performance than an optimized CPU implementation. And very easily things go out of line.

First, you must be aware of how memory will be accessed by your algorithm. You must access your data in the same way it is being fetched by the GPU. You may find yourself doing weird things, like using a space-filling curve to linearize access to data in multiple dimensions. Most likely you will need to use the workgroup's local memory to cache values each workitem would normally get from global memory.

You must pick workload sizes that will result in an optimal processor occupancy. But be very careful here. All your local variables will be using GPU registers. If you cramp too many threads at once you may run out of registers. Your code will still run, but the variables that did not fit in registers will be stored in global memory. This will make the performance orders of magnitude slower than a CPU implementation. Different cards will have different numbers of registers. If you want an optimal workload size, your implementation must become aware of the different cards out there.

And then you need to think of what your code actually does. It pays to use vector operations whenever possible, to sacrifice precision when it doesn't really matter. Control flow instructions are very expensive so use them if there is no other choice. OpenCL and CUDA are SIMT (Single Instruction Multiple Thread). What this means is that all concurrent threads are running the same instruction. Whenever your code branches the processor will have to serialize the different branches and run them one after the other. If you want to get over that, your code then needs to become aware of processor wavefronts, or warps like the are called in CUDA.

As you see, you must be fully aware of the architecture and how your solution is being mapped into it. If you are a GPU developer who was doing generic computing using vertex and fragment shaders, you will find OpenCL and CUDA refreshing and liberating. If you are a CPU developer, well, I hope you enjoy learning new things.

On top of everything there is Amdahl's Law . This is just a way to compute how much parallelization will improve performance. It tells you that any portions of your code that need to remain serial will have a large impact on the overall performance, no matter how many processors or threads are running the rest. Real-life solutions cannot be fully parallel. Often there are large fragments of your logic that you do not want to run in parallel, either because it would be expensive to port them, or because they are out of your control. So in some cases the massive parallelization offered by OpenCL and CUDA, even if exploited properly, improves very little.

After my initial experiment porting Dual Contouring to OpenCL, I realized some improvements were necessary. My first attempt was the naive approach: use a regular grid and run the kernels on every element of the grid, regardless of it was part of the surface or not. The results were better than a single threaded CPU implementation, but the speed up was about two times.

For the second iteration, it was clear I needed to be selective about which elements to compute. This is how I did it:

Run the density function for every point in the regular grid
Classify which edges in the grid contain a sign change between endpoints
Scan the edge classification grid and create a compact array of intersected edges
Using the edge array, compute intersection point for each edge
Compute the normal of the surface at each intersection point
Classify which voxels in the grid have at least one intersection in one of its 12 edges
Scan the voxel classification and create a compact array of voxels
For each voxel in the compacted array, run the QEF solution

As you can see, that's a lot of kernels. Everything needs to be classified, scanned and compacted. The scanning alone may involve more than one call to a single kernel. If you are not clear about this operation, please check an earlier post: Scan or Die. As you must remember the scan operation operated in several branches of a tree. This solution took more memory as well, since we need classification grids for edges and voxels, and also the arrays to store the edges and voxels that really needed to be processed.

The million dollar question is, was it faster? It was. It was really fast. This optimization hit a nerve, which was computing the surface normals. The density functions I was running were very complex to evaluate. At any point the terrain had dozens of layers, each one with five to ten octaves of Perlin noise. In total over a hundred octaves of Perlin for each point. To compute a single normal I had to evaluate the density function another three times. By being selective and computing normals only when needed I was able to compute many more normals at the same time.

But this was not the last iteration I did. As I had suspected, evaluating the density function was the bottleneck of the entire algorithm. I was still evaluating it for every point in the grid. Improving that took me to the next iteration, which may come with a surprise (or not, if you consider my earlier rant about OpenCL complexity, Amdahl et al).

Sunday, December 5, 2010

Scan or Die

Getting your head around parallel programming will need some re-adjustments. You will soon find that a naive approach will rarely show any real advantages. As I see it, there are two big issues to overcome.

First, parallel computing frameworks like OpenCL and CUDA bring some of restrictions of their own, like how much memory each workgroup can have, how wavefronts are executed, etc. Just like any other framework, you just need to learn how to deal with those.

The second issue is more serious. It is the paradigm shift from serial computing to parallel computing. You need to rethink your approach to the task in hand. Algorithms that are trivial when executed serially may look very difficult to implement in parallel.

If you are serious about parallel programming, there is one tool you should really get to know. It is one simple tool, but it is so powerful that you will find yourself using it over and over again: "Scans", or as they are also known, a "Prefix Sum" operation. You should get this one straight from the source: Prefix Sums and Their Applications by Guy Blelloch. Anyway I will provide a quick introduction to the method and why it was useful to me.

One of the simplest problems you can find in traditional programming is to compute the sum of all the values in one array. This is one of the first things you would learn in most introductory courses to programming. Now, the same problem in parallel computing seems a daunting task, if you want it to be at least as efficient as the serial version.

It is not difficult to see why. When done sequentially, the problem involves iterating over each element in the array and adding the current value to a temporary variable. Once you are done iterating, the sum of all elements is contained in the temporary variable.

When executed in parallel, having a single variable to keep the sum of all elements is not practical or even feasible. Every thread would be trying to read and write to this single variable at the same time. Unless you force all threads to wait one for another, the results cannot be trusted. If threads are forced to wait, then there is no point to the parallel approach, it would be slower.

How to solve this? The trick lays in the nature of the operation being performed, in this case the sum. If the operation is associative, meaning:

   (A + B) + C == A + (B + C)

Nothing is really forcing you to perform it in any sequence. You can breakdown the problem into multiple segments. You perform the operation within each segment, then over the results produced by these segments. Since the operation is associative, it is guaranteed the results will be the same.

So if we needed to compute:

   V = A + B + C + D + E + F + G + H

You could do (A + B), (C + D), (E + F) and (G + H) in parallel:

   V1 = A + B
   V2 = C + D
   V3 = E + F
   V4 = G + H

Then you would do a second pass:

   V11 = V1 + V2
   V12 = V3 + V4

And then one last pass:

   V = V11 + V12

For the sake of argument let's assume each sum takes a millisecond. The serial approach would take 7 milliseconds. The parallel approach would take only three milliseconds, since all the additions for each pass happen at the same time.

In general this method requires breaking down the problem into a tree-like structure. Branches of the tree which are at the same level and do not depend from each other can be evaluated at the same time. Then the next level of the tree is processed and so on, until we reach the root and obtain the final result.

This can be extended to many other applications, even some that are not obvious, like parsing tokens, evaluating regular expressions, performing searches and sorting.

In my case it made a big difference. From the naive implementation of Dual Contouring it was clear I was wasting threads on the elements that were not crossed at all by a surface. Instead of running expensive functions on every edge and voxel of the grid, it would be better to first select which edges had the surface crossing and consider only those edges and their voxels.

This could be solved using the Scan algorithm. In its most common form, this algorithm not only returns the sum of all values in an array, but it can also be used to find out the sum of the preceding elements for each element in the array.

If you had:
   [3 1 7 0 4 1 6 3]

The scan would return:

   [0 3 4 11 11 14 16 22]

This allows for a very nice trick. Imagine that in the first array we have an entry for each edge in the voxel grid. If the edge is crossed, a 1 is written in that location of the array. Otherwise a zero is written.

Imagine we have eight edges. Of those only edge 3 and 5 are crossed. This is how the array would look like:

[0 0 0 1 0 1 0 0]

The Scan would return:

[0 0 0 0 1 1 2 2]

If you look closely, you will see the new array can be used to compact the edge array. First, we know how many elements we need to allocate: 2. It is the amount of ones found in the edge array. Second, we can map each element of the new array to the elements in the first array. For that, we just need to go over each element in the Scan result array and use the value in there as the index to the compacted array. In that location we would write the current index in the Scan array. This would be the result:

[3 5]

Which is a neat compact array that contains only the edges we want to look at. This compacting is also suitable for parallelization, so most of the contouring logic can remain in the GPU.

In the next post I will describe how these compact arrays are used and how many times I ended up running the Scan function.

Sunday, November 28, 2010

OpenCL, first iteration

OpenCL is a general purpose computing framework. It is designed for parallel computing so it runs very well in modern GPUs, which are parallel by nature. The whole idea is to have a single "job" executed by a large number of threads at the same time.

Voxels are a natural fit for OpenCL. Many voxel algrorithms run locally, meaning that the algorithm only needs to look at one or a few voxels at a time. There are no dependencies between a voxel and other voxels far away.

I already had an implementation of Dual Contouring for the CPU which would be a good start for a port to OpenCL. A straight port was out of the question, almost in every case OpenCL will require you to re-think your problem from a new angle. Still, having the traditional CPU implementation gave me a good idea of the issues I needed to address.

This first iteration was only to test the waters. I wanted to see if a naive approach would produce acceptable results. The bare minimum I could do was:

For each point in the voxel grid, evaluate the 3D field function
For each edge in the grid, see if the edge crossed the surface and compute the intersection point
For each voxel in the grid, see if its edges had an intersection and compute the inner point for the voxel

I could then create the resulting mesh in the CPU. I would need to iterate through all the edges again, if an edge was crossed, I would output a quadrilateral containing the inner points for the four neighboring voxels.

From the start I saw this would require to run three different "jobs" or "kernels" as they are called in OpenCL-speak. One to evaluate the points in the grid, another for intersecting edges, and one last for computing inner points.

A kernel is like a traditional C function that takes multiple parameters. Some parameters are inputs, some are outputs. In most cases, arrays are used to move data in and out.

For instance, the first kernel for computing the grid function would take an array of floats as an input. Then it would compute one single value for the function and put it into the array, making sure the value was put in the correct index. Why only one value? This is where the parallel way of thinking comes in: this same kernel will run for each point of the array, actually hundreds of them may be running at the same time. OpenCL will spawn as many instances of your kernel as necessary so the entire array can be covered.

It did not take long to implement this version. The results were good, but it was only a few times faster than the CPU implementation. I knew this was not good enough. So what was the problem with the naive approach? It was not difficult to see it was wasting a lot of processing power.

The CPU implementation, being serial by nature, had the chance to skip further processing edges, voxels and points when it realized they were not crossed by the surface. For instance, it would compute the inner point of a voxel only if any of the voxel edges was intersected by the surface. If there was no intersection, the next CPU cycles would be spent processing the next voxel.

The key in the sentences before is the word "If".

With OpenCL and parallel computing in general, "if" statements won't save you any processing, unless you can guarantee the same outcome for every time the if statement is evaluated. Since every thread is running the same code in parallel, it saves very little when you realize that no further processing is needed and the function can just bail out. It is already too late because a thread was allocated to the task anyway.

The naive approach was doing an awful amount of work that was not needed at all. If I wanted really complex field functions with many octaves of Perlin noise, or distance fields to point clouds for rocks and vegetation, I would have to do better.

I was also solving Quadratic Error Functions to compute the location of the points inside the voxels. This required normal vectors for each intersection point. The way to compute these normals is by using the gradient of the 3D field in the point, which meant evaluating the field function another four times. The amount of computation was staggering. I clearly needed to optimize the inputs of my OpenCL kernels, so they would look only to the elements that were contributing to the surface.

This realization took me to the second iteration using OpenCL, which will be the subject of my next post.

Friday, November 12, 2010

This blog will document my attempts in creating a fully procedural world. It is a hobby, so there is no promise that any of this will be ever released.

If you are into computer graphics, massive online games, digital botany and architecture, voxels and parallel computing, you may want to stick around and check for updates.

This is a screen capture of the progress I have made so far:

It is still far from where I want to be, but I'm happy with the results seen here. You can find more videos on my YouTube Channel page: http://www.youtube.com/user/MCeperoG

A few years ago I started wondering how far you can go into creating a fully procedural world. This is a recurrent idea. It has been implemented with some success many times before. Still, I was not satisfied with what I could find. The procedural worlds out there seemed too repetitive, and their generation patterns soon became evident.

The real world itself is full of ever-repeating patterns, but in this case we take them as natural. I had the feeling that maybe the techniques were right, that the problem was just the degree to which they were applied. As it appeared, it would take many layers of complex patterns to make a single believable world. The current implementations were not rich enough to trick the eye.

My goal was to create a rather large virtual environment, around 100 square kilometers, and allow people to connect and explore it over the Internet. Maybe play a game on it. It would cover not only different types of terrain, but also vegetation and architecture. And I wanted all that generated with a minimum of artistic input, but still be visually appealing.

So I started by devising a framework that could support the amount of detail I felt was necessary. Geometric detail could go as fine as one 10 centimeters (~4 inches), and the texturing maybe a few texels per square centimeter. The texturing of the world would be unique, meaning that each texture pixel would be mapped exclusively to one place in the world's geometry.

Considering the world size, this would translate into a few Terabytes of geometry and textures. Soon it became evident that I would need to stream the world's representation to the viewers as they moved. The sheer size of the world data made it prohibitive to pack it as a one-time download.

All these requirements shaped the solution I chose. In my next post I will introduce the basic building block that made all possible.