Procedural World: Parallel Concerns

Tuesday, March 18, 2014

Parallel Concerns

Computers continue to increase in power, pretty much at the same rate as before. They double their performance every 18 to 24 months. This trend is known as Moore's law. The leaders of the microprocessor industry swear they see no end to this law in the 21st century, that everything is ﬁne. Some others say this will come to an end around 2020, but with some luck we should be able to find another exponential to ride.

Regardless of who you believe, you should be wary of the ﬁne-print. Moore's law is about transistor count, it says nothing about how fast they operate. Since the sixties programmers became used to see algorithms improve their performance with every new hardware generation. You could count on Moore's law to make your software run twice as fast every 18 months. No changes were needed, not even to recompile your code. The same program would just run faster.

This way of thinking came to an end around 2005. Clock frequencies hit a physical limit. Since then you cannot compute any faster, you can only compute more. To achieve faster results the only option is to compute in parallel.

Chip-makers try not to make a big deal out of this, but there is a world of difference to programmers. If they were car-makers, it is like their cars had reached a physical limit of 60 miles per hour. When asked how could you do 120 miles per hour they would suggest you take two cars.

Many programmers today ignore this monumental shift. It is often because they produce code for platforms that already deal with parallelization, like database engines or web servers. Or they work creating User Interfaces, where a single thread of code is already fast enough to outpace humans.

Procedural generation requires all the processing power you can get. Any algorithms you design will have to run in this new age of computing. You have no choice but to make them run in parallel.

Parallel computing is an old subject. For many years now programmers have been trying to ﬁnd a silver bullet approach that will take a simple serial algorithm and re-shuffle it so it can run in parallel. None has succeeded. You simply cannot ignore parallelization and rely on wishful thinking. Your algorithms have to be designed with it in mind. What is worse, your algorithm design will depend on the hardware you choose because parallel means something different depending where you go.

This post is about my experience with different parallel platforms.

In my view, today there are three main different hardware platforms worth considering. The ﬁrst one is traditional CPUs, like the x86 series by Intel. Multicore CPUs are now common, and the number of cores may still grow exponentially. In ten years from now we could have hundreds of them. If you manage to break your problem into many different chunks, you can feed each chunk to an individual core and have them run in parallel. As the number of cores grows, your program will run faster.

Let's say you want to generate procedural terrain. You could break the terrain into regular chunks by using an Octree, then process many Octree cells in parallel by feeding them to the available cores.

The x86 platform has the nicest, most mature development tools. Also since the number of concurrent threads is not very high, the parallelization effort is around breaking the work into large chunks and sewing the results back together. Most of the actual computation remains serial. This is a bonus: anyone can write stable serial code, but you will pay dearly for good parallel programmers. The more old-fashion serial code you can write within your fully parallel solution the better.

Being so oriented towards serial, generic code is also the weakness of traditional CPUs. They devote a big deal of transistors and energy dealing with the unpredictability of generic algorithms. If your logic is very simple and linear all this silicone goes to waste.

The second hardware platform are the GPUs. These are the video cards powering games in PCs and consoles. Graphics rendering is highly parallel. The image you see on screen is a matrix of pixels, where each pixel's color can be computed mostly in isolation from the rest. Video cards have evolved around this principle. They allow hundreds of concurrent threads to run in parallel, each one is devoted to producing one fragment of the final image. Compared to today's average CPUs which allow only eight simultaneous threads, hundreds of threads may seem a bonanza.

The catch is all these GPU threads are required to run the same instruction at the same time. Imagine you have 100 dancers on stage. Each dancer must execute exactly the same steps. If just one dancer deviates from the routine, the other dancers stop and wait.

Perfect synchronization is desirable in ballet, not so much in general purpose computing. A single "if" statement in the code could be enough to create divergence. What often happens when an "if" is encountered, is that both branches are executed by all threads, then the results of the branch that was not supposed to run are discarded. It is like some sort of weird quantum reality where all alternate scenarios do happen, but only the outcome of one is picked afterwards. The cat is really both dead and alive in your GPU.

Loops having a variable number of iterations are a problem too. If 99 of the dancers spin twice and the one remaining dancer -for some inexplicable reason- decides to spin forty times, the 99 dancers won't be able to do anything else until the rogue dancer is done. The execution time is the same as if all the dancers had done forty loops.

So programming mainstream GPUs is great as long as you can avoid loops and conditional statements. This sounds absurd, but with the right mindset it is very much possible. The speedup compared to a multithreaded CPU implementation may be signiﬁcant.

There are some frameworks that allow general purpose programs to run on GPUs. CUDA is probably the best known. It is deceptively simple. You write a single function in a language almost identical to C. Each one of the many threads in the GPU will run this function at the same time, but each one will input and output data from a different location. Let's say you have two large arrays of the same size, A and B. You want to compute array C as the sum of these two arrays. To solve this using CUDA you would write a function that looks like this:

void Add(in float A[], in float B[], out float C[])
{
int i = getThreadIndex();
C[i] = A[i] + B[i];
}

This is pseudocode, the actual CUDA code would be different, but the idea is the same. This function processes only one element in the array. To have the entire array processed you need to ask CUDA to spawn a thread for each element.

One big drawback of CUDA is that it is a proprietary framework. It is owned by Nvidia and so far it is limited to their hardware. This means you cannot run CUDA programs on AMD GPUs.

An alternative to CUDA is OpenCL. OpenCL was proposed by Apple, but it is an open standard like OpenGL. It is almost identical to CUDA, maybe a tad more verbose, but for a good reason: not only it runs on both Nvidia and AMD GPUs, it also runs on CPUs. This is great news for developers. You can write code that will use all computing resources available.

Even with these frameworks to aid you, GPU programming requires a whole different way of thinking. You need to address your problem in a way that can be digested by this swarm of rather stupid worker threads. You will need to come up with the right choreography for them, otherwise they will wander aimlessly scratching their heads. And there is one big skeleton in the closet. It is easy to write programs that run on the GPU, but it is hard to make full use of it. Often the bottleneck is between the GPU and the main memory. It takes time to feed data and read results. Adding two arrays in the naive form, like it was shown in the example before, would spend 99% of the time moving data and 1% doing the actual operation. As the arrays grow in size, the GPU performs poorly compared to a CPU implementation.

So which platform should you target, CPUs or GPUs?

Soon it many not be a clear cut anymore. CPUs and GPUs are starting to converge. CPUs may soon include a higher number of less intelligent cores. They will not be very good at running unpredictable generic code, but will do great with linear tasks. On the other side GPUs will get better at handling conditionals and loops. And new architectures are already improving the bandwidth issues. Still it will take a few years for all this to happen so this hardware becomes mainstream.

If you were building a procedural supercomputer today, it would make sense to use a lot of GPUs. You would get more done for a smaller electricity bill. You could stick to Nvidia hardware and hire a few brainiacs to develop your algorithms in CUDA.

If you want to crowd-source your computing, have your logic run by your users, GPUs also make a lot of sense. But then using a general purpose computing framework like CUDA or OpenCL may not be a good idea. In my experience you cannot trust everyone to have the right stack of drivers you will require. In absence of a general computing GPU framework you would need to use graphic rendering APIs like DirectX and OpenGL to perform general purpose computing. There is a lot of literature and research on GPGPU (General Computing on GPUs) but this path is not for the faint of hart. Things will get messy.

On the other hand, CPUs are very cheap and easy to program. It is probably better to get a lot of them running not-so-brilliant code, which after all is easier to produce and maintain. As often happens, hardware excellence does not prevail. You may achieve a lot more just because how developer friendly the platform is and how easy it will be to port your code.

This brings us to the third parallel platform, which is a network of computers (aka the cloud). Imagine you get hundreds of PCs talking to each other over a Local Area Network. Individually they could be rather mediocre systems. They will be highly unreliable, hard drives could stop working anytime, power supply units getting busted without a warning. Still as a collective they could be cheaper, faster and simpler to program than anything else.

23 comments:

Scott RichmondMarch 18, 2014 at 12:49 AM
It really is a very exciting time to be in right now and the compute world is exploding at a pace I don't think anyone can keep up with at the moment. For example the soon to be released CUDA 6.0 comes with built in abstraction and management layers for what you have correctly identified as the biggest bottleneck in GPU compute at the moment - GPU<->CPU memory communication. The problem still exists of course, but its finally managed by something often smarter than us. :)

It would be nice for it to start to normalize soon though as I think a lot of applications (3D engines) could be drastically improved if they were made with a modern parallel architecture.
ReplyDelete
Replies
niadMarch 18, 2014 at 12:57 AM

Good post ;)

Don't forgot AVX. Not long ago all we had was 128 bit SSE.

Now with AVX we are at 256. In a year or two intel will release AVX 512. You have to explicitly use these instructions, but it does give a nice speedup if you can make it fit your algorithm.

I do wish GPU compute had full support for double with loss of performance-

ReplyDelete
Replies
AjmMarch 18, 2014 at 9:38 AM
So... 10 or more years from now, where do you think the focus will be? With smarter drivers/architecture, and breakthroughs like resistive ram, graphene etc.
ReplyDelete
Replies
Miguel CeperoMarch 19, 2014 at 2:00 AM
No idea. I guess it really depends on whether Moore's law continues. If it does not, my bet is on the cloud. Computing will get bulky and power hungry, we will have no choice to hide these monsters away in data centers and keep thinner clients for interface.

If Moore's law was to continue for a long time, I think there may be a shift towards power at your fingertips. Processing will move as close to the client as possible. It is more convenient, secure and private.

In an ideal world networks are important because of the exchange of information, but we want processing power to be as close to the client as possible. It is a more resilient architecture.

For instance, I would like our future robot nurses to be really powerful computers walking around. They should be as self-reliant as possible. On the other hand, if they are rather stupid drones controlled by a central intelligence in the cloud, what happens when the network goes down?
ReplyDelete
Replies
demagogueMarch 20, 2014 at 7:36 AM
I think parallel architecture would be a good way to take full advantage of the coming quantum computers, but especially dealing with big data sets like lighting and voxels.
ReplyDelete
Replies
AjmMarch 20, 2014 at 3:13 PM
Any thoughts on the DX12 press release? Are you excited by the low level optimizations? Does this open any new doors?
ReplyDelete
Replies
_undexMarch 21, 2014 at 10:00 AM
Still the only thing programmers could do to run their code faster is to optimize it. Which brings me to a question: https://www.youtube.com/watch?v=f_ns9R9CC1U
What tips could you give me to optimize my own procedural voxel terrain?
ReplyDelete
Replies
KukaMarch 21, 2014 at 5:17 PM
What about FPGAs?
ReplyDelete
Replies
UnknownMarch 24, 2014 at 1:54 PM
Very nice post :)
Is there any literature on parallel programming concepts that you would recommend?
ReplyDelete
Replies
AjmMarch 25, 2014 at 1:55 PM
New press release today, it looks like nvidia's most pressing concerns are with what you have detailed above. They are looking at improving memory transfer rates as well as opening up cpu/gpu memory for more flexible performance.

http://blogs.nvidia.com/blog/2014/03/25/gpu-roadmap-pascal/
ReplyDelete
Replies

Add comment