Computers continue to increase in power, pretty much
at the same rate as before. They double their performance every 18 to 24
months. This trend is known as Moore's law. The leaders of the microprocessor industry swear they see no end to
this law in the 21st century, that everything is fine. Some others say this will come to an end around 2020, but with some luck we should be able to find another exponential to ride.
Regardless of who you believe, you should
be wary of the fine-print. Moore's law is about transistor count, it says
nothing about how fast they operate. Since the sixties programmers became used to see algorithms
improve their performance with every new hardware generation. You
could count on Moore's law to make your software run twice as fast every
18 months. No changes were needed, not even to recompile your code. The
same program would just run faster.
This way of thinking came to an end
around 2005. Clock frequencies hit a physical limit. Since then you cannot
compute any
faster, you can only compute
more. To achieve faster results
the only option is to compute in parallel.
Chip-makers try not to make a big deal out of this, but there is a world
of difference to programmers. If they were car-makers, it is like their cars
had reached a physical limit of 60 miles per hour. When asked how could
you do 120 miles per hour they would suggest you take two cars.
Many programmers today ignore this monumental shift. It is often
because they produce code for platforms that already deal with
parallelization, like database engines or web servers. Or they work creating
User Interfaces, where a single thread of code is already fast enough to outpace humans.
Procedural generation requires all the processing power you can get.
Any algorithms you design will have to run in this new age of computing.
You have no choice but to make them run in parallel.
Parallel computing is an old subject. For many years now
programmers have been trying to find a silver bullet approach that will
take a simple serial algorithm and re-shuffle it so it can run in parallel.
None has succeeded. You simply cannot ignore parallelization and rely on wishful thinking. Your
algorithms have to be designed with it in mind.
What is worse, your algorithm design will depend on the hardware
you choose because parallel means something different depending where you go.
This post is about my experience with different parallel platforms.
In my view, today there are three main different hardware platforms worth considering.
The first one is traditional CPUs, like the x86 series by Intel. Multicore
CPUs are now common, and the number of cores may still grow exponentially.
In ten years from now we could have hundreds of them. If you manage to break your problem into many
different chunks, you can feed each chunk to an individual core and have
them run in parallel. As the number of cores grows, your program will run
faster.
Let's say you want to generate procedural terrain. You could break the
terrain into regular chunks by using an Octree, then process many Octree
cells in parallel by feeding them to the available cores.
The x86 platform has the nicest, most mature development tools. Also
since the number of concurrent threads is not very high, the parallelization
effort is around breaking the work into large chunks and sewing the results
back together. Most of the actual computation remains serial. This is a
bonus: anyone can write stable serial code, but you will pay dearly for
good parallel programmers. The more old-fashion serial code you can write within your fully parallel solution the better.
Being so oriented towards serial, generic code is also the weakness of
traditional CPUs. They devote a big deal of transistors and energy dealing
with the unpredictability of generic algorithms. If your logic is very simple
and linear all this silicone goes to waste.
The second hardware platform are the GPUs. These are the video
cards powering games in PCs and consoles. Graphics rendering is highly
parallel. The image you see on screen is a matrix of pixels, where each
pixel's color can be computed mostly in isolation from the rest. Video cards have
evolved around this principle. They allow hundreds of concurrent threads
to run in parallel, each one is devoted to producing one fragment of the final image.
Compared to today's average CPUs which allow only eight simultaneous
threads, hundreds of threads may seem a bonanza.
The catch is all these GPU
threads are required to run the same instruction at the same time.
Imagine you have 100 dancers on stage. Each dancer must execute
exactly the same steps. If just one dancer deviates from the routine, the
other dancers stop and wait.
Perfect synchronization is desirable in ballet, not so much in general purpose computing. A single "if" statement in the code could be enough to create divergence. What often happens when an "if" is encountered, is that both
branches are executed by all threads, then the results of the branch that was
not supposed to run are discarded. It is like some sort of weird quantum
reality where all alternate scenarios do happen, but only the outcome of
one is picked afterwards. The cat is really both dead and alive in your GPU.
Loops having a variable number of iterations are a problem too. If 99 of
the dancers spin twice and the one remaining dancer -for some inexplicable
reason- decides to spin forty times, the 99 dancers won't be able to do
anything else until the rogue dancer is done. The execution time is the
same as if all the dancers had done forty loops.
So programming mainstream GPUs is great as long as you can avoid loops and
conditional statements. This sounds absurd, but with the right mindset it is
very much possible. The speedup compared to a multithreaded CPU
implementation may be significant.
There are some frameworks that allow general purpose programs to run
on GPUs. CUDA is probably the best known. It is deceptively simple. You
write a single function in a language almost identical to C. Each one of the
many threads in the GPU will run this function at the same time, but each
one will input and output data from a different location.
Let's say you have two large arrays of the same size, A and B. You want
to compute array C as the sum of these two arrays. To solve this using
CUDA you would write a function that looks like this:
void Add(in float A[], in float B[], out float C[])
{
int i = getThreadIndex();
C[i] = A[i] + B[i];
}
This is pseudocode, the actual CUDA code would be different, but the
idea is the same. This function processes only one element in the array. To
have the entire array processed you need to ask CUDA to spawn a thread
for each element.
One big drawback of CUDA is that it is a proprietary framework. It is
owned by Nvidia and so far it is limited to their hardware. This means you
cannot run CUDA programs on AMD GPUs.
An alternative to CUDA is OpenCL. OpenCL was proposed by Apple,
but it is an open standard like OpenGL. It is almost identical to CUDA,
maybe a tad more verbose, but for a good reason: not only it runs on both
Nvidia and AMD GPUs, it also runs on CPUs. This is great news for
developers. You can write code that will use all computing resources
available.
Even with these frameworks to aid you, GPU programming requires a
whole different way of thinking. You need to address your problem in a
way that can be digested by this swarm of rather stupid worker
threads. You will need to come up with the right choreography for them,
otherwise they will wander aimlessly scratching their heads.
And there is one big skeleton in the closet. It is easy to write programs
that run on the GPU, but it is hard to make
full use of it. Often the
bottleneck is between the GPU and the main memory. It takes time to feed
data and read results. Adding two arrays in the naive form, like it was
shown in the example before, would spend 99% of the time moving data
and 1% doing the actual operation. As the arrays grow in size, the GPU
performs poorly compared to a CPU implementation.
So which platform should you target, CPUs or GPUs?
Soon it many not be a clear cut anymore. CPUs and GPUs are starting to converge. CPUs may soon include a
higher number of less intelligent cores. They will not be very good at
running unpredictable generic code, but will do great with linear tasks. On
the other side GPUs will get better at handling conditionals and loops. And new architectures are already improving the bandwidth issues. Still it will take a
few years for all this to happen so this hardware becomes mainstream.
If you were building a procedural supercomputer today, it would make sense to
use a lot of GPUs. You would get more done for a smaller electricity bill.
You could stick to Nvidia hardware and hire a few brainiacs to develop
your algorithms in CUDA.
If you want to crowd-source your computing, have your logic run by
your users, GPUs also make a lot of sense. But then using a general
purpose computing framework like CUDA or OpenCL may not be a good
idea. In my experience you cannot trust everyone to have the right stack of drivers you will
require. In absence of a general computing GPU framework you would need to use graphic rendering APIs like DirectX and
OpenGL to perform general purpose computing. There is a lot of literature and research on GPGPU (General Computing on GPUs) but this path is not for the faint of hart. Things will get messy.
On the other hand, CPUs are very cheap
and easy to program. It is probably better to get a lot of them running not-so-brilliant
code, which after all is easier to produce and maintain. As often happens, hardware excellence does not prevail. You may achieve a lot more just because how developer friendly the platform is and how easy it will be to port your code.
This brings us to the third parallel platform, which is a network of computers (aka the cloud). Imagine you get hundreds of PCs talking to
each other over a Local Area Network. Individually they could be rather
mediocre systems. They will be highly
unreliable, hard drives could stop working anytime, power supply units
getting busted without a warning. Still as a collective they could be
cheaper, faster and simpler to program than anything else.