Computers continue to increase in power, pretty much
at the same rate as before. They double their performance every 18 to 24 
months. This trend is known as Moore's law.  The leaders of the microprocessor industry swear they see no end to 
this law in the 21st century, that everything is fine. Some others say this will come to an end around 2020, but with some luck we should be able to find another exponential to ride.
Regardless of who you believe, you should 
be wary of the fine-print. Moore's law is about transistor count, it says 
nothing about how fast they operate. Since the sixties programmers became used to see algorithms 
improve their performance with every new hardware generation. You 
could count on Moore's law to make your software run twice as fast every 
18 months. No changes were needed, not even to recompile your code. The 
same program would just run faster.
This way of thinking came to an end 
around 2005. Clock frequencies hit a physical limit. Since then you cannot 
compute any 
faster, you can only compute 
more. To achieve faster results 
the only option is to compute in parallel. 
Chip-makers try not to make a big deal out of this, but there is a world 
of difference to programmers. If they were car-makers, it is like their cars 
had reached a physical limit of 60 miles per hour. When asked how could 
you do 120 miles per hour they would suggest you take two cars.
Many programmers today ignore this monumental shift. It is often 
because they produce code for platforms that already deal with 
parallelization, like database engines or web servers. Or they work creating 
User Interfaces, where a single thread of code is already fast enough to outpace humans.
Procedural generation requires all the processing power you can get. 
Any algorithms you design will have to run in this new age of computing. 
You have no choice but to make them run in parallel.
Parallel computing is an old subject. For many years now 
programmers have been trying to find a silver bullet approach that will 
take a simple serial algorithm and re-shuffle it so it can run in parallel. 
None has succeeded. You simply cannot ignore parallelization and rely on wishful thinking. Your 
algorithms have to be designed with it in mind.
What is worse, your algorithm design will depend on the hardware 
you choose because parallel means something different depending where you go.
This post is about my experience with different parallel platforms.
In my view, today there are three main different hardware platforms worth considering.
The first one is traditional CPUs, like the x86 series by Intel. Multicore 
CPUs are now common, and the number of cores may still grow exponentially. 
In ten years from now we could have hundreds of them. If you manage to break your problem into many 
different chunks, you can feed each chunk to an individual core and have 
them run in parallel. As the number of cores grows, your program will run 
faster.
Let's say you want to generate procedural terrain. You could break the 
terrain into regular chunks by using an Octree, then process many Octree 
cells in parallel by feeding them to the available cores.
The x86 platform has the nicest, most mature development tools. Also 
since the number of concurrent threads is not very high, the parallelization 
effort is around breaking the work into large chunks and sewing the results 
back together. Most of the actual computation remains serial. This is a 
bonus: anyone can write stable serial code, but you will pay dearly for 
good parallel programmers. The more old-fashion serial code you can write within your fully parallel solution the better.
Being so oriented towards serial, generic code is also the weakness of 
traditional CPUs. They devote a big deal of transistors and energy dealing 
with the unpredictability of generic algorithms. If your logic is very simple 
and linear all this silicone goes to waste.
The second hardware platform are the GPUs. These are the video 
cards powering games in PCs and consoles. Graphics rendering is highly 
parallel. The image you see on screen is a matrix of pixels, where each 
pixel's color can be computed mostly in isolation from the rest. Video cards have 
evolved around this principle. They allow hundreds of concurrent threads 
to run in parallel, each one is devoted to producing one fragment of the final image.
Compared to today's average CPUs which allow only eight simultaneous 
threads, hundreds of threads may seem a bonanza.
The catch is all these GPU 
threads are required to run the same instruction at the same time.
Imagine you have 100 dancers on stage. Each dancer must execute 
exactly the same steps. If just one dancer deviates from the routine, the 
other dancers stop and wait.
Perfect synchronization is desirable in ballet, not so much in general purpose computing. A single "if" statement in the code could be enough to create divergence. What often happens when an "if" is encountered, is that both 
branches are executed by all threads, then the results of the branch that was 
not supposed to run are discarded. It is like some sort of weird quantum 
reality where all alternate scenarios do happen, but only the outcome of 
one is picked afterwards. The cat is really both dead and alive in your GPU.
Loops having a variable number of iterations are a problem too. If 99 of 
the dancers spin twice and the one remaining dancer -for some inexplicable 
reason- decides to spin forty times, the 99 dancers won't be able to do 
anything else until the rogue dancer is done. The execution time is the 
same as if all the dancers had done forty loops. 
So programming mainstream GPUs is great as long as you can avoid loops and 
conditional statements. This sounds absurd, but with the right mindset it is 
very much possible. The speedup compared to a multithreaded CPU 
implementation may be significant.
There are some frameworks that allow general purpose programs to run 
on GPUs. CUDA is probably the best known. It is deceptively simple. You
write a single function in a language almost identical to C. Each one of the
many threads in the GPU will run this function at the same time, but each
one will input and output data from a different location.
Let's say you have two large arrays of the same size, A and B. You want 
to compute array C as the sum of these two arrays. To solve this using
CUDA you would write a function that looks like this:
void Add(in float A[], in float B[], out float C[]) 
{
    
  int i = getThreadIndex();
    
  C[i] = A[i] + B[i]; 
} 
This is pseudocode, the actual CUDA code would be different, but the 
idea is the same. This function processes only one element in the array. To 
have the entire array processed you need to ask CUDA to spawn a thread 
for each element. 
One big drawback of CUDA is that it is a proprietary framework. It is 
owned by Nvidia and so far it is limited to their hardware. This means you 
cannot run CUDA programs on AMD GPUs.
An alternative to CUDA is OpenCL. OpenCL was proposed by Apple, 
but it is an open standard like OpenGL. It is almost identical to CUDA, 
maybe a tad more verbose, but for a good reason: not only it runs on both 
Nvidia and AMD GPUs, it also runs on CPUs. This is great news for 
developers. You can write code that will use all computing resources 
available.
Even with these frameworks to aid you, GPU programming requires a 
whole different way of thinking. You need to address your problem in a 
way that can be digested by this swarm of rather stupid worker 
threads. You will need to come up with the right choreography for them, 
otherwise they will wander aimlessly scratching their heads.
And there is one big skeleton in the closet. It is easy to write programs 
that run on the GPU, but it is hard to make 
full use of it. Often the 
bottleneck is between the GPU and the main memory. It takes time to feed 
data and read results. Adding two arrays in the naive form, like it was 
shown in the example before, would spend 99% of the time moving data 
and 1% doing the actual operation. As the arrays grow in size, the GPU 
performs poorly compared to a CPU implementation.
So which platform should you target, CPUs or GPUs?
Soon it many not be a clear cut anymore. CPUs and GPUs are starting to converge. CPUs may soon include a 
higher number of less intelligent cores. They will not be very good at 
running unpredictable generic code, but will do great with linear tasks. On 
the other side GPUs will get better at handling conditionals and loops. And new architectures are already improving the bandwidth issues. Still it will take a 
few years for all this to happen so this hardware becomes mainstream.
If you were building a procedural supercomputer today, it would make sense to 
use a lot of GPUs. You would get more done for a smaller electricity bill. 
You could stick to Nvidia hardware and hire a few brainiacs to develop 
your algorithms in CUDA.
If you want to crowd-source your computing, have your logic run by 
your users, GPUs also make a lot of sense. But then using a general 
purpose computing framework like CUDA or OpenCL may not be a good 
idea. In my experience you cannot trust everyone to have the right stack of drivers you will 
require. In absence of a general computing GPU framework you would need to use graphic rendering APIs like DirectX and 
OpenGL to perform general purpose computing. There is a lot of literature and research on GPGPU (General Computing on GPUs) but this path is not for the faint of hart. Things will get messy.
On the other hand, CPUs are very cheap 
and easy to program. It is probably better to get a lot of them running not-so-brilliant 
code, which after all is easier to produce and maintain. As often happens, hardware excellence does not prevail. You may achieve a lot more just because how developer friendly the platform is and how easy it will be to port your code.
This brings us to the third parallel platform, which is a network of computers (aka the cloud). Imagine you get hundreds of PCs talking to
each other over a Local Area Network. Individually they could be rather
mediocre systems. They will be highly
unreliable, hard drives could stop working anytime, power supply units
getting busted without a warning. Still as a collective they could be 
cheaper, faster and simpler to program than anything else.