Monday, March 24, 2014

Landmark Voxel Creations

Not sure if you knew about this, but Everquest Next Landmark entered Alpha about one month ago. During that month players were introduced for the first time to the voxel world and tools we have jointly developed with Sony Online Entertainment.

We still have a lot of work to do. The game just now entered Beta. Still I am marveled by the incredible creations made by the players in such a short time and with such early versions of the tools.

Have a look:

I do not know about you, but it seems to me player-generated-content does come close to what game studios can do. Hopefully very soon we will be able to completely blur that line.

Probably the biggest surprise was to see all the emergent techniques devised by the players. We knew our voxels were able to encode all sort of funny things, however the specifics of how they could be achieved was a purely player-driven development. Players even had to name these things, so they gave us "microvoxels", "antivoxels", "zero-volume voxels" and other similar things that actually make a big difference on how you can create in the game.

Someone once told me the best software you can write is one that won't have any users. You can relax and have a life. Users (or players in this case) are that reality check developers secretly fear so much. Now I realize this software cannot exist in isolation from the builder community. Thanks to our players we continue to learn and understand about all the emerging properties of the platform we have created.

Keep up the amazing work guys!

Tuesday, March 18, 2014

Parallel Concerns

Computers continue to increase in power, pretty much at the same rate as before. They double their performance every 18 to 24 months. This trend is known as Moore's law.  The leaders of the microprocessor industry swear they see no end to this law in the 21st century, that everything is fine. Some others say this will come to an end around 2020, but with some luck we should be able to find another exponential to ride.

Regardless of who you believe, you should be wary of the fine-print. Moore's law is about transistor count, it says nothing about how fast they operate. Since the sixties programmers became used to see algorithms improve their performance with every new hardware generation. You could count on Moore's law to make your software run twice as fast every 18 months. No changes were needed, not even to recompile your code. The same program would just run faster.

This way of thinking came to an end around 2005. Clock frequencies hit a physical limit. Since then you cannot compute any faster, you can only compute more. To achieve faster results the only option is to compute in parallel. 

Chip-makers try not to make a big deal out of this, but there is a world of difference to programmers. If they were car-makers, it is like their cars had reached a physical limit of 60 miles per hour. When asked how could you do 120 miles per hour they would suggest you take two cars.

Many programmers today ignore this monumental shift. It is often because they produce code for platforms that already deal with parallelization, like database engines or web servers. Or they work creating User Interfaces, where a single thread of code is already fast enough to outpace humans.

Procedural generation requires all the processing power you can get. Any algorithms you design will have to run in this new age of computing. You have no choice but to make them run in parallel.

Parallel computing is an old subject. For many years now programmers have been trying to find a silver bullet approach that will take a simple serial algorithm and re-shuffle it so it can run in parallel. None has succeeded. You simply cannot ignore parallelization and rely on wishful thinking. Your algorithms have to be designed with it in mind. What is worse, your algorithm design will depend on the hardware you choose because parallel means something different depending where you go.

This post is about my experience with different parallel platforms.

In my view, today there are three main different hardware platforms worth considering. The first one is traditional CPUs, like the x86 series by Intel. Multicore CPUs are now common, and the number of cores may still grow exponentially. In ten years from now we could have hundreds of them. If you manage to break your problem into many different chunks, you can feed each chunk to an individual core and have them run in parallel. As the number of cores grows, your program will run faster.

Let's say you want to generate procedural terrain. You could break the terrain into regular chunks by using an Octree, then process many Octree cells in parallel by feeding them to the available cores.

The x86 platform has the nicest, most mature development tools. Also since the number of concurrent threads is not very high, the parallelization effort is around breaking the work into large chunks and sewing the results back together. Most of the actual computation remains serial. This is a bonus: anyone can write stable serial code, but you will pay dearly for good parallel programmers. The more old-fashion serial code you can write within your fully parallel solution the better.

Being so oriented towards serial, generic code is also the weakness of traditional CPUs. They devote a big deal of transistors and energy dealing with the unpredictability of generic algorithms. If your logic is very simple and linear all this silicone goes to waste.

The second hardware platform are the GPUs. These are the video cards powering games in PCs and consoles. Graphics rendering is highly parallel. The image you see on screen is a matrix of pixels, where each pixel's color can be computed mostly in isolation from the rest. Video cards have evolved around this principle. They allow hundreds of concurrent threads to run in parallel, each one is devoted to producing one fragment of the final image. Compared to today's average CPUs which allow only eight simultaneous threads, hundreds of threads may seem a bonanza.

The catch is all these GPU threads are required to run the same instruction at the same time. Imagine you have 100 dancers on stage. Each dancer must execute exactly the same steps. If just one dancer deviates from the routine, the other dancers stop and wait.

Perfect synchronization is desirable in ballet, not so much in general purpose computing. A single "if" statement in the code could be enough to create divergence. What often happens when an "if" is encountered, is that both branches are executed by all threads, then the results of the branch that was not supposed to run are discarded. It is like some sort of weird quantum reality where all alternate scenarios do happen, but only the outcome of one is picked afterwards. The cat is really both dead and alive in your GPU.

Loops having a variable number of iterations are a problem too. If 99 of the dancers spin twice and the one remaining dancer -for some inexplicable reason- decides to spin forty times, the 99 dancers won't be able to do anything else until the rogue dancer is done. The execution time is the same as if all the dancers had done forty loops. 

So programming mainstream GPUs is great as long as you can avoid loops and conditional statements. This sounds absurd, but with the right mindset it is very much possible. The speedup compared to a multithreaded CPU implementation may be significant.

There are some frameworks that allow general purpose programs to run on GPUs. CUDA is probably the best known. It is deceptively simple. You write a single function in a language almost identical to C. Each one of the many threads in the GPU will run this function at the same time, but each one will input and output data from a different location. Let's say you have two large arrays of the same size, A and B. You want to compute array C as the sum of these two arrays. To solve this using CUDA you would write a function that looks like this:

void Add(in float A[], in float B[], out float C[]) 
  int i = getThreadIndex();     
  C[i] = A[i] + B[i]; 

This is pseudocode, the actual CUDA code would be different, but the idea is the same. This function processes only one element in the array. To have the entire array processed you need to ask CUDA to spawn a thread for each element. 

One big drawback of CUDA is that it is a proprietary framework. It is owned by Nvidia and so far it is limited to their hardware. This means you cannot run CUDA programs on AMD GPUs.

An alternative to CUDA is OpenCL. OpenCL was proposed by Apple, but it is an open standard like OpenGL. It is almost identical to CUDA, maybe a tad more verbose, but for a good reason: not only it runs on both Nvidia and AMD GPUs, it also runs on CPUs. This is great news for developers. You can write code that will use all computing resources available.

Even with these frameworks to aid you, GPU programming requires a whole different way of thinking. You need to address your problem in a way that can be digested by this swarm of rather stupid worker threads. You will need to come up with the right choreography for them, otherwise they will wander aimlessly scratching their heads. And there is one big skeleton in the closet. It is easy to write programs that run on the GPU, but it is hard to make full use of it. Often the bottleneck is between the GPU and the main memory. It takes time to feed data and read results. Adding two arrays in the naive form, like it was shown in the example before, would spend 99% of the time moving data and 1% doing the actual operation. As the arrays grow in size, the GPU performs poorly compared to a CPU implementation.

So which platform should you target, CPUs or GPUs?

Soon it many not be a clear cut anymore. CPUs and GPUs are starting to converge. CPUs may soon include a higher number of less intelligent cores. They will not be very good at running unpredictable generic code, but will do great with linear tasks. On the other side GPUs will get better at handling conditionals and loops. And new architectures are already improving the bandwidth issues. Still it will take a few years for all this to happen so this hardware becomes mainstream.

If you were building a procedural supercomputer today, it would make sense to use a lot of GPUs. You would get more done for a smaller electricity bill. You could stick to Nvidia hardware and hire a few brainiacs to develop your algorithms in CUDA.

If you want to crowd-source your computing, have your logic run by your users, GPUs also make a lot of sense. But then using a general purpose computing framework like CUDA or OpenCL may not be a good idea. In my experience you cannot trust everyone to have the right stack of drivers you will require. In absence of a general computing GPU framework you would need to use graphic rendering APIs like DirectX and OpenGL to perform general purpose computing. There is a lot of literature and research on GPGPU (General Computing on GPUs) but this path is not for the faint of hart. Things will get messy.

On the other hand, CPUs are very cheap and easy to program. It is probably better to get a lot of them running not-so-brilliant code, which after all is easier to produce and maintain. As often happens, hardware excellence does not prevail. You may achieve a lot more just because how developer friendly the platform is and how easy it will be to port your code.

This brings us to the third parallel platform, which is a network of computers (aka the cloud). Imagine you get hundreds of PCs talking to each other over a Local Area Network. Individually they could be rather mediocre systems. They will be highly unreliable, hard drives could stop working anytime, power supply units getting busted without a warning. Still as a collective they could be cheaper, faster and simpler to program than anything else.

Friday, March 14, 2014

GDC 2014

I will be visiting GDC this year with a couple of friends from Voxel Farm. If you would like to meet us drop me a line at miguel at

Saturday, March 8, 2014

Procedural Information

Information is the measure of what you do not know. 

You just looked at your watch and realized it is 3:00 PM, then someone comes into your office and tells you it is 3:00 PM. The amount of information the person gave you amounts to zero. You already knew that. That person did give you data, but data is not necessarily information.

Information is measured in bits, bytes, etc. If you ask someone, "Is it raining out there?", the answer will be one bit worth of information, no matter what the weather looks like.

You are now looking at a photo of a real lake on your computer screen:

Let's imagine it is the first time you see this photo. This is information to you, but how many bits of it? You could check the file size, it is already in bytes. It turns out it is a BMP file and it is 300 KBytes. Did you just receive 300 KBytes through your eyes? Somehow this seems suspicious to you. You know that if the file was compressed as a PNG the file size would be a lot less, probably around 90 KBytes, no visual degradation. So what is going on, is it 300 or 90 KBytes what you just saw? Nobody can tell you the right amount. Your eyes, brain and psyche are still mysterious objects to modern science. But whatever it is, it will be closer to 90 than 300. The PNG compression took out a lot of bits that were not really information. Compression algorithms reshuffle data in ways redundancy becomes evident. Then they take it out. It is like having someone else stop that person before entering your office to announce it was 3:00 PM. How is this related to procedural generation? Now imagine I have sent you this little EXE file. It is only 300 KBytes. When you run it, it turns out to be a game. You see terrain, trees, buildings. There are some creatures that want you dead. You learn to hate them back, you fight them everywhere you go. You find it amusing that even if you keep walking, this world appears to never end. You play for days, weeks. Eventually you realize the game's world is infinite, it has no limit. All this was contained in 300K, still the information coming out of it appears to be infinite. How is this possible? You are being tricked. You are not getting infinite information, it is all redundant. The information was the initial 300 KBytes. You have been listening to echoes believing someone was talking to you. This is a hallmark of procedural generation: A trick of mirrors that produces interesting effects, like a kaleidoscope. A successful procedural generator deceives you into thinking you are getting information when you are not. That is hard to achieve. In the same way we love information, we dislike redundancy. It wastes space and time, it does nothing for us. Our brains are very good at discovering it, and we adapt quickly to see through any new trick. Now, does this mean software cannot create information? There is energy going into this system, can it be used for more than powering infinite echoes? This is one of the big questions out there. It is beyond software. Can anyone create information at all? If you look at the lake picture again, you may ask yourself how it came to be. Not the picture, the actual lake. Is it there partly by chance, or because there was no other choice. Its exact shape, location and size, could they be the inevitable result of a chain of cause-effect events that started when the Universe began? If that is the case, the real lake is not information, it is an echo of a much smaller but powerful universal seed. The real answer probably does not matter. Even if the lake was an echo of the Big Bang, 42 or some sort of universal seed, the emergent complexity is so high we cannot realize it. Our brains and senses cannot go that far. If you are ready to accept that, then, yes, software can create information. The key is simulation. Simulations are special because they acknowledge the existence of time, cause and effect. You pick a set of rules, a starting state and you let things unfold from there. If humans are allowed to participate by changing the current state at any point in time, the end results could be very surprising. The problem with simulation is that it is very expensive. If you keep rules too simple or simulate for very little time, results may not be realistic enough. If you choose the right amount of rules and simulate for the right amount of time, you may realize it would take too long to be practical. When it comes to procedural generation you will find these two big families of techniques. One family is based on deception, produces no information, but it is fast and cheap. The other family has great potential, but it is expensive and difficult. As a world builder you should play to their strengths, avoid their pitfalls. And what is more interesting: learn how to mix them.