Wednesday, March 27, 2013

Storage Matters

Imagine you were creating a massive persistent world where everyone would be able to change anything at will. It is a simple, powerful idea that eventually has occurred to everyone ever exposed to a game. Why there aren't many of these worlds out there? Well, this very simple idea is quite difficult and expensive to execute. Not only you need to store the information, you have to be able to write it and read it in a timely fashion.

Then how about your own personal world, something you can run in your PC and invite some friends to play over. How much of your PC's performance are you willing to sacrifice, how many people could you actually invite before you would see the quality of your gameplay begin to suffer?

I began wondering whether all the above could be manifestations of the same problem. What if you could have a storage solution that is lightweight so enthusiasts could run at home, and if you pieced enough of them together you could scale it so it would run massive worlds the size of planet Earth?

As it turns out it was possible. I have now a shinny new database system that does exactly that. The main trick is it aligns with the same other concepts of the voxel world. So this is mainly a voxel database. It won't do any SQL queries, XPath evaluation or any other form of traditional DB interaction. It just stores and retrieves voxel data very fast.

How fast? Over a 10 minute period, a machine with six-year-old Intel processor (T2500 at 2GHz) and an equally crappy HD was able to serve 10 Gigabytes worth of individual queries while another 10 Gigabytes worth of queries were being written. Each query ranged from 500 bytes to 100KBytes worth of data.

That would translate into a lot of friends sharing your server. To give you a better idea, a volume of 40x40x40 worth of player voxels compresses to 2K as an average. Here is how you would compute how much space 10 GB of voxel data would be:

1 chunk = 40x40x40 voxels = 12x12x12 meters
1 chunk = 2K
10 GB = 5,242,880 chunks = 2048x2048x2048 meters

How many people can create this amount of voxel content in 10 minutes? I have no idea, but I bet it will be an entire army. At this point the DB is the least of your concerns. The bottleneck is in the network.

The twist comes now: While this rate was sustained for 10 minutes, it was not meant to push the system to the limit. The DB process CPU usage never went up 1% and the memory usage for the process remained at 3 MB. The system was responsive and usable (well as usable as a six year old PC can be), showing no big difference in behavior.

Here is some evidence:

For most of you who are more artistically or design inclined this is certainly the most boring screenshot I have ever posted. But if you are into programming this kind of thing, this is process porn.

Of course the system is doing real work. The main clue is in a different column not displayed by Task Manager: Virtual memory, which was hovering all the time below 20 Megs. Even then the virtual memory was lower than what Google Chrome was using, which was a whooping 99 Megs.

The voxel database is so fast because it uses the same virtual memory management of the OS. So, instead of writing to files in the HD directly, all the information is mapped through the OS paging system. Only the pages that need to be altered go into memory. Also the system does a lazy write to the HD. Even after the process is gone, the OS continues to save the changes to disk.

I feel this is the stepping stone for great things. It will be fairly easy and inexpensive for people to set up their own servers. They could be hosting a lot of players and barely take a hit for it. This of course depends on how the networking is implemented, which leads into another favorite topic of mine: how to make a server that will not bring your PC to its knees. I will be covering that in the near future.


  1. I love you.

    This is just amazing, The work you are doing could REALLY change the way games work. I mean think about hole games where EVERYTHING can be interacted with or destroyed! This would be crazy to use in a FPS style game. The GFX of this are great and i can see some major future in this project.

    I bet some company buys it from you or hires you.

    Keep up the amazing work, I hope you continue working on this amazing project!

    1. Thanks but not so sure about FPS games. Realistic destruction is very difficult to achieve as there is no concept of rigid bodies.

      This certainly will help for people who like to create stuff and share it with others.

  2. Are you off-loading the heavy lifting of things (i.e. collision, building, destroying) onto the client in order to get that small of a memory footprint? Or, is there more usage going on (in private bytes or some other type thing)?

    The big concern for multi-player would be that you can't trust the data sent from the client. Most of these calculations would need to be done server-side and would require a decent amount of this data to be loaded into memory.

    I suppose (if you could decompress fast enough), you could constantly load/decompress the chunks needed for client interaction. I think that could fairly quickly limit the number of clients you could support though.

    I'm interested to hear what your thoughts/solutions are on this.

    1. Yes, these are different problems. No matter how you split concerns, you will need a fast way to store information, even if it is generated and used by servers.

      One tough nut I want to crack is the matter of trust. Of course the simple approach to it is to never trust the client. I did a realtime MMO strategy game in the past and that was how we did it. The client was only issuing commands and visualizing stuff. This works, but it is kind of boring from a development perspective. Also expensive to run.

      For some games there is no choice that to move all processing to a server and never trust a client.

      Then for some other game mechanics this extreme may not be necessary. For instance, take a fully creative sandbox mode, where the goal is just to build stuff. If no constraints are required, other than maybe authorizing changes (which is pretty fast). If you make sure the input rates are within human capabilities so automated griefing is on par with human griefing, it could work.

      I mean, if someone figures out a nice robot to read voxels from one source and feed them to the system, it may not be necessarily bad as long as does not put other players at disadvantage.

      And even for sensitive simulation, I have been toying with the idea of having clients audit the information produced by other clients. In this case you would have many clients repeating the same work, but still that beats doing the work in the server. The server would only need to consolidate the audits. I know it is crazy, but I believe trust can be crowdsourced. You would need a statistically significant number of compromised clients for the system to fall apart.

    2. Client/server trust has been lurking in the back of my mind when thinking about my own multiplayer game processes. I really like this crowdsourcing idea and it would never have occurred to me! :)

    3. The crowdsourced trust sound a little like the system for bitcoins (

      And when it comes to processing the changes, couldn't you do something like a render farm, where the processing is split up on all connected clients. So instead of doing the same processing on multiple clients, wish seems like an unnecessary overhead, you would divide each processing task into chunks spread out on multiple clients. And I'm guessing reading the changes for each client would still be the same.

    4. Bitcoin spends a lot of time and energy asserting trust. Not sure if this is feasible for a realtime environment.

      The engine is called Voxel Farm for a reason: world blocks can be generated anywhere: servers, clients, or event peer clients. The issue with P2P is about speed. A P2P network may not be able to respond fast enough to random queries. Usually you need to come up with each world chunk in 30 milliseconds or so. Both a dedicated server or a local generation can do that. A P2P network not so sure.

    5. I'm still on my "quest" to read your whole blog from start to finish. Almost there..

      I have to reply to this one though as I was conceptualizing this exact thing a month ago (but keep in mind I'm a newbie programmer).

      When you say P2P network I always think of torrents. But I know they don't operate as that. As you know they split the data up all over into what I call "data chunks" and you are leeching it from everyone else that is seeding. This is what my mind wraps around for a P2P online game or MMO. Could you have every player seeding/leeching the whole world in this way? But have some "connection conditions" that revolve around players being close to each other. i.e. Player A is near players 1, 2, & 3 so that is who he seeds most (say 80% of his total uploaded data) of his "data chunks" to. Then Players 1, 2, & 3 are seeding what they leeched from Player A (say 15% of their total) to the next closest players, and so on as the distance ring gets larger. (Players that are super far from anyone while either have a random connection to other super "alone" players or to the server)

      If your world has more players, then there's more seeders and leechers and it remains at equilibrium. As players get closer, their rates and connection-interaction increases. And all players have some of the new "torrented" data ready with them so that no upload/download/processing is wasted between all players.

      Anyways, this is just something that went around my head for a while. I researched it heavily, but found limited information that related to game connections and then decided I'm not experienced enough for it.

      I just thought I'd share one of my concepts. I've got many, and you're a guy that may appreciate them or even find one useful :) Cheers!

  3. This really isn't that amazing.

    Your use case story is not actually as compelling as it seems. 10gb in 10 minutes? That's only ~17mb/s - a speed most flash drives regularly achieve. Your task manager shows ~3.4GB total ram, so his program was able to keep using and recycling the available system RAM. Had you really ramped up his application, say 100gb in 10 minutes (10x more than your test case), you would have maxed out the available system RAM, and then each new read/write would have forced his 6 year old disk tray to start spinning, stalling his performance badly.

    Further - your solution basically involves reading/writing raw data onto the disk. You can't do compression/decompression at all, compression and decompression requires that you duplicate data for the uncompressed stream, which would significantly increase your RAM footprint. But not doing compression/decompression also means your 10GB database might be equivalent to a 50mb minecraft world. Not exactly ideal.

    And last - this would not work with any kind of threading. The OS paging system has "undefined" behavior with multithreading, so your voxel data is now limited to 1 thread only. In an age of 8-core computers, not good.

    I'm not trying to criticize your approach - this type of solution is the best one we have, OS paging is the best approach to superior IO performance. It's just not the silver bullet you think it is.

    1. I don't mind criticism, just having a bit of trouble with your logic and information.

      Note it is 10 GB read and 10 GB written, simultaneously from multiple threads. It is a DB not just flat storage, so there are spatial queries going on in every case. It involved updating a lot of existing records, managing fragmentation, and other databasey things. It is not just appending information, which would be straightforward. You would just write the the HD in that case.

      Like I said this is happening from multiple threads, 12 in this particular case. Not sure where do you get this idea of the OS paging system having undefined behavior with multithreading. This is actually the recommended way to do IPC in Windows, with processes and many threads participating at the same time.

      As for compression and uncompression, it does not increase RAM use at all. Data can only be used by the cores you have available, if you have 8 cores, the maximum workers threads you should have is 16. This is 16 uncompressed buffers, which is peanuts.

      This notion of my compressed 10GB being equivalent to a 50MB Minecraft dataset, not sure where it is coming from. Is that bistromatics?

    2. >Like I said this is happening from multiple threads, 12 in this particular case. Not sure where do you get this idea of the OS paging system having undefined behavior with multithreading. This is actually the recommended way to do IPC in Windows, with processes and many threads participating at the same time.

      Sure. With locking and proper synchronization, which significantly hurts the IO performance.

      >Note it is 10 GB read and 10 GB written, simultaneously from multiple threads. It is a DB not just flat storage, so there are spatial queries going on in every case. It involved updating a lot of existing records, managing fragmentation, and other databasey things. It is not just appending information, which would be straightforward. You would just write the the HD in that case.

      The very nature of OS paging is completely file based. You might have a database abstraction ontop, but at it's core, it is very much flatfiles.

      >As for compression and uncompression, it does not increase RAM use at all. Data can only be used by the cores you have available, if you have 8 cores, the maximum workers threads you should have is 16. This is 16 uncompressed buffers, which is peanuts.

      To decompress the data, you take the compressed data (from the database), apply the algorithm to the stream to decompress the data, which creates a new copy of the data. So you now have 2 copies of the data, compressed from the database, and uncompressed from the decompression algorithm. In a best case scenario, your RAM is doubled.

      >This notion of my compressed 10GB being equivalent to a 50MB Minecraft dataset, not sure where it is coming from. Is that bistromatics?

      Not at all.

    3. Your point of being limited to 1 thread because otherwise you would end having some critical sections applies to every field of computer science, not only data storage.

      Multithreading implies some degree synchronization. How much it hurts is a matter of how many serial aspects your process has (see Amdahl's law).

      If you have an octree database like this one, you can have queries for different octants go in parallel, as the very spatial hashing you use guarantees no collisions.

      What is even better, you can make the write-lock resolution adaptive. If you see some octants having a lot of write activity, you can refine the octant into sub-octants and you will produce a new set of locks that will allow more operations to go in parallel.

      Only when you have a collision within an octant you need to serialize, but this is a fraction of cases. Even in this case you would have a reader/writer type of lock. So all reads will go in parallel, only writing will block reads. For a system where you read a lot more than what you write this has incredible gains.

      Regarding the RAM cost of compression/decompression. Have you look at how completion ports work? The reality for any server is you have a very limited number of threads to do your work anyway. For obvious reasons you store the data compressed, only when you need to use it you decompress it. Now there is no point in having decompressed data if you are not using it. Since you have a very limited number of threads, that also means you are using a very limited set of data at any time. This set is what you need to keep in RAM.

      In my case the uncompressed data for a chunk is 250K. A server with 4 cores would have 8 worker threads. That means is less than 2 Megs worth of RAM to deal with the uncompression. Now, most of the uncompression actually happens in the clients as you want to send compressed data down the wire as well.

    4. >If you have an octree database like this one

      Woah, slow down. You mentioned octree databases, the original blog post just said "shiny". I don't think we're even arguing the same things. I said one thread w/o synchronization for fast access. You clearly have information about the database implementation not given in the above blog post. That's kind of important. Maybe you should have had a blog post on that instead?

      I'm not trying to question your credibility or decisions here - I agreed in my first comment that your solutation was indeed the way to go. My point is that just because "Windows Task Manager" shows low RAM usage, doesn't mean that's actually what's happening. And if you did more volume read/writing in a shorter timespan, you'd notice that too.

    5. No need to slow down. You ask questions I provide answers, others benefits from the exchange.

      This may have been an introductory post on the system. There is no reason why it should mention how locks are handled, etc. That is up to me to disclose.

      I get the feeling you have tried something like this in the past, but probably a lot simpler, which may justify some of the assumptions you have made. Saving voxel blocks on your own by any chance?

    6. I'd kinda like to take part in this conversation, but my programming credentials are sadly lacking.

      Being artistically inclined, I keep forgetting there is a lot of essential unseen work like this that needs to go on to make this kind of persisent editable world happen.

      I'll come back later when we're talking about making things look pretty ;-)

      keep up the good work

      PS As someone who has read through the whole blog, I can confirm Miguel does refer a number of times to the octree principle the terrain data is built on. Even I vaguely understood it!

  4. This comment has been removed by the author.

  5. The Database seems nice so far, but will it be able to "scale up" as an option?
    Also How about the network load? Its only neccessary to send user-manipulated chunk. But if something changes like the servers, World Generation Algorithm. Wouldn't it make the data unsynced?

    1. Well, when something changes on the server, you could re-build the client-side worlds... Or, atleast, only change the things which changed. With other online games (multiplayer or otherwise) you also need to pause often to update the client. If you change the world generation algorithm, you'd probably want to shut down the server and update the clients anyway.

    2. A single instance of the database can only scale as much as the hardware allows. That means there is a hard limit on what one single DB can do. The thing is, this is an octree-based system, so you can have different databases serving different octants. The same applies to other tasks than storage.

      Like you said, for some applications only user changes need to be transmitted and stored. If something drastic changes in the servers, like the world definition for mountains, it would mean the user created data does not apply anymore. If you had a house on a hill, the hill may not be there anymore. Your house would be floating. If your application is like that, you need to make sure the main world features do not change once people have started playing in it.

      Now if that is somehow part of the gameplay you want, by removing a hill your server game logic could also clear all the user-built stuff over it and place it as a pile of rubbish on top of the new location.

      That would make for a nice God game actually.

  6. This looks great, I've been thinking something like this should exist. So, I'm proud of you, with the apparent skill and ability, to do it. Hope this doesn't die here :)

  7. Have you considered using node.js as your server technology? (or just the libuv) ?

    1. What I have right now is equivalent to libuv, although it does not cover Unix. It uses the same principles: very few worker threads efficiently managed by the OS completion ports and an asynchronous message model on top of that.

      Libuv does fit the profile, and I did consider it, but I would not miss the fun of creating similar components for anything in the world. You can count it as some sort of inspiration.

    2. As long as you have the time and skills to do this! Ambitious dude! :D

  8. Wow, you are really not sitting down, but doing lot of work here.
    Size what takes game world is really important.
    One of good examples of world size reduction we could see in game fuel, it was huge world (useless mostly but still) and it took really low space in hd.
    It is important that we keep good performance and size, so it will make community what can work with ending project bigger.

    This engine is best for huge open world/sandbox areas.
    That is type of thing I like tho most in gaming industry - feeling of open world around and seems your work is dooing good job with that.

    good luck :)

  9. Look forward to one day, PC's with webgl being capable of running environments like this xD

    May I ask how you went about the database? an already existing database solution or custom crafted one, and if so which language / platform was this written in / for ?

    1. Oh! and huge fan of your work, look forward to one day seeing a retail game using the engine!

  10. While your post doesn't address several important questions, it nevertheless is a good start. Most of the issues that DO arise have already been solved or can be dodged with clever thinking.

    For starters there is the issue of bandwidth, which is more a factor than cpu/cache performance, HDD read/write times, and RAM perf combine. You already mentioned one solution, which is initially transmit level/environmental data and from there only transmit world updates to things that are immediately visible. Visibility calculations for determining strictly what IS within the radius of a given player can be done client side, and then the server can throttle this visibility radius as a simple precaution.
    Obviously everything not above the ground can remain unsent.

    More important is compression. If we are talking chunk encoding of voxel data then we don't have to have xyz coordinates for data
    within a given chunk (lets say 8x8x8 subchunks). We can encode each chunk and subchunk as one long chain of bytes. The engine has the dimensions of a chunk preprogrammed, so if each voxel is a byte than the first 16 voxels say, are arranged in a line starting at the top of the chunk in the north west corner and ending at the top north east corner. Move one voxel-width south, and start a new line, repeating the same NW-to-NE sequence until you have a layer. Shift down one layer and continue. Rinse and repeat until a whole chunk is reassembled. The position of the individual voxels is then relative to the absolute position of the containing chunk, so you only have to transmit the chunk XYZ and the voxel data but not the XYZ of individual voxels.

  11. MongoDB uses a similar strategy (mmapping files into RAM) to achieve its speed. The cost is data corruption on crash: if you let the OS choose the order of writes, you can't guarantee whatever is on disk after a write is consistent. You can do tricks with sentinels per page, but most people who start with this strategy end up abandoning it in favor of some form of write-ahead logging or log-structured store.

    You are already a step ahead of most applications since I'm assuming your data is pointer-free and has a lot of locality. Otherwise, the mmap approach would be a lot less atteactive to you. Unless you really don't care about losing your data in a crash, you will end up implementing most of what RocksDB gives you for free, so you might as well just bite the bullet and use RocksDB. Your keys can just be the octree node addresses using a simple prefix scheme.

    I have experience doing this kind of storage on huge scale, so I'm happy to give pointers and critique ideas.

    1. Yes the data chunks have no pointers, all the data is local so in case of corruption the damage is constrained within one chunk. This DB is mostly for implementing caches. For long time persistence I recommend using something more reliable. I will take a look at RocksDB, thanks for the suggestion.

  12. The RocksDB team just released 2.8 today! Note that it's optimized for flash, so I'm not sure whether it's ideal for client side use in a general-purpose engine, though you could definitely use it on the server side.

    Regarding compression, Facebook uses in-memory compression all over the place. It's highly application-dependent whether it will help, but for large caches it's almost always a win, because the choice is usually between spending some CPU time decompressing the data, after which it will be in the L1, L2, and/or L3 cache, or spending even more CPU time doing nothing while you wait for the data to come back from main memory.

    As for caching, do you really expect your app to be stable enough that you're willing to throw out the cache entirely on a crash? Maybe you'd want to invalidate the cache on crash anyway, so perhaps that's not a big deal.

    Incidentally, letting the kernel handle your paging for you is not always the best idea. Sometimes it will end up prioritizing your cache over, say, program code or some other data you might want to keep close at hand. You can also run into a situation where you're reading a bunch of data sequentially, and the kernel thinks that means you're going to read a whole bunch more, so it kicks a bunch of stuff out of RAM to make room. This can happen whether or not the memory you're accessing is mapped to a file; it could just be memory that hasn't been touched since you've allocated it.

    It's possible, at least on Unix-like systems (not a Windows person myself) to handle your own page faults, passing any that don't belong to your caching system on to the kernel. This may not work so well for pointer-free structures, but you can always turn pointer-free structures into pointer-ful structures. What you do is to allocate space for the root of your structure, then protect it so that it can neither be read nor written. When you get a page fault for that space, you map the root in, "unswizzling" any pointers from their persistent form to pointers to more newly allocated space, which you also protect. Repeat recursively ad infinitum. Now that you're in control of the paging mechanism (which is no slower than letting the kernel do it, except for the necessity of unswizzling the pointers and possibly decompressing now that you have less in RAM), you can do all your allocations from a pre-allocated pool which you can then lock into memory so it never gets paged to disk. See for a description of this technique, which actually works for all sorts of huge data structures, even on 32 bit hardware, and even when your "source" address space spans the network.