Procedural World: Storage Matters

Wednesday, March 27, 2013

Storage Matters

Imagine you were creating a massive persistent world where everyone would be able to change anything at will. It is a simple, powerful idea that eventually has occurred to everyone ever exposed to a game. Why there aren't many of these worlds out there? Well, this very simple idea is quite difficult and expensive to execute. Not only you need to store the information, you have to be able to write it and read it in a timely fashion.

Then how about your own personal world, something you can run in your PC and invite some friends to play over. How much of your PC's performance are you willing to sacrifice, how many people could you actually invite before you would see the quality of your gameplay begin to suffer?

I began wondering whether all the above could be manifestations of the same problem. What if you could have a storage solution that is lightweight so enthusiasts could run at home, and if you pieced enough of them together you could scale it so it would run massive worlds the size of planet Earth?

As it turns out it was possible. I have now a shinny new database system that does exactly that. The main trick is it aligns with the same other concepts of the voxel world. So this is mainly a voxel database. It won't do any SQL queries, XPath evaluation or any other form of traditional DB interaction. It just stores and retrieves voxel data very fast.

How fast? Over a 10 minute period, a machine with six-year-old Intel processor (T2500 at 2GHz) and an equally crappy HD was able to serve 10 Gigabytes worth of individual queries while another 10 Gigabytes worth of queries were being written. Each query ranged from 500 bytes to 100KBytes worth of data.

That would translate into a lot of friends sharing your server. To give you a better idea, a volume of 40x40x40 worth of player voxels compresses to 2K as an average. Here is how you would compute how much space 10 GB of voxel data would be:

1 chunk = 40x40x40 voxels = 12x12x12 meters
1 chunk = 2K
10 GB = 5,242,880 chunks = 2048x2048x2048 meters

How many people can create this amount of voxel content in 10 minutes? I have no idea, but I bet it will be an entire army. At this point the DB is the least of your concerns. The bottleneck is in the network.

The twist comes now: While this rate was sustained for 10 minutes, it was not meant to push the system to the limit. The DB process CPU usage never went up 1% and the memory usage for the process remained at 3 MB. The system was responsive and usable (well as usable as a six year old PC can be), showing no big difference in behavior.

Here is some evidence:

For most of you who are more artistically or design inclined this is certainly the most boring screenshot I have ever posted. But if you are into programming this kind of thing, this is process porn.

Of course the system is doing real work. The main clue is in a different column not displayed by Task Manager: Virtual memory, which was hovering all the time below 20 Megs. Even then the virtual memory was lower than what Google Chrome was using, which was a whooping 99 Megs.

The voxel database is so fast because it uses the same virtual memory management of the OS. So, instead of writing to files in the HD directly, all the information is mapped through the OS paging system. Only the pages that need to be altered go into memory. Also the system does a lazy write to the HD. Even after the process is gone, the OS continues to save the changes to disk.

I feel this is the stepping stone for great things. It will be fairly easy and inexpensive for people to set up their own servers. They could be hosting a lot of players and barely take a hit for it. This of course depends on how the networking is implemented, which leads into another favorite topic of mine: how to make a server that will not bring your PC to its knees. I will be covering that in the near future.

30 comments:

AnonymousMarch 27, 2013 at 8:08 PM
I love you.

This is just amazing, The work you are doing could REALLY change the way games work. I mean think about hole games where EVERYTHING can be interacted with or destroyed! This would be crazy to use in a FPS style game. The GFX of this are great and i can see some major future in this project.

I bet some company buys it from you or hires you.

Keep up the amazing work, I hope you continue working on this amazing project!
ReplyDelete
Replies
JaimeBeezMarch 27, 2013 at 8:16 PM
Are you off-loading the heavy lifting of things (i.e. collision, building, destroying) onto the client in order to get that small of a memory footprint? Or, is there more usage going on (in private bytes or some other type thing)?

The big concern for multi-player would be that you can't trust the data sent from the client. Most of these calculations would need to be done server-side and would require a decent amount of this data to be loaded into memory.

I suppose (if you could decompress fast enough), you could constantly load/decompress the chunks needed for client interaction. I think that could fairly quickly limit the number of clients you could support though.

I'm interested to hear what your thoughts/solutions are on this.
ReplyDelete
Replies
AfforessMarch 27, 2013 at 9:18 PM
This really isn't that amazing.

Your use case story is not actually as compelling as it seems. 10gb in 10 minutes? That's only ~17mb/s - a speed most flash drives regularly achieve. Your task manager shows ~3.4GB total ram, so his program was able to keep using and recycling the available system RAM. Had you really ramped up his application, say 100gb in 10 minutes (10x more than your test case), you would have maxed out the available system RAM, and then each new read/write would have forced his 6 year old disk tray to start spinning, stalling his performance badly.

Further - your solution basically involves reading/writing raw data onto the disk. You can't do compression/decompression at all, compression and decompression requires that you duplicate data for the uncompressed stream, which would significantly increase your RAM footprint. But not doing compression/decompression also means your 10GB database might be equivalent to a 50mb minecraft world. Not exactly ideal.

And last - this would not work with any kind of threading. The OS paging system has "undefined" behavior with multithreading, so your voxel data is now limited to 1 thread only. In an age of 8-core computers, not good.

I'm not trying to criticize your approach - this type of solution is the best one we have, OS paging is the best approach to superior IO performance. It's just not the silver bullet you think it is.
ReplyDelete
Replies
CameronMarch 27, 2013 at 11:33 PM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousMarch 28, 2013 at 6:08 AM
The Database seems nice so far, but will it be able to "scale up" as an option?
Also How about the network load? Its only neccessary to send user-manipulated chunk. But if something changes like the servers, World Generation Algorithm. Wouldn't it make the data unsynced?
ReplyDelete
Replies
SybenMarch 28, 2013 at 6:16 PM
This looks great, I've been thinking something like this should exist. So, I'm proud of you, with the apparent skill and ability, to do it. Hope this doesn't die here :)
ReplyDelete
Replies
UnknownMarch 29, 2013 at 4:33 AM
Have you considered using node.js as your server technology? (or just the libuv) ?
ReplyDelete
Replies
Oskars Par KkoMarch 30, 2013 at 5:04 PM
Wow, you are really not sitting down, but doing lot of work here.
Size what takes game world is really important.
One of good examples of world size reduction we could see in game fuel, it was huge world (useless mostly but still) and it took really low space in hd.
It is important that we keep good performance and size, so it will make community what can work with ending project bigger.

This engine is best for huge open world/sandbox areas.
That is type of thing I like tho most in gaming industry - feeling of open world around and seems your work is dooing good job with that.

good luck :)
ReplyDelete
Replies
JezterApril 2, 2013 at 6:50 AM
Look forward to one day, PC's with webgl being capable of running environments like this xD

May I ask how you went about the database? an already existing database solution or custom crafted one, and if so which language / platform was this written in / for ?
ReplyDelete
Replies
JamesApril 12, 2013 at 2:12 AM
While your post doesn't address several important questions, it nevertheless is a good start. Most of the issues that DO arise have already been solved or can be dodged with clever thinking.

For starters there is the issue of bandwidth, which is more a factor than cpu/cache performance, HDD read/write times, and RAM perf combine. You already mentioned one solution, which is initially transmit level/environmental data and from there only transmit world updates to things that are immediately visible. Visibility calculations for determining strictly what IS within the radius of a given player can be done client side, and then the server can throttle this visibility radius as a simple precaution.
Obviously everything not above the ground can remain unsent.

More important is compression. If we are talking chunk encoding of voxel data then we don't have to have xyz coordinates for data
within a given chunk (lets say 8x8x8 subchunks). We can encode each chunk and subchunk as one long chain of bytes. The engine has the dimensions of a chunk preprogrammed, so if each voxel is a byte than the first 16 voxels say, are arranged in a line starting at the top of the chunk in the north west corner and ending at the top north east corner. Move one voxel-width south, and start a new line, repeating the same NW-to-NE sequence until you have a layer. Shift down one layer and continue. Rinse and repeat until a whole chunk is reassembled. The position of the individual voxels is then relative to the absolute position of the containing chunk, so you only have to transmit the chunk XYZ and the voxel data but not the XYZ of individual voxels.
ReplyDelete
Replies
AnonymousApril 5, 2014 at 12:13 PM
MongoDB uses a similar strategy (mmapping files into RAM) to achieve its speed. The cost is data corruption on crash: if you let the OS choose the order of writes, you can't guarantee whatever is on disk after a write is consistent. You can do tricks with sentinels per page, but most people who start with this strategy end up abandoning it in favor of some form of write-ahead logging or log-structured store.

You are already a step ahead of most applications since I'm assuming your data is pointer-free and has a lot of locality. Otherwise, the mmap approach would be a lot less atteactive to you. Unless you really don't care about losing your data in a crash, you will end up implementing most of what RocksDB gives you for free, so you might as well just bite the bullet and use RocksDB. Your keys can just be the octree node addresses using a simple prefix scheme.

I have experience doing this kind of storage on huge scale, so I'm happy to give pointers and critique ideas.
ReplyDelete
Replies
AnonymousApril 7, 2014 at 2:34 PM
The RocksDB team just released 2.8 today! Note that it's optimized for flash, so I'm not sure whether it's ideal for client side use in a general-purpose engine, though you could definitely use it on the server side.

Regarding compression, Facebook uses in-memory compression all over the place. It's highly application-dependent whether it will help, but for large caches it's almost always a win, because the choice is usually between spending some CPU time decompressing the data, after which it will be in the L1, L2, and/or L3 cache, or spending even more CPU time doing nothing while you wait for the data to come back from main memory.

As for caching, do you really expect your app to be stable enough that you're willing to throw out the cache entirely on a crash? Maybe you'd want to invalidate the cache on crash anyway, so perhaps that's not a big deal.

Incidentally, letting the kernel handle your paging for you is not always the best idea. Sometimes it will end up prioritizing your cache over, say, program code or some other data you might want to keep close at hand. You can also run into a situation where you're reading a bunch of data sequentially, and the kernel thinks that means you're going to read a whole bunch more, so it kicks a bunch of stuff out of RAM to make room. This can happen whether or not the memory you're accessing is mapped to a file; it could just be memory that hasn't been touched since you've allocated it.

It's possible, at least on Unix-like systems (not a Windows person myself) to handle your own page faults, passing any that don't belong to your caching system on to the kernel. This may not work so well for pointer-free structures, but you can always turn pointer-free structures into pointer-ful structures. What you do is to allocate space for the root of your structure, then protect it so that it can neither be read nor written. When you get a page fault for that space, you map the root in, "unswizzling" any pointers from their persistent form to pointers to more newly allocated space, which you also protect. Repeat recursively ad infinitum. Now that you're in control of the paging mechanism (which is no slower than letting the kernel do it, except for the necessity of unswizzling the pointers and possibly decompressing now that you have less in RAM), you can do all your allocations from a pre-allocated pool which you can then lock into memory so it never gets paged to disk. See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.55.648 for a description of this technique, which actually works for all sorts of huge data structures, even on 32 bit hardware, and even when your "source" address space spans the network.
ReplyDelete
Replies

Add comment