You can get Voxel Farm now. For more information click here.

Monday, September 12, 2011

Big Data

The terms "Big Data" are frequently thrown around these days. I don't know about you, but this is what comes to mind:

Big Data is when you have so much information that traditional storage and management techniques begin to lose their edge.

One of the promises of Procedural Generation is to cut the amount of data required to render virtual worlds. In theory it sounds perfect. Just describe the world by a few classes, pick a few seeds and let everything grow from there. It is storing metadata instead of data. The actual generation is done by the player's system, data is created on-demand.

The problem is how to generate compelling content. The player's machine may not be powerful enough for this. Odds are you will end up targeting the lowest common denominator and your options will become seriously limited. For instance, Minecraft generates the entire world locally. This limits its complexity. Blocks are huge and there is little diversity. For Minecraft it works, but not many games could get away with it. If you want some kind of realistic graphics it is probably out of the question.

As it seems, data must be generated away from the user`s system. One possible approach is to use procedural techniques to create a static world. Whatever data constraints we may have can be built in as parameters to the procedural generation. For instance we could make it sure the entire world definition fits on a DVD.

On the other hand, procedural techniques can produce huge datasets of very unique features. If you don`t really need to fit on a DVD, Bluray or any type of media you can  keep around your house, you could create very large worlds. This is not new, Google Earth already does that. You don't need to download the entire dataset to start browsing, it all comes on-demand. A game world would not be as large as our real planet, but it would be a lot more complex and detailed. The world's topology for sure will be more than just a sphere.

Then it becomes about Big Data. While generation and storage may be simple and affordable, moving that much data becomes a real challenge. What is the point of becoming so big when you cannot fit through the door anymore?

One option is not to move any data and render the user's point of view on the server. You would stream these views down to the player's machine pretty much like you would stream a movie. This is what the OnLive service does for conventional games. It is appealing because you can deal with bandwidth using generic techniques that are not really related to the type of content you produce, like the magic codecs from OnLive and their competition. But there are a few issues with this. Every user will create some significant load on the servers. Bandwidth costs may also be high, every frame experienced by the users must go down your wires.

The other option is to do it like Google Earth: send just enough data to the client so the virtual world can be rendered there. This approach requires you to walk a thin line. You must decide which portions of data to send over the wire and which ones can be still be created on the client at runtime. Then, whatever you choose to send, you must devise ways to compress it the most.

It is not only about network bandwidth. In some cases data will be so big, traditional polygon rendering may not make sense anymore. When you start having large amounts of triangles becoming smaller than actual pixels after they are projected into screen, it is best to consider other forms of rendering like ray-tracing. For instance this is what these guys found:

This is taken from their realtime terrain engine. This dataset covers a 56 km × 85 km area of the Alps at a resolution of 1 m and 12.5 cm for the DEM and the photo texture, respectively. It amounts to over 860 GB of data

I am a hobbyist, having dedicated servers to render the world for users is not even an option. I rather pay my mortgage. If I want anyone to experience the virtual world I'm creating, I will need to stream data to end-users and render there. I have not figured out a definitive solution, but I have something I think will work. My next post will be about how I'm dealing with my big data.


  1. It is essentially about the trade-off between content data size and content generation speed. But it's not all black-and-white, one or the other. It seems to be that there must be some optimum where the power of the user's pc is harnessed to generate the final data, while you do all the prep work. This would result in less data than when you do all the work, and less user generation time than when the user would do all the work, while utilizing both to the max.

  2. Looking forward to next post... I think it's going to be all about optimization of resources, knowing the algorithms you have created, understanding their requirements (order of operation etc.), and trying to let each part do what it does best.

    Strategically using deterministic algorithms is a good way of moving the load around from up-front to real-time computation for certain things.

    Realizing that "if my screen shot of a hill has slightly different tufts of grass than yours", no one is going to care, should give you some leeway.

    Personally, I think the greatness of what you've created thus far is in your voxel, terrain, flora (though more to do there certainly) and city, roads, building engine.

    Things that don't have to be done upfront:
    - Flora display, if significantly driven by the fixed terrain.
    - Populating the world, with people and animals, to make it "feel" alive.
    - Populating the world with movable objects, and then their dynamic location stored if moved.
    - Some game mechanics, dynamic quests, events, etc...


    - Storage IS "cheap", no getting around that! So perhaps preprocessed base-seeded terrain data, with a number of post-processes applied to make it a detailed terrain.

    - Client side CPU can be powerful, or not, but it's good at some things. Best pinned to things that can have "quality sliders"... controlling populations of NPC, Ambient Audio complexity, text overlay/display or other things.

    - Client Side GPU can be powerful, or not, but it's good at some things. Best pinned to things that can have "quality sliders"... Post processing effects, bloom, antialiasing, shaders.


    I do think you will be best served not trying to tackle "the holy grail" of a completely cloud based computing structure etc... Especially because on-line only options are may cut into your user base.

    My opinion: For the people who would get excited for your eventual game, a 4 GB+ install folder is nothing. If you get it to run decently on decent hardware, you can ALWAYS leverage things like On-Live to get it to run on weaker devices, or tackle the online mobile market later... once the money is rolling -- then release a mobile version that does all the sexy remote rendering for a small monthly fee, or whatever :)

  3. Another way to manage data is to cut it down into load-screens and areas.

    If I were making a game (I'm not, and don't know enough to even start), I'd probably run it at two different resolution. Much like the game Neverwinter Nights, I'd have sections cut around logically-consistent areas, with loading screens in between. Beyond the edge of the screen, a lower-resolution backdrop would be visible. The faroutside off distance could be rendered as a wallpaper (or even dynamic wallpaper, with waving grass or something).

    Sectioning off areas for loading screens I'd think could be logical too. Cities (or districts of cities) can be divided. Large plains can be divided smoothly. Mountains have edges, both along the cliffs and between peaks. Depending on how each section is processed, the 3d model could be created on the fly, with simple png files (or low-resolution poly's, or whatever), can be saved for each section.

    That would require an larger intitial download (which could be spread out via bittorrent really), but there would be a lot less streaming data than streaming video or rendering info.

    But that's just how I would do it, and it obviously has the drawback of loading screens.

  4. Procedural is compression. This is a great benefit. You just say "I'm at (x,y)" and the algorithm will generate the portion of the static world around you. So why don't let the clients to generate the static world and servers only stream moveable object's positions? This would cut down your bandwidth.
    Google Earth can't do this because deals with non procedural content.


  5. @Pietro: The problem with proceduralism as compression is that you need to "decompress" in a timely manner. For instance, if it takes you 10 seconds to walk trough a patch of land, but it takes you one minute to generate it from scratch, your procedural output won't be able to cope with the user's experience.

    You are left with two options: Either reduce the complexity of the world so you can have patches generated in less than 10 seconds, or do some of the work beforehand and load it from a device like a disk or the network.

  6. @Jonathan: I'm just curious, what would you consider as "decent hardware"?

  7. why not split world in 2 scopes. bigger one is generated and stored on server. smaller is generated form data send from server on client machines.
    simple example:
    server generate forest, for every tree generate seed that describe its shape (this seed isnt pure random).
    server send to client position of tree and its seed.
    client can now build form that part of forest.
    if generating every tree shape in this arena is to costly for that machine and trees are only decoration it can create only one tree shape and copy it to every other tree position.

  8. @-=Y=-: It is a possibility, but then having a consistent transition zone between the two scopes will be very challenging. The transition could be very noticeable if too close to the player. If you placed it far enough, then the amount of generation done locally would be too much.

  9. @Miguel: Think first to the benefits having a complete runtime generation at client side. Don't consider for a moment the new operating conditions. Those can be reviewed later. Also consider that for the time you will come out with a complete product, clients hardware will be better, while bandwith probably not.
    There are interesting examples out there. I would point you to "Infinity - quest for earth". I believe its generation is completely runtime and client side.
    I see you are doing a great work and I hope that you will continue at best. So if watching a video of my hobby work you will get curious about any detail, feel free to ask, I will be happy to share.

  10. @Pietro: Sure I'd like to see your work. If you have a live link please post it here so others can benefit too.

    Both client hardware and bandwidth will not change much in one or two years. I have crunched the numbers. What average clients can do, and will do in the short future is not enough. They cannot do the amount of detail I want to have. Also why wait if a solution involving the cloud is possible now?

    I have seen Infinity. It is impressive at planetary scale. However when you get close at human eye level it is not detailed at all. Terrain is heightmap based, no complex botany, no complex architecture. It suits their goals, but it is far from what I want.

  11. @MCepero: This is off the cuff, but I'd be picturing the lowest end of a usable spectrum of PC, as mid-level level PC, bought 2 years ago.

    Something on the lines of which is used as minimum specs on some newer games:

    - Operating system: Windows XP, Vista or 7
    - Processor: Dual Core 2.0 GHz or better
    - Memory: 3 GB
    - Video: Nvidia GeForce FX 5900, ATI/AMD Radeon X300, Intel GMA X4500 or better.

    Ultimately, the results you are getting, the beauty and the detail are top notch. Graphically speaking getting anywhere near real time at max settings is going to MAX out any modern rig, BY ANY MEANS!

    To me this means targeting higher end systems, but still not just top of the line:

    To me this close to "decent", from a gaming perspective:
    - Operating system: Windows XP, Vista or 7
    - Processor: Core 2 Duo 2.2 GHz or better
    - Memory: 4 GB
    - Desktop Video: Nvidia GTS 250 or better
    - Notebook Video: Nvidia GTX 200M series or better

    You are quite intelligent :)
    ...but as you've already said there are just limits to what current, let alone older tech can do... And as much as I think cloud computing can do, I'm not yet convinced you can make a business case to deliver more, within a price point people would be willing to spend... but I've not crunched any of those numbers as you have, I'm excited by the direction you are looking into.

    I think the MOST important aspect of a successful indi product is a COMPELLING game that people just "HAVE to talk about" (ala. Minecraft).

    With the visuals you have shown are possible, if you can optimize the processing so it's usable on ANY current (even close to top of the line) hardware: then the internet (1080p/720p Let's Plays on You-Tube) will sell it the rest of the way. I think people will play it dumbed down greatly, if they know they can eventually crank up the quality as their hardware improves, or if you are able to build more robust streaming versions.

    Anyway, I'm biased because I WANT to walk in the lands you've made :)

    P.S. I you know this, but it's crucial that the game-play is solid, and fun, in itself, in spite of the visuals... Once you've dealt with other things to your satisfaction, you may want to give yourself some direction on where you want to take it (you may have already done that and aren't discussing it publicly, which is fine too).

  12. I'm confident that complex procedural generation client side and runtime is feasible. Maybe my vision is wrong. However it is worth to debate about this subject.
    As far as I've seen, Flavien (programmer of Infinity) has tried simple vegetation (trees) and procedural cities as well.
    I think tiles can help to find a good trade-off. A palette of 10 branches can generate millions of different trees. No need to generate every thing.
    Regarding my hobby work, go to

    and just watch the video in the last post. It takes 5 minutes.
    It's yet another heightmap based viewer. No complex geometries. Maybe interesting the fact that it generates runtime, while the observer moves, that it allows a huge distance view even though the terrain is densely tessellated, and that it's written against OpenGL 1.2.

  13. Also, regarding specs, though they are only regarding steam users, there are some solid metrics here:

  14. @Jonathan: Thanks a lot for the info. Very helpful.

  15. @Pietro: Very nice work. It is among the nicest looking heightmap rendering I have seen.

  16. @Miguel: Thank you. But what you do now, I hope to be able to do in 5 years :)

  17. I started to read your posts from the oldest and I now catch a glimpse of your vision. It is enough crazy to be revolutionary and surely meets all my admiration and regards. Too daring for me though. Definitely I will follow another path. However I keep on reading your posts that are interesting and detailed, providing my modest opinion if you like.

  18. Another way to get more processing power is using P2P, for multiplayer games anyway. Especialy useful if you can convince idle users to keep their machines running (i.e. EVE mining).

    Obviously it has issues(like trust) and makes everything that much more complicated to manage, but it could have great rewards.

  19. @John: P2P is very appealing. I have considered it for hosting the world data files. As the player moves, the data for the next cells could come from other players instead of coming from a server. A karma system could encourage you to leave the a portion of the client running so other players can be served while you don't play.

    Security is not a big concern for static data. I could use some sort of asymmetric signature to make sure the files are the really the originals.

    What I did not like about it was that latency could bee too high. I would still need to involve a server to coordinate p2p transfers and make sure karma is honored, then it cannot be really guaranteed peers will respond in a timely manner.

    But I have very little experience with P2P. I may be wrong and missing on a great solution.

  20. I was imagining sharing the generation of data, rather than just bandwidth, although that's obviously not useful for your system. Your way does solve the trust problem though.

    For latency, if you could try requesting small chunks of data at a time, then you can easily use multiple peers and swap out slow peers without losing much time/data.

    I don't have much P2P experience either. I suspect there is a great solution there, maybe organising data requests with something like OpenCL events, but it'd probably take more time than 1 person has to invest when it's only a small part of their project.

  21. Your mention of OnLive along with huge data sets is interesting. It wouldn't work for your own hobby project, of course, but for a game specifically targeting OnLive as its primary platform you could get away with some insanely huge datasets. All the OnLive servers could read from a single copy of all the game's static world data, so you you get away with a space requirement that would be impossible if each system needed it. And streaming data from a server isn't that hard when the server is in the same room.

  22. @Squash: This may be a viable future for a popular MMO franchise like WoW. RPG won't be killed by the small latency added by server-side rendering. WoW will get a lot of users, investing in that kind of infra-structure would not be a problem for them. It would allow for a very rich and detailed world, unlike anything we have seen. Think of movie CGI quality level. It could be experienced by anyone, even in mobiles and tablets. Network issues affecting multiplayer gameplay would be gone, you could have massive raids where thousands of players take place. A new age of gaming, and the technology is already proven. It is a matter of time and will.


There was an error in this gadget