Upgrading Second Life, and Why It Is So Hard To Do So

Frowning Gwyn

There are so many uh “professionals” claiming that all Linden Lab does is incompetently handled and amateurish and that they don’t know a thing about how to deliver updates, maintain a grid, or develop software overall…

Well, I also claim my share of professionalism 😉 Yes, I’ve worked for ISPs dealing with a few tens of thousands of simultaneous users — with far fewer servers than LL’s 5000 virtual servers or so, though. And yes, my application to work for LL was rejected, very likely because my 12+ years of experience running relatively complex network/server operations were not enough. And they’re right: SL is far more complex than anything I might have worked on 🙂 It shows, at least, that they’re raising a very high threshold when accepting new people to work with them on mission-critical issues.

Some recent comments on LL’s blog are indeed naive and show a lack of understanding on how the grid operates. Linden Lab does stress-testing. They do have QA procedures — they used to be publicly available on their old wiki, and they even invited residents to suggest and implement new QA procedures (there were not really many who raised to the occasion — and the ones that did, are very likely employees of LL these days). They have a whole grid for public testing purposes. They have at least another grid for internal testing purposes.

As anyone who has deployed very complex software knows, you can test as much as you want on “controlled laboratory environments” — even stress-test it to incredible levels — but there is no better test than “the real thing”. It’s the nature of so-called “complex environments”. A computer programme running on a single computer on a single CPU with no other programme running at the same time can be analysed in extreme detail, and you can scientifically predict what its output will be, given a specific input. Do the same test under a multi-tasking system, and that prediction will be right most of the time — but sometimes it won’t. Increase the complexity so that not only the system is multitasking, but it’s a networked environment, and it will be even harder to predict the outcome. Now go towards the full scale of complexity: multi-tasking, multi-CPU, multiple nodes on the network, multiple virtual servers per physical server — and the “predictions” become chaotic.

In essence, what happens is it’s easy to predict the weather inside your own living room (just turn up the heat, and the room’s temperature will rise uniformly) but not on the whole of the Earth — a chaotic system which is not possible to analyse using a statistical method, and you don’t know the variables to develop a chaotic model that replicates Earth’s weather. Now Second Life is, indeed, a chaotic system, with a limited predictability base. Like a weather system, you can simulate it. You can recreate a system that does not have 5000 sims but just 50, and not 15,000 simultaneous users but at most 150, and see how it behaves in a controlled environment. LL even uses very old computers on the testing grids to make sure you can pinpoint algorithm errors (ie. doing things far less efficiently than you could do in a fast system) — it’s an old trick of the trade. However, a system 100 times bigger than the lab environment is not 100 times more complex — the relationships are exponential, not linear. At a time there were just 20,000 accounts in the database, one could probably extrapolate the tests made on a smaller system by applying heuristics — “if the real grid is 100 times bigger, this will be 100 times slower”. With 2 million accounts, things sadly don’t work so well — a grid 100 times as big as the testing environment might be a million — even a billion — times more complex. Exponentials kill almost all systems and turn semi-predictable ones into chaotic ones. That’s the first issue: SL is complex. Much more complex than people tends to think.

The second issue is simply a matter of “time”. To reboot the whole grid needs 5 hours — you can’t “cut” time on that. It’s like a constant of the universe — and kudos to LL to make that “constant”, well, constant, during all these years that the grid grew exponentially. So, to suggest that every day the grid should get updated for one hour is, sadly, technically not possible, and that “demand” comes mostly from people familiar with other MMORPGs, where there is usually a “maintenance hour” every day to reboot all servers. Most MMORPGs have very simple back-end systems and the complexity is on the viewer. Most of them also run on Windows servers; stability on Windows platforms can be accomplished surprisingly by rebooting the servers often (ie. once per day) since it is a good way to deal with all the memory leaks inherent to that OS.

Linux or FreeBSD-based operating systems manage memory in a far different way (I won’t say “better” since that would be a biased answer 🙂 ) and a “maintenance reboot” is really not required — it does neither help nor makes things faster after a reboot. It’s just different, so rebooting Linux servers does not mean they’ll be faster, or get suddenly rid of bugs, or perform better, whatever. It’s simply not the way they work — thus “daily maintenances of one hour” are sadly impossible to accomplish. It will always take 5 hours. And that’s why LL does their upgrades at most once per week, usually once every other week.

Thirdly, being encouraged by the success of the Focus Beta Viewer, which allows server-side functionality to be experienced on the client’s side — but connecting it to the real grid, ie. “talking” to the “same” servers that use the “stable release” — some people here are now advocating that Linden Lab should only do things that way: make all releases be tested on the main grid and never on any “beta grids”, so they could be tested in a real environment. One forgets that almost all releases/patches/upgrades/versions — at least, 99% of them — are not client-side issues! As a matter of fact, almost all releases require both new client code and new server code. There is a good reason for this — the “Second Life Protocol”, the communication layer that binds your client to the grid servers, is constantly evolving with every new release. This means that former versions of the SL client will “talk” a slightly outdated version of the SL Protocol, which will simply not work anymore.

However, when LL can cleverly deploy the SL viewer with only client-side modifications, you’ll get one of those “optional viewer download” messages: this means that the version of the SL Protocol did not need any changes, and only the client was changed. When you get a “rolling server upgrade” to fix things, the reverse is true: only the server needs a change, but the SL Protocol didn’t, so you can still use the same client for the new server version. And, although this hasn’t happened yet, I can assume that one day both will be done simultaneously: both a “rolling upgrade” underway and an “optional client download” for the ones wishing to use it.

The Focus Beta Viewer is, in essence, something like that — only client-side things have been changed, but the SL Protocol to talk to the servers is still exactly the same one (of course, if the SL Protocol changes, LL has to release a new version of the Focus Beta Viewer as well). However, most of the bug fixes and new features require changing all three simultaneously: client, server, and SL Protocol. And that’s what happens every other week or so when the grid needs to be shut down and relaunched (taking 5 hours) and everybody needs a new version of their client to communicate with the new server software on the grid.

Other games/platforms simply don’t work that way. Perhaps people are familiar with the Web, which exists since 1992/3. It also has a communications protocol: HTTP. Since 1992, there were only 3 versions of it available: 0.9, 1.0 and 1.1. Most current browsers these days use 1.1, but a few still use only 1.0. On the content side, we went from HTML 0.9 to describe pages to HTML 4.1 — four major versions in over a decade. Second Life, by contrast, gets both a new version of the protocol every other week and a new description of the client viewer at about the same rate. It’s totally different, conceptually, from the way the Web works. And, unlike SL, nobody needs to “shut down the Web” when Mozilla, Opera, or Microsoft launches a new browser — they’re compatible with both the current version of HTTP and HTML… — in Second Life, things are simply not working like that.

One can, of course, argue about why Linden Lab did not use a more Web-based conceptual environment instead of their current approach. Two things should be said at this point: first, they are, indeed, changing the way the SL Protocol works, towards an approach where there is not so much need to change the communications. This would LL allow to turn all server releases into rolling updates, as well as doing all client upgrades to become optional downloads. I think this is their ultimate goal — even allowing people to use their own, favourite version of the client software (LL-supported or third-party, either open source or commercial software). This requires, however, changing all the SL Protocol — a daunting task of huge proportions, which has been hinted as being underway for around 18 months now. Several parts of it are already changed — allegedly, since 1.9 I think, the servers already talk among themselves using a Web-based, version 2.0 of the SL Protocol. I can only imagine that work is in progress to deploy the client-server communications to the same version as the one deployed “under the hood”. But it’ll take time.

The second point is that LL never really needed to think too much about this at the early stages of deployment, and from studying their SL Protocol it’s quite clear that their first group of developers did come from games design in the late 1990s, where speed of communication on a local area network to ensure a quick response time was far more important than stability and compatibility across millions of users, dozens of thousands of those simultaneously, and thousands and thousands of servers. They aren’t, however, crossing their arms and saying “tough luck, this is what we have devised in the early 2000s, this is what you get…”. Rather, they’re constantly rethinking the way they can “migrate” the conceptual framework of their original platform into a new framework which has totally different requirements and expectations, without too much mess.

You can imagine the analogy of having installed Windows 95 on your PC and having Microsoft constantly throwing patches at you so that at the end of the day you’d have Vista installed. Nobody in the software industry does things that way: at some point, you simply can’t patch something that was supposed to work well in 1995 and expect it to work in the computing environment of 2007. But that’s what LL is doing with SL! They have no choice — it’s impossible to demand that people suddenly start from scratch on “Second Life 2.0”, forfeiting all content — objects, avatars, scripts… — just because the things that LL can do in 2006 are way beyond the ones that were possible in 1999. Instead, they have a migration path — slowly introducing those changes that will become “Second Life 2.0” without ever breaking anything done in the past. Anyone who has tried to install, say, a current version of the Windows X System over a Linux installation done in 1999 knows what I mean 🙂 It’s possible, but be prepared — it’s a nightmare taking weeks (and at every point of the process you’ll be wishing that you’d be able to simply install a new copy of Linux instead).

One could also argue that what LL should be doing is to develop a parallel grid with 5000 servers running “Second Life 2.0” (which doesn’t really exist) and create a “migration tool” to upload all “old 1.0 content” and “convert” it to “new 2.0 content”. I’m not even claiming this is technically possible. What I’m sure of is that it’s financially and resource-wise simply not possible — it means adequately testing hundreds of millions of assets and making sure that not a single one will break when updating. We’re not talking about a software product that has a few gigabytes of data stored in your disk. We’re talking about dozens of terabytes of shared content that need to be changed! If that does not impress you, it should. IMHO, it’s almost a fantastic miracle that during an upgrade just a negligible amount of data is “lost” when, say, converting databases from one asset server to another…

What could, indeed, be done, is something slightly different: not relying so much on a single, centralised server cluster to do all the “tracking”. Right now, as I have patiently explained to some residents, this is the more economically sound alternative. It would be very easy, for instance, to change the way objects are tracked by the asset server by relying not on the UUID, but on a pair of (sim server, UUID) as identifiers.

What would this accomplish? Well, currently, assets are stored on the simulator server they were uploaded first, but the information on that storage is kept on the asset server. Every time you need an asset on a different sim, all it takes is a request to the asset server to ask where the asset is, contact the sim where it was uploaded, and download it to you while keeping a local cache at the sim you are.

But technically you could get rid of the central asset server by simply storing the (server, UUID) pair and have your client contact that server directly, and (also) store a local copy on the sim you are now. There would be no need of keeping the expensive bottleneck called “asset server cluster” — which causes most of the issues you’re familiar with — with lost inventory, Search not working, unable to teleport, etc.

So why does LL use a centralised method? It’s purely an engineering/financial decision. It’s far better and cheaper to maintain a centralised system than a fully distributed one. Imagine the following example: you send a bug report to LL complaining that you have lost an object called “My Nice Clothes” —

A) Today’s system: “Sure, we’ll take a look at the central database and retrieve it for you. There you are. Sorry for the mess.” All the tech support has to do is look it up on the central asset cluster, see where a copy is found, and retrieve it from the backups.

B) Decentralised system: “Sure, do you know the server’s name and the UUID of the object you’ve lost?” Well. Under this model, naturally, people have no idea where the object was first uploaded. So LL’s tech support team would have to log in on all the 5000 servers separately, and search, one by one, to see where a copy is located. This takes a huge amount of time, as you can imagine.

What will very likely happen in the future is a model with sub-grids, ie., parts of the grid being controlled by their own “local asset servers”. Assets would be identified by pairs of (subgrid, UUID). Immediate advantages:

  • LL could upgrade part of the grid while keeping the other parts down for maintenance. The cool thing is that there would be “no absolute downtime”. People would just have to jump from part of the grid to another part (like during a rolling upgrade). Sure, during the upgrade period, part of the assets would not be visible (if they were on the “other subgrid” and not locally cached), but in most cases, most of the things would still work. If currently there were four grids, for instance, you’d have 75% of all content always available — and probably more, due to local caching — and, the more important bit, you would always be able to log in
  • LL could abort an upgrade process mid-way. Having 25% of the grid to experiment with load is currently very likely to be enough to figure out what is going wrong after the upgrade was deployed. They could just warn people to restrict themselves to the rest of the grid while they fix the problems on just the sub-grid under update. Sure, 25% of the users — who happen to have all their content on the sub-grid under upgrade — would still complain. But they would, at least, be able to jump to the rest of the grid running the “old version” while the new one is tested. It’s a fairly good compromise.
  • This would naturally allow sims to be on geographically different co-location facilities. Since we know that Linden Lab wants to do this, it would be one possible approach (the much easier one would be to simply have the asset server cluster separated geographically, but it would always be a single grid nevertheless).
  • Looking up manually assets for a handful of sub-grid asset clusters takes not much longer than looking them up on a single asset cluster. There is quite a difference between doing 4 queries for 4 subgrids and doing 5000 manual queries on a totally decentralised model.

What are the requirements for implementing this approach? Actually, not many. LL is working on changing the way the objects are stored on the grid anyway — to allow for “first use” tags, or attachment of Creative Commons licenses — so they would just need to add a different way to reference assets, using a pair of keys instead of a UUID. When “migrating” to a sub-gridded grid, objects would all be automatically “updated” to reflect this key pair instead, so that would be easy as well.

So, what would be the biggest issue then (or: “why hasn’t anyone at LL done this before, if it’s so good?”). Mostly, LSL scripts. There is no “easy” way to change millions of lines of code to deal with the notion that keys would now have to be differently expressed as pairs of info instead of single UUIDs. This doesn’t mean the task is impossible, just very hard to do. A lot of scripts rely upon the fact that a UUID is really just a string, and lots of clever trickery is employed to make all LSL scripts “backwards compatible”. It’s not an impossible task — just a very, very hard one.