Upgrading Second Life, and Why It Is So Hard To Do So

gwyn-frowning.jpg
There are so many uh “professionals” claiming that all Linden Lab does is incompetently handled and amateurish and that they don’t know a thing about how to deliver updates, or maintain a grid, or develop software overall…

Well, I also claim my share of professionalism 😉 Yes, I’ve worked for ISPs dealing with a few tens of thousands of simultaneous users — with far less servers than LL’s 5000 virtual servers or so, though. And yes, my application to work for LL was rejected, very likely because my 12+ years of experience running relatively complex network/server operations were not enough. And they’re right: SL is far more complex than anything I might have worked on 🙂 It shows, at least, that they’re raising a very high threshold when accepting new people to work with them on the mission-critical issues.

Some recent comments on LL’s blog are indeed naive and show a lack of understanding on how the grid operates. Linden Lab does stress-testing. They do have QA procedures — they used to be publicly available on their old wiki, and they even invited residents to suggest and implement new QA procedures (there were not really many who raised to the occasion — and the ones that did, are very likely employees of LL these days). They have a whole grid for public testing purposes. They have at least another grid for internal testing purposes.

As anyone who has deployed very complex software knows, you can test as much as you want on “controlled laboratory environments” — even stress-test it to incredible levels — but there is no better test than “the real thing”. It’s the nature of so-called “complex environments”. A computer programme running on a single computer on a single CPU with no other programme running at the same time can be analysed to extreme detail, and you can scientifically predict what its output will be given a certain input. Do the same test under a multi-tasking system, and that prediction will be right most of the time — but sometimes it won’t. Increase the complexity so that not only the system is multi-tasking, but it’s a networked environment, and it will be even more hard to predict the outcome. Now go towards the full scale of complexity: multi-tasking, multi-CPU, multiple nodes on the network, multiple virtual servers per physical server — and the “predictions” become chaotic.

In essence, what happens is it’s easy to predict the weather inside your own living room (just turn up the heat, and the room’s temperature will rise uniformly) but not on the whole of the Earth — a chaotic system which is not possible to analyse using a statistical method, and you don’t know the variables to develop a chaotic model that replicates Earth’s weather. Now Second Life is, indeed, a chaotic system, with a limited predictability base. Like a weather system, you can simulate it. You can recreate a system that does not have 5000 sims but just 50, and not 15,000 simultaneous users but at most 150, and see how it behaves in a controlled environment. LL even uses very old computers on the testing grids to make sure you can pinpoint algorithm errors (ie. doing things far less efficiently than you could do in a fast system) — it’s an old trick of the trade. However, a system 100 times bigger than the lab environment is not 100 times more complex — the relationships are exponential, not linear. At a time there were just 20,000 accounts in the database, one could probably extrapolate the tests made on a smaller system by applying heuristics — “if the real grid is 100 times bigger, this will be 100 times slower”. With 2 million accounts, things sadly don’t work so well — a grid a 100 times as big as the testing environment might be a millions even a billion times more complex. Exponentials kill almost all systems, and turn semi-predictable ones into chaotic ones. That’s the first issue: SL is complex. Much more complex than people tend to think.

The second issue is simply a matter of “time”. To reboot the whole grid needs 5 hours — you can’t “cut” time on that. It’s like a constant of the universe — and kudos to LL to make that “constant”, well, constant, during all these years that the grid grew exponentially. So, to suggest that every day the grid should get updated during one hour is, sadly, technically not possible, and that “demand” comes mostly from people familiar with other MMORPGs, where there is usually a “maintenance hour” every day to reboot all servers. Most MMORPGs have very simple back-end systems and the complexity is on the viewer. Most of them also run on Windows servers; stability on Windows platforms can be accomplished surprisingly by rebooting the servers often (ie. once per day) since it is a good way to deal with all the memory leaks inherent to that OS.

Linux or FreeBSD-based operating systems manage memory in a far different way (I won’t say “better” since that would be a biased answer 🙂 ) and a “maintenance reboot” is really not required — it does neither help nor make things faster after a reboot. It’s just different, so rebooting Linux servers does not mean they’ll be faster, or get suddenly rid of bugs, or perform better, whatever. It’s simply not the way they work — thus “daily maintenances of one hour” are sadly impossible to accomplish. It will always take 5 hours. And that’s why LL does their upgrades at most once per week, usually once every other week.

Thirdly, being encouraged by the success of the Focus Beta Viewer, which allows client-side functionality to be experimented on the client side — but connecting it to the real grid, ie. “talking” to the “same” servers that use the “stable release” — some people here are now advocating that Linden Lab should only do things that way: make all releases be tested on the main grid and never on any “beta grids”, so they could be tested in a real environment. One forgets that almost all releases/patches/upgrades/versions — at least, 99% of them — are not client-side issues! As a matter of fact, almost all releases require both new client code and new server code. There is a good reason for this — the “Second Life Protocol”, the communication layer that binds your client to the grid servers, is constantly evolving with every new release. This means that former versions of the SL client will “talk” a slightly outdated version of the SL Protocol, which will simply not work any more.

However, when LL can cleverly deploy the SL viewer with only client-side modifications, you’ll get one of those “optional viewer download” messages: this means that the version of the SL Protocol did not need any changes, and only the client was changed. When you get a “rolling server upgrade” to fix things, the reverse is true: only the server needs a change, but the SL Protocol didn’t, so you can still use the same client for the new server version. And, although this hasn’t happened yet, I can assume that one day both will be done simultaneously: both a “rolling upgrade” under way and an “optional client download” for the ones wishing to use it.

The Focus Beta Viewer is, in essence, something like that — only client-side things have been changed, but the SL Protocol to talk to the servers is still exactly the same one (of course, if the SL Protocol changes, LL has to release a new version of the Focus Beta Viewer as well). However, most of the bug fixes and new features require changing all the three simultaneously: client server, and SL Protocol. And that’s what happens every other week or so when the grid needs to be shut down and relaunched (taking 5 hours) and everybody needs a new version of their client to communicate with the new server software on the grid.

Other games/platforms simply don’t work that way. Perhaps people are familiar with the Web, which exists since 1992/3. It also has a communications protocol: HTTP. Since 1992, there were only 3 versions of it available: 0.9, 1.0 and 1.1. Most current browsers these days use 1.1, but a few still use only 1.0. On the content side, we went from HTML 0.9 to describe pages to HTML 4.1 — four major versions in over a decade. Second Life, by contrast, gets both a new version of the protocol every other week and a new description of the client viewer at about the same rate. It’s totally different, conceptually, from the way the Web works. And , unlike SL, nobody needs to “shut down the Web” when Mozilla, Opera, or Microsoft launches a new browser — they’re compatible with both the current version of HTTP and HTML… — in Second Life, things are simply not working like that.

One can, of course, argue about why Linden Lab did not use a more Web-based conceptual environment instead of their current approach. Two things should be said at this point: first, they are, indeed, changing the way the SL Protocol works, towards an approach where there is not so much need to change the communications. This would LL allow to do all server releases as rolling updates, as well as doing all client upgrades optional downloads. I think this is their ultimate goal — even allowing people to use their own, favourite version of the client software (LL-supported or third-party, either open source or commercial software). This requires, however, changing all the SL Protocol — a daunting task of huge proportions, which has been hinted as being under way for around 18 months now. Several parts of it are already changed — allegedly, since 1.9 I think, the servers already talk among themselves using a Web-based, version 2.0 of the SL Protocol. I can only imagine that work is under progress to deploy the client-server communications to the same version as the one that is deployed “under the hood”. But it’ll take time.

The second point is that LL never really needed to think too much about this at the early stages of deployment, and from studying their SL Protocol it’s quite clear that their first group of developers did come from games design in the late 1990s, where speed of communication on a local area network to ensure a quick response time was far more important than stability and compatibility across millions of users, dozens of thousands of those simultaneously, and thousands and thousands of servers. They aren’t, however, crossing their arms and saying “tough luck, this is what we have devised in the early 2000s, this is what you get…”. Rather, they’re constantly rethinking the way they can “migrate” the conceptual framework of their original platform into a new framework which has totally different requirements and expectations, without too much mess.

You can imagine the analogy of having installed Windows 95 on your PC and having Microsoft constantly throwing patches at you so that at the end of the day you’d have Vista installed. Nobody in the software industry do things that way: at some point, you simply can’t patch something that was supposed to work well in 1995 and expect it to work on the computing environment of 2007. But that’s what LL is doing with SL! They have no choice — it’s impossible to demand that people suddenly start from scratch on “Second Life 2.0”, forfeiting all content — objects, avatars, scripts… — just because the things that LL can do in 2006 are way beyond the ones that were possible in 1999. Instead, they have a migration path — slowly introducing those changes that will become “Second Life 2.0” without ever breaking anything done in the past. Anyone having tried to install, say, a current version of the Windows X System over a Linux installation done in 1999 knows what I mean 🙂 It’s possible, but be prepared — it’s a nightmare taking weeks (and at every point of the process you’ll be wishing that you’d be able to simply install a new copy of Linux instead).

One could also argue that what LL should be doing is to develop a parallel grid with 5000 servers running “Second Life 2.0” (which doesn’t really exist) and create a “migration tool” to upload all “old 1.0 content” and “convert” it to “new 2.0 content”. I’m not even claiming this is technically possible. What I’m sure of is that it’s financially and resource-wise simply not possible — it means adequately testing hundreds of millions of assets and making sure that not a single one will break when updating. We’re not talking about a software product that has a few gigabytes of data stored in your disk. We’re talking about dozens of terabytes of shared content that needs to be changed! If that does not impress you, it should. IMHO, it’s almost a fantastic miracle that during an upgrade just a negligible amount of data is “lost” when, say, converting databases from one asset server to another…

What could, indeed, be done, is something slightly different: not relying so much on a single, centralised server cluster to do all the “tracking”. Right now, as I have patiently explained to some residents, this is the more economically sound alternative. It would be very easy, for instance, to change the way objects are tracked by the asset server by relying not on the UUID, but a pair of (sim server, UUID) as identifiers.

What would this accomplish? Well, currently, assets are stored on the simulator server they were uploaded first, but the information on that storage is kept on the asset server. Every time you need an asset on a different sim, all it takes is a request to the asset server to ask where the asset is, contact the sim where it was uploaded, and download it to you, while keeping a local cache at the sim you are.

But technically you could get rid of the central asset server by simply storing the (server, UUID) pair and have your client contact that server directly, and (also) store a local copy on the sim you are now. There would be no need of keeping the expensive bottleneck called “asset server cluster” — which causes most of the issues you’re familiar with with lost inventory, Search not working, unable to teleport, etc.

So why does LL use a centralised method? It’s purely an engineering/financial decision. It’s far better and cheaper to maintain a centralised system than a fully distributed one. Imagine the following example: you send a bug report to LL complaining that you have lost an object called “My Nice Clothes” —

A) Today’s system: “Sure, we’ll take a look at the central database and retrieve it for you. There you are. Sorry for the mess.” All the tech support has to do is look it up on the central asset cluster, see where a copy is found, and retrieve it from the backups.

B) Decentralised system: “Sure, do you know the server’s name and the UUID of the object you’ve lost?” Well. Under this model, naturally people have no idea on where the object was first uploaded. So LL’s tech support team would have to log in on all the 5000 servers separately, and search, one by one, to see where a copy is located. This takes a huge amount of time, as you can imagine.

What will very likely happen in the future is a model with sub-grids, ie., parts of the grid being controlled by their own “local asset servers”. Assets would be identified by pairs of (subgrid, UUID). Immediate advantages:

  • LL could upgrade part of the grid while keeping the other parts down for maintenance. The cool thing is that there would be “no absolute downtime”. People would just have to jump from part of the grid to another part (like during a rolling upgrade). Sure, during the upgrade period, part of the assets would not be visible (if they were on the “other subgrid” and not locally cached), but in most cases, most of the things would still work. If currently there were four grids, for instance, you’d have 75% of all content always available — and probably more, due to local caching — and, the more important bit, you would always be able to log in
  • LL could abort an upgrade process mid-way. Having 25% of the grid to experiment with load is currently very likely to be enough to figure out what is going wrong after the upgrade was deployed. They could just warn people to restrict themselves to the rest of the grid while they fix the problems on just the sub-grid under update. Sure, 25% of the users — who happen to have all their content on the sub-grid under upgrade — would still complain. But they would, at least, be able to jump to the rest of the grid running the “old version” while the new one is tested. It’s a fairly good compromise.
  • This would naturally allow sims to be on geographically different co-location facilities. Since we know that Linden Lab wants to do this, it would be one possible approach (the much easier one would be to simply have the asset server cluster separated geographically, but it would always be a single grid nevertheless).
  • Looking up manually assets for a handful of sub-grid asset clusters takes not much longer than looking them up on a single asset cluster. There is quite a difference in doing 4 queries for 4 subgrids than doing 5000 manual queries on a totally decentralised model.

What are the requirements for implementing this approach? Actually, not many. LL is working on changing the way the objects are stored on the grid anyway — to allow for “first use” tags, or attachment of Creative Commons licenses — so they would just need to add a different way to reference assets, using a pair of keys instead of a UUID. When “migrating” to a sub-gridded grid, objects would all be automatically “updated” to reflect this key pair instead, so that would be easy as well.

So, what would be the biggest issue then (or: “why hasn’t anyone at LL done this before, if it’s so good?”). Mostly, LSL scripts. There is no “easy” way to change millions of lines of code to deal with the notion that keys would now have to be differently expressed as pairs of info instead of single UUIDs. This doesn’t mean the task is impossible, just very hard to do. A lot of scripts rely upon the fact that an UUID is really just a string, and lots of clever trickery is employed to make all LSL scripts “backwards compatible”. It’s not an impossible task — just a very, very hard one.

About Gwyneth Llewelyn

I'm just a virtual girl in a virtual world...

One Pingback/Trackback

  • There’s been a similar discussion of sorts going on in the ‘Uru Live Preview’ forum, related to how to deal with the last big show-stopper for many – huge latency in the common ‘City.’ It’s come down to a debate on instancing, and if that would fracture the community that they’re trying to ensure is kept together during gameplay.

    The link is here:

    http://www.urulive.com/forums/viewtopic.php?t=3749&start=0&postdays=0&postorder=asc&highlight=

    My suggestion (based on the SL simulator system no less) is the beginning of page 8. Richard Watson (aka RAWA) posted about mid-way through to shoot down the core of the original poster’s discussion thankfully, but it still comes down to how they want to implement scaling that does not fracture a community too much.

    As for myself (and back to original topic), I don’t think that SL is going to survive past 2007 without *some* sort of major re-write of some caliber. It may, in fact, try for something like the subgrid approach. Another would be for total decentralization via leveraging GRID technology (though that would pose an interesting challenge to quantify what the heck you’re getting, when buying an island!).

    Another is to leverage an external company for optimization, be it IBM for Linux-running tweaks or Sun for conversion and tweaking to Solaris 10 and the Sparc T1 chip (a 1U system with up to six cores, each having four threads represented as CPUs) or their AMD-based systems now in production.

    There’s lots that can be done really, it just takes time. Unfortunately I think that it’s time that LL may not have much more of.

    –TSK

  • They might “outsource” time… they have recently announced that they’re willing to get submissions for software development companies to work on specific aspects of their code.

  • Interesting piece of writing.I think a lot of people underestimate the complexity of the Second Life grid and doubt if those who underestimate ever will read this piece of writing.

    But I totally agree on the sub grid “solution” you offered. It is needed and it is needed fast, because they found their single point of failure in their current asset server. Maybe the quick solution/work around is to be found in connection pooling with user UUID as parameter to the asset server instead of impersonated connections (which are unique). But that might be simple thinking because I don’t know the SL grid architecture in detail (wished i did from my profession as software architect).

  • Umm, why does splitting assets over four grids require reworking assets to be a pair of (subgrid,key)? This would be nearly impossible to develop and maintain. Instead recognize that a UUID is an opaque “baton”.

    The first few bits of the UUID probably have meaning here (even if in microsoft’s case they mean “this uuid was generated by microsoft”), there’s nothing preventing part of the uuid to be repurposed to mean “source grid”. You have enough bits available, heck you could identify the individual sim that the asset was created on inside there, even if you are tracking only by the “quarter grid” that you describe.

  • ToryMicheline

    Gwen,

    Having only been in the SL environment since 10/15/2006 I am still in the “WOW” stage. Those of you that have been inside computers for a long time see the SL environment more clearly.
    I would rather work in one interface for a while and then get upgraded, but you pay your money and …. I think SL is fascintating even if it doesn’t always work perfectly.

    -Tory
    [email protected]
    http://picasaweb.google.com/tory.micheline

    CARRY ON !!

  • Being a noobie, DOB 10.15.2006, and not being a computer programmer/analyst I am way over my head commenting here. I will say that “This softward IS REALLY complicated”. I don’t know if new releases every couple of weeks is the way to go but…… at least the Lindens are not sitting on their hands.

    -Tory Micheline

    Gwen – Your site is NOT for the faint of heart but requires careful thought and consideration. I really appreciate your efforts. I never thought so many smart people could interact as avatars on one piece of software.

  • SL never ceases to amaze us, doesn’t it? 🙂

    Some further thoughts on this

    odysseus, take a look at the UUID algorithm. It’s a quite complex one, and all the “bits” of the UUID are generated automatically — all of them. There is no “prefix” to identify “this is an UUID generated from Microsoft, this is one from Linden Lab”. One could however imagine a model where the first, say, N letters in the UUID specify a server/subgrid/whatever, and the rest is a “restricted” UUID, randomly generated.

    Sadly, this means inventing a new algorithm “on the spot”, which very likely will not be random, neither guarantee uniqueness. Worse than that, how would LL deal with the existing keys — specifically, those that are hard-coded inside so many scripts? (mostly sound and textures, but sometimes avatar keys as well). Not to mention that this would also break all existing third-party websites which relies upon LL’s UUIDs to be fully compliant with the OSF standard.

    Of course, one could develop a two-step process — generating a new key and see if they “match” with any existing key. So all keys that are “unmatched” would belong to the “first grid” (if the algorithm generates an existing key, it’ll be repeated until it generates one that does not exist yet); all new keys would have the first N bits “tagged” to mean something (like belonging to a specific subgrid). Assuming that LL can indeed develop their own derivative algorithm, and validate it mathematically to generate enough random and unique keys, this could be possible to do, although, strictly speaking, it would be a mess 🙂

  • Wubs Ooo Gwyn!

    But ya know I’ve read a couple two three break downs similar to yours over the years.

    All greek to me, but some of us DO know how complicated, complex etc SL is, not to mention the burdens of tryin to upgrade from the hampster in the wheel to more modern equipment, software what have you.

    I for one would even be willing to forgoe (sobs!) my Simone folder – and the rest of my inventory – for a start from stratch hey yer a noobie again Second Life 2.0

    But thats just me. I do feel the pain of the LL employees who have to do all that revamping, but I think the main complaint of folks is this:

    1. While yer doing all that, make sure that the employees who give out info/explanations/sorry for the inconvenience type posts a little bit more sensitivity training.

    e.g. – during yet another update oopsie Josh Lindie posts about the shiny new log in screen he made – and basically laughed at any customers who thought his timing was a little uncool.

    2. Most of us are iggerant fools who don’t know how the car runs, and when put the key in and stuff and it doesn’t run – you expect those who know how to fix it – should be able to fix it.

    3. We’d love for like just once LL would say – yeah mostly we need yall to help us test stuff in a real time environment – those who are still payin to play, wouldn’t feel so ouchy (mebbe)
    (ie acknowledge that SL wasn’t never really ready for prime time – and indeed couldn’t be without those exponential numbers testing thingers you splained)

    There’s more, but its yer blog not mine and really I’m just plain tired

    *smoochies!*
    -BC

  • haha Brace — you’re most welcome to write as much as you wish, as always 🙂

    I do agree with the issue of the “communication problem” of Linden Lab. They have the most amazing product that was ever invented for a computer, well, since DOS 1.0 came out for the PC 🙂 But sometimes they haven’t a clue how to “sell” it. Some of the Lindens tend to say “oh, we never do presentations of Second Life like the other companies do, we just show them machinimas”. And that’s mostly because it’s hard to explain what SL is about. If a picture is worth a thousand words, a movie is worth a million 🙂

  • niko donburi

    So, to summarize:

    (1) The “butterfly effect” is alive and well in SL, and

    (2) We should just be happy the damn thing works in the first place!

    😉

    Niko

  • KimbeauSurveryor

    Great article, Gwyneth; I hope it goes some way towards muting just a few of the moaners on the SL blog!

    However, i think that SL’s major problem is not so much technical as user-psychological: LL have failed to make the transistion from delivering cool toys to to a bunch of techie pathfinders, to the situation today where a lot of users are here because it’s “just another cool game”; and a significant minority are in SL because they want to be part of a serious, money-earning future.

    Perhaps LL was more than a tad unwise to go for the rapid expansion phase at the time they did, but that’s water under the bridge, and they have to learn to live with their self-imposed new order!

    In this current, quasi-mature phase of operations, *every* change is going to annoy someone. New bugs will annoy huge numbers — as they do! Every proposed change to the grid needs to be examined very carefully, from a customer-management point of view as well as from a technical standpoint. I suspect there is a good case for switching to a “no new features at all” mode, and just fix the bugs, in order of seriousness. This has the attraction that more of the team could be released to work on a major re-vamp.

    If it was up to me (and if I wasn’t up to my neck in rolling out cool fibre broadband to a reluctant British public, I’d be applying for the job ;-), I’d make available a rapidly changing (but nominally releasable) public alpha grid of SL2 as soon as possible, and pay people L$ on the main grid for spending time in the alpha grid (testing the new version would certainly beat camping for money on the main grid!)

    Let me make a prediction; I suspect that the upgrade on Wednesday will be met with howls of anguish and dissatisfaction; that complains will go UP, not down, even if the bugs fixed do improve the “technical” situation. In my opinion, LL would be much better advised to stick to addressing bugs they can fix safely with rolling upgrades, until they have pushed that approach as far as they can. When the grid is a lot more stable, users will be a little more tolerant to ups and downs from Wednesday upgrades.

    It’s all about the customer, LL!

  • I hope this is the proper forum for me to be discussing this geeky stuff at such length, lol.

    The article you linked to on Wikipedia on UUID’s took a bit of wading through before I found that link to RFC4122 which is probably where you were trying to send me to in the first place.

    There are a number of different places that LL can “squeeze” in extra information such as the sim that generated the UUID and whether they are dealing with an “old” UUID or a “new” one. Many of the different UUID formulas incorporate the MAC address identifying the computer that created it (and it’s possible that the UUID’s in the system already have a “machine assignment” then, albiet one that doesn’t move with the sim). I have a number of virtual VPN adapters and as they need to have MAC addresses of their own they came up with “00:FF” followed by what may be a random number (or opaque baton, who knows)

    My reference to MS is that they seem to be doing their own thing (as usual) and have come up with a completely different (and probably undocumented) way of building UUID’s (described in section 4.1.1 of that RFC). I think that one point that I was trying to make is that things like UUID’s should only be understood by the machine that created them, everyone else should just look at them as if they were these long strings of digits that have no inherent meaning other than to identify an object.

    I do agree that if LL does change the mechanism that UUID’s are generated with, that they are going to have a major legacy issue, which will grow the longer they wait before doing something like this.

  • KimbeauSurveryor

    Can’t the performance problems be solved by having local mirrors of the database? As I understand it, Objects are created, get a UUId, and thereafter neither the UUID nor the location of the object ever change again — if the object is edited, it gets a new UUID? If that’s the case, then client processes can look up in the mirror first, and only address the master if the info is not found.

  • Kimbeau, sims now have their own localized asset system, but its currently restricted to copies of rezzed objects and baked avatar textures.

    LL could in theory extend that for inventories of avatars who make that sim their ‘home’ point (i.e. if I still had land on Boardman a localized cache copy of inventory could reside there).

    Gwyn – I’ve said for a very long time now that LL should consider licensing the code base out to companies that would like to be part of SL but not part of the centrally-managed grid. I still think this has merit since companies that take a bite at this would have the impetus (and apparently time) to update that code to either fix/complete features or make maintenance easier. I’d like to see some comment by more people on this, but in your case it may be easier to just start a new post. 😉

    –TSK

  • lexneva

    I’m glad you took the time to break this down. I agree with a lot of what you’ve said… the problem is not that LL is incompetent, but that SL is hard. In fact, LL pretty much has to be full of awesomely talented people to have SL be as good as it is now.

    Well, currently, assets are stored on the simulator server they were uploaded first, but the information on that storage is kept on the asset server. Every time you need an asset on a different sim, all it takes is a request to the asset server to ask where the asset is, contact the sim where it was uploaded, and download it to you, while keeping a local cache at the sim you are.

    I’m almost sure that this isn’t the way things are currently done. I’m basing this off the same thing you are, namely my assembly of the various tidbits of architecture information that Lindens have dropped in the forums and such. I’m pretty sure that all actual asset data is stored on the asset server cluster. Inventory is stored on a separate server as, essentially a list of “this person owns something named Foo which has asset UUID ___”. Any time something needs to be downloaded by a client, it asks the server it’s currently in, which asks the asset server, grabs the data, and forwards it on.

    A sim also stores a local copy in its own cache, which is a really awesome opportunity for optimization. Given LL’s copy-on-write storage paradigm, every asset UUID is read-only, so caching can be done with 100% confidence that the cache will never go stale. Since people are always going through the sim they’re in for assets, that means that a sim’s cache fills mostly with the data referenced within the builds in the sim itself.

    Also, about the whole UUID thing… I’m pretty sure Kelly Linden once referenced RFC 4122 (or one very similar) as the standard that LL based their UUIDs on. One nice thing hinted at by a commenter above was this little piece of data:

    node = 6hexOctet

    The RFC suggests using the MAC address, but LL may well have used another kind of node ID because sims might share the same MAC address. Anyway, every UUID already has SOME kind of unique node ID, so it’s likely that it’s very easy to partition the existing set of asset UUIDs by their sim of origin.

    But why?

    I’m not sure I see a marked advantage to having the existing sim servers take on the distributed burden of serving assets. As far as I understand things, a given Region can show up on any Sim at any time. It gets its “simstate” from the asset server, loads it up, and connects up to the grid and registers the Region, and people can connect. The only performance loss is that the cache has to be recreated.

    If assets had to be stored on each sim, you’d have to somehow cart around all of those assets every time a sim moved. Think about a hard disk failure on any given sim (I’m sure it happens a bunch). You’d have to recover all of those assets somehow, or store them elsewhere… say on a centralized asset server…

    Also, clients would have to be opening hundreds or possibly thousands of connections to various sims. Sims would have hundreds or thousands of clients connecting to them. Kernel network tables could fill.

    Instead, if the asset system is a cluster, you know where all of the data is and can make it redundant with all of the standard redundancy techniques like RAID and such. You can partition the UUID space up and assign various chunks of assets to various nodes (I bet they do this in some form or another). You can insert an arbitrary number of layers of caching in between requests from sims and the actual asset nodes, as needed. You can add more asset nodes as needed. All in all, I think the existing system is a good solution.

  • AndrewLinden

    The SL project is indeed very complex, and yet, most of the complaints about its shortcomings are partially legit. Yes, the grid shouldn’t require 5 hours of downtime to update, yes an ideal system shouldn’t need to suffer a monolithic update anyway. Yes, the asset system should be distributed.

    Many people correctly conclude that there are better ways to implement the various parts of the SL project. What they often fail to understand is just how hard it is to write perfect software in a timely manner. It really is easier to conceptualize the idea than it is to build it. Even unlimited resources (time and manpower) cannot completely collapse the time it takes to build new stuff.

    So how does a company with limited resources (money and manpower) bootstrap something like SL into existence? The answer is to punt the hardest parts for later, make the required parts work good enough, and complete the minimum set of features to attract enough early adopters to prove the concept to the unbelievers. As the whole project aquires more people it becomes more interesting, and the company is able to convince not only new people to use the product, but also convince investors to buy into the effort, so that the employees can be paid as they complete or fix the missing or broken pieces.

    I think the obvious shortcomings of the project are a consequence of the bootstrap system. The good news is that most of the shortcomings really are solvable. I think you’ll be seeing lots of improvements to SL over the next year.

  • Andrew, thanks for coming here 🙂 We’re honoured by your presence!

    Lex, last time a rumour was spread about the way the asset server was done in the past, it was a very, very easy system, which didn’t even require a “database”. But basically it was simply a collection of pointers to assets stored in a sim. Of course, the whole asset server got a major overhaul, as did the inventory servers, and it’s not easy to understand how things might have changed.

    Still, one thing is certain. Upload a texture to a sim, it’ll load way faster there (even a very busy one). After uploading it, rez it anywhere else — even an empty sim! — and see the difference. With enough time and patience, and enough uploads, this seems quite consistent — textures (and sounds, and animations…) load for the first time much faster on the sim they were uploaded.

    Now this can be due to two things:
    1) The texture is immediately cached locally and then goes to a central server (possible)
    2) The texture stays locally, and just a reference is made to it on the central asset server (more likely)

    Of course, after the cacheing kicks in, in theory at least, you’ll be able to rez things fast, no matter where you currently are. In the past, I tended to upload everything on the then-new Sandbox Island (as opposed to my residence at the time, which is on a painfully slow early sim), and it was quite a difference…

    In essence, it’s more reasonable to have the assets distributed among 5000 servers, each with its own database, and just have the pointers to the assets on the asset server, instead of storing everything there. With “hundreds of millions” of objects (I’m wildly quoting Philip from his last or before-last TH) I find it hard to imagine that LL has put all the assets on the same server – specially taking into account textures, which will take vast amounts of bandwidth to connect with all other servers…

  • FWIW, Here’s another little tidbit about how the asset servers are organized. Linden comment at the very bottom, in response to someone saying that they can build a better Second Life:

    http://www.secondlifeherald.com/slh/2006/01/interview_with_.html

    The focus I have here is describing the asset server as a “server cloud”; there may be a couple hundred machines that contribute to storing your inventory completely seperate from the grid…

  • Pingback: Jeff Barr’s Blog » Links for Thursday, December 14, 2006()