The Schism Around Voice: Multicasting vs. Broadcasting

Imagine that you would have an awesome technology that allowed you to create an universe you have just pictured in your mind, to the extent of detail you wish, and that you could get realistic characters walking around your universe, so perfect in its minutiae that their behaviour and looks would be completely impossible to distinguish from real human beings. Now imagine if that technology were available to everybody in the world and that anyone, anywhere could have access to it.

Actually, that technology does, indeed, exist. It’s even quite old, having first been developed over 6,000 years ago. It’s called a book.

Nevertheless, a “book” is not perfect — there is an author, and there is an audience. The author can do with their book whatever it wishes; but you, the reader, cannot. All you can do is read and imagine with the author, but not contribute to the book.

Enter the Internet, and its many social environments: from the old bulletin boards, through FidoNet, later the USENET, finally to IRC, and to webchats, we come to things like, well, Second Life. Here a new paradigm has emerged: the notion of a collaborative environment, where readers and authors alternate roles, and both contribute, at the same time, to a collective work. Early analysts of the “Internet revolution” have touted this as the primal change in the way we think about the ancient roles of author-editor-publisher-audience; the old “broadcasting” paradigm (one sender, many receivers) has been replaced, on the Web, by a new model: multicasting (all are senders and receivers) and the notion of collaborative environments, where all are readers and authors at the same time.

From the roots of that paradigm, buzzwords like “Web 2.0” have emerged: social, interactive environments, where people collaborate to provide content. So far, these were mostly assynchronous — in the sense that on a blog, a forum, or any of the social sites like MySpace or Flickr, the author(s) publish something (a text, a picture, a movie), and others add to it afterwards (with comments, or, in the case of the Wikipedia, with additional articles). It’s not done in real time, but there are delays; still, those delays are very tiny when compared to an author that writes a book and then receives fan/hate mail which might influence a future revision of their own work.

Real-time interactivity comes, however, with chatrooms — from IRC to webchat. Here the issue is slightly different: you don’t have persistence of context. An IRC session can take hours, with hundreds of participants, but after it finishes, the text “disappears” (obviously, you can grab a copy and read it later; but IRC, as a medium, is not persistent on its own). Thus, although it’s definitely real-time, it lacks the persistence effect of something like, say, Wikipedia — which grows and grows as more people interact with it.

There is, to a degree, a system that allows both — real-time interaction in a “multicasting” model, with many authors and readers who interchange roles — and persistence of content. The best known example are Richard Bartle’s MUDs and all its many successors. In that environment, people build rooms — snippets of text that, like a book, appeal to the users’ imagination to evoke images in their mind when they read what others have placed as a description of the room. The rooms are also interactive — they are not “merely” nice, textual descriptions, but they allow, through commands and other special options, agents in the environment to change the rooms. Whether it’s a “game” (in the sense of a role-playing game with missions, quests, and goals), or simply “interactive art” (a narrative that unfolds, depending on the users’ choice), the issue is the same: MUDs allow collaboration, interaction, multicasting, and they have persistence of content.

Second Life and the Collaborative Environment

Now enter Second Life. Bartle and others tend to scorn the “visual” aspect of Second Life, in the sense that it “distracts” the user — in the same sense that a movie based on a book will short-circuit the viewer’s imagination, by presenting a “ready-made” universe that is “consumed” without interaction. If I say that my hair is red, people will imagine different shades of red, and probably some will imagine my hair is curly, or perhaps spiky, or clipped short. In Second Life, however, they’ll get a 3D representation of my red hair — and there will be no room for “imagining” how it looks like. An image, being more worth than a thousand words, will “spare” us a thousand words of patient description of each flexiprim in my hair — but it will also constrain how people imagine that my hair looks like.

This has been an old argument about 3D virtual worlds, in the way they stifle the users’ imagination, by presenting a “canned” visual environment. The issue, of course, does not apply to Second Life, for a very good reason: everybody can change the enviornment. So, although one might agree that a “ready-made” 3D environment that does not allow much change (which is the case of the majority of virtual world these days — a company produces the highly detailed environment, and the users are simple consumers of that content), Second Life, however, is very different. When “red hair” is described, it can apply to any different conception of what red hair looks like; you can build it the way you wish; if someone disagrees and says “that’s not how red hair looks like”, that someone can create their own representation of red hair. Pure imagination is replaced by the absoluteness of redefining the whole environment: that is, with an adequate set of tools (the in-world building and scripting tools) anyone can precisely pin down their own representation of the environment, and share it with everybody else.

But that’s not all — in Second Life you can also do more: change (or, better, contribute) existing content. So, back to the red hair example, you aren’t limited to saying “I can do better red hair than you”; but you can, effectively, change someone else’s red hair — interactively, and in real time.

What has all the above to do with “voice”? 🙂

Multicasting vs. Broadcasting in a Collaborative Environment

Most people take technology and “the way things are” for granted. When you turn a TV on, you expect it to display an image after a second or less, and that it shows you a movie that “someone” has prepared (the broadcasting station and their producers), and that it will be shown in decent quality with 25 frames per second (or more). You don’t “expect” to be able to talk to someone in a studio while watching the news report. You don’t “expect” to interactively change the patterns of the clouds on the weather report. You don’t expect that suddenly someone else pops in and starts broadcasting their own popular show in the middle of an opera concert. We assume that TV works like it does, because we’re used to it: it’s a broadcasting medium, not a multicasting one, and it has worked well for over 70 years or so that way. Naturally, you might have post-modern art trying to break the rules (“conceptual television”) — or a degree of interactivity on some live events where people can call a phone number and talk to the hosts — but the nature of TV as a broadcasting medium hasn’t changed much. Its biggest change was perhaps the VCR, allowing people to watch things out of order, ie., still being consumers of video, but not limited to a predefined sequence of events as established by the broadcasting company. But that’s about it. If you want an interactive video medium, you’re far better off with something like YouTube (which is not a real-time interactive medium, but at least it allows some feedback and change, through comments, and, most important, video comments).

Second Life has also a few assumptions, which have worked well to give us a collective opinion of what it actually is. It’s a “medium” as well — no question about it — and it’s real-time. It’s definitely collaborative: Linden Lab has added little content, and practically everything (except the sky, the colour of the sea, the Sun and the Moon, and the Earthlike gravity) can be changed by the users. But it’s not an assynchronous medium like, say, YouTube, or a blog or a forum: very similar to IRC or, better, a MUD, it allows synchronous, real-time change to occur all over the world. At the same time, almost 40,000 people are changing their environment around themselves, all at the same time, and this change is collaborative and real-time. You can watch things being changed in front of your eyes; and, although people might be physically apart — they work on the same virtual environment. Better still: you can build things on a different, virtual location — like someone building a tower on their private island — and then join it with a common build elsewhere (placing the content at its final destination). Buying some outfit at, say, Pixel Dolls, but then going to a club and displaying your new dress is the simplest form of “location disruption” — the notion that content does not need to stay at a single location, but can move across the grid to different locations. Better still: a single piece of content can be used simultaneously at different locations at the same time (ie. that same dress can be in use by several different avatars on different clubs).

The best example of a similar concept on the 2D Web is having, say, someone copying the content of a blog article, and simultaneously placing it on several other sites. In a sense, “feed aggregators” and “blog portals”, or simply placing links from MySpace to YouTube, are similar concepts for the Web 2.0 that replicate (perhaps a bit artificially) what we expect to be “normal” in Second Life.

The Environment Shapes Communication and Behaviour

We take all the above for granted: the notion that content is not stuck to locations; content is created dynamically and in real-time; content is (or at least can be) created collaboratively, synchronously, and without “delays”. We’re so used to it that if by some reason this changed, we would be as surprised as having the news report on TV be suddenly “taken over” by a pirate broadcaster who pops in the middle of the report and starts showing pictures of surfers in Hawaii.

Thus, when using Second Life (in the sense of being an active participant in a multicasting, real-time, collaborative medium), we adopt — naturally, one would say — a model of social interaction that is indeed appropriate for that environment. Since content can be at different places, we use IMs to talk to people who we cannot “see”. Since there is not a one-to-one relationship between location and content, we tend to view our friends and contacts as being “detached” from the location (or even from their appearance). They are “somewhere on the grid” and probably “wearing some outfit or other” and “being surrounded by some kind of content”. We don’t care; we know that an IM will always reach them, no matter what their location or appearance is.

Since we’re used to collaborative content in Second Life (and also, because of older paradigms like MUDs or even IRC), we also tend to look upon social happenings in SL as being highly interactive and collaborative as well. A discussion in Second Life can have 25-50 participants — all talking (writing!) at the same time. You don’t really need “moderators” like in RL — the dynamics of IRC work well in SL, and fit like a glove to the collaborative and multicasting nature of it. In an environment with multiple agents that are producers and consumers at the same time, a social communication form (a discussion) will adopt exactly the same criterium: it’ll be multicasting as well, everybody participating, and everybody reading at the same time — in real time. Thus, the nature of the environment shapes the social interactions. And while this is not surprising (the same happens in RL, obviously — we don’t shout in a library or walk around naked in a church), it defines an unique quality of Second Life. While many might find the way social interactivity occurs in SL a bit strange at the beginning, they get used to it very quickly. No wonder; all the content in the world is generated the same way — why should social interactions be different?

This obviously isn’t restricted to discussion events — which are not very popular anyway. In fact, you can see the same behaviour on all social events in SL. When you go to a club, unlike in real life, the focus is mostly “talking to other people” — flirting, telling jokes, communicating — while, of course, listening to music (and commenting on it). RL clubs, however, obviously also offer an environment, but mostly a music, and a place for physical self-expression (dancing). What they don’t offer is a place for communication — the music is too loud for it. A jazz club might allow small groups of people at a table to whisper among themselves, but the communication will be mostly limited to single tables and small groups. There are no environments in RL where, say, a hundred people can be in the same place, listening to loud music, and talk among themselves at the same time. RL does not allow that; but SL does! More than that — besides a public chat, where everybody can contribute and read at the same time — you can also flirt with half a dozen people, in private, at the same time, while still listening to the music and participating in the public chat.

(The old Thinker’s group once did a philosophical discussion in one of the then-popular clubs, while dancing and listening to the music. It works! It’s possible in SL. Now try to do the same in RL…)

We don’t find this kind of behaviour “strange” for SL. In fact, it follows from the whole multicasting environment: you expect, when going to a club in SL, to be able to be an active participator of the social happening. And this is done simply by contributing to the ongoing, real-time chat. A good event hoster will not be “simply a DJ”; rather, they will tell jokes in public chat, accept orders for the next music, flirt with the new arrivals, encourage them to tip the dancers, and talk about trivial things — all at the same time! (“Why, Jeanne, you have the most lovely dress tonight — everybody, give it up for Jeanne!” [the audience cheers, types some witty comments, or howls]).

So many similar examples exist. On “Who Wants to Be a Lindennaire?”, a popular contest named after a similar TV contest, the participants can talk all the time among themselves. Trivia contests are not different. How would any of those events be possible in RL, with the contestants talking all the time? However, in SL, those events attract a social audience, who, besides winning some Linden dollars, expect to be able to chat.

Even AOL Pointe, AOL’s virtual presence, uses the social interaction in SL as a selling point: in SL, people can, like in RL, watch videos “together”. But the difference is that you can talk about it, while you’re watching the video together. That’s a kind of experience that you simply don’t have in SL; and it makes sense to explore SL as a new medium to provide that experience.

Again, most people who are familiar with all the above — “it’s just the way SL works, there is nothing special about it” — simply overlook the following issue: all is going to change in June. Dramatically. For ever. And there is no going back 🙂

The Old Arguments Against Voice in SL

The usual naysayers about the introduction of voice in SL have one of three arguments. The easiest, and most well understood, is the issue about anonymity. I will not consider this aspect, since it has been discussed ad nauseam for ever and ever, and at some point, sheer number pressure will always finish off that argument: as SL grows exponentially, if 1-2 million people will suddenly need to make the option of leaving SL because of their unwillingness to disclose RL data (and voice is a “signature” which you can’t escape from easily), they will not make a difference. Sure, if they left SL now, this would mean that SL would lose one third or their users. But in June it’ll be only one sixth. In December, less than one tenth. Next year, it won’t even register on the statistics. By 2010, it will be a side note on the SL History Wiki: “Before June 2007, some people did not like to reveal their real life identities, and they had sadly to leave when voice was introduced; however, they were a tiny minority that never really affected the growth of Second Life”. Exponential growth kills the voice of minorities (pun intended!) and is ruthless: a million today is nothing tomorrow.

So, overall, SL will not “disappear” and the sky is not going to fall down if a tiny minority of a couple million users leave SL overnight. A few, of course, will survive on their ghettos and still have some fun; they will be freaks, outcasts, but still enjoy SL, of course, in their tiny text-based communities. They will be ignored and disregarded, and nobody will have any interest left in them; in the mean time, the mainstream will go on, enjoying a renewed growth from all the users that never came to SL because it didn’t support in-world voice chat. Overall, SL will be a “bigger” place — and, since the new requirements for joining SL will be revealing your RL identity since day one, the hordes of dozens of millions coming in after June 2007 will never understand why “anonymity” was such an issue in the past. It will be something to talk about as part of a SL history trivia event, but nothing more. SL, after June 2007, will be the community of augmentists that have, once and for all, thrown out those pesky immersionists — and when the augmentists are the majority (if not in June, then certainly in December), the issue about anonymity will, once and for all, be buried in the past. Linden Lab can next try to introduce that cool feature of mapping your RL face onto your avatar, and bring less anonymity and more “real data” into SL. That will come some day.

Exponential growth is the mother of all arguments. “There will always be more happy users than unhappy ones”. For every unhappy one that leaves by June, three new happy ones will come. So, while the discussion around anonymity will continue to foster flame wars on blogs, forums, and even in-world events, it is a “lost cause”. The New Second Life has no more space for anonymity — like the current SL did not have space for gaming the ratings, or telehubs, or taxes on prims. The past is the past, and while one can always speculate how things could have been, we should aim for the future.

The second argument is a bit more complex, and it relies on published research which shows that co-habitation of voice-enabled and voice-impaired users of a virtual world platform is never easy. The results of all those published studies show that the end result is about the same: it will never work 🙂 During the first weeks, we’ll see people gravitating towards their own ghettos, the voice-enabled and the voice-impaired limiting themselves to their own friends with similar attitudes. You’ll lose all your friends that moved towards the “other field”, but at the beginning, this will get compensated by getting new friends with a similar alignment. But remember that exponential growth will make things change rather quickly; all new users will be voice-enabled. At some point in time, event hosters will simply give up doing voice-impaired events, since the majority will be voice-enabled anyway; shops will have shop attendants that will have no patience with the text-chatty customers; Mentors will never bother to answer questions in text; and so on. There are no examples where you can have a single world with voice-enabled or voice-impaired users at the same time and both enjoying the world in the same way. Under exponential growth, the world will align itself towards the majority, and after June, the majority will all be voice-enabled. There is no “room” left for the voice-impaired; they will either leave for other platforms, or content themselves living in their ghettos, uncared for.

The third argument is called “Bartle’s Argument“, based on an old essay he wrote in 2003, and which I basically have addressed last year. His argument is very simple: in a world where you can tweak everything, being unable to tweak your own voice in the same way, doesn’t make any sense — it’s a rupture of the “suspension of disbelief” which is at the core of all literary and movie production in the world. The argument is a bit stretched thinly, since it’s not clear that Second Life, as a social platform, attempts to convey any suspension of disbelief to anyone beyond the role-players. But the partial argument is still true: why can’t I have a set of sliders to change my voice, like I can do for changing my avatar? (or any other aspect of SL really)

In 2003, when Bartle wrote, that technology was not much advanced, so he recommended caution and patience. In 2007, however, this technology, even for amateurs, is becoming much mature. The very friendly Screaming Bee corporation, for instance, has a delightful voice morphing software called MorphVOX Pro, which I recommend (if you’re a Windows users; I’ve been patiently trying to encourage Screaming Bee to release a Mac version of it). But the best part of their technology is that they have a server-based solution of their software. Philip did even talk to them about this, with some gentle nudging by yours truly; however, at the end of the day, Linden Lab thought it would take too much time to do that integration. They would go with un-morphed voice first, and perhaps by the end of 2008, they’d use some sort of morphing technology which might be available.

18 months of waiting is too much. By the end of 2008, SL will have 35-50 million users. Among these, only about a million will have been patiently waiting for voice morphing and brandishing Bartle’s Argument. Will it really, really be worth it? I seriously suspect it won’t; people will be far better off buying their own voice morphing tool (assuming you have Windows, that is) and rely upon that, if that’s all you wish — eliminating your real life “signature” from your voice (and believe me, it works rather better than most people might imagine).

But all the above arguments are totally missing the point; the issue is not anonymity. It’s about disrupting multicasting.

The End of a Paradigm — Bye bye, Multicasting

We now come to the end of Second Life as we know it. The key to understanding the change is not to focus on the above three arguments against voice, which are the kind of arguments that always have two sides for it, and can be argued and argued while never convincing the other side. You can read pro and con sites and blogs dedicating huge amounts of argumentation for each viewpoint, and it’s quite clear that both factions are not talking to each other — they’re just praising their own faction and ignoring the other. Ultimately, as we have seen, the voice-impaired community will always lose, which will be the winning argument for the voice-enabled one. One can build the same kind of arguments for immersionism vs. augmentism; the augmentists, growing exponentially in number, will also always “win” every argument by sheer pressure of numbers. In any case, it’s not the quality of the reasoning that will win arguments — in these cases, like in so many others, the majority will always win with weak arguments, and it’s pointless to debate things like that. The battle is not fought on equal footing. At the very least, the old argument that “Second Life needs voice, because it’s the only virtual world that doesn’t support it natively” will win votes at Linden Lab: they know they have to keep abreast of their competition and can’t lag much behind (another intended pun 😉 ). So, at the end, even Linden Lab will need to reply to market pressures. Voice will be a reality in June; there is no turning back.

However, let’s see what we’re going to really lose in the process. Not users; the new users will come to a world that “always” had voice, and will never think twice about it. Not anonymity; the new users, again, will not care about that, they’ll have other uses for voice, and will rather insist on people not “hiding behind their avatars”. After all, the number of people posting “anonymous” pictures on MySpace are a tiny minority, aren’t they? They do exist, but they’re certainly not a significant percentage of the 150 million users of MySpace — instead, like in SL, they are mostly people very willing to give out as much RL data as possible.

The difference between the Old Second Life and the New Second Life is much more drastic. Again, this does not mean that SL will “disappear” or that people will “leave it in despair” or that “Linden Lab is doomed”. Rather the contrary — it will continue to grow in spite of everything, but it’ll grow into something different. Like the abolition of the telehubs and the introduction of point-to-point teleport drastically changed the landscape on the mainland — almost “overnight” – and SL did not “disappear”, but kept on growing, voice in SL will transform it so much as to make it a completely different product, but it won’t make it disappear. One might even argue that it will lead to a faster growth

There is a common fallacy brought by the voice-enabled group. They argue that communication is difficult when typing, because even a fast typer can only type 3-4 times slower than they can talk. The pro-voice group is notoriously lazy, and thinks that “talking” is a much more efficient method of conveying information and communicating.

Actually, this is a very old fallacy. It’s true that we type 3-4 times slower; however, we read ten times as faster as we listen. This means that during the same period of time that we’re focusing on a single conversation by listening to it, we could have ten simultaneous conversations in chat — obviously, writing less on each one, but receiving ten times as much information.

There is also another issue, which has simply to do with the way our brains are wired, and although some people can definitely train their abilities, not everyone can (or is willing to do it): listening is a “high-bandwidth” activity that requires a much higher processing power from our brains than reading. Writing/reading, however, is “low-bandwidth”. And, more than that, it is also time-independent. Why is this important? Well, you can easily pick up something you’re reading, and start from the point you were. We do that all the time when reading letters or email, getting interrupted by a phone call or a colleague at work (notice: voice communications!), and then getting back to the exact spot where we were.

Why is that so? Well, neurologists and cognitive scientists think that “reading” (unlike “hearing”) has evolved from the uncanny pattern-matching abilities that are built-in on our vision processing modules in our brians. They’re highly parallelised (our brain is very slow, but due to its huge interconnection of neurons, it can parallelise information processing at an astonishing rate), and even the optic nerve “helps” the brain with that by pre-processing information. So your eye is not really sending high-definition frames, 25 per second, for your brain for processing; rather, our eyes and optical nerve “filter” unnecessary data and just send relevant bits to the several areas of the brain that deal with the final processing. We are very good at doing this; that’s why, compared with other animals, we have exceptionally good vision, but very limited hearing (as well as smell and taste, of course).

Hearing, by contrast, is a complex process, which is not so evolved as vision, although it uses similar processes. However, it uses a lot of processing power from our brain. Most human beings cannot follow more than one audio conversation at the same time: in real life, the need for that has never evolved.

So think a bit how our “voice communications” work in real life. On business meetings, we politely wait for the speaker to stop, and then add our comments. Discussions are moderated; there is a panel of speakers, and the moderator will circle questions among them, and then with the audience: one at the time. We don’t do “perfect recollections” on audio; moderators, if they get two questions at the same time, will serialise them to the panel, sometimes asking the person to rephrase that second question (it’s easier to remember a face — visuals again! — than remembering what people said). But even ten years after having read The Lord of the Rings, almost everybody will remember the names of the major characters and the overall plot; quick question — what was the name of the little girl on Edward Scissorhands? (But you’ll definitely remember that Edward was played by Johnny Depp, obviously — due to his visual appeal on screen 😉 )

A study in the late 1980s or so also show that people listening to a 15-minute seminar or lesson will, at most, remember 5% of what has been said. This is the reason why teachers and speakers on conferences and seminars, these days, also use slideshows and hand out notes; or why teachers use blackboards and books to complement their lessons. Have you ever noticed that during a 15-minute podcast you could be reading ten transcripts of podcasts at the same time, while answering your email and replying to your friends on IM chat?

Why do all TV networks put a ticker with the headlines during the news? Well, in a minute, everyone will get the relevant headlines; if they have to listen to the whole news, they’ll need an hour to absorb all major news. Voice is slow, not fast, when you’re at the receiving end. The written text is way faster.

Nothing should surprise us when we see how our real world works. We have painfully adapted to our limitations of dealing with voice; it works great for one-to-one communications without interruptions. But try to speak on the phone with someone while doing a lecture on modern art at the same time; have you ever seen a speaker do that in public?

Of course not.

Our environment, again, shapes the way we communicate. In the real world, since we’re so limited in “multitasking” when using voice, we have evolved a means to deal with that limitation. We organise all our environment in a way that we only need to talk to one person at the time, to the exclusion of everything else. We pick up the phone and interrupt our meetings or leave that email unfinished (or “go afk” in Second Life while talking to your mom on the phone). When talking to an audience of dozens (conferences) or millions (TV and radio broadcasting), one person only is talking, while the others only listen. We don’t allow people to “interact back” on broadcasted events. People are not allowed to interrupt others in a library. And even when talking in a small group of friends, only one will be talking, and the others listening — we are educated by our parents that it is “unpolite” to interrupt others. Why do you think that these social norms have evolved? It’s not mere stubbornness or tradition; it’s simply the way we’re “wired” with the high-bandwidth demands of voice communication.

I’m pretty sure that someone will comment sooner or later that they are perfectly able to listen to music, talk to their friends, and be on the phone at the same time, and that my picture of the way the world deals with audio communication is very pessimistic. Well, there are always exceptions. I’m pretty sure that people who consider themselves to be “natural multicasters” think that others can do the same if they only try. The reality is not so simple. There is even a gender difference — women can usually multicast more, and can also change subjects much faster (and never miss a beat when having several conversations!) than men — although obviously there is a lot of overlap between genders. Also, musicians can train their hearing to listen to several instruments on an orchestra at the same time (and this really works to an extent, although one tends to focus on one instrument for short periods of time, and find it again when we focus on it once more). So, yes, exceptional individuals can “train” to listen to different conversations at the same time — but not the average person. In contrast, everybody can follow a 5-person-text-chat on separate IMs easily, and read a discussion with 20 people chatting at the same time. It only needs a slight adjustment, but we can handle that with little training.

Back to the New Second Life of late 2007. All the multicasting events will slowly fade out and die, as people will experiment with doing them using voice — since they’re used to do them using text-based chat — and will find out, quickly enough, that “it doesn’t work”. If people start yelling on a club, you’ll tell them to hush, because you want to dance and listen to the music; if they don’t behave, you’ll mute them. But today you’re expected to do text chat on clubs, that’s what makes them socially appealing! So why will people come to clubs after June 2007? Basically, to find a partner to do an “intimate” private voice chat 🙂 … and you’ll have to limit you flirting to one partner at a time. Remember that you won’t have many visual clues, only “disembodied voices”, and hopefully you won’t mix up the blonde in the husky voice with the redhead with the tinn, squeaking noise… or you’re up for trouble 🙂

This will mean a dramatic change overall. You’ll be hanging around with a couple of friends, talking about everything, and then a Voice IM pops up — you’ll excuse yourself because you have to handle that Voice IM, and mute everybody while you listen to it. When you “hang up” the call, you’ll apologise again, and couldn’t they repeat what they were just saying? Obviously, there won’t be any transcripts to scroll back…

You’ll also go and attend a conference from a major speaker; suddenly, your cat pukes on the carpet, and you need to clean up the mess — only to return to an empty conference room. Ah well. With luck, someone saved a podcast for you. Today, you’ll be able to quickly scroll back history, read about what you’ve missed, and still participate on the Q&A session.

There are going to be numerous examples of dramatic changes like that. In a way, we’ll lose a special quality of Second Life — de-localisation, unstuckness from the place you are, multicasting, synchronous communication with anyone, anywhere in SL, and the notion that you don’t need your full attention dedicated to SL, but can leave your computer and come back and scroll back on history. People will be angry if you don’t pick up your Voice IM; but you’ll only be able to take one at the time. After some months, people will be used to it — they’ll know they can’t expect everybody to answer all their Voice IMs simultaneously, and since they will require total concentration, if you log in for an hour, you might be able to talk to a dozen people, 5 minutes at the time. Right now, in an hour, you’re able to communicate with a hundred people. What a difference will that make! Naturally, we’ll adapt, and we have an advantage — this will be just like real life, where you can’t talk to hundred people in an hour anyway.

Second Life, just like real life, will lose the extraordinary breakthrough of creating one of the first true real time multicasting environments. At least, historically, we will be able to tell our children how fantastically amazing this once was, during a period of three years. All this will be lost, forever; and a New Second Life will emerge which will much more similar to real life in terms of social interaction.

The “Many Countries” Paradigm

Philip “Linden” Rosedale’s famous quote, which I still keep in my email signature — “I’m not building a game. I’m building a new country.” [interview to Wired, 2004-05-08] — received a serious blow when Linden Lab removed the telehubs. It was clear that what would follow next was a jump to private islands, where you can devise your own urban planning, and that there wouldn’t be “one country”, but “several communities” inside the same virtual environment.

Up to now, these communities have (mostly) had a major advantage: you would not really care how people would come together — they would share a common goal (or lifestyle!) and aggregate around that concept. Physical location would not be an issue, since most of those communities would have people writing in a common language (mostly English, but recently more and more on other national languages as well), more or the less well enough for everybody to understand. A non-native speaker would only take a bit longer to type, but they’ll generally manage reasonably well.

Contrast that with “voice”. In the “early days” of SL, you had mostly people with a certain degree of education and open-mindedness, mostly from travelling around the world, either for business or pleasure, and they would be rather tolerant of different accents, bad grammar, and a certain amount of “uhs” and “ahs” while people figure out how to say what they wish to say.

But SL is a mainstream application these days. People will be surprised to “hear” that their friends do no talk like TV or radio anchors, or movie actors. The “best” event hoster they knew might have a rasping voice, an unidentifiable accent, and such a strange grammar, that suddenly their events have no appeal whatsoever; you won’t be able to understand half (or all) of what they’re saying. People will naturally gravitate towards those that have clear voices, speak flawless English (or your own language), have some training as a speaker (ie. a journalist, DJ, or actor), and have good audio equipment delivering high quality sound in SL. This will naturally leave out many people.

Or will they? Perhaps not. They will have their own, small communities, where their own voice will be “understood” and accepted without problems. In a sense, voice communications in an international medium will push people to the communities that have the closest possible pattern of speech that you have; this is only natural, since, in real life, your circle of friends and acquaintances will come, very likely, from a geographically limited area (since geography plays a role in real life). In a sense, Second Life will, again, resemble more and more the “limitations” that we encounter in real life as well: to be understood, you’ll need to find and locate the type of people you can understand.

Perhaps this is a major difficulty because of the number of users in SL. Obviously enough, when you have a billion users, this doesn’t make a big difference — almost every tiny city will have a large part of the population in Second Life. But they will be the same people you see every day! With a hundred million users, you’ll still be able to find a reasonably large number of friends from your home town to communicate — but, again, it would seem pointless, since you see them every day as well. With just ten million users, it’s very likely that you’ll only find “strangers” — perhaps many from your own city, if it’s large enough.

So will Second Life be “fragmented” again, but this time, across physical boundaries? It’s the most likely outcome, since the same happens, in fact, with almost all “social” sites when they grow. There is obviously a difference in “education”, which is not apparent these days, but which will be very soon be visible: people with a different, open-minded mindset will move across all boundaries and fragmented communities, but they will be a tiny minority — just like in RL. Contrast that with what happens today — even someone that hardly speaks any English, and writes little more than “hi” and “my name is John” is perfectly able to drop in into an art lecture or a popular club somewhere and have some fun. They might even be bold enough to join the conversation; nobody knows if they’re using a dictionary or so to pick up some words. In fact, the best example are the many “automatic translators” used by Mentors to try to help non-native English speakers. While these translators are very, very weak, at least they enable rudimentary, basic communication — a word here, a word there, and at least something can be conveyed, at least enough to point people out to a place where they can find out more (in their own language).

But there are no real-time voice translators. If you don’t understand what people are talking about — because you simply aren’t fluent enough — there is no luck for you. Your only choice is to happen to stumble upon a community that understands you and can help you out; and remember that SL does, indeed, have a steep learning curve, compared to, say, YouTube or even MySpace.

Also, if you are able to multitask even with voice, be prepared — as soon as you meet someone with a bad audio setup, a very hard accent, or not enough fluency in the language, you’ll need full attention to be able to communicate the minimum — to the exclusion of everything else. Very few are that patient. In fact, the biggest argument for voice is that it doesn’t require you to painfully type everything, and that communication is thus much faster. That’s the case if all you want is to be able to talk to your best friend, lover, or latest sexual conquest (assuming they are very fluent as well); but you’re forfeiting the ability to communicate to, say, perhaps 80% of all users in Second Life, for the privilege of talking to a handful of people — which you could do anyway in Skype, if you really wanted.

Is it really worth it?

Conclusions

The voice-enabled SL will be a Babel Tower of people not understanding each other, and speaking to each of them will be a nightmare. Ultimately, we’ll forfeit the benefits of a multicasting environment and revert back to what Humankind has had for the past millenia: one-to-one conversation (one at a time) and broadcasting (one speaker, many listeners). Second Life, already fragmented once under the “many countries” paradigm, will fragment further — creating tiny communities of people that are able to understand each other, to the exclusion of everybody else.

There will be new opportunities, of course. Conferences, seminars, classes, even business meetings, will benefit a lot from voice chat, specially if you don’t require taking notes. And there will be a new job opportunity for actors or journalists that have clear, pleasing voices — like in RL, they will be in high demand for special events, and like in RL, there are not many people with those qualities, so they will be able to charge a lot for that.

The world is not “coming to an end”. Second Life, resembling more real life, might even be easier for the new users coming after June. In real life, most of us are not used to taking 12 phone calls at the same time and writing emails while talking to all these people. We’re used to one-to-one conversations with a restricted number of friends. Newcomers will never understand the appeal of a multicasting environment; they will expect, instead, good broadcasting events, like they have when they turn on the TV or the radio, or go out for a movie. New forms of social interaction will develop; people will excuse themselves to “pick up a voice IM” while on meetings or in a club. When talking to others, you’ll exclude everybody else in the world, and people will understand that (and hopefully respect it!).

One might also speculate what this means for businesses using Second Life. You can’t simply have an open-space office with everybody yelling to the screen; in fact, in the future, office environments using SL actively with voice, will resemble call centres. On the other hand, when at home, it’ll be hard to conciliate your family duties, your neighbours (who certainly don’t want to listen to you flirting with your latest friend at 2 AM), and a relatively noise-free environment. Voice has been popular on games for teenagers because they have less social pressure; you expect teenagers to be rowdy and loud while they’re “playing”. They are used to playing games with voice chat, as Hiro Pendragon reports in his blog. But all this requires a large degree of adaptation for the old users of Second Life.

The new ones will be the lucky ones; they will never have experienced a “different” Second Life.