[**MUTED**]

After a while, people are always bringing up this subject — when will voice be available in Second Life, and how it will split the community in two halves, the ones that will use it, and the ones that won’t/can’t.

The discussion is old and on the same day I have engaged in it twice; so I better use the basis of my own thoughts on a single place, and refer people to this article. I’m getting more lazy with age and lack of sleep…

Enjoy 🙂

As to the lovely gadgets and toys for ‘morphing’ voice (not really important for the educator community here, but since they’re being mentioned…), as well as text-to-speech and speech-to-text, well, they’re gadgets really. I’d like to compare them to SL thinking about the 1980s isometric-type of games, on 320×240 screens with 16 colours. Sure, with enough imagination, you got an ‘illusion’ of 3D, but it was simply way beyond what we can do with avatars with dozens of thousands of polygons today, rendered at 40–50 fps on a fast machine in SL. Voice Morphing software has improved dramatically in the past few years, but it’s a “toy” — you can mask your voice, but it won’t get rid of your accent. For the Windows fans around there, I’d recommend taking a look at what companies like http://www.audio4fun.com/ are doing. Although I use a Mac exclusively, and they only do Windows versions of it (yes, I’ve tried to actively convince them to do Macintosh ports) I’ve bought one of their tools once to see the results, and are a somewhat lazy beta tester of their latest technologies. I managed to have some fun sounding a bit like Cindy Crawford or Humphrey Bogart with a toe-curling bad accent; lots of fun to be gathered that way, and the delay due to voice processing & morphing is acceptable (a second or so) for game role-playing. Still, this technology needs some years of development; it’s simply at the “cute toy” level. People will *always* know your voice is being masked; they might only get a bit confused as to your identity. So you’re able to successfully mask that, but a Russian woman of 50 won’t “pass” for a Valley Girl of 17 using this tool … or for an Orc, for that matter.

I’ve tried out several different tools at a time (almost 2 years ago!), some of them the result on academic research, and a few are for free. While there is really no much choice (and I found none for the Mac or Linux), the results are not much impressive. We’ll have to wait until this technology develops further (in 2 years of development, for example, the differences are not noticeable).

Text-to-speech looks (sounds?) much more promising. While demo’ing things like Rhetorical (now acquired by Nuance, http://www.nuance.com/, and called “RealSpeak”) and Oddcast (http://www.oddcast.com/home/), I was much more impressed. RealSpeak is far better in terms of quality (from the demos at least) but it should be incredibly costly. Oddcast is still too expensive for the average user, but it’s targetted to the medium-to-low market at least.

What this means is that at least on that front, you’ll be able to use one of those technologies to “give voice” to your Orc avatar with ease. Of course, you’ll still be slightly handicapped, as people using their natural voice will be able to talk 4 times as quickly as you can type. I believe that with a good combination of shortcuts (we’d need a better “gesture” system though) you’d be able to almost keep a “normal-speed” conversation using this. Definitely a “middle-way” solution. I suggested to Philip a long time ago that they looked into Rhetorical (now RealSpeak) and integrated it into SL whenever they wished to introduce voice chat in Second Life; “corporate pricing” would eventually allow LL to deploy something like that very cost-effectively, and in turn LL would be able to charge, say, an extra dollar or two per month for people wishing to use TTS. Both technologies allow you to get your “personalised voice” as well for a fee; so that would very likely be feasible (mind you, since this would be embedded into the SL client, it would be as low-bandwidth as regular text chat…).

Voice recognition software is another beast entirely. I tried IBM’s ViaVoice and Philips’ own dictation system. Although both are also “mid-level”, they’re impossible to use in an informal setting, where people are talking all the time on a very busy chatroom, with dozens of different dialects. These are tools thought to work on “limited environments”, the ones giving the best results needing to be trained for a specific user. This is more than adequate for someone that has problems typing due to some disability; but it won’t work for “capturing” a busy chatroom and converting it to typed text!

So, except for using TTS (the only promising technology in this bunch), there is no way in 2006 to be able to fulfil the following requirements:

  • masking your voice and personalise it to fit to your avatar
  • making sure people that can’t type are able to use voice software instead
  • no exclusion of hearing/speech-impaired people
  • dealing with dialects and accents
  • low-bandwidth (“low” in the sense that 30 people chatting don’t need 1 Mbps just for that!)
  • the ability to keep written transcripts of what has been said

What might work is that all “chat” communication is text-only (low bandwidth), but at each end, you have both text-to-speech and speech-to-text (trained to the user’s voice). This, I think, will be the way to go; it fulfils all the above requirements, and the technology is available in 2006: it’s just very expensive for the average user.

Mind you, I do fully agree that first-person-shooters, fast-paced and full with action, do really benefit a slot from voice chat. Not being interested in that kind of use of SL, I tend to minimize the importance of voice chat; I still think you can use external software for that. When it comes to general-purpose usage of voice in virtual worlds, I think I have to side with [*DELETED RL NAME HERE*] on this: we need full immersion first, and that will take a few more years. Not 50 years as I originally thought, but perhaps 5–10 years to get convincing immersion. 10 years ago, we speculated on what we would “need” so that full immersion were possible. Nowadays, we have all the tools we need:

  • Tracking body motion and expressions; we have all the key components for that, and the required hardware (some sort of laser thingy) is cheap and available; software has made huge leaps in that aspect, is still expensive.
  • Cheap webcams and microphones with reasonably high quality (we have all that already).
  • Mapping expressions to avatar’s faces (LL has developed that technology for the Bedazzle group, it was demoed on the Silver Bells and Golden Spurs video).
  • Text-to-speech and voice morphing technology (available, still not good enough for general purpose use, and still very expensive).
  • Speech-to-text technology (still unusable, except as described above).
  • Goggles with gyroscopes (they exist, they’re cheaper these days than we think, but they’re not fully supported by SL, due to the way their OpenGL implementation works — there is a thread on that on the forums as well).
  • Bodysuits (full or partial) for tactile impressions. They’re still expensive, but the technology is quite well-developed, just not easily adaptable. There were rumours and urban legends that Linden Lab itself had started as a company developing gloves for virtual worlds, and that SL was a way to demo their technology. According to the legend, Philip saw that the future of LL was on the software and not on the hardware, so he dropped that path (imagine if he hadn’t!).
  • Low broadband costs. We have that.
  • Powerful computers. On average, a 2006-bought computer is powerful enough to drive all the above applications and devices.
  • Second Life. Ok, that we have, and the price is right (free ?)

So, compared to the mid-1990s, where all these things were “prototypes” or “things-we-need”, in the mid 2000s, we have them all, just most of them are simply too expensive — yet. Thus the “prediction” that it’ll only take 5–10 years to have, say, full bodysuits + goggles + TTS/STT software for US$10 a month.

Again, this is wildly speculative, but I look forward to it; it’s not a “dream” right now — it’s available to you in 2006 if you have enough money

Anyway, do not take my word for it. Read Richard Bartle’s old article on the subject; although everyone knows that Bartle is biased towards text-based communications, he certainly raises the correct points — and I must agree with him to a degree. What we should wish for is full immersion, not a crippled “voice chat”…