Revisiting the Colourising Script

In late 2004, I’ve joined a few groups that did a lot of discussion events. Often people could not attend them, so transcripts were kept and posted on the Linden Lab forums.

Reading a transcript ‘after-the-fact’ is not the same thing as participating in a discussion in real time. For me, the major difficulty is keeping track on who said what. After a while it’s hard to remember who is actually speaking.

One of those groups I joined in 2004 was starting an experiment in running a community in Second Life® democratically. I’ve extensively written over the subject, so I’ll just mention that any democratic community needs to use Transcription Services for all ‘official’ meetings and discussions. So one of the members had created a fantastic little tool that you could feed a transcript to, and it would automatically assign different colours to each speaker, which would make things far easier to read. They got a server using a HostGator Cyber Monday Deal 2016 coupon to host it – and it’s been running ever since. Way cool! This was done as a simple form on a website which was really easy to use: just copy & paste the chat log, and you’d get a page with everything properly formatted in colour.

This person left 2006, and her site disappeared at the same time. Because the tool was so useful, and we kept using it every day for all our meetings, I decided to do something similar — the page for Gwyn’s Colourising Script was born. At that time I was also toying with novel ideas to attract more readers to my blog, so this seemed like a cool idea to get people visiting over and over again.

My own script had lots of small improvements over the original. At that time, our community would publish logs on Linden Lab’s forums, which used at the time phpBB, and so it formatted the output using BBCode. But sometimes people like me that prefers Human transcription vs speech recognition want to post the transcripts on our own blogs, so HTML formatting was also useful. Our community also started to use TikiWiki for some of the documentation (including transcripts), so I did the formatting for that as well. In fact, the formatting is applied separately, and, in theory, I could easily extend that to include Markdown or other popular formatting syntaxes relatively easily…

It already included a few twists like handling ‘online’ and ‘offline’ messages, which would be deleted by the script, and distinguished between object chat and avatar chat, eliminating the former.

Then the SL Viewer went to several successive changes in the way it produces logs. First, copying logs from in-world produced different results than copying them from the logs saved automatically to disk, so I had to deal with that as well. The SL 2.0 Viewer placed emotes (anything started with /me) in italics, so I wanted to replicate that functionality as well. Here the difficulty was to catch the end of the avatar name. For simplicity’s sake, because all avatar names were of the form FirstName LastName, I did a simplification: separate the whole line into words. Take the second word: if it finishes with a colon, then it was a ‘normal’ chat. If not, then it was an emote, and would be rendered in italics instead. This would also work with things like Gwyneth Llewelyn whispers: I’m so stupid! which would also be rendered in italics, like the SL viewer does, because the colon appears on the third word, not on the second.

Sometimes, however, the chat log would actually have /me explicitly on the log, which I would adjust accordingly. Similarly, I wanted an easy way to replace You: by my own avatar name (instead of doing a search & replace), so that option was added as well.

Anyone who has posted very long transcripts on forums knows the pain it is to split the text neatly so that the character limit is preserved. So I’ve added that feature as well; splitting is done automatically. Because I found out that each forum has different character limits, there was no other choice but to allow the user to define by how many characters the log should be split (of course, I avoid splitting in the middle of a sentence!).

But then things started to get awkward, and I gave up fiddling with the code. For a while, copying text chat with /me would render the avatar name twice, i.e. Gwyneth LlewelynGwyneth Llewelyn nods. I don’t know if LL did that deliberately or not; the truth is that this bug was flagged and apparently corrected.

There was also an issue with multi-lines. If you type something in the text chat box which has newlines in it — like copying from a notecard, for instance — the chat log timestamps those lines, but doesn’t place the avatar name in front of it, to make it consistent with what is seen on screen in real-time. But that means, for a log parser, that you have no clue who is saying that. I attempted to deal with that as well, but I had a flaw in my code, and so multi-line support never really worked.

Then, of course, Linden Lab completely changed the way avatars are named. You could pretty much use any name, with any amount of spaces in it — as a Display Name. On some viewers, the text chat is rendered with the Display Name followed by the (old-style) avatar name in parenthesis. And after Display Names, LL started allowing any name as the avatar name, making things pretty much impossible to parse. So I gave up. At the end of the day, it was almost always easier to fix the mistakes afterwards manually, because we humans are so much better at pattern matching than a computer!

Recently, however, I had to post another long transcript to the forums. Because the attendance just had one user with a Display Name, and one user with a ‘short’ avatar name, I thought that changing everything manually would be simple enough. Alas, it wasn’t — I spent a few hours doing so! And I can hardly waste so much time!

So I tackled my ancient colourising script from 2006 again. Clearly I needed a new approach to identify where the names are. But there seems to be no simple solution; there are always so many exceptions where things can go wrong!

My main issue was how to identify names, since they a) can now be of any size, with any amount of spaces, with or without a colon at the end, depending if the user is emoting or not, and optionally prefixed with whispers or shouts which will have a colon; b) you cannot assume that LL and/or TPVs use exactly the same format consistently across versions.

Basically, I had to work from some assumptions. The first is that most text chat will not be an emote, multi-line, or include whispers: and shouts:. In that case, the algorithm will simply split each line of text at the colon, and consider everything to the left of it as an avatar name, and place it on a table with the assigned colour. Sometimes, however, some lines will not have colons. This can be for several reasons, the more obvious one being an emote. Sometimes the colon will be at the wrong place (e.g. after whispers or shouts), so the algorithm will have to rely on partial matches instead of exact matches to figure out things. This obviously will make a lot of things wrong in extreme cases. I’ll list a few of them below, but first, let me share with you the pseudocode I came up with. Maybe expert programmers can come up with better solutions.

while there are still lines in the log
  check if the line isn't empty or just has 'is online' or 'is offline', and discard it
  replace 'You:' by your avatar name
  extract timestamp (multi-lines don't have timestamps in logs, so flag this as a multi-line if it has no timestamp)
  check if the log has '/me' (can happen with some viewers) and flag it as italic
  if there is a timestamp:
    split the rest of the line at the colon
    if there is no colon:
      assign a random color
      loop through all names on the lookup table so far
        do a partial match of each name on the lookup table, to see if it matches anything, starting at the beginning of the line
        if yes, we might have found a candidate for an avatar name, who is emoting
          assign this name and the respective colour to this line
      flag line as italic
    else (there is a colon in the line):
      extract the avatar name from the left-side part of the line
      extract the text chat for this line from the right-side part of the line
      assign an empty colour
      loop through all names on the lookup table so far
        if we find a partial match, then assign the respective colour to this line
      if we didn't find any match:
        get the list of possible colours and assign the next one to this avatar
        update the avatar lookup table with this avatar name and the colour we just assigned
    else (no timestamp):
      flag this line as multi-line

    output the formatted line:
      if the line is in italic, add italics
      if we have italics or multi-line, do not add a colon after the avatar name
      if we have multi-line, output the whole line (avatar name will be empty)
      else just output the line without the avatar name

    if user wants to split the text into blocks
      check if we have already enough characters for the current block and chop text if it's the case

This has a lots of caveats. Let me tell you about some of the most obvious ones.

If two avatars are chatting, one called ‘Maria’ and the other ‘Maria Gherardi’, then the algorithm will always think it’s the same person. Note that the order appearing in chat is important; the reverse will not be true (i.e. if ‘Maria Gherardi’ appears on chat before ‘Maria’, then the algorithm should distinguish both), because the lookup table is not sorted, but rather shows things by order of appearance.

If people emote or send multi-lines before ‘regular talk’, this means that the first line in the chat will get the wrong colour. This happens a lot to me, because usually I drop in a discussion meeting with Gwyneth Llewelyn says hi, and that means that the algorithm will think this is a multi-line, cannot find any avatar for assigning a colour, and will render the text in black.

Shouting and whispering on the first line will add a new entry on the lookup table! Consider the following excerpt:

Gwyneth Llewelyn shouts: hello there!

Gwyneth Llewelyn waves.

Gwyneth Llewelyn: how is everybody?

When the algorithm runs, it will split the first name before the colon, so it will assign a colour to this strange avatar named Gwyneth Llewelyn whispers. On the second line, we don’t have an avatar named Gwyneth Llewelyn yet, so the algorithm assumes it’s a multi-line, and will probably render the text black or with the colour previous selected. Now comes the third line: this time, a new entry for Gwyneth Llewelyn will be added. However, every time I whisper again, I will get the first colour, not the second one.

Similarly, there is now no way to distinguish between object names chatting in public, and avatars chatting in public. One possible solution for that would be to look up if that avatar exists. Sadly, this requires an external service of some sort, because LL does not offer an easy way to check if an avatar exists. I might think of that, because it would allow something neat, i.e. storing the same colour consistently for each avatar, forever.

The algorithm also gets confused with an emote or status message before the avatar has said something ‘normally’. Here is a typical example. Imagine the following at the very beginning of the chat:

Moon Adamant: hi everybody

Gwyneth Llewelyn gives Moon Adamant a hug.

The algorithm will correctly assign a colour to Moon Adamant. But on the second line, we have an emote, without an entry for Gwyneth Llewelyn. On the other hand, Moon Adamant will have been assigned a colour previously, and the algorithm will find a match on that line, so it will be rendered as italic (because it has no colon) but with the colour assigned to Moon Adamant!

Another variant of this bug comes with emotes or multi-lines with colons in the middle, specially at the start of the conversation:

Gwyneth Llewelyn thinks that she looks great today: she has a new dress and everything!

If Gwyneth Llewelyn hasn’t chatted before in a ‘normal’ way, then this will mean that a new avatar will be stored on the lookup table, named Gwyneth Llewelyn thinks that she looks great today.

I have no simple solution for all the above. Remember that avatar names can, these days, appear on chat looking like this: The CEO of Linden Lab (ebbe.linden). There is no easy way to know exactly where the avatar name stops. A more complex parser could probably check all words and word combinations for valid avatar names. To simplify, you would just need to parse the beginning characters. Currently, Display Names cannot have more than 40 characters, while avatar names are 31 characters maximum. Adding the parethensis and the dot, that would mean around 75 characters, which would appear at the beginning of each line, which one could check up somewhere to see if any of it belongs to an avatar name. The problem is that you would need to do partial matching, and that means looping over the 32 million avatar names or so — one by one — for each line! A nightmare!

I have also noticed that if a resident actually changes their Display Name in the middle of the conversation, it will get a new colour assigned — which will be confusing. Alas! I never thought that would actually happen! There is no easy way to link the new name to the old name.

So if you have any suggestions on how the parser can be improved, please let me know. Also, feel free to make many tests with the Colourising Script and check what further bugs you might have found. While I have tested out all possibilities — emotes, whisper/shout, multi-lines, etc. — I haven’t tested all combinations. As you can see, the algorithm is very sensitive towards correctly finding the first instance of someone chatting in text. And it can be easily confused with avatars having Display Names which are partially similar — the order in which they appear can throw the algorithm out of track.

It is still very useful to me, so I hope it continues to be useful for you too!

Revisiting the Colourising Script

Like this:

AUTHOR

Gwyneth Llewelyn

Like this:

Like this:

Like this:

Like this: