Ask Leo: Odd Characters Instead of Quotes

askleo

By Leo Notenboom

I have noticed for years that certain emails and documents
have strange characters where punctuation and other
characters should be. An example is this word:
yesterday’s Where the characters ’ should
clearly be an apostrophe. Why is this happening and what
can I do to eliminate this occurring? I suspect that it
happens more often when the originating computer system is
a mac.

It’s all about character encoding.

And that simple sentence represents a bit of complexity.

Let me cover a few concepts, and throw out a few tips on how it can sometimes be avoided.

As I’ve discussed before, typically in the context of email,
there are several ways to “encode” the characters – the letters and numbers and
symbols – you see on the screen.

The fundamental concept is that all characters are actually stored as numbers.
The uppercase letter “A”, for example, is the number 65. “B” is 66, and so on.

“The fundamental concept is that all characters are actually stored as numbers.”

The “ASCII” character set or encoding uses a single byte – values from 0 to 255 –
to represent up to 256 different characters. (Technically ASCII actually only uses
7 bits of that byte, or values from 0-127. The most common true 8-bit encoding used
on the internet today is “ISO-8859-1”.)

The problem, of course, is that there are way more than 256 possible characters.
While we might spend most of our time with common characters like A-Z, a-z, 0-9 and
a handful of punctuation, in reality the there are thousands of other possible characters – particularly
if you think globally.

At the other end of the spectrum is the “Unicode” encoding, which uses two (or more) bytes,
giving many more possible different characters. “A” is still 65, but if we look at it
in hexadecimal the single byte Ascii “A” is 41, while the two-byte Unicode “A” is 0041.

At this point, it should be clear that switching from Ascii to Unicode would immediately
double the size of every email, every document, and everything else that stored text. Possible, and
in some cases even the right solution, but when you consider that the majority of communications,
particularly in the western world, focus on the basic roman alphabet and a few numbers and
punctuation, it starts to seem wasteful.

Enter “UTF-8”, for “8 bit Unicode Transformation Format”.

In UTF-8 the entire Unicode character set is broken down by an algorithm into byte sequences that are either 1, 2, 3 or 4 bytes long.
The reason is simple: the vast majority of characters in common usage in Western languages fall into the 1 byte range. Messages
remain smaller, but should one of those “other” characters be needed it can be incorporated by using it’s “longer” representation.

All that is a lot of back story to the problem.

Mis-Interpretation

When you see funny characters it’s because data encoded using UTF-8 is likely being interpreted as ISO-8859-1.

Let’s use an example: that apostrophe.

First, let’s be clear as mud: there are apostrophes, and apostrophes. In reality the characters we often
refer to as apostrophes could be:

  • the apostrophe: (‘)

  • the acute accent: (´)

  • the grave accent: (`)

  • the right single quote (’)

  • the left single quote (‘)

  • (Those might look similar, different, or not appear at all depending on the fonts and character sets
    available on your computer. I told you this was complex. Smile)

    Each, of course has a different encoding. Let’s take the right single quote (for reasons I’ll explain below):

  • ASCII: doesn’t exist

  • ISO-8859-1: 0xB4 in hexadecimal

  • Unicode: 0x07E3 in hexadecimal

  • UTF-8: 0xE28099

  • I don’t expect you to care about the actual numbers there, but simply notice how dramatically different they are.

    Now, what happens when the UTF-8 series of numbers is interpreted as if it were ISO-8859-1?

    ’

    Look familiar?

    0xE28099 breaks down as 0xE2 (â), 0x80 (€) and 0x99 (™). What was one character in UTF-8 (’) gets
    mistakenly displayed as three (’) when misinterpreted as ISO-8859-1.

    The Culprits

    There are typically two.two

    [This post is excerpted with Leo’s permission from his Ask Leo blog.]

    Leo Notenboom has been involved in the tech industry for nearly 30 years. After retiring from an 18 year career as a Microsoft Software Engineer Leo went on to create Ask Leo!, a free web site where he answers real questions from ordinary computer users.

    FaceBook URL: Leo’s Facebook

    Twitter URL: http://twitter.com/askleo

    Stop Responding to Threats.
    Prevent Them.

    Want to get monthly tips & tricks?

    Subscribe to our newsletter to get cybersecurity tips & tricks and stay up to date with the constantly evolving world of cybersecurity.

    Related Articles