Character Encodings Are a PITA

Character Encodings Are a PITA

Character encoding schemes (UTF-8, ASCII, ISO-8859-1/-15,
Windows-1252, etc.) are an incredible source of headaches. Stay away
from them.

(Oh, and if you tell me I mean “raw character encoding” or “codepoint
set” some such, I’ll whack you upside the head with a thick Unicode
reference.)

In case you hadn’t noticed, I upgraded WordPress not too long ago.
Being the cautious sort, I did a dump of the back-end database before
doing so, as I’ve done every other time I upgraded. And, like every
other time, I noticed that some characters got mangled. This time
around, though, I decided to do something about it.

It turned out that when I originally set up the database, I told it to
use ISO-8859-1 as the default text encoding. But later, I told
WordPress to use UTF-8. And somewhere between dumping, restoring, and
WordPress’s upgrade of the schema, various characters got mangled. For
the most part, various ISO-8859-1 quotation marks got converted to
UTF-8, then interpreted as ISO-8859-1, and converted again. On top of
which, some commenters used retarded software to post comments, which
insisted on using cp1252 or cp1258 (and I even saw something which
might’ve been IBM-CP1133), which also got converted to and from UTF-8
and ISO-8859-1 or -15.

Obviously, with 13 Mb of data, I wasn’t going to correct it all by
hand; I needed to write a script. But that introduced additional
problems: a Perl script that’s basically “s/foo/bar/g” is
pretty simple, but when foo and bar are strings that
represent the same character using different encodings, things can get
hairy: what if bar is UTF-8, but Perl thinks that the file is
in ISO-8859-15?

On top of that, you have to keep track of which encoding Emacs is
using to show you any given file.

iconv turned out to be an invaluable forensic tool, but it has one
limitation: you can’t use it to simply decode UTF-8 (or if you can, I
wasn’t able to figure out how to do so). There were times when I
wanted to decode a snippet of text and look at it to see if I could
recognize the encoding. But iconv only allows you to convert from one
encoding to another; so if you try to convert from UTF-8 to
ISO-8859-1, and the resulting character isn’t defined in ISO-8859-1,
you get an error. Bleah.

The moral of the story is, use UTF-8 for everything. If the software
you’re using doesn’t give you UTF-8 as an option, ditch it and use
another package.

One thought on “Character Encodings Are a PITA

  1. Stuff like this makes me happy I spend most of my time either in driver code or DSP / imaging code. The less I interact directly with human beings, the happier I am. If you don’t speak directly in binary integer form, I’m not very interested in writing code to talk to you.

    Being the type of programmer I am, I really don’t know much about the history of these different encodings. Why haven’t we settled on one that will handle all of our symbols? Has such a standard been agreed upon? I imagine that if/when it happens, it will be years before everybody starts using it, not least of all because it’s a lot easier for Americans like me to code standard ASCII string literals into our systems all over the place. Are lazy slobs like me the main reason things are so messed up?

  2. Troublesome Frog:

    If you don’t speak directly in binary integer form, I’m not very interested in writing code to talk to you.

    Unfortunately, I think that a lot of this stuff might be more of a problem for programmers than for end users. If you accept or display any kind of human-readable string, you have to (if you want your software to work in Beijing as well as it does in Cincinnati) you’re going to need to know whether it’s wide characters (4 bytes per character, whatever a character is) or variable-length (a character can be represented by either one or more bytes). And at the very least, you need to know what the encoding is (US-ASCII, ISO-8859-1, UTF-16-BE, etc.) so that you can spit it out in the HTML headers, or put the right byte-order mark in the output file, or whatever.

    If you’re using wide chars (wchar_t in C), you’re using more storage, but operations like finding the 12th character in a string are fast. Also, if you write files this way, you break utilities like strings, and screw over anyone who thinks a string is NUL-terminated. If you use a variable-width encoding like UTF-8, you save space, but you lose the ability to use strlen() and friends.

    Of course, if your code’s interaction with the rest of the world consists entirely of taking numeric data structures as arguments and returning a numeric error code, you can probably avoid thinking about this.

    But if you ever do need to deal with it, I recommend using UTF-8 for file I/O, and wide character Unicode for internal representation (see the wcs*() functions). Although I haven’t done any truly internationalized programming in a language that didn’t already have an internal representation that I had to worry about.

    Why haven’t we settled on one that will handle all of our symbols? Has such a standard been agreed upon?

    There is: Unicode.

    Basically, it’s a huge collection of every character in every writing system ever devised (including, I believe, Tolkien’s runes). Every character has a number. There are tables that say what’s a letter, what’s a number, what’s an upper-case letter, what’s a punctuation symbol, and so forth. There are also standard encodings, which allow you to convert between the Platonic-ideal numerical representation of a string and actual bits in a file: UTF-16 basically just uses a 16-bit integer for each character (and exists in two endiannesses).

    UTF-8 uses a variable number of bytes per character, and has some useful properties: for characters 0-127, it’s the same as ASCII (and those characters are also the same in ISO-8859-*), so your plain-English files remain readable. Also, NUL isn’t used, so your old C code that assumes NUL-terminated strings continues to work.

    UTF-7 is like UTF-8, but for situations where ancient software might use the 8th bit for its own nefarious purposes. I think it’s mostly used in email headers (where antediluvian RFCs are still being followed) and in discussions of Unicode encodings.

    it’s a lot easier for Americans like me to code standard ASCII string literals into our systems all over the place. Are lazy slobs like me the main reason things are so messed up?

    I’m afraid so.

    I have a theory (colloquial sense) that the US was the best place for computing to be developed, because text I/O is so simple. You can get away with 26 letters (if you’re willing to use monocase) and a handful of punctuation symbols. (And this is a big deal with things like drum printers, not to mention the cost of storing text. In the early days, every bit counted; that’s why a lot of old programs used the high bit to annotate text.)

    Countries like Britain and Australia had the same linguistic advantages as the US in this regard, although the British would have had to sacrifice accents in words like “cöordination”, just as we sacrificed the accents in “résumé”. But they didn’t have the industrial or brain-trust resources that post-WWII US had.

    Most European countries — France, Germany, Italy, Spain, etc. — could’ve sacrificed accents as well. But their languages use accents a lot more than English does, and “computer French” would’ve been a severely limited version of French.

    Okay, Greek uses 16 letters, so we’re back to the “industrial base” argument. And Russian uses 32 letters, which is a disadvantage over English, but not an insurmountable one.

    As for Chinese and Japanese, forget it. They’d have had to invent a whole new language to allow programmers to talk to computers.

    And the more I look at Unicode, the more I think that while it’s butt-ugly, it’s probably the best solution we’re going to get to a butt-ugly problem. A roman capital A and a cyrillic capital A are written the same way, but are arguably different characters. A character like “à” is different from “a”, but is related to it. You could declare that “à” should be represented as “a” followed by “`”, but then that breaks strlen(), so you need to be able to represent both “a” and “à”. And “`”, since you might be writing a grammar book.

    In Hebrew, you have a similar problem, in that vowels are represented by diacritical marks above consonants. In German, “ß” is equivalent to “ss”. In Dutch, “ij” is treated as a single letter, as are “ll” and “rr” in Spanish. In Arabic, certain letters can have different shapes depending on whether they come at the beginning, middle, or end of a word, or on their own (I glossed over the definition of “character”, earlier. This is why. Jesuit theologians have nothing on the people who argue over what constitutes a letter). In Chinese, Japanese, and Korean, I think you also get into the problem of what’s the main pen-stroke of a character, and in which order you add more strokes to build up the full character (which is important for things like sorting strings).

    Some languages are written left-to-right. Others right-to-left. Chinese is optionally written top-to-bottom. And some ancient Greek dialects are written boustrophedonically, alternating direction with every line.

    For computer displays, it’s quite reasonable to have monospace fonts: it only takes a little bit of mangling to have “i” fit in the same sized box as “M”. The same is, I think, true in Chinese as well. But it wouldn’t make sense to have a monospace font that can accommodate both English and Chinese: either the Chinese text would be so cramped as to be illegible, or the English text would be grotesquely spaced out. So you also need to keep track of which characters need wide boxes and which ones need narrow ones.

    Of course, what would you expect from a human endeavor that’s been evolving for thousands of years, using completely different technology and limitations. Of course it’s a mess. So any standard that tries to unify all of this is going to be a mess as well. As Douglas Adams said, in his summary of the summary of the summary, people are a problem.

Comments are closed.