<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Character Encodings Are a PITA</title>
	<atom:link href="http://www.ooblick.com/weblog/2008/12/06/character-encodings-are-a-pita/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.ooblick.com/weblog/2008/12/06/character-encodings-are-a-pita/</link>
	<description>All the etcetera that&#039;s fit to read.</description>
	<lastBuildDate>Tue, 31 Jan 2012 13:31:01 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.1</generator>
	<item>
		<title>By: Fez</title>
		<link>http://www.ooblick.com/weblog/2008/12/06/character-encodings-are-a-pita/comment-page-1/#comment-163502</link>
		<dc:creator>Fez</dc:creator>
		<pubDate>Mon, 12 Jan 2009 16:13:32 +0000</pubDate>
		<guid isPermaLink="false">http://www.ooblick.com/weblog/?p=641#comment-163502</guid>
		<description>&lt;p&gt;Even more fun with encodings and why it&#039;s important to consider the breadth of their effects:  http://www.securityfocus.com/archive/1/499926 .  A demonstrated ability to think one&#039;s way through a corkscrew should probably be added to any interview for IT security related positions.&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>Even more fun with encodings and why it&#8217;s important to consider the breadth of their effects:  <a href="http://www.securityfocus.com/archive/1/499926" rel="nofollow">http://www.securityfocus.com/archive/1/499926</a> .  A demonstrated ability to think one&#8217;s way through a corkscrew should probably be added to any interview for IT security related positions.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: arensb</title>
		<link>http://www.ooblick.com/weblog/2008/12/06/character-encodings-are-a-pita/comment-page-1/#comment-161509</link>
		<dc:creator>arensb</dc:creator>
		<pubDate>Mon, 08 Dec 2008 03:56:01 +0000</pubDate>
		<guid isPermaLink="false">http://www.ooblick.com/weblog/?p=641#comment-161509</guid>
		<description>&lt;p&gt;Troublesome Frog:&lt;/p&gt;

&lt;blockquote&gt;If you don’t speak directly in binary integer form, I’m not very interested in writing code to talk to you.&lt;/blockquote&gt;

&lt;p&gt;Unfortunately, I think that a lot of this stuff might be more of a problem for programmers than for end users. If you accept or display any kind of human-readable string, you have to (if you want your software to work in Beijing as well as it does in Cincinnati) you&#039;re going to need to know whether it&#039;s wide characters (4 bytes per character, whatever a character is) or variable-length (a character can be represented by either one or more bytes). And at the very least, you need to know what the encoding is (US-ASCII, ISO-8859-1, UTF-16-BE, etc.) so that you can spit it out in the HTML headers, or put the right byte-order mark in the output file, or whatever.&lt;/p&gt;

&lt;p&gt;If you&#039;re using wide chars (&lt;tt&gt;wchar_t&lt;/tt&gt; in C), you&#039;re using more storage, but operations like finding the 12th character in a string are fast. Also, if you write files this way, you break utilities like &lt;tt&gt;strings&lt;/tt&gt;, and screw over anyone who thinks a string is NUL-terminated. If you use a variable-width encoding like UTF-8, you save space, but you lose the ability to use &lt;tt&gt;strlen()&lt;/tt&gt; and friends.&lt;/p&gt;

&lt;p&gt;Of course, if your code&#039;s interaction with the rest of the world consists entirely of taking numeric data structures as arguments and returning a numeric error code, you can probably avoid thinking about this.&lt;/p&gt;

&lt;p&gt;But if you ever do need to deal with it, I recommend using UTF-8 for file I/O, and wide character Unicode for internal representation (see the &lt;tt&gt;wcs*()&lt;/tt&gt; functions). Although I haven&#039;t done any truly internationalized programming in a language that didn&#039;t already have an internal representation that I had to worry about.&lt;/p&gt;

&lt;blockquote&gt;Why haven’t we settled on one that will handle all of our symbols? Has such a standard been agreed upon?&lt;/blockquote&gt;

&lt;p&gt;There is: Unicode.&lt;/p&gt;

&lt;p&gt;Basically, it&#039;s a huge collection of every character in every writing system ever devised (including, I believe, Tolkien&#039;s runes). Every character has a number. There are tables that say what&#039;s a letter, what&#039;s a number, what&#039;s an upper-case letter, what&#039;s a punctuation symbol, and so forth. There are also standard encodings, which allow you to convert between the Platonic-ideal numerical representation of a string and actual bits in a file: UTF-16 basically just uses a 16-bit integer for each character (and exists in two endiannesses).&lt;/p&gt;

&lt;p&gt;UTF-8 uses a variable number of bytes per character, and has some useful properties: for characters 0-127, it&#039;s the same as ASCII (and those characters are also the same in ISO-8859-*), so your plain-English files remain readable. Also, NUL isn&#039;t used, so your old C code that assumes NUL-terminated strings continues to work.&lt;/p&gt;

&lt;p&gt;UTF-7 is like UTF-8, but for situations where ancient software might use the 8th bit for its own nefarious purposes. I think it&#039;s mostly used in email headers (where antediluvian RFCs are still being followed) and in discussions of Unicode encodings.&lt;/p&gt;

&lt;blockquote&gt;it’s a lot easier for Americans like me to code standard ASCII string literals into our systems all over the place. Are lazy slobs like me the main reason things are so messed up?&lt;/blockquote&gt;

&lt;p&gt;I&#039;m afraid so.&lt;/p&gt;

&lt;p&gt;I have a theory (colloquial sense) that the US was the best place for computing to be developed, because text I/O is so simple. You can get away with 26 letters (if you&#039;re willing to use monocase) and a handful of punctuation symbols. (And this is a big deal with things like drum printers, not to mention the cost of storing text. In the early days, every bit counted; that&#039;s why a lot of old programs used the high bit to annotate text.)&lt;/p&gt;

&lt;p&gt;Countries like Britain and Australia had the same linguistic advantages as the US in this regard, although the British would have had to sacrifice accents in words like &quot;cöordination&quot;, just as we sacrificed the accents in &quot;résumé&quot;. But they didn&#039;t have the industrial or brain-trust resources that post-WWII US had.&lt;/p&gt;

&lt;p&gt;Most European countries &#8212; France, Germany, Italy, Spain, etc. &#8212; could&#039;ve sacrificed accents as well. But their languages use accents a lot more than English does, and &quot;computer French&quot; would&#039;ve been a severely limited version of French.&lt;/p&gt;

&lt;p&gt;Okay, Greek uses 16 letters, so we&#039;re back to the &quot;industrial base&quot; argument. And Russian uses 32 letters, which is a disadvantage over English, but not an insurmountable one.&lt;/p&gt;

&lt;p&gt;As for Chinese and Japanese, forget it. They&#039;d have had to invent a whole new language to allow programmers to talk to computers.&lt;/p&gt;

&lt;p&gt;And the more I look at Unicode, the more I think that while it&#039;s butt-ugly, it&#039;s probably the best solution we&#039;re going to get to a butt-ugly problem. A roman capital A and a cyrillic capital A are written the same way, but are arguably different characters. A character like &quot;à&quot; is different from &quot;a&quot;, but is related to it. You could declare that &quot;à&quot; should be represented as &quot;a&quot; followed by &quot;&#096;&quot;, but then that breaks &lt;tt&gt;strlen()&lt;/tt&gt;, so you need to be able to represent both &quot;a&quot; and &quot;à&quot;. And &quot;&#096;&quot;, since you might be writing a grammar book.&lt;/p&gt;

&lt;p&gt;In Hebrew, you have a similar problem, in that vowels are represented by diacritical marks above consonants. In German, &quot;ß&quot; is equivalent to &quot;ss&quot;. In Dutch, &quot;ij&quot; is treated as a single letter, as are &quot;ll&quot; and &quot;rr&quot; in Spanish. In Arabic, certain letters can have different shapes depending on whether they come at the beginning, middle, or end of a word, or on their own (I glossed over the definition of &quot;character&quot;, earlier. This is why. Jesuit theologians have nothing on the people who argue over what constitutes a letter). In Chinese, Japanese, and Korean, I think you also get into the problem of what&#039;s the main pen-stroke of a character, and in which order you add more strokes to build up the full character (which is important for things like sorting strings).&lt;/p&gt;

&lt;p&gt;Some languages are written left-to-right. Others right-to-left. Chinese is optionally written top-to-bottom. And some ancient Greek dialects are written boustrophedonically, alternating direction with every line.&lt;/p&gt;

&lt;p&gt;For computer displays, it&#039;s quite reasonable to have monospace fonts: it only takes a little bit of mangling to have &quot;i&quot; fit in the same sized box as &quot;M&quot;. The same is, I think, true in Chinese as well. But it wouldn&#039;t make sense to have a monospace font that can accommodate both English and Chinese: either the Chinese text would be so cramped as to be illegible, or the English text would be grotesquely spaced out. So you also need to keep track of which characters need wide boxes and which ones need narrow ones.&lt;/p&gt;

&lt;p&gt;Of course, what would you expect from a human endeavor that&#039;s been evolving for thousands of years, using completely different technology and limitations. Of course it&#039;s a mess. So any standard that tries to unify all of this is going to be a mess as well. As Douglas Adams said, in his summary of the summary of the summary, people are a problem.&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>Troublesome Frog:</p>
<blockquote><p>If you don’t speak directly in binary integer form, I’m not very interested in writing code to talk to you.</p></blockquote>
<p>Unfortunately, I think that a lot of this stuff might be more of a problem for programmers than for end users. If you accept or display any kind of human-readable string, you have to (if you want your software to work in Beijing as well as it does in Cincinnati) you&#8217;re going to need to know whether it&#8217;s wide characters (4 bytes per character, whatever a character is) or variable-length (a character can be represented by either one or more bytes). And at the very least, you need to know what the encoding is (US-ASCII, ISO-8859-1, UTF-16-BE, etc.) so that you can spit it out in the HTML headers, or put the right byte-order mark in the output file, or whatever.</p>
<p>If you&#8217;re using wide chars (<tt>wchar_t</tt> in C), you&#8217;re using more storage, but operations like finding the 12th character in a string are fast. Also, if you write files this way, you break utilities like <tt>strings</tt>, and screw over anyone who thinks a string is NUL-terminated. If you use a variable-width encoding like UTF-8, you save space, but you lose the ability to use <tt>strlen()</tt> and friends.</p>
<p>Of course, if your code&#8217;s interaction with the rest of the world consists entirely of taking numeric data structures as arguments and returning a numeric error code, you can probably avoid thinking about this.</p>
<p>But if you ever do need to deal with it, I recommend using UTF-8 for file I/O, and wide character Unicode for internal representation (see the <tt>wcs*()</tt> functions). Although I haven&#8217;t done any truly internationalized programming in a language that didn&#8217;t already have an internal representation that I had to worry about.</p>
<blockquote><p>Why haven’t we settled on one that will handle all of our symbols? Has such a standard been agreed upon?</p></blockquote>
<p>There is: Unicode.</p>
<p>Basically, it&#8217;s a huge collection of every character in every writing system ever devised (including, I believe, Tolkien&#8217;s runes). Every character has a number. There are tables that say what&#8217;s a letter, what&#8217;s a number, what&#8217;s an upper-case letter, what&#8217;s a punctuation symbol, and so forth. There are also standard encodings, which allow you to convert between the Platonic-ideal numerical representation of a string and actual bits in a file: UTF-16 basically just uses a 16-bit integer for each character (and exists in two endiannesses).</p>
<p>UTF-8 uses a variable number of bytes per character, and has some useful properties: for characters 0-127, it&#8217;s the same as ASCII (and those characters are also the same in ISO-8859-*), so your plain-English files remain readable. Also, NUL isn&#8217;t used, so your old C code that assumes NUL-terminated strings continues to work.</p>
<p>UTF-7 is like UTF-8, but for situations where ancient software might use the 8th bit for its own nefarious purposes. I think it&#8217;s mostly used in email headers (where antediluvian RFCs are still being followed) and in discussions of Unicode encodings.</p>
<blockquote><p>it’s a lot easier for Americans like me to code standard ASCII string literals into our systems all over the place. Are lazy slobs like me the main reason things are so messed up?</p></blockquote>
<p>I&#8217;m afraid so.</p>
<p>I have a theory (colloquial sense) that the US was the best place for computing to be developed, because text I/O is so simple. You can get away with 26 letters (if you&#8217;re willing to use monocase) and a handful of punctuation symbols. (And this is a big deal with things like drum printers, not to mention the cost of storing text. In the early days, every bit counted; that&#8217;s why a lot of old programs used the high bit to annotate text.)</p>
<p>Countries like Britain and Australia had the same linguistic advantages as the US in this regard, although the British would have had to sacrifice accents in words like &#8220;cöordination&#8221;, just as we sacrificed the accents in &#8220;résumé&#8221;. But they didn&#8217;t have the industrial or brain-trust resources that post-WWII US had.</p>
<p>Most European countries &mdash; France, Germany, Italy, Spain, etc. &mdash; could&#8217;ve sacrificed accents as well. But their languages use accents a lot more than English does, and &#8220;computer French&#8221; would&#8217;ve been a severely limited version of French.</p>
<p>Okay, Greek uses 16 letters, so we&#8217;re back to the &#8220;industrial base&#8221; argument. And Russian uses 32 letters, which is a disadvantage over English, but not an insurmountable one.</p>
<p>As for Chinese and Japanese, forget it. They&#8217;d have had to invent a whole new language to allow programmers to talk to computers.</p>
<p>And the more I look at Unicode, the more I think that while it&#8217;s butt-ugly, it&#8217;s probably the best solution we&#8217;re going to get to a butt-ugly problem. A roman capital A and a cyrillic capital A are written the same way, but are arguably different characters. A character like &#8220;à&#8221; is different from &#8220;a&#8221;, but is related to it. You could declare that &#8220;à&#8221; should be represented as &#8220;a&#8221; followed by &#8220;&#96;&#8221;, but then that breaks <tt>strlen()</tt>, so you need to be able to represent both &#8220;a&#8221; and &#8220;à&#8221;. And &#8220;&#96;&#8221;, since you might be writing a grammar book.</p>
<p>In Hebrew, you have a similar problem, in that vowels are represented by diacritical marks above consonants. In German, &#8220;ß&#8221; is equivalent to &#8220;ss&#8221;. In Dutch, &#8220;ij&#8221; is treated as a single letter, as are &#8220;ll&#8221; and &#8220;rr&#8221; in Spanish. In Arabic, certain letters can have different shapes depending on whether they come at the beginning, middle, or end of a word, or on their own (I glossed over the definition of &#8220;character&#8221;, earlier. This is why. Jesuit theologians have nothing on the people who argue over what constitutes a letter). In Chinese, Japanese, and Korean, I think you also get into the problem of what&#8217;s the main pen-stroke of a character, and in which order you add more strokes to build up the full character (which is important for things like sorting strings).</p>
<p>Some languages are written left-to-right. Others right-to-left. Chinese is optionally written top-to-bottom. And some ancient Greek dialects are written boustrophedonically, alternating direction with every line.</p>
<p>For computer displays, it&#8217;s quite reasonable to have monospace fonts: it only takes a little bit of mangling to have &#8220;i&#8221; fit in the same sized box as &#8220;M&#8221;. The same is, I think, true in Chinese as well. But it wouldn&#8217;t make sense to have a monospace font that can accommodate both English and Chinese: either the Chinese text would be so cramped as to be illegible, or the English text would be grotesquely spaced out. So you also need to keep track of which characters need wide boxes and which ones need narrow ones.</p>
<p>Of course, what would you expect from a human endeavor that&#8217;s been evolving for thousands of years, using completely different technology and limitations. Of course it&#8217;s a mess. So any standard that tries to unify all of this is going to be a mess as well. As Douglas Adams said, in his summary of the summary of the summary, people are a problem.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Troublesome Frog</title>
		<link>http://www.ooblick.com/weblog/2008/12/06/character-encodings-are-a-pita/comment-page-1/#comment-161499</link>
		<dc:creator>Troublesome Frog</dc:creator>
		<pubDate>Mon, 08 Dec 2008 00:41:43 +0000</pubDate>
		<guid isPermaLink="false">http://www.ooblick.com/weblog/?p=641#comment-161499</guid>
		<description>&lt;p&gt;Stuff like this makes me happy I spend most of my time either in driver code or DSP / imaging code.  The less I interact directly with human beings, the happier I am.  If you don&#039;t speak directly in binary integer form, I&#039;m not very interested in writing code to talk to you.&lt;/p&gt;

&lt;p&gt;Being the type of programmer I am, I really don&#039;t know much about the history of these different encodings.  Why haven&#039;t we settled on one that will handle all of our symbols?  Has such a standard been agreed upon? I imagine that if/when it happens, it will be years before everybody starts using it, not least of all because it&#039;s a lot easier for Americans like me to code standard ASCII string literals into our systems all over the place.  Are lazy slobs like me the main reason things are so messed up?&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>Stuff like this makes me happy I spend most of my time either in driver code or DSP / imaging code.  The less I interact directly with human beings, the happier I am.  If you don&#8217;t speak directly in binary integer form, I&#8217;m not very interested in writing code to talk to you.</p>
<p>Being the type of programmer I am, I really don&#8217;t know much about the history of these different encodings.  Why haven&#8217;t we settled on one that will handle all of our symbols?  Has such a standard been agreed upon? I imagine that if/when it happens, it will be years before everybody starts using it, not least of all because it&#8217;s a lot easier for Americans like me to code standard ASCII string literals into our systems all over the place.  Are lazy slobs like me the main reason things are so messed up?</p>
]]></content:encoded>
	</item>
</channel>
</rss>

