home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  

Book HomeXML in a NutshellSearch this book

5.5. Unicode

Unicode is an international standard character set that can be used to write documents in almost any language you're likely to speak, learn, or encounter in your lifetime, barring alien abduction. Version 3.2, the current version as of May, 2002, contains 95,156 characters from most of Earth's living languages as well as several dead ones. Unicode easily covers the Latin alphabet, in which most of this book is written. Unicode also covers Greek-derived scripts, including ancient and modern Greek and the Cyrillic scripts used in Serbia and much of the former Soviet Union. Unicode covers several ideographic scripts, including the Han character set used for Chinese and Japanese, the Korean Hangul syllabary, and phonetic representations of these languages, including Katakana and Hiragana. It covers the right-to-left Arabic and Hebrew scripts. It covers various scripts native to the Indian subcontinent, including Devanagari, Thai, Bengali, Tibetan, and many more. And that's still less than half of the scripts in Unicode 3.1.1. Probably less than one person in a thousand today speaks a language that cannot be reasonably represented in Unicode. In the future, Unicode will add still more characters, making this fraction even smaller. Unicode can potentially hold more than a million characters, but no one is willing to say in public where they think most of the remaining million characters will come from.[4]

[4]Privately, some developers are willing to admit that they're preparing for a day when we're part of a Galactic Federation of thousands of intelligent species.

The Unicode character set assigns characters to code points, that is, numbers. These numbers can then be encoded in a variety of schemes, including:

  • UCS-2

  • UCS-4

  • UTF-8

  • UTF-16

5.5.1. UCS-2 and UTF-16

UCS-2, also known as ISO-10646-UCS-2, is perhaps the most natural encoding of Unicode. It represents each character as a 2-byte, unsigned integer between 0 and 65,535. Thus the capital letter A, code point 65 in Unicode, is represented by the 2 bytes 00 and 41 (in hexadecimal). The capital letter B, code point 66, is represented by the 2 bytes 00 and 42. The 2 bytes 03 and A3 represent the capital Greek letter Figure , code point 931.

UCS-2 comes in two variations, big endian and little endian. In big-endian UCS-2, the most significant byte of the character comes first. In little-endian UCS-2, the order is reversed. Thus, in big-endian UCS-2, the letter A is #x0041.[5] In little-endian UCS-2, the bytes are swapped, and A is #x4100. In big-endian UCS-2, the letter B is #x0042; in little-endian UCS-2, it's #x4200. In big-endian UCS-2, the letter Figure is #x03A3; in little-endian UCS-2, it's #xA303. In this book we use big-endian notation, but parsers cannot assume this. They must be able to determine the endianness from the document itself.

[5]For reasons that will become apparent shortly, this book has adopted the convention that #x precedes hexadecimal numbers. Every two hexadecimal digits map to one byte.

To distinguish between big-endian and little-endian UCS-2, a document encoded in UCS-2 customarily begins with Unicode character #xFEFF, the zero-width nonbreaking space, more commonly called the byte-order mark. This character has the advantage of being invisible. Furthermore, if its bytes are swapped, the resulting #xFFFE character doesn't actually exist. Thus, a program can look at the first two bytes of a UCS-2 document and tell immediately whether the document is big endian, depending on whether those bytes are #xFEFF or #xFFFE.

UCS-2 has three major disadvantages, however:

  • Files containing mostly Latin text are about twice as large in UCS-2 as they are in a single-byte character set, such as ASCII or Latin-1.

  • UCS-2 is not backward or forward compatible with ASCII. Tools that are accustomed to single-byte character sets often can't process a UCS-2 file in a reasonable way, even if the file only contains characters from the ASCII character set. For instance, a program written in C that expects the zero byte to terminate strings will choke on a UCS-2 file containing mostly English text because almost every other byte is zero.

  • UCS-2 is limited to 65,536 characters.

The last problem isn't so important in practice, since the first 65,536 code points of Unicode nonetheless manage to cover most people's needs except for dead languages like Ugaritic and fictional scripts like Tengwar. Mathematical symbols are also encountering these issues. Unicode does, however, provide a means of representing code points beyond 65,535 by recognizing certain two-byte sequences as half of a surrogate pair. A Unicode document that uses UCS-2 plus surrogate pairs is said to be in the UTF-16 encoding.

The other two problems, however, are more likely to affect most developers. UTF-8 is an alternative encoding for Unicode that addresses both.

Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.