Unicode (XML in a Nutshell, 2nd Edition)

5.5.1. UCS-2 and UTF-16

UCS-2, also known as ISO-10646-UCS-2, is perhaps the most natural encoding of Unicode. It represents each character as a 2-byte, unsigned integer between 0 and 65,535. Thus the capital letter A, code point 65 in Unicode, is represented by the 2 bytes 00 and 41 (in hexadecimal). The capital letter B, code point 66, is represented by the 2 bytes 00 and 42. The 2 bytes 03 and A3 represent the capital Greek letter , code point 931.

UCS-2 comes in two variations, big endian and little endian. In big-endian UCS-2, the most significant byte of the character comes first. In little-endian UCS-2, the order is reversed. Thus, in big-endian UCS-2, the letter A is #x0041.[5] In little-endian UCS-2, the bytes are swapped, and A is #x4100. In big-endian UCS-2, the letter B is #x0042; in little-endian UCS-2, it's #x4200. In big-endian UCS-2, the letter is #x03A3; in little-endian UCS-2, it's #xA303. In this book we use big-endian notation, but parsers cannot assume this. They must be able to determine the endianness from the document itself.

[5]For reasons that will become apparent shortly, this book has adopted the convention that #x precedes hexadecimal numbers. Every two hexadecimal digits map to one byte.

To distinguish between big-endian and little-endian UCS-2, a document encoded in UCS-2 customarily begins with Unicode character #xFEFF, the zero-width nonbreaking space, more commonly called the byte-order mark. This character has the advantage of being invisible. Furthermore, if its bytes are swapped, the resulting #xFFFE character doesn't actually exist. Thus, a program can look at the first two bytes of a UCS-2 document and tell immediately whether the document is big endian, depending on whether those bytes are #xFEFF or #xFFFE.

UCS-2 has three major disadvantages, however:

Files containing mostly Latin text are about twice as large in UCS-2 as they are in a single-byte character set, such as ASCII or Latin-1.
UCS-2 is not backward or forward compatible with ASCII. Tools that are accustomed to single-byte character sets often can't process a UCS-2 file in a reasonable way, even if the file only contains characters from the ASCII character set. For instance, a program written in C that expects the zero byte to terminate strings will choke on a UCS-2 file containing mostly English text because almost every other byte is zero.
UCS-2 is limited to 65,536 characters.

The last problem isn't so important in practice, since the first 65,536 code points of Unicode nonetheless manage to cover most people's needs except for dead languages like Ugaritic and fictional scripts like Tengwar. Mathematical symbols are also encountering these issues. Unicode does, however, provide a means of representing code points beyond 65,535 by recognizing certain two-byte sequences as half of a surrogate pair. A Unicode document that uses UCS-2 plus surrogate pairs is said to be in the UTF-16 encoding.

The other two problems, however, are more likely to affect most developers. UTF-8 is an alternative encoding for Unicode that addresses both.

5.5. Unicode

5.5.1. UCS-2 and UTF-16

5.5.2. UTF-8