5.5.1. UCS-2 and UTF-16
UCS-2, also known as ISO-10646-UCS-2,
is perhaps the most natural encoding of Unicode. It represents each
character as a 2-byte, unsigned integer between 0 and 65,535. Thus
the capital letter A, code point 65 in Unicode,
is represented by the 2 bytes 00 and 41 (in hexadecimal). The capital
letter B, code point 66, is represented by the 2
bytes 00 and 42. The 2 bytes 03 and A3 represent the capital Greek
letter
, code point 931.
UCS-2 comes in two variations, big endian and little endian. In
big-endian UCS-2, the most significant byte of the character comes
first. In little-endian UCS-2, the order is reversed. Thus, in
big-endian UCS-2, the letter A is
#x0041.[5] In
little-endian UCS-2, the bytes are swapped, and
A is #x4100. In big-endian UCS-2, the letter
B is #x0042; in little-endian UCS-2,
it's #x4200. In big-endian UCS-2, the letter
is #x03A3; in little-endian UCS-2, it's #xA303. In this book we use big-endian notation, but parsers cannot assume this. They must be able to determine the endianness from the document itself.
To distinguish between big-endian and little-endian UCS-2, a document
encoded in UCS-2 customarily begins with Unicode character
#xFEFF, the zero-width nonbreaking
space, more commonly called the byte-order
mark. This character has the advantage of being
invisible. Furthermore, if its bytes are swapped, the resulting
#xFFFE character doesn't actually exist. Thus, a
program can look at the first two bytes of a UCS-2 document and tell
immediately whether the document is big endian, depending on whether
those bytes are #xFEFF or #xFFFE.
UCS-2 has three major disadvantages, however:
-
Files containing mostly Latin text are about twice as large in UCS-2
as they are in a single-byte character set, such as ASCII or Latin-1.
-
UCS-2 is not backward or forward compatible with ASCII. Tools that
are accustomed to single-byte character sets often
can't process a UCS-2 file in a reasonable way, even
if the file only contains characters from the ASCII character set.
For instance, a program written in C that expects the zero byte to
terminate strings will choke on a UCS-2 file containing mostly
English text because almost every other byte is zero.
-
UCS-2 is limited to 65,536 characters.
The last problem isn't so important in practice,
since the first 65,536 code points of Unicode nonetheless manage to
cover most people's needs except for dead languages
like Ugaritic and fictional scripts like Tengwar. Mathematical
symbols are also encountering these issues. Unicode does, however,
provide a means of representing code points beyond 65,535 by
recognizing certain two-byte sequences as half of a surrogate pair. A
Unicode document that uses UCS-2 plus surrogate pairs is said to be
in the UTF-16 encoding.
The other two problems, however, are more likely to affect most
developers. UTF-8 is an alternative encoding for Unicode that
addresses both.