2.6. Unicode, Character Sets, and Encodings
At low levels,
computers see text as a series of
positive integer numbers mapped onto
character sets, which are collections of numbered characters (and
sometimes control codes) that some standards body created. A very
common collection is the venerable US-ASCII character set,
which contains 128 characters, including upper- and lowercase letters
of the Latin alphabet, numerals, various symbols and space
characters, and a few special print codes inherited from the old days
of teletype terminals. By adding on the eighth bit, this 7-bit system
is extended into a larger set with twice as many characters, such as
ISO-Latin1, used in many Unix systems. These characters include other
European characters, such as Latin letters with accents, Icelandic
characters, ligatures, footnote marks, and legal symbols. Alas,
humanity, a species bursting with both creativity and pride, has
invented many more linguistic symbols than can be mapped onto an
8-bit number.
For this reason, a new character encoding architecture called Unicode
has gained acceptance as the standard way to represent every written
script in which people might want to store data (or write computer
code). Depending on the flavor used, it uses up to 32 bits to
describe a character, giving the standard room for millions of
individual glyphs. For over a decade, the Unicode Consortium has been filling
up this space with characters ranging from the entire
Han
Chinese
character set to various mathematical, notational, and signage
symbols, and still leaves the encoding space with enough room to grow
for the coming millennium or two.
Given all this effort we're putting into hyping it,
it shouldn't surprise you to learn that, while an
XML document can use any type of encoding, it will by default assume
the Unicode-flavored, variable-length encoding known as
UTF-8. This encoding uses between one
and six bytes to encode the number that represents the
character's Unicode address and the
character's length in bytes, if that address is
greater than 255. It's possible to write an entire
document in 1-byte characters and have it be indistinguishable from
ISO
Latin-1 (a humble address block with addresses ranging from 0 to
255), but if you need the occasional high character, or if you need a
lot of them (as you would when storing Asian-language data, for
example), it's easy to encode in UTF-8.
Unicode-aware processors handle the encoding correctly and display
the right glyphs, while older applications simply ignore the
multibyte characters and pass them through unharmed. Since Version
5.6, Perl has handled UTF-8 characters with increasing finesse.
We'll discuss Perl's handling of
Unicode in more depth in Chapter 3, "XML Basics: Reading and Writing".
 |  |  | 2.5. Entities |  | 2.7. The XML Declaration |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
|