Chapter 26. Character Sets
By default, an XML parser assumes that XML documents
are written in the UTF-8 encoding of
Unicode. However, documents may instead be written in any character
set the XML processor understands, provided that
there's either some external metadata like an HTTP
header or internal metadata like a byte order mark or an encoding
declaration that specifies the character set. For example, a document
written in the Latin-5 character set would need this XML declaration:
<?xml version="1.0" encoding="ISO-8859-9"?>
Most good XML processors understand many common character sets. The
XML specification recommends the character names shown in Table 26-1.
When using any of these character sets, you should use these names.
Of these character sets, only UTF-8 and UTF-16 must be supported by
all XML processors, though many XML processors support all character
sets listed here, and many support additional character sets besides.
When using character sets not listed here, you should use the names
specified in the IANA character sets registry at http://www.iana.org/assignments/character-sets.
Table 26-1. Character set names defined by the XML 1.0 specification
Some parsers do not understand all these encodings. Specifically,
parsers based on James Clark's
expat often support only UTF-8,
UTF-16, ISO-8859-1, and US-ASCII encodings. Xerces-C supports ASCII,
UTF-8, UTF-16, UCS4, IBM037, IBM1140, ISO-8859-1, and Windows-1252.
IBM's XML4C parser, derived from the Xerces
codebase, adds over 100 more encodings, including ISO-8859 character
sets 1 through 9 and 15. However, for maximum cross-parser
compatibility, you should convert your documents to either UTF-8 or
UTF-16 before publishing them, even if you author them in another
The default encoding used in XML documents, unless an encoding
declaration, byte order mark, or external metadata specifies
otherwise; a variable-width encoding of Unicode that uses one to six
bytes per character. UTF-8 is designed such that all ASCII documents
are legal UTF-8 documents, which is not true for other character
sets, such as UTF-16 and Latin-1. This character set is the best
encoding choice if your XML documents contain limited
A two-byte encoding of Unicode in which all Unicode characters
defined in Unicode 3.0 and earlier (including the ASCII characters)
occupy exactly two bytes. However, characters from planes 1 through
14, added in Unicode 3.1 and later, are encoded using surrogate pairs
of 4 bytes each. This encoding is the best choice if your XML
documents contain substantial amounts of Chinese, Japanese, or
The Basic Multilingual Plane of Unicode, i.e., plane 0. This
character set is the same as UTF-16, except that it does not allow
surrogate pairs to represent characters with code points beyond
65,535. The difference is only significant in Unicode 3.1 and later.
Each Unicode character is represented as exactly one two-byte,
unsigned integer. Determining endianness requires a byte-order mark
at the beginning of the file.
A four-byte encoding of Unicode in which each Unicode character is
represented as exactly one four-byte, unsigned integer. Determining
endianness requires a byte-order mark at the beginning of the file.
ASCII plus the characters needed for most Western European languages,
including Danish, Dutch, English, Faroese, Finnish, Flemish, German,
Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, and
Swedish. Some non-European languages, such as Hawaiian, Indonesian,
and Swahili, also use these characters.
ASCII plus the characters needed for most Central European languages,
including Croatian, Czech, Hungarian, Polish, Slovak, and Slovenian.
ASCII plus the characters needed for Esperanto, Maltese, Turkish, and
Galician. Latin-5, ISO-8859-9, however, is now preferred for Turkish.
ASCII plus the characters needed for the Baltic languages Latvian,
Lithuanian, Greenlandic, and Lappish. Now largely replaced by
ASCII plus the Cyrillic characters used for Byelorussian, Bulgarian,
Macedonian, Russian, Serbian, and Ukrainian.
ASCII plus Arabic
ASCII plus modern Greek.
ASCII plus Hebrew.
which is essentially the same as Latin-1 (ASCII plus Western Europe),
except that the Turkish letters ,
1, , ,
, and replace the less-commonly used
letters , , ,
, , and .
which covers the characters needed for the Northern European
languages Estonian, Lithuanian, Greenlandic, Icelandic, Inuit, and
Lappish. It's similar to Latin-4, but drops some
symbols and the Latvian letter,
adds a few extra letters needed for Inuit and Lappish, and moves
various characters around. ISO-8859-13 now supersedes this character
Adds the Thai alphabet to basic ASCII.
However, it is not well supported by current XML parsers, and
you're probably better off using Unicode instead.
Not yet in existence and unlikely to exist in the foreseeable future.
At one point, this character set was considered for Devanagari, so
the number was reserved. However, this effort is not yet off the
ground, and it now seems likely that the increasing acceptance of
Unicode will make such a character set unnecessary.
Another character set designed to cover the
Baltic languages. This set adds back in the Latvian letter and other symbols dropped from Latin-6.
variant of Latin-1 with extra letters needed for Gaelic and Welsh,
such as , , and . These letters mostly replace
punctuation marks, such as x and |.
officially as Latin-9 and unofficially as Latin-0; a revision of
Latin-1 that replaces the international currency symbol ¤
with the Euro sign . It also
replaces the seldom-used fraction characters
and 3/4 with the
uncommon French letters , , , and
the ¬, , and ′ symbols with
the Finnish letters , , and . Otherwise, it's
identical to ISO-8859-1.
Latin-10; intended primarily for Romanian.
A seven-bit encoding of the character set defined in the
standard JIS X-0208-1997 used on web pages and in email; see RFC
The encoding of the Japanese national standard character set JIS
X-0208-1997 used in Microsoft Windows.
The encoding of the Japanese national standard character set JIS
X-0208-1997 used by most Unixes.
26.1. Character Tables
The XML 1.0 specification divides Unicode into five overlapping sets:
- Name characters
Characters that can appear in an
element, attribute, or entity name. These characters are letters,
ideographs, digits, and the punctuation marks _,
-, ., and :.
In the tables that follow, name characters are shown in bold type,
such as A, Å, , ,
, 1, 2, 3, , , and _.
- Name start characters
Characters that can be the first
character of an element, attribute, or entity name. These characters
are letters, ideographs, and the underscore _. In
the tables that follow, these characters are shown with a gray
background, such as A, Å,
, ,, , , and _. Because name start characters are a subset of name characters, they are also shown in bold.
- Character data characters
All characters that can be used
anywhere in an XML document, including element and attribute content,
comments, and DTDs. This set includes almost all Unicode characters,
except for surrogates and most C0 control characters. These
characters are shown in a normal typeface. If they are name
characters, then they will be bold. If they are also name start
characters, they'll have a gray background.
- Illegal characters
Characters that may not appear
anywhere in an XML document, such as in part of a name, character
data, or comment text. These characters are shown in italic, such as
NUL or BEL. Most of these
characters are either C0 control characters or half of a surrogate
- Unassigned code points
byte sequences that are not assigned to a character as of Unicode
3.1.1. Theoretically, a program could produce a file containing one
of these byte sequences, but their meaning is undefined and they
should be avoided. They are represented in the following tables as
Figure 26-1 shows the relationship between these sets. Note that all
name start characters are name characters and that all name
characters are character data characters.
Figure 26-1. XML's division of Unicode characters
In all the tables that follow, each cell's upper
lefthand corner contains the character's two-digit
Unicode hexadecimal value and the upper righthand corner contains the
character's Unicode decimal value. You can insert a
character in an XML document by prefixing the decimal value with
&# and suffixing it with a semicolon. Thus,
Unicode character 69, the capital letter E, can be written as
E. Hexadecimal values work the same way,
except that you prefix them with &#x;. In
hexadecimal, the letter E is 45, so it can also be written as
in common use today are supersets of ASCII. That is, code points 0
through 127 are assigned to the same characters to which ASCII
assigns them. Figure 26-2 lists the ASCII character set. The only
notable exceptions are the EBCDIC-derived character sets.
Specifically, Unicode is a superset of ASCII, and code points 1
through 127 identify the same characters in Unicode as they do in
Figure 26-2. The first 128 Unicode characters (known as the ASCII character
Characters 0 through 31 and character 127 are nonprinting control
characters, sometimes called the C0 controls to distinguish
them from the C1 controls used in the ISO-8859 character sets. Of
these 33 characters, only the carriage return, linefeed, and
horizontal tab may appear in XML documents. The other 29 may not
appear anywhere in an XML document, including in tags, comments, or
parsed character data. They may not be inserted with character
references, such as . For example, you
may not use form feeds to insert page breaks.
26.1.2. ISO-8859-1, Latin-1
Character sets defined by the ISO-8859
standard comprise one popular superset of the ASCII character sets.
These characters all provide the normal ASCII characters from code
points 0 through 127 and the C1 controls from 128 to 159, as well as
change the characters from 160 through 255.
In particular, many Western European and American systems use a
character set called Latin-1. This set is the first code page defined
in the ISO-8859 standard and is also called ISO-8859-1. Though all
common encodings of Unicode map code points 128 through 255
differently than Latin-1, code points 128 through 255 map to the same
characters in both Latin-1 and Unicode. This situation does not occur
in other character sets.
220.127.116.11. C1 controls
All ISO-8859 character sets begin with the same
32 extra nonprinting control characters in code points 128 through
159. These sets are used on terminals like the DEC VT-320 to provide
graphics functionality not included in ASCII, for example, erasing
the screen and switching it to inverse video or graphics mode. These
characters cause severe problems for anyone reading or editing an XML
document on a terminal or terminal emulator.
Fortunately, these characters are not necessary in XML documents.
Their inclusion in XML 1.0 was an oversight. They should have been
banned like the C0 controls. Unfortunately, many editors and
documents incorrectly label documents written in the
character set as ISO-8859-1. This character set does use
the code points between 128 and 159 for noncontrol graphics
characters. When documents written with this character set are
displayed or edited on a dumb terminal, they can effectively disable
the user's terminal. Similar problems exist with
most other Windows code pages for single-byte character sets.
In the spirit of being liberal in what you accept and conservative in
what you generate, you should never use Cp1252, correctly labeled or
otherwise. You should also avoid using other nonstandard code pages
for documents that move beyond a single system. On the other hand, if
you receive a document labeled as Cp1252 (or any other Windows code
page), it can be displayed if you're careful not to
throw it at a terminal unchanged. If you suspect that a document
labeled as ISO-8859-1 that uses characters between 128 and 159 is in
fact a Cp1252 document, you should probably reject it. This decision
is difficult, however, given the prevalence of broken software that
does not identify documents sent properly.
Latin-1 covers most Western European languages
that use some variant of the Latin alphabet. Characters 0 through 127
in this set are identical to the ASCII characters with the same code
points. Characters 128 to 159 are the C1 control characters used only
for dumb terminals. Character 160 is the nonbreaking space.
Characters 161 through 255 are accented characters, such as
è, á, and ö, non-U.S. punctuation
marks, such as £ and ¿, and a few new letters,
such as the Icelandic and ß. Figure 26-3 shows the upper half of this character set. The lower half is identical to the ASCII character set shown in Figure 26-2.
Figure 26-3. Unicode characters between 160 and 255 and the second half of the Latin-1, ISO-8859-1 character set
Copyright © 2002 O'Reilly & Associates. All rights reserved.