The XML Declaration (XML in a Nutshell, 2nd Edition)

2.9.1. encoding

So far we've been a little cavalier about encodings. We've said that XML documents are composed of pure text, but we haven't said what encoding that text uses. Is it ASCII? Latin-1? Unicode? Something else?

The short answer to this question is "Yes." The long answer is that by default XML documents are assumed to be encoded in the UTF-8 variable-length encoding of the Unicode character set. This is a strict superset of ASCII, so pure ASCII text files are also UTF-8 documents. However, most XML processors, especially those written in Java, can handle a much broader range of character sets. All you have to do is tell the parser which character encoding the document uses. Preferably this is done through metainformation, stored in the filesystem or provided by the server. However, not all systems provide character-set metadata so XML also allows documents to specify their own character set with an encoding declaration inside the XML declaration. Example 2-8 shows how you'd indicate that a document was written in the ISO-8859-1 (Latin-1) character set that includes letters like ö and ç needed for many non-English Western European languages.

Example 2-8. An XML document encoded in Latin-1

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<person>
  Erwin Schrödinger
</person>

The encoding attribute is optional in an XML declaration. If it is omitted and no metadata is available, then the Unicode character set is assumed. The parser may use the first several bytes of the file to try to guess which encoding of Unicode is in use. If metadata is available and it conflicts with the encoding declaration, then the encoding specified by the metadata wins. For example, if an HTTP header says a document is encoded in ASCII but the encoding declaration says it's encoded in UTF-8, then the parser will pick ASCII.

The different encodings and the proper handling of non-English XML documents will be discussed in greater detail in Chapter 5.

2.9.2. standalone

If the standalone attribute has the value no, then an application may be required to read an external DTD (that is a DTD in a file other than the one it's reading now) to determine the proper values for parts of the document. For instance, a DTD may provide default values for attributes that a parser is required to report even though they aren't actually present in the document.

Documents that do not have DTDs, like all the documents in this chapter, can have the value yes for the standalone attribute. Documents that do have DTDs can also have the value yes for the standalone attribute if the DTD doesn't in any way change the content of the document or if the DTD is purely internal. Details for documents with DTDs are covered in Chapter 3.

The standalone attribute is optional in an XML declaration. If it is omitted, then the value no is assumed.

Example 2-8. An XML document encoded in Latin-1

2.9. The XML Declaration

Example 2-7. A very simple XML document with an XML declaration

2.9.1. encoding

2.9.2. standalone