However, XML parsers generally can't count on the
availability of such information. Even if they can, they
can't necessarily assume that it's
accurate. Therefore, an XML parser will attempt to guess the
character set based on the first several bytes of the document. The
main checks the parser makes include the following:
-
If the first two bytes of the document bytes are #xFEFF, then the
parser recognizes the bytes as the Unicode byte-order
mark. It then guesses that the document is written in the
big-endian, UCS-2 encoding of Unicode. With that knowledge, it can
read the rest of the document.
-
If the first two bytes of the document are #xFFFE, then the parser
recognizes the little-endian form of the Unicode byte-order mark. It
now knows that the document is written in the little-endian, UCS-2
encoding of Unicode, and with that knowledge it can read the rest of
the document.
-
If the first four bytes of the document are #x3C3F786D, that is, the
ASCII characters <?xm, then it guesses that the
file is written in a superset of ASCII. In particular, it assumes
that the file is written in the UTF-8 encoding of Unicode. Even if
it's wrong, this information is sufficient to
continue reading the document until it gets to the encoding
declaration and finds out what the character set really is.