home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam    

Book HomeSAX2Search this book

B.3. Document Information Item

The Document Information Item is the root of the information found in an XML document. There is only one such root item.

This information item begins with the ContentHandler.startDocument() call and ends with the ContentHandler.endDocument() call. Many SAX2 event calls are used to construct its children or constituents.






See the sections for each type of Information Item: Document Type Declaration (one, if present), Element (one), processing instruction (possibly many), Comment (possibly many).

[document element]


This is the element in the [children] property.



See the section on Notation Information Items. (Unordered.)

[unparsed entities]


See the section on Unparsed Entity Information Items. (Unordered.)

[base URI]

Locator.getSystemId(), or XMLReader.parse()

Locator may be used during the startDocument() callback (and earlier callbacks, unless they were made in the context of an external parameter entity).


Alternatively, for any parsers that don't provide a Locator, applications using an XMLReader are responsible for providing this information (if it exists) to the parse() method. This is passed directly as the string parameter or indirectly as the systemId property of an InputSource.

[character encoding scheme]

unavailable; or InputSource.getEncoding()

Normally this property is unavailable; it won't affect the interpretation of character data in Java. However, applications will in rare cases provide this to the parser when they call XMLReader.parse(InputSource) to start parsing. It's likely that an upcoming extension API will provide this information.



It's likely that an upcoming extension API will provide this information using an is-standalone feature flag.



You can probably assume the value of this property is "1.0" for now. It's likely that an upcoming extension API will provide this information.

[all declarations processed]

ContentHandler.skippedEntity(): LexicalHandler.endDTD()

When endDTD() is invoked, the value of this property is known. If no external parameter entities are reported as skipped, then the value is true. If the parser doesn't support the lexical handler, then the later call to startElement() may be used instead of endDTD().

Because text in Java is always accessed using UTF-16 character strings or arrays, most applications won't need to worry about encoding issues; the SAX2 parser handles that. However, there are cases when encoding may matter:

Input normalization

Some recent XML standards require that text be normalized. For example, XML Canonicalization (as used in digital signature applications) requires the use of Unicode Normalization Form C; some other W3C specifications have the same requirement. Text originally represented in UTF-8 or UTF-16 might need further normalization to remove some deprecated character codes that can be represented using those encodings.

Such encoding data is required on a per-entity basis, not a per-document basis as implied by the Infoset specification. And for internal entity expansions or defaulted attributes, you'll need to normalize if the encoding associated with the original definition supported denormalized text.

Output encoding

When using an output encoding that is not based on the Unicode character set, you may not be able to represent XML names that use particular characters. For example, ASCII cannot handle element or attribute names using accented characters (used in Europe and Latin America) or using ideographic characters (used in Asia).

The preferred encoding solution is to always use UTF-8 or UTF-16 when outputting XML, so that such problems cannot occur and so that all XML processors can work with such output. Similar logic applies to display systems like window systems: prefer font rendering systems that use Unicode over those tied to some specific encoding.

Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.