SAX2 and the XML Infoset (SAX2)

This appendix shows how the various parts of the XML Infoset are made available through the SAX2 event consumer APIs. Think of it as a structural index for concepts in SAX2, or for the underlying XML information structure. Use it when you're trying to develop SAX2-based software that needs access to particular data. It can also be viewed as an Infoset conformance statement for SAX2; it will help you to understand what parts of the XML Infoset aren't supported by SAX2 and to see where SAX2 lets you access information beyond what the Infoset addresses. The Infoset is not a data structure; what's important is that the information be provided, not randomly accessible.

The presentation here is the same as used in the Infoset specification itself; the structure and order are identical. Information items are similar to object types, and each is presented in its own section. Information items consist of sets of named [properties], each of which is presented in a table. Properties can have one or more values, sometimes ordered, which are provided in SAX2 using consumer callbacks. You should be able to make sense of this without reading the infoset specification if you know XML, but you'll need it to understand some details.

As of this writing, the XML Infoset (http://www.w3.org/TR/xml-infoset/) has recently been finalized. This appendix was written using the 24 October 2001 "Recommendation," which omits almost all declarations found in the DTD. Some other W3C specifications use related data models, like the XPath Data Model. The W3C approach to XML Schemas augments this core Infoset with additional data-typing information items, defining the Post-Schema-Validation Infoset (PSVI) items and properties associated with schema-valid XML text. Most of those PSVI properties relate to data-typing models.

B.1. Event Producer Issues

Although the focus of this appendix is on how SAX2 event consumers see Infoset data, you may also need to pay attention to some producer-side issues beyond ensuring that the event stream itself is legal (and perhaps valid). As the Infoset specification puts it, "synthetic" infosets might have inconsistencies that real ones (from XML documents) don't. If you produce a synthetic infoset, by writing SAX events directly rather than by using a parser, make sure the event stream is properly constructed.

As noted earlier, you should make sure you always provide the document URI when you invoke XMLReader.parse(). Not only is this needed to correctly absolutize relative URIs found in the document's DTD (for notations and all types of external entities) and to provide accurate diagnostics, but it is essential for computing [base URI] properties in the document entity.

The namespace-prefixes feature on XMLReader instances has a problematic default; set its value to true unless you're comfortable with parsers hiding [namespace attributes] and [prefix] properties. (In this book, this is called mixed mode namespace support.) SAX2 parsers aren't required to support setting this feature value to true, but most do. If your parser doesn't support this, you can re-create prefixes and declarations, but they normally won't correspond to the original versions. This appendix assumes you kept the default setting (true) for the namespaces feature flag.

Some SAX2 XMLReader implementations may not produce all of this information. Most of today's widely used SAX2 parsers are fully featured, so in practice this won't be a common problem. However, information provided through the optional SAX2 extension callbacks DeclHandler or LexicalHandler might not be available. Similarly, reporting of [base URI] ingredients through a Locator is also optional.

The SAX2 ErrorHandler exposes some data that is not addressed by the XML Infoset: validity and well-formedness errors. Exposing such information is required for parser conformance to the XML 1.0 specification.

Appendix B. SAX2 and the XML Infoset

Contents:

B.1. Event Producer Issues