What XML Are We Talking About? (SAX2)

1.8. What XML Are We Talking About?

Over the past years, there has been an explosive growth in the number of XML-related standards. Talking about XML has become confusing, because those three letters can mean so many different things. Some people actually mean what I've called "Greater XML." Think of it this way: Boston is significant city, but people who don't live there may often name Boston to refer to other nearby towns (Arlington, Cambridge, and so on). What they're really talking about is the "Greater Boston Metropolitan Area," or sometimes even just "Eastern Massachusetts."

In much the same way, many people now talk about "XML" when they really mean one of dozens of related technologies built around the nucleus of XML. Some of these may even be part of the original XML vision as "SGML for the Web." Using XML to develop documents using a DTD like DocBook (http://www.docbook.org) is clearly part of that original open systems vision. However, it's also been trendy to market "new and improved!" software as based on XML. Such ambiguities can be confusing and can even implicitly promote vendor lock-in, rather than liberate customer data from vendor control. The simplicity at the core of XML isn't friendly to lock-in strategies, but complex application layers on top of XML can certainly cause closed systems.

So when someone says that SAX is a great API for XML processing, exactly what part of Greater XML does that mean? Briefly, parts built with the "core" XML specifications. The following lists shows the parts that this book uses in most of its examples.

XML 1.0 (Second Edition)

http://www.w3.org/TR/REC-xml

This text document format is the core of XML. SAX2 parsers work with this format and turn it into a stream of events that present the XML Infoset. However, as we'll see, SAX can be quite useful without even parsing XML text. (The second edition incorporates a variety of bug fixes and a few functional changes, which were previously published as a separate list of errata.)

XML includes Document Type Declarations, or DTDs. These provide several processing facilities, most of which you can rely on even when you don't use a validating parser. All XML parsers must support DTDs; they're what "schema" technologies attempt to improve on.

Unicode support has been part of XML from the earliest days. Java programmers may tend to overlook the significance of that fact, since it's always been part of Java too. But it's actually a big deal that XML moves web technologies firmly away from ASCII toward Unicode, in all programming environments (not just Java) -- not everyone needs to be a native English speaker to make best use of Internet technologies. XML has even been called a "virus for Unicode."

XML Infoset

http://www.w3.org/TR/xml-infoset/

The Infoset is best explained as an abstract model for what XML represents: information like elements, attributes, and character data. The Infoset exposes XML structure, not meaningful data. Applications transform Infoset data into forms that are suited to their particular tasks, normally behind a veil of application objects, unless they manipulate the text like a text editor.

The SAX2 event APIs present Infoset-level data; the lower-level alternative is to work directly with text. (See Appendix B, "SAX2 and the XML Infoset" for details about Infoset support in SAX2.) Other XML infrastructure, such as XInclude, generally transforms or augments Infoset data. Higher-level APIs generally hide such XML structures.

XML Namespaces

http://www.w3.org/TR/REC-xml-names/

Namespaces are an optional convention for XML 1.0 documents. Namespaces distinguish elements and attributes so that names can be reused when necessary. For example, in document markup a <table> probably refers to a tabular presentation of data, but in a furniture catalog it might also refer to something rather different. XML namespaces distinguish those cases with name prefixes; unlike "straight XML" with DTDs, those prefixes are expected to change in different contexts (such as different parts of that furniture catalog). This makes combining namespaces and DTDs complicated.

One of the most visible differences between SAX1 and SAX2 is that SAX2 has integrated support for XML namespaces to promote their widespread adoption.

Over time, some other simple layers (and conventions) may become appropriate to view as part of the core of XML. The XML Base specification (http://www.w3.org/TR/xml-base/) might be an example of such a facility; it explains how to use an xml:base attribute to augment normal processing of relative URIs found in text.[8] Various internationalization rules and policies are also likely to fit into that core. One example is W3C work on the Character Model for the World Wide Web (http://www.w3.org/TR/charmod/), which promotes uniform handling of sequences used to represent some non-ASCII characters. Another is currently called XML Blueberry, which will modify XML 1.0 to allow use of new Unicode characters in element and attribute names. Those characters support languages not previously supported (before Unicode 3.1) and also improve support for languages such as Japanese.

[8]In fact, since this list includes the XML Infoset in the core, documents with the xml:base attribute implicitly need XML base in their core view of XML to augment normal interpretation of URIs in document content. Example 5-1 shows one way to implement such processing in SAX.

Many of the increasingly substantial layers over XML, such as schemas (there are many schema approaches, with one from W3C), schema APIs and tools (which may focus on non-XML data models, distant from "downtown XML"), Remote Procedure Calls ("RPCs"; again, many approaches including one from W3C), XPath (and its outgrowths), and XSLT are prime examples of technologies that deserve to be viewed as technology choices in their own right. They are other cities in the metropolis of Greater XML, satellites of the original village that leverage the original civic infrastructure. Some of those layers may even reflect different fundamental goals and requirements from those that originally drove the creation and adoption of XML. That doesn't mean that you won't put SAX interfaces on them (or at least SAX-friendly ones), but because they are data layers over the core of XML, they may involve API layers too.

If you look at Java implementations of other technologies in Greater XML, you'll probably find SAX not far from the surface. This book identifies a number of such SAX-based tools and shows SAX events used as a framework to efficiently integrate these different technologies.