Turning SAX Events into Data Structures (SAX2)

4.4. Turning SAX Events into Data Structures

As described earlier, one of the great strengths of SAX is that it lets applications use appropriate data structures, instead of forcing the use of generic data structures. In Section 3.5.2, "Push Mode Event Production" in Chapter 3, "Producing SAX2 Events", we looked at the problem of producing SAX events from data structures. Here we look at the reverse process: producing data structures from SAX events. This is a process that most SAX applications handle to one degree or another. One of the most traditional names for this process is unmarshaling; it's also sometimes called deserializing. (I tend to avoid using the latter term with Java except when talking about RMI.)

We'll first look at how to turn SAX into generic DOM (and DOM-like) data structures. If you're working with such data structures, you may find it's advantageous to build them using SAX. With SAX, you can easily discard data you don't need, filtering it out so you don't need to pay its costs. Afterward we'll look briefly at some of the concerns associated with working with data structures that are more specialized to your application.

Implementation

Class name

Comment

Crimson

org.apache.crimson.tree.XmlDocumentBuilder

Implements all the event consumer handlers.

DOM4J

org.dom4j.io.SAXContentHandler

Extends DefaultHandler; does not implement DeclHandler.

GNUJAXP

gnu.xml.dom.Consumer

Uses the gnu.xml.pipeline framework.

JDOM

org.jdom.input.SAXHandler

Extends DefaultHandler.

public Document SAX2DOM (String uri) throws SAXException, IOException { XmlDocumentBuilder consumer; XMLReader producer; consumer = new XmlDocumentBuilder (); producer = XMLReaderFactory.createXMLReader (); producer.setContentHandler (consumer); producer.setDTDHandler (consumer); producer.setProperty ("http://xml.org/sax/properties/lexical-handler", consumer); producer.setProperty ("http://xml.org/sax/properties/declaration-handler", consumer); producer.parse (uri); return consumer.getDocument (); }

public Document SAX2DOM (String uri) throws SAXException, IOException { XmlDocumentBuilder consumer; XMLReader producer; consumer = new XmlDocumentBuilder (); consumer.setIgnoreWhitespace (true); producer = XMLReaderFactory.createXMLReader (); producer.setContentHandler (consumer); producer.parse (uri); return consumer.getDocument (); }

import javax.xml.parsers.DocumentBuilderFactory; import org.w3c.dom.*; import org.xml.sax.*; import org.xml.sax.helpers.DefaultHandler; // a kind of event handler interface DomListener { public String getURI (); public String getLocalName (); public void processTree (Element tree) throws SAXException; } public class DomFilter extends DefaultHandler { private Document factory; private Element current; private DomListener listener; public DomFilter (DomListener l) { listener = l; } public void startDocument () throws SAXException { // all this just to get an empty document; // we need one to use as a factory try { factory = DocumentBuilderFactory .newInstance () .newDocumentBuilder () .newDocument (); } catch (Exception e) { throw new SAXException ("can't get DOM factory", e); } } public void startElement (String uri, String local, String qName, Attributes atts) throws SAXException { // start a new subtree, or ignore if (current == null) { if (!listener.getURI ().equals (uri)) return; if (!listener.getLocalName ().equals (local)) return; current = factory.createElementNS (uri, qName); // Add to current subtree, descend. } else { Element e; if ("".equals (uri)) e = factory.createElement (qName); else e = factory.createElementNS (uri, qName); current.appendChild (e); current = e; } // NOTE: this example discards all attributes! // They ought to be saved to the current element. } public void endElement (String uri, String local, String qName) throws SAXException { Node parent; // ignore? if (current == null) return; parent = current.getParentNode (); // end subtree? if (parent == null) { current.normalize (); listener.processTree (current); current = null; // else climb up one level } else current = (Element) current.getParentNode (); } // if saving, append and continue public void characters (char buf [], int offset, int length) throws SAXException { if (current != null) current.appendChild (factory.createTextNode ( new String (buf, offset, length))); } }

4.4.4. Turning SAX Events into Custom Data Structures

If your application data structure or interchange syntax is already defined, you may not be able to unmarshal it using software based on the numerous schema-oriented tools. However, lots of software uses SAX to do this efficiently. Once you understand how SAX models data in XML documents, you can treat unmarshaling much like any other parsing problem. It's closely associated with marshaling your data structures to XML. Here we'll look at some of the issues you may want to consider when transforming XML into your data structures.

You may find that some individual data items, such as integers and dates, use the low-level encoding rules that are specified in Part 2 of the W3C XML Schema specification (http://www.w3c.org/TR/xmlschema-2/). Those encodings are low-level policy decisions, and they're conceptually independent of the rest of the W3C Schema; you can use them even if you don't buy the W3C approach to those schemas. Some other schema systems, such as Relax-NG, incorporate those low-level encoding policies without adopting more problematic parts of the W3C XML Schema specification. Your application might likewise want to use these policies.

One basic high-level encoding issue is how closely the XML structures and application structures should match. For example, an element will be easier to unmarshal by mapping its attributes (or child elements) directly to properties of a single application object rather than by mapping them to properties of several different objects. The latter design is more complex, and for many purposes it could be much more appropriate, but such unmarshaling code needs more complex state.

Regularity of the various structures is another issue. It's usually less work to handle regular structures, since it's easy to create general methods and reuse them. Bugs are less frequent and more easily found than when every transformation involves yet another special case.

You'll need to figure out how much state you need to track and what techniques you will use. You might be able to use extremely simple parsing state machines; one of these is shown later, in Example 6-2. Sometimes it might easier to unmarshal fragments into an intermediate form (as in the DOM subtrees example earlier), and map that form to your application structure before discarding them.

Often some sort of recursive-descent parsing algorithm that explicitly tracks the state of your parsing activities will be useful. It will often be helpful to keep a stack of pending elements and attributes, as shown later (in Example 5-1). But since the XML structures might not map directly to your application structures, you might also need to stack objects you're in various stages of unmarshaling.

The worst scenario is when neither the XML text nor the application data structures are very regular. Software to work with that kind of system quickly gets fragile as it grows, and you'll probably want to change some of your application constraints.

4.4. Turning SAX Events into Data Structures

4.4.1. SAX-to-DOM Consumers

Table 4-1. SAX-to-DOM consumer classes

Example 4-1. Converting SAX events to a DOM document (Crimson)

4.4.2. Pruning Noise Data from a DOM Tree

Example 4-2. Converting SAX events to DOM, discarding noise (Crimson)

4.4.3. Building a Partial DOM

Example 4-3. Using SAX to stream DOM subtrees

4.4.4. Turning SAX Events into Custom Data Structures