Beginning SAX (SAX2)

2.2. Beginning SAX

This chapter explores SAX through some progressively more functional examples, which build on each other to present the key concepts that are discussed later in more detail. Essential producer and consumer interfaces are presented together to show how they interact, and you'll see how to customize classic SAX configurations. We'll focus first on the producer side, saving most details about consumer-side APIs for a bit later.

2.2.1. How Do the Parts Fit Together?

In the simplest possible example, you (in your role as director) will get an XML parser, which will later produce parsing events. Then you will get a consumer and connect it to the producer for processing the most important events. Finally, you'll ask that parser to produce events, pushing them through to the consumer.

To start, focus on what the different parts are, and how they relate to each other. Example 2-1 is a simple SAX program, which you can compile and run if you like.

Example 2-1. SAX2 application skeleton

import java.io.IOException;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

public class Skeleton {

    // argv[0] must be the absolute URL of an XML document
    public static void main (String argv [])
    {
        XMLReader       producer;
        DefaultHandler  consumer;

        // Get an instance of the default XML parser class
        try {
            producer = XMLReaderFactory.createXMLReader ();
        } catch (SAXException e) {
            System.err.println (
                  "Can't get parser, check configuration: "
                + e.getMessage ());
            return;
        }

	// Set up the consumer
	
        
        try {

	    // Get a consumer for all the parser events
	    consumer = new DefaultHandler ();

	    // Connect the most important standard handler
	    producer.setContentHandler (consumer);

	    // Arrange error handling
	    producer.setErrorHandler (consumer);
	} catch (Exception e) {
	    // Consumer setup can uncover errors,
	    // though this simple one shouldn't
	    System.err.println (
	          "Can't set up consumers:"
                + e.getMessage ());
            return;
	}

        // Do the parse!
        try {
            producer.parse (argv [0]);
        } catch (IOException e) {
            System.err.println ("I/O error: ");
	    e.printStackTrace ();
        } catch (SAXException e) {
            System.err.println ("Parsing error: ");
	    e.printStackTrace ();
        }
    }
}

This is a complete SAX application, though it's sort of boring since it throws away all the data the parser delivers. The only reason this program would print anything at all is if you didn't pass it an argument that was the URL for a well-formed XML file. Other than that, it's fairly typical of how you'll be using SAX2, at least in terms of the basic structure. You can make real programs from this skeleton if you substitute smarter components for the simple ones shown here.

We introduced a few SAX classes and interfaces, so we can add some details to our earlier producer/consumer picture to get Figure 2-2. This producer is an XMLReader, and we're listening to one consumer interface and the ErrorHandler. The whole thing is driven by an application which is pulling the whole document through the reader.

Figure 2-2. Basic SAX roles and components

XMLReader producer;

The application thread is "pulling" the XML text through the XMLReader-style producer: the parse() call won't return until the whole document is parsed, or until parsing is aborted by throwing an exception. Until it returns, the thread that called the XMLReader is either blocking on I/O, parsing data that it just read, or "pushing" data into one of the consumer interfaces. That is, from the perspective of event consumers SAX2 is a "push" API: handlers do nothing until they're asked.

2.2.2. What Are the SAX2 Event Handlers?

SAX2 events are grouped into several interfaces, which we explore later in more detail. All except two are implemented by DefaultHandler. Each interface encapsulates a set of events; to see those events, applications give parsers objects that implement the handler interfaces they're interested in.

org.xml.sax.ContentHandler

With the exception of ErrorHandler, you'll normally want to work with all of these interfaces as a single group: four interfaces, two for content in the document body and two for DTD content. That way, you will work with all the XML data from a document (its Infoset) as part of a cohesive whole. There are SAX2 helper classes (like DefaultHandler and XMLFilterImpl) that group most of these interfaces into classes, but they ignore the two extension handlers (Decl and Lexical handlers in the org.xml.sax.ext package). SAX2 application layers often handle such grouping; for example, you can subclass those helper classes in a different package, adding extension interface support.

The logic behind keeping these interfaces separate, rather than merging all of their methods into one huge interface, is that it's more appropriate for simple applications. You must explicitly ask for bells and whistles; they aren't thrust upon you by default. You can easily prune out certain data by ignoring the interfaces that report it. Most code only uses ContentHandler and ErrorHandler implementations, so the methods in other interfaces are easy to ignore. Plus, from the application perspective, parser recognition of the extension handlers isn't guaranteed. There's a slight awkwardness associated with needing to bind each type of handler separately, but that's a small trade-off for the benefit of having a modular API extension model already in place.

SAX2 defines another important interface beyond these handlers and the XMLReader: parsers use EntityResolver to retrieve external entity text they must parse. That interface is also stubbed out by DefaultHandler. If you want the parser to use local copies of DTDs rather than DTDs accessed from a server that might not be available, you'll want to become familiar with EntityResolver. However, it isn't really a consumer API since it doesn't deal directly with parsed XML data (the Infoset); it deals with accessing raw unparsed text, the same stuff that's given to XMLReader.parse() methods. This book presents it as a producer-side helper for parsers, in Section 3.4, "The EntityResolver Interface" in Chapter 3, "Producing SAX2 Events".

2.2.3. XMLWriter: an Event Consumer

The next part of SAX we show in this overview is really not a part of SAX, except that it uses SAX to do something you'll likely need to do fairly often. (Pretty much everyone does!) As you've seen, SAX2 includes an XMLReader interface, used to turn XML text into a stream of SAX events. But it does not include the corresponding XMLWriter to reverse the process: turning such events back into text and supporting XML for program outputs as well as inputs. SAX isn't only for reading XML. The same APIs are used to write XML too.

It's almost a tradition to show how to write most of such a class as an example when explaining SAX. We avoid that in this book because getting all the XML details right is tricky, and because this class is a clear example of something that should be treated as a reusable SAX library component. There are lots of ways the data needs to be escaped, and sometimes you need to use output encodings (like ASCII) that have problems representing some XML characters.

There's a better solution: use one of several such classes, which are widely available. This book uses the gnu.xml.util.XMLWriter class (bundled with gnujaxp.jar andÆlfred) when it needs XML generation functionality, because it doesn't force applications to discard as much of the XML data. It supports all of the SAX2 handlers, including the extension handlers LexicalHandler and DeclHandler, so it can round-trip almost all XML data. To use such classes, at least in their simple low-fidelity modes, you can modify the skeleton program shown earlier to something like this:

import java.io.FileOutputStream;
import gnu.xml.util.XMLWriter;

public class ... {

    ...
        setContentHandler (
	    new XMLWriter (new FileOutputStream ("out.xml"))
	    );
    ...
}

In addition to the GNU class used in this book, other versions are available. One is provided with DOM4J org.dom4j.io.XMLWriter, which supports Content and Lexical handlers and evolved from the com.megginson.sax.XMLWriter class, which supports only ContentHandler. Curiously, neither Crimson nor Xerces include such SAX-to-text functionality at this time.

2.2.3.1. Event pipelines

Of course, just parsing and echoing data is not very useful. Such classes are best used to output XML data that you've massaged a bit. We'll look at two ways to do this later. One way is to use an XML pipeline, where consumers produce data for other consumers, as illustrated in Figure 2-3. For example, one stage could filter the event stream from a parser to remove various uninteresting elements, or otherwise transform the data, and then feed the result to an XMLWriter. You can combine several such stages into a "pipeline" and debug them using an XMLWriter to watch data as it flows through particular stages. Remember that XMLReader isn't the only kind of SAX event producer: programs can write events and feed the result to an XMLWriter. Also, the consumer doesn't need to be an XMLWriter; it could construct any kind of useful data structure. In fact we'll look later at doing this with DOM.

Figure 2-3. Simple SAX2 event pipeline

This kind of processing pipeline is a fundamental model for more advanced uses of SAX and for structuring components that are SAX-aware. We look at pipelines again in Section 4.5, "XML Pipelines " in Chapter 4, "Consuming SAX2 Events". For now, keep in mind that sometimes event consumers will be producing events for later processing components.

2.2.3.2. Concerns when writing XML text

There are several important issues to consider when writing XML output, which should be mentioned in the documentation for the XMLWriter you use. You may even be able to use your XMLWriter to canonicalize output, so you can safely compare processor output or create digital signatures. The GNU class shown earlier handles most of these directly, but that's not true for all such classes.

You need the flexibility to choose different line endings, such as Macintosh style (CR only), DOS style (CRLF), and Unix style (LF only). The default should be right for the host Operating System, but sometimes that's not right for the destination.

The SAX2 event stream might discard essential namespace prefix information. If you're using documents with namespaces, you need to provide a sanitized event stream, making sure either that such data is not discarded (using the "mixed mode" namespace handling discussed later in this chapter) or that corresponding data gets synthesized (maybe in some pipeline stage).

You might be sending XML to applications that don't handle DTDs or external entities very well. For example, many web browsers won't read DTDs. To talk robustly to such applications, you might need to send standalone documents.

If your application just uses ContentHandler events, you'll have discarded information needed to re-create "high-fidelity" output reflecting DTD content, comments, entity references, and CDATA section boundaries. More handlers are detailed in Chapter 4, "Consuming SAX2 Events" as well as and briefly summarized later in this section; most of the writers implement many such interfaces.
If you don't want to use UTF-8 as your character encoding (or UTF-16), you'll have to be sure the names used by your markup can be expressed using that character encoding. That's because while numeric character references can be used inside text, they can't be used inside markup components like element and attribute names. ASCII, for example, is hopeless at handling element names that use Japanese ideographic characters, but it can handle Japanese text if you don't mind that every character in the document text is cryptically expressed as a numeric character reference.

The first time you try to debug XML output where a single line is even just a few kilobytes in length, you'll want your XMLWriter to be "pretty printing." Minimally it should add line breaks; ideally it should be able to indent to show document structure.

Such an XMLWriter is part of almost every developer's SAX toolkit, even though it isn't part of SAX itself. As you work with SAX, you'll probably start to collect and develop your own library of such reusable event consumer code.