Beginning SAX (SAX2)

This chapter explores SAX through some progressively more functional examples, which build on each other to present the key concepts that are discussed later in more detail. Essential producer and consumer interfaces are presented together to show how they interact, and you'll see how to customize classic SAX configurations. We'll focus first on the producer side, saving most details about consumer-side APIs for a bit later.

2.2.1. How Do the Parts Fit Together?

In the simplest possible example, you (in your role as director) will get an XML parser, which will later produce parsing events. Then you will get a consumer and connect it to the producer for processing the most important events. Finally, you'll ask that parser to produce events, pushing them through to the consumer.

To start, focus on what the different parts are, and how they relate to each other. Example 2-1 is a simple SAX program, which you can compile and run if you like.

Example 2-1. SAX2 application skeleton

import java.io.IOException;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

public class Skeleton {

    // argv[0] must be the absolute URL of an XML document
    public static void main (String argv [])
    {
        XMLReader       producer;
        DefaultHandler  consumer;

        // Get an instance of the default XML parser class
        try {
            producer = XMLReaderFactory.createXMLReader ();
        } catch (SAXException e) {
            System.err.println (
                  "Can't get parser, check configuration: "
                + e.getMessage ());
            return;
        }

	// Set up the consumer
	
        
        try {

	    // Get a consumer for all the parser events
	    consumer = new DefaultHandler ();

	    // Connect the most important standard handler
	    producer.setContentHandler (consumer);

	    // Arrange error handling
	    producer.setErrorHandler (consumer);
	} catch (Exception e) {
	    // Consumer setup can uncover errors,
	    // though this simple one shouldn't
	    System.err.println (
	          "Can't set up consumers:"
                + e.getMessage ());
            return;
	}

        // Do the parse!
        try {
            producer.parse (argv [0]);
        } catch (IOException e) {
            System.err.println ("I/O error: ");
	    e.printStackTrace ();
        } catch (SAXException e) {
            System.err.println ("Parsing error: ");
	    e.printStackTrace ();
        }
    }
}

This is a complete SAX application, though it's sort of boring since it throws away all the data the parser delivers. The only reason this program would print anything at all is if you didn't pass it an argument that was the URL for a well-formed XML file. Other than that, it's fairly typical of how you'll be using SAX2, at least in terms of the basic structure. You can make real programs from this skeleton if you substitute smarter components for the simple ones shown here.

We introduced a few SAX classes and interfaces, so we can add some details to our earlier producer/consumer picture to get Figure 2-2. This producer is an XMLReader, and we're listening to one consumer interface and the ErrorHandler. The whole thing is driven by an application which is pulling the whole document through the reader.

Figure 2-2. Basic SAX roles and components

XMLReader producer;

The most common type of SAX2 event producer is an XML parser. Like most parsers, XML parsers implement the XMLReader interface. Whether or not they parse actual XML (instead of HTML or something else), they are required to produce events as if they did.

Don't confuse this class with the java.io.Reader from which you can pull a stream of character data. SAX parsers produce streams of SAX events, which they push to event consumers. Those are rather different models for how to deliver data.

producer = XMLReaderFactory.createXMLReader ();

This is the best all-around SAX2 bootstrap API when you need an XML parser. The only time it should produce any kind of exception is when your environment is misconfigured. For example, you might need to set the org.xml.sax.driver system property to the class name for your parser (see Section 3.2.1, "The XMLReaderFactory Class" in Chapter 3, "Producing SAX2 Events").

You can (and should!) keep reusing this XMLReader, but you should only have one thread touch a parser at a time. That is, parsing is not re-entrant. Parsers are perfectly safe to use with multiple threads, except that two threads can't use the same parser at the same time. (That's a good rule of thumb for most objects in multithreaded code, in all programming languages; it should feel natural to apply that rule to SAX parsers.)

consumer = new DefaultHandler ();

The DefaultHandler class is particularly handy when you're just starting to use SAX. It implements most of the event consumer interfaces, providing stubbed out (no-op) implementations for each method that's not part of an extension handler. That means it's easy to subclass this method if you need a place to start: just override each stub method to provide real code when you need it. We'll use DefaultHandler to avoid presenting extra callback methods.

producer.setContentHandler (consumer);

In this chapter, we're only showing the most commonly used consumer interfaces. ContentHandler is used to report elements, attributes, and characters; that's enough to get almost all serious XML work done.

producer.setErrorHandler (consumer);

ErrorHandler lets applications control handling of various kinds of errors, and we'll need it in later examples. We'll usually look at error handling as a specialized kind of task, different from other consumer roles. Even though "handler" is part of its name, it's a different kind of object.

producer.parse (argv [0]);

This call tells a parser to read the XML text found at a particular fully qualified URL. There's another call you'll use when you don't have a URL for that text, but most of the time this is the call you ought to use. If you're tempted to pass filenames or relative URIs, just say no! Filenames need to be converted to URLs first (see Section 3.1.3, "Filenames Versus URIs" in Chapter 3, "Producing SAX2 Events"), and relative URIs must be converted to absolute ones.

Parsing can report exceptions. This is important, and not just because it's the only way that a chunk of code like this (using just an XMLReader) could seem to "do" anything. Normally, those exceptions will be thrown only for fatal errors, such as well-formedness errors in an XML document, or for document I/O problems.

The application thread is "pulling" the XML text through the XMLReader-style producer: the parse() call won't return until the whole document is parsed, or until parsing is aborted by throwing an exception. Until it returns, the thread that called the XMLReader is either blocking on I/O, parsing data that it just read, or "pushing" data into one of the consumer interfaces. That is, from the perspective of event consumers SAX2 is a "push" API: handlers do nothing until they're asked.

2.2. Beginning SAX

2.2.1. How Do the Parts Fit Together?

Example 2-1. SAX2 application skeleton

Figure 2-2. Basic SAX roles and components

2.2.2. What Are the SAX2 Event Handlers?

2.2.3. XMLWriter: an Event Consumer

2.2.3.1. Event pipelines

Figure 2-3. Simple SAX2 event pipeline

2.2.3.2. Concerns when writing XML text