Producing SAX2 Events (SAX2)

The preceding chapter provided an overview of the most widely used SAX classes and showed how they relate to each other. This chapter drills more deeply into how to produce XML events with SAX, including further customization of SAX parsers.

3.1. Pull Mode Event Production with XMLReader

Most of the time you work with SAX2, you'll be using some kind of org.xml.sax.XMLReader implementation that turns XML text into a SAX event stream. Such a class is loosely called a "SAX parser." Don't confuse this with the older SAX1 org.xml.sax.Parser class. New code should not be using that class!

This interface works in a kind of "pull" mode: when a thread makes an XMLReader.parse() request, it blocks until the XML document has been fully read and processed. Inside the parser there's a lot of work going on, including a "pull-to-push" adapter: the parser pulls data out of the input source provided to parse() and converts it to events that it pushes to event consumers. This model is different from the model of a java.io.Reader, from which applications can only get individual buffers of character data, but it's also similar because in both cases the calling thread is pulling data from a stream.

You can also have pure "push" mode event producers. The most common kind writes events directly to event handlers and doesn't use any kind of input abstraction to indicate the data's source; it's not parsing XML text. We discuss several types of such producers later in this chapter. Using threads, you could also create a producer that lets you write raw XML text, a buffer at a time, to an XMLReader that parses the text; that's another kind of "push" mode producer.

3.1.1. The XMLReader Interface

The SAX overview presented the most important parts of the XMLReader interface. Here we discuss the whole thing, in functional groups. Most of the handlers are presented in more detail in the next chapter, which focuses on the consumption side of the SAX event streaming process. Each handler has get and set accessor methods, and has a default value of null.

XMLReader has the following functional groups:

void parse(String uri)

void parse(InputSource in)

There are two methods to parse documents. In most cases, the Java environment is able to resolve the document's URI; the form with the absolute URI should be used when possible. (You may need to convert filenames to URIs before passing them to SAX. SAX specifically disallows passing relative URIs.) The second form is discussed in more detail along with the InputSource class. Both of these methods can throw a SAXException or java.io.IOException, as presented earlier. A SAXException is normally thrown only when an event handler throws it to terminate parsing. That policy is best encapsulated in an ErrorHandler, but handler methods can make such decisions themselves.

Only one thread may call a given parser's parse() method at a time; applications are responsible for ensuring that threads don't share parsers that are in active use. (SAX parsers aren't necessarily going to report applications that break that rule, though!) The thread doing the parsing will normally block only while it's waiting for data to be delivered to it, or if a handler's processing causes it to block.

void setContentHandler(ContentHandler handler)

ContentHandler getContentHandler()

Key parts of the ContentHandler interface were presented as part of the SAX overview; ContentHandler packages the fundamental parsing callbacks used by SAX event consumers. This interface is presented in more detail in Chapter 4, "Consuming SAX2 Events", in Section 4.1.1, "Other ContentHandler Methods ".

void setDTDHandler(DTDHandler handler)

DTDHandler getDTDHandler()

The DTDHandler is presented in detail later, in Chapter 4, "Consuming SAX2 Events" in Section 4.3.2, "The DTDHandler Interface ".

void setEntityResolver(EntityResolver handler)

EntityResolver getEntityResolver()

The EntityResolver is presented later in this chapter, in Section 3.4, "The EntityResolver Interface". It is used by the parser to help locate the content for external entities (general or parameter) to be parsed.

void setErrorHandler(ErrorHandler handler)

ErrorHandler getErrorHandler()

The ErrorHandler was presented in Section 2.5.2, "ErrorHandler Interface" in Chapter 2, "Introducing SAX2". It is often used by consumer code that interprets events reported through other handlers, since they may need to report errors detected at higher levels than XML syntax.

void setFeature(String uri, boolean value)

boolean getFeature(String uri)

Parser feature flags were discussed in Chapter 2, "Introducing SAX2", and are presented in more detail later in this chapter in Section 3.3.2, "XMLReader Feature Flags".

void setProperty(String uri, Object value)

Object getProperty(String uri)

Parser properties are used for data such as additional event handlers, and are presented in more detail later in this chapter in Section 3.3.1, "XMLReader Properties".

All the event handlers and the entity resolver may be reassigned inside event callbacks. At this level, SAX guarantees "late binding" of handlers. Layers built on top of SAX might use earlier binding, which can optimize event processing.

Many SAX parsers let you set handlers to null as a way to ignore the events reported by that type of handler. Strictly speaking, they don't need to do that; they're allowed to throw a NullPointerException when you use null. So if you need to restore the default behavior of a parser, you should use a DefaultHandler (or something implementing the appropriate extension interface) just in case, rather than use the more natural idiom of setting the handler to its default value, null.

If for any reason you need a push mode XML parser, which takes blocks of character or byte data (encapsulating XML text) that you write to a parser, you can easily create one from a standard pull mode parser. The cost is one helper thread and some API glue. The helper thread will call parse() on an InputSource that uses a java.io.PipedInputStream to read text. The push thread will write such data blocks to an associated java.io.PipedOutputStream when it becomes available. Most SAX parsers will in turn push the event data out incrementally, but there's no guarantee (at least from SAX) that they won't buffer megabytes of data before they start to parse.

3.1.2. The InputSource Class

The InputSource class shows up in both places where SAX needs to parse data: for the document itself, through parse(), and for the external parsed entities it might reference through the EntityResolver interface.

In almost all cases you should simply pass an absolute URI to the XMLReader.parse() method. (If you have a relative URI or a filename, turn it into an absolute URI first.) However, there are cases when you may need to parse data that has no URI. It might be in unnamed storage like a String; or it might need to be read using a specialized access scheme (maybe a java.io.PipedInputStream, or POST input to a servlet, or something named by a URN). The web server for the URI might misidentify the document's character encoding, so you'd need to work around that server bug. In such cases, you must use the alternative XMLReader.parse() method and pass an InputSource object to the parser.

InputSource objects are fundamentally holders for one or two things: an entity's URI and the entity text. (There can be a "public ID" too, but it's rarely useful.) When only one of those is needed, an application's work for setting up the InputSource might end with choosing the right constructor. Whenever you provide the entity text, you need to pay attention to some character encoding issues. Because character encoding is easy to get wrong, avoid directly providing entity text when you can.

3.1.2.1. Always provide absolute URIs

You should try to always provide the fully qualified (absolute) URI of the entity as its systemId, even if you also provide the entity text. That URI will often be the only data you need to provide. You must convert filenames to URIs (as described later in this chapter in Section 3.1.3, "Filenames Versus URIs"), and turn relative URIs into absolute ones. Some parsers have bugs and will attempt to turn relative URIs into absolute ones, guessing at an appropriate base URI. Do not rely on such behavior.

If you don't provide that absolute URI, then diagnostics may be useless. More significantly, relative URIs within the document can't be correctly resolved by the parser if the base URI is forgotten. XML parsers need to handle relative URIs within DTDs. To do that they need the absolute document (or entity) base URIs to be provided in InputSource (or parse() methods) by the application. Parsers use those base URIs to absolutize relative URIs, and then use EntityResolver to map the URIs (or their public identifiers) to entity text. Applications sometimes need to do similar things to relative URIs in document content. The xml:base attribute may provide an alternative solution for applications to determine the base URI, but it is normally needed only when relative URIs are broken. This can happen when someone moves the base document without moving its associated resources, or when you send the document through DOM (which doesn't record base URIs). Moreover, relative URIs in an xml:base attribute still need to be resolved with respect to the real base URI of the document.

The following methods are used to provide absolute URIs:

InputSource(String uri): Use this constructor when you are creating an InputSource consisting only of a fully qualified URI in a scheme understood by the JVM you are using. Such schemes commonly include http://, file://, ftp://, and increasingly https://.
InputSource.setSystemId(String uri): Use this method to record the URI associated with text you are providing directly.

For example, these three ways to parse a document are precisely equivalent:

String    uri = ...;
XMLReader parser = ...;

parser.parse (uri);
// or
parser.parse (new InputSource (uri);

3.1.2.2. Providing entity text

For data without a URI, or that uses a URI scheme not supported by your JVM, applications must provide entity text themselves. There are two ways to provide the text through an InputSource: as character data or as binary data, which needs to be decoded into character data before it can be parsed. In both cases your application will create an open data stream and give it to the parser. It will no longer be owned by your application; the parser should later close it as part of its end-of-input processing. If you provide binary data, you might know the character encoding used with it and can give that information to the parser rather than turning it to character data yourself using something like an InputStreamReader.

InputSource(java.io.Reader in)

Use this constructor when you are providing predecoded data to the parser, which will then ignore what any XML or text declaration says about the character encoding. (Also, call setSystemId(uri) when possible.) This constructor is useful for parsing data from a java.io.Reader such as java.io.CharArrayReader and for working around configuration bugs in HTTP servers.

Some HTTP servers will misidentify the text encoding used for XML documents, using the content type text/xml for non-ASCII data, instead of text/xml;charset=... or application/xml.[12] If you know a particular server does this, and that the encoding won't be autodetected, create an InputSource by using an InputStreamReader that uses the correct encoding. If the correct encoding will be autodetectable, you can use the InputStream constructor.

[12]application/xml is the safest MIME type to use for *.xml, *.dtd, and other XML files. See RFC 3023 for information about XML MIME types and character encodings.

InputSource(java.io.InputStream in)

Use this constructor when you are providing binary data to a parser and expect the parser to be able to detect the encoding from the binary data. (Also, call setSystemId(uri) when possible.)

For example, UTF-16 text always includes a Byte Order Mark, a document beginning <?xml ... encoding="Big5"?> is understood by most parsers as a Big5 (traditional Chinese) document, and UTF-8 is the default for XML documents without a declaration identifying the actual encoding in use.

InputSource.setEncoding(String id)

Use this method if you know the character encoding used with data you are providing as a java.io.InputStream. (Or provide a java.io.Reader if you can, though some parsers know more about encodings than the underlying JVM does.)[13] If you don't know the encoding, don't guess. XML parsers know how to use XML and text declarations to correctly determine the encoding in use. However, some parsers don't autodetect EBCDIC encodings, which are mostly used with IBM mainframes. You can use this method to help parsers handle documents using such encodings, if you can't provide the document in a fully interoperable encoding such as UTF-8.

[13]JDK 1.4 includes public APIs through which applications can support new character encodings. Some applications may need to use those APIs to support encodings beyond those the JVM handles natively.

All XML parsers support "UTF-8" and "UTF-16" values here, and most support other values, such as US-ASCII and ISO-8859-1. Consult your parser documentation for information about other encodings it supports. Typically, all encodings supported by the underlying JVM will be available, but they might be inconsistently named. (As one example, Sun's JDK supports many EBCDIC encodings, but gives them unusual names that don't suggest they're actually EBCDIC.) You should use standard Internet (IANA) encoding names, rather than Java names, where possible. In particular, don't use the name "UTF8"; use "UTF-8".

So if you want to parse some XML text you have lying around in a character array or String, the natural thing to do is package it as a java.io.Reader and wrap it up in something like this:

String    text = "<lichen color='red'/>";
Reader    reader = new StringReader (text);
XMLReader parser = ... ;

parser.setContentHandler (...);
parser.parse (new InputSource (reader));

In the same way, if you're implementing a servlet's POST handler and the servlet accepts XML text as its input, you'll create an InputSource. The InputSource will never have a URI, though you could support URIs for multipart/related content (sending a bundle of related components, such as external entities). Example 3-1 handles the MIME content type correctly, though it does so by waving a magic wand: it calls a routine that implements the rules in RFC 3023. That is, text/* content is US-ASCII (seven-bit code) by default, and any charset=... attribute is authoritative. When parsing XML requests inside a servlet, you'd typically apply a number of configuration techniques to speed up per-request processing and maintain security.[14]

[14]You might have a pool of parsers, to reduce bootstrap costs. You'd use an entity resolver to turn most entity accesses from remote ones into local ones. Depending on your application, you might even prevent all access to nonlocal entities so the servlet won't hang when remote network accesses get delayed.

Some security policies would also involve the entity resolver. Basically, every entity access "requested" by the client (through a reference in the document) is a potential attack. If it's not known to be safe (for example, access to standard DTD components), it may be important to prevent or nullify the access. (This does not always happen in the entity resolver; sometimes system security policies will be more centralized.) In a small trade-off against performance, security might require that the request data always be validated, and that validity errors be treated as fatal, because malformed input data is likely to affect system integrity.

Example 3-1. Parsing POST input to an HTTP Servlet

import gnu.xml.util.Resolver;

public void doPost (HttpServletRequest request, HttpServletResponse response)
throws IOException, ServletException
{
    String       type = req.getContentType ();
    InputSource  in;
    XMLReader    parser;

    if (!(type.startsWith ("text/xml")
            || type.startsWith ("application/xml")) {
        response.sendError (response.SC_UNSUPPORTED_MEDIA_TYPE,
            "non-XML content type: " + type);
        return;
    }

    // there's no URI for this input data!
    in = new InputSource (req.getInputStream ());

    // use any encoding associated with the MIME type
    in.setEncoding (Resolver.getEncoding (req.getContentType ()));

    try {
        parser = XMLReaderFactory.createXMLReader();
        ...
        parser.setContentHandler (...);
        parser.parse (in);
        // content handler expected to handle response generation

    } catch (SAXException e) {
        response.sendError (response.SC_BAD_REQUEST,
            "bad input: " + e.getMessage ());
        return;

    } catch (IOException e) {
	// maybe a relative URI in the input couldn't be resolved
        response.sendError (response.SC_INTERNAL_SERVER_ERROR
            "i/o problem: " + e.getMessage ());
        return;
    }
}

You might have some XML text in a database, stored as a binary large object (BLOB, accessed using java.sql.Blob) and potentially referring to other BLOBs in the database. Constructing input sources for such data should be slightly different because of those references. You'd want to be sure to provide a URI, so the references can be resolved:

String        key = "42";
byte          data [] = Storage.keyToBlob (key);
InputStream   stream = new ByteArrayInputStream (data);
InputSource   source = new InputSource (stream);
XMLReader     parser = ... ;

source.setSystemId ("blob:" + key);
parser.parse (source);

In such cases, where you are using a URI scheme that your JVM doesn't support directly, consider using an EntityResolver to create the InputSource objects you hand to parse(). Such schemes might be standard (such as members of a MIME multipart/related bundle), or they might be private to your application (like this blob: scheme). (Example 3-3 shows how to package handling for such nonstandard URI schemes so that you can use them in your application, even when your JVM does not understand them. You may need to pass such URIs using public IDs rather than system IDs, so that parsers won't report errors when they try to resolve them.)

3.1.3. Filenames Versus URIs

Filenames are not URIs, so you may not provide them as system identifiers where SAX expects a system identifier: in parse() or in an InputSource object. If you are depending on JDK 1.2 or later, you can rely on new File(name).toURL().toString() to turn a filename into a URI. To be most portable, you may prefer to use a routine as shown in Example 3-2, which handles key issues like mapping DOS or Mac OS filenames into legal URIs.

Example 3-2. File.toURL() analogue for JDK 1.1

public static String fileToURL (File f)
throws IOException
{
    String      temp;

    if (!f.exists ())
        throw new IOException ("no such file: " + f.getName ());

    temp = f.getAbsolutePath ();

    if (File.separatorChar != '/')
        temp = temp.replace (File.separatorChar, '/');
    if (!temp.startsWith ("/"))
        temp = "/" + temp;
    if (!temp.endsWith ("/") && f.isDirectory ())
        temp = temp + "/";
    return "file:" + temp;
}

If you're using the GNU software distribution that is described earlier, gnu.xml.util.Resolver.fileToURL() is available so you won't need to enter that code yourself.

Chapter 3. Producing SAX2 Events

Contents: