Chapter 3. Producing SAX2 EventsContents: Pull Mode Event Production with XMLReader The preceding chapter provided an overview of the most widely used SAX classes and showed how they relate to each other. This chapter drills more deeply into how to produce XML events with SAX, including further customization of SAX parsers. 3.1. Pull Mode Event Production with XMLReaderMost of the time you work with SAX2, you'll be using some kind of org.xml.sax.XMLReader implementation that turns XML text into a SAX event stream. Such a class is loosely called a "SAX parser." Don't confuse this with the older SAX1 org.xml.sax.Parser class. New code should not be using that class! This interface works in a kind of "pull" mode: when a thread makes an XMLReader.parse() request, it blocks until the XML document has been fully read and processed. Inside the parser there's a lot of work going on, including a "pull-to-push" adapter: the parser pulls data out of the input source provided to parse() and converts it to events that it pushes to event consumers. This model is different from the model of a java.io.Reader, from which applications can only get individual buffers of character data, but it's also similar because in both cases the calling thread is pulling data from a stream. You can also have pure "push" mode event producers. The most common kind writes events directly to event handlers and doesn't use any kind of input abstraction to indicate the data's source; it's not parsing XML text. We discuss several types of such producers later in this chapter. Using threads, you could also create a producer that lets you write raw XML text, a buffer at a time, to an XMLReader that parses the text; that's another kind of "push" mode producer. 3.1.1. The XMLReader InterfaceThe SAX overview presented the most important parts of the XMLReader interface. Here we discuss the whole thing, in functional groups. Most of the handlers are presented in more detail in the next chapter, which focuses on the consumption side of the SAX event streaming process. Each handler has get and set accessor methods, and has a default value of null. XMLReader has the following functional groups:
All the event handlers and the entity resolver may be reassigned inside event callbacks. At this level, SAX guarantees "late binding" of handlers. Layers built on top of SAX might use earlier binding, which can optimize event processing. Many SAX parsers let you set handlers to null as a way to ignore the events reported by that type of handler. Strictly speaking, they don't need to do that; they're allowed to throw a NullPointerException when you use null. So if you need to restore the default behavior of a parser, you should use a DefaultHandler (or something implementing the appropriate extension interface) just in case, rather than use the more natural idiom of setting the handler to its default value, null. If for any reason you need a push mode XML parser, which takes blocks of character or byte data (encapsulating XML text) that you write to a parser, you can easily create one from a standard pull mode parser. The cost is one helper thread and some API glue. The helper thread will call parse() on an InputSource that uses a java.io.PipedInputStream to read text. The push thread will write such data blocks to an associated java.io.PipedOutputStream when it becomes available. Most SAX parsers will in turn push the event data out incrementally, but there's no guarantee (at least from SAX) that they won't buffer megabytes of data before they start to parse. 3.1.2. The InputSource ClassThe InputSource class shows up in both places where SAX needs to parse data: for the document itself, through parse(), and for the external parsed entities it might reference through the EntityResolver interface. In almost all cases you should simply pass an absolute URI to the XMLReader.parse() method. (If you have a relative URI or a filename, turn it into an absolute URI first.) However, there are cases when you may need to parse data that has no URI. It might be in unnamed storage like a String; or it might need to be read using a specialized access scheme (maybe a java.io.PipedInputStream, or POST input to a servlet, or something named by a URN). The web server for the URI might misidentify the document's character encoding, so you'd need to work around that server bug. In such cases, you must use the alternative XMLReader.parse() method and pass an InputSource object to the parser. InputSource objects are fundamentally holders for one or two things: an entity's URI and the entity text. (There can be a "public ID" too, but it's rarely useful.) When only one of those is needed, an application's work for setting up the InputSource might end with choosing the right constructor. Whenever you provide the entity text, you need to pay attention to some character encoding issues. Because character encoding is easy to get wrong, avoid directly providing entity text when you can. 3.1.2.1. Always provide absolute URIsYou should try to always provide the fully qualified (absolute) URI of the entity as its systemId, even if you also provide the entity text. That URI will often be the only data you need to provide. You must convert filenames to URIs (as described later in this chapter in Section 3.1.3, "Filenames Versus URIs"), and turn relative URIs into absolute ones. Some parsers have bugs and will attempt to turn relative URIs into absolute ones, guessing at an appropriate base URI. Do not rely on such behavior. If you don't provide that absolute URI, then diagnostics may be useless. More significantly, relative URIs within the document can't be correctly resolved by the parser if the base URI is forgotten. XML parsers need to handle relative URIs within DTDs. To do that they need the absolute document (or entity) base URIs to be provided in InputSource (or parse() methods) by the application. Parsers use those base URIs to absolutize relative URIs, and then use EntityResolver to map the URIs (or their public identifiers) to entity text. Applications sometimes need to do similar things to relative URIs in document content. The xml:base attribute may provide an alternative solution for applications to determine the base URI, but it is normally needed only when relative URIs are broken. This can happen when someone moves the base document without moving its associated resources, or when you send the document through DOM (which doesn't record base URIs). Moreover, relative URIs in an xml:base attribute still need to be resolved with respect to the real base URI of the document. The following methods are used to provide absolute URIs:
For example, these three ways to parse a document are precisely equivalent: String uri = ...; XMLReader parser = ...; parser.parse (uri); // or parser.parse (new InputSource (uri); 3.1.2.2. Providing entity textFor data without a URI, or that uses a URI scheme not supported by your JVM, applications must provide entity text themselves. There are two ways to provide the text through an InputSource: as character data or as binary data, which needs to be decoded into character data before it can be parsed. In both cases your application will create an open data stream and give it to the parser. It will no longer be owned by your application; the parser should later close it as part of its end-of-input processing. If you provide binary data, you might know the character encoding used with it and can give that information to the parser rather than turning it to character data yourself using something like an InputStreamReader.
So if you want to parse some XML text you have lying around in a character array or String, the natural thing to do is package it as a java.io.Reader and wrap it up in something like this: String text = "<lichen color='red'/>"; Reader reader = new StringReader (text); XMLReader parser = ... ; parser.setContentHandler (...); parser.parse (new InputSource (reader)); In the same way, if you're implementing a servlet's POST handler and the servlet accepts XML text as its input, you'll create an InputSource. The InputSource will never have a URI, though you could support URIs for multipart/related content (sending a bundle of related components, such as external entities). Example 3-1 handles the MIME content type correctly, though it does so by waving a magic wand: it calls a routine that implements the rules in RFC 3023. That is, text/* content is US-ASCII (seven-bit code) by default, and any charset=... attribute is authoritative. When parsing XML requests inside a servlet, you'd typically apply a number of configuration techniques to speed up per-request processing and maintain security.[14]
Example 3-1. Parsing POST input to an HTTP Servletimport gnu.xml.util.Resolver; public void doPost (HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException { String type = req.getContentType (); InputSource in; XMLReader parser; if (!(type.startsWith ("text/xml") || type.startsWith ("application/xml")) { response.sendError (response.SC_UNSUPPORTED_MEDIA_TYPE, "non-XML content type: " + type); return; } // there's no URI for this input data! in = new InputSource (req.getInputStream ()); // use any encoding associated with the MIME type in.setEncoding (Resolver.getEncoding (req.getContentType ())); try { parser = XMLReaderFactory.createXMLReader(); ... parser.setContentHandler (...); parser.parse (in); // content handler expected to handle response generation } catch (SAXException e) { response.sendError (response.SC_BAD_REQUEST, "bad input: " + e.getMessage ()); return; } catch (IOException e) { // maybe a relative URI in the input couldn't be resolved response.sendError (response.SC_INTERNAL_SERVER_ERROR "i/o problem: " + e.getMessage ()); return; } } You might have some XML text in a database, stored as a binary large object (BLOB, accessed using java.sql.Blob) and potentially referring to other BLOBs in the database. Constructing input sources for such data should be slightly different because of those references. You'd want to be sure to provide a URI, so the references can be resolved: String key = "42"; byte data [] = Storage.keyToBlob (key); InputStream stream = new ByteArrayInputStream (data); InputSource source = new InputSource (stream); XMLReader parser = ... ; source.setSystemId ("blob:" + key); parser.parse (source); In such cases, where you are using a URI scheme that your JVM doesn't support directly, consider using an EntityResolver to create the InputSource objects you hand to parse(). Such schemes might be standard (such as members of a MIME multipart/related bundle), or they might be private to your application (like this blob: scheme). (Example 3-3 shows how to package handling for such nonstandard URI schemes so that you can use them in your application, even when your JVM does not understand them. You may need to pass such URIs using public IDs rather than system IDs, so that parsers won't report errors when they try to resolve them.) 3.1.3. Filenames Versus URIsFilenames are not URIs, so you may not provide them as system identifiers where SAX expects a system identifier: in parse() or in an InputSource object. If you are depending on JDK 1.2 or later, you can rely on new File(name).toURL().toString() to turn a filename into a URI. To be most portable, you may prefer to use a routine as shown in Example 3-2, which handles key issues like mapping DOS or Mac OS filenames into legal URIs. Example 3-2. File.toURL() analogue for JDK 1.1public static String fileToURL (File f) throws IOException { String temp; if (!f.exists ()) throw new IOException ("no such file: " + f.getName ()); temp = f.getAbsolutePath (); if (File.separatorChar != '/') temp = temp.replace (File.separatorChar, '/'); if (!temp.startsWith ("/")) temp = "/" + temp; if (!temp.endsWith ("/") && f.isDirectory ()) temp = temp + "/"; return "file:" + temp; } If you're using the GNU software distribution that is described earlier, gnu.xml.util.Resolver.fileToURL() is available so you won't need to enter that code yourself. Copyright © 2002 O'Reilly & Associates. All rights reserved. |
|