Figure 4-1. SAX2 event pipeline
Pipeline stages can be used to create functional layers,
or they can simply be used to define clean module boundaries.
Some stages may work well with fragments of XML, while others
may expect to process entire documents.
The order in which processing tasks occur could be critically
important or largely incidental.
Stages can be application specific or general purpose.
In addition to reading and writing XML, examples of such
general-purpose stages include:
Cleaning up namespace information
to re-create prefix declarations and references,
replace old URIs with current ones, or
give unqualified names a namespace.
Performing XSLT transformations.
Validating against an appropriate DTD
or schema.
Transforming input text to eliminate
problematic character representations.
(Several recent W3C specifications require using Unicode
Normalization Format C.)
Supporting the xml:base
model for determining base URIs.
Passing data through pipeline
stages on remote servers.
Implementing XInclude or similar
replacements for DTD-based external entity processing.
Performing well-formedness tests to
guard against sloppy producers (parsers won't need this).
More application-specific pipeline stages
might include:
Performing validation using procedural
logic with access to system state.
Collecting links, to support tasks
such as verifying they all work.
Unmarshaling application-specific data
structures.
Stripping out data that later processing
must never see. For example, SOAP 1.1 messages must never
include processing instructions or DTDs, and some kinds
of XHTML rendering engines must not see
<font> tweaks.
This process is different from how a work flow is managed
in a data structure API such as DOM.
In both cases you can assemble work-flow components, with
intermediate work products represented as data structures.
With SAX, those work-flow components would be pipelines;
pipeline stages wouldn't necessarily correspond to individual
work-flow components, although they might.
With a data structure API,
the intermediate work products must always use that API;
with SAX they can use whatever representation is convenient,
including XML text or a specialized application data structure.
Beyond defining the event consumer interfaces and
how to hook them up to XML parsers, SAX includes only
limited support for pipelines. That is primarily
through the XMLFilterImpl class.
The support is limited in part because
XMLFilterImpl doesn't provide
full support for the two extension handlers so that by
default it won't pass enough of the XML Infoset to support
some interesting tasks (including several in the previous lists).
In the rest of this section we talk about that class,
XSLT and the javax.xml.transform package,
and about a more complete framework (the gnu.xml.pipeline package), to illustrate one alternative approach.
4.5.2. XMLFilter Examples
This book includes some examples that use
XMLFilterImpl as a base class, supporting both filter modes:
Example 6-3
shows a custom handler interface, delivering
application-specific unmarshaled data. This interface can be used either to postprocess or to preprocess SAX events, without additional setup.
Example 6-9
replaces processing instructions with the content of an
included document so that downstream stages won't know
about the substitution.
When used to postprocess events, the handler
may need to be set up
with appropriate EntityHandler and ErrorHandler objects.
Sun is developing a "Multi-Schema Validator"
engine, which uses SAX filters to implement validators for
schema systems including RELAX (also called ISO RELAX),
TREX, RELAX-NG (combining the best of RELAX and TREX),
and W3C XML schemas.
This work ties in to the
org.iso_relax.verifier
framework for validator APIs (at http://iso-relax.sourceforge.net), which
also supports using SAX objects (such as filters and
content handlers) that validate schemas.
If you're using RDDL
(http://www.rddl.org)
as a convention for associating resources with XML namespaces,
you may find the
org.rddl.sax.RDDLFilter
class to be useful. It parses RDDL documents
and lets you determine the various resources
associated with namespaces, such as a
DTD, a preferred CSS or XSLT stylesheet,
or the schema using any of several schema languages.
This is another "producer-mode only" filter.
4.5.4. The gnu.xml.pipeline Framework
This framework takes a different approach to building
pipelines than XMLFilterImpl
or XMLFilter. Two key characteristics are its built-in support for all the SAX2 handlers, including the extension handlers, and its exclusive focus on the postprocessing model. In addition, it has several utility filters and some factory methods that can automate construction and initialization of pipelines. The core interface is EventConsumer:
public interface EventConsumer
{
public ContentHandler getContentHandler ();
public DTDHandler getDTDHandler ();
public Object getProperty (String id)
throws SAXNotRecognizedException;
public void setErrorHandler (ErrorHandler handler);
}
With that interface, pipelines are normally set up
beginning with the last consumer and then working toward
the first consumer.
There is a formal convention that states pipeline stages have a
constructor that takes an EventConsumer
parameter, which is used to construct pipelines from simple
textual descriptions (which look like Unix-style command
pipelines). That convention makes it easy to construct
a pipeline by hand, as shown in the following code.
Stages are strongly expected to share the same error handling;
the error handler is normally established after the pipeline
is set up, when a pipeline is bound to an event producer.
There is a class that corresponds to the pure consumer
mode XMLFilterImpl, except that it
implements all the SAX2 event consumer interfaces, not
just the ones in the core API. LexicalHandler and DeclHandler are fully supported. This class also adds convenience methods such as the following:
public class EventFilter
implements EventConsumer, ContentHandler, DTDHandler,
LexicalHandler, DeclHandler
{
... lots omitted ...
// hook up all event consumer interfaces to the producer
// map some known EventFilters into XMLReader feature settings
public static void bind (XMLReader producer, EventConsumer consumer)
{ /* code omitted */ }
// wrap a "consumer mode" XMLFilterImpl
public void chainTo (XMLFilterImpl next)
{ /* code omitted */ }
... lots omitted ...
}
Example 4-4 shows
how one simple event pipeline works using
the GNU pipeline framework. It looks like it has
three pipeline components (in addition to the parser),
but in this case it's likely that two of them will be
optimized away into parser feature flag settings:
NSFilter restores namespace-related
information that is discarded by SAX2 parser defaults
(bind() sets
namespace-prefixes to true and
discards that filter),
and ValidationFilter is a layered
validator that may not be necessary if the underlying
parser can support validation (in which case the
validation flag is set to true and
the filter is discarded).
Apart from arranging that validation errors are reported
and using the GNU DOM implementation instead of Crimson's,
this code does exactly what the first
SAX-to-DOM example above does.[22]
Example 4-4. SAX events to DOM document (using GNU DOM)
import gnu.xml.pipeline.*;
public Document SAX2DOM (String uri)
throws SAXException, IOException
{
DomConsumer consumer;
XMLReader producer;
consumer = new gnu.xml.dom.Consumer ();
consumer = new ValidationConsumer (consumer);
consumer = new NSFilter (consumer);
producer = XMLReaderFactory.createXMLReader ();
producer.setErrorHandler (new DefaultHandler () {
public void error (SAXParseException e)
throws SAXException
{ throw e; }
});
EventFilter.bind (producer, consumer);
producer.parse (uri);
return consumer.getDocument ();
}
There are some interesting notions lurking in this
example. For instance, when validation is a postprocessing
stage, it can be initialized with a particular DTD and
hooked up to an XMLReader that walks DOM nodes. That way, that DOM content can be incrementally validated as applications change it. Similarly, application code can produce a SAX event stream and validate content without saving it to a file. This same postprocessing approach could be taken with validators based on any of the various schema systems.
There are a variety of other utility pipeline
stages and support classes in the
gnu.xml.pipeline package.
One is briefly shown later
(in Example 6-7).
Others include XInclude and XSLT support,
as well as a TeeConsumer
to send events down two pipelines
(like a tee joint used in plumbing).
This can be useful to save output for debugging;
you can write XML text to a file, or save it
as a DOM tree, and watch the events that come out
of a particular pipeline stage to find problematic areas.
Even if you don't use that GNU framework, you should
keep in mind that SAX pipeline stages can be used to package
significant and reusable XML processing components.