Chapter 4. Consuming SAX2 Events
Most of the power of SAX is exposed through event callbacks.
In previous chapters you've seen some of the most widely
used event callbacks as well as how to ensure that all the callbacks
are generated and reported to application code.
This chapter presents the rest of the standard SAX
event-handling interfaces (including the extension handlers),
then talks about some of the common ways
that event consumers use those interfaces.
These interfaces are primarily implemented by application code
that consumes events and needs to solve particular problems.
You might also write custom event producers, which call
these interfaces directly rather than expecting some
type of XMLReader to issue them.
4.1. More About ContentHandler
In Section 2.3, "Basic ContentHandler Events", in Chapter 2, "Introducing SAX2",
we looked at the most important APIs used to handle XML
document content. Some other APIs were deferred to this
section because they aren't used as widely.
Depending on what problems you're solving, you may rely
heavily on some of these additional methods.
4.1.1. Other ContentHandler Methods
Five ContentHandler
callbacks were discussed in Chapter 2:
Section 2.3.4, "Essential ContentHandler Callbacks" explained how
characters and element boundaries were reported, and
Section 2.6.4, "ContentHandler and Prefix Mappings" explained how
namespace-prefix scopes were reported.
But the interface has five other methods.
Here's what they do and when you'll want to use them:
-
void setDocumentLocator (Locator l)
This is normally the first callback from a
parser; the single parameter is a
Locator, discussed later.
Strictly speaking, SAX parsers are not required to provide
a locator or to make this callback;
however, you'd want to avoid parsers that don't provide this
information.
Your implementation of this callback will normally just
save the locator; it can't do much more since it's the
only SAX
event callback that can't throw a
SAXException:
class MyHandler implements ContentHandler ... {
private Locator locator;
...
public void setDocumentLocator (Locator l)
{ locator = l; }
...
}
Use this object as discussed later in this chapter, in
Section 4.1.2, "The Locator Interface ".
It is the standard way to report
the base URI of the XML text currently being parsed;
that information is essential for resolving relative URIs.
It's also essential for diagnostics that tell you
where application code detects errors in large quantities
of XML text.
-
void startDocument ()
- void endDocument ()
These two callbacks bracket
processing for a
document, and they are normally used to manage application
state associated with the document being parsed.
If you're parsing a document, these methods will always
be called once each, even when parsing is cut
short by a thrown exception.
No other methods have such guarantees.
startDocument() is always called
before any data is reported from the parser, and is normally
used to initialize application data structures.
It will usually be the second callback from the parser;
parsers that provide a Locator
will report that first. You can't rely on a
setDocumentLocator()
call before startDocument();
structure your initialization code to do the
real work in the callback guaranteed to be available.
endDocument() is always called
to report that no more document data will be provided.
The normal application response is to clean up
all state associated with the current parse.
The parser closes any input data streams you gave it
using an InputSource (discussed later),
so the application doesn't need to do that.
Cleanup would include forgetting any saved
Locator since that object
is no longer usable when the parse is complete.
Also, you'd likely close other files or sockets that were
opened while processing this document:
class MyHandler implements ContentHandler ... {
...
public void startDocument ()
throws SAXException
{
// initialize data structures for ALL handlers here
...
}
public void endDocument ()
throws SAXException
{
// free those same data structures
locator = null;
elementStack = null;
...
}
...
}
These two calls are widely used in robust SAX
code because they provide such good hooks to control
memory usage and manage associated file descriptors. However, some SAX2 parsers have a bug that reduces the robustness offered by SAX; they won't correctly call endDocument() when parsing is aborted by throwing exceptions.
-
void processingInstruction (target, data)
Processing Instructions (PIs) are used
in XML for data that doesn't obey the rules of a DTD.
They can be placed anywhere in a document, including within
the DTD, except inside other markup constructs like tags.
Unlike comments, PIs are designed for applications to use.
They're part of the document structure that
programmatic logic must understand; they can
follow rules, just not ones found in a DTD or schema.
This method has two parameters:
-
String target
XML applications use this parameter
to determine how to handle the PI.
You can rely on the fact that it'll never
be the string xml
(in any combination of upper- and lowercase characters)
because XML and text declarations
are not processing instructions.
Some documents follow the convention that the target
of a PI names a notation (perhaps the fully qualified URI
found in its system identifier) and the meaning is
associated with the notation rather than the name.
That's a fine practice to follow, but it isn't essential.
Most code just compares target names as strings,
rather than use data reported with
DTDHandler.notationDecl()
to figure out what a target name should mean.
-
String data
This parameter is data
associated with the PI,
and it may be the null string if no data was provided
after the target name.
Some applications use the syntax of an attribute here;
others don't bother.
Processing instructions are natural to use in
template systems and other document-oriented
applications.[19]
Processing instructions are normally safe to ignore
when your processing doesn't recognize them (passing them on to any subsequent processing stage), or to store. If the parser does recognize them, it normally acts on then immediately. For example, an <?xml-stylesheet ...?> PI might select a particular XSLT stylesheet to use for generating a servlet's output. The processing instruction event is used later, in Example 6-9.
-
void ignorableWhitespace(buf,offset,len)
This is an optional callback, made by
most parsers (including all that are validating) to report
whitespace that separates elements in element content models,
like those of the form (title,para*,sect1*) but
not (#PCDATA|para|comment)*,
ANY, or EMPTY.
Whitespace before or after the document's root element
is not treated as ignorable and is completely discarded.
Providing this information is a requirement of the XML
specification, since this kind of whitespace is defined
to be markup rather than document content.
If the parser doesn't see such a content model declaration
for any reason, it can't use this callback;
it'll use characters() instead,
and applications will need to figure out if the
whitespace is part of markup or part of content.
The parameters are exactly the same as those of the
characters() callback, except
that you know the characters in the specified range
will all be spaces, tabs, or newlines.
(Keep that in mind if you're directly producing ignorable
whitespace to feed some event consumer.
Using CRLF- or CR-style line ends here is a bug,
though you might not see immediate consequences.)
Like characters(), this method can be
called several times in a row, to complete processing
a single stretch of characters.
There are two popular ways to handle this callback.
My favorite is to drop all the characters;
they're only in the source document to make the
elements lay out nicely,
so they won't ever mean anything.
There's rarely a reason to even look at the data,
much less save it.
The other option is to delegate handling and just call the
characters() callback with the
whitespace.
-
void skippedEntity (String name)
The parameter is a
String
that identifies an internal or external parsed entity.
General entity names are presented as found in their
declarations (dudley).
Parameter entity names begin with a percent sign
(%nell).
The external DTD subset is special; it's an unnamed
parameter entity and is reported with the name
[dtd].
You might not be able to tell if the skipped entity was
an internal or external entity, even
using DeclHandler events.
You probably don't ever want to see this call,
since it means that part of your document has been hidden.
XML 1.0 processors are required to report this case;
SAX 1.0 didn't, and most other parser-level APIs (such
as DOM Level 2) still don't.
This is a call that only nonvalidating parsers
may issue, and even then only if they are not parsing all
the external entities referred to in documents -- that
is, where one or both of the
external entities feature
flags is set to false, to disable reading external
general or parameter entities.
No widely used Java parsers clear those flags by default,
so this is a rare call in Java.
However some C parsers, such as Expat (used in Mozilla),
won't normally parse external entities,
so the notion isn't exotic in all languages.
4.1.2. The Locator Interface
This useful interface is sometimes overlooked.
It gives information that is essential for providing
location-sensitive diagnostics and is often given to
SAXParseException constructors.
That same information is also needed to resolve relative URIs
in document content or attribute values (such as
xml:base).
Parsers provide one instance of this class, which can be
used inside event callbacks to find what entity triggered
the event and approximately where.
Use that locator only during such callbacks.
There are only a few methods in this class.
-
String getSystemId ()
This is the most important method in
this interface.
It returns the base URI (system ID) for the entity
being parsed; this is always an absolute URI.
(However, versions of Xerces that are current at this
writing have a bug here. They sometimes return nonabsolute URIs.) Use this method to identify the document or external entity in diagnostics or to resolve relative URIs (perhaps in conjunction with xml:base attributes).
If the parser doesn't know this value, null is
returned. This normally indicates that the parser was
not given such a URI inside of a
InputSource encapsulating document
text. That's bad practice
except when it's unavoidable, such as parsing in-memory data
or input to the POST method in a servlet.
- int getLineNumber ()
- int getColumnNumber ()
These two functions approximate the current
position of a parser within an entity.
The position reflected is where the relevant event's data
ended. It is only an approximation for diagnostics,
but most parsers do try to be accurate about the line number.
These numbers count up from 1 as appropriate for
user-oriented diagnostics. Not all implementations will provide these values;
the value -1 is returned to indicate
that no value was provided.
-
String getPublicId ()
A public identifier may be provided
with this method. Otherwise null is returned.
This may be useful for diagnostics in some cases.
One common use for a locator is to report an error
detected while an application processes document content.
The SAXParseException class has two
constructors that take locator parameters. (The descriptive
string is always first, the locator is second, and an
optional "root cause" exception is third.) Once you
create such an exception, it can be thrown directly, which
always terminates a parse. Or you pass it to an
ErrorHandler to centralize error
handling-policy in your application:
// "locator" was saved when setDocumentLocator() was called earlier
// or was initialized to null; this is safe in both cases
try {
...
engine.setWarpFactor (11);
...
} catch (DriveException e) {
SAXParseException spe = new SAXParseException (
"The warp engine's gonna blow!",
locator,
e);
errHandler.error (e);
// we'll get here whenever such problems are ignored
}
To resolve relative URIs in document content -- for example, one found in an <xhtml:a href="..."/> reference in a link checker -- you'd use code like this
(ignoring xml:base complications):
public void startElement (String uri, String lname, String qname,
Attributes atts) throws SAXException
{
if (xhtmlURI.equals (uri)) {
if ("a".equals (lname)) {
String href = atts.getValue ("href");
if (href != null) {
// ASSUMES: locator is nonnull
System.out.println ("Found href to: " +
new URI (new URI(locator.getSystemId ()), href));
}
// else presumably <xhtml:a name="..."/>
}
} ...
}
Some of the XMLReader
implementations cannot possibly call
ContentHandler.setDocumentLocator()
with a Locator.
When parsing in-memory data structures, such as a DOM document,
a locator will normally be meaningless.
When parsing in-memory buffers like a String (with
a StringReader), there won't
usually be a URI in the locator.
If your application supports the layered
xml:base convention (which lets documents
"lie" about their true locations for purposes of resolving
relative URIs), it will need to track those
attributes itself, as part of a context stack mechanism.
(An example of such a stack is shown later, in
Example 5-1.)
Such attributes can sometimes help make up for SAX event
sources that can't provide locator information, such as
DOM-to-SAX producers.
But they can confuse things too: in the following
example, xml:base would apply to the
top element and its direct children, but nothing
within the external entity reference.
(Let's assume, for the sake of discussion, that no element
has an xml:base attribute.)
<top xml:base="http://www.example.com/moved/doc2.xml">
<xhtml:a href="abc.xml"/>
<xhtml:div> &external; </xhtml:div>
<xhtml:a href="xyz.xml"/>
</top>
When character content of an element is reported,
characters from different external entities will get
different callbacks, so the locator can be used to
tell those different entities apart from each other.
4.1.3. Internationalization Concerns
One of the goals of XML was to bring Unicode into
widespread use so that the Web could really become worldwide
in terms of people, not just technology.
This brings several concerns into text management.
You may not need to worry about these if you're working
only in ASCII or with just one character encoding.
While you're just starting out with Java and XML you
should certainly avoid worrying about these details.
Some other users of SAX2 will need to understand these
issues. Since they surface primarily with
ContentHandler event callbacks,
we briefly summarize them here.
If your application works with MathML, or in various
languages whose character sets gained support in Unicode 3.1
through the so-called Astral Planes, you will need to know
that what Java calls a char is not really
the same thing as a Unicode character or an XML character.
If you aren't using such languages, you'll probably be able
to ignore this issue for a while. Still, you might want to
read about Unicode 3.1 to learn more about this and minimize
trouble later.
By the time you read this, the W3C may even have
completed its "Blueberry" XML update, intended to allow
the use of some such characters within XML names.
In the case of such characters, whose Unicode code point
is above the value U+FFFF (the maximum
16-bit code point), these characters are mapped to two
Java char values, called a
surrogate pair.
The char values are in a range reserved
for surrogate characters, with a
high surrogate always
immediately followed by a low surrogate.
(This is called a big-endian sequence.)
Surrogate pairs can show up in several places in XML,
and hence in SAX2:
in character content, processing instructions, attribute
values (including defaults in the DTD), and comments.
At this time, Java does not have APIs to explicitly
support characters using surrogate pairs, although character
arrays and java.lang.String will hold
them as if the char values weren't part of
the same character.
The java.lang.Character class doesn't
recognize surrogate pairs.
The best precaution seems to be to prefer APIs that talk in
terms of slices of character arrays (or
Strings), rather than
in terms of individual Java char values.
This approach also handles other situations where more
than one char value is needed per character.
Depending on the character encodings you're using
and the applications you're implementing, you may also need
to pay attention to the W3C Character Model
(http://www.w3.org/TR/charmod/
at this writing) and Unicode Normalization Form C.
Briefly, these aim to eliminate undesirable representations
of characters and to handle some other cases where Unicode
characters aren't the same as XML characters or a Java
char, such as composite characters.
For example, many accented characters are represented by
composing two or more Unicode characters.
Systems work better when they only need to handle one way to
represent such characters, and Form C addresses that problem.
| | | 3.5. Other Kinds of SAX2 Event Producers | | 4.2. The LexicalHandler Interface |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|