4.2. The LexicalHandler Interface
This extension interface is new in SAX2.
It's in the org.xml.sax.ext package, which
means among other things that it is optional and isn't supported
by all SAX APIs and layers,
such as DefaultHandler.
However, any SAX2 parser that can be bootstrapped with JAXP
supports this interface.
Parsers that support LexicalHandler
expose comment text and the boundaries of CDATA sections,
DTDs, and most parsed entities.
There is no setLexicalHandler() method;
bind these handlers to parsers like this:
XMLReader producer = ...;
LexicalHandler handler = ...;
producer.setProperty ("http://xml.org/sax/properties/lexical-handler",
handler);
// throws SAXNotSupportedException if parameter isn't a LexicalHandler
// throws SAXNotRecognizedException if parser doesn't support it.
The information this exposes is needed
for applications that need more in the way of "round-tripping"
support than the SAX2 core allows.
That is, less of the information read by parsers
will be completely discarded.
The application needs SAX to provide more complete support
for the XML Infoset (or for the XPath data model).
To completely support DOM, XPath, or XSLT on top of a SAX2
parser, this interface is as necessary as the namespaces
exposed in the SAX2 ContentHandler
and Attributes interfaces.
The downside is that much of this information is in the category
of information applications shouldn't want to deal with.
Be careful how you use these callbacks; don't assume that
just because the information is available, you should use it.
LexicalHandler has the following methods:
-
void comment(buf,offset,len)
Reports characters inside a
<!--...-->
comment section (without the delimiting characters).For many applications, this event is the only reason to use this interface. This is almost the same convention ContentHandler uses to report character content or ignorable whitespace; the parameters are identical. Comments are always reported in a single callback. Two consecutive comment() calls means two consecutive comments, while two consecutive characters() calls just enlarge a given logical span of text.
- char buf []
A character array that holds the
comment text. As with the
ContentHandler.characters()
callback, you must ignore characters in this buffer
that are outside of the specified range.
- int offset
The index of the first comment
character in the buffer.
- int len
How many comment characters are
in the buffer, beginning at the specified offset.
Comments show up in the XPath data model, so they are
reflected in layers (such as XSLT, XPointer, and XLink) that
build on XPath. Strictly speaking, applications should
ignore comments except when they round-trip data
provided during authoring. Instead, they should use
processing instructions when they need to work with
annotations. You might need to use comment data with HTML processors because it doesn't support processing instructions. For example, HTML documents often use comments to wrap CSS data, JavaScript code, or server-side includes.
There are two good ways to handle comments. One
is just to discard them and make the implementation of
this method do nothing. (I like that one!) The other is
to create a new String using the
method parameters and
save the string somewhere. Avoid parsing comment content;
if you're tempted to do that in new applications, try
to use PIs (which were designed for such purposes).
public void comment (String buf, int offset, int len)
throws SAXException
{
String value = new String (buf, offset, len);
... now that you have it, what do you want to do?
}
-
void startDTD(name, publicId, systemId)
- void endDTD()
The startDTD() event
reports the beginning of a document's DTD, and
endDTD() reports the end.
These events can be useful when you save DTD information, such
as the partial support in DOM Level 2. It
is also important when you create SAX event streams
that may need to print as documents that include a DTD.
-
String name
The declared name of the root
element for the document. It is never omitted, though
for invalid documents it may not correspond
to the name of the root element.
-
String publicId
Normalized version of the public ID
declared for the external subset, or null if no such subset
was provided.
- String systemId
The system ID declared for the external
subset, or null if no such ID was provided.
Note that this URI is not absolutized.
When the end of the DTD is reported, all other
declarations that should have been reported (with
DeclHandler
or DTDHandler callbacks)
will have been reported. If any
ContentHandler.skippedEntity() calls
were made for external parameter entities, applications will
normally infer that some declarations were not processed.
Parsers are not required to distinguish the internal
and external subsets. There are two mechanisms
applications can use, but both of them are optional.
The natural method is to rely on external parameter entity
boundary reports, using other methods in this interface.
Not all parsers report those entities; you can check
the lexical-handler/parameter-entities
feature flag to see if this mechanism will work for you.
The other mechanism compares base URIs as reported through
the Locator.getSystemId() method;
base URIs for external subset components will differ from
those of the document itself.
Most parsers support this method, but it's awkward to
use for this purpose.
If you're saving DTD content, these methods will
bracket a lot of work where you squirrel data
away for later use. Otherwise, you'll probably
arrange to ignore all the other DTD events and will
only need to decide what to do with comments and processing
instructions, if you don't just ignore them.
Ignoring them within DTDs is a popular strategy even when
they're not ignored elsewhere. This is because comments
or PIs inside a DTD would seem to apply to DTD contents,
while most applications are instead working with
document contents.
- void startCDATA()
- void endCDATA()
These methods report the beginning and end of a
<[CDATA[...]]> text section;
the bracketing characters are not reported.
Any content within a CDATA section is reported with
characters()
events; the < and
& characters within CDATA
sections are parsed like normal characters,
not like delimiters for markup.
Most software has little reason to care whether
character content is contained in CDATA sections.
Unless you are trying to round-trip data while preserving
those lexical artifacts (to simplify potential future work
done with text editors), the right response to CDATA
events is to ignore them.
- void startEntity(String name)
- void endEntity(String name)
These methods report the beginning and
end of internal or external entity expansion. The entity is named using the same rules as the ContentHandler.skippedEntity() callback. If you need to indicate which kind of entity is being expanded, record information from the DeclHandler.externalEntityDecl() callback and consult it in these methods. (That means you'll likely really want an extended DefaultHandler or XMLFilterImpl that supports both of the standardized extension classes.)
Expansions of general entity references, like
&dudley;, are reported
everywhere except inside attribute values.
Such expansions within entity values can't meaningfully be
reported, since all markup within start tags
is reported at the same time.
Not all parsers report expansion of parameter entities,
like %nell;, in DTDs.
There is a special parser feature flag
(lexical-handler/parameter-entities)
that determines whether parsers report such events.
As with general entity references, not all parameter
entity expansions can be meaningfully reported.
Parameter entities that expand as part of markup declarations or
conditional section markers won't be seen, since markup
declarations are reported only in their entirety.
| | | 4. Consuming SAX2 Events | | 4.3. Exposing DTD Information |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
|