The LexicalHandler Interface (SAX2)

4.2. The LexicalHandler Interface

This extension interface is new in SAX2. It's in the org.xml.sax.ext package, which means among other things that it is optional and isn't supported by all SAX APIs and layers, such as DefaultHandler. However, any SAX2 parser that can be bootstrapped with JAXP supports this interface. Parsers that support LexicalHandler expose comment text and the boundaries of CDATA sections, DTDs, and most parsed entities. There is no setLexicalHandler() method; bind these handlers to parsers like this:

XMLReader	producer = ...;
LexicalHandler	handler = ...;

producer.setProperty ("http://xml.org/sax/properties/lexical-handler",
	handler);
// throws SAXNotSupportedException if parameter isn't a LexicalHandler
// throws SAXNotRecognizedException if parser doesn't support it.

The information this exposes is needed for applications that need more in the way of "round-tripping" support than the SAX2 core allows. That is, less of the information read by parsers will be completely discarded. The application needs SAX to provide more complete support for the XML Infoset (or for the XPath data model). To completely support DOM, XPath, or XSLT on top of a SAX2 parser, this interface is as necessary as the namespaces exposed in the SAX2 ContentHandler and Attributes interfaces. The downside is that much of this information is in the category of information applications shouldn't want to deal with. Be careful how you use these callbacks; don't assume that just because the information is available, you should use it.

LexicalHandler has the following methods:

void comment(buf,offset,len)

Reports characters inside a  comment section (without the delimiting characters).For many applications, this event is the only reason to use this interface. This is almost the same convention ContentHandler uses to report character content or ignorable whitespace; the parameters are identical. Comments are always reported in a single callback. Two consecutive comment() calls means two consecutive comments, while two consecutive characters() calls just enlarge a given logical span of text.

char buf []: A character array that holds the comment text. As with the ContentHandler.characters() callback, you must ignore characters in this buffer that are outside of the specified range.
int offset: The index of the first comment character in the buffer.
int len: How many comment characters are in the buffer, beginning at the specified offset.

Comments show up in the XPath data model, so they are reflected in layers (such as XSLT, XPointer, and XLink) that build on XPath. Strictly speaking, applications should ignore comments except when they round-trip data provided during authoring. Instead, they should use processing instructions when they need to work with annotations. You might need to use comment data with HTML processors because it doesn't support processing instructions. For example, HTML documents often use comments to wrap CSS data, JavaScript code, or server-side includes.

There are two good ways to handle comments. One is just to discard them and make the implementation of this method do nothing. (I like that one!) The other is to create a new String using the method parameters and save the string somewhere. Avoid parsing comment content; if you're tempted to do that in new applications, try to use PIs (which were designed for such purposes).

public void comment (String buf, int offset, int len)
throws SAXException
{
    String value = new String (buf, offset, len);
    ... now that you have it, what do you want to do?
}

void startDTD(name, publicId, systemId)

void endDTD()

The startDTD() event reports the beginning of a document's DTD, and endDTD() reports the end. These events can be useful when you save DTD information, such as the partial support in DOM Level 2. It is also important when you create SAX event streams that may need to print as documents that include a DTD.

String name: The declared name of the root element for the document. It is never omitted, though for invalid documents it may not correspond to the name of the root element.
String publicId: Normalized version of the public ID declared for the external subset, or null if no such subset was provided.
String systemId: The system ID declared for the external subset, or null if no such ID was provided. Note that this URI is not absolutized.

When the end of the DTD is reported, all other declarations that should have been reported (with DeclHandler or DTDHandler callbacks) will have been reported. If any ContentHandler.skippedEntity() calls were made for external parameter entities, applications will normally infer that some declarations were not processed.

Parsers are not required to distinguish the internal and external subsets. There are two mechanisms applications can use, but both of them are optional. The natural method is to rely on external parameter entity boundary reports, using other methods in this interface. Not all parsers report those entities; you can check the lexical-handler/parameter-entities feature flag to see if this mechanism will work for you. The other mechanism compares base URIs as reported through the Locator.getSystemId() method; base URIs for external subset components will differ from those of the document itself. Most parsers support this method, but it's awkward to use for this purpose.

If you're saving DTD content, these methods will bracket a lot of work where you squirrel data away for later use. Otherwise, you'll probably arrange to ignore all the other DTD events and will only need to decide what to do with comments and processing instructions, if you don't just ignore them. Ignoring them within DTDs is a popular strategy even when they're not ignored elsewhere. This is because comments or PIs inside a DTD would seem to apply to DTD contents, while most applications are instead working with document contents.

void startCDATA()

void endCDATA()

These methods report the beginning and end of a <[CDATA[...]]> text section; the bracketing characters are not reported. Any content within a CDATA section is reported with characters() events; the < and & characters within CDATA sections are parsed like normal characters, not like delimiters for markup.

Most software has little reason to care whether character content is contained in CDATA sections. Unless you are trying to round-trip data while preserving those lexical artifacts (to simplify potential future work done with text editors), the right response to CDATA events is to ignore them.

void startEntity(String name)

void endEntity(String name)

These methods report the beginning and end of internal or external entity expansion. The entity is named using the same rules as the ContentHandler.skippedEntity() callback. If you need to indicate which kind of entity is being expanded, record information from the DeclHandler.externalEntityDecl() callback and consult it in these methods. (That means you'll likely really want an extended DefaultHandler or XMLFilterImpl that supports both of the standardized extension classes.)

Expansions of general entity references, like &dudley;, are reported everywhere except inside attribute values. Such expansions within entity values can't meaningfully be reported, since all markup within start tags is reported at the same time.

Not all parsers report expansion of parameter entities, like %nell;, in DTDs. There is a special parser feature flag (lexical-handler/parameter-entities) that determines whether parsers report such events. As with general entity references, not all parameter entity expansions can be meaningfully reported. Parameter entities that expand as part of markup declarations or conditional section markers won't be seen, since markup declarations are reported only in their entirety.