XML::SAX: The Second Generation (Perl and XML)

5.7.2. SAX2 Handler Interface

Once you've selected a parser, the next step is to code up a handler package to catch the parser's event stream, much like the SAX modules we've seen so far. XML::SAX specifies events and their properties in exquisite detail and in large numbers. This specification gives your handler considerable control while ensuring absolute conformance to the API.

The types of supported event handlers fall into several groups. The ones we are most familiar with include the content handlers, including those for elements and general document information, entity resolvers, and lexical handlers that handle CDATA sections and comments. DTD handlers and declaration handlers take care of everything outside of the document element, including element and entity declarations. XML::SAX adds a new group, the error handlers, to catch and process any exceptions that may occur during parsing.

One important new facet to this class of parsers is that they recognize namespaces. This recognition is one of the innovations of SAX2. Previously, SAX parsers treated a qualified name as a single unit: a combined namespace prefix and local name. Now you can tease out the namespaces, see where their scope begins and ends, and do more than you could before.

5.7.2.1. Content event handlers

Focusing on the content of the document, these handlers are the most likely ones to be implemented in a SAX handling program. Note the useful addition of a document locator reference, which gives the handler a special window into the machinations of the parser. The support for namespaces is also new.

set_document_locator( locator )

Called at the beginning of parsing, a parser uses this method to tell the handler where the events are coming from. The locator parameter is a reference to a hash containing these properties:

PublicID: The public identifier of the current entity being parsed.
SystemID: The system identifier of the current entity being parsed.
LineNumber: The line number of the current entity being parsed.
ColumnNumber: The last position in the line currently being parsed.

The hash is continuously updated with the latest information. If your handler doesn't like the information it's being fed and decides to abort, it can check the locator to construct a meaningful message to the user about where in the source document an error was found. A SAX parser isn't required to give a locator, though it is strongly encouraged to do so. You should check to make sure that you have a locator before trying to access it. Don't try to use the locator except inside an event handler, or you'll get unpredictable results.

start_document( document )

This handler routine is called right after set_document_locator( ), just as parsing on a document begins. The parameter, document, is an empty reference, as there are no properties for this event.

end_document( document )

This is the last handler method called. If the parser has reached the end of input or has encountered an error and given up, it sends notification of this event. The return value for this method is used as the value returned by the parser's parse( ) method. Again, the document parameter is empty.

start_element( element )

Whenever the parser encounters a new element start tag, it calls this method. The parameter element is a hash containing properties of the element, including:

Name: The string containing the name of the element, including its namespace prefix.
Attributes: The hash of attributes, in which each key is encoded as {NamespaceURI}LocalName. The value of each item in the hash is a hash of attribute properties.
NamespaceURI: The element's namespace.
Prefix: The prefix part of the qualified name.
LocalName: The local part of the qualified name.

Properties for attributes include:

Name: The qualified name (prefix + local).
Value: The attribute's value, normalized (leading and trailing spaces are removed).
NamespaceURI: The source of the namespace.
Prefix: The prefix part of the qualified name.
LocalName: The local part of the qualified name.

The properties NamespaceURI, LocalName, and Prefix are given only if the parser supports the namespaces feature.

end_element( element )

After all the content is processed and an element's end tag has come into view, the parser calls this method. It is even called for empty elements. The parameter element is a hash containing these properties:

Name: The string containing the element's name, including its namespace prefix.
NamespaceURI: The element's namespace.
Prefix: The prefix part of the qualified name.
LocalName: The local part of the qualified name.

The properties NamespaceURI, LocalName, and Prefix are given only if the parser supports the namespaces feature.

characters( characters )

The parser calls this method whenever it finds a chunk of plain text (character data). It might break up a chunk into pieces and deliver each piece separately, but the pieces must always be sent in the same order as they were read. Within a piece, all text must come from the same source entity. The characters parameter is a hash containing one property, Data, which is a string containing the characters from the document.

ignorable_whitespace( characters )

The term ignorable whitespace is used to describe space characters that appear in places where the element's content model declaration doesn't specifically call for character data. In other words, the newlines often used to make XML more readable by spacing elements apart can be ignored because they aren't really content in the document. A parser can tell if whitespace is ignorable only by reading the DTD, and it would do that only if it supports the validation feature. (If you don't understand this, don't worry; it's not important to most people.) The characters parameter is a hash containing one property, Data, containing the document's whitespace characters.

start_prefix_mapping( mapping )

This method is called when the parser detects a namespace coming into scope. For parsers that are not namespace-aware, this event is skipped, but element and attribute names still include the namespace prefixes. This event always occurs before the start of the element for which the scope holds. The parameter mapping is a hash with these properties:

Prefix: The namespace prefix.
NamespaceURI: The URI that the prefix maps to.

end_prefix_mapping( mapping )

This method is called when a namespace scope closes. This routine's parameter mapping is a hash with one property:

Prefix: The namespace prefix.

This event is guaranteed to come after the end element event for the element in which the scope is declared.

processing_instruction( pi )

This routine handles processing instruction events from the parser, including those found outside the document element. The pi parameter is a hash with these properties:

Target: The target for the processing instruction.
Data: The instruction's data (or undef if there isn't any).

skipped_entity( entity )

Nonvalidating parsers may skip entities rather than resolve them. For example, if they haven't seen a declaration, they can just ignore the entity rather than abort with an error. This method gives the handler a chance to do something with the entity, and perhaps even implement its own entity resolution scheme.

If a parser skips entities, it will have one or more of these features set:

Handle external parameter entities (feature-ID is http://xml.org/sax/features/external-parameter-entities)
Handle external general entities (feature-ID is http://xml.org/sax/features/external-general-entities)

(In XML, features are represented as URIs, which may or may not actually exist. See Chapter 10, "Coding Strategies" for a fuller explanation.)

The parameter entity is a hash with this property:

Name: The name of the entity that was skipped. If it's a parameter entity, the name will be prefixed with a percent sign (%).

5.7.2.2. Entity resolver

By default, XML parsers resolve external entity references without your program ever knowing they were there. You may want to override that behavior occasionally. For example, you may have a special way of resolving public identifiers, or the entities are entries in a database. Whatever the reason, if you implement this handler, the parser will call it before attempting to resolve the entity on its own.

The argument to resolve_entity( ) is a hash with two properties: PublicID, a public identifier for the entity, and SystemID, the system-specific location of the identity, such as a filesystem path or a URI. If the public identifier is undef, then none was given, but a system identifier will always be present.

5.7.2.4. Error event handlers and catching exceptions

XML::SAX lets you customize your error handling with this group of handlers. Each handler takes one argument, called an exception, that describes the error in detail. The particular handler called represents the severity of the error, as defined by the W3C recommendation for parser behavior. There are three types:

warning( ): This is the least serious of the exception handlers. It represents any error that is not bad enough to halt parsing. For example, an ID reference without a matching ID would elicit a warning, but allow the parser to keep grinding on. If you don't implement this handler, the parser will ignore the exception and keep going.
error( ): This kind of error is considered serious, but recoverable. A validity error falls in this category. The parser should still trundle on, generating events, unless your application decides to call it quits. In the absence of a handler, the parser usually continues parsing.
fatal_error( ): A fatal error might cause the parser to abort parsing. The parser is under no obligation to continue, but might just to collect more error messages. The exception could be a syntax error that makes the document into non-well-formed XML, or it might be an entity that can't be resolved. In any case, this example shows the highest level of error reporting provided in XML::SAX.

According to the XML specification, conformant parsers are supposed to halt when they encounter any kind of well-formedness or validity error. In Perl SAX, halting results in a call to die( ). That's not the end of story, however. Even after the parse session has died, you can raise it from the grave to continue where it left off, using the eval{} construct, like this:

eval{ $parser->parse( $uri ) };
if( $@ ) {
  # yikes! handle error here...
}

The $@ variable is a blessed hash of properties that piece together the story about why parsing failed.

These properties include:

Message: A text description about what happened
ColumnNumber: The number of characters into the line where the error occurred, if this error is a parse error
LineNumber: Which line the error happened on, if the exception was thrown while parsing
PublicID: A public identifier for the entity in which the error occurred, if this error is a parse error
SystemID: A system identifier pointing to the offending entity, if a parse error occurred

Not all thrown exceptions indicate that a failure to parse occurred. Sometimes the parser throws an exception because of a bad feature setting.

5.7. XML::SAX: The Second Generation

5.7.1. XML::SAX::ParserFactory

5.7.2. SAX2 Handler Interface

5.7.2.1. Content event handlers

5.7.2.2. Entity resolver

5.7.2.3. Lexical event handlers

5.7.2.4. Error event handlers and catching exceptions

5.7.3. SAX2 Parser Interface

5.7.4. Example: A Driver

Example 5-8. Web log SAX driver

Example 5-9. A program to test the SAX driver

5.7.5. Installing Your Own Parser