5.7. XML::SAX: The Second GenerationThe proliferation of SAX parsers presents two problems: how to keep them all synchronized with the standard API and how to keep them organized on your system. XML::SAX, a marvelous team effort by Matt Sergeant, Kip Hampton, and Robin Berjon, solves both problems at once. As a bonus, it also includes support for SAX Level 2 that previous modules lacked. "What," you ask, "do you mean about keeping all the modules synchronized with the API?" All along, we've touted the wonders of using a standard like SAX to ensure that modules are really interchangeable. But here's the rub: in Perl, there's more than one way to implement SAX. SAX was originally designed for Java, which has a wonderful interface type of class that nails down things like what type of argument to pass to which method. There's nothing like that in Perl. This wasn't as much of a problem with the older SAX modules we've been talking about so far. They all support SAX Level 1, which is fairly simple. However, a new crop of modules that support SAX2 is breaking the surface. SAX2 is more complex because it introduces namespaces to the mix. An element event handler should receive both the namespace prefix and the local name of the element. How should this information be passed in parameters? Do you keep them together in the same string like foo:bar? Or do you separate them into two parameters? This debate created a lot of heat on the perl-xml mailing list until a few members decided to hammer out a specification for "Perlish" SAX (we'll see in a moment how to use this new API for SAX2). To encourage others to adhere to this convention, XML::SAX includes a class called XML::SAX::ParserFactory. A factory is an object whose sole purpose is to generate objects of a specific type -- in this case, parsers. XML::SAX::ParserFactory is a useful way to handle housekeeping chores related to the parsers, such as registering their options and initialization requirements. Tell the factory what kind of parser you want and it doles out a copy to you. XML::SAX represents a shift in the way XML and Perl work together. It builds on the work of the past, including all the best features of previous modules, while avoiding many of the mistakes. To ensure that modules are truly compatible, the kit provides a base class for parsers, abstracting out most of the mundane work that all parsers have to do, leaving the developer the task of doing only what is unique to the task. It also creates an abstract interface for users of parsers, allowing them to keep the plethora of modules organized with a registry that is indexed by properties to make it easy to find the right one with a simple query. It's a bold step and carries a lot of heft, so be prepared for a lot of information and detail in this section. We think it will be worth your while. 5.7.1. XML::SAX::ParserFactoryWe start with the parser selection interface, XML::SAX::ParserFactory. For those of you who have used DBI, this class is very similar. It's a front end to all the SAX parsers on your system. You simply request a new parser from the factory and it will dig one up for you. Let's say you want to use any SAX parser with your handler package XML::SAX::MyHandler. Here's how to fetch the parser and use it to read a file: use XML::SAX::ParserFactory; use XML::SAX::MyHandler; my $handler = new XML::SAX::MyHandler; my $parser = XML::SAX::ParserFactory->parser( Handler => $handler ); $parser->parse_uri( "foo.xml" ); The parser you get depends on the order in which you've installed the modules. The last one (with all the available features specified with RequiredFeatures, if any) will be returned by default. But maybe you don't want that one. No problem; XML::SAX maintains a registry of SAX parsers that you can choose from. Every time you install a new SAX parser, it registers itself so you can call upon it with ParserFactory. If you know you have the XML::SAX::BobsParser parser installed, you can require an instance of it by setting the variable $XML::SAX::ParserPackage as follows: use XML::SAX::ParserFactory; use XML::SAX::MyHandler; my $handler = new XML::SAX::MyHandler; $XML::SAX::ParserPackage = "XML::SAX::BobsParser( 1.24 )"; my $parser = XML::SAX::ParserFactory->parser( Handler => $handler ); Setting $XML::SAX:ParserPackage to XML::SAX::BobsParser( 1.24 ) returns an instance of the package. Internally, ParserFactory is require( )-ing that parser and calling its new( ) class method. The 1.24 in the variable setting specifies a minimum version number for the parser. If that version isn't on your system, an exception will be thrown. To see a list of all the parsers available to XML::SAX, call the parsers( ) method: use XML::SAX; my @parsers = @{XML::SAX->parsers( )}; foreach my $p ( @parsers ) { print "\n", $p->{ Name }, "\n"; foreach my $f ( sort keys %{$p->{ Features }} ) { print "$f => ", $p->{ Features }->{ $f }, "\n"; } } It returns a reference to a list of hashes, with each hash containing information about a parser, including the name and a hash of features. When we ran the program above we were told that XML::SAX had two registered parsers, each supporting namespaces: XML::LibXML::SAX::Parser http://xml.org/sax/features/namespaces => 1 XML::SAX::PurePerl http://xml.org/sax/features/namespaces => 1 At the time this book was written, these parsers were the only two parsers included with XML::SAX. XML::LibXML::SAX::Parser is a SAX API for the libxml2 library we use in Chapter 6, "Tree Processing". To use it, you'll need to have libxml2, a compiled, dynamically linked library written in C, installed on your system. It's fast, but unless you can find a binary or compile it yourself, it isn't very portable. XML::SAX::PurePerl is, as the name suggests, a parser written completely in Perl. As such, it's completely portable because you can run it wherever Perl is installed. This starter set of parsers already gives you some different options. The feature list associated with each parser is important because it allows a user to select a parser based on a set of criteria. For example, suppose you wanted a parser that did validation and supported namespaces. You could request one by calling the factory's require_feature( ) method: my $factory = new XML::SAX::ParserFactory; $factory->require_feature( 'http://xml.org/sax/features/validation' ); $factory->require_feature( 'http://xml.org/sax/features/namespaces' ); my $parser = $factory->parser( Handler => $handler ); Alternatively, you can pass such information to the factory in its constructor method: my $factory = new XML::SAX::ParserFactory( Required_features => { 'http://xml.org/sax/features/validation' => 1 'http://xml.org/sax/features/namespaces' => 1 } ); my $parser = $factory->parser( Handler => $handler ); If multiple parsers pass the test, the most recently installed one is used. However, if the factory can't find a parser to fit your requirements, it simply throws an exception. To add more SAX modules to the registry, you only need to download and install them. Their installer packages should know about XML::SAX and automatically register the modules with it. To add a module of your own, you can use XML::SAX's add_parser( ) with a list of module names. Make sure it follows the conventions of SAX modules by subclassing XML::SAX::Base. Later, we'll show you how to write a parser, install it, and add it to the registry. 5.7.2. SAX2 Handler InterfaceOnce you've selected a parser, the next step is to code up a handler package to catch the parser's event stream, much like the SAX modules we've seen so far. XML::SAX specifies events and their properties in exquisite detail and in large numbers. This specification gives your handler considerable control while ensuring absolute conformance to the API. The types of supported event handlers fall into several groups. The ones we are most familiar with include the content handlers, including those for elements and general document information, entity resolvers, and lexical handlers that handle CDATA sections and comments. DTD handlers and declaration handlers take care of everything outside of the document element, including element and entity declarations. XML::SAX adds a new group, the error handlers, to catch and process any exceptions that may occur during parsing. One important new facet to this class of parsers is that they recognize namespaces. This recognition is one of the innovations of SAX2. Previously, SAX parsers treated a qualified name as a single unit: a combined namespace prefix and local name. Now you can tease out the namespaces, see where their scope begins and ends, and do more than you could before. 5.7.2.1. Content event handlersFocusing on the content of the document, these handlers are the most likely ones to be implemented in a SAX handling program. Note the useful addition of a document locator reference, which gives the handler a special window into the machinations of the parser. The support for namespaces is also new.
5.7.2.2. Entity resolverBy default, XML parsers resolve external entity references without your program ever knowing they were there. You may want to override that behavior occasionally. For example, you may have a special way of resolving public identifiers, or the entities are entries in a database. Whatever the reason, if you implement this handler, the parser will call it before attempting to resolve the entity on its own. The argument to resolve_entity( ) is a hash with two properties: PublicID, a public identifier for the entity, and SystemID, the system-specific location of the identity, such as a filesystem path or a URI. If the public identifier is undef, then none was given, but a system identifier will always be present. 5.7.2.3. Lexical event handlersImplementation of this group of events is optional. You probably don't need to see these events, so not all parsers will give them to you. However, a few very complete ones will. If you want to be able to duplicate the original source XML down to the very comments and CDATA sections, then you need a parser that supports these event handlers. They include:
5.7.2.4. Error event handlers and catching exceptionsXML::SAX lets you customize your error handling with this group of handlers. Each handler takes one argument, called an exception, that describes the error in detail. The particular handler called represents the severity of the error, as defined by the W3C recommendation for parser behavior. There are three types:
According to the XML specification, conformant parsers are supposed to halt when they encounter any kind of well-formedness or validity error. In Perl SAX, halting results in a call to die( ). That's not the end of story, however. Even after the parse session has died, you can raise it from the grave to continue where it left off, using the eval{} construct, like this: eval{ $parser->parse( $uri ) }; if( $@ ) { # yikes! handle error here... } The $@ variable is a blessed hash of properties that piece together the story about why parsing failed. These properties include:
Not all thrown exceptions indicate that a failure to parse occurred. Sometimes the parser throws an exception because of a bad feature setting. 5.7.3. SAX2 Parser InterfaceAfter you've written a handler package, you need to create an instance of the parser, set its features, and run it on the XML source. This section discusses the standard interface for XML::SAX parsers. The parse( ) method, which gets the parsing process rolling, takes a hash of options as an argument. Here you can assign handlers, set features, and define the data source to be parsed. For example, the following line sets both the handler package and the source document to parse: $parser->parse( Handler => $handler, Source => { SystemId => "data.xml" }); The Handler property sets a generic set of handlers that will be used by default. However, each class of handlers has its own assignment slot that will be checked before Handler. These settings include: ContentHandler, DTDHandler, EntityResolver, and ErrorHandler. All of these settings are optional. If you don't assign a handler, the parser will silently ignore events and handle errors in its own way. The Source parameter is a hash used by a parser to hold all the information about the XML being input. It has the following properties:
Any other options you want to set are in the set of features defined for SAX2. For example, you can tell a parser that you are interested in special treatment for namespaces. One way to set features is by defining the Features property in the options hash given to the parse( ) method. Another way is with the method set_feature( ). For example, here's how you would turn on validation in a validating parser using both methods: $parser->parse( Features => { 'http://xml.org/sax/properties/validate' => 1 } ); $parser->set_feature( 'http://xml.org/sax/properties/validate', 1 ); For a complete list of features defined for SAX2, see the documentation at http://sax.sourceforge.net/apidoc/org/xml/sax/package-summary.html. You can also define your own features if your parser has special abilities others don't. To see what features your parser supports, get_features( ) returns a list and get_feature( ) with a name parameter reports the setting of a specific feature. 5.7.4. Example: A DriverMaking your own SAX parser is simple, as most of the work is handled by a base class, XML::SAX::Base. All you have to do is create a subclass of this object and override anything that isn't taken care of by default. Not only is it convenient to do this, but it will result in code that is much safer and more reliable than if you tried to create it from scratch. For example, checking if the handler package implements the handler you want to call is done for you automatically. The next example proves just how easy it is to create a parser that works with XML::SAX. It's a driver, similar to the kind we saw in Section 5.4, "Drivers for Non-XML Sources", except that instead of turning Excel documents into XML, it reads from web server log files. The parser turns a line like this from a log file: 10.16.251.137 - - [26/Mar/2000:20:30:52 -0800] "GET /index.html HTTP/1.0" 200 16171 into this snippet of XML: <entry> <ip>10.16.251.137<ip> <date>26/Mar/2000:20:30:52 -0800<date> <req>GET /apache-modlist.html HTTP/1.0<req> <stat>200<stat> <size>16171<size> <entry> Example 5-8 implements the XML::SAX driver for web logs. The first subroutine in the package is parse( ). Ordinarily, you wouldn't write your own parse( ) method because the base class does that for you, but it assumes that you want to input some form of XML, which is not the case for drivers. Thus, we shadow that routine with one of our own, specifically trained to handle web server log files. Example 5-8. Web log SAX driverpackage LogDriver; require 5.005_62; use strict; use XML::SAX::Base; our @ISA = ('XML::SAX::Base'); our $VERSION = '0.01'; sub parse { my $self = shift; my $file = shift; if( open( F, $file )) { $self->SUPER::start_element({ Name => 'server-log' }); while( <F> ) { $self->_process_line( $_ ); } close F; $self->SUPER::end_element({ Name => 'server-log' }); } } sub _process_line { my $self = shift; my $line = shift; if( $line =~ /(\S+)\s\S+\s\S+\s\[([^\]]+)\]\s\"([^\"]+)\"\s(\d+)\s(\d+)/ ) { my( $ip, $date, $req, $stat, $size ) = ( $1, $2, $3, $4, $5 ); $self->SUPER::start_element({ Name => 'entry' }); $self->SUPER::start_element({ Name => 'ip' }); $self->SUPER::characters({ Data => $ip }); $self->SUPER::end_element({ Name => 'ip' }); $self->SUPER::start_element({ Name => 'date' }); $self->SUPER::characters({ Data => $date }); $self->SUPER::end_element({ Name => 'date' }); $self->SUPER::start_element({ Name => 'req' }); $self->SUPER::characters({ Data => $req }); $self->SUPER::end_element({ Name => 'req' }); $self->SUPER::start_element({ Name => 'stat' }); $self->SUPER::characters({ Data => $stat }); $self->SUPER::end_element({ Name => 'stat' }); $self->SUPER::start_element({ Name => 'size' }); $self->SUPER::characters({ Data => $size }); $self->SUPER::end_element({ Name => 'size' }); $self->SUPER::end_element({ Name => 'entry' }); } } 1; Since web logs are line oriented (one entry per line), it makes sense to create a subroutine that handles a single line, _process_line( ). All it has to do is break down the web log entry into component parts and package them in XML elements. The parse( ) routine simply chops the document into separate lines and feeds them into the line processor one at a time. Notice that we don't call event handlers in the handler package directly. Rather, we pass the data through routines in the base class, using it as an abstract layer between the parser and the handler. This is convenient for you, the parser developer, because you don't have to check if the handler package is listening for that type of event. Again, the base class is looking out for us, making our lives easier. Let's test the parser now. Assuming that you have this module already installed (don't worry, we'll cover the topic of installing XML::SAX parsers in the next section), writing a program that uses it is easy. Example 5-9 creates a handler package and applies it to the parser we just developed. Example 5-9. A program to test the SAX driveruse XML::SAX::ParserFactory; use LogDriver; my $handler = new MyHandler; my $parser = XML::SAX::ParserFactory->parser( Handler => $handler ); $parser->parse( shift @ARGV ); package MyHandler; # initialize object with options # sub new { my $class = shift; my $self = {@_}; return bless( $self, $class ); } sub start_element { my $self = shift; my $data = shift; print "<", $data->{Name}, ">"; print "\n" if( $data->{Name} eq 'entry' ); print "\n" if( $data->{Name} eq 'server-log' ); } sub end_element { my $self = shift; my $data = shift; print "<", $data->{Name}, ">\n"; } sub characters { my $self = shift; my $data = shift; print $data->{Data}; } We use XML::SAX::ParserFactory to demonstrate how a parser can be selected once it is registered. If you wish, you can define attributes for the parser so that subsequent queries can select it based on those properties rather than its name. The handler package is not terribly complicated; it turns the events into an XML character stream. Each handler receives a hash reference as an argument through which you can access each object's properties by the appropriate key. An element's name, for example, is stored under the hash key Name. It all works pretty much as you would expect. 5.7.5. Installing Your Own ParserOur coverage of XML::SAX wouldn't be complete without showing you how to create an installation package that adds a parser to the registry automatically. Adding a parser is very easy with the h2xs utility. Though it was originally made to facilitate extensions to Perl written in C, it is invaluable in other ways. Here, we will use it to create something much like the module installers you've downloaded from CPAN.[26]
First, we start a new project with the following command: h2xs -AX -n LogDriver h2xs automatically creates a directory called LogDriver, stocked with several files.
LogDriver.pm, the module to be installed, doesn't need much extra code to make h2xs happy. It only needs a variable, $VERSION, since h2xs is (justifiably) finicky about that information. As you know from installing CPAN modules, the first thing you do when opening an installer archive is run the command perl Makefile.PM. Running this command generates a file called Makefile, which configures the installer to your system. Then you can run make and make install to load the module in the right place. Any deviation from the default behavior of the installer must be coded in the Makefile.PM program. Untouched, it looks like this: use ExtUtils::MakeMaker; WriteMakefile( 'NAME' => 'LogDriver', # module name 'VERSION_FROM' => 'LogDriver.pm', # finds version ); The argument to WriteMakeFile( ) is a hash of properties about the module, used in generating a Makefile file. We can add more properties here to make the installer do more sophisticated things than just copy a module onto the system. For our parser, we want to add this line: 'PREREQ_PM' => { 'XML::SAX' => 0 } Adding this line triggers a check during installation to see if XML::SAX exists on the system. If not, the installation aborts with an error message. We don't want to install our parser until there is a framework to accept it. This subroutine should also be added to Makefile.PM: sub MY::install { package MY; my $script = shift->SUPER::install(@_); $script =~ s/install :: (.*)$/install :: $1 install_sax_driver/m; $script .= <<"INSTALL"; install_sax_driver : \t\@\$(PERL) -MXML::SAX -e "XML::SAX->add_parser(q(\$(NAME)))->save_parsers( )" INSTALL return $script; } This example adds the parser to the list maintained by XML::SAX. Now you can install your module. Copyright © 2002 O'Reilly & Associates. All rights reserved. |
|