"What," you ask,
"do you mean about keeping all the modules
synchronized with the API?" All along,
we've touted the wonders of using a standard like
SAX to ensure that modules are really interchangeable. But
here's the rub: in Perl, there's
more than one way to implement SAX. SAX was originally designed for
Java, which has a wonderful interface type of class that nails down
things like what type of argument to pass to which method.
There's nothing like that in Perl.
This wasn't as much of a problem with the older SAX
modules we've been talking about so far. They all
support SAX Level 1, which is fairly simple. However, a new crop of
modules that support SAX2 is breaking the surface. SAX2 is more
complex because it introduces namespaces to the mix. An element event
handler should receive both the namespace prefix and the local name
of the element. How should this information be passed in parameters?
Do you keep them together in the same string like
foo:bar? Or do you separate them into two
parameters?
XML::SAX represents a shift in the way XML and
Perl work together. It builds on the work of the past, including all
the best features of previous modules, while avoiding many of the
mistakes. To ensure that modules are truly compatible, the kit
provides a base class for parsers, abstracting out most of the
mundane work that all parsers have to do, leaving the developer the
task of doing only what is unique to the task. It also creates an
abstract interface for users of parsers, allowing them to keep the
plethora of modules organized with a registry that is indexed by
properties to make it easy to find the right one with a simple query.
It's a bold step and carries a lot of heft, so be
prepared for a lot of information and detail in this section. We
think it will be worth your while.
5.7.1. XML::SAX::ParserFactory
We start with the parser selection interface,
XML::SAX::ParserFactory. For those of you who have
used DBI, this class is very similar. It's a front
end to all the SAX parsers on your system. You simply request a new
parser from the factory and it will dig one up for you.
Let's say you want to use any SAX parser with your
handler package XML::SAX::MyHandler.
Here's how to fetch the parser and use it to read a
file:
use XML::SAX::ParserFactory;
use XML::SAX::MyHandler;
my $handler = new XML::SAX::MyHandler;
my $parser = XML::SAX::ParserFactory->parser( Handler => $handler );
$parser->parse_uri( "foo.xml" );
The parser you get depends on the order in which
you've installed the modules. The last one (with all
the available features specified with
RequiredFeatures, if any) will be returned by
default. But maybe you don't want that one. No
problem; XML::SAX maintains a registry of SAX
parsers that you can choose from. Every time you install a new SAX
parser, it registers itself so you can call upon it with
ParserFactory. If you know you have the
XML::SAX::BobsParser parser installed, you can
require an instance of it by setting the variable
$XML::SAX::ParserPackage as follows:
use XML::SAX::ParserFactory;
use XML::SAX::MyHandler;
my $handler = new XML::SAX::MyHandler;
$XML::SAX::ParserPackage = "XML::SAX::BobsParser( 1.24 )";
my $parser = XML::SAX::ParserFactory->parser( Handler => $handler );
Setting $XML::SAX:ParserPackage to
XML::SAX::BobsParser( 1.24
) returns an instance of the package. Internally,
ParserFactory is require( )-ing
that parser and calling its new(
) class method. The
1.24 in the variable setting specifies a minimum
version number for the parser. If that version isn't
on your system, an exception will be thrown.
To see a list of all the parsers available to
XML::SAX, call the parsers( )
method:
use XML::SAX;
my @parsers = @{XML::SAX->parsers( )};
foreach my $p ( @parsers ) {
print "\n", $p->{ Name }, "\n";
foreach my $f ( sort keys %{$p->{ Features }} ) {
print "$f => ", $p->{ Features }->{ $f }, "\n";
}
}
It returns a reference to a list of hashes, with each hash containing
information about a parser, including the name and a hash of
features. When we ran the program above we were told that
XML::SAX had two registered parsers, each
supporting namespaces:
XML::LibXML::SAX::Parser
http://xml.org/sax/features/namespaces => 1
XML::SAX::PurePerl
http://xml.org/sax/features/namespaces => 1
At the time this book was written, these parsers were the only two
parsers included with XML::SAX.
XML::LibXML::SAX::Parser is a SAX API for the
libxml2
library we use in Chapter 6, "Tree Processing". To use it,
you'll need to have libxml2, a
compiled, dynamically linked library written in C, installed on your
system. It's fast, but unless you can find a binary
or compile it yourself, it isn't very portable.
XML::SAX::PurePerl is, as the name suggests, a
parser written completely in Perl. As such, it's
completely portable because you can run it wherever Perl is
installed. This starter set of parsers already gives you some
different options.
The feature list associated with each parser is important because it
allows a user to select a parser based on a set of criteria. For
example, suppose you wanted a parser that did validation and
supported namespaces. You could request one by calling the
factory's require_feature( )
method:
my $factory = new XML::SAX::ParserFactory;
$factory->require_feature( 'http://xml.org/sax/features/validation' );
$factory->require_feature( 'http://xml.org/sax/features/namespaces' );
my $parser = $factory->parser( Handler => $handler );
Alternatively, you can pass such information to the factory in its
constructor method:
my $factory = new XML::SAX::ParserFactory(
Required_features => {
'http://xml.org/sax/features/validation' => 1
'http://xml.org/sax/features/namespaces' => 1
}
);
my $parser = $factory->parser( Handler => $handler );
If multiple parsers pass the test, the most recently installed one is
used. However, if the factory can't find a parser to
fit your requirements, it simply throws an exception.
To add more SAX modules to the registry, you only need to download
and install them. Their installer packages should know about
XML::SAX and automatically register the modules
with it. To add a module of your own, you can use
XML::SAX's add_parser(
) with a list of module names. Make sure
it follows the conventions of SAX modules by subclassing
XML::SAX::Base. Later, we'll show
you how to write a parser, install it, and add it to the
registry.
5.7.4. Example: A Driver
Making your own SAX parser is simple, as most of the work is handled
by a base class, XML::SAX::Base. All you have to
do is create a subclass of this object and override anything that
isn't taken care of by default. Not only is it
convenient to do this, but it will result in code that is much safer
and more reliable than if you tried to create it from scratch. For
example, checking if the handler package implements the handler you
want to call is done for you automatically.
The next example proves just how easy it is to create a parser that
works with XML::SAX. It's a
driver, similar to the kind we saw in Section 5.4, "Drivers for Non-XML Sources", except that instead of
turning Excel documents into XML, it reads from web server log files.
The parser turns a line like this from a log file:
10.16.251.137 - - [26/Mar/2000:20:30:52 -0800] "GET /index.html HTTP/1.0" 200 16171
into this snippet of XML:
<entry>
<ip>10.16.251.137<ip>
<date>26/Mar/2000:20:30:52 -0800<date>
<req>GET /apache-modlist.html HTTP/1.0<req>
<stat>200<stat>
<size>16171<size>
<entry>
Example 5-8 implements the
XML::SAX driver for web logs. The first subroutine
in the package is parse( ). Ordinarily, you
wouldn't write your own parse(
) method because the base class does that for you, but it
assumes that you want to input some form of XML, which is not the
case for drivers. Thus, we shadow that routine with one of our own,
specifically trained to handle web server log files.
Example 5-8. Web log SAX driver
package LogDriver;
require 5.005_62;
use strict;
use XML::SAX::Base;
our @ISA = ('XML::SAX::Base');
our $VERSION = '0.01';
sub parse {
my $self = shift;
my $file = shift;
if( open( F, $file )) {
$self->SUPER::start_element({ Name => 'server-log' });
while( <F> ) {
$self->_process_line( $_ );
}
close F;
$self->SUPER::end_element({ Name => 'server-log' });
}
}
sub _process_line {
my $self = shift;
my $line = shift;
if( $line =~
/(\S+)\s\S+\s\S+\s\[([^\]]+)\]\s\"([^\"]+)\"\s(\d+)\s(\d+)/ ) {
my( $ip, $date, $req, $stat, $size ) = ( $1, $2, $3, $4, $5 );
$self->SUPER::start_element({ Name => 'entry' });
$self->SUPER::start_element({ Name => 'ip' });
$self->SUPER::characters({ Data => $ip });
$self->SUPER::end_element({ Name => 'ip' });
$self->SUPER::start_element({ Name => 'date' });
$self->SUPER::characters({ Data => $date });
$self->SUPER::end_element({ Name => 'date' });
$self->SUPER::start_element({ Name => 'req' });
$self->SUPER::characters({ Data => $req });
$self->SUPER::end_element({ Name => 'req' });
$self->SUPER::start_element({ Name => 'stat' });
$self->SUPER::characters({ Data => $stat });
$self->SUPER::end_element({ Name => 'stat' });
$self->SUPER::start_element({ Name => 'size' });
$self->SUPER::characters({ Data => $size });
$self->SUPER::end_element({ Name => 'size' });
$self->SUPER::end_element({ Name => 'entry' });
}
}
1;
Since web logs are line oriented (one entry per line), it makes sense
to create a subroutine that handles a single line,
_process_line( ). All it has to do is break down
the web log entry into component parts and package them in XML
elements. The parse( ) routine simply chops the
document into separate lines and feeds them into the line processor
one at a time.
Notice that we don't call event handlers in the
handler package directly. Rather, we pass the data through routines
in the base class, using it as an abstract layer between the parser
and the handler. This is convenient for you, the parser developer,
because you don't have to check if the handler
package is listening for that type of event. Again, the base class is
looking out for us, making our lives easier.
Let's test the parser now. Assuming that you have
this module already installed (don't worry,
we'll cover the topic of installing
XML::SAX parsers in the next section), writing a
program that uses it is easy. Example 5-9 creates a
handler package and applies it to the parser we just developed.
Example 5-9. A program to test the SAX driver
use XML::SAX::ParserFactory;
use LogDriver;
my $handler = new MyHandler;
my $parser = XML::SAX::ParserFactory->parser( Handler => $handler );
$parser->parse( shift @ARGV );
package MyHandler;
# initialize object with options
#
sub new {
my $class = shift;
my $self = {@_};
return bless( $self, $class );
}
sub start_element {
my $self = shift;
my $data = shift;
print "<", $data->{Name}, ">";
print "\n" if( $data->{Name} eq 'entry' );
print "\n" if( $data->{Name} eq 'server-log' );
}
sub end_element {
my $self = shift;
my $data = shift;
print "<", $data->{Name}, ">\n";
}
sub characters {
my $self = shift;
my $data = shift;
print $data->{Data};
}
We use XML::SAX::ParserFactory to demonstrate how
a parser can be selected once it is registered. If you wish, you can
define attributes for the parser so that subsequent queries can
select it based on those properties rather than its name.
The handler package is not terribly complicated; it turns the events
into an XML character stream. Each handler receives a hash reference
as an argument through which you can access each
object's properties by the appropriate key. An
element's name, for example, is stored under the
hash key Name. It all works pretty much as you
would expect.