3.4. Putting Parsers to Work
Enough
tinkering with the parser's internal details. We
want to see what you can do with the stuff you get from parsers.
We've already seen an example of a complete,
parser-built tree structure in Example 3-3, so
let's do something with the other type.
We'll take an XML event stream and make it drive
processing by plugging it into some code to handle the events. It may
not be the most useful tool in the world, but it will serve well
enough to show you how real-world XML processing programs are
written.
XML::Parser
(with Expat running underneath) is at the input
end of our program. Expat subscribes to the event-based parsing
school we described earlier. Rather than loading your whole XML
document into memory and then turning around to see what it hath
wrought, it stops every time it encounters a discrete chunk of data
or markup, such as an angle-bracketed tag or a literal string inside
an element. It then checks to see if our program wants to react to it
in any way.
Your first responsibility is to give the parser an interface to the
pertinent bits of code that handle events. Each type of event is
handled by a different subroutine, or
handler.
We register our handlers with the parser by setting the
Handlers option at initialization time. Example 3-5 shows the entire process.
Example 3-5. A stream-based XML processor
use XML::Parser;
# initialize the parser
my $parser = XML::Parser->new( Handlers =>
{
Start=>\&handle_start,
End=>\&handle_end,
});
$parser->parsefile( shift @ARGV );
my @element_stack; # remember which elements are open
# process a start-of-element event: print message about element
#
sub handle_start {
my( $expat, $element, %attrs ) = @_;
# ask the expat object about our position
my $line = $expat->current_line;
print "I see an $element element starting on line $line!\n";
# remember this element and its starting position by pushing a
# little hash onto the element stack
push( @element_stack, { element=>$element, line=>$line });
if( %attrs ) {
print "It has these attributes:\n";
while( my( $key, $value ) = each( %attrs )) {
print "\t$key => $value\n";
}
}
}
# process an end-of-element event
#
sub handle_end {
my( $expat, $element ) = @_;
# We'll just pop from the element stack with blind faith that
# we'll get the correct closing element, unlike what our
# homebrewed well-formedness did, since XML::Parser will scream
# bloody murder if any well-formedness errors creep in.
my $element_record = pop( @element_stack );
print "I see that $element element that started on line ",
$$element_record{ line }, " is closing now.\n";
}
It's easy to see how this process works.
We've written two handler subroutines called
handle_start( ) and handle_end(
) and registered each with a
particular event in the call to new( ). When we
call parse( ), the parser knows it has handlers
for a start-of-element event and an end-of-element event. Every time
the parser trips over an element start tag, it calls the first
handler and gives it information about that element (element name and
attributes). Similarly, any end tag it encounters leads to a call of
the other handler with similar element-specific information.
Note that the parser also gives each handler a reference called
$expat. This is a reference to the
XML::Parser::Expat object, a low-level interface
to Expat. It has access to interesting information that might be
useful to a program, such as line numbers and element depth.
We've taken advantage of this fact, using the line
number to dazzle users with our amazing powers of document analysis.
Want to see it run? Here's how the output looks
after processing the customer database document from Example 1-1:
I see a spam-document element starting on line 1!
It has these attributes:
version => 3.5
timestamp => 2002-05-13 15:33:45
I see a customer element starting on line 3!
I see a first-name element starting on line 4!
I see that the first-name element that started on line 4 is closing now.
I see a surname element starting on line 5!
I see that the surname element that started on line 5 is closing now.
I see a address element starting on line 6!
I see a street element starting on line 7!
I see that the street element that started on line 7 is closing now.
I see a city element starting on line 8!
I see that the city element that started on line 8 is closing now.
I see a state element starting on line 9!
I see that the state element that started on line 9 is closing now.
I see a zip element starting on line 10!
I see that the zip element that started on line 10 is closing now.
I see that the address element that started on line 6 is closing now.
I see a email element starting on line 12!
I see that the email element that started on line 12 is closing now.
I see a age element starting on line 13!
I see that the age element that started on line 13 is closing now.
I see that the customer element that started on line 3 is closing now.
[... snipping other customers for brevity's sake ...]
I see that the spam-document element that started on line 1 is closing now.
Here we used the element stack again. We didn't
actually need to store the elements' names
ourselves; one of the methods you can call on the
XML::Parser::Expat object returns the current
context list, a newest-to-oldest ordering of all
elements our parser has probed into. However, a stack proved to be a
useful way to store additional information like line numbers. It
shows off the fact that you can let events build up structures of
arbitrary complexity -- the
"memory" of the
document's past.
There are many more event types than we handle here. We
don't do anything with character data, comments, or
processing instructions, for example. However, for the purpose of
this example, we don't need to go into those event
types. We'll have more exhaustive examples of event
processing in the next chapter, anyway.
Before we close the topic of event processing, we want to mention one
thing: the Simple API for XML processing, more commonly known as
SAX. It's
very similar to the event processing model we've
seen so far, but the difference is that it's a
W3C-supported standard. Being a W3C-supported standard means that it
has a standardized, canonical set of events. How these events should
be presented for processing is also standardized. The cool thing
about it is that with a standard interface, you can hook up different
program components like Legos and it will all work. If you
don't like one parser, just plug in another (and
sophisticated tools like the XML::SAX module
family can even help you pick a parser based on the features you
need). Get your XML data from a database, a file, or your
mother's shopping list; it
shouldn't matter where it comes from. SAX is very
exciting for the Perl community because we've long
been criticized for our lack of standards compliance and general
barbarism. Now we can be criticized for only one of those things. You
can expect a nice, thorough discussion on SAX (specifically,
PerlSAX,
our
beloved language's
mutation thereof) in Chapter 5, "SAX".
 |  |  | 3.3. Stream-Based Versus Tree-Based Processing |  | 3.5. XML::LibXML |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
|