4.6. XML::Parser
Another early parser is XML::Parser
, the first fast and efficient parser to
hit CPAN. We detailed its many-faceted interface in Chapter 3, "XML Basics: Reading and Writing". Its built-in stream mode is worth a closer
look, though. Let's return to it now with a solid
stream example.
We'll use XML::Parser to read a
list of records encoded as an XML document. The records contain
contact information for people, including their names, street
addresses, and phone numbers. As the parser reads the file, our
handler will store the information in its own data structure for
later processing. Finally, when the parser is done, the program sorts
the records by the person's name and outputs them as
an HTML table.
The source document is listed in Example 4-3. It has
a <list> element as the root, with four
<entry> elements inside it, each with an
address, a name, and a phone number.
Example 4-3. Address book file
<list>
<entry>
<name><first>Thadeus</first><last>Wrigley</last></name>
<phone>716-505-9910</phone>
<address>
<street>105 Marsupial Court</street>
<city>Fairport</city><state>NY</state><zip>14450</zip>
</address>
</entry>
<entry>
<name><first>Jill</first><last>Baxter</last></name>
<address>
<street>818 S. Rengstorff Avenue</street>
<zip>94040</zip>
<city>Mountainview</city><state>CA</state>
</address>
<phone>217-302-5455</phone>
</entry>
<entry>
<name><last>Riccardo</last>
<first>Preston</first></name>
<address>
<street>707 Foobah Drive</street>
<city>Mudhut</city><state>OR</state><zip>32777</zip>
</address>
<phone>111-222-333</phone>
</entry>
<entry>
<address>
<street>10 Jiminy Lane</street>
<city>Scrapheep</city><state>PA</state><zip>99001</zip>
</address>
<name><first>Benn</first><last>Salter</last></name>
<phone>611-328-7578</phone>
</entry>
</list>
This simple structure lends itself naturally to event processing.
Each <entry> start tag signals the
preparation of a new part of the data structure for storing data. An
</entry> end tag indicates that all data for
the record has been collected and can be saved. Similarly, start and
end tags for <entry> subelements are cues
that tell the handler when and where to save information. Each
<entry> is self-contained, with no links to
the outside, making it easy to process.
The program is listed in Example 4-4. At the top is
code used to initialize the parser object with references to
subroutines, each of which will serve as the handler for a single
event. This style of event handling is called a
callback because you write the subroutine
first, and the parser then calls it back when it needs it to handle
an event.
After the initialization, we declare some global variables to store
information from XML elements for later processing. These variables
give the handlers a memory, as mentioned earlier. Storing information
for later retrieval is often called saving
state because it helps the handlers preserve the state of
the parsing up to the current point in the document.
After reading in the data and applying the parser to it, the rest of
the program defines the handler subroutines. We handle five events:
the start and end of the document, the start and end of elements, and
character data. Other events, such as comments, processing
instructions, and document type declarations, will all be ignored.
Example 4-4. Code for the address program
# initialize the parser with references to handler routines
#
use XML::Parser;
my $parser = XML::Parser->new( Handlers => {
Init => \&handle_doc_start,
Final => \&handle_doc_end,
Start => \&handle_elem_start,
End => \&handle_elem_end,
Char => \&handle_char_data,
});
#
# globals
#
my $record; # points to a hash of element contents
my $context; # name of current element
my %records; # set of address entries
#
# read in the data and run the parser on it
#
my $file = shift @ARGV;
if( $file ) {
$parser->parsefile( $file );
} else {
my $input = "";
while( <STDIN> ) { $input .= $_; }
$parser->parse( $input );
}
exit;
###
### Handlers
###
#
# As processing starts, output the beginning of an HTML file.
#
sub handle_doc_start {
print "<html><head><title>addresses</title></head>\n";
print "<body><h1>addresses</h1>\n";
}
#
# save element name and attributes
#
sub handle_elem_start {
my( $expat, $name, %atts ) = @_;
$context = $name;
$record = {} if( $name eq 'entry' );
}
# collect character data into the recent element's buffer
#
sub handle_char_data {
my( $expat, $text ) = @_;
# Perform some minimal entitizing of naughty characters
$text =~ s/&/&/g;
$text =~ s/</</g;
$record->{ $context } .= $text;
}
#
# if this is an <entry>, collect all the data into a record
#
sub handle_elem_end {
my( $expat, $name ) = @_;
return unless( $name eq 'entry' );
my $fullname = $record->{'last'} . $record->{'first'};
$records{ $fullname } = $record;
}
#
# Output the close of the file at the end of processing.
#
sub handle_doc_end {
print "<table border='1'>\n";
print "<tr><th>name</th><th>phone</th><th>address</th></tr>\n";
foreach my $key ( sort( keys( %records ))) {
print "<tr><td>" . $records{ $key }->{ 'first' } . ' ';
print $records{ $key }->{ 'last' } . "</td><td>";
print $records{ $key }->{ 'phone' } . "</td><td>";
print $records{ $key }->{ 'street' } . ', ';
print $records{ $key }->{ 'city' } . ', ';
print $records{ $key }->{ 'state' } . ' ';
print $records{ $key }->{ 'zip' } . "</td></tr>\n";
}
print "</table>\n</div>\n</body></html>\n";
}
To understand how this program works, we need to study the handlers.
All handlers called by XML::Parser receive a
reference to the expat parser object as their
first argument, a courtesy to developers in case they want to access
its data (for example, to check the input file's
current line number). Other arguments may be passed, depending on the
kind of event. For example, the start-element event handler gets the
name of the element as the second argument, and then gets a list of
attribute names and values.
Our handlers use global variables to store information. If you
don't like global variables (in larger programs,
they can be a headache to debug), you can create an object that
stores the information internally. You would then give the parser
your object's methods as handlers.
We'll stick with globals for now because they are
easier to read in our example.
The first
handler is
handle_doc_start, called at the start of parsing.
This handler is a convenient way to do some work before processing
the document. In our case, it just outputs HTML code to begin the
HTML page in which the sorted address entries will be formatted. This
subroutine has no special arguments.
The next handler, handle_elem_start, is called
whenever the parser encounters the start of a new element. After the
obligatory expat reference, the routine gets two
arguments: $name, which is the element name, and
%atts, a hash of attribute names and values. (Note
that using a hash will not preserve the order of attributes, so if
order is important to you, you should use an @atts
array instead.) For this simple example, we don't
use attributes, but we leave open the possibility of using them
later.
This routine sets up processing of an element by saving the name of
the element in a variable called $context. Saving
the element's name ensures that we will know what to
do with character data events the parser will send later. The routine
also initializes a hash called %record, which will
contain the data for each of
<entry>'s subelements in a
convenient look-up table.
The handler handle_char_data takes care of
nonmarkup data -- basically all the character data in elements.
This text is stored in the second argument, here called
$text. The handler only needs to save the content
in the buffer $record->{ $context }. Notice
that we append the character data to the buffer, rather than assign
it outright. XML::Parser has a funny quirk in
which it calls the character handler after each line or
newline-separated string of text.[23] Thus, if the content of an element includes a newline
character, this will result in two separate calls to the handler. If
you didn't append the data, then the last call would
overwrite the one before it.
Not surprisingly, handle_elem_end handles the end
of element events. The second argument is the
element's name, as with the start-element event
handler. For most elements, there's not much to do
here, but for <entry>, we have a final
housekeeping task. At this point, all the information for a record
has been collected, so the record is complete. We only have to store
it in a hash, indexed by the person's full name so
that we can easily sort the records later. The sorting can be done
only after all the records are in, so we need to store the record for
later processing. If we weren't interested in
sorting, we could just output the record as HTML.
Finally, the handle_doc_end handler completes our
set, performing any final tasks that remain after reading the
document. It so happens that we do have something to do. We need to
print out the records, sorted alphabetically by contact name. The
subroutine generates an HTML table to format the entries nicely.
This example, which involved a flat sequence of records, was pretty
simple, but not all XML is like that. In some complex document
formats, you have to consider the parent, grandparent, and even
distant ancestors of the current element to decide what to do with an
event. Remembering an element's ancestry requires a
more sophisticated state-saving structure, which we will
show
in a
later example.
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
|