XML::Parser (Perl and XML)

4.6. XML::Parser

We'll use XML::Parser to read a list of records encoded as an XML document. The records contain contact information for people, including their names, street addresses, and phone numbers. As the parser reads the file, our handler will store the information in its own data structure for later processing. Finally, when the parser is done, the program sorts the records by the person's name and outputs them as an HTML table.

The source document is listed in Example 4-3. It has a <list> element as the root, with four <entry> elements inside it, each with an address, a name, and a phone number.

Example 4-3. Address book file

<list>
  <entry>
    <name><first>Thadeus</first><last>Wrigley</last></name>
    <phone>716-505-9910</phone>
    <address>
      <street>105 Marsupial Court</street>
      <city>Fairport</city><state>NY</state><zip>14450</zip>
    </address>
  </entry>
  <entry>
    <name><first>Jill</first><last>Baxter</last></name>
    <address>
      <street>818 S. Rengstorff Avenue</street>
      <zip>94040</zip>
      <city>Mountainview</city><state>CA</state>
    </address>
    <phone>217-302-5455</phone>
  </entry>
  <entry>
    <name><last>Riccardo</last>
    <first>Preston</first></name>
    <address>
      <street>707 Foobah Drive</street>
      <city>Mudhut</city><state>OR</state><zip>32777</zip>
    </address>
    <phone>111-222-333</phone>
  </entry>
  <entry>
    <address>
      <street>10 Jiminy Lane</street>
      <city>Scrapheep</city><state>PA</state><zip>99001</zip>
    </address>
    <name><first>Benn</first><last>Salter</last></name>
    <phone>611-328-7578</phone>
  </entry>
</list>

This simple structure lends itself naturally to event processing. Each <entry> start tag signals the preparation of a new part of the data structure for storing data. An </entry> end tag indicates that all data for the record has been collected and can be saved. Similarly, start and end tags for <entry> subelements are cues that tell the handler when and where to save information. Each <entry> is self-contained, with no links to the outside, making it easy to process.

The program is listed in Example 4-4. At the top is code used to initialize the parser object with references to subroutines, each of which will serve as the handler for a single event. This style of event handling is called a callback because you write the subroutine first, and the parser then calls it back when it needs it to handle an event.

After the initialization, we declare some global variables to store information from XML elements for later processing. These variables give the handlers a memory, as mentioned earlier. Storing information for later retrieval is often called saving state because it helps the handlers preserve the state of the parsing up to the current point in the document.

After reading in the data and applying the parser to it, the rest of the program defines the handler subroutines. We handle five events: the start and end of the document, the start and end of elements, and character data. Other events, such as comments, processing instructions, and document type declarations, will all be ignored.

Example 4-4. Code for the address program

# initialize the parser with references to handler routines
#
use XML::Parser;
my $parser = XML::Parser->new( Handlers => {
    Init =>    \&handle_doc_start,
    Final =>   \&handle_doc_end,
    Start =>   \&handle_elem_start,
    End =>     \&handle_elem_end,
    Char =>    \&handle_char_data,
});

#
# globals
#
my $record;       # points to a hash of element contents
my $context;      # name of current element
my %records;      # set of address entries

#
# read in the data and run the parser on it
#
my $file = shift @ARGV;
if( $file ) {
    $parser->parsefile( $file );
} else {
    my $input = "";
    while( <STDIN> ) { $input .= $_; }
    $parser->parse( $input );
}
exit;

###
### Handlers
###

#
# As processing starts, output the beginning of an HTML file.
# 
sub handle_doc_start {
    print "<html><head><title>addresses</title></head>\n";
    print "<body><h1>addresses</h1>\n";
}

#
# save element name and attributes
#
sub handle_elem_start {
    my( $expat, $name, %atts ) = @_;
    $context = $name;
    $record = {} if( $name eq 'entry' );
} 

# collect character data into the recent element's buffer
#
sub handle_char_data {
    my( $expat, $text ) = @_;

    # Perform some minimal entitizing of naughty characters
    $text =~ s/&/&/g;
    $text =~ s/</&lt;/g;

    $record->{ $context } .= $text;
}

#
# if this is an <entry>, collect all the data into a record
#
sub handle_elem_end {
    my( $expat, $name ) = @_;
    return unless( $name eq 'entry' );
    my $fullname = $record->{'last'} . $record->{'first'};
    $records{ $fullname } = $record;
}

#
# Output the close of the file at the end of processing.
#
sub handle_doc_end {
    print "<table border='1'>\n";
    print "<tr><th>name</th><th>phone</th><th>address</th></tr>\n";
    foreach my $key ( sort( keys( %records ))) {
        print "<tr><td>" . $records{ $key }->{ 'first' } . ' ';
        print $records{ $key }->{ 'last' } . "</td><td>";
        print $records{ $key }->{ 'phone' } . "</td><td>";
        print $records{ $key }->{ 'street' } . ', ';
        print $records{ $key }->{ 'city' } . ', ';
        print $records{ $key }->{ 'state' } . ' ';
        print $records{ $key }->{ 'zip' } . "</td></tr>\n";
    }
    print "</table>\n</div>\n</body></html>\n";
}

To understand how this program works, we need to study the handlers. All handlers called by XML::Parser receive a reference to the expat parser object as their first argument, a courtesy to developers in case they want to access its data (for example, to check the input file's current line number). Other arguments may be passed, depending on the kind of event. For example, the start-element event handler gets the name of the element as the second argument, and then gets a list of attribute names and values.

Our handlers use global variables to store information. If you don't like global variables (in larger programs, they can be a headache to debug), you can create an object that stores the information internally. You would then give the parser your object's methods as handlers. We'll stick with globals for now because they are easier to read in our example.

The next handler, handle_elem_start, is called whenever the parser encounters the start of a new element. After the obligatory expat reference, the routine gets two arguments: $name, which is the element name, and %atts, a hash of attribute names and values. (Note that using a hash will not preserve the order of attributes, so if order is important to you, you should use an @atts array instead.) For this simple example, we don't use attributes, but we leave open the possibility of using them later.

This routine sets up processing of an element by saving the name of the element in a variable called $context. Saving the element's name ensures that we will know what to do with character data events the parser will send later. The routine also initializes a hash called %record, which will contain the data for each of <entry>'s subelements in a convenient look-up table.

Not surprisingly, handle_elem_end handles the end of element events. The second argument is the element's name, as with the start-element event handler. For most elements, there's not much to do here, but for <entry>, we have a final housekeeping task. At this point, all the information for a record has been collected, so the record is complete. We only have to store it in a hash, indexed by the person's full name so that we can easily sort the records later. The sorting can be done only after all the records are in, so we need to store the record for later processing. If we weren't interested in sorting, we could just output the record as HTML.

Finally, the handle_doc_end handler completes our set, performing any final tasks that remain after reading the document. It so happens that we do have something to do. We need to print out the records, sorted alphabetically by contact name. The subroutine generates an HTML table to format the entries nicely.