Putting Parsers to Work (Perl and XML)

Example 3-5. A stream-based XML processor

use XML::Parser;

# initialize the parser
my $parser = XML::Parser->new( Handlers => 
                                     {
                                      Start=>\&handle_start,
                                      End=>\&handle_end,
                                     });
$parser->parsefile( shift @ARGV );

my @element_stack;                # remember which elements are open

# process a start-of-element event: print message about element
#
sub handle_start {
    my( $expat, $element, %attrs ) = @_;

    # ask the expat object about our position
    my $line = $expat->current_line;

    print "I see an $element element starting on line $line!\n";

    # remember this element and its starting position by pushing a
    # little hash onto the element stack
    push( @element_stack, { element=>$element, line=>$line });

    if( %attrs ) {
        print "It has these attributes:\n";
        while( my( $key, $value ) = each( %attrs )) {
            print "\t$key => $value\n";
        }
    }
}

# process an end-of-element event
#
sub handle_end {
    my( $expat, $element ) = @_;

    # We'll just pop from the element stack with blind faith that
    # we'll get the correct closing element, unlike what our
    # homebrewed well-formedness did, since XML::Parser will scream
    # bloody murder if any well-formedness errors creep in.
    my $element_record = pop( @element_stack );
    print "I see that $element element that started on line ",
          $$element_record{ line }, " is closing now.\n";
}

It's easy to see how this process works. We've written two handler subroutines called handle_start( ) and handle_end( ) and registered each with a particular event in the call to new( ). When we call parse( ), the parser knows it has handlers for a start-of-element event and an end-of-element event. Every time the parser trips over an element start tag, it calls the first handler and gives it information about that element (element name and attributes). Similarly, any end tag it encounters leads to a call of the other handler with similar element-specific information.

Note that the parser also gives each handler a reference called $expat. This is a reference to the XML::Parser::Expat object, a low-level interface to Expat. It has access to interesting information that might be useful to a program, such as line numbers and element depth. We've taken advantage of this fact, using the line number to dazzle users with our amazing powers of document analysis.

I see a spam-document element starting on line 1! It has these attributes: version => 3.5 timestamp => 2002-05-13 15:33:45 I see a customer element starting on line 3! I see a first-name element starting on line 4! I see that the first-name element that started on line 4 is closing now. I see a surname element starting on line 5! I see that the surname element that started on line 5 is closing now. I see a address element starting on line 6! I see a street element starting on line 7! I see that the street element that started on line 7 is closing now. I see a city element starting on line 8! I see that the city element that started on line 8 is closing now. I see a state element starting on line 9! I see that the state element that started on line 9 is closing now. I see a zip element starting on line 10! I see that the zip element that started on line 10 is closing now. I see that the address element that started on line 6 is closing now. I see a email element starting on line 12! I see that the email element that started on line 12 is closing now. I see a age element starting on line 13! I see that the age element that started on line 13 is closing now. I see that the customer element that started on line 3 is closing now. [... snipping other customers for brevity's sake ...] I see that the spam-document element that started on line 1 is closing now.

3.4. Putting Parsers to Work

Example 3-5. A stream-based XML processor