XML::Parser (Perl and XML)

The C underpinnings are the secret to XML::Parser's success. We've seen how to write a basic parser in Perl. If you apply our previous example to a large XML document, you'll wait a long time before it finishes. Others have written complete XML parsers in Perl that are portable to any system, but you'll find much better performance in a compiled C parser like Expat. Fortunately, as with every other Perl module based on C code (and there are actually lots of these modules because they're not too hard to make, thanks to Perl's standard XS library),[16] it's easy to forget you're driving Expat around when you use XML::Parser.

use XML::Parser; my $xmlfile = shift @ARGV; # the file to parse # initialize parser object and parse the string my $parser = XML::Parser->new( ErrorContext => 2 ); eval { $parser->parsefile( $xmlfile ); }; # report any error that stopped parsing, or announce success if( $@ ) { $@ =~ s/at \/.*?$//s; # remove module line number print STDERR "\nERROR in '$file':\n$@\n"; } else { print STDERR "'$file' is well-formed\n"; }

3.2.2. Parsing Styles

XML::Parser supports several different styles of parsing to suit various development strategies. The style doesn't change how the parser reads XML. Rather, it changes how it presents the results of parsing. If you need a persistent structure containing the document, you can have it. Or, if you'd prefer to have the parser call a set of routines you write, you can do it that way. You can set the style when you initialize the object by setting the value of style. Here's a quick summary of the available styles:

Debug: This style prints the document to STDOUT, formatted as an outline (deeper elements are indented more). parse( ) doesn't return anything special to your program.
Tree: This style creates a hierarchical, tree-shaped data structure that your program can use for processing. All elements and their data are crystallized in this form, which consists of nested hashes and arrays.
Object: Like tree, this method returns a reference to a hierarchical data structure representing the document. However, instead of using simple data aggregates like hashes and lists, it consists of objects that are specialized to contain XML markup objects.
Subs: This style lets you set up callback functions to handle individual elements. Create a package of routines named after the elements they should handle and tell the parser about this package by using the pkg option. Every time the parser finds a start tag for an element called <fooby>, it will look for the function fooby( ) in your package. When it finds the end tag for the element, it will try to call the function _fooby( ) in your package. The parser will pass critical information like references to content and attributes to the function, so you can do whatever processing you need to do with it.
Stream: Like Subs, you can define callbacks for handling particular XML components, but callbacks are more general than element names. You can write functions called handlers to be called for "events" like the start of an element (any element, not just a particular kind), a set of character data, or a processing instruction. You must register the handler package with either the Handlers option or the setHandlers( ) method.
custom: You can subclass the XML::Parser class with your own object. Doing so is useful for creating a parser-like API for a more specific application. For example, the XML::Parser::PerlSAX module uses this strategy to implement the SAX event processing standard.

Example 3-3 is a program that uses XML::Parser with Style set to Tree. In this mode, the parser reads the whole XML document while building a data structure. When finished, it hands our program a reference to the structure that we can play with.

Example 3-3. An XML tree builder

use XML::Parser;

# initialize parser and read the file
$parser = new XML::Parser( Style => 'Tree' );
my $tree = $parser->parsefile( shift @ARGV );

# serialize the structure
use Data::Dumper;
print Dumper( $tree );

In tree mode, the parsefile( ) method returns a reference to a data structure containing the document, encoded as lists and hashes. We use Data::Dumper, a handy module that serializes data structures, to view the result. Example 3-4 is the datafile.

Example 3-4. An XML datafile

<preferences>
  <font role="console">
    <fname>Courier</name>
    <size>9</size>
  </font>
  <font role="default">
    <fname>Times New Roman</name>
    <size>14</size>
  </font>
  <font role="titles">
    <fname>Helvetica</name>
    <size>10</size>
  </font>
</preferences>

With this datafile, the program produces the following output (condensed and indented to be easier to read):

$tree = [ 
          'preferences', [ 
            {}, 0, '\n', 
            'font', [ 
              { 'role' => 'console' }, 0, '\n',
              'size', [ {}, 0, '9' ], 0, '\n', 
              'fname', [ {}, 0, 'Courier' ], 0, '\n'
            ], 0, '\n',
            'font', [ 
              { 'role' => 'default' }, 0, '\n',
              'fname', [ {}, 0, 'Times New Roman' ], 0, '\n',
              'size', [ {}, 0, '14' ], 0, '\n'
            ], 0, '\n', 
            'font', [ 
               { 'role' => 'titles' }, 0, '\n',
               'size', [ {}, 0, '10' ], 0, '\n',
               'fname', [ {}, 0, 'Helvetica' ], 0, '\n',
            ], 0, '\n',
          ]
        ];

It's a lot easier to write code that dissects the above structure than to write a parser of your own. We know, because the parser returned a data structure instead of dying mid-parse, that the document was 100 percent well-formed XML. In Chapter 4, "Event Streams", we will use the Stream mode of XML::Parser, and in Chapter 6, "Tree Processing", we'll talk more about trees and objects.

3.2. XML::Parser

3.2.1. Example: Well-Formedness Checker Revisited

Example 3-2. Well-formedness checker using XML::Parser

3.2.2. Parsing Styles

Example 3-3. An XML tree builder

Example 3-4. An XML datafile