XML::Parser (Perl and XML)

The C underpinnings are the secret to XML::Parser's success. We've seen how to write a basic parser in Perl. If you apply our previous example to a large XML document, you'll wait a long time before it finishes. Others have written complete XML parsers in Perl that are portable to any system, but you'll find much better performance in a compiled C parser like Expat. Fortunately, as with every other Perl module based on C code (and there are actually lots of these modules because they're not too hard to make, thanks to Perl's standard XS library),[16] it's easy to forget you're driving Expat around when you use XML::Parser.

3.2.1. Example: Well-Formedness Checker Revisited

To show how XML::Parser might be used, let's return to the well-formedness checker problem. It's very easy to create this tool with XML::Parser, as shown in Example 3-2.

Example 3-2. Well-formedness checker using XML::Parser

use XML::Parser;

my $xmlfile = shift @ARGV;              # the file to parse

# initialize parser object and parse the string
my $parser = XML::Parser->new( ErrorContext => 2 );
eval { $parser->parsefile( $xmlfile ); };

# report any error that stopped parsing, or announce success
if( $@ ) {
    $@ =~ s/at \/.*?$//s;               # remove module line number
    print STDERR "\nERROR in '$file':\n$@\n";
} else {
    print STDERR "'$file' is well-formed\n";
}

Here's how this program works. First, we create a new XML::Parser object to do the parsing. Using an object rather than a static function call means that we can configure the parser once and then process multiple files without the overhead of repeatedly recreating the parser. The object retains your settings and keeps the Expat parser routine alive for as long as you want to parse files, and then cleans everything up when you're done.

Next, we call the parsefile( ) method inside an eval block because XML::Parser tends to be a little overzealous when dealing with parse errors. If we didn't use an eval block, our program would die before we had a chance to do any cleanup. We check the variable $@ for content in case there was an error. If there was, we remove the line number of the module at which the parse method "died" and then print out the message.

When initializing the parser object, we set an option ErrorContext => 2. XML::Parser has several options you can set to control parsing. This one is a directive sent straight to the Expat parser that remembers the context in which errors occur and saves two lines before the error. When we print out the error message, it tells us what line the error happened on and prints out the region of text with an arrow pointing to the offending mistake.

Here's an example of our checker choking on a syntactic faux pas (where we decided to name our program xwf as an XML well-formedness checker):

$ xwf ch01.xml 

ERROR in 'ch01.xml':

not well-formed (invalid token) at line 66, column 22, byte 2354:

<chapter id="dorothy-in-oz">
<title>Lions, Tigers & Bears</title>
=====================^

Notice how simple it is to set up the parser and get powerful results. What you don't see until you run the program yourself is that it's fast. When you type the command, you get a result in a split second.

You can configure the parser to work in different ways. You don't have to parse a file, for example. Use the method parse( ) to parse a text string instead. Or, you could give it the option NoExpand => 1 to override default entity expansion with your own entity resolver routine. You could use this option to prevent the parser from opening external entities, limiting the scope of its checking.

Although the well-formedness checker is a very useful tool that you certainly want in your XML toolbox if you work with XML files often, it only scratches the surface of what we can do with XML::Parser. We'll see in the next section that a parser's most important role is in shoveling packaged data into your program. How it does this depends on the particular style you select.

3.2.2. Parsing Styles

XML::Parser supports several different styles of parsing to suit various development strategies. The style doesn't change how the parser reads XML. Rather, it changes how it presents the results of parsing. If you need a persistent structure containing the document, you can have it. Or, if you'd prefer to have the parser call a set of routines you write, you can do it that way. You can set the style when you initialize the object by setting the value of style. Here's a quick summary of the available styles:

Debug: This style prints the document to STDOUT, formatted as an outline (deeper elements are indented more). parse( ) doesn't return anything special to your program.
Tree: This style creates a hierarchical, tree-shaped data structure that your program can use for processing. All elements and their data are crystallized in this form, which consists of nested hashes and arrays.
Object: Like tree, this method returns a reference to a hierarchical data structure representing the document. However, instead of using simple data aggregates like hashes and lists, it consists of objects that are specialized to contain XML markup objects.
Subs: This style lets you set up callback functions to handle individual elements. Create a package of routines named after the elements they should handle and tell the parser about this package by using the pkg option. Every time the parser finds a start tag for an element called <fooby>, it will look for the function fooby( ) in your package. When it finds the end tag for the element, it will try to call the function _fooby( ) in your package. The parser will pass critical information like references to content and attributes to the function, so you can do whatever processing you need to do with it.
Stream: Like Subs, you can define callbacks for handling particular XML components, but callbacks are more general than element names. You can write functions called handlers to be called for "events" like the start of an element (any element, not just a particular kind), a set of character data, or a processing instruction. You must register the handler package with either the Handlers option or the setHandlers( ) method.
custom: You can subclass the XML::Parser class with your own object. Doing so is useful for creating a parser-like API for a more specific application. For example, the XML::Parser::PerlSAX module uses this strategy to implement the SAX event processing standard.

Example 3-3 is a program that uses XML::Parser with Style set to Tree. In this mode, the parser reads the whole XML document while building a data structure. When finished, it hands our program a reference to the structure that we can play with.

Example 3-3. An XML tree builder

use XML::Parser;

# initialize parser and read the file
$parser = new XML::Parser( Style => 'Tree' );
my $tree = $parser->parsefile( shift @ARGV );

# serialize the structure
use Data::Dumper;
print Dumper( $tree );

In tree mode, the parsefile( ) method returns a reference to a data structure containing the document, encoded as lists and hashes. We use Data::Dumper, a handy module that serializes data structures, to view the result. Example 3-4 is the datafile.

Example 3-4. An XML datafile

<preferences>
  <font role="console">
    <fname>Courier</name>
    <size>9</size>
  </font>
  <font role="default">
    <fname>Times New Roman</name>
    <size>14</size>
  </font>
  <font role="titles">
    <fname>Helvetica</name>
    <size>10</size>
  </font>
</preferences>

With this datafile, the program produces the following output (condensed and indented to be easier to read):

$tree = [ 
          'preferences', [ 
            {}, 0, '\n', 
            'font', [ 
              { 'role' => 'console' }, 0, '\n',
              'size', [ {}, 0, '9' ], 0, '\n', 
              'fname', [ {}, 0, 'Courier' ], 0, '\n'
            ], 0, '\n',
            'font', [ 
              { 'role' => 'default' }, 0, '\n',
              'fname', [ {}, 0, 'Times New Roman' ], 0, '\n',
              'size', [ {}, 0, '14' ], 0, '\n'
            ], 0, '\n', 
            'font', [ 
               { 'role' => 'titles' }, 0, '\n',
               'size', [ {}, 0, '10' ], 0, '\n',
               'fname', [ {}, 0, 'Helvetica' ], 0, '\n',
            ], 0, '\n',
          ]
        ];

It's a lot easier to write code that dissects the above structure than to write a parser of your own. We know, because the parser returned a data structure instead of dying mid-parse, that the document was 100 percent well-formed XML. In Chapter 4, "Event Streams", we will use the Stream mode of XML::Parser, and in Chapter 6, "Tree Processing", we'll talk more about trees and objects.

3.2. XML::Parser

3.2.1. Example: Well-Formedness Checker Revisited

Example 3-2. Well-formedness checker using XML::Parser

3.2.2. Parsing Styles

Example 3-3. An XML tree builder

Example 3-4. An XML datafile