XML::LibXML (Perl and XML)

3.5. XML::LibXML

XML::LibXML, like XML::Parser, is an interface to a library written in C. Called libxml2, it's part of the GNOME project.[17] Unlike XML::Parser, this new parser supports a major standard for XML tree processing known as the Document Object Model ( DOM).

[17]For downloads and documentation, see http://www.libxml.org/.

DOM is another much-ballyhooed XML standard. It does for tree processing what SAX does for event streams. If you have your heart set on climbing trees in your program and you think there's a likelihood that it might be reused or applied to different data sources, you're better off using something standard and interchangeable. Again, we're happy to delve into DOM in a future chapter and get you thinking in standards-complaint ways. That topic is coming up in Chapter 7, "DOM".

Now we want to show you an example of another parser in action. We'd be remiss if we focused on just one kind of parser when so many are out there. Again, we'll show you a basic example, nothing fancy, just to show you how to invoke the parser and tame its power. Let's write another document analysis tool like we did in Example 3-5, this time printing a frequency distribution of elements in a document.

Example 3-6 shows the program. It's a vanilla parser run because we haven't set any options yet. Essentially, the parser parses the filehandle and returns a DOM object, which is nothing more than a tree structure of well-designed objects. Our program finds the document element, and then traverses the entire tree one element at a time, all the while updating the hash of frequency counters.

Example 3-6. A frequency distribution program

use XML::LibXML;
use IO::Handle;

# initialize the parser
my $parser = new XML::LibXML;

# open a filehandle and parse
my $fh = new IO::Handle;
if( $fh->fdopen( fileno( STDIN ), "r" )) {
    my $doc = $parser->parse_fh( $fh );
    my %dist;
    &proc_node( $doc->getDocumentElement, \%dist );
    foreach my $item ( sort keys %dist ) {
        print "$item: ", $dist{ $item }, "\n";
    }
    $fh->close;
}

# process an XML tree node: if it's an element, update the
# distribution list and process all its children
#
sub proc_node {
    my( $node, $dist ) = @_;
    return unless( $node->nodeType eq &XML_ELEMENT_NODE );
    $dist->{ $node->nodeName } ++;
    foreach my $child ( $node->getChildnodes ) {
        &proc_node( $child, $dist );
    }
}

Note that instead of using a simple path to a file, we use a filehandle object of the IO::Handle class. Perl filehandles, as you probably know, are magic and subtle beasties, capable of passing into your code characters from a wide variety of sources, including files on disk, open network sockets, keyboard input, databases, and just about everything else capable of outputting data. Once you define a filehandle's source, it gives you the same interface for reading from it as does every other filehandle. This dovetails nicely with our XML-based ideology, where we want code to be as flexible and reusable as possible. After all, XML doesn't care where it comes from, so why should we pigeonhole it with one source type?

The parser object returns a document object after parsing. This object has a method that returns a reference to the document element -- the element at the very root of the whole tree. We take this reference and feed it to a recursive subroutine, proc_node( ), which happily munches on elements and scribbles into a hash variable every time it sees an element. Recursion is an efficient way to write programs that process XML because the structure of documents is somewhat fractal: the same rules for elements apply at any depth or position in the document, including the root element that represents the entire document (modulo its prologue). Note the "node type" check, which distinguishes between elements and other parts of a document (such as pieces of text or processing instructions).

For every element the routine looks at, it has to call the object's getChildnodes( ) method to continue processing on its children. This call is an essential difference between stream-based and tree-based methodologies. Instead of having an event stream take the steering wheel of our program and push data at it, thus calling subroutines and codeblocks in a (somewhat) unpredictable order, our program now has the responsibility of navigating through the document under its own power. Traditionally, we start at the root element and go downward, processing children in order from first to last. However, because we, not the parser, are in control now, we can scan through the document in any way we want. We could go backwards, we could scan just a part of the document, we could jump around, making multiple passes though the tree -- the sky's the limit. Here's the result from processing a small chapter coded in DocBook XML:

$ xfreq < ch03.xml
chapter: 1
citetitle: 2
firstterm: 16
footnote: 6
foreignphrase: 2
function: 10
itemizedlist: 2
listitem: 21
literal: 29
note: 1
orderedlist: 1
para: 77
programlisting: 9
replaceable: 1
screen: 1
section: 6
sgmltag: 8
simplesect: 1
systemitem: 2
term: 6
title: 7
variablelist: 1
varlistentry: 6
xref: 2

The result shows only a few lines of code, but it sure does a lot of work. Again, thanks to the C library underneath , it's quite speedy.