XPath (Perl and XML)

8.2. XPath

Imagine that you have an army of monkeys at your disposal. You say to them, "I want you to get me a banana frappe from the ice cream parlor on Massachusetts Avenue just north of Porter Square." Not being very smart monkeys, they go out and bring back every beverage they can find, leaving you to taste them all to figure out which is the one you wanted. To retrain them, you send them out to night school to learn a rudimentary language, and in a few months you repeat the request. Now the monkeys follow your directions, identify the exact item you want, and return with it.

We've just described the kind of problem XPath was designed to solve. XPath is one of the most useful technologies supporting XML. It provides an interface to find nodes in a purely descriptive way, so you don't have to write code to hunt them down yourself. You merely specify the kind of nodes that interest you and an XPath parser will retrieve them for you. Suddenly, XML goes from becoming a vast, confusing pile of nodes to a well-indexed filing cabinet of data.

Consider the XML document in Example 8-4.

Example 8-4. A preferences file

<plist>
  <dict>
    <key>DefaultDirectory</key>
    <string>/usr/local/fooby</string>
    <key>RecentDocuments</key>
    <array>
      <string>/Users/bobo/docs/menu.pdf</string>
      <string>/Users/slappy/pagoda.pdf</string>
      <string>/Library/docs/Baby.pdf</string>
    </array>
    <key>BGColor</key>
    <string>sage</string>
  </dict>
</plist>

This document is a typical preferences file for a program with a series of data keys and values. Nothing in it is too complex. To obtain the value of the key BGColor, you'd have to locate the <key> element containing the word "BGColor" and step ahead to the next element, a <string>. Finally, you would read the value of the text node inside. In DOM, you might do it as shown in Example 8-5.

Example 8-5. Program to get a preferred color

sub get_bgcolor {
    my @keys = $doc->getElementsByTagName( 'key' );
    foreach my $key ( @keys ) {
        if( $key->getFirstChild->getData eq 'BGColor' ) {
            return $key->getNextSibling->getData;
        }
    }
    return;
}

Writing one routine like this isn't too bad, but imagine if you had to do hundreds of queries like it. And this program was for a relatively simple document -- imagine how complex the code could be for one that was many levels deep. It would be nice to have a shorthand way of doing the same thing, say, on one line of code. Such a syntax would be much easier to read, write, and debug. This is where XPath comes in.

XPath is a language for expressing a path to a node or set of nodes anywhere in a document. It's simple, expressive, and standard (backed by the W3C, the folks who brought you XML).[28] You'll see it used in XSLT for matching rules to nodes, and in XPointer, a technology for linking XML documents to resources. You can also find it in many Perl modules, as we'll show you soon.

[28]The recommendation is on the Web at http://www.w3.org/TR/xpath.html.

An XPath expression is called a location path and consists of some number of path steps that extend the path a little bit closer to the goal. Starting from an absolute, known position (for example, the root of the document), the steps "walk" across the document tree to arrive at a node or set of nodes. The syntax looks much like a filesystem path, with steps separated by slash characters (/).

This location path shows how to find that color value in our last example:

/plist/dict/key[text()='BGColor']/following-sibling::*[1]/text( )

A location path is processed by starting at an absolute location in the document and moving to a new node (or nodes) with each step. At any point in the search, a current node serves as the context for the next step. If multiple nodes match the next step, the search branches and the processor maintains a set of current nodes. Here's how the location path shown above would be processed:

Start at the root node (one level above the root element).

Move to a <plist> element that is a child of the current node.
Move to a <dict> element that is a child of the current node.
Move to a <key> element that is a child of the current node and that has the value BGColor.
Find the next element after the current node.
Return any text nodes belonging to the current node.

Because node searches can branch if multiple nodes match, we sometimes have to add a test condition to a step to restrict the eligible candidates. Adding a test condition was necessary for the <key> sampling step where multiple nodes would have matched, so we added a test condition requiring the value of the element to be BGColor. Without the test, we would have received all text nodes from all siblings immediately following a <key> element.

This location path matches all <key> elements in the document:

/plist/dict/key

Of the many kinds of test conditions, all result in a boolean true/false answer. You can test the position (where a node is in the list), existence of children and attributes, numeric comparisons, and all kinds of boolean expressions using AND and OR operators. Sometimes a test consists of only a number, which is shorthand for specifying an index into a node list, so the test [1] says, "stop at the first node that matches."

You can link multiple tests inside the brackets with boolean operations. Alternatively, you can chain tests with multiple sets of brackets, functioning as an AND operator. Every path step has an implicit test that prunes the search tree of blind alleys. If at any point a step turns up zero matching nodes, the search along that branch terminates.

Along with boolean tests, you can shape a location path with directives called axes. An axis is like a compass needle that tells the processor which direction to travel. Instead of the default, which is to descend from the current node to its children, you can make it go up to the parent and ancestors or laterally among its siblings. The axis is written as a prefix to the step with a double colon (::). In our last example, we used the axis following-sibling to jump from the current node to its next-door neighbor.

A step is not limited to frolicking with elements. You can specify different kinds of nodes, including attributes, text, processing instructions, and comments, or leave it generic with a selector for any node type. You can specify the node type in many ways, some of which are listed here:

Symbol	Matches
`node( )`	Any node
`text( )`	A text node
`element::foo`	An element named `foo`
`foo`	An element named `foo`
`attribute::foo`	An attribute named `foo`
`@foo`	An attribute named `foo`
`@*`	Any attribute
`*`	Any element
`.`	This element
`..`	The parent element
`/`	The root node
`/*`	The root element
`//foo`	An element `foo` at any level

Since the thing you're most likely to select in a location path step is an element, the default node type is an element. But there are reasons why you should use another node type. In our example location path, we used text( ) to return just the text node for the <value> element.

Most steps are relative locators because they define where to go relative to the previous locator. Although locator paths are comprised mostly of relative locators, they always start with an absolute locator, which describes a definite point in the document. This locator comes in two flavors: id( ), which starts at an element with a given ID attribute, and root( ), which starts at the root node of the document (an abstract node that is the parent of the document element). You will frequently see the shorthand "/" starting a path indicating that root( ) is being used.

Now that we've trained our monkeys to understand XPath, let's give it a whirl with Perl. The XML::XPath module, written by Matt Sergeant of XML::LibXML fame, is a solid implementation of XPath. We've written a program in Example 8-6 that takes two command-line arguments: a file and an XPath locator path. It prints the text value of all nodes it finds that match the path.

Example 8-6. A program that uses XPath

use XML::XPath;
use XML::XPath::XMLParser;

# create an object to parse the file and field XPath queries
my $xpath = XML::XPath->new( filename => shift @ARGV );

# apply the path from the command line and get back a list matches
my $nodeset = $xpath->find( shift @ARGV );

# print each node in the list
foreach my $node ( $nodeset->get_nodelist ) {
  print XML::XPath::XMLParser::as_string( $node ) . "\n";
}

That example was simple. Now we need a datafile. Check out Example 8-7.

Example 8-7. An XML datafile

<?xml version="1.0"?>
<!DOCTYPE inventory [
  <!ENTITY poison "<note>danger: poisonous!</note>">
  <!ENTITY endang "<note>endangered species</note>">
]>
<!-- Rivenwood Arboretum inventory -->
<inventory date="2001.9.4">
  <category type="tree">
    <item id="284">
      <name style="latin">Carya glabra</name>
      <name style="common">Pignut Hickory</name>
      <location>east quadrangle</location>
      &endang;
    </item>
    <item id="222">
      <name style="latin">Toxicodendron vernix</name>
      <name style="common">Poison Sumac</name>
      <location>west promenade</location>
      &poison;
    </item>
  </category>
  <category type="shrub">
    <item id="210">
      <name style="latin">Cornus racemosa</name>
      <name style="common">Gray Dogwood</name>
      <location>south lawn</location>
    </item>
    <item id="104">
      <name style="latin">Alnus rugosa</name>
      <name style="common">Speckled Alder</name>
      <location>east quadrangle</location>
      &endang;
    </item>
  </category>
</inventory>

The first test uses the path /inventory/category/item/name:

> grabber.pl data.xml "/inventory/category/item/name"
<name style="latin">Carya glabra</name>
<name style="common">Pignut Hickory</name>
<name style="latin">Toxicodendron vernix</name>
<name style="common">Poison Sumac</name>
<name style="latin">Cornus racemosa</name>
<name style="common">Gray Dogwood</name>
<name style="latin">Alnus rugosa</name>
<name style="common">Speckled Alder</name>

Every <name> element was found and printed. Let's get more specific with the path /inventory/category/item/name[@style='latin']:

> grabber.pl data.xml "/inventory/category/item/name[@style='latin']"
<name style="latin">Carya glabra</name>
<name style="latin">Toxicodendron vernix</name>
<name style="latin">Cornus racemosa</name>
<name style="latin">Alnus rugosa</name>

Now let's use an ID attribute as a starting point with the path //item[@id='222']/note. (If we had defined the attribute id in a DTD, we'd be able to use the path id('222')/note. We didn't, but this alternate method works just as well.)

> grabber.pl data.xml "//item[@id='222']/note"
<note>danger: poisonous!</note>

How about ditching the element tags? To do so, use this:

> grabber.pl data.xml "//item[@id='222']/note/text( )"
danger: poisonous!

When was this inventory last updated?

> grabber.pl data.xml "/inventory/@date"
 date="2001.9.4"

With XPath, you can go hog wild! Here's the path a silly monkey might take through the tree:

> grabber.pl data.xml "//*[@id='104']/parent::*/preceding-sibling::*/child::*[2]/
name[not(@style='latin')]/node( )"
Poison Sumac

The monkey started on the element with the attribute id='104', climbed up a level, jumped to the previous element, climbed down to the second child element, found a <name> whose style attribute was not set to 'latin', and hopped on the child of that element, which happened to be the text node with the value Poison Sumac.

We have just seen how to use XPath expressions to locate and return a set of nodes. The implementation we are about to see is even more powerful. XML::Twig, an ingenious module by Michel Rodriguez, is quite Perlish in the way it uses XPath expressions. It uses a hash to map them to subroutines, so you can have functions called automatically for certain types of nodes.

The program in Example 8-8 shows how this works. When you initialize the XML::Twig object, you can set a bunch of handlers in a hash, where the keys are XPath expressions. During the parsing stage, as the tree is built, these handlers are called for appropriate nodes.

As you look at Example 8-8, you'll notice that at-sign (@) characters are escaped. This is because @ can cause a little confusion with XPath expressions living in a Perl context. In XPath, @foo refers to an attribute named foo, not an array named foo. Keep this distinction in mind when going over the XPath examples in this book and when writing your own XPath for Perl to use -- you must escape the @ characters so Perl doesn't try to interpolate arrays in the middle of your expressions.

If your code does so much work with Perl arrays and XPath attribute references that it's unclear which @ characters are which, consider referring to attributes in longhand, using the "attribute" XPath axis: attribute::foo. This raises the issue of the double colon and its different meanings in Perl and XPath. Since XPath has only a few hardcoded axes, however, and they're always expressed in lowercase, they're easier to tell apart at a glance.

Example 8-8. How twig handlers work

use XML::Twig;

# buffers for holding text
my $catbuf = '';
my $itembuf = '';

# initialize parser with handlers for node processing
my $twig = new XML::Twig( TwigHandlers => { 
                             "/inventory/category"    => \&category,
                             "name[\@style='latin']"  => \&latin_name,
                             "name[\@style='common']" => \&common_name,
                             "category/item"          => \&item,
                                          });

# parse, handling nodes on the way
$twig->parsefile( shift @ARGV );

# handle a category element
sub category {
  my( $tree, $elem ) = @_;
  print "CATEGORY: ", $elem->att( 'type' ), "\n\n", $catbuf;
  $catbuf = '';
}

# handle an item element
sub item {
  my( $tree, $elem ) = @_;
  $catbuf .= "Item: " . $elem->att( 'id' ) . "\n" . $itembuf . "\n";
  $itembuf = '';
}

# handle a latin name
sub latin_name {
  my( $tree, $elem ) = @_;
  $itembuf .= "Latin name: " . $elem->text . "\n";
}

# handle a common name
sub common_name {
  my( $tree, $elem ) = @_;
  $itembuf .= "Common name: " . $elem->text . "\n";
}

Our program takes a datafile like the one shown in Example 8-7 and outputs a summary report. Note that since a handler is called only after an element is completely built, the overall order of handler calls may not be what you expect. The handlers for children are called before their parent. For that reason, we need to buffer their output and sort it out at the appropriate time.

The result comes out like this:

CATEGORY: tree

Item: 284
Latin name: Carya glabra
Common name: Pignut Hickory

Item: 222
Latin name: Toxicodendron vernix
Common name: Poison Sumac

CATEGORY: shrub

Item: 210
Latin name: Cornus racemosa
Common name: Gray Dogwood

Item: 104
Latin name: Alnus rugosa
Common name: Speckled Alder

XPath makes the task of locating nodes in a document and describing types of nodes for processing ridiculously simple. It cuts down on the amount of code you have to write because climbing around the tree to sample different parts is all taken care of. It's easier to read than code too. We're happy with it, and because it is a standard, we'll be seeing more uses for it in many modules to come.