XML::XPath (Perl and XML)

3.6. XML::XPath

We've seen examples of parsers that dutifully deliver the entire document to you. Often, though, you don't need the whole thing. When you query a database, you're usually looking for only a single record. When you crack open a telephone book, you're not going to sit down and read the whole thing. There is obviously a need for some mechanism of extracting a specific piece of information from a vast document. Look no further than XPath.

XPath is a recommendation from the folks who brought you XML.[18] It's a grammar for writing expressions that pinpoint specific pieces of documents. Think of it as an addressing scheme. Although we'll save the nitty-gritty of XPath wrangling for Chapter 8, "Beyond Trees: XPath, XSLT, and More", we can tantalize you by revealing that it works much like a mix of regular expressions with Unix-style file paths. Not surprisingly, this makes it an attractive feature to add to parsers.

[18]Check out the specification at http://www.w3.org/TR/xpath/.

Matt Sergeant's XML::XPath module is a solid implementation, built on the foundation of XML::Parser. Given an XPath expression, it returns a list of all document parts that match the description. It's an incredibly simple way to perform some powerful search and retrieval work.

For instance, suppose we have an address book encoded in XML in this basic form:

<contacts>
  <entry>
    <name>Bob Snob</name>
    <street>123 Platypus Lane</street>
    <city>Burgopolis</city>
    <state>FL</state>
    <zip>12345</zip>
  </entry>
 <!--More entries go here-->
</contacts>

Suppose you want to extract all the zip codes from the file and compile them into a list. Example 3-7 shows how you could do it with XPath.

Example 3-7. Zip code extractor

use XML::XPath;

my $file = 'customers.xml';
my $xp = XML::XPath->new(filename=>$file);

# An XML::XPath nodeset is an object which contains the result of
# smacking an XML document with an XPath expression; we'll do just
# this, and then query the nodeset to see what we get.
my $nodeset = $xp->find('//zip');

my @zipcodes;                   # Where we'll put our results
if (my @nodelist = $nodeset->get_nodelist) {
  # We found some zip elements! Each node is an object of the class
  # XML::XPath::Node::Element, so I'll use that class's 'string_value'
  # method to extract its pertinent text, and throw the result for all
  # the nodes into our array.
  @zipcodes = map($_->string_value, @nodelist);

  # Now sort and prepare for output
  @zipcodes = sort(@zipcodes);
  local $" = "\n";
  print "I found these zipcodes:\n@zipcodes\n";
} else {
  print "The file $file didn't have any 'zip' elements in it!\n";
}

Run the program on a document with three entries and we'll get something like this:

I found these zipcodes:
03642
12333
82649

This module also shows an example of tree-based parsing, by the way, as its parser loads the whole document into an object tree of its own design and then allows the user to selectively interact with parts of it via XPath expressions. This example is just a sample of what you can do with advanced tree processing modules. You'll see more of these modules in Chapter 8, "Beyond Trees: XPath, XSLT, and More".

XML::LibXML's element objects support a findnodes( ) method that works much like XML::XPath's, using the invoking Element object as the current context and returning a list of objects that match the query. We'll play with this functionality later in Chapter 10, "Coding Strategies".