Example 8-5. Program to get a preferred color
sub get_bgcolor {
my @keys = $doc->getElementsByTagName( 'key' );
foreach my $key ( @keys ) {
if( $key->getFirstChild->getData eq 'BGColor' ) {
return $key->getNextSibling->getData;
}
}
return;
}
Writing one routine like this isn't too bad, but
imagine if you had to do hundreds of queries like it. And this
program was for a relatively simple document -- imagine how
complex the code could be for one that was many levels deep. It would
be nice to have a shorthand way of doing the same thing, say, on one
line of code. Such a syntax would be much easier to read, write, and
debug. This is where XPath comes in.
Because node searches can branch if multiple nodes match, we
sometimes have to add a test condition to a step to restrict the
eligible candidates. Adding a test condition was necessary for the
<key> sampling step where multiple nodes
would have matched, so we added a test condition requiring the value
of the element to be BGColor. Without the test, we
would have received all text nodes from all siblings immediately
following a <key> element.
This location path matches all <key>
elements in the document:
/plist/dict/key
Of the many kinds of test conditions, all result in a boolean
true/false answer. You can test the position (where a node is in the
list), existence of children and attributes, numeric comparisons, and
all kinds of boolean expressions using AND and OR operators.
Sometimes a test consists of only a number, which is shorthand for
specifying an index into a node list, so the test
[1] says, "stop at the first node
that matches."
A step is not limited to frolicking with elements. You can specify
different kinds of nodes, including attributes, text, processing
instructions, and comments, or leave it generic with a selector for
any node type. You can specify the node type in many ways, some of
which are listed here:
Since the thing you're most likely to select in a
location path step is an element, the default node type is an
element. But there are reasons why you should use another node type.
In our example location path, we used text( ) to
return just the text node for the <value>
element.
Example 8-7. An XML datafile
<?xml version="1.0"?>
<!DOCTYPE inventory [
<!ENTITY poison "<note>danger: poisonous!</note>">
<!ENTITY endang "<note>endangered species</note>">
]>
<!-- Rivenwood Arboretum inventory -->
<inventory date="2001.9.4">
<category type="tree">
<item id="284">
<name style="latin">Carya glabra</name>
<name style="common">Pignut Hickory</name>
<location>east quadrangle</location>
&endang;
</item>
<item id="222">
<name style="latin">Toxicodendron vernix</name>
<name style="common">Poison Sumac</name>
<location>west promenade</location>
&poison;
</item>
</category>
<category type="shrub">
<item id="210">
<name style="latin">Cornus racemosa</name>
<name style="common">Gray Dogwood</name>
<location>south lawn</location>
</item>
<item id="104">
<name style="latin">Alnus rugosa</name>
<name style="common">Speckled Alder</name>
<location>east quadrangle</location>
&endang;
</item>
</category>
</inventory>
The first test uses the path
/inventory/category/item/name:
> grabber.pl data.xml "/inventory/category/item/name"
<name style="latin">Carya glabra</name>
<name style="common">Pignut Hickory</name>
<name style="latin">Toxicodendron vernix</name>
<name style="common">Poison Sumac</name>
<name style="latin">Cornus racemosa</name>
<name style="common">Gray Dogwood</name>
<name style="latin">Alnus rugosa</name>
<name style="common">Speckled Alder</name>
Every <name> element was found and printed.
Let's get more specific with the path
/inventory/category/item/name[@style='latin']:
> grabber.pl data.xml "/inventory/category/item/name[@style='latin']"
<name style="latin">Carya glabra</name>
<name style="latin">Toxicodendron vernix</name>
<name style="latin">Cornus racemosa</name>
<name style="latin">Alnus rugosa</name>
Now let's use an ID attribute as a starting point
with the path //item[@id='222']/note. (If we had
defined the attribute id in a DTD,
we'd be able to use the path
id('222')/note. We didn't, but
this alternate method works just as well.)
> grabber.pl data.xml "//item[@id='222']/note"
<note>danger: poisonous!</note>
How about ditching the element tags? To do so, use this:
> grabber.pl data.xml "//item[@id='222']/note/text( )"
danger: poisonous!
When was this inventory last updated?
> grabber.pl data.xml "/inventory/@date"
date="2001.9.4"
With XPath, you can go hog wild! Here's the path a
silly monkey might take through the tree:
> grabber.pl data.xml "//*[@id='104']/parent::*/preceding-sibling::*/child::*[2]/
name[not(@style='latin')]/node( )"
Poison Sumac
If your code does so much work with Perl arrays and XPath attribute
references that it's unclear which
@ characters are which, consider referring to
attributes in longhand, using the
"attribute" XPath axis:
attribute::foo. This raises the issue of the
double colon and its different meanings in Perl and XPath. Since
XPath has only a few hardcoded axes, however, and
they're always expressed in lowercase,
they're easier to tell apart at a glance.
XPath makes the task of locating nodes in a document and describing
types of nodes for processing ridiculously simple. It cuts down on
the amount of code you have to write because climbing around the tree
to sample different parts is all taken care of. It's
easier to read than code too. We're happy with it,
and because it is a standard, we'll be seeing more
uses for it in many modules to come.