XPath: A Syntax for Describing Needles and Haystacks (XSLT)

XPath is a syntax used to describe parts of an XML document. With XPath, you can refer to the first <para> element, the quantity attribute of the <part-number> element, all <first-name> elements that contain the text "Joe", and many other variations. An XSLT stylesheet uses XPath expressions in the match and select attributes of various elements to indicate how a document should be transformed. In this chapter, we'll discuss XPath in all its glory.

XPath is designed to be used inside an attribute in an XML document. The syntax is a mix of basic programming language expressions (such as $x*6) and Unix-like path expressions (such as /sonnet/author/last-name). In addition to the basic syntax, XPath provides a set of useful functions that allow you to find out various things about the document.

One important point, though: XPath works with the parsed version of your XML document. That means that some details of the original document aren't accessible to you from XPath. For example, entity references are resolved beforehand by the XSLT processor before instructions in our stylesheet are evaluated. CDATA sections are converted to text, as well. That means we have no way of knowing if a text node in an XPath tree was in the original XML document as text, as an entity reference, or as part of a CDATA section. As you get used to thinking about your XML documents in terms of XPath expressions, this situation won't be a problem, but it may confuse you at first.

3.1. The XPath Data Model

XPath views an XML document as a tree of nodes. This tree is very similar to a Document Object Model (DOM) tree, so if you're familiar with the DOM, you should have some understanding of how to build basic XPath expressions. (To be precise, this is a conceptual tree; an XSLT processor or anything else that implements the XPath standard doesn't have to build an actual tree.) There are seven kinds of nodes in XPath:

The root node (one per document)
Element nodes
Attribute nodes
Text nodes
Comment nodes
Processing instruction nodes
Namespace nodes

We'll talk about all the different node types in terms of the following document:

<?xml version="1.0"?>
<?xml-stylesheet href="sonnet.xsl" type="text/xsl"?>
<?cocoon-process type="xslt"?>

<!DOCTYPE sonnet [
  <!ELEMENT sonnet (auth:author, title, lines)>
  <!ATTLIST sonnet public-domain CDATA "yes"
            type (Shakespearean | Petrarchan) "Shakespearean">
<!ELEMENT auth:author  (last-name,first-name,nationality,
                        year-of-birth?,year-of-death?)>
<!ELEMENT last-name (#PCDATA)>
<!ELEMENT first-name (#PCDATA)>
<!ELEMENT nationality (#PCDATA)>
<!ELEMENT year-of-birth (#PCDATA)>
<!ELEMENT year-of-death (#PCDATA)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT lines (line,line,line,line,
                 line,line,line,line,
                 line,line,line,line,
                 line,line)>
<!ELEMENT line (#PCDATA)>
]>

<!-- Default sonnet type is Shakespearean, the other allowable  -->
<!-- type is "Petrarchan."                                      -->
<sonnet type="Shakespearean">
  <auth:author xmlns:auth="http://www.authors.com/">
    <last-name>Shakespeare</last-name>
    <first-name>William</first-name>
    <nationality>British</nationality>
    <year-of-birth>1564</year-of-birth>
    <year-of-death>1616</year-of-death>
  </auth:author>
  <!-- Is there an official title for this sonnet?  They're     
       sometimes named after the first line.                   -->
  <title>Sonnet 130</title>
  <lines>
    <line>My mistress' eyes are nothing like the sun,</line>
    <line>Coral is far more red than her lips red.</line>
    <line>If snow be white, why then her breasts are dun,</line>
    <line>If hairs be wires, black wires grow on her head.</line>
    <line>I have seen roses damasked, red and white,</line>
    <line>But no such roses see I in her cheeks.</line>
    <line>And in some perfumes is there more delight</line>
    <line>Than in the breath that from my mistress reeks.</line>
    <line>I love to hear her speak, yet well I know</line>
    <line>That music hath a far more pleasing sound.</line>
    <line>I grant I never saw a goddess go,</line>
    <line>My mistress when she walks, treads on the ground.</line>
    <line>And yet, by Heaven, I think my love as rare</line>
    <line>As any she belied with false compare.</line>
  </lines>
</sonnet>
<!-- The title of Sting's 1987 album "Nothing like the sun" is  -->
<!-- from line 1 of this sonnet.                                -->

3.1.1. The Root Node

The root node is the XPath node that contains the entire document. In our example, the root node contains the <sonnet> element; it's not the <sonnet> element itself. In an XPath expression, the root node is specified with a single slash (/).

Unlike other nodes, the root node has no parent. It always has at least one child, the document element. The root node also contains comments or processing instructions that are outside the document element. In our sample document, the two processing instructions named xml-stylesheet and cocoon-process are both children of the root node, as are the comment that appears before the <sonnet> tag and the comment that appears after the </sonnet> tag. The string value of the root node (returned by <xsl:value-of select="/" />, for example), is the concatenation of all text nodes of the root node's descendants.

3.1.2. Element Nodes

Every element in the original XML document is represented by an XPath element node. In the previous document, an element node exists for the <sonnet> element, the <auth:author> element, the <last-name> element, etc. An element node's children include text nodes, element nodes, comment nodes, and processing instruction nodes that occur within that element in the original document.

An element node's string value (returned by <xsl:value-of select="sonnet">, for example) is the concatenation of the text of this node and all of its children, in document order (the order in which they appear in the original document). All entity references (such as <) and character references (such as 4) in the text are resolved automatically; you can't access the entity or character references from XPath.

The name of an element node (returned by the XPath name() function) is the element name and any namespace in effect. In the previous example, the name() of the <sonnet> element is sonnet. The name() of the <auth:author> element is auth:author, and the name() of the <last-name> element is auth:last-name (any element contained in the <author> element is from the auth namespace unless specifically declared otherwise). Other XPath functions, such as local-name() and namespace-uri(), return other information about the name of the element node.

3.1.3. Attribute Nodes

At a minimum, an element node is the parent of one attribute node for each attribute in the XML source document. In our sample document, the element node corresponding to the <sonnet> element is the parent of an attribute node with a name of type and a value of Shakespearean. A couple of complications for attribute nodes exist, however:

Although an element node is the parent of its attribute nodes, those attribute nodes are not children of their parent. The children of an element are the text, element, comment, and processing instruction nodes contained in the original element. If you want a document's attributes, you must ask for them specifically. That relationship seems odd at first, but you'll find that treating an element's attributes separately is usually what you want to do.
If a DTD or schema defines default values for certain attributes, those attributes don't have to appear in the XML document. For example, we could have declared that a Shakespearean sonnet is the default type, so that the tag <sonnet type="Shakespearean"> is functionally equivalent to <sonnet>. Under normal circumstances, XPath creates an attribute node for all attributes with default values, whether they actually appear in the document or not. If the type is defined as #IMPLIED, both of the <sonnet> elements we just mentioned will have an attribute node with a name of type and a value of Shakespearean. Of course, if the document codes a value other than the default (<sonnet type="Petrarchan">, for example), the attribute node's value will be whatever was coded in the document.

To make this situation even worse, an XML parser isn't required to read an external DTD. If it doesn't, then any attribute nodes that represent default values not coded in the document won't exist. Fortunately, XSLT has some branching elements (<xsl:if> and <xsl:choose>) that can help you deal with these ambiguities; we'll discuss those in Chapter 4, "Branching and Control Elements".
The XML 1.0 specification defines two attributes (xml:lang and xml:space) that work like default namespaces. In other words, if the <auth:author> element in our sample document contains the attribute xml:lang="en_us", that attribute applies to all elements contained inside <auth:author>. Even though that attribute might apply to the <last-name> element, <last-name> won't have an attribute node named xml:lang. Similarly, the xml:space defines whether whitespace in an element should be preserved; valid values for this attribute are preserve and default. Whether these attributes are in effect for a given element or not, the only attribute nodes an element node contains are those tagged in the document and those defined with a default value in the DTD.

For more information on language codes and whitespace handling, see the discussions of the XPath lang() function and the XSLT <xsl:preserve-space> and <xsl:strip-space> elements.

3.1.4. Text Nodes

Text nodes are refreshingly simple; they contain text from an element. If the original text in the XML document contained entity or character references, they are resolved before the XPath text node is created. The text node is text, pure and simple. A text node is required to contain as much text as possible; the next or previous node can't be a text node.

You might have noticed that there are no CDATA nodes in this list. If your XML document contains text in a CDATA section, you can access the contents of the CDATA section as a text node. You have no way of knowing if a given text node was originally a CDATA section. Similarly, all entity references are resolved before anything in your stylesheet is evaluated, so you have no way of knowing if a given piece of text originally contained entity references.

3.1.5. Comment Nodes

A comment node is also very simple -- it contains some text. Every comment in the source document (except for comments in the DTD) becomes a comment node. The text of the comment node (returned by the text() node test) contains everything inside the comment, except the opening .

3.1.6. Processing Instruction Nodes

A processing instruction node has two parts, a name (returned by the name() function) and a string value. The string value is everything after the name, including whitespace, but not including the ?> that closes the processing instruction.

3.1.7. Namespace Nodes

Namespace nodes are almost never used in XSLT stylesheets; they exist primarily for the XSLT processor's benefit. Remember that the declaration of a namespace (such as xmlns:auth="http://www.authors.net"), even though it is technically an attribute in the XML source, becomes a namespace node, not an attribute node.

Chapter 3. XPath: A Syntax for Describing Needles and Haystacks

Contents: