Points (XML in a Nutshell, 2nd Edition)

11.6. Points

XPaths, bare names, and child sequences can only point to entire nodes or sets of nodes. However, sometimes you want to point to something that isn't a node, such as the third word of the second paragraph or the year in a date attribute that looks like date="01/03/1950". XPointer adds points and ranges to the XPath syntax to make this possible. A point is the position preceding or following any tag, comment, processing instruction, or character in the #PCDATA. Points can also be positions inside comments, processing instructions, or attribute values. Points cannot be located inside an entity reference, though they can be located inside the entity's replacement text. A range is the span of parsed character data between two points. Nodes, points, and ranges are collectively called locations; a set that may contain nodes, points, and ranges is called a location set. In other words, a location is a generalization of the XPath node that includes points and ranges, as well as elements, attributes, namespaces, text nodes, comments, processing instructions, and the root node.

A point is identified by its container node and a non-negative index into that node. If the node contains child nodes--that is, if it's a document or element node--then there are points before and after each of its children (except at the ends, where the point after one child node will also be the point before the next child node). If the node does not contain child nodes--that is, if it's a comment, processing instruction, attribute, namespace, or text node--then there's a point before and after each character in the string value of the node, and again the point after one character will be the same as the point before the next character.

Consider the document in Example 11-1. It contains a novel element that has seven child nodes, three of which are element nodes and four of which are text nodes containing only whitespace.

Example 11-1. A novel document

<?xml version="1.0"?>
<?xml-stylesheet type="text/css" value="novel.css"?>
<!-- You may recognize this from the last chapter -->
<novel copyright="public domain">
  <title>The Wonderful Wizard of Oz</title>
  <author>L. Frank Baum</author>
  <year>1900</year>
</novel>

There are eight points directly inside the novel element numbered from 0 to 7, one immediately after and one immediately before each tag. Figure 11-1 identifies these points.

Figure 11-1. The points inside the novel element

Inside the text node child of the year element, there are five points:

Point 0 between <year> and 1
Point 1 between 1 and 9
Point 2 between 9 and 0
Point 3 between 0 and 0
Point 4 between 0 and </year>

Notice that the points occur between the characters of the text rather than on the characters themselves. Points are zero-dimensional. They identify a location, but they have no extension, not even a single character. To indicate one or more characters, you need to specify a range between two points.

XPointer adds two functions to XPath that make it very easy to select the first and last points inside a node, start-point( ) and end-point( ). For example, this XPointer identifies the first point inside the title element, that is, the point between the title node and its text node child:

xpointer(start-point(//title))

This XPointer indicates the point immediately before the </author> tag:

xpointer(end-point(//author))

If there were multiple title and author elements in the document, then these functions would select multiple points.

This XPointer points to the point immediately before the letter T in "The Wonderful Wizard of Oz":

xpointer(start-point(//title/text( )))

This point falls immediately after the point indicated by xpointer(start-point(//title)). These are two different points, even though they fall between the same two characters (> and T) in the text.

To select points other than the start-point or end-point of a node, you first need to form a range that begins or ends with the point of interest using string-range( ) and then use the start-point or end-point function on that range. We take this up in the next section.