Beyond Trees: XPath, XSLT, and More (Perl and XML)

In the last chapter, we introduced the concepts behind handling XML documents as memory trees. Our use of them was kind of primitive, limited to building, traversing, and modifying pieces of trees. This is okay for small, uncomplicated documents and tasks, but serious XML processing requires beefier tools. In this chapter, we examine ways to make tree processing easier, faster, and more efficient.

8.1. Tree Climbers

The first in our lineup of power tools is the tree climber. As the name suggests, it climbs a tree for you, finding the nodes in the order you want them, making your code simpler and more focused on per-node processing. Using a tree climber is like having a trained monkey climb up a tree to get you coconuts so you don't have to scrape your own skin on the bark to get them; all you have to do is drill a hole in the shell and pop in a straw.

The simplest kind of tree climber is an iterator (sometimes called a walker). It can move forward or backward in a tree, doling out node references as you tell it to move. The notion of moving forward in a tree involves matching the order of nodes as they would appear in the text representation of the document. The exact algorithm for iterating forward is this:

If there's no current node, start at the root node.
If the current node has children, move to the first child.
Otherwise, if the current node has a following sibling, move to it.
If none of these options work, go back up the list of the current node's ancestors and try to find one with an unprocessed sibling.

With this algorithm, the iterator will eventually reach every node in a tree, which is useful if you want to process all the nodes in a document part. You could also implement this algorithm recursively, but the advantage to doing it iteratively is that you can stop in between nodes to do other things. Example 8-1 shows how one might implement an iterator object for DOM trees. We've included methods for moving both forward and backward.

Example 8-1. A DOM iterator package

package XML::DOMIterator;

sub new {
  my $class = shift;
  my $self = {@_};
  $self->{ Node } = undef;
  return bless( $self, $class );
}

# move forward one node in the tree
#
sub forward {
  my $self = shift;

  # try to go down to the next level
  if( $self->is_element and
      $self->{ Node }->getFirstChild ) {
    $self->{ Node } = $self->{ Node }->getFirstChild;

  # try to go to the next sibling, or an acestor's sibling
  } else {
    while( $self->{ Node }) {
      if( $self->{ Node }->getNextSibling ) {
        $self->{ Node } = $self->{ Node }->getNextSibling;
        return $self->{ Node };
      }
      $self->{ Node } = $self->{ Node }->getParentNode;
    }
  }
}

# move backward one node in the tree
#
sub backward {
  my $self = shift;

  # go to the previous sibling and descend to the last node in its tree
  if( $self->{ Node }->getPreviousSibling ) {
    $self->{ Node } = $self->{ Node }->getPreviousSibling;
    while( $self->{ Node }->getLastChild ) {
      $self->{ Node } = $self->{ Node }->getLastChild;
    }

  # go up
  } else {
    $self->{ Node } = $self->{ Node }->getParentNode;
  }
  return $self->{ Node };
}

# return a reference to the current node
#
sub node {
  my $self = shift;
  return $self->{ Node };
}

# set the current node
#
sub reset {
  my( $self, $node ) = @_;
  $self->{ Node } = $node;
}

# test if current node is an element
#
sub is_element {
  my $self = shift;
  return( $self->{ Node }->getNodeType == 1 );
}

Example 8-2 is a test program for the iterator package. It prints out a short description of every node in an XML document tree -- first in forward order, then in backward order.

Example 8-2. A test program for the iterator package

use XML::DOM;

# initialize parser and iterator
my $dom_parser = new XML::DOM::Parser;
my $doc = $dom_parser->parsefile( shift @ARGV );
my $iter = new XML::DOMIterator;
$iter->reset( $doc->getDocumentElement );

# print all the nodes from start to end of a document
print "\nFORWARDS:\n";
my $node = $iter->node;
my $last;
while( $node ) {
  describe( $node );
  $last = $node;
  $node = $iter->forward;
}

# print all the nodes from end to start of a document
print "\nBACKWARDS:\n";
$iter->reset( $last );
describe( $iter->node );
while( $iter->backward ) {
  describe( $iter->node );
}

# output information about the node
#
sub describe {
  my $node = shift;
  if( ref($node) =~ /Element/ ) {
    print 'element: ', $node->getNodeName, "\n";
  } elsif( ref($node) =~ /Text/ ) {
    print "other node: \"", $node->getNodeValue, "\"\n";
  }
}

Many tree packages provide automated tree climbing capability. XML::LibXML::Node has a method iterator( ) that traverses a node's subtree, applying a subroutine to each node. Data::Grove::Visitor performs a similar function.

Example 8-3 shows a program that uses an automated tree climbing function to test processing instructions in a document.

Example 8-3. Processing instruction tester

use XML::LibXML;

my $dom = new XML::LibXML;
my $doc = $dom->parse_file( shift @ARGV );
my $docelem = $doc->getDocumentElement;
$docelem->iterator( \&find_PI );

sub find_PI {
  my $node = shift;
  return unless( $node->nodeType == &XML_PI_NODE );
  print "Found processing instruction: ", $node->nodeName, "\n";
}

Tree climbers are terrific for tasks that involve processing the whole document, since they automate the process of moving from node to node. However, you won't always have to visit every node. Often, you only want to pick out one from the bunch or get a set of nodes that satisfy a certain criterion, such as having a particular element name or attribute value. In these cases, you may want to try a more selective approach, as we will demonstrate in the next section.

Chapter 8. Beyond Trees: XPath, XSLT, and More

Contents:

8.1. Tree Climbers

Example 8-1. A DOM iterator package

Example 8-2. A test program for the iterator package

Example 8-3. Processing instruction tester