Tree Processing (Perl and XML)

Having done just about all we can do with streams, it's time to move on to another style of XML processing. Instead of letting the XML fly past the program one tiny piece at a time, we will capture the whole document in memory and then start working on it. Having an in-memory representation built behind the scenes for us makes our job much easier, although it tends to require more memory and CPU cycles.

This chapter is an overview of programming with persistent XML objects, better known as tree processing. It looks at a variety of different modules and strategies for building and accessing XML trees, including the rigorous, standard Document Object Model (DOM), fast access to internal document parts with XPath, and efficient tree processing methods.

6.1. XML Trees

Every XML document can be represented as a collection of data objects linked in an acyclic structure called a tree. Each object, or node, is a small piece of the document, such as an element, a piece of text, or a processing instruction. One node, called the root, links to other nodes, and so on down to nodes that aren't linked to anything. Graph this image out and it looks like a big, bushy tree -- hence the name.

A tree structure representing a piece of XML is a handy thing to have. Since a tree is acyclic (it has no circular links), you can use simple traversal methods that won't get stuck in infinite loops. Like a filesystem directory tree, you can represent the location of a node easily in simple shorthand. Like real trees, you can break a piece off and treat it like a smaller tree -- a tree is just a collection of subtrees joined by a root node. Best of all, you have all the information in one place and search through it like a database.

For the programmer, a tree makes life much easier. Stream processing, you will recall, remembers fleeting details to use later in constructing another data structure or printing out information. This work is tedious, and can be downright horrible for very complex documents. If you have to combine pieces of information from different parts of the document, then you might go mad. If you have a tree containing the document, though, all the details are right in front of you. You only need to write code to sift through the nodes and pull out what you need.

Of course, you don't get anything good for free. There is a penalty for having easy access to every point in a document. Building the tree in the first place takes time and precious CPU cycles, and even more if you use object-oriented method calls. There is also a memory tax to pay, since each object in the tree takes up some space. With very large documents (trees with millions of nodes are not unheard of), you could bring your poor machine down to its knees with a tree processing program. On the average, though, processing trees can get you pretty good results (especially with a little optimizing, as we show later in the chapter), so don't give up just yet.

As we talk about trees, we will frequently use genealogical terms to describe relationships between nodes. A container node is said to be the parent of the nodes it branches to, each of which may be called a child of the container node. Likewise, the terms descendant, ancestor, and sibling mean pretty much what you think they would. So two sibling nodes share the same parent node, and all nodes have the root node as their ancestor.

There are several different species of trees, depending on the implementation you're talking about. Each species models the document in a slightly different way. For example, do you consider an entity reference to be a separate node from text, or would you include the reference in the same package? You have to pay attention to the individual scheme of each module. Table 6-1 shows a common selection of node types.

Table 6-1. Typical node type definitions

Type	Properties
Element	Name, attributes, references to children
Namespace	Prefix name, URI
Character data	String of characters
Processing instruction	Target, Data
Comment	String of characters
CDATA section	String of characters
Entity reference	Name, Replacement text (or System ID and/or Public ID)

In addition to this set, some implementations define node types for the DTD, allowing a programmer to access declarations for elements, entities, notations, and attributes. Nodes may also exist for the XML declaration and document type declarations.

Chapter 6. Tree Processing

Contents:

6.1. XML Trees

Table 6-1. Typical node type definitions