Optimized Tree Processing (Perl and XML)

8.4. Optimized Tree Processing

The big drawback to using trees for XML crunching is that they tend to consume scandalous amounts of memory and processor time. This might not be apparent with small documents, but it becomes noticeable as documents grow to many thousands of nodes. A typical book of a few hundred pages' length could easily have tens of thousands of nodes. Each one requires the allocation of an object, a process that takes considerable time and memory.

Perhaps you don't need to build the entire tree to get your work done, though. You might only want a small branch of the tree and can safely do all the processing inside of it. If that's the case, then you can take advantage of the optimized parsing modes in XML::Twig (recall that we dealt with this module earlier in Section 8.2, "XPath"). These modes allow you to specify ahead of time what parts (or "twigs") of the tree you'll be working with so that only those parts are assembled. The result is a hybrid of tree and event processing with highly optimized performance in speed and memory.

XML::Twig has three modes of operation: the regular old tree mode, similar to what we've seen so far; "chunk" mode, which builds a whole tree, but has only a fraction of it in memory at a time (sort of like paged memory); and multiple roots mode, which builds only a few selected twigs from the tree.

Example 8-11 demonstrates the power of XML::Twig in chunk mode. The data to this program is a DocBook book with some <chapter> elements. These documents can be enormous, sometimes a hundred megabytes or more. The program breaks up the processing per chapter so that only a fraction of the space is needed.

Example 8-11. A chunking program

use XML::Twig;

# initalize the twig, parse, and output the revised twig
my $twig = new XML::Twig( TwigHandlers => { chapter => \&process_chapter });
$twig->parsefile( shift @ARGV );
$twig->print;

# handler for chapter elements: process and then flush up the chapter
sub process_chapter {
  my( $tree, $elem ) = @_;
  &process_element( $elem );
  $tree->flush_up_to( $elem );  # comment out this line to waste memory
}

# append 'foo' to the name of an element
sub process_element {
  my $elem = shift;
  $elem->set_gi( $elem->gi . 'foo' );
  my @children = $elem->children;
  foreach my $child ( @children ) {
    next if( $child->gi eq '#PCDATA' );
    &process_element( $child );
  }
}

The program changes element names to append the string "foo" to them. Changing names is just busy work to keep the program running long enough to check the memory usage. Note the line in the function process_chapter( ):

$tree->flush_up_to( $elem );

We get our memory savings from this command. Without it, the entire tree will be built and kept in memory until the document is finally printed out. But when it is called, the tree that has been built up to a given element is dismantled and its text is output (called flushing). The memory usage never rises higher than what is needed for the largest chapter in the book.

To test this theory, we ran the program on a 3 MB document, first without and then with the line shown above. Without flushing, the program's heap space grew to over 30 MB. It's staggering to see how much memory an object-oriented tree processor needs -- in this case ten times the size of the file. But with flushing enabled, the program hovered around only a few MB of memory usage, a savings of about 90 percent. In both cases, the entire tree is eventually built, so the total processing time is about the same. To save CPU cycles as well as memory, we need to use multiple roots mode.

Multiple roots mode works by specifying before parsing the roots of the twigs that you want built. You will save significant time and memory if the twigs are much smaller than the document as a whole. In our chunk mode example, we probably can't do much to speed up the process, since the sum of <chapter> elements is about the same as the size of the document. So let's focus on an example that fits the profile.

The program in Example 8-12 reads in DocBook documents and outputs the titles of chapters -- a table of contents of sorts. To get this information, we don't need to build a tree for the whole chapter; only the <title> element is necessary. So for roots, we specify titles of chapters, expressed in the XPath notation chapter/title.

Example 8-12. A many-twigged program

use XML::Twig;

my $twig = new XML::Twig( TwigRoots => { 'chapter/title' => \&output_title });
$twig->parsefile( shift @ARGV );

sub output_title {
  my( $tree, $elem ) = @_;
  print $elem->text, "\n";
}

The key line here is the one with the keyword TwigRoots. It's set to a hash of handlers and works very similarly to TwigHandlers that we saw earlier. The difference is that instead of building the whole document tree, the program builds only trees whose roots are <title> elements. This is a small fraction of the whole document, so we can expect time and memory savings to be high.

How high? Running the program on the same test data, we saw memory usage barely reach 2 MB, and the total processing time was 13 seconds. Compare that to 30 MB memory usage (the size required to build the whole tree) and a full minute to grind out the titles. This conservation of resources is significant for both memory and CPU time.

XML::Twig can give you a big performance boost for your tree processing programs, but you have to know when chunking and multiple roots will help. You won't save much time if the sum of twigs is almost as big as the document itself. Chunking is not useful unless the chunks are significantly smaller than the document.