Stream-Based Versus Tree-Based Processing (Perl and XML)

3.3. Stream-Based Versus Tree-Based Processing

Remember the Perl mantra , "There's more than one way to do it "? It is also true when working with XML. Depending on how you want to work and what kind of resources you have, many options are available. One developer may prefer a low-maintenance parsing job and is prepared to be loose and sloppy with memory to get it. Another will need to squeeze out faster and leaner performance at the expense of more complex code. XML processing tasks vary widely, so you should be free to choose the shortest path to a solution.

There are a lot of different XML processing strategies. Most fall into two categories: stream-based and tree-based. With the stream-based strategy, the parser continuously alerts a program to patterns in the XML. The parser functions like a pipeline, taking XML markup on one end and pumping out processed nuggets of data to your program. We call this pipeline an event stream because each chunk of data sent to the program signals something new and interesting in the XML stream. For example, the beginning of a new element is a significant event. So is the discovery of a processing instruction in the markup. With each update, your program does something new -- perhaps translating the data and sending it to another place, testing it for some specific content, or sticking it onto a growing heap of data.

With the tree-based strategy, the parser keeps the data to itself until the very end, when it presents a complete model of the document to your program. Instead of a pipeline, it's like a camera that takes a picture and transmits the replica to you. The model is usually in a much more convenient state than raw XML. For example, nested elements may be represented in native Perl structures like lists or hashes, as we saw in an earlier example. Even more useful are trees of blessed objects with methods that help navigate the structure from one place to another. The whole point to this strategy is that your program can pull out any data it needs, in any order.

Why would you prefer one over the other? Each has strong and weak points. Event streams are fast and often have a much slimmer memory footprint, but at the expense of greater code complexity and impermanent data. Tree building, on the other hand, lets the data stick around for as long as you need it, and your code is usually simple because you don't need special tricks to do things like backwards searching. However, trees wither when it comes to economical use of processor time and memory.

All of this is relative, of course. Small documents don't cause much hardship to a typical computer, especially since CPU cycles and megabytes are getting cheaper every day. Maybe the convenience of a persistent data structure will outweigh any drawbacks. On the other hand, when working with Godzilla-sized documents like books, or huge numbers of documents all at once, you'll definitely notice the crunch. Then the agility of event stream processors will start to look better. It's impossible to give you any hard-and-fast rules, so we'll leave the decision up to you.

An interesting thing to note about the stream-based and tree-based strategies is that one is the basis for the other. That's right, an event stream drives the process of building a tree data structure. Thus, most low-level parsers are event streams because you can always write a tree building layer on top. This is how XML::Parser and most other parsers work.

In a related, more recent, and very cool development, XML event streams can also turn any kind of document into some form of XML by writing stream-based parsers that generate XML events from whatever data structures lurk in that document type.

There's a lot more to say about event streams and tree builders -- so much, in fact, that we've devoted two whole chapters to the topics. Chapter 4, "Event Streams" takes a deep plunge into the theory behind event streams with lots of examples for making useful programs out of them. Chapter 6, "Tree Processing" takes you deeper into the forest with lots of tree-based examples. After that, Chapter 8, "Beyond Trees: XPath, XSLT, and More" shows you unusual hybrids that provide the best of both worlds.