3.4. Putting Parsers to WorkEnough tinkering with the parser's internal details. We want to see what you can do with the stuff you get from parsers. We've already seen an example of a complete, parser-built tree structure in Example 3-3, so let's do something with the other type. We'll take an XML event stream and make it drive processing by plugging it into some code to handle the events. It may not be the most useful tool in the world, but it will serve well enough to show you how real-world XML processing programs are written. XML::Parser (with Expat running underneath) is at the input end of our program. Expat subscribes to the event-based parsing school we described earlier. Rather than loading your whole XML document into memory and then turning around to see what it hath wrought, it stops every time it encounters a discrete chunk of data or markup, such as an angle-bracketed tag or a literal string inside an element. It then checks to see if our program wants to react to it in any way. Your first responsibility is to give the parser an interface to the pertinent bits of code that handle events. Each type of event is handled by a different subroutine, or handler. We register our handlers with the parser by setting the Handlers option at initialization time. Example 3-5 shows the entire process. Example 3-5. A stream-based XML processor
It's easy to see how this process works. We've written two handler subroutines called handle_start( ) and handle_end( ) and registered each with a particular event in the call to new( ). When we call parse( ), the parser knows it has handlers for a start-of-element event and an end-of-element event. Every time the parser trips over an element start tag, it calls the first handler and gives it information about that element (element name and attributes). Similarly, any end tag it encounters leads to a call of the other handler with similar element-specific information. Note that the parser also gives each handler a reference called $expat. This is a reference to the XML::Parser::Expat object, a low-level interface to Expat. It has access to interesting information that might be useful to a program, such as line numbers and element depth. We've taken advantage of this fact, using the line number to dazzle users with our amazing powers of document analysis. Want to see it run? Here's how the output looks after processing the customer database document from Example 1-1:
Here we used the element stack again. We didn't actually need to store the elements' names ourselves; one of the methods you can call on the XML::Parser::Expat object returns the current context list, a newest-to-oldest ordering of all elements our parser has probed into. However, a stack proved to be a useful way to store additional information like line numbers. It shows off the fact that you can let events build up structures of arbitrary complexity -- the "memory" of the document's past. There are many more event types than we handle here. We don't do anything with character data, comments, or processing instructions, for example. However, for the purpose of this example, we don't need to go into those event types. We'll have more exhaustive examples of event processing in the next chapter, anyway. Before we close the topic of event processing, we want to mention one thing: the Simple API for XML processing, more commonly known as SAX. It's very similar to the event processing model we've seen so far, but the difference is that it's a W3C-supported standard. Being a W3C-supported standard means that it has a standardized, canonical set of events. How these events should be presented for processing is also standardized. The cool thing about it is that with a standard interface, you can hook up different program components like Legos and it will all work. If you don't like one parser, just plug in another (and sophisticated tools like the XML::SAX module family can even help you pick a parser based on the features you need). Get your XML data from a database, a file, or your mother's shopping list; it shouldn't matter where it comes from. SAX is very exciting for the Perl community because we've long been criticized for our lack of standards compliance and general barbarism. Now we can be criticized for only one of those things. You can expect a nice, thorough discussion on SAX (specifically, PerlSAX, our beloved language's mutation thereof) in Chapter 5, "SAX".
Copyright © 2002 O'Reilly & Associates. All rights reserved. |
|