Chapter 5. SAXContents:
SAX Event Handlers XML::Parser has done remarkably well as a multipurpose XML parser and stream generator, but it really isn't the future of Perl and XML. The problem is that we don't want one standard parser for all ends and purposes; we want to be able to choose from multiple parsers, each serving a different purpose. One parser might be written completely in Perl for portability, while another is accelerated with a core written in C. Or, you might want a parser that translates one format (such as a spreadsheet) into an XML stream. You simply can't anticipate all the things a parser might be called on to do. Even XML::Parser, with its many options and multiple modes of operation, can't please everybody. The future, then, is a multiplicity of parsers that cover any situation you encounter. An environment with multiple parsers demands some level of consistency. If every parser had its own interface, developers would go mad. Learning one interface and being able to expect all parsers to comply to that is better than having to learn a hundred different ways to do the same thing. We need a standard interface between parsers and code: a universal plug that is flexible and reliable, free from the individual quirks of any particular parser. The XML development world has settled on an event-driven interface called SAX. SAX evolved from discussions on the XML-DEV mailing list and, shepherded by David Megginson,[24] was quickly shaped into a useful specification. The first incarnation, called SAX Level 1 (or just SAX1), supports elements, attributes, and processing instructions. It doesn't handle some other things like namespaces or CDATA sections, so the second iteration, SAX2, was devised, adding support for just about any event you can imagine in generic XML.
SAX has been a huge success. Its simplicity makes it easy to learn and work with. Early development with XML was mostly in the realm of Java, so SAX was codified as an interface construct. An interface construct is a special kind of class that declares an object's methods without implementing them, leaving the implementation up to the developer. Enthusiasm for SAX soon infected the Perl community and implementations began to appear in CPAN, but there was a problem. Perl doesn't provide a rigorous way to define a standard interface like Java does. It has weak type checking and forgives all kinds of inconsistencies. Whereas Java compares argument types in functions with those defined in the interface construct at compile time, Perl quietly accepts any arguments you use. Thus, defining a standard in Perl is mostly a verbal activity, relying on the developer's experience and watchfulness to comply. One of the first Perl implementations of SAX is Ken McLeod's XML::Parser::PerlSAX module. As a subclass of XML::Parser, it modifies the stream of events from Expat to repackage them as SAX events. 5.1. SAX Event HandlersTo use a typical SAX module in a program, you must pass it an object whose methods implement handlers for SAX events. Table 5-1 describes the methods in a typical handler object. A SAX parser passes a hash to each handler containing properties relevant to the event. For example, in this hash, an element handler would receive the element's name and a list of attributes. Table 5-1. PerlSAX handlers
A few notes about handler methods:
Let's show an example now. We'll write a program called a filter, a special processor that outputs a replica of the original document with a few modifications. Specifically, it makes these changes to a document:
The code for this program is listed in Example 5-1. Like the last program, we initialize the parser with a set of handlers, except this time they are bundled together in a convenient package: an object called MyHandler. Notice that we've implemented a few more handlers, since we want to be able to deal with comments, processing instructions, and the document prolog. Example 5-1. Filter program# initialize the parser # use XML::Parser::PerlSAX; my $parser = XML::Parser::PerlSAX->new( Handler => MyHandler->new( ) ); if( my $file = shift @ARGV ) { $parser->parse( Source => {SystemId => $file} ); } else { my $input = ""; while( <STDIN> ) { $input .= $_; } $parser->parse( Source => {String => $input} ); } exit; # # global variables # my @element_stack; # remembers element names my $in_intset; # flag: are we in the internal subset? ### ### Document Handler Package ### package MyHandler; # # initialize the handler package # sub new { my $type = shift; return bless {}, $type; } # # handle a start-of-element event: output start tag and attributes # sub start_element { my( $self, $properties ) = @_; # note: the hash %{$properties} will lose attribute order # close internal subset if still open output( "]>\n" ) if( $in_intset ); $in_intset = 0; # remember the name by pushing onto the stack push( @element_stack, $properties->{'Name'} ); # output the tag and attributes UNLESS it's a <literal> # inside a <programlisting> unless( stack_top( 'literal' ) and stack_contains( 'programlisting' )) { output( "<" . $properties->{'Name'} ); my %attributes = %{$properties->{'Attributes'}}; foreach( keys( %attributes )) { output( " $_=\"" . $attributes{$_} . "\"" ); } output( ">" ); } } # # handle an end-of-element event: output end tag UNLESS it's from a # <literal> inside a <programlisting> # sub end_element { my( $self, $properties ) = @_; output( "</" . $properties->{'Name'} . ">" ) unless( stack_top( 'literal' ) and stack_contains( 'programlisting' )); pop( @element_stack ); } # # handle a character data event # sub characters { my( $self, $properties ) = @_; # parser unfortunately resolves some character entities for us, # so we need to replace them with entity references again my $data = $properties->{'Data'}; $data =~ s/\&/\&/; $data =~ s/</\</; $data =~ s/>/\>/; output( $data ); } # # handle a comment event: turn into a <comment> element # sub comment { my( $self, $properties ) = @_; output( "<comment>" . $properties->{'Data'} . "</comment>" ); } # # handle a PI event: delete it # sub processing_instruction { # do nothing! } # # handle internal entity reference (we don't want them resolved) # sub entity_reference { my( $self, $properties ) = @_; output( "&" . $properties->{'Name'} . ";" ); } sub stack_top { my $guess = shift; return $element_stack[ $#element_stack ] eq $guess; } sub stack_contains { my $guess = shift; foreach( @element_stack ) { return 1 if( $_ eq $guess ); } return 0; } sub output { my $string = shift; print $string; } Looking closely at the handlers, we see that one argument is passed, in addition to the obligatory object reference $self. This argument is a reference to a hash of properties about the event. This technique has one disadvantage: in the element start handler, the attributes are stored in a hash, which has no memory of the original attribute order. Semantically, this is not a big deal, since XML is supposed to be ignorant of attribute order. However, there may be cases when you want to replicate that order.[25]
As a filter, this program preserves everything about the original document, except for the few details that have to be changed. The program preserves the document prolog, processing instructions, and comments. Even entity references should be preserved as they are instead of being resolved (as the parser may want to do). Therefore, the program has a few more handlers than in the last example, from which we were interested only in extracting very specific information. Let's test this program now. Our input datafile is listed in Example 5-2. Example 5-2. Data for the filter<?xml version="1.0"?> <!DOCTYPE book SYSTEM "/usr/local/prod/sgml/db.dtd" [ <!ENTITY thingy "hoo hah blah blah"> ]> <book id="mybook"> <?print newpage?> <title>GRXL in a Nutshell</title> <chapter id="intro"> <title>What is GRXL?</title> <!-- need a better title --> <para> Yet another acronym. That was our attitude at first, but then we saw the amazing uses of this new technology called <literal>GRXL</literal>. Consider the following program: </para> <?print newpage?> <programlisting>AH aof -- %%%% {{{{{{ let x = 0 }}}}}} print! <lineannotation><literal>wow</literal></lineannotation> or not!</programlisting> <!-- what font should we use? --> <para> What does it do? Who cares? It's just lovely to look at. In fact, I'd have to say, "&thingy;". </para> <?print newpage?> </chapter> </book> The result, after running the program on the data, is shown in Example 5-3. Example 5-3. Output from the filter<book id="mybook"> <title>GRXL in a Nutshell</title> <chapter id="intro"> <title>What is GRXL?</title> <comment> need a better title </comment> <para> Yet another acronym. That was our attitude at first, but then we saw the amazing uses of this new technology called <literal>GRXL</literal>. Consider the following program: </para> <programlisting>AH aof -- %%%% {{{{{{ let x = 0 }}}}}} print! <lineannotation>wow</lineannotation> or not!</programlisting> <comment> what font should we use? </comment> <para> What does it do? Who cares? It's just lovely to look at. In fact, I'd have to say, "&thingy;". </para> </chapter> </book> Here's what the filter did right. It turned an XML comment into a <comment> element and deleted the processing instruction. The <literal> element in the <programlisting> was removed, with its contents left intact, while other <literal> elements were preserved. Entity references were left unresolved, as we wanted. So far, so good. But something's missing. The XML declaration, document type declaration, and internal subset are gone. Without the declaration for the entity thingy, this document is not valid. It looks like the handlers we had available to us were not sufficient. Copyright © 2002 O'Reilly & Associates. All rights reserved. |
|||||||||||||||||||||||||||||||||
|