home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomePerl & XMLSearch this book

Chapter 5. SAX

XML::Parser has done remarkably well as a multipurpose XML parser and stream generator, but it really isn't the future of Perl and XML. The problem is that we don't want one standard parser for all ends and purposes; we want to be able to choose from multiple parsers, each serving a different purpose. One parser might be written completely in Perl for portability, while another is accelerated with a core written in C. Or, you might want a parser that translates one format (such as a spreadsheet) into an XML stream. You simply can't anticipate all the things a parser might be called on to do. Even XML::Parser, with its many options and multiple modes of operation, can't please everybody. The future, then, is a multiplicity of parsers that cover any situation you encounter.

An environment with multiple parsers demands some level of consistency. If every parser had its own interface, developers would go mad. Learning one interface and being able to expect all parsers to comply to that is better than having to learn a hundred different ways to do the same thing. We need a standard interface between parsers and code: a universal plug that is flexible and reliable, free from the individual quirks of any particular parser.

The XML development world has settled on an event-driven interface called SAX. SAX evolved from discussions on the XML-DEV mailing list and, shepherded by David Megginson,[24] was quickly shaped into a useful specification. The first incarnation, called SAX Level 1 (or just SAX1), supports elements, attributes, and processing instructions. It doesn't handle some other things like namespaces or CDATA sections, so the second iteration, SAX2, was devised, adding support for just about any event you can imagine in generic XML.

[24]David Megginson maintains a web page about SAX at http://www.saxproject.org.

SAX has been a huge success. Its simplicity makes it easy to learn and work with. Early development with XML was mostly in the realm of Java, so SAX was codified as an interface construct. An interface construct is a special kind of class that declares an object's methods without implementing them, leaving the implementation up to the developer.

Enthusiasm for SAX soon infected the Perl community and implementations began to appear in CPAN, but there was a problem. Perl doesn't provide a rigorous way to define a standard interface like Java does. It has weak type checking and forgives all kinds of inconsistencies. Whereas Java compares argument types in functions with those defined in the interface construct at compile time, Perl quietly accepts any arguments you use. Thus, defining a standard in Perl is mostly a verbal activity, relying on the developer's experience and watchfulness to comply.

One of the first Perl implementations of SAX is Ken McLeod's XML::Parser::PerlSAX module. As a subclass of XML::Parser, it modifies the stream of events from Expat to repackage them as SAX events.

5.1. SAX Event Handlers

To use a typical SAX module in a program, you must pass it an object whose methods implement handlers for SAX events. Table 5-1 describes the methods in a typical handler object. A SAX parser passes a hash to each handler containing properties relevant to the event. For example, in this hash, an element handler would receive the element's name and a list of attributes.

Table 5-1. PerlSAX handlers

Method name

Event

Properties

start_document

The document processing has started (this is the first event)

(none defined)

end_document

The document processing is complete (this is the last event)

(none defined)

start_element

An element start tag or empty element tag was found

Name, Attributes

end_element

An element end tag or empty element tag was found

Name

characters

A string of nonmarkup characters (character data) was found

Data

processing_instruction

A parser encountered a processing instruction

Target, Data

comment

A parser encountered a comment

Data

start_cdata

The beginning of a CDATA section encountered (the following character data may contain reserved markup characters)

(none defined)

end_cdata

The end of an encountered CDATA section

(none defined)

entity_reference

An internal entity reference was found (as opposed to an external entity reference, which would indicate that a file needs to be loaded)

Name, Value

A few notes about handler methods:

  • For an empty element, both the start_element( ) and end_element( ) handlers are called, in that order. No handler exists specifically for empty elements.

  • The characters( ) handler may be called more than once for a string of contiguous character data, parceling it into pieces. For example, a parser might break text around an entity reference, which is often more efficient for the parser.

  • The characters( ) handler will be called for any whitespace between elements, even if it doesn't seem like significant data. In XML, all characters are considered part of data. It's simply more efficient not to make a distinction otherwise.

  • Handling of processing instructions, comments, and CDATA sections is optional. In the absence of handlers, the data from processing instructions and comments is discarded. For CDATA sections, calls are still made to the characters( ) handler as before so the data will not be lost.

  • The start_cdata( ) and end_cdata( ) handlers do not receive data. Instead, they merely act as signals to tell you whether reserved markup characters can be expected in future calls to the characters( ) handler.

  • In the absence of an entity_reference( ) handler, all internal entity references will be resolved automatically by the parser, and the resulting text or markup will be handled normally. If you do define an entity_reference( ) handler, the entity references will not be expanded and you can do what you want with them.

Let's show an example now. We'll write a program called a filter, a special processor that outputs a replica of the original document with a few modifications. Specifically, it makes these changes to a document:

  • Turns every XML comment into a <comment> element

  • Deletes processing instructions

  • Removes tags, but leaves the content, for <literal> elements that occur within <programlisting> elements at any level

The code for this program is listed in Example 5-1. Like the last program, we initialize the parser with a set of handlers, except this time they are bundled together in a convenient package: an object called MyHandler. Notice that we've implemented a few more handlers, since we want to be able to deal with comments, processing instructions, and the document prolog.

Example 5-1. Filter program

# initialize the parser
#
use XML::Parser::PerlSAX;
my $parser = XML::Parser::PerlSAX->new( Handler => MyHandler->new( ) );

if( my $file = shift @ARGV ) {
    $parser->parse( Source => {SystemId => $file} );
} else {
    my $input = "";
    while( <STDIN> ) { $input .= $_; }
    $parser->parse( Source => {String => $input} );
}
exit;

#
# global variables
#
my @element_stack;                # remembers element names
my $in_intset;                    # flag: are we in the internal subset?

###
### Document Handler Package
###
package MyHandler;

#
# initialize the handler package
#
sub new {
    my $type = shift;
    return bless {}, $type;
}

#
# handle a start-of-element event: output start tag and attributes
#
sub start_element {
    my( $self, $properties ) = @_;
    # note: the hash %{$properties} will lose attribute order

    # close internal subset if still open
    output( "]>\n" ) if( $in_intset );
    $in_intset = 0;

    # remember the name by pushing onto the stack
    push( @element_stack, $properties->{'Name'} );

    # output the tag and attributes UNLESS it's a <literal>
    # inside a <programlisting>
    unless( stack_top( 'literal' ) and
            stack_contains( 'programlisting' )) {
        output( "<" . $properties->{'Name'} );
        my %attributes = %{$properties->{'Attributes'}};
        foreach( keys( %attributes )) {
            output( " $_=\"" . $attributes{$_} . "\"" );
        }
        output( ">" );
    }
} 

#
# handle an end-of-element event: output end tag UNLESS it's from a
# <literal> inside a <programlisting>
#
sub end_element {
    my( $self, $properties ) = @_;
    output( "</" . $properties->{'Name'} . ">" )
         unless( stack_top( 'literal' ) and
                stack_contains( 'programlisting' ));
    pop( @element_stack );
}

#
# handle a character data event
#
sub characters {
    my( $self, $properties ) = @_;
    # parser unfortunately resolves some character entities for us,
    # so we need to replace them with entity references again
    my $data = $properties->{'Data'};
    $data =~ s/\&/\&/;
    $data =~ s/</\&lt;/;
    $data =~ s/>/\&gt;/;
    output( $data );
}

#
# handle a comment event: turn into a <comment> element
#
sub comment {
    my( $self, $properties ) = @_;
    output( "<comment>" . $properties->{'Data'} . "</comment>" );
}

#
# handle a PI event: delete it
#
sub processing_instruction {
  # do nothing!
}

#
# handle internal entity reference (we don't want them resolved)
#
sub entity_reference {
    my( $self, $properties ) = @_;
    output( "&" . $properties->{'Name'} . ";" );
}

sub stack_top {
    my $guess = shift;
    return $element_stack[ $#element_stack ] eq $guess;
}

sub stack_contains {
    my $guess = shift;
    foreach( @element_stack ) {
        return 1 if( $_ eq $guess );
    }
    return 0;
}

sub output {
    my $string = shift;
    print $string;
}

Looking closely at the handlers, we see that one argument is passed, in addition to the obligatory object reference $self. This argument is a reference to a hash of properties about the event. This technique has one disadvantage: in the element start handler, the attributes are stored in a hash, which has no memory of the original attribute order. Semantically, this is not a big deal, since XML is supposed to be ignorant of attribute order. However, there may be cases when you want to replicate that order.[25]

[25]In the case of our filter, we might want to compare the versions from before and after processing using a utility such as the Unix program diff. Such a comparison would yield many false differences where the order of attributes changed. Instead of using diff, you should consider using the module XML::SemanticDiff by Kip Hampton. This module would ignore syntactic differences and compare only the semantics of two documents.

As a filter, this program preserves everything about the original document, except for the few details that have to be changed. The program preserves the document prolog, processing instructions, and comments. Even entity references should be preserved as they are instead of being resolved (as the parser may want to do). Therefore, the program has a few more handlers than in the last example, from which we were interested only in extracting very specific information.

Let's test this program now. Our input datafile is listed in Example 5-2.

Example 5-2. Data for the filter

<?xml version="1.0"?>
<!DOCTYPE book
  SYSTEM "/usr/local/prod/sgml/db.dtd"
[
  <!ENTITY thingy "hoo hah blah blah">
]>

<book id="mybook">
<?print newpage?>
  <title>GRXL in a Nutshell</title>
  <chapter id="intro">
    <title>What is GRXL?</title>
<!-- need a better title -->
    <para>
Yet another acronym.  That was our attitude at first, but then we saw 
the amazing uses of this new technology called
<literal>GRXL</literal>.  Consider the following program:
    </para>
<?print newpage?>
    <programlisting>AH aof -- %%%%
{{{{{{ let x = 0 }}}}}}
  print!  <lineannotation><literal>wow</literal></lineannotation>
or not!</programlisting>
<!-- what font should we use? -->
    <para>
What does it do?  Who cares?  It's just lovely to look at.  In fact,
I'd have to say, "&thingy;".
    </para>
<?print newpage?>
  </chapter>
</book>

The result, after running the program on the data, is shown in Example 5-3.

Example 5-3. Output from the filter

<book id="mybook">
  <title>GRXL in a Nutshell</title>
  <chapter id="intro">
    <title>What is GRXL?</title>
<comment> need a better title </comment>
    <para>
Yet another acronym.  That was our attitude at first, but then we saw 
the amazing uses of this new technology called
<literal>GRXL</literal>.  Consider the following program:
    </para>

    <programlisting>AH aof -- %%%%
{{{{{{ let x = 0 }}}}}}
  print!  <lineannotation>wow</lineannotation>
or not!</programlisting>
<comment> what font should we use? </comment>
    <para>
What does it do?  Who cares?  It's just lovely to look at.  In fact,
I'd have to say, "&thingy;".
    </para>

  </chapter>
</book>

Here's what the filter did right. It turned an XML comment into a <comment> element and deleted the processing instruction. The <literal> element in the <programlisting> was removed, with its contents left intact, while other <literal> elements were preserved. Entity references were left unresolved, as we wanted. So far, so good. But something's missing. The XML declaration, document type declaration, and internal subset are gone. Without the declaration for the entity thingy, this document is not valid. It looks like the handlers we had available to us were not sufficient.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.