Chapter 5. SAX
XML::Parser has done remarkably well as a multipurpose XML
parser and stream generator, but it really isn't the
future of Perl and XML. The problem is that we don't
want one standard parser for all ends and purposes; we want to be
able to choose from multiple parsers, each serving a different
purpose. One parser might be written completely in Perl for
portability, while another is accelerated with a core written in C.
Or, you might want a parser that translates one format (such as a
spreadsheet) into an XML stream. You simply can't
anticipate all the things a parser might be called on to do. Even
XML::Parser, with its many options and multiple
modes of operation, can't please everybody. The
future, then, is a multiplicity of parsers that cover any situation
you encounter.
An environment with multiple parsers demands some level of
consistency. If every parser had its own interface, developers would
go mad. Learning one interface and being able to expect all parsers
to comply to that is better than having to learn a hundred different
ways to do the same thing. We need a standard interface between
parsers and code: a universal plug that is flexible and reliable,
free from the individual quirks of any particular parser.
The XML development world has settled on an event-driven interface
called SAX. SAX evolved from discussions on the XML-DEV mailing list
and, shepherded by David Megginson,[24]
was quickly shaped into a useful specification. The first
incarnation, called SAX Level 1 (or just SAX1), supports elements,
attributes, and processing instructions. It doesn't
handle some other things like namespaces or CDATA sections, so the
second iteration, SAX2, was devised, adding support for just about
any event you can imagine in generic XML.
SAX has been a huge success. Its simplicity makes it easy to learn
and work with. Early development with XML was mostly in the realm of
Java, so SAX was codified as an interface construct. An interface
construct is a special kind of class that declares an
object's methods without implementing them, leaving
the implementation up to the developer.
Enthusiasm for SAX soon infected the Perl community and
implementations began to appear in CPAN, but there was a problem.
Perl doesn't provide a rigorous way to define a
standard interface like Java does. It has weak type checking
and forgives all kinds of inconsistencies. Whereas Java compares
argument types in functions with those defined in the interface
construct at compile time, Perl quietly accepts any arguments you
use. Thus, defining a standard in Perl is mostly a verbal activity,
relying on the developer's experience and
watchfulness to comply.
One of the first Perl implementations of SAX is Ken
McLeod's XML::Parser::PerlSAX
module. As a subclass of XML::Parser, it modifies
the stream of events from Expat to repackage them as SAX events.
5.1. SAX Event Handlers
To
use a typical SAX module
in a program, you must pass it an object whose methods implement
handlers for SAX events. Table 5-1 describes the
methods in a typical handler object. A SAX parser passes a hash to
each handler containing properties relevant to the event. For
example, in this hash, an element handler would receive the
element's name and a list of attributes.
Table 5-1. PerlSAX handlers
Method name
|
Event
|
Properties
|
start_document
|
The document processing has started (this is the first event)
|
(none defined)
|
end_document
|
The document processing is complete (this is the last event)
|
(none defined)
|
start_element
|
An element start tag or empty element tag was found
|
Name, Attributes
|
end_element
|
An element end tag or empty element tag was found
|
Name
|
characters
|
A string of nonmarkup characters (character data) was found
|
Data
|
processing_instruction
|
A parser encountered a processing instruction
|
Target, Data
|
comment
|
A parser encountered a comment
|
Data
|
start_cdata
|
The beginning of a CDATA section encountered (the following character
data may contain reserved markup characters)
|
(none defined)
|
end_cdata
|
The end of an encountered CDATA section
|
(none defined)
|
entity_reference
|
An internal entity reference was found (as opposed to an external
entity reference, which would indicate that a file needs to be
loaded)
|
Name, Value
|
A few notes about handler methods:
-
For an empty element, both the start_element( )
and end_element( ) handlers are called, in that
order. No handler exists specifically for empty elements.
-
The characters( ) handler may be called more
than once for a string of contiguous character data, parceling it
into pieces. For example, a parser might break text around an entity
reference, which is often more efficient for the parser.
-
The characters( ) handler will be called for any
whitespace between elements, even if it doesn't seem
like significant data. In XML, all characters are considered part of
data. It's simply more efficient not to make a
distinction otherwise.
-
Handling of processing instructions, comments, and CDATA sections is
optional. In the absence of handlers, the data from processing
instructions and comments is discarded. For CDATA sections, calls are
still made to the characters( )
handler as before so the data will not be lost.
-
The start_cdata( ) and end_cdata(
) handlers do not receive data. Instead, they merely act
as signals to tell you whether reserved markup characters can be
expected in future calls to the characters( )
handler.
-
In the absence of an entity_reference( )
handler, all internal entity references will be resolved
automatically by the parser, and the resulting text or markup will be
handled normally. If you do define an entity_reference(
) handler, the entity references will not be expanded and
you can do what you want with them.
Let's show an example now. We'll
write a program called a filter, a special processor that outputs a
replica of the original document with a few modifications.
Specifically, it makes these changes to a document:
-
Turns every XML comment into a <comment>
element
-
Deletes processing instructions
-
Removes tags, but leaves the content, for
<literal> elements that occur within
<programlisting> elements at any level
The code for this program is listed in Example 5-1.
Like the last program, we initialize the parser with a set of
handlers, except this time they are bundled together in a convenient
package: an object called MyHandler. Notice that
we've implemented a few more handlers, since we want
to be able to deal with comments, processing instructions, and the
document prolog.
Example 5-1. Filter program
# initialize the parser
#
use XML::Parser::PerlSAX;
my $parser = XML::Parser::PerlSAX->new( Handler => MyHandler->new( ) );
if( my $file = shift @ARGV ) {
$parser->parse( Source => {SystemId => $file} );
} else {
my $input = "";
while( <STDIN> ) { $input .= $_; }
$parser->parse( Source => {String => $input} );
}
exit;
#
# global variables
#
my @element_stack; # remembers element names
my $in_intset; # flag: are we in the internal subset?
###
### Document Handler Package
###
package MyHandler;
#
# initialize the handler package
#
sub new {
my $type = shift;
return bless {}, $type;
}
#
# handle a start-of-element event: output start tag and attributes
#
sub start_element {
my( $self, $properties ) = @_;
# note: the hash %{$properties} will lose attribute order
# close internal subset if still open
output( "]>\n" ) if( $in_intset );
$in_intset = 0;
# remember the name by pushing onto the stack
push( @element_stack, $properties->{'Name'} );
# output the tag and attributes UNLESS it's a <literal>
# inside a <programlisting>
unless( stack_top( 'literal' ) and
stack_contains( 'programlisting' )) {
output( "<" . $properties->{'Name'} );
my %attributes = %{$properties->{'Attributes'}};
foreach( keys( %attributes )) {
output( " $_=\"" . $attributes{$_} . "\"" );
}
output( ">" );
}
}
#
# handle an end-of-element event: output end tag UNLESS it's from a
# <literal> inside a <programlisting>
#
sub end_element {
my( $self, $properties ) = @_;
output( "</" . $properties->{'Name'} . ">" )
unless( stack_top( 'literal' ) and
stack_contains( 'programlisting' ));
pop( @element_stack );
}
#
# handle a character data event
#
sub characters {
my( $self, $properties ) = @_;
# parser unfortunately resolves some character entities for us,
# so we need to replace them with entity references again
my $data = $properties->{'Data'};
$data =~ s/\&/\&/;
$data =~ s/</\</;
$data =~ s/>/\>/;
output( $data );
}
#
# handle a comment event: turn into a <comment> element
#
sub comment {
my( $self, $properties ) = @_;
output( "<comment>" . $properties->{'Data'} . "</comment>" );
}
#
# handle a PI event: delete it
#
sub processing_instruction {
# do nothing!
}
#
# handle internal entity reference (we don't want them resolved)
#
sub entity_reference {
my( $self, $properties ) = @_;
output( "&" . $properties->{'Name'} . ";" );
}
sub stack_top {
my $guess = shift;
return $element_stack[ $#element_stack ] eq $guess;
}
sub stack_contains {
my $guess = shift;
foreach( @element_stack ) {
return 1 if( $_ eq $guess );
}
return 0;
}
sub output {
my $string = shift;
print $string;
}
Looking closely at the handlers, we see that one argument is passed,
in addition to the obligatory object reference
$self. This argument is a reference to a hash of
properties about the event. This technique has one disadvantage: in
the element start handler, the attributes are stored in a hash, which
has no memory of the original attribute order. Semantically, this is
not a big deal, since XML is supposed to be ignorant of attribute
order. However, there may be cases when you want to replicate that
order.[25]
As a filter, this program preserves everything about the original
document, except for the few details that have to be changed. The
program preserves the document prolog, processing instructions, and
comments. Even entity references should be preserved as they are
instead of being resolved (as the parser may want to do). Therefore,
the program has a few more handlers than in the last example, from
which we were interested only in extracting very specific
information.
Let's test this program now. Our input datafile is
listed in Example 5-2.
Example 5-2. Data for the filter
<?xml version="1.0"?>
<!DOCTYPE book
SYSTEM "/usr/local/prod/sgml/db.dtd"
[
<!ENTITY thingy "hoo hah blah blah">
]>
<book id="mybook">
<?print newpage?>
<title>GRXL in a Nutshell</title>
<chapter id="intro">
<title>What is GRXL?</title>
<!-- need a better title -->
<para>
Yet another acronym. That was our attitude at first, but then we saw
the amazing uses of this new technology called
<literal>GRXL</literal>. Consider the following program:
</para>
<?print newpage?>
<programlisting>AH aof -- %%%%
{{{{{{ let x = 0 }}}}}}
print! <lineannotation><literal>wow</literal></lineannotation>
or not!</programlisting>
<!-- what font should we use? -->
<para>
What does it do? Who cares? It's just lovely to look at. In fact,
I'd have to say, "&thingy;".
</para>
<?print newpage?>
</chapter>
</book>
The result, after running the program on the data, is shown in Example 5-3.
Example 5-3. Output from the filter
<book id="mybook">
<title>GRXL in a Nutshell</title>
<chapter id="intro">
<title>What is GRXL?</title>
<comment> need a better title </comment>
<para>
Yet another acronym. That was our attitude at first, but then we saw
the amazing uses of this new technology called
<literal>GRXL</literal>. Consider the following program:
</para>
<programlisting>AH aof -- %%%%
{{{{{{ let x = 0 }}}}}}
print! <lineannotation>wow</lineannotation>
or not!</programlisting>
<comment> what font should we use? </comment>
<para>
What does it do? Who cares? It's just lovely to look at. In fact,
I'd have to say, "&thingy;".
</para>
</chapter>
</book>
Here's what the filter did right. It turned an XML
comment into a <comment> element and deleted
the processing instruction. The <literal>
element in the <programlisting> was removed,
with its contents left intact, while other
<literal> elements were preserved. Entity
references were left unresolved, as we wanted. So far, so good. But
something's missing. The XML declaration, document
type declaration, and internal subset are gone. Without the
declaration for the entity thingy, this document
is not valid. It looks like the handlers we had available to us were not
sufficient.
 |  |  | 4.6. XML::Parser |  | 5.2. DTD Handlers |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|