4.5. XML::PYX
In
the Perl universe, standard APIs have been slow to catch on for many
reasons. CPAN, the vast storehouse of publicly offered modules, grows
organically, with no central authority to approve of a submission.
Also, with XML, a relative newcomer on the data format scene, the
Perl community has only begun to work out standard solutions.
We can characterize the first era of XML hacking in Perl to be the
age of nonstandard parsers. It's a time when
documentation is scarce and modules are experimental. There is much
creativity and innovation, and just as much idiosyncrasy and
quirkiness. Surprisingly, many of the tools that first appeared on
the horizon were quite useful. It's fascinating
territory for historians and developers alike.
XML::PYX is one of these early parsers. Streams
naturally lend themselves to the concept of pipelines, where data
output from one program can be plugged into another, creating a chain
of processors. There's no reason why XML
can't be handled that way, so an innovative and
elegant processing style has evolved around this concept.
Essentially, the XML is repackaged as a stream of easily recognizable
and transmutable symbols, even as a command-line utility.
One example of this repackaging is PYX, a symbolic encoding of XML
markup that is friendly to text processing languages like Perl. It
presents each XML event on a separate line very cleverly. Many Unix
programs like awk and
grep are line oriented, so they work
well with PYX. Lines are happy in Perl too.
Table 4-1 summarizes the notation of PYX.
Table 4-1. PYX notation
Symbol
|
Represents
|
(
|
An element start tag
|
)
|
An element end tag
|
-
|
Character data
|
A
|
An attribute
|
?
|
A processing instruction
|
For every event coming through the stream, PYX starts a new line,
beginning with one of the five symbols shown in Table 4-1. This line is followed by the element name or
whatever other data is pertinent. Special characters are escaped with
a backslash, as you would see in Perl code.
Here's how a parser converting an XML document into
PYX notation would look. The following code is XML input by the
parser:
<shoppinglist>
<!-- brand is not important -->
<item>toothpaste</item>
<item>rocket engine</item>
<item optional="yes">caviar</item>
</shoppinglist>
As PYX, it would look like this:
(shoppinglist
-\n
(item
-toothpaste
)item
-\n
(item
-rocket engine
)item
-\n
(item
Aoptional yes
-caviar
)item
-\n
)shoppinglist
Notice that the comment didn't come through in the
PYX translation. PYX is a little simplistic in some ways, omitting
some details in the markup. It will not alert you to CDATA markup
sections, although it will let the content pass through. Perhaps the
most serious loss is character entity references that disappear from
the stream. You should make sure you don't need that
information before working with PYX.
Matt Sergeant
has written a module, XML::PYX, which parses XML
and translates it into PYX. The compact program in Example 4-2 strips out all XML element tags, leaving only
the character data.
Example 4-2. PYX parser
use XML::PYX;
# initialize parser and generate PYX
my $parser = XML::PYX::Parser->new;
my $pyx;
if (defined ( $ARGV[0] )) {
$pyx = $parser->parsefile( $ARGV[0] );
}
# filter out the tags
foreach( split( / /, $pyx )) {
print $' if( /^-/ );
}
PYX is an interesting alternative to SAX and DOM for quick-and-dirty
XML processing. It's useful for simple tasks like
element counting, separating content from markup, and reporting
simple events. However, it does lack sophistication, making it less
attractive for complex processing.
 |  |  | 4.4. Stream Applications |  | 4.6. XML::Parser |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|