XML::PYX (Perl and XML)

4.5. XML::PYX

In the Perl universe, standard APIs have been slow to catch on for many reasons. CPAN, the vast storehouse of publicly offered modules, grows organically, with no central authority to approve of a submission. Also, with XML, a relative newcomer on the data format scene, the Perl community has only begun to work out standard solutions.

We can characterize the first era of XML hacking in Perl to be the age of nonstandard parsers. It's a time when documentation is scarce and modules are experimental. There is much creativity and innovation, and just as much idiosyncrasy and quirkiness. Surprisingly, many of the tools that first appeared on the horizon were quite useful. It's fascinating territory for historians and developers alike.

XML::PYX is one of these early parsers. Streams naturally lend themselves to the concept of pipelines, where data output from one program can be plugged into another, creating a chain of processors. There's no reason why XML can't be handled that way, so an innovative and elegant processing style has evolved around this concept. Essentially, the XML is repackaged as a stream of easily recognizable and transmutable symbols, even as a command-line utility.

One example of this repackaging is PYX, a symbolic encoding of XML markup that is friendly to text processing languages like Perl. It presents each XML event on a separate line very cleverly. Many Unix programs like awk and grep are line oriented, so they work well with PYX. Lines are happy in Perl too.

Table 4-1 summarizes the notation of PYX.

Table 4-1. PYX notation

Symbol	Represents
(	An element start tag
)	An element end tag
-	Character data
A	An attribute
?	A processing instruction

For every event coming through the stream, PYX starts a new line, beginning with one of the five symbols shown in Table 4-1. This line is followed by the element name or whatever other data is pertinent. Special characters are escaped with a backslash, as you would see in Perl code.

Here's how a parser converting an XML document into PYX notation would look. The following code is XML input by the parser:

<shoppinglist>
  <!-- brand is not important -->
  <item>toothpaste</item>
  <item>rocket engine</item>
  <item optional="yes">caviar</item>
</shoppinglist>

As PYX, it would look like this:

(shoppinglist
-\n
(item
-toothpaste
)item
-\n
(item
-rocket engine
)item
-\n
(item
Aoptional yes
-caviar
)item
-\n
)shoppinglist

Notice that the comment didn't come through in the PYX translation. PYX is a little simplistic in some ways, omitting some details in the markup. It will not alert you to CDATA markup sections, although it will let the content pass through. Perhaps the most serious loss is character entity references that disappear from the stream. You should make sure you don't need that information before working with PYX.

Matt Sergeant has written a module, XML::PYX, which parses XML and translates it into PYX. The compact program in Example 4-2 strips out all XML element tags, leaving only the character data.

Example 4-2. PYX parser

use XML::PYX;

# initialize parser and generate PYX
my $parser = XML::PYX::Parser->new;
my $pyx;
if (defined ( $ARGV[0] )) {
  $pyx = $parser->parsefile( $ARGV[0] );
}

# filter out the tags
foreach( split( / /, $pyx )) {
  print $' if( /^-/ );
}

PYX is an interesting alternative to SAX and DOM for quick-and-dirty XML processing. It's useful for simple tasks like element counting, separating content from markup, and reporting simple events. However, it does lack sophistication, making it less attractive for complex processing.