Coding Strategies (Perl and XML)

This chapter sends you off by bringing this book's topics full circle. We return to many of the themes about XML processing in Perl that we introduced in Chapter 3, "XML Basics: Reading and Writing", but in the context of all the detailed material that we've covered in the interceding chapters. Our intent is to take you on one concluding tour through the world of Perl and XML, with its strategies and its gotchas, before sending you on your way.

10.1. Perl and XML Namespaces

You've seen XML namespaces used since we first mentioned this concept back in Chapter 2, "An XML Recap". Many XML applications, such as XSLT, insist that all their elements claim fealty to a certain namespace. The deciding factor here usually involves how symbiotic the application is in its usual use: does it usually work on its own, with a one-document-per-application style, or does it tend to mix with other sorts of XML?

DocBook XML, for example, is not very symbiotic. An instance of DocBook is almost always a whole XML document, defining a book or an article, and all the elements within such a document that aren't explicitly tied to some other namespace are found in the official DocBook documentation.[37] However, within a DocBook document, you might encounter a clump of MathML elements making their home in a rather parasitic fashion, nestled in among the folds of the DocBook elements, from which it derives nourishing context.

[37]See http://www.docbook.org or O'Reilly's DocBook: The Definitive Guide.

This sort of thing is useful for two reasons: first, DocBook, while its element spread tries to cover all kinds of things you might find in a piece of technical documentation,[38] doesn't have the capacity to richly describe everything that might go into a mathematical equation. (It does have <equation> elements, but they are often used to describe the nature of the graphic contained within them.) By adding MathML into the mix, you can use all the tags defined by that markup language's specification inside of a DocBook document, tucked away safely in their own namespace. (Since MathML and DocBook work so well together, the DocBook DTD allows a user to plug in a "MathML module," which adds a <mml:math> element to the mix. Within this mix, everything is handled by MathML's own DTD, which the module imports (along with DocBook's main DTD) into the whole DTD-space when validating.)

[38]Some would say, in fact, that it tries a little too hard; hence the existence of trimmed-down variants such as Simplified DocBook.

Second, and perhaps more interesting from the parser's point of view, tags existing in a given namespace work like embassies; while you stand on its soil (or in its scope), all that country's rules and regulations apply to you, despite the embassy's location in a foreign land. XML namespaces are also similar to Perl namespaces, which let you invoke variables, subroutines, and other symbols that live inside Some::Other::Package, though you may not have defined them within the default main package (or whatever package you are working in).

In other words, the presence of a namespace often indicates that another, separate XML application is invoked within the current one. Thus, if you are writing a processor to handle a type of XML application and you know that a certain namespace will probably pop up within it, you can save yourself a lot of work by passing off the work to another Perl module that knows how to handle things in that other application.

URI Identifiers

Many XML technologies, such as XML namespaces, SAX2, and SOAP, rely on URIs as unique identifiers -- strings that differentiate features or properties to prevent ideological conflicts. Any processor that reads it can be absolutely sure that it's referring to the technology intended by the author. URIs used in this way often look like URLs, usually of the http:// variety, which implies that typing them into a web browser will cause something to happen. However, sometimes the only result is a disappointing HTTP 404 response. URIs, unlike URLs, don't have to point to an actual resource; they only have to be globally unique.

Developers who need to assign a new URI to something often base them on URLs leading to web sites they have some control over. For example, if you have exclusive control over http://www.greenmonkey-markup.com/~jmac, then you can assign URIs based on it, such as http://www.greenmonkey-markup/~jmac/monkeyml/. Even without a response, you are still guaranteed that nobody else will ever use that URI. However, polite developers tend to put something at these URIs -- preferably documentation about the technology that uses them.

Another popular solution involves using a service such as http://purl.org (no relation to Perl), which can put a layer of indirection between a URI you use as a namespace and the location that houses its documentation, letting you change the latter at will while keeping the former constant.

Sometimes a URI does convey information besides mere uniqueness. For example, many XML application processors are sticklers about URIs used to declare XML namespaces, with good reason. XSLT processors, for example, usually don't care that all stylesheet XSLT elements have the usual xsl: prefix, as much as they care what URI that prefix is bound to, in the appropriate xmlns:-prefixed attribute. Knowing what URI the prefix is bound to assures the processor that you're using, for example, the W3C's most recent version of XSLT, and not a pre-1.0 version that some bleeding-edge processor adopted (that has its own namespace).

Robin Berjon's XML::NamespaceSupport module, available on CPAN, can help you process XML documents that use namespaces and manage their prefix-to-URI mappings.

For example, let's say that on your machine you have an XML file whose document keeps a list of the monkeys living in your house. Much of this file contains elements of your own design, but because you are both crafty and lazy, your document also uses the Monkey Markup Language, a standard way to describe monkeys with XML. Because it's designed for use in larger documents, it defines its own namespace:

 <?xml version="1.0">
<monkey-list>
 <monkey>
  <description xmlns:mm="http://www.jmac.org/projects/monkeys/mm/">
   <mm:monkey> <!-- start of monkey section -->
    <mm:name>Virtram</mm:name>
    <mm:color>teal</mm:color>
    <mm:favorite-foods>
     <mm:food>Banana</mm:food> <mm:food>Walnut</mm:food>
    </mm:favorite-foods>
    <mm:personality-matrix>
     F6 30 00 0A 1B E7 9C 20
    </mm:personality-matrix>
   </mm:monkey>
  </description>
  <location>Living Room</location>
  <job>Scarecrow</job>
 </monkey>
 <!-- Put more monkeys here later -->
</monkey-list>

Luckily, we have a Perl module on our system, XML::MonkeyML, that can parse a MonkeyML document into an object. This module is useful because the XML::MonkeyML class contains code for handling MonkeyML's personality-matrix element, which condenses a monkey's entire personality down to a short hexadecimal code. Let's write a program that predicts how all our monkeys will react in a given situation:

#!/usr/bin/perl

# This program takes an action specified on the command line, and
# applies it to every monkey listed in a monkey-list XML document
# (whose filename is also supplied on the command line)

use warnings;
use strict;

use XML::LibXML;
use XML::MonkeyML;

my ($filename, $action) = @ARGV;

unless (defined ($filename) and defined ($action)) {
  die "Usage: $0 monkey-list-file action\n";
}

my $parser = XML::LibXML->new;
my $doc = $parser->parse_file($filename);

# Get all of the monkey elements
my @monkey_nodes = $parser->documentElement->findNodes("//monkey/description/mm:monkey");

foreach (@monkey_nodes) {
  my $monkeyml = XML::MonkeyML->parse_string($_->toString);
  my $name = $monkeyml->name . " the " . $monkeyml->color . " monkey";
  print "$name would react in the following fashion:\n";
  # The magic MonkeyML 'action' object method takes an English
  # description of an action performed on this monkey, and returns a
  # phrase describing the monkey's reaction.
  print $monkeyml->action($action); print "\n";
}

Here is the output:

$ ./money_action.pl monkeys.xml "Give it a banana"

Virtram the teal monkey would react in the following fashion:
Take the banana. Eat it. Say "Ook".

Speaking of laziness, let's look at how a programmer might create a helper module like XML::MonkeyML.

Chapter 10. Coding Strategies

Contents:

10.1. Perl and XML Namespaces

URI Identifiers