XML::LibXML (Perl and XML)

7.4. XML::LibXML

Matt Sergeant's XML::LibXML module is an interface to the GNOME project's LibXML library. It's quickly becoming a popular implementation of DOM, demonstrating speed and completeness over the older XML::Parser based modules. It also implements Level 2 DOM, which means it has support for namespaces.

So far, we haven't worked much with namespaces. A lot of people opt to avoid them. They add a new level of complexity to markup and code, since you have to handle both local names and prefixes. However, namespaces are becoming more important in XML, and sooner or later, we all will have to deal with them. The popular transformation language XSLT uses namespaces to distinguish between tags that are instructions and tags that are data (i.e., which elements should be output and which should be used to control the output).

You'll even see namespaces used in good old HTML. Namespaces provide a way to import specialized markup into documents, such as equations into regular HTML pages. The MathML language (http://www.w3.org/Math/) does just that. Example 7-1 incorporates MathML into it with namespaces.

Example 7-1. A document with namespaces

<html>
<body xmlns:eq="http://www.w3.org/1998/Math/MathML">
<h1>Billybob's Theory</h1>
<p>
It is well-known that cats cannot be herded easily. That is, they do
not tend to run in a straight line for any length of time unless they
really want to. A cat forced to run in a straight line against its
will has an increasing probability, with distance, of deviating from
the line just to spite you, given by this formula:</p>
<p>
  <!-- P = 1 - 1/(x^2) -->
  <eq:math>
    <eq:mi>P</eq:mi><eq:mo>=</eq:mo><eq:mn>1</eq:mn><eq:mo>-</eq:mo>
    <eq:mfrac>
      <eq:mn>1</eq:mn>
      <eq:msup>
        <eq:mi>x</eq:mi>
        <eq:mn>2</eq:mn>
      </eq:msup>
    </eq:mfrac>
  </eq:math>
</p>
</body>
</html>

The tags with eq: prefixes are part of a namespace identified by the URI http://www.w3.org/1998/Math/MathML, defined in an attribute in the <body> element. Using a namespace helps the browser discern between what is native to HTML and what is not. Browsers that understand MathML route the qualified elements to their equation formatter instead of the regular HTML formatter.

Some browsers are confused by the MathML tags and render unpredictable results. One particularly useful utility is a program that detects and removes namespace-qualified elements that would gum up an older HTML processor. The following example uses DOM2 to sift through a document and strip out all elements that have a namespace prefix.

The first step is to parse the file:

use XML::LibXML;

my $parser = XML::LibXML->new( );
my $doc = $parser->parse_file( shift @ARGV );

Next, we locate the document element and run a recursive subroutine on it to ferret out the namespace-qualified elements. Afterwards, we print out the document:

my $mathuri = 'http://www.w3.org/1998/Math/MathML';
my $root = $doc->getDocumentElement;
&amp;purge_nselems( $root );
print $doc->toString;

This routine takes an element node and, if it has a namespace prefix, removes it from its parent's content list. Otherwise, it goes on to process the descendants:

sub purge_nselems {
  my $elem = shift;
  return unless( ref( $elem ) =~ /Element/ );
  if( $elem->prefix ) {
    my $parent = $elem->parentNode;
    $parent->removeChild( $elem );
  } elsif( $elem->hasChildNodes ) {
    my @children = $elem->getChildnodes;
    foreach my $child ( @children ) {
      &purge_nselems( $child );
    }
  }
}

You might have noticed that this DOM implementation adds some Perlish conveniences over the recommended DOM interface. The call to getChildnodes, in an array context, returns a Perl list instead of a more cumbersome NodeList object. Called in a scalar context, it would return the number of child nodes for that node, so NodeLists aren't really used at all.

Simplifications like this are common in the Perl world, and no one really seems to mind. The emphasis is usually on ease of use over rigorous object-oriented protocol. Of course, one would hope that all DOM implementations in the Perl world adopt the same conventions, which is why many long discussions on the perl-xml mailing list try to decide the best way to adopt standards. A current debate discusses how to implement SAX2 (which supports namespaces) in the most logical, Perlish way.

Matt Sergeant has stocked the XML::LibXML package with other goodies. The Node class has a method called findnodes( ) , which takes an XPath expression as an argument, allowing retrieval of nodes in more flexible ways than permitted by the ordinary DOM interface. The parser has options that control how pedantically the parser runs, entity resolution, and whitespace significance. One can also opt to use special handlers for unparsed entities. Overall, this module is excellent for DOM programming.