XML::DOM (Perl and XML)

7.3. XML::DOM

Enno Derkson's XML::DOM module is a good place to start exploring DOM in Perl. It's a complete implementation of Level 1 DOM with a few extra features thrown in for convenience. XML::DOM::Parser extends XML::Parser to build a document tree installed in an XML::DOM::Document object whose reference it returns. This reference gives you complete access to the tree. The rest, we happily report, works pretty much as you'd expect.

Here's a program that uses DOM to process an XHTML file. It looks inside <p> elements for the word "monkeys," replacing every instance with a link to monkeystuff.com. Sure, you could do it with a regular expression substitution, but this example is valuable because it shows how to search for and create new nodes, and read and change values, all in the unique DOM style.

The first part of the program creates a parser object and gives it a file to parse with the call to parsefile( ):

use XML::DOM;

&process_file( shift @ARGV );

sub process_file {
    my $infile = shift;
    my $dom_parser = new XML::DOM::Parser;            # create a parser object
    my $doc = $dom_parser->parsefile( $infile );      # make it parse a file
    &add_links( $doc );                               # perform our changes
    print $doc->toString;                             # output the tree again
    $doc->dispose;                                    # clean up memory
}

This method returns a reference to an XML::DOM::Document object, which is our gateway to the nodes inside. We pass this reference along to a routine called add_links( ), which will do all the processing we require. Finally, we output the tree with a call to toString( ), and then dispose of the object. This last step performs necessary cleanup in case any circular references between nodes could result in a memory leak.

The next part burrows into the tree to start processing paragraphs:

sub add_links {
    my $doc = shift;                                  

    # find all the <p> elements
    my $paras = $doc->getElementsByTagName( "p" );
    for( my $i = 0; $i < $paras->getLength; $i++ ) {
        my $para = $paras->item( $i );

        # for each child of a <p>, if it is a text node, process it
        my @children = $para->getChildNodes;
        foreach my $node ( @children ) {
            &fix_text( $node ) if( $node->getNodeType eq TEXT_NODE );
        }
    }
}

The add_links( ) routine starts with a call to the document object's getElementsByTagName( ) method. It returns an XML::DOM::NodeList object containing all matching <p>s in the document (multilevel searching is so convenient) from which we can select nodes by index using item( ).

The bit we're interested in will be hiding inside a text node inside the <p> element, so we have to iterate over the children to find text nodes and process them. The call to getChildNodes( ) gives us several child nodes, either in a generic Perl list (when called in an array context) or another XML::DOM::NodeList object; for variety's sake, we've selected the first option. For each node, we test its type with a call to getNodeType and compare the result to XML::DOM's constant for text nodes, provided by TEXT_NODE( ). Nodes that pass the test are sent off to a routine for some node massaging.

The last part of the program targets text nodes and splits them around the word "monkeys" to create a link:

sub fix_text {
    my $node = shift;
    my $text = $node->getNodeValue;
    if( $text =~ /(monkeys)/i ) {

        # split the text node into 2 text nodes around the monkey word
        my( $pre, $orig, $post ) = ( $`, $1, $' );
        my $tnode = $node->getOwnerDocument->createTextNode( $pre );
        $node->getParentNode->insertBefore( $tnode, $node );
        $node->setNodeValue( $post );

        # insert an <a> element between the two nodes
        my $link = $node->getOwnerDocument->createElement( 'a' );
        $link->setAttribute( 'href', 'http://www.monkeystuff.com/' );
        $tnode = $node->getOwnerDocument->createTextNode( $orig );
        $link->appendChild( $tnode );
        $node->getParentNode->insertBefore( $link, $node );

        # recurse on the rest of the text node 
        # in case the word appears again
        fix_text( $node );
    }
}

First, the routine grabs the node's text value by calling its getNodeValue( ) method. DOM specifies redundant accessor methods used to get and set values or names, either through the generic Node class or through the more specific class's methods. Instead of getNodeValue( ), we could have used getData( ), which is specific to the text node class. For some nodes, such as elements, there is no defined value, so the generic getNodeValue( ) method would return an undefined value.

Next, we slice the node in two. We do this by creating a new text node and inserting it before the existing one. After we set the text values of each node, the first will contain everything before the word "monkeys", and the other will have everything after the word. Note the use of the XML::DOM::Document object as a factory to create the new text node. This DOM feature takes care of many administrative tasks behind the scenes, making the genesis of new nodes painless.

After that step, we create an <a> element and insert it between the text nodes. Like all good links, it needs a place to put the URL, so we set it up with an href attribute. To have something to click on, the link needs text, so we create a text node with the word "monkeys" and append it to the element's child list. Then the routine will recurse on the text node after the link in case there are more instances of "monkeys" to process.

Does it work? Running the program on this file:

<html>
<head><title>Why I like Monkeys</title></head>
<body><h1>Why I like Monkeys</h1>
<h2>Monkeys are Cute</h2>
<p>Monkeys are <b>cute</b>. They are like small, hyper versions of
ourselves. They can make funny facial expressions and stick out their
tongues.</p>
</body>
</html>

produces this output:

<html>
<head><title>Why I like Monkeys</title></head>
<body><h1>Why I like Monkeys</h1>
<h2>Monkeys are Cute</h2>
<p><a href="http://www.monkeystuff.com/">Monkeys</a> 
are <b>cute</b>. They are like small, hyper versions of
ourselves. They can make funny facial expressions and stick out their
tongues.</p>
</body>
</html>