Modifying HTML with Trees (Perl & LWP)

In Chapter 9, "HTML Processing with Trees", we saw how to extract information from HTML trees. But that's not the only thing you can use trees for. HTML::TreeBuilder trees can be altered and can even be written back out as HTML, using the as_HTML( ) method. There are four ways in which a tree can be altered: you can alter a node's attributes; you can delete a node; you can detach a node and reattach it elsewhere; and you can add a new node. We'll treat each of these in turn.

10.1. Changing Attributes

Suppose that in your new role as fixer of large sets of HTML documents, you are given a bunch of documents that have headings like this:

<h3 align=center>Free Monkey</h3>
<h3 color=red>Inquire Within</h3>

that need to be changed like this:

<h2 class=scream>Free Monkey</h2>
<h4 class=mutter>Inquire Within</h4>

Before you start phrasing this in terms of HTML::Element methods, you should consider whether this can be done with a search-and-replace operation in an editor. In this case, it cannot, because you're not just changing every <h3 align=center> to <h2 class=scream> and every <h4 color=red> to <h3 class=mutter> (which are apparently simple search-and-replace operations), you also have to change </h3> to </h2> or to </h4>, depending on what you did to the element that it closes. That sort of context dependency puts this well outside the realm of simple search-and-replace operations. One could try to implement this with HTML::TokeParser, reading every token and printing it back out, after having possibly altered it. In such a program, every time we see an <h3...> and maybe alter it, we'd have to set a flag indicating what the next </h3> should be changed to.

So far, you've seen the method $element->attr(attrname) to get the value of an attribute (returning undef if there is no such attribute). To alter attribute values, you need only two additional syntaxes: $element->attr(attrname, newval) sets a value (regardless of whether that attribute had a previous value), and $element->attr(attrname, undef) deletes an attribute. That works even for changing the _tag attribute (for which the $element->tag method is a shortcut).

That said, it's just a matter of knowing what nodes to change and then changing them, as in Example 10-1.

Example 10-1. Modifying attributes

use strict;
use HTML::TreeBuilder;
my $root = HTML::TreeBuilder->new;
$root->parse_file('rewriters1/in1.html') || die $!;
 
print "Before:\n";
$root->dump;
 
my @h3_center = $root->look_down('_tag', 'h3', 'align', 'center');
my @h3_red    = $root->look_down('_tag', 'h3', 'color', 'red');
foreach my $h3c (@h3_center) {
  $h3c->attr('_tag', 'h2');
  $h3c->attr('style', 'scream');
  $h3c->attr('align', undef);
}
 
foreach my $h3r (@h3_red) {
  $h3r->attr('_tag', 'h4');
  $h3r->attr('style', 'mumble');
  $h3r->attr('color', undef);
}
 
print "\n\nAfter:\n";
$root->dump;

Suppose that the input file consists of this:

<html><body>
  
<h3 align=center>Free Monkey</h3>
<h3 color=red>Inquire Within</h3>
<p>It's a monkey!  <em>And it's free!</em></html>

When we run the program, we can see the tree dump before and after the modifications happen:

Before:
<html> @0
  <head> @0.0 (IMPLICIT)
  <body> @0.1
    <h3 align="center"> @0.1.0
      "Free Monkey"
    <h3 color="red"> @0.1.1
      "Inquire Within"
    <p> @0.1.2
      "It's a monkey! "
      <em> @0.1.2.1
        "And it's free!"
 
After:
<html> @0
  <head> @0.0 (IMPLICIT)
  <body> @0.1
    <h2 style="scream"> @0.1.0
      "Free Monkey"
    <h4 style="mumble"> @0.1.1
      "Inquire Within"
    <p> @0.1.2
      "It's a monkey! "
      <em> @0.1.2.1
        "And it's free!"

The changes applied correctly, so we can go ahead and add this code to the end of the program, to dump the tree to disk:

open(OUT, ">rewriters1/out1.html") || die "Can't write: $!";
print OUT $root->as_HTML;
close(OUT);
$root->delete; # done with it, so delete it

10.1.1. Whitespace

Examining the output file shows it to be one single line, consisting of this (wrapped so it will fit on the page):

<html><head></head><body><h2 style="scream">Free Monkey</h2><h4
style="mumble">Inquire Within</h4><p>It's a monkey! <em>And it's
free!</em></body></html>

Where did all the nice whitespace from the original go, such as the newline after each </h3>?

Whitespace in HTML (except in pre elements and a few others) isn't contrastive. That is, any amount of whitespace is as good as just one space. So whenever HTML::TreeBuilder sees whitespace tokens as it is parsing the HTML source, it compacts each group into a single space. Furthermore, whitespace between some kinds of tags (such as between </h3> and <h3>, or between </h3> and <p>) isn't meaningful at all, so when HTML::TreeBuilder sees such whitespace, it just discards it.

This whitespace mangling is the default behavior of an HTML::TreeBuilder tree and can be changed by two options that you set before parsing from a file:

my $root = HTML::TreeBuilder->new;

$root->ignore_ignorable_whitespace(0);
  # Don't try to delete whitespace between block-level elements.

$root->no_space_compacting(1);
  # Don't smash every whitespace sequences into a single space.

With those lines added to our program, the parse tree output file ends up with the appropriate whitespace.

<html><head></head><body>
  
<h2 style="scream">Free Monkey</h2>
<h4 style="mumble">Inquire Within</h4>
  
<p>It's a monkey!  <em>And it's free!</em></body>
  
</html>

An alternative is to have the as_HTML( ) method try to indent the HTML as it prints it. This is achieved by calling as_HTML like so:

print OUT $root->as_HTML(undef, "  ");

This feature is still somewhat experimental, and its implementation might change, but at time of this writing, this makes the output file's code look like this:

<html>
  <head>
  </head>
  <body>
    <h2 style="scream">Free Monkey</h2>
    <h4 style="mumble">Inquire Within</h4>
    <p>It's a monkey! <em>And it's free!</em></body>
</html>

10.1.2. Other HTML Options

Besides this indenting option, there are further options to as_HTML( ), as described in Chapter 9, "HTML Processing with Trees". One option controls whether omissible end-tags (such as </p> and </li>) are printed.

Another controls what characters are escaped using &foo; sequences. Notably, by default, this encodes all characters over ASCII 126, so for example, as_HTML will print an é in the parse tree as é (whether it came from a literal é or from an é). This is always safe, but in cases where you're dealing with text with a lot of Latin-1 or Unicode characters, having every one of those characters encoded as a &foo; sequence might be bothersome to any people looking at the HTML markup output.

In that case, your call to as_HTML can consist of $root->as_HTML('<>&'), in which case only the minimum of characters (<, >, and &) will be escaped. There's no point is using these options (or in preserving whitespace with ignore_ignorable_whitespace and no_space_compacting) if you're reasonably sure nobody will ever be looking at the resulting HTML. But for cases where people might need to look at the HTML, these options will make the code more inviting than just one huge block of HTML.

Chapter 10. Modifying HTML with Trees

Contents:

10.1. Changing Attributes

Example 10-1. Modifying attributes

10.1.1. Whitespace

10.1.2. Other HTML Options