Detaching and Reattaching (Perl & LWP)

10.3. Detaching and Reattaching

Example 10-2. Detaching and reattaching nodes

use strict;
use HTML::TreeBuilder;
my $root = HTML::TreeBuilder->new;
$root->parse_file('rewriters1/in002.html') || die $!;

my $good_td = $root->look_down( '_tag', 'td',  'class', 'story', );
die "No good td?!" unless $good_td;      # sanity checking
my $big_table = $root->look_down( '_tag', 'table' );
die "No big table?!" unless $big_table;  # sanity checking

$good_td->detach;
$big_table->replace_with($good_td);
  # Yes, there's even a method for replacing one node with another!

open(OUT, ">rewriters1/out002b.html") || die "Can't write: $!";
print OUT $root->as_HTML(undef, '  '); # two-space indent in output
close(OUT);
$root->delete; # done with it, so delete it

The resulting document looks like this:

<html>
  <head>
    <title>Shatner and Kunis Sweep the Oscars</title>
  </head>
  <body>
    <td class="story">
      <h1>Shatner and Kunis Sweep the Oscars</h1>
      <p>Stars of <cite>American Psycho II</cite> walked [...] </td>
    <hr>Copyright 2002, United Lies Syndicate </body>
</html>

One problem, though: we have a td outside of a table. Simply change it from a td element into something innocuous, such as a div, and while we're at it, delete that class attribute:

$good_td->tag('div'); 
$good_td->attr('class', undef);

That makes the output look like this:

<html>
  <head>
    <title>Shatner and Kunis Sweep the Oscars</title>
  </head>
  <body>
    <div>
      <h1>Shatner and Kunis Sweep the Oscars</h1>
      <p>Stars of <cite>American Psycho II</cite> walked [...] </div>
    <hr>Copyright 2002, United Lies Syndicate </body>
</html>

An alternative is not to detach and save the td in the first place, but to detach and save only its content. That's simple enough:

my @good_content = $good_td->content_list;
foreach my $c (@good_content) {
  $c->detach if ref $c;
    # text nodes aren't objects, so aren't really "attached" anyhow
}

10.3.1. The detach_content( ) Method

The above task is so common that there's a method for it, called detach_content( ), which detaches and returns the content of the node on which it's called. So we can simply modify our program to read:

my @good_content = $good_td->detach_content;
  
$big_table->replace_with(@good_content);
$big_table->delete;

However you chose to express the node-moving operations, the parse tree looks like this:

<html>
  <head>
    <title>Shatner and Kunis Sweep the Oscars</title>
  </head>
  <body>
    <h1>Shatner and Kunis Sweep the Oscars</h1>
    <p>Stars of <cite>American Psycho II</cite> walked [...]
    <hr>Copyright 2002, United Lies Syndicate </body>
</html>

In fact, every HTML::Element method that allows you to attach a node someplace (as replace_with does) will first detach that node if it's already attached elsewhere. So you could actually skip the whole detach_content( ) process step and just write this:

$big_table->replace_with( $good_td->content_list );
$big_table->delete;

It does the same thing and results in the same output.

10.3.2. Constraints

There are some constraints on what you can expect replace_with( ) to do, but these are just three constraints against fairly odd things that you would probably not try anyway. Namely, the documentation says you can't replace an element with multiple instances of itself; you can't replace an element with one (or more) of its siblings; and you can't replace an element that has no parent, because replacing an element inherently means altering the content list of its parent.

Many methods in the HTML::Element documentation have similar constraints spelled out, although the typical programmer will never find them to be an obstacle in and of themselves. If one of those constraints is violated, it is typically a sign that something is conceptually wrong elsewhere in the program.

For example, if you try $element->replace_with(...) and are surprised by an error message that "the target node has no parent," it is almost definitely because you either already replaced the element with something (leaving it parentless) or deleted it (leaving it parentless, contentless, and attributeless). For example, that error message would result if our program had this:

$big_table->delete;
$big_table->replace_with( $good_td->content_list );
# Wrong!

instead of this:

$big_table->replace_with( $good_td->content_list );
$big_table->delete;
# Right.