HTML::TreeBuilder (Perl & LWP)

9.2. HTML::TreeBuilder

There are five steps to an HTML::TreeBuilder program:

Create the HTML::TreeBuilder object.
Set the parse options.
Parse the HTML.
Process it according to the needs of your problem.
Delete the HTML::TreeBuilder object.

Example 9-2 is a simple HTML::TreeBuilder program.

Example 9-2. Simple HTML::TreeBuilder program

#!/usr/bin/perl -w
use strict;
use HTML::TreeBuilder 3;  # make sure our version isn't ancient
my $root = HTML::TreeBuilder->new;
$root->parse(  # parse a string...
q{
   <ul>
     <li>Ice cream.</li>
     <li>Whipped cream.
     <li>Hot apple pie <br>(mmm pie)</li>
   </ul>
});
$root->eof( );  # done parsing for this tree
$root->dump;   # print( ) a representation of the tree
$root->delete; # erase this tree because we're done with it

Four of the five steps are shown here. The HTML::TreeBuilder class's new( ) constructor creates a new object. We don't set parse options, preferring instead to use the defaults. The parse( ) method parses HTML from a string. It's designed to let you supply HTML in chunks, so you use the eof( ) method to tell the parser when there's no more HTML. The dump( ) method is our processing here, printing a string form of the tree (the output is given in Example 9-3). And finally we delete( ) the tree to free the memory it used.

Example 9-3. Output of Example 9-2

<html> @0 (IMPLICIT)
  <head> @0.0 (IMPLICIT)
  <body> @0.1 (IMPLICIT)
    <ul> @0.1.0
      <li> @0.1.0.0
        "Ice cream."
      <li> @0.1.0.1
        "Whipped cream. "
      <li> @0.1.0.2
        "Hot apple pie "
        <br> @0.1.0.2.1
        "(mmm pie)"

Each line in the dump represents either an element or text. Each element is identified by a dotted sequence of numbers (e.g., 0.1.0.2). This sequence identifies the position of the element in the tree (2nd child of the 0th child of the 1st child of the 0th child of the root of the tree). The dump also identifies some nodes as (IMPLICIT), meaning they weren't present in the HTML fragment but have been inferred to make a valid document parse tree.

9.2.1. Constructors

To create a new empty tree, use the new( ) method:

$root = HTML::TreeBuilder->new( );

To create a new tree and parse the HTML in one go, pass one or more strings to the new_from_content( ) method:

$root = HTML::TreeBuilder->new_from_content([string, ...]);

To create a new HTML::TreeBuilder object and parse HTML from a file, pass the filename or a filehandle to the new_from_file( ) method:

$root = HTML::TreeBuilder->new_from_file(filename);
$root = HTML::TreeBuilder->new_from_file(filehandle);

If you use new_from_file( ) or new_from_content( ), the parse is carried out with the default parsing options. To parse with any nondefault options, you must use the new( ) constructor and call parse_file( ) or parse( ).

9.2.2. Parse Options

Set options for the parse by calling methods on the HTML::TreeBuilder object. These methods return the old value for the option and set the value if passed a parameter. For example:

$comments = $root->strict_comment( );
print "Strict comment processing is ";
print $comments ? "on\n" : "off\n";
$root->strict_comments(0);      # disable

Some options affect the way the HTML standard is ignored or obeyed, while others affect the internal behavior of the parser. The full list of parser options follows.

$root->strict_comments([boolean]);: The HTML standard says that a comment is terminated by an even number of -- s between the opening < and the closing >, and there must be nothing but whitespace between even and odd -- s. That part of the HTML standard is little known, little understood, and little obeyed. So most browsers simply accept any --> as the end of a comment. If enabled via a true value, this option makes the HTML::TreeBuilder recognize only those comments that obey the HTML standard. By default, this option is off, so that HTML::TreeBuilder will parse comments as normal browsers do.
$root->strict_names([boolean]);: Some HTML has unquoted attribute values that include spaces, e.g., <img alt=big dog! src="dog.jpg">. If this option is enabled, that tag would be reported as text, because it doesn't obey the standard (dog! is not a valid attribute name). If the option is disabled, as it is by default, source such as this is parsed as a tag, with a Boolean attribute called dog! set.
$root->implicit_tags([boolean]);: Enabled by default, this option makes the parser create nodes for missing start- or end-tags. If disabled, the parse tree simply reflects the input text, which is rarely useful.
$root->implicit_body_p_tag([boolean]);: This option controls what happens to text or phrasal tags (such as <i>...</i>) that are directly in a <body>, without a containing <p>. By default, the text or phrasal tag nodes are children of the <body>. If enabled, an implicit <p> is created to contain the text or phrasal tags.
$root->ignore_unknown([boolean]);: By default, unknown tags, such as <footer>, are ignored. Enable this to create nodes in the parse tree for unknown tags.
$root->ignore_text([boolean]);: By default, text in elements appears in the parse tree. Enable this option to create parse trees without the text from the document.
$root->ignore_ignorable_whitespace([boolean]);: Whitespace between most tags is ignorable, and multiple whitespace characters are collapsed to one. If you want to preserve the whitespace present in the original HTML, enable this option.

9.2.3. Parsing

There are two ways of parsing HTML: from a file or from strings.

Pass the parse_file( ) method a filename or filehandle to parse the HTML in that file:

$success = $root->parse_file(filename);
$success = $root->parse_file(filehandle);

For example, to parse HTML from STDIN:

$root->parse_file(*STDIN) or die "Can't parse STDIN";

The parse_file( ) method returns the HTML::TreeBuilder object if successful or undef if an error occurred.

The parse( ) method takes a chunk of HTML and parses it. Call parse( ) on each chunk, then call the eof( ) method when there's no more HTML to come.

$success = $root->parse(chunk);
$success = $root->eof( );

This method is designed for situations where you are acquiring your HTML one chunk at a time. It's also useful when you're extracting HTML from a larger file and can't simply parse the entire file with parse_file( ). In many cases, you could use new_from_content( ), but recall that new_from_content( ) doesn't give you an opportunity to set nondefault parsing options.

9.2.4. Cleanup

The delete( ) method frees the tree and its elements, giving the memory it used back to Perl:

$root->delete( );

Use this method in persistent environments such as mod_perl or when your program will parse a lot of HTML files. It's not enough to simply have $root be a private variable that goes out of scope, or to assign a new value to $root. Perl's current memory-management system fails on the kinds of data structures that HTML::Element uses.