home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomePerl & LWPSearch this book

9.2. HTML::TreeBuilder

There are five steps to an HTML::TreeBuilder program:

  1. Create the HTML::TreeBuilder object.
  2. Set the parse options.
  3. Parse the HTML.
  4. Process it according to the needs of your problem.
  5. Delete the HTML::TreeBuilder object.

Example 9-2 is a simple HTML::TreeBuilder program.

Example 9-2. Simple HTML::TreeBuilder program

#!/usr/bin/perl -w
use strict;
use HTML::TreeBuilder 3;  # make sure our version isn't ancient
my $root = HTML::TreeBuilder->new;
$root->parse(  # parse a string...
q{
   <ul>
     <li>Ice cream.</li>
     <li>Whipped cream.
     <li>Hot apple pie <br>(mmm pie)</li>
   </ul>
});
$root->eof( );  # done parsing for this tree
$root->dump;   # print( ) a representation of the tree
$root->delete; # erase this tree because we're done with it

Four of the five steps are shown here. The HTML::TreeBuilder class's new( ) constructor creates a new object. We don't set parse options, preferring instead to use the defaults. The parse( ) method parses HTML from a string. It's designed to let you supply HTML in chunks, so you use the eof( ) method to tell the parser when there's no more HTML. The dump( ) method is our processing here, printing a string form of the tree (the output is given in Example 9-3). And finally we delete( ) the tree to free the memory it used.

Example 9-3. Output of Example 9-2

<html> @0 (IMPLICIT)
  <head> @0.0 (IMPLICIT)
  <body> @0.1 (IMPLICIT)
    <ul> @0.1.0
      <li> @0.1.0.0
        "Ice cream."
      <li> @0.1.0.1
        "Whipped cream. "
      <li> @0.1.0.2
        "Hot apple pie "
        <br> @0.1.0.2.1
        "(mmm pie)"

Each line in the dump represents either an element or text. Each element is identified by a dotted sequence of numbers (e.g., 0.1.0.2). This sequence identifies the position of the element in the tree (2nd child of the 0th child of the 1st child of the 0th child of the root of the tree). The dump also identifies some nodes as (IMPLICIT), meaning they weren't present in the HTML fragment but have been inferred to make a valid document parse tree.

9.2.2. Parse Options

Set options for the parse by calling methods on the HTML::TreeBuilder object. These methods return the old value for the option and set the value if passed a parameter. For example:

$comments = $root->strict_comment( );
print "Strict comment processing is ";
print $comments ? "on\n" : "off\n";
$root->strict_comments(0);      # disable

Some options affect the way the HTML standard is ignored or obeyed, while others affect the internal behavior of the parser. The full list of parser options follows.

$root->strict_comments([boolean]);
The HTML standard says that a comment is terminated by an even number of -- s between the opening < and the closing >, and there must be nothing but whitespace between even and odd -- s. That part of the HTML standard is little known, little understood, and little obeyed. So most browsers simply accept any --> as the end of a comment. If enabled via a true value, this option makes the HTML::TreeBuilder recognize only those comments that obey the HTML standard. By default, this option is off, so that HTML::TreeBuilder will parse comments as normal browsers do.
$root->strict_names([boolean]);
Some HTML has unquoted attribute values that include spaces, e.g., <img alt=big dog! src="dog.jpg">. If this option is enabled, that tag would be reported as text, because it doesn't obey the standard (dog! is not a valid attribute name). If the option is disabled, as it is by default, source such as this is parsed as a tag, with a Boolean attribute called dog! set.
$root->implicit_tags([boolean]);
Enabled by default, this option makes the parser create nodes for missing start- or end-tags. If disabled, the parse tree simply reflects the input text, which is rarely useful.
$root->implicit_body_p_tag([boolean]);
This option controls what happens to text or phrasal tags (such as <i>...</i>) that are directly in a <body>, without a containing <p>. By default, the text or phrasal tag nodes are children of the <body>. If enabled, an implicit <p> is created to contain the text or phrasal tags.
$root->ignore_unknown([boolean]);
By default, unknown tags, such as <footer>, are ignored. Enable this to create nodes in the parse tree for unknown tags.
$root->ignore_text([boolean]);
By default, text in elements appears in the parse tree. Enable this option to create parse trees without the text from the document.
$root->ignore_ignorable_whitespace([boolean]);
Whitespace between most tags is ignorable, and multiple whitespace characters are collapsed to one. If you want to preserve the whitespace present in the original HTML, enable this option.


Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.