9.3. Processing
Once you have parsed some HTML, you need to process it. Exactly what
you do will depend on the nature of your problem. Two common models
are extracting information and producing a transformed version of the
HTML (for example, to remove banner advertisements).
Whether extracting or transforming, you'll probably
want to find the bits of the document you're
interested in. They might be all headings, all bold italic regions,
or all paragraphs with class="blinking".
HTML::Element provides several functions for searching the tree.
9.3.1. Methods for Searching the Tree
In scalar context, these
methods
return the first node that satisfies the criteria. In list context,
all such nodes are returned. The methods can be called on the root of
the tree or any node in it.
- $node->find_by_tag_name(tag [, ...])
-
Return node(s) for tags of the names listed. For example, to find all
h1 and h2 nodes:
@headings = $root->find_by_tag_name('h1', 'h2');
- $node->find_by_attribute(attribute, value)
-
Returns the node(s) with the given attribute set to the given value.
For example, to find all nodes with
class="blinking":
@blinkers = $root->find_by_attribute("class",
"blinking");
- $node->look_down(...)
- $node->look_up(...)
-
These two methods search $node and its children
(and children's children, and so on) in the case of
look_down, or its parent (and the
parent's parent, and so on) in the case of
look_up, looking for nodes that match whatever
criteria you specify. The parameters are either
attribute =>
value pairs (where the special attribute
_tag represents the tag name), or a subroutine
that is passed a current node and returns true to indicate that this
node is of interest.
For example, to find all h2 nodes in the tree with
class="blinking":
@blinkers = $root->look_down(_tag => 'h2', class => 'blinking');
We'll discuss look_down in
greater detail later.
9.3.2. Attributes of a Node
Four methods give access to the
basic information in a node:
- $node->tag( )
-
The tag name string of this element. Example values:
html, img,
blockquote. Note that this is always lowercase.
- $node->parent( )
-
This returns the node object that is the parent of this node. If
$node is the root of the tree,
$node->parent( ) will return
undef.
- $node->content_list( )
-
This returns the (potentially empty) list of nodes that are this
node's children.
- $node->attr(attributename)
-
This returns the value of the HTML
attributename attribute for this element.
If there is no such attribute for this element, this returns
undef. For example: if $node is
parsed from <img
src="x1.jpg" alt="Looky!">,
then $node->attr("src") will return the string
x1.jpg.
Four more methods convert a tree or part of a tree into another
format, such as HTML or text.
- $node->as_HTML([ entities [, indent_char [, optional_end_tags ]]]);
-
Returns a string consisting of the node and its children as HTML. The
entities parameter is a string containing
characters that should be entity escaped (if empty, all potentially
unsafe characters are encoded as entities; if you pass just
<>&, just those characters will get
encoded—a bare minimum for valid HTML). The
indent_char parameter is a string used for
indenting the HTML. The optional_end_tags
parameter is a reference to a hash that has a true value for every
key that is the name of a tag whose closing tag is optional. The most
common value for this parameter is {} to force all
tags to be closed:
$html = $node->as_HTML("", "", {});
For example, this will emit </li> tags for
any li nodes under $node, even
though </li> tags are technically optional,
according to the HTML specification.
Using $node->as_HTML( ) with no parameters
should be fine for most purposes.
- $node->as_text( )
-
Returns a string consisting of all the text nodes from this element
and its children.
- $node->starttag([entities])
-
Returns the HTML for the start-tag for this node. The
entities parameter is a string of
characters to entity escape, as in the as_HTML( )
method; you can omit this. For example, if this node came from
parsing <TD
class=loud>Hooboy</TD>, then
$node->starttag( ) returns
<td class="loud">. Note
that the original source text is not reproduced exactly, because
insignificant differences, such as the capitalization of the tag name
or attribute names, will have been discarded during parsing.
- $node->endtag( )
-
Returns the HTML for the end-tag for this node. For example, if this
node came from parsing <TD
class=loud>Hooboy</TD>, then
$node->endtag( ) returns
</td>.
These methods are useful once you've found the
desired content. Example 9-4 prints all the bold
italic text in a document.
Example 9-4. Bold-italic headline printer
#!/usr/bin/perl -w
use HTML::TreeBuilder;
use strict;
my $root = HTML::TreeBuilder->new_from_content(<<"EOHTML");
<b><i>Shatner wins Award!</i></b>
Today in <b>Hollywood</b> ...
<b><i>End of World Predicted!</i></b>
Today in <b>Washington</b> ...
EOHTML
$root->eof( );
# print contents of <b><i>...</i></b>
my @bolds = $root->find_by_tag_name('b');
foreach my $node (@bolds) {
my @kids = $node->content_list( );
if (@kids and ref $kids[0] and $kids[0]->tag( ) eq 'i') {
print $kids[0]->as_text( ), "\n";
}
}
Example 9-4 is fairly straightforward. Having parsed
the string into a new tree, we get a list of all the bold nodes. Some
of these will be the headlines we want, while others will simply be
bolded text. In this case, we can identify headlines by checking that
the node that it contains represents
<i>...</i>. If it is an italic node,
we print its text content.
The only complicated part of Example 9-4 is the test
to see whether it's an interesting node. This test
has three parts:
- @kids
-
True if there are children of this node. An empty
<b></b> would fail this test.
- ref $kids[0]
-
True if the first child of this node is an element. This is false in
cases such as <b>Washington</b>, where
the first (and here, only) child is text. If we fail to check this,
the next expression, $kids[0]->tag( ), would
produce an error when $kids[0]
isn't an object value.
- $kids[0]->tag( ) eq 'i'
-
True if the first child of this node is an i
element. This would weed out anything like
<b><img
src="shatner.jpg"></b>, where
$kids[0]->tag( ) would return
img, or
<b><strong>Yes,
Shatner!</strong></b>, where
$kids[0]->tag( ) would return
strong.
9.3.3. Traversing
For many tasks, you can use
the
built-in search functions. Sometimes, though, you'd
like to visit every node of the tree. You have two choices: you can
use the existing traverse( ) function or write
your
own using either recursion or your own
stack.
The act of visiting every node in a tree is called a
traversal. Traversals can either be
preorder (where you process the current node before
processing its children) or postorder (where you
process the current node after processing its children). The
traverse( ) method lets you both:
$node->traverse(callbacks [, ignore_text]);
The traverse( ) method calls a callback before
processing the children and again afterward. If the
callbacks parameter is a single function
reference, the same function is called before and after processing
the children. If the callbacks parameter
is an array reference, the first element is a reference to a function
called before the children are processed, and the second element is
similarly called after the children are processed, unless this node
is a text segment or an element that is prototypically empty, such as
br or hr. (This last quirk of
the traverse( ) method is one of the reasons that
I discourage its use.)
Callbacks get called with three values:
sub callback
my ($node, $startflag, $depth,
$parent, $my_index) = @_;
# ...
}
The current node is
the
first parameter. The next is a Boolean value indicating whether
we're being called before (true) or after (false)
the children, and the third is a number indicating how deep into the
traversal we are. The fourth and fifth parameters are supplied only
for text elements: the parent node object and the index of the
current node in its parent's list of children.
A callback can return any of the following values:
- HTML::Element::OK (or any true value)
-
Continue traversing.
- HTML::Element::PRUNE (or any false value)
-
Do not go into the children. The postorder callback is not called.
(Ignored if returned by a postorder callback.)
- HTML::Element::ABORT
-
Abort the traversal immediately.
- HTML::Element::PRUNE_UP
-
Do not go into this node's children or into its
parent node.
- HTML::Element::PRUNE_SOFTLY
-
Do not go into the children, but do call this node's
postorder callback.
For example, to extract text from a node but not go into
table elements:
my $text;
sub text_no_tables {
return if ref $_[0] && $_[0]->tag eq 'table';
$text .= $_[0] unless ref $_[0]; # only append text nodex
return 1; # all is copacetic
}
$root->traverse([\&text_no_tables]);
This prevents descent into the contents of tables, while accumulating
the text nodes in $text.
It can be hard to think in terms of callbacks, though, and the
multiplicity of return values and calling parameters you get with
traverse( ) makes for confusing code, as you will
likely note when you come across its use in existing programs that
use HTML::TreeBuilder.
Instead, it's usually easier and clearer to simply
write your own recursive subroutine, like this one:
my $text = '';
sub scan_for_non_table_text {
my $element = $_[0];
return if $element->tag eq 'table'; # prune!
foreach my $child ($element->content_list) {
if (ref $child) { # it's an element
scan_for_non_table_text($child); # recurse!
} else { # it's a text node!
$text .= $child;
}
}
return;
}
scan_for_non_table_text($root);
Alternatively, implement it using a stack, doing the same work:
my $text = '';
my @stack = ($root); # where to start
while (@stack) {
my $node = shift @stack;
next if ref $node and $node->tag eq 'table'; # skip tables
if (ref $node) {
unshift @stack, $node->content_list; # add children
} else {
$text .= $node; # add text
}
}
The while( ) loop version can be faster than the
recursive version, but at the cost of being much less clear to people
who are unfamiliar with this technique. If speed is a concern, you
should always benchmark the two versions to make sure you really need
the speedup and that the while( ) loop version
actually delivers. The speed difference is sometimes insignificant.
The manual page perldoc
HTML::Element::traverse discusses writing more
complex traverser routines, in the rare cases where you might find
this necessary.
 |  |  | | 9.2. HTML::TreeBuilder |  | 9.4. Example: BBC News |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
|