More HTML::TokeParser Methods (Perl & LWP)

7.5. More HTML::TokeParser Methods

Example 7-1 illustrates that often you aren't interested in every kind of token in a stream, but care only about tokens of a certain kind. The HTML::TokeParser interface supports this with three methods, get_tag( ), get_text( ), and get_trimmed_text( ) that do something other than simply get the next token.

$text_string = $stream->get_text( );: If the next token is text, return its value.
$text_string = $stream->get_text('foo');: Return all text up to the next foo start-tag.
$text_string = $stream->get_text('/bar');: Return all text up to the next /bar end-tag.
$text = $stream->get_trimmed_text( );
$text = $stream->get_trimmed_text('foo');
$text = $stream->get_trimmed_text('/bar');: Like get_text( ) calls, except with initial and final whitespace removed, and all other whitespace collapsed.
$tag_ref = $stream->get_tag( );: Return the next start-tag or end-tag token.
$tag_ref = $stream->get_tag('foo', '/bar', 'baz');: Return the next foo start-tag, /bar end-tag, or baz start-tag.

We will explain these methods in detail in the following sections.

7.5.1. The get_text( ) Method

The get_text( ) syntax is:

$text_string = $stream->get_text( );

If $stream's next token is text, this gets it, resolves any entities in it, and returns its string value. Otherwise, this returns an empty string.

For example, if you are parsing this snippet:

<h1 lang='en-GB'>Shatner Reprises Kirk R&ocirc;le</h1>

and have just parsed the token for h1, $stream->get_text( ) returns "Shatner Reprises Kirk Rôle." If you call it again (and again and again), it will return the empty string, because the next token waiting is not a text token but an h1 end-tag token.

7.5.2. The get_text( ) Method with Parameters

The syntax for get_text( ) with parameters is:

$text_string = $stream->get_text('foo');
$text_string = $stream->get_text('/bar');

Specifying a foo or /bar parameter changes the meaning of get_text( ). If you specify a tag, you get all the text up to the next time that tag occurs (or until the end of the file, if that tag never occurs).

For however many text tokens are found, their text values are taken, entity sequences are resolved, and they are combined and returned. (All the other sorts of tokens seen along the way are just ignored.)

Note that the tag name that you specify (whether foo or /bar) must be in lowercase.

This sounds complex, but it works out well in real use. For example, imagine you've got this snippet:

<h1 lang='en-GB'>Star of <cite>Star Trek</cite> in New R&ocirc;le</h1>
   <cite>American Psycho II</cite> in Production.
   <!-- I'm not making this up, folks. -->
   <br>Shatner to play FBI profiler.

and that you've just parsed the token for h1. Calling $stream->get_text( ), simply gets Star of . If, however, the task you're performing is the extraction of the text content of <h1> elements, then what's called for is:

$stream->get_text('/h1')

This returns Star of Star Trek in New Rôle.

Calling:

$stream->get_text('br')

returns:

"Star of Star Trek in New Rôle\n  American Psycho II in Production.\n   \n  "

And if you instead called $stream->get_text('schlock') and there is no <schlock...> in the rest of the document, you will get Star of Star Trek in New Rôle\n American Psycho II in Production.\n \n Shatner to play FBI profiler.\n, plus whatever text there is in the rest of the document.

Note that this never introduces whitespace where it's not there in the original. So if you're parsing this:

<table>
<tr><th>Height<th>Weight<th>Shoe Size</tr>
<tr><th>6' 2"<th>180lbs<th>n/a</tr>
</table>

and you've just parsed the table token, if you call:

$stream->get_text('/table')

you'll get back:

"\nHeightWeightShoe Size\n6' 2"180lbsn/a\n"

Not all nontext tokens are ignored by $stream->get_text( ). Some tags receive special treatment: if an img or applet tag is seen, it is treated as if it were a text token; if it has an alt attribute, its value is used as the content of the virtual text token; otherwise, you get just the uppercase tag name in brackets: [IMG] or [APPLET]. For further information on altering and expanding this feature, see perldoc HTML::TokeParser in the documentation for the get_text method, and possibly even the surprisingly short HTML::TokeParser source code.

If you just want to turn off such special treatment for all tags:

$stream->{'textify'} = {}

This is the only case of the $object->{'thing'} syntax we'll discuss in this book. In no other case does an object require us to access its internals directly like this, because it has no method for more normal access. For more information on this particular syntax, see perldoc perlref's documentation on hash references.

7.5.3. The get_trimmed_text( ) Method

The syntax for the get_trimmed_text( ) method is:

$text = $stream->get_trimmed_text( );
$text = $stream->get_trimmed_text('foo');
$text = $stream->get_trimmed_text('/bar');

These work exactly like the corresponding $stream->get_text( ) calls, except any leading and trailing whitespace is removed and each sequence of whitespace is replaced with a single space.

Returning to our news example:

$html = <<<EOF ;
<h1 lang='en-GB'>Star of <cite>Star Trek</cite> in New R&ocirc;le</h1>
   <cite>American Psycho II</cite> in Production.
   <!-- I'm not making this up, folks. -->
   <br>Shatner to play FBI profiler.
EOF
$stream = HTML::TokeParser->new(\$html);
$stream->get_token( );                      # skip h1

The get_text( ) method would return Star of (with the trailing space), while the get_trimmed_text( ) method would return Star of (no trailing space).

Similarly, $stream->get_text('br') would return:

"Star of Star Trek in New Rôle\n  American Psycho II in Production.\n   \n  "

whereas $stream->get_trimmed_text ('br') would return:

"Star of Star Trek in New Rôle American Psycho II in Production."

Notice that the medial newline-space-space became a single space, and the final newline-space-space-newline-space-space was simply removed.

The caveat that get_text( ) does not introduce any new whitespace applies also to get_trimmed_text( ). So where, in the last example in get_text( ), you would have gotten \nHeightWeightShoe Size\n6' 2"180lbsn/a\n, get_trimmed_text( ) would return HeightWeightShoe Size 6' 2"180lbsn/a.

7.5.4. The get_tag( ) Method

The syntax for the get_tag( ) method is:

$tag_reference = $stream->get_tag( );

This returns the next start-tag or end-tag token (throwing out anything else it has to skip to get there), except while get_token( ) would return start and end-tags in these formats:

['S', 'hr', {'class','Ginormous'}, ['class'], '<hr class=Ginormous>']
['E', 'p' , '</P>']

get_tag( ) instead returns them in this format:

['hr', {'class','Ginormous'}, ['class'], '<hr class=Ginormous>']
['/p' , '</P>']

That is, the first item has been taken away, and end-tag names start with /.

7.5.4.1. Start-tags

Unless $tag->[0] begins with a /, the tag represents a start-tag:

[$tag, $attribute_hash, $attribute_order_arrayref, $source]

The components of this token are:

$tag: The tag name, in lowercase.
$attribute_hashref: A reference to a hash encoding the attributes of this tag. The (lowercase) attribute names are the keys of the hash.
$attribute_order_arrayref: A reference to an array of (lowercase) attribute names, in case you need to access elements in order.
$source: The original HTML for this token.

The first two values are the most interesting ones, for most purposes.

For example, parsing this HTML with $stream->get_tag( ) :

<IMG SRC="kirk.jpg" alt="Shatner in r&ocirc;le of Kirk" WIDTH=352 height=522>

gives this tag:

[
  'img',
  { 'alt' => 'Shatner in rôle of Kirk',
     'height' => '522', 'src' => 'kirk.jpg', 'width' => '352'
  },
  [ 'src', 'alt', 'width', 'height' ],
  '<IMG SRC="kirk.jpg" alt="Shatner in r&ocirc;le of Kirk" WIDTH=352 height=522>'
]

Notice that the tag and attribute names have been lowercased, and the ô entity decoded within the alt attribute.

7.5.4.2. End-tags

When $tag->[0] does begin with a /, the token represents an end-tag:

[ "/$tag", $source ]

The components of this tag are:

$tag: The lowercase name of the tag being closed, with a leading /.
$source: The original HTML for this token.

Parsing this HTML with $stream->get_tag( ) :

</A>

gives this tag:

[ '/a', '</A>' ]

Note that if get_tag( ) reads to the end of the stream and finds no tag tokens, it will return undef.

7.5.5. The get_tag( ) Method with Parameters

Pass a list of tags, to skip through the tokens until a matching tag is found:

$tag_reference = $stream->get_tag('foo', '/bar', 'baz');

This returns the next start-tag or end-tag that matches any of the strings you provide (throwing out anything it has to skip to get there). Note that the tag name(s) that you provide as parameters must be in lowercase.

If get_tag( ) reads to the end of the stream and finds no matching tag tokens, it will return undef. For example, this code's get_tag( ) looks for img start-tags:

while (my $img_tag = $stream->get_tag('img')) {
  my $i = $img_tag->[1];            # attributes of this img tag
  my @lack = grep !exists $i->{$_}, qw(alt height width);
  print "Missing for ", $i->{'src'} || "????", ": @lack\n" if @lack;
}