home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  

Book HomePerl & LWPSearch this book

7.2. Basic HTML::TokeParser Use

The HTML::TokeParser module is a class for accessing HTML as tokens. An HTML::TokeParser object gives you one token at a time, much as a filehandle gives you one line at a time from a file. The HTML can be tokenized from a file or string. The tokenizer decodes entities in attributes, but not entities in text.

Create a token stream object using one of these two constructors:

my $stream = HTML::TokeParser->new($filename)
  || die "Couldn't read HTML file $filename: $!";


my $stream = HTML::TokeParser->new( \$string_of_html );

Once you have that stream object, you get the next token by calling:

my $token = $stream->get_token( );

The $token variable then holds an array reference, or undef if there's nothing left in the stream's file or string. This code processes every token in a document:

my $stream = HTML::TokeParser->new($filename)
  || die "Couldn't read HTML file $filename: $!";

while(my $token = $stream->get_token) {
  # ... consider $token ...

The $token can have one of six kinds of values, distinguished first by the value of $token->[0], as shown in Table 7-1.

Table 7-1. Token types




["S",  $tag, $attribute_hashref, $attribute_order_arrayref, $source]


["E",  $tag, $source]


["T",  $text, $should_not_decode]


["C",  $source]


["D",  $source]

Processing instruction

["PI", $content, $source]

Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.