new( )
Constructor. Takes a filename, filehandle, or reference to a scalar
as arguments. Each argument represents the content that will be
parsed. If a scalar is present, new looks for a
filename $scalar. If a reference to a scalar is
present, new looks for HTML in
\$scalar. new will read
filehandles until end-of-file. Returns undef on
failure.
get_tag( )
Returns the next start or end tag in a document. If there are no
remaining start or end tags, get_tag returns
undef. get_tag is useful
because it skips unwanted tokens and matches only the tag that you
want—if it exists. When a tag is found, it is returned as an
array reference, like so: [$tag, $attr, $attrseq,
$text]. If an end tag is found, is is returned—e.g.,
"/$tag".
get_text( )
Returns all text found at the current position. If the next token is
not text, get_text returns a zero-length string.
You can pass an "$end_tag" option to
get_text to return all of the text before
"end_tag".
get_token( )
Returns the next token found in the HTML document, or
undef if no next token exists. Each token is
returned as an array reference. The array
reference's first and last items refer to start and
end tags concurrently. The rest of the items in the array include
text, comments, declarations, and process instructions.
get_token uses the following labels for the
tokens:
- S
-
Start tag
- E
-
End tag
- T
-
Text
- C
-
Comment
- D
-
Declaration
- PI
-
Process instructions
Consider the following code:
#!/usr/local/bin/perl -w
require HTML::TokeParser;
my $html = '<a href="http://blah">My name is
Nate!</a></p>';
my $p = HTML::TokeParser->new(\$html);
while (my $token = $p->get_token) {
my $i = 0;
foreach my $tk (@{$token}) {
print "token[$i]: $tk\n";
$i++;
}
}
The items in each token (in the HTML) are displayed as follows:
token[0]: S
token[1]: a
token[2]: HASH(0x8146d3c)
token[3]: ARRAY(0x814a380)
token[4]: <a href="http://blah">
token[0]: T
token[1]: My name is Nate!
token[2]:
token[0]: E
token[1]: a
token[2]: </a>
token[0]: E
token[1]: p
token[2]: </p>
get_trimmed_text( )
Works the same as get_text, but reduces all
instances of multiple spaces to a single space and removes leading
and trailing whitespace.
unget_token( )
Useful for pushing tokens back to the parser so they can be reused
the next time you call get_token.