The HTML Modules (Perl in a Nutshell, 2nd Edition)

#!/usr/local/bin/perl -w require HTML::TokeParser; # Our string that turns out to be HTML! my $html = '<p>Some text. <a href="http://blah"My name is Nate!</a></p>'; my $parser = HTML::TokeParser->new(\$html); get_tag( ) tells TokeParser to match a tag by name while (my $token = $parser->get_tag("a")) { my $url = $token->[1]{href} || "-"; my $text = $parser->get_trimmed_text("/a"); print "URL is: $url.\nURL text is: $text.\n"; }

20.4.2.1. HTML::TokeParser methods

new

new(  )

Constructor. Takes a filename, filehandle, or reference to a scalar as arguments. Each argument represents the content that will be parsed. If a scalar is present, new looks for a filename $scalar. If a reference to a scalar is present, new looks for HTML in \$scalar. new will read filehandles until end-of-file. Returns undef on failure.

get_tag

get_tag(  )

Returns the next start or end tag in a document. If there are no remaining start or end tags, get_tag returns undef. get_tag is useful because it skips unwanted tokens and matches only the tag that you want—if it exists. When a tag is found, it is returned as an array reference, like so: [$tag, $attr, $attrseq, $text]. If an end tag is found, is is returned—e.g., "/$tag".

get_text

get_text(  )

Returns all text found at the current position. If the next token is not text, get_text returns a zero-length string. You can pass an "$end_tag" option to get_text to return all of the text before "end_tag".

get_token

get_token(  )

Returns the next token found in the HTML document, or undef if no next token exists. Each token is returned as an array reference. The array reference's first and last items refer to start and end tags concurrently. The rest of the items in the array include text, comments, declarations, and process instructions. get_token uses the following labels for the tokens:

S: Start tag
E: End tag
T: Text
C: Comment
D: Declaration
PI: Process instructions

Consider the following code:

#!/usr/local/bin/perl -w

require HTML::TokeParser;

my $html = '<a href="http://blah">My name is 
Nate!</a></p>';
my $p = HTML::TokeParser->new(\$html);

while (my $token = $p->get_token) {
    my $i = 0;
    foreach my $tk (@{$token}) {
        print "token[$i]: $tk\n";
                    $i++;
    }
}

The items in each token (in the HTML) are displayed as follows:

token[0]: S
token[1]: a
token[2]: HASH(0x8146d3c)
token[3]: ARRAY(0x814a380)
token[4]: <a href="http://blah">
token[0]: T
token[1]: My name is Nate!
token[2]:
token[0]: E
token[1]: a
token[2]: </a>
token[0]: E
token[1]: p
token[2]: </p>

get_trimmed_text

get_trimmed_text(  )

Works the same as get_text, but reduces all instances of multiple spaces to a single space and removes leading and trailing whitespace.

unget_token

unget_token(  )

Useful for pushing tokens back to the parser so they can be reused the next time you call get_token.

20.4. The HTML Modules

20.4.1. HTML::Parser

20.4.2. HTML::TokeParser

20.4.2.1. HTML::TokeParser methods

20.4.3. HTML::Element

20.4.4. HTML::TreeBuilder

20.4.5. HTML::FormatPS

20.4.6. HTML::FormatText