The HTML Modules (Perl in a Nutshell, 2nd Edition)

20.4. The HTML Modules

HTML modules provide an interface to parse HTML documents. After you parse the document, you can print or display it according to the markup tags or extract specific information such as hyperlinks.

The HTML::parser module provides methods for, literally, parsing HTML. It can handle HTML text from a string or file and can separate out the syntactic structures and data. You shouldn't use HTML::Parser directly, however, since its interface hasn't been designed to make your life easy when you parse HTML. It's merely a base class from which you can build your own parser to deal with HTML in any way you want. And if you don't want to roll your own HTML parser or parser class, then there's always HTML::TokeParser and HTML::TreeBuilder, both of which are covered in this chapter.

HTML::TreeBuilder is a class that parses HTML into a syntax tree. In a syntax tree, each element of the HTML, such as container elements with beginning and end tags, is stored relative to other elements. This preserves the nested structure and behavior of HTML and its hierarchy.

A syntax tree of the TreeBuilder class is formed of connected nodes that represent each element of the HTML document. These nodes are saved as objects from the HTML::Element class. An HTML::Element object stores all the information from an HTML tag: the start tag, end tag, attributes, plain text, and pointers to any nested elements.

The remaining classes of the HTML modules use the syntax trees and its nodes of element objects to output useful information from the HTML documents. The format classes, such as HTML::FormatText and HTML::FormatPS, allow you to produce text and PostScript from HTML. The HTML::LinkExtor class extracts all of the links from a document. Additional modules provide means for replacing HTML character entities and implementing HTML tags as subroutines.

20.4.1. HTML::Parser

This module implements the base class for the other HTML modules. A parser object is created with the new constructor:

$p = HTML::Parser->new( );

The constructor takes no arguments.

The parser object takes methods that read in HTML from a string or a file. The string-reading method can take data in several smaller chunks if the HTML is too big. Each chunk of HTML will be appended to the object, and the eof method indicates the end of the document. These basic methods are described below.

eof

$p->eof(  )

Indicates the end of a document and flushes any buffered text. Returns the parser object.

parse

$p->parse(string)

Reads HTML into the parser object from a given string. Performance problems occur if the string is too large, so the HTML can be broken up into smaller pieces, which will be appended to the data already contained in the object. The parse can be terminated with a call to the eof method.

parse_file

$p->parse_file(file)

Reads HTML into the parser object from the given file, which can be a filename or an open filehandle.

When the parse or parse_file method is called, it parses the incoming HTML with a few internal methods. In HTML::Parser, these methods are defined, but empty. Additional HTML parsing classes (included in the HTML modules or ones you write yourself) override these methods for their own purposes. For example:

package HTML::MyParser;
require HTML::Parser;
@ISA=qw(HTML::MyParser);

sub start {
     your subroutine defined here
     }

The following list shows the internal methods contained in HTML::Parser.

comment

comment(comment)

Invoked on comments from HTML (text between <!- and ->). The text of the comment (without the tags) is given to the method as the string comment.

end

end(tag, origtext)

Invoked on end tags (those with the </tag> form). The first argument, tag, is the tag name in lowercase, and the second argument, origtext, is the original HTML text of the tag.

start

start(tag, $attr, attrseq, origtext)

Invoked on start tags. The first argument, tag, is the name of the tag in lowercase. The second argument is a reference to a hash, attr. This hash contains all the attributes and their values in key/value pairs. The keys are the names of the attributes in lowercase. The third argument, attrseq, is a reference to an array that contains the names of all the attributes in the order they appeared in the tag. The fourth argument, origtext, is a string that contains the original text of the tag.

text

text(text)

Invoked on plain text in the document. The text is passed unmodified and may contain newlines. Character entities in the text are not expanded .

xml_mode

xml_mode(bool)

Enabling this attribute changes the parser to allow some XML constructs such as empty element tags and XML processing instructions. It also disables forcing tag and attribute names to lowercase when they are reported by the tagname and attr arguments, and suppresses special treatment of elements parsed as CDATA for HTML.

20.4.2. HTML::TokeParser

As we said, you should use a subclassed HTML parser if you want a better interface to HTML parsing features than what HTML::Parser gives you. HTML::TokeParser by Gisle Aas is one such example. While HTML::TokeParser is actually a subclass of HTML::PullParser, it can help you do many useful things, such as link extraction and HTML checking.

In short, HTML::TokeParser breaks an HTML document into tokens, attributes, and content, in which the HTML <a href="http://url">link</a> would break down as:

token: a
    attrib: href
content: http://url
content: link
token /a

For example, you can use HTML::TokeParser to extract links from a string that contains HTML:

#!/usr/local/bin/perl -w

require HTML::TokeParser;

# Our string that turns out to be HTML!
my $html = '<p>Some text. <a href="http://blah"My name is Nate!</a></p>';
my $parser = HTML::TokeParser->new(\$html);

get_tag( ) tells TokeParser to match a tag by name
while (my $token = $parser->get_tag("a")) {
    my $url = $token->[1]{href} || "-";
    my $text = $parser->get_trimmed_text("/a");
    print "URL is: $url.\nURL text is: $text.\n";
}

20.4.2.1. HTML::TokeParser methods

new

new(  )

Constructor. Takes a filename, filehandle, or reference to a scalar as arguments. Each argument represents the content that will be parsed. If a scalar is present, new looks for a filename $scalar. If a reference to a scalar is present, new looks for HTML in \$scalar. new will read filehandles until end-of-file. Returns undef on failure.

get_tag

get_tag(  )

Returns the next start or end tag in a document. If there are no remaining start or end tags, get_tag returns undef. get_tag is useful because it skips unwanted tokens and matches only the tag that you want—if it exists. When a tag is found, it is returned as an array reference, like so: [$tag, $attr, $attrseq, $text]. If an end tag is found, is is returned—e.g., "/$tag".

get_text

get_text(  )

Returns all text found at the current position. If the next token is not text, get_text returns a zero-length string. You can pass an "$end_tag" option to get_text to return all of the text before "end_tag".

get_token

get_token(  )

Returns the next token found in the HTML document, or undef if no next token exists. Each token is returned as an array reference. The array reference's first and last items refer to start and end tags concurrently. The rest of the items in the array include text, comments, declarations, and process instructions. get_token uses the following labels for the tokens:

S: Start tag
E: End tag
T: Text
C: Comment
D: Declaration
PI: Process instructions

Consider the following code:

#!/usr/local/bin/perl -w

require HTML::TokeParser;

my $html = '<a href="http://blah">My name is 
Nate!</a></p>';
my $p = HTML::TokeParser->new(\$html);

while (my $token = $p->get_token) {
    my $i = 0;
    foreach my $tk (@{$token}) {
        print "token[$i]: $tk\n";
                    $i++;
    }
}

The items in each token (in the HTML) are displayed as follows:

token[0]: S
token[1]: a
token[2]: HASH(0x8146d3c)
token[3]: ARRAY(0x814a380)
token[4]: <a href="http://blah">
token[0]: T
token[1]: My name is Nate!
token[2]:
token[0]: E
token[1]: a
token[2]: </a>
token[0]: E
token[1]: p
token[2]: </p>

get_trimmed_text

get_trimmed_text(  )

Works the same as get_text, but reduces all instances of multiple spaces to a single space and removes leading and trailing whitespace.

unget_token

unget_token(  )

Useful for pushing tokens back to the parser so they can be reused the next time you call get_token.

20.4.3. HTML::Element

The HTML::Element module provides methods for dealing with nodes in an HTML syntax tree. You can get or set the contents of each node, traverse the tree, and delete a node.

HTML::Element objects are used to represent elements of HTML. These elements include start and end tags, attributes, contained plain text, and other nested elements.

The constructor for this class requires the name of the tag for its first argument. You may optionally specify initial attributes and values as hash elements in the constructor. For example :

$h = HTML::Element->new('a', 'href' => 'http:www.oreilly.com');

The new element is created for the anchor tag, <a>, which links to the URL through its href attribute.

The following methods are provided for objects of the HTML::Element class.

as_HTML

$h->as_HTML(  )

Returns the HTML string that represents the element and its children.

attr

$h->attr(name [,value])

Sets or retrieves the value of attribute name in the current element.

content

$h->content(  )

Returns the content contained in this element as a reference to an array that contains plain-text segments and references to nested element objects.

delete

$h->delete(  )

Deletes the current element and all of its child elements.

delete_content

$h->delete_content(  )

Removes the content from the current element.

dump

$h->dump(  )

Prints the tag name of the element and all its children to STDOUT. Useful for debugging. The structure of the document is shown by indentation.

endtag

$h->endtag(  )

Returns the original text of the end tag, including the </ and >.

extract_links

$h->extract_links([types])

Retrieves the links contained within an element and all of its child elements. This method returns a reference to an array in which each element is a reference to an array with two values: the value of the link and a reference to the element in which it was found. You may specify the tags from which you want to extract links by providing their names in a list of types.

implicit

$h->implicit([boolean])

Indicates whether the element was contained in the original document (false) or whether it was assumed to be implicit (true) by the parser. Implicit tags are elements that the parser included to conform to proper HTML structure, such as an ending paragraph tag (</p>). You may also set this attribute by providing a boolean argument.

insert_element

$h->insert_element($element, implicit)

Inserts the object $element at the current position relative to the root object $h and updates the position (indicated by pos) to the inserted element. Returns the new $element. The implicit argument is a Boolean indicating whether the element is an implicit tag (true) or the original HTML (false).

is_empty

$h->is_empty(  )

Returns true if the current object has no content.

is_inside

$h->is_inside(tag1 [,tag2, ...])

Returns true if the tag for this element is contained inside one of the tags listed as arguments.

parent

$h->parent([$new])

Without an argument, returns the parent object for this element. If given a reference to another element object, this element is set as the new parent object and is returned.

pos

$h->pos([$element])

Sets or retrieves the current position in the syntax tree of the current object. The returned value is a reference to the element object that holds the current position. The "position" object is an element contained within the tree that has the current object ($h) at its root.

push_content

$h->push_content(content)

Inserts the specified content into the current element. content can be either a scalar containing plain text or a reference to another element. Multiple arguments can be supplied.

starttag

$h->starttag(  )

Returns the original text of the start tag for the element. This includes the < and > and all attributes.

tag

$h->tag([name])

Sets or retrieves the tag name for the element. Tag names are always converted to lowercase.

traverse

$h->traverse(sub, [ignoretext])

Traverses the current element and all of its children, invoking the callback routine sub for each element. The callback routine is called with a reference to the current element (the node), a startflag, and the depth as arguments. The start flag is 1 when entering a node and 0 when leaving (returning to a parent element). If the ignoretext parameter is true (the default), then the callback routine will not be invoked for text content. If the callback routine returns false, the method will not traverse any child elements of that node.

20.4.4. HTML::TreeBuilder

The HTML::TreeBuilder class provides a parser that creates an HTML syntax tree. Each node of the tree is an HTML::Element object. This class inherits both HTML::Parser and HTML::Elements, so methods from both of those classes can be used on its objects.

The methods provided by HTML::TreeBuilder control how the parsing is performed. Values for these methods are set by providing a Boolean value for their arguments.

ignore_text

$p->ignore_text(boolean)

If set to true, text content of elements will not be included in elements of the parse tree. The default is false.

ignore_unknown

$p->ignore_unknown(boolean)

If set to true, unknown tags in the HTML will be represented as elements in the parse tree.

implicit_tags

$p->implicit_tags(boolean)

If set to true, the parser will try to deduce implicit tags such as missing elements or end tags that are required to conform to proper HTML structure. If false, the parse tree will reflect the HTML as is.

warn

$p->warn(boolean)

If set to true, the parser will make calls to warn with messages describing syntax errors when they occur. Error messages are off by default.

20.4.5. HTML::FormatPS

The HTML::FormatPS module converts an HTML parse tree into PostScript. The formatter object is created with the new constructor, which can take parameters that assign PostScript attributes. For example:

$formatter = HTML::FormatPS->new('papersize' => 'Letter');

You can now give parsed HTML to the formatter and produce PostScript output for printing. HTML::FormatPS does not handle table or form elements at this time.

The method for this class is format. format takes a reference to an HTML TreeBuilder object, representing a parsed HTML document. It returns a scalar containing the document formatted in PostScript. The following example shows how to use this module to print a file in PostScript:

use HTML::FormatPS;

$html = HTML::TreeBuilder->parse_file(somefile);
$formatter = HTML::FormatPS->new( );
print $formatter->format($html);

The following list describes the attributes that can be set in the constructor:

PaperSize: Possible values are 3, A4, A5, B4, B5, Letter, Legal, Executive, Tabloid, Statement, Folio, 10x14, and Quarto. The default is A4.
PaperWidth: Width of the paper in points.
PaperHeight: Height of the paper in points.
LeftMargin: Left margin in points.
RightMargin: Right margin in points.
HorizontalMargin: Left and right margin. Default is 4 cm.
TopMargin: Top margin in points.
BottomMargin: Bottom margin in points.
VerticalMargin: Top and bottom margin. Default is 2 cm.
PageNo: Boolean value to display page numbers. Default is 0 (off).
FontFamily: Font family to use on the page. Possible values are Courier, Helvetica, and Times. Default is Times.
FontScale: Scale factor for the font.
Leading

20.4.6. HTML::FormatText

The HTML::FormatText module takes a parsed HTML file and outputs a plain-text version of it. None of the character attributes will be usable, i.e., bold or italic fonts, font sizes, etc.

This module is similar to FormatPS in that the constructor takes attributes for formatting, and the format method produces the output. A formatter object can be constructed like this:

$formatter = HTML::FormatText->new(leftmargin => 10, rightmargin => 80);

The constructor can take two parameters: leftmargin and rightmargin. The value for the margins is given in column numbers. The aliases lm and rm can also be used.

The format method takes an HTML::TreeBuilder object and returns a scalar containing the formatted text. You can print it with:

print $formatter->format($html);