[Chapter 17] 17.4 The HTML Module

17.4 The HTML Module

The HTML modules provide an interface to parse HTML documents. After you parse the document, you can print or display it according to the markup tags, or you can extract specific information such as hyperlinks.

The HTML::Parser module provides the base class for the usable HTML modules. It provides methods for reading in HTML text from either a string or a file and then separating out the syntactic structures and data. As a base class, Parser does virtually nothing on its own. The other modules call it internally and override its empty methods for their own purposes. However, the HTML::Parser class is useful to you if you want to write your own classes for parsing and formatting HTML.

HTML::TreeBuilder is a class that parses HTML into a syntax tree. In a syntax tree, each element of the HTML, such as container elements with beginning and end tags, is stored relative to other elements. This preserves the nested structure and behavior of HTML and its hierarchy.

A syntax tree of the TreeBuilder class is formed of connected nodes that represent each element of the HTML document. These nodes are saved as objects from the HTML::Element class. An HTML::Element object stores all the information from an HTML tag: the start tag, end tag, attributes, plain text, and pointers to any nested elements.

The remaining classes of the HTML modules use the syntax trees and its nodes of element objects to output useful information from the HTML documents. The format classes, such as HTML::FormatText and HTML::FormatPS, allow you to produce text and PostScript from HTML. The HTML::LinkExtor class extracts all of the links from a document. Additional modules provide means for replacing HTML character entities and implementing HTML tags as subroutines.

17.4.1 HTML::Parser

This module implements the base class for the other HTML modules. A parser object is created with the new constructor:

$p = HTML::Parser->new();

The constructor takes no arguments.

The parser object takes methods that read in HTML either from a string or a file. The string-reading method can take data as several smaller chunks if the HTML is too big. Each chunk of HTML will be appended to the object, and the eof method indicates the end of the document. These basic methods are described below.

When the parse or parse_file method is called, it parses the incoming HTML with a few internal methods. In HTML::Parser, these methods are defined, but empty. Additional HTML parsing classes (included in the HTML modules or ones you write yourself) override these methods for their own purposes. For example:

package HTML::MyParser;
require HTML::Parser;
@ISA=qw(HTML::MyParser);

sub start {
     
your subroutine defined here

     }

The following list shows the internal methods contained in HTML::Parser:

17.4.2 HTML::Element

The HTML::Element module provides methods for dealing with nodes in an HTML syntax tree. You can get or set the contents of each node, traverse the tree, and delete a node.

HTML::Element objects are used to represent elements of HTML. These elements include start and end tags, attributes, contained plain text, and other nested elements.

The constructor for this class requires the name of the tag for its first argument. You may optionally specify initial attributes and values as hash elements in the constructor. For example:

$h = HTML::Element->new('a', 'href' => 'http://www.oreilly.com');

The new element is created for the anchor tag,

<a>

, which links to the URL through its


href

attribute.

The following methods are provided for objects of the HTML::Element class:

17.4.3 HTML::TreeBuilder

The HTML::TreeBuilder class provides a parser that creates an HTML syntax tree. Each node of the tree is an HTML::Element object. This class inherits both HTML::Parser and HTML::Elements, so methods from both of those classes can be used on its objects.

The methods provided by HTML::TreeBuilder control how the parsing is performed. Values for these methods are set by providing a boolean value for their arguments. Here are the methods:

17.4.4 HTML::FormatPS

The HTML::FormatPS module converts an HTML parse tree into PostScript. The formatter object is created with the new constructor, which can take parameters that assign PostScript attributes. For example:

$formatter = new HTML::FormatPS('papersize' => 'Letter');

You can now give parsed HTML to the formatter and produce PostScript output for printing. HTML::FormatPS does not handle table or form elements at this time.

The method for this class is format . format takes a reference to an HTML TreeBuilder object, representing a parsed HTML document. It returns a scalar containing the document formatted in PostScript. The following example shows how to use this module to print a file in PostScript:

use HTML::FormatPS;

$html = HTML::TreeBuilder->parse_file(somefile);
$formatter = new HTML::FormatPS;
print $formatter->format($html);

The following list describes the attributes that can be set in the constructor:

PaperSize: Possible values of 3, A4, A5, B4, B5, Letter, Legal, Executive, Tabloid, Statement, Folio, 10x14, and Quarto. The default is A4.
PaperWidth: Width of the paper in points.
PaperHeight: Height of the paper in points.
LeftMargin: Left margin in points.
RightMargin: Right margin in points.
HorizontalMargin: Left and right margin. Default is 4 cm.
TopMargin: Top margin in points.
BottomMargin: Bottom margin in points.
VerticalMargin: Top and bottom margin. Default is 2 cm.
PageNo: Boolean value to display page numbers. Default is 0 (off).
FontFamily: Font family to use on the page. Possible values are Courier, Helvetica, and Times. Default is Times.
FontScale: Scale factor for the font.
Leading: Space between lines, as a factor of the font size. Default is 0.1.

17.4.5 HTML::FormatText

The HTML::FormatText takes a parsed HTML file and outputs a plain text version of it. None of the character attributes will be usable, i.e., bold or italic fonts, font sizes, etc.

This module is similar to FormatPS in that the constructor takes attributes for formatting, and the format method produces the output. A formatter object can be constructed like this:

$formatter = new HTML::FormatText (leftmargin => 10, rightmargin => 80);

The constructor can take two parameters:


leftmargin

and


rightmargin

. The value for the margins is given in column numbers. The aliases

lm

and

rm

can also be used.

The format method takes an HTML::TreeBuilder object and returns a scalar containing the formatted text. You can print it with:

print $formatter->format($html);