home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomeJava and XSLTSearch this book

20.4. The HTML Modules

HTML modules provide an interface to parse HTML documents. After you parse the document, you can print or display it according to the markup tags or extract specific information such as hyperlinks.

The HTML::parser module provides methods for, literally, parsing HTML. It can handle HTML text from a string or file and can separate out the syntactic structures and data. You shouldn't use HTML::Parser directly, however, since its interface hasn't been designed to make your life easy when you parse HTML. It's merely a base class from which you can build your own parser to deal with HTML in any way you want. And if you don't want to roll your own HTML parser or parser class, then there's always HTML::TokeParser and HTML::TreeBuilder, both of which are covered in this chapter.

HTML::TreeBuilder is a class that parses HTML into a syntax tree. In a syntax tree, each element of the HTML, such as container elements with beginning and end tags, is stored relative to other elements. This preserves the nested structure and behavior of HTML and its hierarchy.

A syntax tree of the TreeBuilder class is formed of connected nodes that represent each element of the HTML document. These nodes are saved as objects from the HTML::Element class. An HTML::Element object stores all the information from an HTML tag: the start tag, end tag, attributes, plain text, and pointers to any nested elements.

The remaining classes of the HTML modules use the syntax trees and its nodes of element objects to output useful information from the HTML documents. The format classes, such as HTML::FormatText and HTML::FormatPS, allow you to produce text and PostScript from HTML. The HTML::LinkExtor class extracts all of the links from a document. Additional modules provide means for replacing HTML character entities and implementing HTML tags as subroutines.

20.4.1. HTML::Parser

This module implements the base class for the other HTML modules. A parser object is created with the new constructor:

$p = HTML::Parser->new( );

The constructor takes no arguments.

The parser object takes methods that read in HTML from a string or a file. The string-reading method can take data in several smaller chunks if the HTML is too big. Each chunk of HTML will be appended to the object, and the eof method indicates the end of the document. These basic methods are described below.

When the parse or parse_file method is called, it parses the incoming HTML with a few internal methods. In HTML::Parser, these methods are defined, but empty. Additional HTML parsing classes (included in the HTML modules or ones you write yourself) override these methods for their own purposes. For example:

package HTML::MyParser;
require HTML::Parser;
@ISA=qw(HTML::MyParser);

sub start {
     your subroutine defined here
     }

The following list shows the internal methods contained in HTML::Parser.

comment

comment(comment)

Invoked on comments from HTML (text between <!- and ->). The text of the comment (without the tags) is given to the method as the string comment.

end

end(tag, origtext)

Invoked on end tags (those with the </tag> form). The first argument, tag, is the tag name in lowercase, and the second argument, origtext, is the original HTML text of the tag.

start

start(tag, $attr, attrseq, origtext)

Invoked on start tags. The first argument, tag, is the name of the tag in lowercase. The second argument is a reference to a hash, attr. This hash contains all the attributes and their values in key/value pairs. The keys are the names of the attributes in lowercase. The third argument, attrseq, is a reference to an array that contains the names of all the attributes in the order they appeared in the tag. The fourth argument, origtext, is a string that contains the original text of the tag.

xml_mode

xml_mode(bool)

Enabling this attribute changes the parser to allow some XML constructs such as empty element tags and XML processing instructions. It also disables forcing tag and attribute names to lowercase when they are reported by the tagname and attr arguments, and suppresses special treatment of elements parsed as CDATA for HTML.

20.4.2. HTML::TokeParser

As we said, you should use a subclassed HTML parser if you want a better interface to HTML parsing features than what HTML::Parser gives you. HTML::TokeParser by Gisle Aas is one such example. While HTML::TokeParser is actually a subclass of HTML::PullParser, it can help you do many useful things, such as link extraction and HTML checking.

In short, HTML::TokeParser breaks an HTML document into tokens, attributes, and content, in which the HTML <a href="http://url">link</a> would break down as:

token: a
    attrib: href
content: http://url
content: link
token /a

For example, you can use HTML::TokeParser to extract links from a string that contains HTML:

#!/usr/local/bin/perl -w

require HTML::TokeParser;

# Our string that turns out to be HTML!
my $html = '<p>Some text. <a href="http://blah"My name is Nate!</a></p>';
my $parser = HTML::TokeParser->new(\$html);

get_tag( ) tells TokeParser to match a tag by name
while (my $token = $parser->get_tag("a")) {
    my $url = $token->[1]{href} || "-";
    my $text = $parser->get_trimmed_text("/a");
    print "URL is: $url.\nURL text is: $text.\n";
}

20.4.2.1. HTML::TokeParser methods

new

new(  )

Constructor. Takes a filename, filehandle, or reference to a scalar as arguments. Each argument represents the content that will be parsed. If a scalar is present, new looks for a filename $scalar. If a reference to a scalar is present, new looks for HTML in \$scalar. new will read filehandles until end-of-file. Returns undef on failure.

get_tag

get_tag(  )

Returns the next start or end tag in a document. If there are no remaining start or end tags, get_tag returns undef. get_tag is useful because it skips unwanted tokens and matches only the tag that you want—if it exists. When a tag is found, it is returned as an array reference, like so: [$tag, $attr, $attrseq, $text]. If an end tag is found, is is returned—e.g., "/$tag".

get_text

get_text(  )

Returns all text found at the current position. If the next token is not text, get_text returns a zero-length string. You can pass an "$end_tag" option to get_text to return all of the text before "end_tag".

get_token

get_token(  )

Returns the next token found in the HTML document, or undef if no next token exists. Each token is returned as an array reference. The array reference's first and last items refer to start and end tags concurrently. The rest of the items in the array include text, comments, declarations, and process instructions. get_token uses the following labels for the tokens:

S
Start tag

E
End tag

T
Text

C
Comment

D
Declaration

PI
Process instructions

Consider the following code:

#!/usr/local/bin/perl -w

require HTML::TokeParser;

my $html = '<a href="http://blah">My name is 
Nate!</a></p>';
my $p = HTML::TokeParser->new(\$html);

while (my $token = $p->get_token) {
    my $i = 0;
    foreach my $tk (@{$token}) {
        print "token[$i]: $tk\n";
                    $i++;
    }
}

The items in each token (in the HTML) are displayed as follows:

token[0]: S
token[1]: a
token[2]: HASH(0x8146d3c)
token[3]: ARRAY(0x814a380)
token[4]: <a href="http://blah">
token[0]: T
token[1]: My name is Nate!
token[2]:
token[0]: E
token[1]: a
token[2]: </a>
token[0]: E
token[1]: p
token[2]: </p>
get_trimmed_text

get_trimmed_text(  )

Works the same as get_text, but reduces all instances of multiple spaces to a single space and removes leading and trailing whitespace.

unget_token

unget_token(  )

Useful for pushing tokens back to the parser so they can be reused the next time you call get_token.

20.4.3. HTML::Element

The HTML::Element module provides methods for dealing with nodes in an HTML syntax tree. You can get or set the contents of each node, traverse the tree, and delete a node.

HTML::Element objects are used to represent elements of HTML. These elements include start and end tags, attributes, contained plain text, and other nested elements.

The constructor for this class requires the name of the tag for its first argument. You may optionally specify initial attributes and values as hash elements in the constructor. For example:

$h = HTML::Element->new('a', 'href' => 'http:www.oreilly.com');

The new element is created for the anchor tag, <a>, which links to the URL through its href attribute.

The following methods are provided for objects of the HTML::Element class.

20.4.5. HTML::FormatPS

The HTML::FormatPS module converts an HTML parse tree into PostScript. The formatter object is created with the new constructor, which can take parameters that assign PostScript attributes. For example:

$formatter = HTML::FormatPS->new('papersize' => 'Letter');

You can now give parsed HTML to the formatter and produce PostScript output for printing. HTML::FormatPS does not handle table or form elements at this time.

The method for this class is format. format takes a reference to an HTML TreeBuilder object, representing a parsed HTML document. It returns a scalar containing the document formatted in PostScript. The following example shows how to use this module to print a file in PostScript:

use HTML::FormatPS;

$html = HTML::TreeBuilder->parse_file(somefile);
$formatter = HTML::FormatPS->new( );
print $formatter->format($html);

The following list describes the attributes that can be set in the constructor:

PaperSize
Possible values are 3, A4, A5, B4, B5, Letter, Legal, Executive, Tabloid, Statement, Folio, 10x14, and Quarto. The default is A4.

PaperWidth
Width of the paper in points.

PaperHeight
Height of the paper in points.

LeftMargin
Left margin in points.

RightMargin
Right margin in points.

HorizontalMargin
Left and right margin. Default is 4 cm.

TopMargin
Top margin in points.

BottomMargin
Bottom margin in points.

VerticalMargin
Top and bottom margin. Default is 2 cm.

PageNo
Boolean value to display page numbers. Default is 0 (off).

FontFamily
Font family to use on the page. Possible values are Courier, Helvetica, and Times. Default is Times.

FontScale
Scale factor for the font.

Leading
Space between lines, as a factor of the font size. Default is 0.1.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.