home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomePerl & LWPSearch this book

1.5. LWP in Action

Enough of why you should be careful when you automate the Web. Let's look at the types of things you'll be learning in this book. Chapter 2, "Web Basics" introduces web automation and LWP, presenting straightforward functions to let you fetch web pages. Example 1-1 shows how to fetch the O'Reilly home page and count the number of times Perl is mentioned.

Example 1-1. Count "Perl" in the O'Reilly catalog

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
  
my $catalog = get("http://www.oreilly.com/catalog");
my $count = 0;
$count++ while $catalog =~ m{Perl}gi;
print "$count\n";

The LWP::Simple module's get( ) function returns the document at a given URL or undef if an error occurred. A regular expression match in a loop counts the number of occurrences.

1.5.3. Parsing HTML

The regular expression techniques in Examples Example 1-1 and Example 1-3 are discussed in detail in Chapter 6, "Simple HTML Processing with Regular Expressions". Chapter 7, "HTML Processing with Tokens" shows a different approach, where the HTML::TokeParser module turns a string of HTML into a stream of chunks ("start-tag," "text," "close-tag," and so on). Chapter 8, "Tokenizing Walkthrough" is a detailed step-by-step walkthrough showing how to solve a problem using HTML::TokeParser. Example 1-4 uses HTML::TokeParser to extract the src parts of all img tags in the O'Reilly home page.

Example 1-4. Extract image locations

#!/usr/bin/perl -w
  
use strict;
use LWP::Simple;
use HTML::TokeParser;
  
my $html   = get("http://www.oreilly.com/");
my $stream = HTML::TokeParser->new(\$html);
my %image  = ( );
  
while (my $token = $stream->get_token) {
    if ($token->[0] eq 'S' && $token->[1] eq 'img') {
        # store src value in %image
        $image{ $token->[2]{'src'} }++;
    }
}
  
foreach my $pic (sort keys %image) {
    print "$pic\n";
}

The get_token( ) method on our HTML::TokeParser object returns an array reference, representing a token. If the first array element is S, it's a token representing the start of a tag. The second array element is the type of tag, and the third array element is a hash mapping attribute to value. The %image hash holds the images we find.

Chapter 9, "HTML Processing with Trees" and Chapter 10, "Modifying HTML with Trees" show how to use tree data structures to represent HTML. The HTML::TreeBuilder module constructs such trees and provides operations for searching and manipulating them. Example 1-5 extracts image locations using a tree.

Example 1-5. Extracting image locations with a tree

#!/usr/bin/perl -w
  
use strict;
use LWP::Simple;
use HTML::TreeBuilder;
  
my $html = get("http://www.oreilly.com/");
my $root = HTML::TreeBuilder->new_from_content($html);
my %images;
foreach my $node ($root->find_by_tag_name('img')) {
    $images{ $node->attr('src') }++;
}
  
foreach my $pic (sort keys %images) {
    print "$pic\n";
}

We create a new tree from the HTML in the O'Reilly home page. The tree has methods to help us search, such as find_by_tag_name( ), which returns a list of nodes corresponding to those tags. We use that to find the img tags, then use the attr( ) method to get their src attributes.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.