LWP in Action (Perl & LWP)

1.5. LWP in Action

Enough of why you should be careful when you automate the Web. Let's look at the types of things you'll be learning in this book. Chapter 2, "Web Basics" introduces web automation and LWP, presenting straightforward functions to let you fetch web pages. Example 1-1 shows how to fetch the O'Reilly home page and count the number of times Perl is mentioned.

Example 1-1. Count "Perl" in the O'Reilly catalog

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
  
my $catalog = get("http://www.oreilly.com/catalog");
my $count = 0;
$count++ while $catalog =~ m{Perl}gi;
print "$count\n";

The LWP::Simple module's get( ) function returns the document at a given URL or undef if an error occurred. A regular expression match in a loop counts the number of occurrences.

1.5.1. The Object-Oriented Interface

Chapter 3, "The LWP Class Model" goes beyond LWP::Simple to show larger LWP's powerful object-oriented interface. Most useful of all the features it covers are how to set headers in requests and check the headers of responses. Example 1-2 prints the identifying string that every server returns.

Example 1-2. Identify a server

#!/usr/bin/perl -w
use strict;
use LWP;
  
my $browser = LWP::UserAgent->new( );
my $response = $browser->get("http://www.oreilly.com/");
print $response->header("Server"), "\n";

The two variables, $browser and $response, are references to objects. LWP::UserAgent object $browser makes requests of a server and creates HTTP::Response objects such as $response to represent the server's reply. In Example 1-2, we call the header( ) method on the response to check one of the HTTP header values.

1.5.2. Forms

Chapter 5, "Forms" shows how to analyze and submit forms with LWP, including both GET and POST submissions. Example 1-3 makes queries of the California license plate database to see whether a personalized plate is available.

Example 1-3. Query California license plate database

#!/usr/bin/perl -w
# pl8.pl -  query California license plate database
 
use strict;
use LWP::UserAgent;
my $plate = $ARGV[0] || die "Plate to search for?\n";
$plate = uc $plate;
$plate =~ tr/O/0/;  # we use zero for letter-oh
die "$plate is invalid.\n"
 unless $plate =~ m/^[A-Z0-9]{2,7}$/
    and $plate !~ m/^\d+$/;  # no all-digit plates
 
my $browser = LWP::UserAgent->new;
my $response = $browser->post(
  'http://plates.ca.gov/search/search.php3',
  [
    'plate'  => $plate,
    'search' => 'Check Plate Availability'
  ],
);
die "Error: ", $response->status_line
 unless $response->is_success;
 
if($response->content =~ m/is unavailable/) {
  print "$plate is already taken.\n";
} elsif($response->content =~ m/and available/) {
  print "$plate is AVAILABLE!\n";
} else {
  print "$plate... Can't make sense of response?!\n";
}
exit;

Here's how you might use it:

% pl8.pl knee
KNEE is already taken.
% pl8.pl ankle
ANKLE is AVAILABLE!

We use the post( ) method on an LWP::UserAgent object to POST form parameters to a page.

1.5.3. Parsing HTML

The regular expression techniques in Examples Example 1-1 and Example 1-3 are discussed in detail in Chapter 6, "Simple HTML Processing with Regular Expressions". Chapter 7, "HTML Processing with Tokens" shows a different approach, where the HTML::TokeParser module turns a string of HTML into a stream of chunks ("start-tag," "text," "close-tag," and so on). Chapter 8, "Tokenizing Walkthrough" is a detailed step-by-step walkthrough showing how to solve a problem using HTML::TokeParser. Example 1-4 uses HTML::TokeParser to extract the src parts of all img tags in the O'Reilly home page.

Example 1-4. Extract image locations

#!/usr/bin/perl -w
  
use strict;
use LWP::Simple;
use HTML::TokeParser;
  
my $html   = get("http://www.oreilly.com/");
my $stream = HTML::TokeParser->new(\$html);
my %image  = ( );
  
while (my $token = $stream->get_token) {
    if ($token->[0] eq 'S' && $token->[1] eq 'img') {
        # store src value in %image
        $image{ $token->[2]{'src'} }++;
    }
}
  
foreach my $pic (sort keys %image) {
    print "$pic\n";
}

The get_token( ) method on our HTML::TokeParser object returns an array reference, representing a token. If the first array element is S, it's a token representing the start of a tag. The second array element is the type of tag, and the third array element is a hash mapping attribute to value. The %image hash holds the images we find.

Chapter 9, "HTML Processing with Trees" and Chapter 10, "Modifying HTML with Trees" show how to use tree data structures to represent HTML. The HTML::TreeBuilder module constructs such trees and provides operations for searching and manipulating them. Example 1-5 extracts image locations using a tree.

Example 1-5. Extracting image locations with a tree

#!/usr/bin/perl -w
  
use strict;
use LWP::Simple;
use HTML::TreeBuilder;
  
my $html = get("http://www.oreilly.com/");
my $root = HTML::TreeBuilder->new_from_content($html);
my %images;
foreach my $node ($root->find_by_tag_name('img')) {
    $images{ $node->attr('src') }++;
}
  
foreach my $pic (sort keys %images) {
    print "$pic\n";
}

We create a new tree from the HTML in the O'Reilly home page. The tree has methods to help us search, such as find_by_tag_name( ), which returns a list of nodes corresponding to those tags. We use that to find the img tags, then use the attr( ) method to get their src attributes.

1.5.4. Authentication

Chapter 11, "Cookies, Authentication,and Advanced Requests" talks about advanced request features such as cookies (used to identify a user between web page accesses) and authentication. Example 1-6 shows how easy it is to request a protected page with LWP.

Example 1-6. Authenticating

#!/usr/bin/perl -w
  
use strict;
use LWP;
  
my $browser = LWP::UserAgent->new( );
$browser->credentials("www.example.com:80", "music", "fred" => "l33t1");
my $response = $browser->get("http://www.example.com/mp3s");
# ...

The credentials( ) method on an LWP::UserAgent adds the authentication information (the host, realm, and username/password pair are the parameters). The realm identifies which username and password are expected if there are multiple protected areas on a single host. When we request a document using that LWP::UserAgent object, the authentication information is used if necessary.