home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  

20.1. Fetching a URL from a Perl Script


You have a URL that you want to fetch from a script.


Use the get function from by the CPAN module LWP::Simple, part of LWP.

use LWP::Simple;
$content = get($URL);


The right library makes life easier, and the LWP modules are the right ones for this task.

The get function from LWP::Simple returns undef on error, so check for errors this way:

use LWP::Simple;
unless (defined ($content = get $URL)) {
    die "could not get $URL\n";

When it's run that way, however, you can't determine the cause of the error. For this and other elaborate processing, you'll have to go beyond LWP::Simple.

Example 20.1 is a program that fetches a document remotely. If it fails, it prints out the error status line. Otherwise it prints out the document title and the number of bytes of content. We use three modules from LWP and one other from CPAN.


This module creates a virtual browser. The object returned from the new constructor is used to make the actual request. We've set the name of our agent to "Schmozilla/v9.14 Platinum" just to give the remote webmaster browser-envy when they see it in their logs.


This module creates a request but doesn't send it. We create a GET request and set the referring page to a fictitious URL.


This is the object type returned when the user agent actually runs the request. We check it for errors and contents.


This curious little module uses Netscape-style guessing algorithms to expand partial URLs. For example:











Although these aren't legitimate URLs (their format is not in the URI specification), Netscape tries to guess the URL they stand for. Because Netscape does it, most other browsers do too.

The source is in Example 20.1 .

Example 20.1: titlebytes

#!/usr/bin/perl -w 
# titlebytes - find the title and size of documents 
use LWP::UserAgent; 
use HTTP::Request; 
use HTTP::Response; 
use URI::Heuristic;
my $raw_url = shift                      or die "usage: $0 url\n"; 
my $url = URI::Heuristic::uf_urlstr($raw_url);
$| = 1;                                  # to flush next line 
printf "%s =>\n\t", $url;
my $ua = LWP::UserAgent->new(); 
$ua->agent("Schmozilla/v9.14 Platinum"); # give it time, it'll get there
my $req = HTTP::Request->new(GET => $url); 
                                         # perplex the log analysers
my $response = $ua->request($req);
if ($response->is_error()) {
     printf " %s\n", $response->status_line;
 } else {
     my $count;
     my $bytes;
     my $content = $response->content();
     $bytes = length $content;
     $count = ($content =~ tr/\n/\n/);
     printf "%s (%d lines, %d bytes)\n", $response->title(), $count, $bytes; } 

When run, the program produces output like this:

% titlebytes http://www.tpj.com/
http://www.tpj.com/ =>
    The Perl Journal (109 lines, 4530 bytes)

Yes, " referer" is not how "referrer" should be spelled. The standards people got it wrong when they misspelled HTTP_REFERER. Please use two r's when referring to things in English.

See Also

The documentation for the CPAN module LWP::Simple, and the lwpcook (1) manpage that came with LWP; the documentation for the modules LWP::UserAgent, HTTP::Request, HTTP::Response, and URI::Heuristic; Recipe 20.2

Previous: 20.0. Introduction Perl Cookbook Next: 20.2. Automating Form Submission
20.0. Introduction Book Index 20.2. Automating Form Submission

Library Navigation Links

Copyright © 2001 O'Reilly & Associates. All rights reserved.