Fetching a URL from a Perl Script (Perl Cookbook, 2nd Edition)

20.1.3. Discussion

The right library makes life easier, and the LWP modules are the right ones for this task. As you can see from the Solution, LWP makes this task a trivial one.

The get function from LWP::Simple returns undef on error, so check for errors this way:

use LWP::Simple;
unless (defined ($content = get $URL)) {
    die "could not get $URL\n";
}

When called that way, however, you can't determine the cause of the error. For this and other elaborate processing, you'll have to go beyond LWP::Simple.

Example 20-1 is a program that fetches a remote document. If it fails, it prints out the error status line. Otherwise, it prints out the document title and the number of bytes of content. We use three modules, two of which are from LWP.

LWP::UserAgent: This module creates a virtual browser. The object returned from the new constructor is used to make the actual request. We've set the name of our agent to "Schmozilla/v9.14 Platinum" just to give the remote webmaster browser-envy when they see it in their logs. This is useful on obnoxious web servers that needlessly consult the user agent string to decide whether to return a proper page or an infuriating "you need Internet Navigator v12 or later to view this site" cop-out.
HTTP::Response: This is the object type returned when the user agent actually runs the request. We check it for errors and contents.
URI::Heuristic: This curious little module uses Netscape-style guessing algorithms to expand partial URLs. For example:

Simple	Guess
`perl`	http://www.perl.com
`www.oreilly.com`	http://www.oreilly.com
`ftp.funet.fi`	ftp://ftp.funet.fi
`/etc/passwd`	file:/etc/passwd

Although the simple forms listed aren't legitimate URLs (their format is not in the URI specification), Netscape tries to guess the URLs they stand for. Because Netscape does it, most other browsers do, too.

The source is in Example 20-1.

Example 20-1. titlebytes

  #!/usr/bin/perl -w 
  # titlebytes - find the title and size of documents 
  use strict;
  use LWP::UserAgent; 
  use HTTP::Response; 
  use URI::Heuristic;
  my $raw_url = shift                      or die "usage: $0 url\n"; 
  my $url = URI::Heuristic::uf_urlstr($raw_url);
  $| = 1;                                  # to flush next line 
  printf "%s =>\n\t", $url;
  # bogus user agent
  my $ua = LWP::UserAgent->new( ); 
  $ua->agent("Schmozilla/v9.14 Platinum"); # give it time, it'll get there
  # bogus referrer to perplex the log analyzers
  my $response = $ua->get($url, Referer => "http://wizard.yellowbrick.oz");
  if ($response->is_error( )) {
    printf " %s\n", $response->status_line;
  } else {
    my $content = $response->content( );
    my $bytes = length $content;
    my $count = ($content =~ tr/\n/\n/);
    printf "%s (%d lines, %d bytes)\n",
      $response->title( ) || "(no title)", $count, $bytes;
  }

When run, the program produces output like this:

% titlebytes http://www.tpj.com/
http://www.tpj.com/ =>
    The Perl Journal (109 lines, 4530 bytes)

Yes, "referer" is not how "referrer" should be spelled. The standards people got it wrong when they misspelled HTTP_REFERER. Please use double r's when referring to things in English.

The first argument to the get method is the URL, and subsequent pairs of arguments are headers and their values.

20.1. Fetching a URL from a Perl Script

20.1.1. Problem

20.1.2. Solution

20.1.3. Discussion

Example 20-1. titlebytes

20.1.4. See Also