home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomePerl & LWPSearch this book

3.4. User Agents

The first and simplest use of LWP's two basic classes is LWP::UserAgent, which manages HTTP connections and performs requests for you. The new( ) constructor makes a user agent object:

$browser = LWP::UserAgent->new(%options);

The options and their default values are summarized in Table 3-1. The options are attributes whose values can be fetched or altered by the method calls described in the next section.

Table 3-1. Constructor options and default values for LWP::UserAgent

Key

Default

agent
"libwww-perl/#.###"
conn_cache
undef
cookie_jar
undef
from
undef
max_size
undef
parse_head
1
protocols_allowed
undef
protocols_forbidden
undef
requests_redirectable
['GET', 'HEAD']
timeout
180

If you have a user agent object and want a copy of it (for example, you want to run the same requests over two connections, one persistent with KeepAlive and one without) use the clone( ) method:

$copy = $browser->clone( );

This object represents a browser and has attributes you can get and set by calling methods on the object. Attributes modify future connections (e.g., proxying, timeouts, and whether the HTTP connection can be persistent) or the requests sent over the connection (e.g., authentication and cookies, or HTTP headers).

3.4.1. Connection Parameters

The timeout( ) attribute represents how long LWP will wait for a server to respond to a request:

$oldval = $browser->timeout([newval]);

That is, if you want to set the value, you'd do it like so:

$browser->timeout(newval);

And if you wanted to read the value, you'd do it like this:

$value = $browser->timeout( );

And you could even set the value and get back the old value at the same time:

$previously = $browser->timeout(newval);

The default value of the timeout attribute is 180 seconds. If you're spidering, you might want to change this to a lower number to prevent your spider from wasting a lot of time on unreachable sites:

$oldval = $browser->timeout( );
$browser->timeout(10);
print "Changed timeout from $oldval to 10\n";
Changed timeout from 180 to 10

The max_size( ) method limits the number of bytes of an HTTP response that the user agent will read:

$size = $browser->max_size([bytes])

The default value of the max_size( ) attribute is undef, signifying no limit. If the maximum size is exceeded, the response will have a Client-Aborted header. Here's how to test for that:

$response = $browser->request($req);
if ($response->header("Client-Aborted")) {
  warn "Response exceeded maximum size."
}

To have your browser object support HTTP Keep-Alive, call the conn_cache( ) method to a connection cache object, of class LWP::ConnCache. This is done like so:

use LWP::ConnCache;
$cache = $browser->conn_cache(LWP::ConnCache->new( ));

The newly created connection cache object will cache only one connection at a time. To have it cache more, you access its total_capacity attribute. Here's how to increase that cache to 10 connections:

$browser->conn_cache->total_capacity(10);

To cache all connections (no limits):

$browser->conn_cache->total_capacity(undef);

3.4.2. Request Parameters

The agent( ) attribute gets and sets the string that LWP sends for the User-Agent header:

$oldval = $browser->agent([agent_string]);

Some web sites use this string to identify the browser. To pretend to be Netscape to get past web servers that check to see whether you're using a "supported browser," do this:

print "My user agent name is ", $browser->agent( ), ".\n";
$browser->agent("Mozilla/4.76 [en] (Windows NT 5.0; U)");
print "And now I'm calling myself ", $browser->agent( ), "!\n";
My user agent name is libwww-perl/5.60.
And now I'm calling myself Mozilla/4.76 [en] (Windows NT 5.0; U)!

The from( ) attribute controls the From header, which contains the email address of the user making the request:

$old_address = $browser->from([email_address]);

The default value is undef, which indicates no From header should be sent.

The user agent object can manage the sending and receiving of cookies for you. Control this with the cookie_jar( ) method:

$old_cj_obj = $browser->cookie_jar([cj_obj])

This reads or sets the HTTP::Cookies object that's used for holding all this browser's cookies. By default, there is no cookie jar, in which case the user agent ignores cookies.

To create a temporary cookie jar, which will keep cookies only for the duration of the user agent object:

$browser->cookie_jar(HTTP::Cookies->new);

To use a file as a persistent store for cookies:

my $some_file = '/home/mojojojo/cookies.lwp';
$browser->cookie_jar(HTTP::Cookies->new(
  'file' => $some_file, 'autosave' => 1
));

Cookies are discussed in more detail in Chapter 11, "Cookies, Authentication,and Advanced Requests".

3.4.3. Protocols

LWP allows you to control the protocols with which a user agent can fetch documents. You can choose to allow only a certain set of protocols, or allow all but a few. You can also test a protocol to see whether it's supported by LWP and by this particular browser object.

The protocols_allowed( ) and protocols_forbidden( ) methods explicitly permit or forbid certain protocols (e.g., FTP or HTTP) from being used by this user agent:

$aref_maybe = $browser->protocols_allowed([\@protocols]);
$aref_maybe = $browser->protocols_forbidden([\@protocols]);

Call the methods with no arguments to get an array reference containing the allowed or forbidden protocols, or undef if the attribute isn't set. By default, neither is set, which means that this browser supports all the protocols that your installation of LWP supports.

For example, if you're processing a list of URLs and don't want to parse them to weed out the FTP URLs, you could write this:

$browser->protocols_forbidden(["ftp"]);

Then you can blindly execute requests, and any ftp URLs will fail automatically. That is, if you request an ftp URL, the browser object returns an error response without performing any actual request.

Instead of forbidden protocols, you can specify which to allow by using the protocols_allowed method. For example, to set this browser object to support only http and gopher URLs, you could write this:

$browser->protocols_allowed(["http", "gopher"]);

To check if LWP and this particular browser support a particular URL protocol, use the is_protocol_supported( ) method. It returns true if LWP supports the protocol, isn't in protocols_forbidden, and it has been allowed in a protocols_allowed list set. You call it like this:

$boolean = $browser->is_protocol_supported(scheme);

For example:

unless ($browser->is_protocol_supported("https")) {
  warn "Cannot process https:// URLs.\n";
}

3.4.7. Request Methods

There are three basic request methods:

$resp = $browser->get(url);
$resp = $browser->head(url);
$resp = $browser->post(url, \@form_data);

If you're specifying extra header lines to be sent with the request, do it like this:

$resp = $browser->get(url, Header1 => Value1, Header2 => Value2, ...);
$resp = $browser->head(url, Header1 => Value1, Header2 => Value2, ...);
$resp = $browser->post(url, \@form_data,
                       Header1 => Value1, Header2 => Value2, ...);

For example:

$resp = $browser->get("http://www.nato.int",
  'Accept-Language' => 'en-US',
  'Accept-Charset' => 'iso-8859-1,*,utf-8',
  'Accept-Encoding' => 'gzip',
  'Accept' =>
   "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*",
);

3.4.7.1. Saving response content to a file

With normal requests, the body of the response is stored in the response object's $response->content( ) attribute by default. That's fine when the response body is a moderately small piece of data such as a 20-kilobyte HTML file. But a 6-megabyte MP3 file should probably be saved to disk without saving it in memory first.

The request methods support this by providing sort of fake header lines that don't turn into real headers in the request but act as options for LWP's handling of the request. Each option/header starts with a ":" character, a character that no real HTTP header name could contain. The simplest option is ':content_file' => filename.

$resp = $browser->get(url, ':content_file' => filename, ...);
$resp = $browser->head(url, ':content_file' => filename, ...);
$resp = $browser->post(url, \@form_data,
  ':content_file' => filename, ...);

With this option, the content of the response is saved to the given filename, overwriting whatever might be in that file already. (In theory, no response to a HEAD request should ever have content, so it seems odd to specify where content should be saved. However, in practice, some strange servers and many CGIs on otherwise normal servers do respond to HEAD requests as if they were GET requests.)

A typical example:

my $out = 'weather_satellite.jpg';
my $resp = $browser->get('http://weathersys.int/',
  ':content_file' => $out,
);
die "Couldn't get the weather picture: ", $response->status_line
 unless $response->is_success;

This feature is also useful for cases in which you were planning on saving the content to that file anyway. Also see the mirror( ) method described below, which does something similar to $browser->get($url, ':content_file' => filename, ...).

3.4.7.2. Sending response content to a callback

If you instead provide an option/header pair consisting of ':content_cb' and a subroutine reference, LWP won't save the content in memory or to a file but will instead call the subroutine every so often, as new data comes in over the connection to the remote server. This is the syntax for specifying such a callback routine:

$resp = $browser->get(url, ':content_cb' => \&mysub, ...);
$resp = $browser->head(url, ':content_cb' => \&mysub, ...);
$resp = $browser->post(url, \@form_data,
  ':content_cb' => \&mysub, ...);

Whatever subroutine you define will get chunks of the newly received data passed in as the first parameter, and the second parameter will be the new HTTP::Response object that will eventually get returned from the current get/head/post call. So you should probably start every callback routine like this:

sub callbackname {
 my($data, $response) = @_;
 ...

Here, for example, is a routine that hex-dumps whatever data is received as a response to this request:

my $resp = $browser->get('http://www.perl.com'
 ':content_cb' => \&hexy,
);
sub hexy {
  my($data, $resp) = @_;
  print length($data), " bytes:\n";
  print '  ', unpack('H*', substr($data,0,16,'')), "\n"
   while length $data;
  return;
}

In fact, you can pass an anonymous routine as the callback. The above could just as well be expressed like this:

my $resp = $browser->get('http://www.perl.com/'
  ':content_cb' => sub {
    my($data, $resp) = @_;
    print length($data), " bytes:\n";
    print '  ', unpack('H*', substr($data,0,16,'')), "\n"
     while length $data;
    return;
  }
);

The size of the $data string is unpredictable. If it matters to you how big each is, you can specify another option, :read_size_hint => byte_count, which LWP will take as a hint for how many bytes you want the typical $data string to be:

$resp = $browser->get(url,
  ':content_cb' => \&mysub,
  ':read_size_hint' => byte_count,
  ...,
);
$resp = $browser->head(url,
  ':content_cb' => \&mysub,
  ':read_size_hint' => byte_count,
  ...,
);
$resp = $browser->post(url, \@form_data,
  ':content_cb' => \&mysub,
  ':read_size_hint' => byte_count,
  ...,
);

We can modify our hex-dumper routine to be called like this:

my $resp = $browser->get('http://www.perl.com'
':content_cb' => \&hexy,
':read_size_hint' => 1024,
);

However, there is no guarantee that's how big the $data string will actually be. It is merely a hint, which LWP may disregard.

3.4.7.3. Mirroring a URL to a file

The mirror( ) method GETs a URL and stores the result to a file:

$response = $browser->mirror(url_to_get, filename)

But it has the added feature that it uses an HTTP If-Modified-Since header line on the request it performs, to avoid transferring the remote file unless it has changed since the local file (filename) was last changed. The mirror( ) method returns a new HTTP::Response object but without a content attribute (any interesting content will have been written to the local file). You should at least check $response->is_error( ):

$response = $browser->mirror("http://www.cpan.org/",
                             "cpan_home.html");
if( $response->is_error( ) ){
  die "Couldn't access the CPAN home page: " .
    $response->status_line;
}


Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.