User Agents (Perl & LWP)

3.4. User Agents

The first and simplest use of LWP's two basic classes is LWP::UserAgent, which manages HTTP connections and performs requests for you. The new( ) constructor makes a user agent object:

$browser = LWP::UserAgent->new(%options);

The options and their default values are summarized in Table 3-1. The options are attributes whose values can be fetched or altered by the method calls described in the next section.

Table 3-1. Constructor options and default values for LWP::UserAgent

Key	Default
agent	"libwww-perl/`#.###`"
conn_cache	undef
cookie_jar	undef
from	undef
max_size	undef
parse_head	1
protocols_allowed	undef
protocols_forbidden	undef
requests_redirectable	['GET', 'HEAD']
timeout	180

If you have a user agent object and want a copy of it (for example, you want to run the same requests over two connections, one persistent with KeepAlive and one without) use the clone( ) method:

$copy = $browser->clone( );

This object represents a browser and has attributes you can get and set by calling methods on the object. Attributes modify future connections (e.g., proxying, timeouts, and whether the HTTP connection can be persistent) or the requests sent over the connection (e.g., authentication and cookies, or HTTP headers).

3.4.1. Connection Parameters

The timeout( ) attribute represents how long LWP will wait for a server to respond to a request:

$oldval = $browser->timeout([newval]);

That is, if you want to set the value, you'd do it like so:

$browser->timeout(newval);

And if you wanted to read the value, you'd do it like this:

$value = $browser->timeout( );

And you could even set the value and get back the old value at the same time:

$previously = $browser->timeout(newval);

The default value of the timeout attribute is 180 seconds. If you're spidering, you might want to change this to a lower number to prevent your spider from wasting a lot of time on unreachable sites:

$oldval = $browser->timeout( );
$browser->timeout(10);
print "Changed timeout from $oldval to 10\n";
Changed timeout from 180 to 10

The max_size( ) method limits the number of bytes of an HTTP response that the user agent will read:

$size = $browser->max_size([bytes])

The default value of the max_size( ) attribute is undef, signifying no limit. If the maximum size is exceeded, the response will have a Client-Aborted header. Here's how to test for that:

$response = $browser->request($req);
if ($response->header("Client-Aborted")) {
  warn "Response exceeded maximum size."
}

To have your browser object support HTTP Keep-Alive, call the conn_cache( ) method to a connection cache object, of class LWP::ConnCache. This is done like so:

use LWP::ConnCache;
$cache = $browser->conn_cache(LWP::ConnCache->new( ));

The newly created connection cache object will cache only one connection at a time. To have it cache more, you access its total_capacity attribute. Here's how to increase that cache to 10 connections:

$browser->conn_cache->total_capacity(10);

To cache all connections (no limits):

$browser->conn_cache->total_capacity(undef);

3.4.2. Request Parameters

The agent( ) attribute gets and sets the string that LWP sends for the User-Agent header:

$oldval = $browser->agent([agent_string]);

Some web sites use this string to identify the browser. To pretend to be Netscape to get past web servers that check to see whether you're using a "supported browser," do this:

print "My user agent name is ", $browser->agent( ), ".\n";
$browser->agent("Mozilla/4.76 [en] (Windows NT 5.0; U)");
print "And now I'm calling myself ", $browser->agent( ), "!\n";
My user agent name is libwww-perl/5.60.
And now I'm calling myself Mozilla/4.76 [en] (Windows NT 5.0; U)!

The from( ) attribute controls the From header, which contains the email address of the user making the request:

$old_address = $browser->from([email_address]);

The default value is undef, which indicates no From header should be sent.

The user agent object can manage the sending and receiving of cookies for you. Control this with the cookie_jar( ) method:

$old_cj_obj = $browser->cookie_jar([cj_obj])

This reads or sets the HTTP::Cookies object that's used for holding all this browser's cookies. By default, there is no cookie jar, in which case the user agent ignores cookies.

To create a temporary cookie jar, which will keep cookies only for the duration of the user agent object:

$browser->cookie_jar(HTTP::Cookies->new);

To use a file as a persistent store for cookies:

my $some_file = '/home/mojojojo/cookies.lwp';
$browser->cookie_jar(HTTP::Cookies->new(
  'file' => $some_file, 'autosave' => 1
));

Cookies are discussed in more detail in Chapter 11, "Cookies, Authentication,and Advanced Requests".

3.4.3. Protocols

LWP allows you to control the protocols with which a user agent can fetch documents. You can choose to allow only a certain set of protocols, or allow all but a few. You can also test a protocol to see whether it's supported by LWP and by this particular browser object.

The protocols_allowed( ) and protocols_forbidden( ) methods explicitly permit or forbid certain protocols (e.g., FTP or HTTP) from being used by this user agent:

$aref_maybe = $browser->protocols_allowed([\@protocols]);
$aref_maybe = $browser->protocols_forbidden([\@protocols]);

Call the methods with no arguments to get an array reference containing the allowed or forbidden protocols, or undef if the attribute isn't set. By default, neither is set, which means that this browser supports all the protocols that your installation of LWP supports.

For example, if you're processing a list of URLs and don't want to parse them to weed out the FTP URLs, you could write this:

$browser->protocols_forbidden(["ftp"]);

Then you can blindly execute requests, and any ftp URLs will fail automatically. That is, if you request an ftp URL, the browser object returns an error response without performing any actual request.

Instead of forbidden protocols, you can specify which to allow by using the protocols_allowed method. For example, to set this browser object to support only http and gopher URLs, you could write this:

$browser->protocols_allowed(["http", "gopher"]);

To check if LWP and this particular browser support a particular URL protocol, use the is_protocol_supported( ) method. It returns true if LWP supports the protocol, isn't in protocols_forbidden, and it has been allowed in a protocols_allowed list set. You call it like this:

$boolean = $browser->is_protocol_supported(scheme);

For example:

unless ($browser->is_protocol_supported("https")) {
  warn "Cannot process https:// URLs.\n";
}

3.4.4. Redirection

A server can reply to a request with a response that redirects the user agent to a new location. A user agent can automatically follow redirections for you. By default, LWP::UserAgent objects follow GET and HEAD method redirections.

The requests_redirectable( ) attribute controls the list of methods for which the user agent will automatically follow redirections:

$aref = $browser->requests_redirectable([\@methods]);

To disable the automatic following of redirections, pass in a reference to an empty array:

$browser->requests_redirectable([]);

To add POST to the list of redirectable methods:

push @{$browser->requests_redirectable}, 'POST';

You can test a request to see whether the method in that request is one for which the user agent will follow redirections:

$boolean = $browser->redirect_ok(request);

The redirect_ok( ) method returns true if redirections are permitted for the method in the request.

3.4.5. Authentication

The user agent can manage authentication information for a series of requests to the same site. The credentials( ) method sets a username and password for a particular realm on a site:

$browser->credentials(host_port, realm, uname, pass);

A realm is a string that's used to identify the locked-off area on the given server and port. In interactive browsers, the realm is the string that's displayed as part of the pop-up window that appears. For example, if the pop-up window says "Enter username for Unicode-MailList-Archives at www.unicode.org," then the realm string is Unicode-MailList-Archives, and the host_port value is www.unicode.org:80. (The browser doesn't typically show the :80 part for HTTP, nor the :443 part for HTTPS, as those are the default port numbers.)

The username, password, and realm can be sent for every request whose hostname and port match the one given in host_port, and that require authorization. For example:

$browser->credentials("intranet.example.int:80", "Finances",
                      "fred", "3l1t3");

From that point on, any requests this browser makes to port 80 that require authentication with a realm name of "Finances," will be tried with a username "fred" and a password "3l1t3."

For more information on authentication, see Chapter 11, "Cookies, Authentication,and Advanced Requests".

3.4.6. Proxies

One potentially important function of the user agent object is managing proxies. The env_proxy( ) method configures the proxy settings:

$browser->env_proxy( );

This method inspects proxy settings from environment variables such as http_proxy, gopher_proxy, and no_proxy. If you don't use a proxy, those environment variables aren't set, and the call to env_proxy( ) has no effect.

To set proxying from within your program, use the proxy( ) and no_proxy( ) methods. The proxy( ) method sets or retrieves the proxy for a particular scheme:

$browser->proxy(scheme, proxy);
$browser->proxy(\@schemes, proxy);
$proxy = $browser->proxy(scheme);

The first two forms set the proxy for one or more schemes. The third form returns the proxy for a particular scheme. For example:

$p = $browser->proxy("ftp");
$browser->proxy("ftp", "http://firewall:8001/");
print "Changed proxy from $p to our firewall.\n";

The no_proxy( ) method lets you disable proxying for particular domains:

$browser->no_proxy([ domain, ... ]);

Pass a list of domains to no_proxy( ) to add them to the list of domains that are not proxied (e.g., those within your corporate firewall). For example:

$browser->no_proxy("c64.example.int", "localhost", "server");

Call no_proxy( ) with no arguments to clear the list of unproxied domains:

$browser->no_proxy( );  # no exceptions to proxying

3.4.7. Request Methods

There are three basic request methods:

$resp = $browser->get(url);
$resp = $browser->head(url);
$resp = $browser->post(url, \@form_data);

If you're specifying extra header lines to be sent with the request, do it like this:

$resp = $browser->get(url, Header1 => Value1, Header2 => Value2, ...);
$resp = $browser->head(url, Header1 => Value1, Header2 => Value2, ...);
$resp = $browser->post(url, \@form_data,
                       Header1 => Value1, Header2 => Value2, ...);

For example:

$resp = $browser->get("http://www.nato.int",
  'Accept-Language' => 'en-US',
  'Accept-Charset' => 'iso-8859-1,*,utf-8',
  'Accept-Encoding' => 'gzip',
  'Accept' =>
   "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*",
);

3.4.7.1. Saving response content to a file

With normal requests, the body of the response is stored in the response object's $response->content( ) attribute by default. That's fine when the response body is a moderately small piece of data such as a 20-kilobyte HTML file. But a 6-megabyte MP3 file should probably be saved to disk without saving it in memory first.

The request methods support this by providing sort of fake header lines that don't turn into real headers in the request but act as options for LWP's handling of the request. Each option/header starts with a ":" character, a character that no real HTTP header name could contain. The simplest option is ':content_file' => filename.

$resp = $browser->get(url, ':content_file' => filename, ...);
$resp = $browser->head(url, ':content_file' => filename, ...);
$resp = $browser->post(url, \@form_data,
  ':content_file' => filename, ...);

With this option, the content of the response is saved to the given filename, overwriting whatever might be in that file already. (In theory, no response to a HEAD request should ever have content, so it seems odd to specify where content should be saved. However, in practice, some strange servers and many CGIs on otherwise normal servers do respond to HEAD requests as if they were GET requests.)

A typical example:

my $out = 'weather_satellite.jpg';
my $resp = $browser->get('http://weathersys.int/',
  ':content_file' => $out,
);
die "Couldn't get the weather picture: ", $response->status_line
 unless $response->is_success;

This feature is also useful for cases in which you were planning on saving the content to that file anyway. Also see the mirror( ) method described below, which does something similar to $browser->get($url, ':content_file' => filename, ...).

3.4.7.2. Sending response content to a callback

If you instead provide an option/header pair consisting of ':content_cb' and a subroutine reference, LWP won't save the content in memory or to a file but will instead call the subroutine every so often, as new data comes in over the connection to the remote server. This is the syntax for specifying such a callback routine:

$resp = $browser->get(url, ':content_cb' => \&mysub, ...);
$resp = $browser->head(url, ':content_cb' => \&mysub, ...);
$resp = $browser->post(url, \@form_data,
  ':content_cb' => \&mysub, ...);

Whatever subroutine you define will get chunks of the newly received data passed in as the first parameter, and the second parameter will be the new HTTP::Response object that will eventually get returned from the current get/head/post call. So you should probably start every callback routine like this:

sub callbackname {
 my($data, $response) = @_;
 ...

Here, for example, is a routine that hex-dumps whatever data is received as a response to this request:

my $resp = $browser->get('http://www.perl.com'
 ':content_cb' => \&hexy,
);
sub hexy {
  my($data, $resp) = @_;
  print length($data), " bytes:\n";
  print '  ', unpack('H*', substr($data,0,16,'')), "\n"
   while length $data;
  return;
}

In fact, you can pass an anonymous routine as the callback. The above could just as well be expressed like this:

my $resp = $browser->get('http://www.perl.com/'
  ':content_cb' => sub {
    my($data, $resp) = @_;
    print length($data), " bytes:\n";
    print '  ', unpack('H*', substr($data,0,16,'')), "\n"
     while length $data;
    return;
  }
);

The size of the $data string is unpredictable. If it matters to you how big each is, you can specify another option, :read_size_hint => byte_count, which LWP will take as a hint for how many bytes you want the typical $data string to be:

$resp = $browser->get(url,
  ':content_cb' => \&mysub,
  ':read_size_hint' => byte_count,
  ...,
);
$resp = $browser->head(url,
  ':content_cb' => \&mysub,
  ':read_size_hint' => byte_count,
  ...,
);
$resp = $browser->post(url, \@form_data,
  ':content_cb' => \&mysub,
  ':read_size_hint' => byte_count,
  ...,
);

We can modify our hex-dumper routine to be called like this:

my $resp = $browser->get('http://www.perl.com'
':content_cb' => \&hexy,
':read_size_hint' => 1024,
);

However, there is no guarantee that's how big the $data string will actually be. It is merely a hint, which LWP may disregard.

3.4.7.3. Mirroring a URL to a file

The mirror( ) method GETs a URL and stores the result to a file:

$response = $browser->mirror(url_to_get, filename)

But it has the added feature that it uses an HTTP If-Modified-Since header line on the request it performs, to avoid transferring the remote file unless it has changed since the local file (filename) was last changed. The mirror( ) method returns a new HTTP::Response object but without a content attribute (any interesting content will have been written to the local file). You should at least check $response->is_error( ):

$response = $browser->mirror("http://www.cpan.org/",
                             "cpan_home.html");
if( $response->is_error( ) ){
  die "Couldn't access the CPAN home page: " .
    $response->status_line;
}

3.4.8. Advanced Methods

The HTML specification permits meta tags in the head of a document, some of which are alternatives to HTTP headers. By default, if the Response object is an HTML object, its head section is parsed, and some of the content of the head tags is copied into the HTTP::Response object's headers. For example, consider an HTML document that starts like this:

<html>
<head><title>Kiki's Pie Page</title>
 <base href="http://cakecity.int/">
 <meta name="Notes" content="I like pie!">
 <meta http-equiv="Description" content="PIE RECIPES FROM KIKI">
</head>

If you request that document and call print $response->headers_as_string on it, you'll see this:

Date: Fri, 05 Apr 2002 11:19:51 GMT
Accept-Ranges: bytes
Server: Apache/1.3.23
Content-Base: http://cakecity.int/
Content-Length: 204
Content-Type: text/html
Last-Modified: Fri, 05 Apr 2002 11:19:38 GMT
Client-Date: Fri, 05 Apr 2002 11:19:51 GMT
Description: PIE RECIPES FROM KIKI
Title: Kiki's Pie Page
X-Meta-Notes: I like pie!

You can access those headers individually with $response->header('Content-Base'), $response->header('Description'), $response->header('Title'), and $response->header('X-Meta-Notes'), respectively, as we shall see in the next section.

The documentation for the HTML::HeadParser module, which LWP uses to implement this feature, explains the exact details.