Every kind of browser sends different HTTP headers besides the very
minimal headers that LWP::UserAgent typically sends. For example,
whereas an LWP::UserAgent browser by default sends this header line:
User-Agent: libwww-perl/5.5394
Netscape 4.76 sends a header line like this:
User-Agent: Mozilla/4.76 [en] (Win98; U)
And also sends these header fields that an LWP::UserAgent browser
doesn't send normally at all:
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*
Accept-Charset: iso-8859-1,*,utf-8
Accept-Encoding: gzip
Accept-Language: en-US
(That's assuming you've set your
language preferences to U.S. English). That's on top
of any Connection: keep-alive
headers that may be sent, if the browser or any intervening firewall
supports that feature (keep-alive) of HTTP.
Opera 5.12 is not much different:
User-Agent: Opera/5.12 (Windows 98; U) [en]
Accept: text/html, image/png, image/jpeg, image/gif, image/x-xbitmap, */*
Accept-Language: en
Accept-Encoding: deflate, gzip, x-gzip, identity, *;q=0
But a recent version of Netscape gets rather more verbose:
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US;
rv:0.9.4) Gecko/20011126 Netscape6/6.2.1
Accept: text/xml, application/xml, application/xhtml+xml, text/html;q=0.9,
image/png, image/jpeg, image/gif;q=0.2, text/plain;q=0.8,
text/css, */*;q=0.1
Accept-Charset: ISO-8859-1, utf-8;q=0.66, *;q=0.66
Accept-Encoding: gzip, deflate, compress;q=0.9
Accept-Language: en-us
Internet Explorer 5.12, in true Microsoft fashion, emits a few
nonstandard headers:
Accept: */*
Accept-Language: en
Extension: Security/Remote-Passphrase
UA-CPU: PPC
UA-OS: MacOS
User-Agent: Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)
Lynx can be verbose in reporting what MIME types my
system's /etc/mailcap tells it
that it can handle:
Accept: text/html, text/plain, audio/mod, image/*, video/*, video/mpeg,
application/pgp, application/pgp, application/pdf, message/partial,
message/external-body, application/postscript, x-be2,
application/andrew-inset, text/richtext, text/enriched
Accept: x-sun-attachment, audio-file, postscript-file, default,
mail-file, sun-deskset-message, application/x-metamail-patch,
text/sgml, */*;q=0.01
Accept-Encoding: gzip, compress
Accept-Language: en, es
User-Agent: Lynx/2.8.3dev.18 libwww-FM/2.14
This information can come in handy when trying to make your LWP
program seem as much like a well-known interactive browser as
possible
11.2.1. Pretending to Be Netscape
For example, suppose you're looking at
http://www.expreszo.nl/home.php
and you see that it has interesting headlines. You'd
like to write a headline detector for this site to go with the other
headline detectors we've been producing throughout
the book. You look at the source in Netscape and see that each
headline link looks like this:
<A class=pink href="headlines.php?id=749">...text...</A>
So you write something quite simple to capture those links:
use strict;
use warnings;
use LWP;
my $browser = LWP::UserAgent->new;
my $url = 'http://www.expreszo.nl/home.php';
my $response = $browser->get($url);
die "Can't get $url: ", $response->status_line
unless $response->is_success;
$_ = $response->content;
my %seen;
while( m{href="(headlines.php[^"]+)">(.*?)</A>}sg ) {
my $this = URI->new_abs($1,$response->base);
print "$this\n $2\n" unless $seen{$this}++;
}
print "NO HEADLINES?! Source:\n", $response->content unless keys %seen;
And you run it, and it quite stubbornly says:
NO HEADLINES?! Source:
<html><body>
...
Je hebt minimaal Microsoft Internet Explorer versie 4 of hoger, of
Netscape Navigator versie 4 of hoger nodig om deze site te bekijken.
...
</body></html>
That is, "you need MSIE 4 or higher, or Netscape 4
or higher, to view this site." It seems to be
checking the User-Agent string of whatever browser
visits the site and throwing a fit unless it's MSIE
or Netscape! This is easily simulated, by adding this line right
after $browser is created:
$browser->agent('Mozilla/4.76 [en] (Win98; U)');
With that one small change, the server sends the same page you saw in
Netscape, and the headline extractor happily sees the headlines, and
everything works:
http://www.expreszo.nl/headlines.php?id=752
Meer syfilis en HIV bij homo's
http://www.expreszo.nl/headlines.php?id=751
Imam hangt geldboete van
1200 boven het hoofd
http://www.expreszo.nl/headlines.php?id=740
SGP wil homohuwelijk terugdraaien
http://www.expreszo.nl/headlines.php?id=750
Gays en moslims worden vaak gediscrimineerd
http://www.expreszo.nl/headlines.php?id=749
Elton's gaydar rinkelt bij bruidegom Minnelli
http://www.expreszo.nl/headlines.php?id=746
Lekkertje Drew Barrymore liever met een vrouw?
This approach works fine when the web site is looking only at the
User-Agent line, as you can most easily control it
with $browser->agent(...). If you were dealing
with some other site that insisted on seeing even more Netscape-like
headers, that could be done, too:
my @netscape_like_headers = (
'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
'Accept-Language' => 'en-US',
'Accept-Charset' => 'iso-8859-1,*,utf-8',
'Accept-Encoding' => 'gzip',
'Accept' =>
"image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*",
);
my $response = $browser->get($url, @netscape_like_headers);