11.2. Adding Extra Request Header LinesHere's some simplistic debugging advice: if your browser sees one thing at a given URL, but your LWP program sees another, first try just turning on cookie support, with an empty cookie jar. If that fails, have it read in your browser's cookie file.[4] And if that fails, it's time to start wondering what means the remote site is using for distinguishing your LWP program's requests from your browser's requests.
Every kind of browser sends different HTTP headers besides the very minimal headers that LWP::UserAgent typically sends. For example, whereas an LWP::UserAgent browser by default sends this header line: User-Agent: libwww-perl/5.5394 Netscape 4.76 sends a header line like this: User-Agent: Mozilla/4.76 [en] (Win98; U) And also sends these header fields that an LWP::UserAgent browser doesn't send normally at all: Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* Accept-Charset: iso-8859-1,*,utf-8 Accept-Encoding: gzip Accept-Language: en-US (That's assuming you've set your language preferences to U.S. English). That's on top of any Connection: keep-alive headers that may be sent, if the browser or any intervening firewall supports that feature (keep-alive) of HTTP. Opera 5.12 is not much different: User-Agent: Opera/5.12 (Windows 98; U) [en] Accept: text/html, image/png, image/jpeg, image/gif, image/x-xbitmap, */* Accept-Language: en Accept-Encoding: deflate, gzip, x-gzip, identity, *;q=0 But a recent version of Netscape gets rather more verbose: User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:0.9.4) Gecko/20011126 Netscape6/6.2.1 Accept: text/xml, application/xml, application/xhtml+xml, text/html;q=0.9, image/png, image/jpeg, image/gif;q=0.2, text/plain;q=0.8, text/css, */*;q=0.1 Accept-Charset: ISO-8859-1, utf-8;q=0.66, *;q=0.66 Accept-Encoding: gzip, deflate, compress;q=0.9 Accept-Language: en-us Internet Explorer 5.12, in true Microsoft fashion, emits a few nonstandard headers: Accept: */* Accept-Language: en Extension: Security/Remote-Passphrase UA-CPU: PPC UA-OS: MacOS User-Agent: Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC) Lynx can be verbose in reporting what MIME types my system's /etc/mailcap tells it that it can handle: Accept: text/html, text/plain, audio/mod, image/*, video/*, video/mpeg, application/pgp, application/pgp, application/pdf, message/partial, message/external-body, application/postscript, x-be2, application/andrew-inset, text/richtext, text/enriched Accept: x-sun-attachment, audio-file, postscript-file, default, mail-file, sun-deskset-message, application/x-metamail-patch, text/sgml, */*;q=0.01 Accept-Encoding: gzip, compress Accept-Language: en, es User-Agent: Lynx/2.8.3dev.18 libwww-FM/2.14 This information can come in handy when trying to make your LWP program seem as much like a well-known interactive browser as possible 11.2.1. Pretending to Be NetscapeFor example, suppose you're looking at http://www.expreszo.nl/home.php and you see that it has interesting headlines. You'd like to write a headline detector for this site to go with the other headline detectors we've been producing throughout the book. You look at the source in Netscape and see that each headline link looks like this: <A class=pink href="headlines.php?id=749">...text...</A> So you write something quite simple to capture those links: use strict; use warnings; use LWP; my $browser = LWP::UserAgent->new; my $url = 'http://www.expreszo.nl/home.php'; my $response = $browser->get($url); die "Can't get $url: ", $response->status_line unless $response->is_success; $_ = $response->content; my %seen; while( m{href="(headlines.php[^"]+)">(.*?)</A>}sg ) { my $this = URI->new_abs($1,$response->base); print "$this\n $2\n" unless $seen{$this}++; } print "NO HEADLINES?! Source:\n", $response->content unless keys %seen; And you run it, and it quite stubbornly says: NO HEADLINES?! Source: <html><body> ... Je hebt minimaal Microsoft Internet Explorer versie 4 of hoger, of Netscape Navigator versie 4 of hoger nodig om deze site te bekijken. ... </body></html> That is, "you need MSIE 4 or higher, or Netscape 4 or higher, to view this site." It seems to be checking the User-Agent string of whatever browser visits the site and throwing a fit unless it's MSIE or Netscape! This is easily simulated, by adding this line right after $browser is created: $browser->agent('Mozilla/4.76 [en] (Win98; U)'); With that one small change, the server sends the same page you saw in Netscape, and the headline extractor happily sees the headlines, and everything works: http://www.expreszo.nl/headlines.php?id=752 Meer syfilis en HIV bij homo's http://www.expreszo.nl/headlines.php?id=751 Imam hangt geldboete van 1200 boven het hoofd http://www.expreszo.nl/headlines.php?id=740 SGP wil homohuwelijk terugdraaien http://www.expreszo.nl/headlines.php?id=750 Gays en moslims worden vaak gediscrimineerd http://www.expreszo.nl/headlines.php?id=749 Elton's gaydar rinkelt bij bruidegom Minnelli http://www.expreszo.nl/headlines.php?id=746 Lekkertje Drew Barrymore liever met een vrouw? This approach works fine when the web site is looking only at the User-Agent line, as you can most easily control it with $browser->agent(...). If you were dealing with some other site that insisted on seeing even more Netscape-like headers, that could be done, too: my @netscape_like_headers = ( 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)', 'Accept-Language' => 'en-US', 'Accept-Charset' => 'iso-8859-1,*,utf-8', 'Accept-Encoding' => 'gzip', 'Accept' => "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*", ); my $response = $browser->get($url, @netscape_like_headers); 11.2.2. RefererFor some sites, that's not enough: they want to see that your Referer header value is something they consider appropriate. A Referer header line signals the URL of a page that either linked to the item you're requesting (as with <a href="url">) or inlines that image item (as with <img src="url">). For example, I am a big fan of the comic strip Dennis The Menace. I find it to be the truest realization of deep satire, and I admire how its quality has kept up over the past 50 years, quite undeterred by the retirement and eventual death of its auteur, the comic genius Hank Ketcham. And nothing brightens my day more than laughing over the day's Dennis The Menace strip and hardcopying a really good one now and then, so I can pin it up on my office door to amuse my colleagues and to encourage them to visit the DTM web site. However, the server for the strip's image files doesn't want it to be inlined on pages that aren't authorized to do so, so they check the Referer line. Unfortunately, they have forgotten to allow for when there is no Referer line at all, such as happens when I try to hardcopy the day's image file using my browser. But LWP comes to the rescue: my $response = $browser->get( # The URL of the image: 'http://pst.rbma.com/content/Dennis_The_Menace', 'Referer' => # The URL where I see the strip: 'http://www.sfgate.com/cgi-bin/article.cgi?file=/comics/Dennis_The_Menace.dtl', ); open(OUT, ">today_dennis.gif") || die $!; binmode(OUT); print OUT $response->content; close(OUT); By giving a Referer value that passes the image server's test for a good URL, I get to make a local copy of the image, which I can then print out and put on my office door. Copyright © 2002 O'Reilly & Associates. All rights reserved. |
|