Because Perl supports Berkeley sockets, all kinds of networking tasks
can be automated with Perl. Below are some common idioms to show you
what is possible with Perl and a little elbow grease.
41.13.1. Be Your Own Web Browser with LWP
The suite of classes that handle all the aspects of HTTP are
collectively known as LWP (for libwww-perl library). If your Perl
installation doesn't currently have LWP, you can
easily install it with the CPAN
module (Section 41.11) like this:
# perl -MCPAN -e 'install Bundle::LWP'
If you also included an X widget library such as Tk, you could create
a graphic web browser in Perl (an example of this comes with the Perl
Tk library). However, you don't need all of that if
you simply want to grab a file from a web server:
use LWP::Simple;
my $url = "http://slashdot.org/slashdot.rdf";
getstore($url, "s.rdf");
This example grabs the Rich Site Summary file from the popular tech
news portal, Slashdot, and saves it to a local file called
s.rdf. In fact, you don't even
need to bother with a full-fledged script:
$ perl -MLWP::Simple -e 'getstore("http://slashdot.org/slashdot.rdf", "s.rdf")'
Sometimes you want to process a web page to extract information from
it. Here, the title of the page given by the URL given on the command
line is extracted and reported:
use LWP::Simple;
use HTML::TokeParser;
$url = $ARGV[0] || 'http://www.oreilly.com';
$content = get($url);
die "Can't fetch page: halting\n" unless $content;
$parser = HTML::TokeParser->new(\$content);
$parser->get_tag("title");
$title = $parser->get_token;
print $title->[1], "\n" if $title;
After bringing in the library to fetch the web page (LWP::Simple) and
the one that can parse HTML (HTML::TokeParser), the command line is
inspected for a user-supplied URL. If one isn't
there, a default URL is used. The get function,
imported implicitly from LWP::Simple, attempts to fetch the URL. If
it succeeds, the whole page is kept in memory in the scalar
$content. If the fetch fails,
$content will be empty, and the script halts. If
there's something to parse, a reference to the
content is passed into the HTML::TokeParser object constructor.
HTML::TokeParser deconstructs a page into individual HTML elements.
Although this isn't the way most people think of
HTML, it does make it easier for both computers and programmers to
process web pages. Since nearly every web page has only one
<title> tag, the parser is instructed to
ignore all tokens until it finds the opening
<title> tag. The actual title string is a
text string and fetching that piece requires getting the next token.
The method get_token returns an array reference of
various sizes depending on the kind of token returned (see the
HTML::TokeParse manpage for details). In this case, the desired
element is the second one.
One important word of caution: these scripts are very simple web
crawlers, and if you plan to be grabbing a lot of pages from a web
server you don't own, you should do more research
into how to build polite web robots. See
O'Reilly's Perl &
LWP.