home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  

19.10 Perl and the Web: Beyond CGI Programming

Perl is used for much more than CGI programming. Other uses include logfile analysis, cookie and password management, clickable images, and image manipulation.[ 15 ] And that's still just the tip of the iceberg.

[15] See the GD.pm module on CPAN for a Perl interface to Thomas Boutell's gd graphics library.

19.10.1 Custom Publishing Systems

Commercial web publishing systems may make easy things easy, especially for nonprogrammers, but they just aren't infinitely flexible the way a real programming language is. Without source code, you're locked into someone else's design decisions: if something doesn't work quite the way you want it to, you can't fix it. No matter how many whiz-bang programs become available for the consumer to purchase, a programmer will always be needed for those special jobs that don't quite fit the mold. And of course someone has to write the publishing software in the first place.

Perl is great for creating custom publishing systems tailored to your unique needs. It's easy to convert raw data into zillions of HTML pages en masse. Sites all over the Web use Perl to generate and maintain their entire web site. The Perl Journal ( www.tpj.com ) uses Perl to generate all its pages. The Perl Language Home Page ( www.perl.com ) has nearly 10,000 web pages all automatically maintained and updated by various Perl programs.

19.10.2 Embedded Perl

The fastest, cheapest (it's hard to get any cheaper than free), and most popular web server on the Net, Apache, can run with Perl embedded inside it using the mod_perl module from CPAN. With mod_perl, Perl becomes the extension language for your web server. You can write little Perl snippets to handle authorization requests, error handling, logging, and anything else you can think of. These don't require a new process because Perl is now built-in to the web server. Even more appealing for many is that under Apache you don't have to fire off a whole new process each time a CGI request comes in. Instead, a new thread executes a precompiled Perl program. This speeds up your CGI programs significantly; typically it's the fork/exec overhead that slows you down, not the size of the program itself.

Another strategy for speeding up CGI execution is through the standard CGI::Fast module. Unlike the embedded Perl interpreter described above, this approach doesn't require the Apache web server. See the CGI::Fast module's manpage for more details about this.

If you're running a web server under WindowsNT, you should definitely check out the ActiveWare site, www.activeware.com . Not only do they have prebuilt binaries of Perl for Windows platforms,[ 16 ] they also provide PerlScript and PerlIS. PerlScript is an ActiveX scripting engine that lets you embed Perl code in your web pages as you would with JavaScript or VBScript. PerlIS is an ISAPI DLL that runs Perl scripts directly from IIS and other ISAPI compliant web servers, providing significant performance benefits.

[16] As of release 5.004, the standard distribution of Perl builds under Windows, assuming you have a C compiler, that is.

19.10.3 Web Automation with LWP

Have you ever wanted to check a web document for dead links, find its title, or figure out which of its links have been updated since last Thursday? Or wanted to download the images contained within a document or mirror an entire directory full of documents? What happens if you have to go through a proxy server or server redirects?

Now, you could do these things by hand using your browser. But because graphical interfaces are woefully inadequate for programmatic automation, this would be a slow and tedious process requiring more patience and less laziness[ 17 ] than most of us tend to possess.

[17] Remember that according to Larry Wall, the three principal virtues of a programmer are Laziness, Impatience, and Hubris.

The LWP ("Library for WWW access in Perl") modules from CPAN do all this for you and more. For example, fetching a document from the Web in a script is so easy using these modules that you can write it as a one-liner. For example, to get the /perl/index.html document from www.perl.com , just type this into your shell or command interpreter:

perl -MLWP::Simple -e "getprint 'http://www.perl.com/perl/index.html'"

Apart from the LWP:: Simple module, most of the modules included in the LWP suite are strongly object-oriented. For example, here's a tiny program that takes URLs as arguments and produces their titles:

use LWP;
$browser = LWP::UserAgent->new(); # create virtual browser
$browser->agent("Mothra/126-Paladium"); # give it a name
foreach $url (@ARGV) { # expect URLs as args
    # make a GET request on the URL via fake browser
    $webdoc = $browser->request(HTTP::Request->new(GET => $url));
    if ($webdoc->is_success) { # found it
    print STDOUT "$url: ", $webdoc->title, "\n";
    } else { # something went wrong
    print STDERR "$0: Couldn't fetch $url\n";

As you see, familiarity with Perl's objects is important. But just as with the CGI.pm module, the LWP modules hide most of the complexity.

This script works as follows: first create a user agent object, something like an automated, virtual browser. This object is used to make requests to remote servers. Give our virtual browser a silly name just to make people's logfiles more interesting. Then pull in the remote document by making an HTTP GET request to the remote server. If the result is successful, print out the URL and its title; otherwise, complain a bit.

Here's a program that prints out a sorted list of unique links and images contained in URLs passed as command-line arguments:

#!/usr/bin/perl -w
use strict;
use LWP 5.000;
use URI::URL;
use HTML::LinkExtor;

my($url, $browser, %saw);
$browser = LWP::UserAgent->new(); # make fake browser
foreach $url ( @ARGV ) {
    # fetch the document via fake browser
    my $webdoc = $browser->request(HTTP::Request->new(GET => $url));
    next unless $webdoc->is_success;
    next unless $webdoc->content_type eq 'text/html'; 
                                                 # can't parse gifs

    my $base = $webdoc->base;

    # now extract all links of type <A ...> and <IMG ...>
    foreach (HTML::LinkExtor->new->parse($webdoc->content)->eof->
                                                   links) {
        my($tag, %links) = @$_;
        next unless $tag eq "a" or $tag eq "img";
        my $link;
        foreach $link (values %links) {
            $saw{ url($link,$base)->abs->as_string }++;
print join("\n", sort keys %saw), "\n";

This looks pretty complicated, but most of the complexity lies in understanding how the various objects and their methods work. We aren't going to explain all these here, because this book is long enough already. Fortunately, LWP comes with extensive documentation and examples.