home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Writing Apache Modules with Perl and C
By:   Lincoln Stein and Doug MacEachern
Published:   O'Reilly & Associates, Inc.  - March 1999

Copyright © 1999 by O'Reilly & Associates, Inc.


 


   Show Contents   Previous Page   Next Page

Chapter 7 - Other Request Phases
The URI Translation Phase

In this section...

Introduction
A Very Simple Translation Handler
A Practical Translation Handler
Using a Translation Handler to Change the URI
Installing a Custom Response Handler in the URI Translation Phase

Introduction

   Show Contents   Go to Top   Previous Page   Next Page

One of the web's virtues is its Uniform Resource Identifier (URI) and Uniform Resource Locator (URL) standards. End users never know for sure what is sitting behind a URI. It could be a static file, a dynamic script, a proxied request, or something even more esoteric. The file or program behind a URI may change over time, but this too is transparent to the end user.

Much of Apache's power and flexibility comes from its highly configurable URI translation phase, which comes relatively early in the request cycle, after the post_read_request and before the header_parser phases. During this phase, the URI requested by the remote browser is translated into a physical filename, which may in turn be returned directly to the browser as a static document or passed on to a CGI script or Apache API module for processing. During URI translation, each module that has declared its interest in handling this phase is given a chance to modify the URI. The first module to handle the phase (i.e., return something other than a status of DECLINED) terminates the phase. This prevents several URI translators from interfering with one another by trying to map the same URI onto several different file paths.

By default, two URI translation handlers are installed in stock Apache distributions. The mod_alias module looks for the existence of several directives that may apply to the current URI. These include Alias, ScriptAlias, Redirect, AliasMatch, and other directives. If it finds one, it uses the directive's value to map the URI to a file or directory somewhere on the server's physical filesystem. Otherwise, the request falls through to the default URI translation handler, which simply appends the URI to the value of the Document-Root configuration directive, forming a file path relative to the document root.

The optional mod_rewrite module implements a much more comprehensive URI translator that allows you to slice and dice URIs in various interesting ways. It is extremely powerful but uses a series of pattern matching conditions and substitution rules that can be difficult to get right.

Once a translation handler has done its work, Apache walks along the returned filename path in the manner described in Chapter 4, Content Handlers, finding where the path part of the URI ends and the additional path information begins. This phase of processing is performed internally and cannot be modified by the module API.

In addition to their intended role in transforming URIs, translation handlers are sometimes used to associate certain types of URIs with specific upstream handlers. We'll see examples of this later in the chapter when we discuss creating custom proxy services in the section "Handling Proxy Requests."

A Very Simple Translation Handler

   Show Contents   Go to Top   Previous Page   Next Page

Let's look at an example. Many of the documents browsed on a web site are files that are located under the configured DocumentRoot. That is, the requested URI is a filename relative to a directory on the hard disk. Just so you can see how simple a translation handler's job can be, we present a Perl version of Apache's default translation handler found in the http_core module.

package Apache::DefaultTrans;
use Apache::Constants qw(:common BAD_REQUEST);
use Apache::Log ();
sub handler {
  my $r = shift;
  my $uri = $r->uri;
   if($uri !~ m:^/: or index($uri, '*')) {
      $r->log->error("Invalid URI in request ", $r->the_request);
      return BAD_REQUEST;
  }
   $r->filename($r->document_root . $r->uri);
   return OK;
}
1;
__END__

The handler begins by subjecting the requested URI to a few sanity checks, making sure that it begins with a slash and doesn't contain any * characters. If the URI fails these tests, we log an error message and return BAD_REQUEST. Otherwise, all is well and we join together the value of the DocumentRoot directive (retrieved by calling the request object's document_root() method) and the URI to create the complete file path. The file path is now written into the request object by passing it to the filename() method.

We don't check at this point whether the file exists or can be opened. This is the job of handlers further down the request chain.

To install this handler, just add the following directive to the main part of your perl.conf configuration file (or any other Apache configuration file, if you prefer):

PerlTransHandler Apache::DefaultTrans

Beware. You probably won't want to keep this handler installed for long. Because it overrides other translation handlers, you'll lose the use of Alias, ScriptAlias, and other standard directives.

A Practical Translation Handler

   Show Contents   Go to Top   Previous Page   Next Page

Here's a slightly more complex example. Consider a web-based system for archiving software binaries and source code. On a nightly basis an automated system will copy changed and new files from a master repository to multiple mirror sites. Because of the vagaries of the Internet, it's important to confirm that the entire file, and not just a fragment of it, is copied from one mirror site to the other.

One technique for solving this problem would be to create an MD5 checksum for each file and store the information on the repository. After the mirror site copies the file, it checksums the file and compares it against the master checksum retrieved from the repository. If the two values match, then the integrity of the copied file is confirmed.

In this section, we'll begin a simple system to retrieve precomputed MD5 checksums from an archive of files. To retrieve the checksum for a file, you simply append the extension .cksm to the end of its URI. For example, if the archived file you wish to retrieve is:

/archive/software/cookie_cutter.tar.gz

then you can retrieve a text file containing its MD5 checksum by fetching this URI:

/archive/software/cookie_cutter.tar.gz.cksm

The checksum files will be precomputed and stored in a physical directory tree that parallels the document hierarchy. For example, if the document itself is physically stored in:

/home/httpd/htdocs/archive/software/cookie_cutter.tar.gz

then its checksum will be stored in a parallel tree in this file:

/home/httpd/checksums/archive/software/cookie_cutter.tar.gz

The job of the URI translation handler is to map requests for /file/path/filename.cksm files into the physical file /home/httpd/checksums/file/path/filename. When called from a browser, the results look something like the screenshot in Figure 7-1.

Figure 7-1. A checksum file retrieved by Apache::Checksum1

As often happens with Perl programs, the problem takes longer to state than to solve. Example 7-1 shows a translation handler, Apache::Checksum1, that accomplishes this task. The structure is similar to other Apache Perl modules. After the usual preamble, the handler() subroutine shifts the Apache request object off the call stack and uses it to recover the URI of the current request, which is stashed in the local variable $uri. The subroutine next looks for a configuration directive named ChecksumDir which defines the top of the tree where the checksums are to be found. If defined, handler() stores the value in a local variable named $cksumdir. Otherwise, it assumes a default value defined in DEFAULT_CHECKSUM_DIR.

Now the subroutine checks whether this URI needs special handling. It does this by attempting a string substitution which will replace the .cksm URI with a physical path to the corresponding file in the checksums directory tree. If the substitution returns a false value, then the requested URI does not end with the .cksm extension and we return DECLINED. This leaves the requested URI unchanged and allows Apache's other translation handlers to work on it. If, on the other hand, the substitution returns a true result, then $uri holds the correct physical pathname to the checksum file. We call the request object's filename() method to set the physical path returned to Apache and return OK. This tells Apache that the URI was successfully translated and prevents any other translation handlers from being called.

Example 7-1. A URI Translator for Checksum Files

package Apache::Checksum1; # file: Apache/Checksum1.pm use strict; use Apache::Constants qw(:common); use constant DEFAULT_CHECKSUM_DIR => '/usr/tmp/checksums'; sub handler { my $r = shift; my $uri = $r->uri; my $cksumdir = $r->dir_config('ChecksumDir') || DEFAULT_CHECKSUM_DIR; $cksumdir = $r->server_root_relative($cksumdir); return DECLINED unless $uri =~ s!^(.+)\.cksm$!$cksumdir$1!; $r->filename($uri); return OK; } 1; __END__

The configuration for this translation handler should look something like this:

# checksum translation handler directives
PerlTransHandler  Apache::Checksum1
PerlSetVar        ChecksumDir /home/httpd/checksums
<Directory /home/httpd/checksums>
 ForceType text/plain
</Directory>

This configuration declares a URI translation handler with the PerlTransHandler directive and sets the Perl configuration variable ChecksumDir to /home/httpd/checksums, the top of the checksum tree. We also need a <Directory> section to force all files in the checksums directory to be of type text/plain. Otherwise, the default MIME type checker will try to use each checksum file's extension to determine its MIME type.

There are a couple of important points about this configuration section. First, the PerlTransHandler and PerlSetVar directives are located in the main section of the configuration file, not in a <Directory>, <Location>, or <Files> section. This is because the URI translation phase runs very early in the request processing cycle, before Apache has a definite URI or file path to use in selecting an appropriate <Directory>, <Location>, or <Files> section to take its configuration from. For the same reason, PerlTransHandler is not allowed in .htaccess files, although you can use it in virtual host sections.

The second point is that the ForceType directive is located in a <Directory> section rather than in a <Location> block. The reason for this is that the <Location> section refers to the requested URI, which is not changed by this particular translation handler. To apply access control rules and other options to the physical file path returned by the translation handler, you must use <Directory> or <Files>.

To set up the checksum tree, you'll have to write a script that will recurse through the web document hierarchy (or a portion of it) and create a mirror directory of checksum files. In case you're interested in implementing a system like this one, Example 7-2 gives a short script named checksum.pl that does this. It uses the File::Find module to walk the tree of source files, the MD5 module to generate MD5 checksums, and File::Path and File::Basename for filename manipulations. New checksum files are only created if the checksum file doesn't exist or the modification time of the source file is more recent than that of an existing checksum file.

You call the script like this:

% checksum.pl -source ~www/htdocs -dest ~www/checksums

Replace ~www/htdocs and ~www/checksums with the paths to the web document tree and the checksums directory on your system.

Example 7-2. checksum.pl Creates a Parallel Tree of Checksum Files

#!/usr/local/bin/perl
use File::Find;
use File::Path;
use File::Basename;
use IO::File;
use MD5;
use Getopt::Long;
use strict;
use vars qw($SOURCE $DESTINATION $MD5);
GetOptions('source=s'     =>  \$SOURCE,
         'destination=s' =>  \$DESTINATION)  || die <<USAGE;
Usage: $0
    Create a checksum tree.
Options:
   -source       <path>  File tree to traverse [.]
   -destination  <path>  Destination for checksum tree [TMPDIR]
Option names may be abbreviated.
USAGE
$SOURCE      ||= '.';
$DESTINATION ||= $ENV{TMPDIR} || '/tmp';
die "Must specify absolute destination directory" unless $DESTINATION=~m!^/!;
$MD5 = new MD5;
find(\&wanted,$SOURCE);
# This routine is called for each node (directory or file) in the
# source tree.  On entry, $_ contains the filename,
# and $File::Find::name contains its full path.
sub wanted {
   return unless -f $_ && -r _;
   my $modtime = (stat _)[9];
   my ($source,$dest,$url);
   $source = $File::Find::name;
   ($dest = $source)=~s/^$SOURCE/$DESTINATION/o;
   return if -e $dest && $modtime <= (stat $dest)[9];
   ($url = $source) =~s/^$SOURCE//o;
   make_checksum($_,$dest,$url);
}
# This routine is called with the source file, the destination in which
# to write the checksum, and a URL to attach as a comment to the checksum.
sub make_checksum {
   my ($source,$dest,$url) = @_;
   my $sfile = IO::File->new($source) || die "Couldn't open $source: $!\n";
   mkpath dirname($dest);  # create the intermediate directories
   my $dfile = IO::File->new(">$dest") || die "Couldn't open $dest: $!\n";
   $MD5->reset;
   $MD5->addfile($sfile);
   print $dfile $MD5->hexdigest(),"\t$url\n"; # write the checksum
}
__END__

   Show Contents   Go to Top   Previous Page   Next Page
Copyright © 1999 by O'Reilly & Associates, Inc.