Writing Apache Modules with Perl and C

Writing Apache Modules with Perl and C

By:	Lincoln Stein and Doug MacEachern
Published:	O'Reilly & Associates, Inc. - March 1999

Show Contents Previous Page Next Page

Chapter 7 - Other Request Phases / Handling Proxy Requests
Handling the Proxy Process on Your Own

As long as you only need to monitor or modify the request half of a proxy transaction, you can use Apache's mod_proxy module directly as we did in the previous two examples. However, if you also want to intercept the response so as to modify the information returned from the remote server, then you'll need to handle the proxy request on your own.

In this section, we present Apache::AdBlocker. This module replaces Apache's mod_proxy with a specialized proxy that filters the content of certain URLs. Specifically, it looks for URLs that are likely to be banner advertisements and replaces their content with a transparent GIF image that says "Blocked Ad." This can be used to "lower the volume" of commercial sites by removing distracting animated GIFs and brightly colored banners. Figure 7-3 shows what the AltaVista search site looks like when fetched through the Apache::AdBlocker proxy.

Figure 7-3. The AltaVista search engine after filtering by Apache::AdBlocker

The code for Apache::AdBlocker is given in Example 7-12. It is a bit more complicated than the other modules we've worked with in this chapter but not much more. The basic strategy is to install two handlers. The first handler is activated during the URI translation phase. It doesn't actually alter the URI or filename in any way, but it does inspect the transaction to see if it is a proxy request. If this is the case, the handler installs a custom content handler to actually go out and do the request. In this respect, the translation handler is similar to Apache::Checksum3, which also installs a custom content handler for certain URIs.

Later on, when its content handler is called, the module uses the Perl LWP library to fetch the remote document. If the document does not appear to be a banner ad, the content handler forwards it on to the waiting client. Otherwise, the handler does a little switcheroo, replacing the advertisement with a custom GIF image of exactly the same size and shape as the ad. This bit of legerdemain is completely invisible to the browser, which goes ahead and renders the image as if it were the original banner ad.

In addition to the LWP library, this module requires the GD and Image::Size libraries for creating and manipulating images. They are available on CPAN if you do not already have them installed.

Turning to the code, after the familiar preamble we create a new LWP::UserAgent object that we will use to make all our requests for documents from remote servers:

@ISA = qw(LWP::UserAgent);
$VERSION = '1.00';

my $UA = __PACKAGE__->new;
$UA->agent(join "/", __PACKAGE__, $VERSION);

We actually subclass LWP::UserAgent, using the @ISA global to create an inheritance relationship between LWP::UserAgent and our own package. Although we don't override any of LWP::UserAgent's methods, making our module a subclass of LWP::UserAgent allows us to cleanly customize these methods at a later date should we need to.

We now create a new instance of the LWP::UserAgent subclass, using the special token __PACKAGE__ which evaluates at compile time to the name of the current package. In this case, __PACKAGE__->new is equivalent to Apache::AdBlocker->new (or new Apache::AdBlocker if you prefer Smalltalk syntax). Immediately afterward we call the object's agent() method with a string composed of the package name and version number. This is the calling card that LWP sends to the remote hosts' web servers as the HTTP User-Agent field. The method we use for constructing the User-Agent field creates the string Apache::AdBlocker/1.00.

my $Ad = join "|", qw{ads? advertisements? banners? adv promotions?};

The last initialization step is to define a package global named $Ad that defines a pattern match that picks up many (but certainly not all) banner advertisement URIs. Most ads contain variants on the words "ad," "advertisement," "banner," or "promotion" somewhere in the URI, although this may have changed by the time you read this!

sub handler {
   my $r = shift;
   return DECLINED unless $r->proxyreq;
   $r->handler("perl-script"); #ok, let's do it
   $r->push_handlers(PerlHandler => \&proxy_handler);
   return OK;
}

The next part of the module is the definition of the handler() subroutine, which in this case will be run during the URI translation phase. It simply checks whether the current transaction is a proxy request and declines the transaction if not. Otherwise, it calls the request object's handler() method to set the content handler to perl-script and calls push_handlers() to make the module's proxy_handler() subroutine the callback for the response phase of the transaction. handler() then returns OK to flag that it has handled the URI translation phase.

Most of the work is done in proxy_handler(). Its job is to use LWP's object-oriented methods to create an HTTP::Request object. The HTTP::Request is then forwarded to the remote host by the LWP::UserAgent, returning an HTTP::Response. The response must then be returned to the waiting browser, possibly after replacing the content. The only subtlety here is the need to copy the request headers from the incoming Apache request's headers_in() table to the HTTP::Request and, in turn, to copy the response headers from the HTTP::Response into the Apache request headers_out() table. If this copying back and forth isn't performed, then documents that rely on the exact values of certain HTTP fields, such as CGI scripts, will fail to work correctly across the proxy.

sub proxy_handler {
   my $r = shift;

    my $request = HTTP::Request->new($r->method, $r->uri);

proxy_handler() starts by recovering the Apache request object. It then uses the request object's method() and uri() methods to fetch the request method and the URI. These are used to create and initialize a new HTTP::Request. We now feed the incoming header fields from the Apache request object into the corresponding fields in the outgoing HTTP::Request:

    $r->headers_in->do(sub {
      $request->header(@_);
   });

We use a little trick to accomplish the copy. The headers_in() method (as opposed to the header_in() method that we have seen before) returns an instance of the Apache::Table class. This class, described in more detail in Chapter 9 (see "The Apache::Table Class"), implements methods for manipulating Apache's various table-like structures, including the incoming and outgoing HTTP header fields. One of these methods is do(), which when passed a CODE reference invokes the code once for each header field, passing to the routine the header's name and value each time. In this case, we call do() with an anonymous subroutine that passes the header keys and values on to the HTTP::Request object's header() method. It is important to use headers->do() here rather than copying the headers into a hash because certain headers, particularly Cookie, can be multivalued.

    # copy POST data, if any
   if($r->method eq 'POST') {
       my $len = $r->header_in('Content-length');
       my $buf;
       $r->read($buf, $len);
       $request->content($buf);
    }

The next block of code checks whether the request method is POST. If so, we must copy the POSTed data from the incoming request to the HTTP::Request object. We do this by calling the request object's read() method to read the POST data into a temporary buffer. The data is then copied into the HTTP::Request by calling its content() method. Request methods other than POST may include a request body, but this example does not cope with these rare cases.

The HTTP::Request object is now complete, so we can actually issue the request:

    my $response = $UA->request($request);

We pass the HTTP::Request object to the user agent's request() method. After a delay for the network fetch, the call returns an HTTP::Response object, which we copy into a variable named $response.

    $r->content_type($response->header('Content-type'));
   $r->status($response->code);
   $r->status_line(join " ", $response->code, $response->message);

Now the process of copying the headers is reversed. Every header in the LWP HTTP::Response object must be copied to the Apache request object. First, we handle a few special cases. We call the HTTP::Response object's header() method to fetch the content type of the returned document and immediately pass the result to the Apache request object's content_type() method. Next, we set the numeric HTTP status code and the human-readable HTTP status line. We call the HTTP::Response object's code() and message() methods to return the numeric code and human-readable messages, respectively, and copy them to the Apache request object, using the status() and status_line() methods to set the values.

When the special case headers are done, we copy all the other header fields, using the HTTP::Response object's scan() method:

    $response->scan(sub {
      $r->header_out(@_);
   });

scan() is similar to the Apache::Table do() method: it loops through each of the header fields, invoking an anonymous callback routine for each one. The callback sets the corresponding field in the Apache request object using the header_out() method.

    if ($r->header_only) {
      $r->send_http_header();
      return OK;

The outgoing header is complete at this point, so we check whether the current transaction is a HEAD request. If so, we emit the HTTP header and exit with an OK status code.

   my $content = \$response->content;
   if($r->content_type =~ /^image/ and $r->uri =~ /\b($Ad)\b/i) {
      block_ad($content);
      $r->content_type("image/gif");
   }

Otherwise, the time has come to deal with potential banner ads. To identify likely ads, we require that the document be an image and that its URI satisfy the regular expression match defined at the top of the module. We retrieve the document contents by calling the HTTP::Response object's content() method, and store a reference to the contents in a local variable named $content.¹⁰ We now check whether the document's MIME type is one of the image variants and that the URI satisfies the advertisement pattern match. If both of these are true, we call block_ad() to replace the content with a customized image. We also set the document's content type to image/gif, since this is what block_ad() produces.

   $r->content_type('text/html') unless $$content;
   $r->send_http_header;
   $r->print($$content || $response->error_as_HTML);

We send the HTTP header, then print the document contents. Notice that the document content may be empty, which can happen when LWP connects to a server that is down or busy. In this case, instead of printing an empty document, we return the nicely formatted error message returned by the HTTP::Response object's error_as_HTML() method.

    return OK;
}

Our work is done, so we return an OK status code.

The block_ad() subroutine is short and sweet. Its job is to take an image in any of several possible formats and replace it with a custom GIF of exactly the same dimensions. The GIF will be transparent, allowing the page background color to show through, and will have the words "Blocked Ad" printed in large friendly letters in the upper lefthand corner.

sub block_ad {
   my $data = shift;
   my($x, $y) = imgsize($data);

    my $im = GD::Image->new($x,$y);

To get the width and height of the image, we call imgsize(), a function imported from the Image::Size module. imgsize() recognizes most web image formats, including GIF, JPEG, XBM, and PNG. Using these values, we create a new blank GD::Image object and store it in a variable named $im.

   my $white = $im->colorAllocate(255,255,255);
   my $black = $im->colorAllocate(0,0,0);      
   my $red = $im->colorAllocate(255,0,0);

We call the image object's colorAllocate() method three times to allocate color table entries for white, black, and red. Then we declare that the white color is transparent, using the transparent() method:

    $im->transparent($white);
   $im->string(GD::gdLargeFont(),5,5,"Blocked Ad",$red);
   $im->rectangle(0,0,$x-1,$y-1,$black);

    $$data = $im->gif;
}

The routine calls the string() method to draw the message starting at coordinates (5,5) and finally frames the whole image with a black rectangle. The custom image is now converted into GIF format with the gif() method and copied into $$data, overwriting whatever was there before.

sub redirect_ok {return undef;}

The last detail is to define a redirect_ok() method to override the default LWP::UserAgent method. By returning undef this method tells LWP not to handle redirects internally but to pass them on to the browser to handle. This is the correct behavior for a proxy server.

Activating this module is just a matter of adding the following line to one of the configuration files:

PerlTransHandler Apache::AdBlocker

Users who wish to make use of this filtering service should configure their browsers to proxy their requests through your server.

Example 7-12. A Banner Ad Blocking Proxy

package Apache::AdBlocker;
# file: Apache/AdBlocker.pm

use strict;
use vars qw(@ISA $VERSION);
use Apache::Constants qw(:common);
use GD ();
use Image::Size qw(imgsize);
use LWP::UserAgent ();

@ISA = qw(LWP::UserAgent);
$VERSION = '1.00';

my $UA = __PACKAGE__->new;
$UA->agent(join "/", __PACKAGE__, $VERSION);

my $Ad = join "|", qw{ads? advertisements? banners? adv promotions?};

sub handler {
   my $r = shift;
   return DECLINED unless $r->proxyreq;
   $r->handler("perl-script"); #ok, let's do it
   $r->push_handlers(PerlHandler => \&proxy_handler);
   return OK;
}

sub proxy_handler {
   my $r = shift;

    my $request = HTTP::Request->new($r->method, $r->uri);

    $r->headers_in->do(sub {
      $request->header(@_);
   });

    # copy POST data, if any
   if($r->method eq 'POST') {
      my $len = $r->header_in('Content-length');
      my $buf;
      $r->read($buf, $len);
      $request->content($buf);
   }

    my $response = $UA->request($request);
   $r->content_type($response->header('Content-type'));

    #feed response back into our request_rec*
   $r->status($response->code);
   $r->status_line(join " ", $response->code, $response->message);
   $response->scan(sub {
      $r->header_out(@_);
   });

    if ($r->header_only) {
      $r->send_http_header();
      return OK;
   }

    my $content = \$response->content;
   if($r->content_type =~ /^image/ and $r->uri =~ /\b($Ad)\b/i) {
      block_ad($content);
      $r->content_type("image/gif");
   }

    $r->content_type('text/html') unless $$content;
   $r->send_http_header;
   $r->print($$content || $response->error_as_HTML);

    return OK;
}

sub block_ad {
   my $data = shift;
   my($x, $y) = imgsize($data);

    my $im = GD::Image->new($x,$y);

    my $white = $im->colorAllocate(255,255,255);
   my $black = $im->colorAllocate(0,0,0);
   my $red = $im->colorAllocate(255,0,0);

    $im->transparent($white);
   $im->string(GD::gdLargeFont(),5,5,"Blocked Ad",$red);
   $im->rectangle(0,0,$x-1,$y-1,$black);

    $$data = $im->gif;
}

sub redirect_ok {return undef;}

1;
__END__

Footnotes

9 There are several third-party Perl API modules on CPAN that handle proxy requests, including one named Apache::ProxyPass and another named Apache::ProxyPassThru. If you are looking for the functionality of Apache::PassThru, you should examine one of these more finished products before using this one as the basis for your own module.

10 In this example, we call the response object's content() method to slurp the document content into a scalar. However, it can be more efficient to use the three-argument form of LWP::UserAgent's response() method to read the content in fixed-size chunks. See the LWP::UserAgent manual page for details.

Show Contents Previous Page Next Page