Writing Apache Modules with Perl and C

Writing Apache Modules with Perl and C

By:	Lincoln Stein and Doug MacEachern
Published:	O'Reilly & Associates, Inc. - March 1999

Show Contents Previous Page Next Page

Chapter 7 - Other Request Phases
Handling Proxy Requests

In this section...

Introduction

Invoking mod_proxy for Nonproxy Requests

An Anonymizing Proxy

Handling the Proxy Process on Your Own

Introduction

Show Contents Go to Top Previous Page Next Page

The HTTP proxy protocol was originally designed to allow users unfortunate enough to be stuck behind a firewall to access external web sites. Instead of connecting to the remote server directly, an action forbidden by the firewall, users point their browsers at a proxy server located on the firewall machine itself. The proxy goes out and fetches the requested document from the remote site and forwards the retrieved document to the user.

Nowadays most firewall systems have a web proxy built right in so there's no need for dedicated proxying servers. However, proxy servers are still useful for a variety of purposes. For example, a caching proxy (of which Apache is one example) will store frequently requested remote documents in a disk directory and return the cached documents directly to the browser instead of fetching them anew. Anonymizing proxies take the outgoing request and strip out all the headers that can be used to identify the user or his browser. By writing Apache API modules that participate in the proxy process, you can achieve your own special processing of proxy requests.

The proxy request/response protocol is nearly the same as vanilla HTTP. The major difference is that instead of requesting a server-relative URI in the request line, the client asks for a full URL, complete with scheme and host. In addition, a few optional HTTP headers beginning with Proxy- may be added to the request. For example, a normal (nonproxy) HTTP request sent by a browser might look like this:

GET /foo/index.html HTTP/1.0
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
Pragma: no-cache
Connection: Keep-Alive
User-Agent: Mozilla/2.01 (WinNT; I)
Host: www.modperl.com:80

In contrast, the corresponding HTTP proxy request will look like this:

GET http://www.modperl.com/foo/index.html HTTP/1.0
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
Pragma: no-cache
User-Agent: Mozilla/2.01 (WinNT; I)
Host: www.modperl.com:80
Proxy-Connection: Keep-Alive

Notice that the URL in the request line of an HTTP proxy request includes the scheme and hostname. This information enables the proxy server to initiate a connection to the distant server. To generate this type of request, the user must configure his browser so that HTTP and, optionally, FTP requests are proxied to the server. This usually involves setting values in the browser's preference screens. An Apache server will be able to respond to this type of request if it has been compiled with the mod_proxy module. This module is part of the core Apache distribution but is not compiled in by default.

You can interact with Apache's proxy mechanism at the translation handler phase. There are two types of interventions you can make. You can take an ordinary (nonproxy) request and change it into one so that it will be handled by Apache's standard proxy module, or you can take an incoming proxy request and install your own content handler for it so that you can examine and possibly modify the response from the remote server.

Invoking mod_proxy for Nonproxy Requests

Show Contents Go to Top Previous Page Next Page

We'll look first at Apache::PassThru, an example of how to turn an ordinary request into a proxy request.⁹ Because this technique uses Apache's mod_proxy module, this module will have to be compiled and installed in order for this example to run on your system.

The idea behind the example is simple. Requests for URIs beginning with a certain path will be dynamically transformed into a proxy request. For example, we might transform requests for URLs beginning with /CPAN/ into a request for http://www.perl.com/CPAN/. The request to www.perl.com will be done completely behind the scenes; nothing will reveal to the user that the directory hierarchy is being served from a third-party server rather than our own. This functionality is the same as the ProxyPass directive provided by mod_proxy itself. You can also achieve the same effect by providing an appropriate rewrite rule to mod_rewrite.

The configuration for this example uses a PerlSetVar to set a variable named Perl-PassThru. A typical entry in the configuration directive will look like this:

PerlTransHandler Apache::PassThru
PerlSetVar PerlPassThru '/CPAN/   => http://www.perl.com/,\
                        /search/ => http://www.altavista.digital.com/'

The PerlPassThru variable contains a string representing a series of URI=>proxy pairs, separated by commas. A backslash at the end of a line can be used to split the string over several lines, improving readability (the ability to use backslash as a continuation character is actually an Apache configuration file feature but not a well-publicized one). In this example, we map the URI /CPAN/ to http://www.perl.com/ and /search/ to http://www.altavista.digital.com/. For the mapping to work correctly, local directory names should end with a slash in the manner shown in the example.

The code for Apache::PassThru is given in Example 7-10. The handler() subroutine begins by retrieving the request object and calling its proxyreq() method to determine whether the current request is a proxy request:

sub handler {
   my $r = shift;
   return DECLINED if $r->proxyreq;

If this is already a proxy request, we don't want to alter it in any way, so we decline the transaction. Otherwise, we retrieve the value of PerlPassThru, split it into its key/value components with a pattern match, and store the result in a hash named %mappings:

    my $uri = $r->uri;
   my %mappings = split /\s*(?:,|=>)\s*/, $r->dir_config('PerlPassThru');

We now loop through each of the local paths, looking for a match with the current request's URI. If a match is found, we perform a string substitution to replace the local path with the corresponding proxy URI. Otherwise, we continue to loop:

    for my $src (keys %mappings) {
      next unless $uri =~ s/^$src/$mappings{$src}/;
      $r->proxyreq(1);
      $r->uri($uri);
      $r->filename("proxy:$uri");
      $r->handler('proxy-server');
      return OK;
      }
   return DECLINED;
}

If the URI substitution succeeds, there are four steps we need to take to transform this request into something that mod_proxy will handle. The first two are obvious, but the others are less so. First, we need to set the proxy request flag to a true value by calling $r->proxyreq(1). Next, we change the requested URI to the proxied URI by calling the request object's uri() method. In the third step, we set the request filename to the string proxy: followed by the URI, as in proxy:http://www.perl.com/CPAN/. This is a special filename format recognized by mod_proxy, and as such is somewhat arbitrary. The last step is to set the content handler to proxy-server, so that the request is passed to mod_proxy to handle the response phase.

If we turned the local path into a proxy request, we return OK from the translation handler. Otherwise, we return DECLINED.

Example 7-10. Invoking Apache's Proxy Request Mechanism from Within a Translation Handler

package Apache::PassThru;
# file: Apache/PassThru.pm;
use strict;
use Apache::Constants qw(:common);

sub handler {
   my $r = shift;
   return DECLINED if $r->proxyreq;
   my $uri = $r->uri;
   my %mappings = split /\s*(?:,|=>)\s*/, $r->dir_config('PerlPassThru');
   for my $src (keys %mappings) {
      next unless $uri =~ s/^$src/$mappings{$src}/;
      $r->proxyreq(1);
      $r->uri($uri);
      $r->filename("proxy:$uri");
      $r->handler('proxy-server');
      return OK;
   }
   return DECLINED;
}
1;
__END__

An Anonymizing Proxy

Show Contents Go to Top Previous Page Next Page

As public concern about the ability of web servers to track people's surfing sessions grows, anonymizing proxies are becoming more popular. An anonymizing proxy is similar to an ordinary web proxy, except that certain HTTP headers that provide identifying information such as the Referer, Cookie, User-Agent, and From fields are quietly stripped from the request before forwarding it on to the remote server. Not only is this identifying information removed, but the identity of the requesting host is obscured. The remote server knows only the hostname and IP address of the proxy machine, not the identity of the machine the user is browsing from.

You can write a simple anonymizing proxy in the Apache Perl API in all of 18 lines (including comments). The source code listing is shown in Example 7-11. Like the previous example, it uses Apache's mod_proxy, so that module must be installed before this example will run correctly.

The module defines a package global named @Remove containing the names of all the request headers to be stripped from the request. In this example, we remove User-Agent, Cookie, Referer, and the infrequently used From field. The handler() subroutine begins by fetching the Apache request object and checking whether the current request uses the proxy protocol. However, unlike the previous example where we wanted the existence of the proxy to be secret, here we expect the user to explicitly configure his browser to use our anonymizing proxy. So here we return DECLINED if proxyreq() returns false.

If proxyreq() returns true, we know that we are in the midst of a proxy request. We loop through each of the fields to be stripped and delete them from the incoming headers table by using the request object's header_in() method to set the field to undef. We then return OK to signal Apache to continue processing the request. That's all there is to it.

To activate the anonymizing proxy, install it as a URI translation handler as before:

PerlTransHandler Apache::AnonProxy

An alternative that works just as well is to call the module during the header parsing phase (see the discussion of this phase earlier). In some ways, this makes more sense because we aren't doing any actual URI translation, but we are modifying the HTTP header. Here is the appropriate directive:

PerlHeaderParserHandler Apache::AnonProxy

The drawback to using PerlHeaderParserHandler like this is that, unlike PerlTransHandler, the directive is allowed in directory configuration sections and .htaccess files. But directory configuration sections are irrelevant in proxy requests, so the directive will silently fail if placed in one of these sections. The directive should go in the main part of one of the configuration files or in a virtual host section.

Example 7-11. A Simple Anonymizing Proxy

package Apache::AnonProxy;
# file: Apache/AnonProxy.pm
use strict;
use Apache::Constants qw(:common);

my @Remove = qw(user-agent cookie from referer);

sub handler {
   my $r = shift;
   return DECLINED unless $r->proxyreq;
   foreach (@Remove) {
      $r->header_in($_ => undef);
   }
   return OK;
}

1;
__END__

In order to test that this handler was actually working, we set up a test Apache server as the target of the proxy requests and added the following entry to its configuration file:

CustomLog logs/nosy_log "%h %{Referer}i %{User-Agent}i %{Cookie}i %U"

This created a "nosy" log that contains entries for the Referer, User-Agent, and Cookie fields. Before installing the anonymous proxy module, entries in this log looked like this (the lines have been wrapped to fit on the page):

192.168.2.5 http://prego/ Mozilla/4.04 [en] (X11; I; Linux 2.0.33 i686)
      - /tkdocs/tk_toc.ht
192.168.2.5 http://prego/ Mozilla/4.04 [en] (X11; I; Linux 2.0.33 i686)
      POMIS=10074 /perl/hangman1.pl

In contrast, after installing the anonymizing proxy module, all the identifying information was stripped out, leaving only the IP address of the proxy machine:

192.168.2.5 - - -  /perl/hangman1.pl
192.168.2.5 - - -  /icons/hangman/h0.gif
192.168.2.5 - - -  /cgi-bin/info2www

Show Contents Go to Top Previous Page Next Page