Show Contents Previous Page Next Page
Chapter 6 - Authentication and Authorization / Access Control with mod_perl Browser-Based Access Control Web-crawling robots are an increasing problem for webmasters. Robots are supposed
to abide by an informal agreement known as the robot exclusion standard (RES),
in which the robot checks a file named robots.txt that tells it what
parts of the site it is allowed to crawl through. Many rude robots, however,
ignore the RES or, worse, exploit robots.txt to guide them to the "interesting"
parts. The next example (Example 6-3) gives the
outline of a robot exclusion module called Apache::BlockAgent. With
it you can block the access of certain web clients based on their User-Agent
field (which frequently, although not invariably, identifies robots).
The module is configured with a "bad agents" text file. This file contains a series of pattern matches, one per line. The incoming request's User-Agent field will be compared to each of these patterns in a case-insensitive manner. If any of the patterns hit, the request will be refused. Here's a small sample file that contains pattern matches for a few robots that have been reported to behave rudely:
^teleport pro\/1\.28
^nicerspro
^mozilla\/3\.0 \(http engine\)
^netattache
^crescent internet toolpak http ole control v\.1\.0
^go-ahead-got-it
^wget
^devsoft's http component v1\.0
^www\.pl
^digout4uagent
Rather than hardcode the location of the bad agents file, we set its path using a configuration variable named BlockAgentFile. A directory configuration section like this sample perl.conf entry will apply the Apache::BlockAgent handler to the entire site:
<Location />
PerlAccessHandler Apache::BlockAgent
PerlSetVar BlockAgentFile conf/bad_agents.txt
</Location>
Apache::BlockAgent is a long module, so we'll step through the code a section at a time.
package Apache::BlockAgent;
use strict;
use Apache::Constants qw(:common);
use Apache::File ();
use Apache::Log ();
use Safe ();
my $Safe = Safe->new;
my %MATCH_CACHE;
The module brings in the common Apache constants and loads file-handling code from Apache::File. It also brings in the Apache::Log module, which makes the logging API available. The standard Safe module is pulled in next, and a new compartment is created where code will be compiled. We'll see later how the %MATCH_CACHE package variable is used to cache the code routines that detect undesirable user agents. Most of Apache::BlockAgent's logic is contained in the short handler() subroutine:
sub handler {
my $r = shift;
my($patfile, $agent, $sub);
return DECLINED unless $patfile = $r->dir_config('BlockAgentFile');
return FORBIDDEN unless $agent = $r->header_in('User-Agent');
return SERVER_ERROR unless $sub = get_match_sub($r, $patfile);
return OK if $sub->($agent);
$r->log_reason("Access forbidden to agent $agent", $r->filename);
return FORBIDDEN;
}
The code first checks that the BlockAgentFile configuration variable is present. If not, it declines to handle the transaction. It then attempts to fetch the User-Agent field from the HTTP header, by calling the request object's header_in() method. If no value is returned by this call (which might happen if a sneaky robot declines to identify itself), we return FORBIDDEN from the subroutine, blocking access.
Otherwise, we call an internal function named get_match_sub() with the request object and the path to the bad agent file. get_match_sub() uses the information contained within the file to compile an anonymous subroutine which, when called with the user agent identification, returns a true value if the client is accepted, or false if it matches one of the forbidden patterns. If get_match_sub() returns an undefined value, it indicates that one or more of the patterns didn't compile correctly and we return a server error. Otherwise, we call the returned subroutine with the agent name and return OK or FORBIDDEN , depending on the outcome.
The remainder of the module is taken up by the definition of get_match_sub(). This subroutine is interesting because it illustrates the advantage of a persistent module over a transient CGI script:
sub get_match_sub {
my($r, $filename) = @_;
$filename = $r->server_root_relative($filename);
my $mtime = (stat $filename)[9];
# try to return the sub from cache
return $MATCH_CACHE{$filename}->{'sub'} if
$MATCH_CACHE{$filename} &&
$MATCH_CACHE{$filename}->{'mod'} >= $mtime;
Rather than tediously read in the bad agents file each time we're called, compile each of the patterns, and test them, we compile the pattern match tests into an anonymous subroutine and store it in the %MATCH_CACHE package variable, along with the name of the pattern file and its modification date. Each time the subroutine is called, the subroutine checks %MATCH_CACHE to see whether this particular pattern file has been processed before. If the file has been seen before, the routine then compares the file's modification time against the date stored in the cache. If the file is not more recent than the cached version, then we return the cached subroutine. Otherwise, we compile it again.
Next we open up the bad agents file, fetch the patterns, and build up a subroutine line by line using a series of string concatenations:
my($fh, @pats);
return undef unless $fh = Apache::File->new($filename);
chomp(@pats = <$fh>); # get the patterns into an array
my $code = "sub { local \$_ = shift;\n";
foreach (@pats) {
next if /^#/;
$code .= "return if /$_/i;\n";
}
$code .= "1; }\n";
$r->server->log->debug("compiled $filename into:\n $code");
Note the use of $r->server->log->debug() to send a debugging message to the server log file. This message will only appear in the error log if the LogLevel is set to debug. If all goes well, the synthesized subroutine stored in $code will end up looking something like this:
sub {
$_ = shift;
return if /^teleport pro\/1\.28/i;
return if /^nicerspro/i;
return if /^mozilla\/3\.0 \(http engine\)/i;
...
1;
}
After building up the subroutine, we run a match-all regular expression over the code in order to untaint what was read from disk. In most cases, blindly untainting data is a bad idea, rendering the taint check mechanism useless. To mitigate this we use a Safe compartment and the reval() method, disabling potentially dangerous operations such as system().
# create the sub, cache and return it
($code) = $code =~ /^(.*)$/s; #untaint
my $sub = $Safe->reval($code);
unless ($sub) {
$r->log_error($r->uri, ": ", $@);
return;
}
The untainting step is required only if taint checks are turned on with the
PerlTaintCheck on directive (see Appendix A,
Standard Noncore Modules). The result of reval()ing the
string is a CODE reference to an anonymous subroutine or undef
if something went wrong during the compilation. In the latter case, we log the
error and return.
The final step is to store the compiled subroutine and the bad agent file's modification time into %MATCH_CACHE :
@{ $MATCH_CACHE{$filename} }{'sub','mod'} = ($sub, $mtime);
return $MATCH_CACHE{$filename}->{'sub'};
}
Because there may be several pattern files applicable to different parts of the site, we key %MATCH_CACHE by the path to the file. We then return the compiled subroutine to the caller.
As we saw in Chapter 4, Content
Handlers, this technique of compiling and caching a dynamically evaluated
subroutine is a powerful optimization that allows Apache::BlockAgent
to keep up with even very busy sites. Going one step further, the Apache::BlockAgent
module could avoid parsing the pattern file entirely by defining its own custom
configuration directives. The technique for doing this is described in Chapter 7, Other Request Phases.4 Example 6-3. Blocking Rude Robots
with Apache::BlockAgent
package Apache::BlockAgent; use strict;
use Apache::Constants qw(common);
use Apache::File ();
use Apache::Log ();
use Safe (); my $Safe = Safe->new'
my %MATCH_CACHE; sub handler {
my $r = shift;
my($patfile, $agent, $sub);
retrun DECLINED unless $patfile = $r->dir_config('BlockAgentFile');
return FORBIDDEN unless $agent = $r->header_in('User-Agent');
return SERVER_ERROR unless $sub = get_match_sub($r, $patfile);
return OK if $sub->($agent)'
$r->log_reason("Access forbidden to agent $agent", $r->filename);
return FORBIDDEN;
} # This routine creates a pattern matching subroutine from a
# list of pattern matches stroed in a file.
sub get_match_sub {
my($r, $filename) = @_'
$filename = $r->server_root_relative($filename);
my $mtime = (stat $filename)[9]; # try to return the sub from cache
return $MATCH_CACHE{$filename}->{'sub'} if
$MATCH_CACHE{$filename}->('mod'} >= $mtime; # if we get here, then we need to create the sub
my ($fh, @pats);
return unless $$fh = Apache::File->new($filename);
chomp(@pats = <$FH>); # get the patterns into an array
my $code = "sub {local \$_ = shift;\n";
foreach (@pats) {
net if /^#/;
$code .= "return if /$_/i;\n";
}
$code .= "1; }\n";
$r->server->log->debug("compiled $filename into:\n $code"); # create the sub, cache and return it
(($cide) = $code = - /^(.*)$/s; #untaint
my $sub = $Safe->reval($code);
unless ($sub) {
$r->log_error($r->uri, ": ". $@);
return;
}
@{ $MATCH_CACHE{$filename} }{'sub','mod'} = ($sub, $mtime);
return $MATCH_CACHE{$filename}->{'sub'};
} 1;
__END__ Show Contents Previous Page Next Page Copyright © 1999 by O'Reilly & Associates, Inc. |