home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Writing Apache Modules with Perl and C
By:   Lincoln Stein and Doug MacEachern
Published:   O'Reilly & Associates, Inc.  - March 1999

Copyright © 1999 by O'Reilly & Associates, Inc.


 


   Show Contents   Previous Page   Next Page

Chapter 6 - Authentication and Authorization / Access Control with mod_perl
Blocking Greedy Clients

A limitation of using pattern matching to identify robots is that it only catches the robots that you know about and that identify themselves by name. A few devious robots masquerade as users by using user agent strings that identify themselves as conventional browsers. To catch such robots, you'll have to be more sophisticated.

A trick that some mod_perl developers have used to catch devious robots is to block access to things that act like robots by requesting URIs at a rate faster than even the twitchiest of humans can click a mouse. The strategy is to record the time of the initial access by the remote agent and to count the number of requests it makes over a period of time. If it exceeds the speed limit, it gets locked out. Apache::SpeedLimit (Example 6-4) shows one way to write such a module.

The module starts out much like the previous examples:

package Apache::SpeedLimit;
use strict;
use Apache::Constants qw(:common);
use Apache::Log ();
use IPC::Shareable ();
use vars qw(%DB);

Because it needs to track the number of hits each client makes on the site, Apache::SpeedLimit faces the problem of maintaining a persistent variable across multiple processes. Here, because performance is an issue in a script that will be called for every URI on the site, we solve the problem by tying a hash to shared memory using IPC::Shareable. The tied variable, %DB, is keyed to the name of the remote client. Each entry in the hash holds four values: the time of the client's first access to the site, the time of the most recent access, the number of hits the client has made on the site, and whether the client has been locked out for exceeding the speed limit.5

sub handler {
   my $r = shift;
   return DECLINED unless $r->is_main;  # don't handle sub-requests
   my $speed_limit = $r->dir_config('SpeedLimit') || 10;
   # Accesses per minute
   my $samples = $r->dir_config('SpeedSamples')   || 10;
   # Sampling threshold (hits)
   my $forgive = $r->dir_config('SpeedForgive')   || 20;
   # Forgive after this period

The handler() subroutine first fetches some configuration variables. The recognized directives include SpeedLimit, the number of accesses per minute that any client is allowed to make; SpeedSamples, the number of hits that the client must make before the module starts calculating statistics, and SpeedForgive, a "statute of limitations" on breaking the speed limit. If the client pauses for SpeedForgive minutes before trying again, the module will forgive it and treat the access as if it were the very first one.

A small but important detail is the second line in the handler, where the subroutine declines the transaction unless is_main() returns true. It is possible for this handler to be invoked as the result of an internal subrequest, for example, when Apache is rapidly iterating through the contents of an automatically indexed directory to determine the MIME types of each of the directory's files. We do not want such subrequests to count against the user's speed limit totals, so we ignore any request that isn't the main one. is_main() returns true for the main request, false for subrequests.

In addition to this, there's an even better reason for the is_main() check because the very next thing the handler routine does is to call lookup_uri() to look up the requested file's content type and to ignore requests for image files. Without the check, the handler would recurse infinitely:

   my $content_type = $r->lookup_uri($r->uri)->content_type;
   return OK if $content_type =~ m:^image/:i; # ignore images

The rationale for the check for image files is that when a browser renders a graphics-intensive page, it generates a flurry of requests for inline images that can easily exceed the speed limit. We don't want to penalize users for this, so we ignore requests for inline images. It's necessary to make a subrequest to fetch the requested file's MIME type because access control handlers ordinarily run before the MIME type checker phase.

If we are dealing with a nonimage document, then it should be counted against the client's total. In the next section of the module, we tie a hash named %DB to shared memory using the IPC::Shareable module. We're careful only to tie the variable the first time the handler is called. If %DB is already defined, we don't tie it again:6

    tie %DB, 'IPC::Shareable', 'SPLM', {create => 1, mode => 0644}
     unless defined %DB;

The next task is to create a unique ID for the client to use as a key into the hash:

    my($ip, $agent) = ($r->connection->remote_ip,
                      $r->header_in ('User-Agent'));
   my $id = "$ip:$agent";
   my $now = time()/60; # minutes since the epoch

The client's IP address alone would be adequate in a world of one desktop PC per user, but the existence of multiuser systems, firewalls, and web proxies complicates the issue, making it possible for multiple users to appear to originate at the same IP address. This module's solution is to create an ID that consists of the IP address concatenated with the User-Agent field. As long as Microsoft and Netscape release new browsers every few weeks, this combination will spread clients out sufficiently for this to be a practical solution. A more robust solution could make use of the optional cookie generated by Apache's mod_usertrack module, but we didn't want to make this example overly complex. A final preparatory task is to fetch the current time and scale it to minute units.

    tied(%DB)->shlock;
   my($first, $last, $hits, $locked) = split ' ', $DB{$id};

Now we update the user's statistics and calculate his current fetch speed. In preparation for working with the shared data we call the tied hash's shlock() method, locking the data structure for writing. Next, we look up the user's statistics and split them into individual fields.

At this point in the code, we enter a block named CASE in which we take a variety of actions depending on the current field values:

    my $result = OK;
   my $l = $r->server->log;
 CASE:
   {

Just before entering the block, we set a variable named $result to a default of OK. We also retrieve an Apache::Log object to use for logging debugging messages.

The first case we consider is when the $first access time is blank:

       unless ($first) { # we're seeing this client for the first time
          $l->debug("First request from $ip.  Initializing speed counter.");
          $first = $last = $now;
          $hits = $locked = 0;
          last CASE;
      }

In this case, we can safely assume that this is the first time we're seeing this client. Our action is to initialize the fields and exit the block.

The second case occurs when the interval between the client's current and last accesses is longer than the grace period:

      if ($now - $last > $forgive) {
         # beyond the grace period.  Treat like first
         $l->debug("$ip beyond grace period.Reinitializing speed counter.");
         $last = $first = $now;
         $hits = $locked = 0;
         last CASE;
     }

In this case, we treat this access as a whole new session and reinitialize all the fields to their starting values. This "forgives" the client, even if it previously was locked out.

At this point, we can bump up the number of hits and update the last access time. If the number of hits is too small to make decent statistics, we just exit the block at this point:

       $last = $now; $hits++;
      if ($hits < $samples) {
          $l->debug("$ip not enough samples to calculate speed.");
          last CASE;
      }

Otherwise, if the user is already locked out, we set the result code to FORBIDDEN and immediately exit the block. Once a client is locked out of the site, we don't unlock it until the grace period has passed:

       if ($locked) { # already locked out, so forbid access
          $l->debug("$ip locked");
          $result = FORBIDDEN;
          last CASE;
      }

If the client isn't yet locked out, then we calculate its average fetch speed by dividing the number of accesses it has made by the time interval between now and its first access. If this value exceeds the speed limit, we set the $locked variable to true and set the result code to FORBIDDEN:

       my $interval = $now - $first;
      $l->debug("$ip speed = ", $hits/$interval);
      if ($hits/$interval > $speed_limit) {
          $l->debug("$ip exceeded speed limit.  Blocking.");
          $locked = 1;
          $result = FORBIDDEN;
          last CASE;
      }
   }

At the end of the module, we check the result code. If it's FORBIDDEN we emit a log entry to explain the situation. We now update %DB with new values for the access times, number of hits, and lock status and unlock the shared memory. Lastly, we return the result code to Apache:

   $r->log_reason("Client exceeded speed limit.", $r->filename)
      if $result == FORBIDDEN;
   $DB{$id} = join " ", $first, $now, $hits, $locked;
   tied(%DB)->shunlock;
   return $result;
}

To apply the Apache::SpeedLimit module to your entire site, you would create a configuration file entry like the following:

<Location />
 PerlAccessHandler Apache::SpeedLimit
 PerlSetVar        SpeedLimit   20   # max 20 accesses/minute
 PerlSetVar        SpeedSamples  5   # 5 hits before doing statistics
 PerlSetVar        SpeedForgive 30   # amnesty after 30 minutes
</Location>

Example 6-4. Blocking Greedy Clients

package Apache::SpeedLimit;
# file: Apache/SpeedLimit.pm
use strict;
use Apache::Constants qw(:common);
use Apache::Log ();
use IPC::Shareable ();
use vars qw(%DB);
sub handler {
   my $r = shift;
   return DECLINED unless $r->is_main;  # don't handle sub-requests
    my $speed_limit = $r->dir_config('SpeedLimit') || 10;
   # Accesses per minute
   my $samples = $r->dir_config('SpeedSamples')   || 10;(hits)
   # Sampling threshold (hits)
   my $forgive = $r->dir_config('SpeedForgive')   || 20;
   # Forgive after this period (minutes)
  # Forgive after this period (minutes)
    my $content_type = $r->lookup_uri($r->uri)->content_type;
   return OK if $content_type =~ m:^image/:i; # ignore images
   tie %DB, 'IPC::Shareable', 'SPLM', {create => 1, mode => 0644}
     unless defined %DB;
    my($ip, $agent) = ($r->connection->remote_ip,
                      $r->header_in('User-Agent'));
   my $id = "$ip:$agent";
   my $now = time()/60; # minutes since the epoch
    # lock the shared memory while we work with it
   tied(%DB)->shlock;
   my($first, $last, $hits, $locked) = split ' ', $DB{$id};
   my $result = OK;
   my $l = $r->server->log;
 CASE:
   {
      unless ($first) { # we're seeing this client for the first time
          $l->debug("First request from $ip.  Initializing speed counter.");
          $first = $last = $now;
          $hits = $locked = 0;
          last CASE;
      }
       if ($now - $last > $forgive) {
          # beyond the grace period.  Treat like first
          $l->debug("$ip beyond grace period.Reinitializing speed counter."); 
$last = $first = $now; $hits = $locked = 0; last CASE; }
       # update the values now
      $last = $now; $hits++;
      if ($hits < $samples) {
          $l->debug("$ip not enough samples to calculate speed.");
          last CASE;
      }
       if ($locked) { # already locked out, so forbid access
          $l->debug("$ip locked");
          $result = FORBIDDEN;
          last CASE;
      }
       my $interval = $now - $first;
      $l->debug("$ip speed = ", $hits/$interval);
      if ($hits/$interval > $speed_limit) {
          $l->debug("$ip exceeded speed limit.  Blocking.");
          $locked = 1;
          $result = FORBIDDEN;
          last CASE;
      }
   }
    $r->log_reason("Client exceeded speed limit.", $r->filename)
      if $result == FORBIDDEN;
   $DB{$id} = join " ", $first, $now, $hits, $locked;
   tied(%DB)->shunlock;
    return $result;
}
1;
__END__

Footnotes

4 The mod_rewrite module may also be worth perusing. Its rewrite rules can be based on the User-Agent field, time of day, and other variables.

5 On systems that don't have IPC::Shareable available, a tied DBM file might also work, but you'd have to open and close it each time the module is called. This would have performance implications. A better solution would be to store the information in a DBI database, as described in Chapter 5, Maintaining State. Windows systems use a single-process server, and don't have to worry about this issue.

6 An alternative approach would be to use a PerlChildInitHandler to tie the %DB. This technique is described in more detail in Chapter 7, Other Request Phases.

   Show Contents   Previous Page   Next Page
Copyright © 1999 by O'Reilly & Associates, Inc.