Writing Apache Modules with Perl and C

Writing Apache Modules with Perl and C

By:	Lincoln Stein and Doug MacEachern
Published:	O'Reilly & Associates, Inc. - March 1999

Show Contents Previous Page Next Page

Chapter 4 - Content Handlers / Content Handlers as File Processors
A Server-Side Include System

The obvious limitation of the Apache::Footer example is that the footer text is hardcoded into the code. Changing the footer becomes a nontrivial task, and using different footers for various parts of the site becomes impractical. A much more flexible solution is provided by Vivek Khera's Apache::Sandwich module. This module "sandwiches" HTML pages between canned headers and footers that are determined by runtime configuration directives. The Apache::Sandwich module also avoids the overhead of parsing the request document; it simply uses the subrequest mechanism to send the header, body, and footer files in sequence.

We can provide more power than Apache::Sandwich by using server-side includes. Server-side includes are small snippets of code embedded within HTML comments. For example, in the standard server-side includes that are implemented in Apache, you can insert the current time and date into the page with a comment that looks like this:

Today is <!--#echo var="DATE_LOCAL"-->.

In this section, we use mod_perl to develop our own system of server-side includes, using a simple but extensible scheme that lets you add new types of includes at a moment's whim. The basic idea is that HTML authors will create files that contain comments of this form:

<!---->

A directive name consists of any sequence of alphanumeric characters or underscores. This is followed by a series of optional parameters, separated by spaces or commas. Parameters that contain whitespace must be enclosed in single or double quotes in shell command style. Backslash escapes also work in the expected manner.

The directives themselves are not hardcoded into the module but are instead dynamically loaded from one or more configuration files created by the site administrator. This allows the administrator to create a standard menu of includes that are available to the site's HTML authors. Each directive is a short Perl subroutine. A simple directive looks like this one:

sub HELLO { "Hello World!"; }

This defines a subroutine named HELLO() that returns the string "Hello World!" A document can now include the string in its text with a comment formatted like this one:

I said <!--#HELLO-->

A more complex subroutine will need access to the Apache object and the server-side include parameters. To accommodate this, the Apache object is passed as the first function argument, and the server-side include parameters, if any, follow. Here's a function definition that returns any field from the incoming request's HTTP header, using the Apache object's header_in() method:

sub HTTP_HEADER {
 my ($r,$field) = @_;
 $r->header_in($field);
}

With this subroutine definition in place, HTML authors can insert the User-Agent field into their document using a comment like this one:

You are using the browser <!-- #HTTP_HEADER User-Agent -->.

Example 4-2 shows an HTML file that uses a few of these includes, and Figure 4-2 shows what the page looks like after processing.

Figure 4-2. A page generated by Apache::ESSI

Example 4-2. An HTML File That Uses Extended Server-Side Includes

<html> <head> <title>Server-Side Includes</title></head>
<body bgcolor=white>
<h1>Server-Side Includes</h1>
This is some straight text.<p>

This is a "<!-- #HELLO -->" include.<p>

The file size is <strong><!-- #FSIZE --></strong>, and it was 
last modified on <!-- #MODTIME %x --><p>
Today is <!-- #DATE "%A, in <em>anno domini</em> %Y"-->.<p>
The user agent is <em><!--#HTTP_HEADER User-Agent--></em>.<p>
Oops: <!--#OOPS 0--><p>
Here is an included file:
<pre>
<!--#INCLUDE /include.txt 1-->
</pre>

<!--#FOOTER-->
</body> </html>

Implementing this type of server-side include system might seem to be something of a challenge, but in fact the code is surprisingly compact (Example 4-3). This module is named Apache::ESSI, for "extensible server-side includes."

Again, we'll step through the code one section at a time.

package Apache::ESSI;

use strict;
use Apache::Constants qw(:common);
use Apache::File ();
use Text::ParseWords qw(quotewords); 
my (%MODIFIED, %SUBSTITUTION);

We start as before by declaring the package name and loading various Perl library modules. In addition to the modules that we loaded in the Apache::Footer example, we import the quotewords() function from the standard Perl Text::ParseWords module. This routine provides command shell-like parsing of strings that contain quote marks and backslash escapes. We also define two lexical variables, %MODIFIED and %SUBSTITUTION, which are global to the package.

sub handler {
   my $r = shift;
   $r->content_type() eq 'text/html' || return DECLINED;
   my $fh = Apache::File->new($r->filename) || return DECLINED;
   my $sub = read_definitions($r)    || return SERVER_ERROR;
   $r->send_http_header;
   $r->print($sub->($r, $fh));
   return OK;
}

The handler() subroutine is quite short. As in the Apache::Footer example, handler() starts by examining the content type of the document being requested and declines to handle requests for non-HTML documents. The handler recovers the file's physical path by calling the request object's filename() method and attempts to open it. If the file open fails, the handler again returns an error code of DECLINED. This avoids Apache::Footer's tedious checking of the file's existence and access permissions, at the cost of some efficiency every time a nonexistent file is requested.

Once the file is opened, we call an internal function named read_definitions(). This function reads the server-side includes configuration file and generates an anonymous subroutine to do the actual processing of the document. If an error occurs while processing the configuration file, read_definitions() returns undef and we return SERVER_ERROR in order to abort the transaction. Otherwise, we send the HTTP header and invoke the anonymous subroutine to perform the substitutions on the contents of the file. The result of invoking the subroutine is sent to the client using the request object's print() method, and we return a result code of OK to indicate that everything went smoothly.

sub read_definitions {
   my $r = shift;
   my $def = $r->dir_config('ESSIDefs');
   return unless $def;
   return unless -e ($def = $r->server_root_relative($def));

Most of the interesting work occurs in read_definitions(). The idea here is to read the server-side include definitions, compile them, and then use them to generate an anonymous subroutine that does the actual substitutions. In order to avoid recompling this subroutine unnecessarily, we cache its code reference in the package variable %SUBSTITUTION and reuse it if we can.

The read_definitions() subroutine begins by retrieving the path to the file that contains the server-side include definitions. This information is contained in a per-directory configuration variable named ESSIDefs, which is set in the configuration file using the PerlSetVar directive and retrieved within the handler with the request object's dir_config() method (see the end of the example for a representative configuration file entry). If, for some reason, this variable isn't present, we return undef. Like other Apache configuration files, we allow this file to be specified as either an absolute path or a partial path relative to the server root. We pass the path to the request object's server_root_relative() method. This convenient function prepends the server root to relative paths and leaves absolute paths alone. We next check that the file exists using the -e file test operator and return undef if not.

    return $SUBSTITUTION{$def} if $MODIFIED{$def} && $MODIFIED{$def} <= -M _;

Having recovered the name of the definitions file, we next check the cache to see whether the subroutine definitions are already cached and, if so, whether the file hasn't changed since the code was compiled and cached. We use two hashes for this purpose. The %SUBSTITUTION array holds the compiled code and %MODIFIED contains the modification date of the definition file the last time it was compiled. Both hashes are indexed by the definition file's path, allowing the module to handle the case in which several server-side include definition files are used for different parts of the document tree. If the modification time listed in %MODIFIED is less than or equal to the definition file's current modification date, we return the cached subroutine.

    my $package = join "::", __PACKAGE__, $def;
   $package =~ tr/a-zA-Z0-9_/_/c;

The next two lines are concerned with finding a unique namespace in which to compile the server-side include functions. Putting the functions in their own namespace decreases the chance that function side effects will have unwanted effects elsewhere in the module. We take the easy way out here by using the path to the definition file to synthesize a package name, which we store in a variable named $package.

    eval "package $package; do '$def'";
   if($@) {
      $r->log_error("Eval of $def did not return true: $@");
      return;
   }

We then invoke eval() to compile the subroutine definitions into the newly chosen namespace. We use the package declaration to set the namespace and do to load and run the definitions file. We use do here rather than the more common require because do unconditionally recompiles code files even if they have been loaded previously. If the eval was unsuccessful, we log an error and return undef.

    $SUBSTITUTION{$def} = sub {
       do_substitutions($package, @_); 
  };
   $MODIFIED{$def} = -M $def;  # store modification date
   return $SUBSTITUTION{$def};
}

Before we exit read_definitions(), we create a new anonymous subroutine that invokes the do_substitutions() function, store this subroutine in %SUBSTITUTION, and update %MODIFIED with the modification date of the definitions file. We then return the code reference to our caller. We interpose a new anonymous subroutine here so that we can add the contents of the $package variable to the list of variables passed to the do_substitutions() function.

sub do_substitutions {
    my $package = shift;
    my($r, $fh) = @_; 
   # Make sure that eval() errors aren't trapped.
    local $SIG{__WARN__};
    local $SIG{__DIE__};
    local $/; #slurp $fh
    my $data = <$fh>;
    $data =~ s/<!--\s*\#(\w+)  # start of a function name
               \s*(.*?)        # optional parameters
               \s*-->          # end of comment 
             /call_sub($package, $1, $r, $2)/xseg;
    $data;
}

When handler() invokes the anonymous subroutine, it calls do_substitutions() to do the replacement of the server-side include directives with the output of their corresponding routines. We start off by localizing the $SIG{__WARN__} and $SIG{__DIE__} handlers and setting them back to the default Perl CORE::warn() and CORE::die() subroutines. This is a paranoid precaution against the use of CGI::Carp, which some mod_perl users load into Apache during the startup phase in order to produce nicely formatted server error log messages. The subroutine continues by fetching the lines of the page to be processed and joining them in a single scalar value named $data.

We then invoke a string substitution function to replace properly formatted comment strings with the results of invoking the corresponding server-side include function. The substitution uses the e flag to treat the replacement part as a Perl expression to be evaluated and the g flag to perform the search and replace globally. The search half of the function looks like this:

/<!--\s*\#(\w+)\s*(.*?)\s*-->/

This detects the server-side include comments while capturing the directive name in $1 and its optional arguments in $2.

The replacement of the function looks like this:

/call_sub($package, $1, $r, $2)/

This just invokes another utility function, call_sub(), passing it the package name, the directive name, the request object, and the list of parameters.

sub call_sub {
   my($package, $name, $r, $args) = @_;
   my $sub = \&{join '::', $package, $name};
   $r->chdir_file;
   my $res = eval { $sub->($r, quotewords('[ ,]',0,$args)) };
   return "<em>[$@]</em>" if $@;
   return $res;

The call_sub() routine starts off by obtaining a reference to the subroutine using its fully qualified name. It does this by joining the package name to the subroutine name and then using the funky Perl \&{...} syntax to turn this string into a subroutine reference. As a convenience to the HTML author, before invoking the subroutine we call the request object's chdir_file() method. This simply makes the current directory the same as the requested file, which in this case is the HTML file containing the server-side includes.

The server-side include function is now invoked, passing it the request object and the optional arguments. We call quotewords() to split up the arguments on commas or whitespace. In order to trap fatal runtime errors that might occur during the function's execution, the call is done inside an eval{} block. If the call function fails, we return the error message it died with captured within $@. Otherwise, we return the value of the call function.

At the bottom of Example 4-3 is an example entry for perl.conf (or httpd.conf if you prefer). The idea here is to make Apache::ESSI the content handler for all files ending with the extension .ehtml. We do this with a <Files> configuration section that contains the appropriate SetHandler and PerlHandler directives. We use the PerlSetVar directive to point the module to the server-relative definitions file, conf/essi.defs.

In addition to the <Files> section, we need to ensure that Apache knows that .ehtml files are just a special type of HTML file. We use AddType to tell Apache to treat .ehtml files as MIME type text/html.

You could also use <Location> or <Directory> to assign the Apache::ESSI content handler to a section of the document tree, or a different <Files> directive to make Apache::ESSI the content handler for all HTML files.

Example 4-3. An Extensible Server-Side Include System

package Apache::ESSI;
# file: Apache/ESSI.pm
use strict;
use Apache::Constants qw(:common);
use Apache::File ();
use Text::ParseWords qw(quotewords);
my (%MODIFIED, %SUBSTITUTION);

sub handler {
   my $r = shift;
   $r->content_type() eq 'text/html' || return DECLINED;
   my $fh = Apache::File->new($r->filename) || return DECLINED;
   my $sub = read_definitions($r)    || return SERVER_ERROR;
   $r->send_http_header;
   $r->print($sub->($r, $fh));
   return OK;
}

sub read_definitions {
   my $r = shift;
   my $def = $r->dir_config('ESSIDefs');
   return unless $def;
   return unless -e ($def = $r->server_root_relative($def));
   return $SUBSTITUTION{$def} if $MODIFIED{$def} && $MODIFIED{$def} <= -M _;

    my $package = join "::", __PACKAGE__, $def;
   $package =~ tr/a-zA-Z0-9_/_/c;
   eval "package $package; do '$def'";

    if($@) {
      $r->log_error("Eval of $def did not return true: $@");
      return;
   }

    $SUBSTITUTION{$def} = sub {
       do_substitutions($package, @_);
   };

    $MODIFIED{$def} = -M $def;  # store modification date
   return $SUBSTITUTION{$def};
}

sub do_substitutions {
   my $package = shift;
   my($r, $fh) = @_;
   # Make sure that eval() errors aren't trapped.
   local $SIG{__WARN__};
   local $SIG{__DIE__};
   local $/; #slurp $fh
   my $data = <$fh>;
   $data =~ s/<!--\s*\#(\w+) # start of a function name
              \s*(.*?)         # optional parameters
              \s*-->           # end of comment
             /call_sub($package, $1, $r, $2)/xseg;
   $data;
}

sub call_sub {
   my($package, $name, $r, $args) = @_;
   my $sub = \&{join '::', $package, $name};
   $r->chdir_file;
   my $res = eval { $sub->($r, quotewords('[ ,]',0,$args)) };
   return "<em>[$@]</em>" if $@;
   return $res;
}
1;
__END__

Here are some perl.conf directives to go with Apache::ESSI:

<Files ~ "\.ehtml$">
 SetHandler  perl-script
 PerlHandler Apache::ESSI
 PerlSetVar  ESSIDefs conf/essi.defs 
</Files>
AddType text/html .ehtml

At this point you'd probably like a complete server-side include definitions file to go with the module. Example 4-4 gives a short file that defines a core set of functions that you can build on top of. Among the functions defined here are ones for inserting the size and modification date of the current file, the date, fields from the browser's HTTP request header, and a function that acts like the C preprocessor #include macro to insert the contents of a file into the current document. There's also an include called OOPS which divides the number 10 by the argument you provide. Pass it an argument of zero to see how runtime errors are handled.

The INCLUDE() function inserts whole files into the current document. It accepts either a physical pathname or a "virtual" path in URI space. A physical path is only allowed if it lives in or below the current directory. This is to avoid exposing sensitive files such as /etc/passwd.

If the $virtual flag is passed, the function translates from URI space to a physical path name using the lookup_uri() and filename() methods:

$file = $r->lookup_uri($path)->filename;

The request object's lookup_uri() method creates an Apache subrequest for the specified URI. During the subrequest, Apache does all the processing that it ordinarily would on a real incoming request up to, but not including, activating the content handler. lookup_uri() returns an Apache::SubRequest object, which inherits all its behavior from the Apache request class. We then call this object's filename() method in order to retrieve its translated physical file name.

Example 4-4. Server-Side Include Function Definitions

# Definitions for server-side includes.
# This file is require'd, and therefore must end with
# a true value.
use Apache::File ();
use Apache::Util qw(ht_time size_string);

# insert the string "Hello World!"
sub HELLO {
   my $r = shift;
   "Hello World!";
}

# insert today's date possibly modified by a strftime() format
# string
sub DATE {
   my ($r,$format) = @_;
   return scalar(localtime) unless $format;
   return ht_time(time, $format, 0);
}

# insert the modification time of the document, possibly modified
# by a strftime() format string.
sub MODTIME {
   my ($r,$format) = @_;
   my $mtime = (stat $r->finfo)[9];
   return localtime($mtime) unless $format;
   return ht_time($mtime, $format, 0);
}

# insert the size of the current document
sub FSIZE {
   my $r = shift;
   return size_string -s $r->finfo;
}

# divide 10 by the argument (used to test runtime error trapping)
sub OOPS { 10/$_[1]; }

# insert a canned footer
sub FOOTER {
   my $r = shift;
   my $modtime = MODTIME($r);
   return <<END;
<hr>
&copy; 1998 <a href="http://www.ora.com/">O'Reilly &amp; Associates</a><br>
<em>Last Modified: $modtime</em>
END
}

# insert the named field from the incoming request 
sub HTTP_HEADER {
   my ($r,$h) = @_;
   $r->header_in($h);
}

#ensure that path is relative, and does not contain ".."
sub is_below_only { $_[0] !~ m:(^/|(^|/)\.\.(/|$)): }

# Insert the contents of a file.  If the $virtual flag is set
# does a document-root lookup, otherwise treats filename as a
# physical path.
sub INCLUDE {
   my ($r,$path,$virtual) = @_;
   my $file;
   if($virtual) {
      $file = $r->lookup_uri($path)->filename;
   }
   else {
      unless(is_below_only($path)) {
          die "Can't include $path\n";
      }
      $file = $path;
   }
   my $fh = Apache::File->new($file) || die "Couldn't open $file: $!\n";
   local $/;
   return <$fh>;
}

1;

If you're a fan of server-side includes, you should also check out the Apache Embperl and ePerl packages. Both packages, along with several others available from the CPAN, build on mod_perl to create a Perl-like programming language embedded entirely within server-side includes.

Show Contents Previous Page Next Page