Maintaining State (CGI Programming with Perl)

The advantages and disadvantages of each technique are summarized in Table 11-1. We will review each technique separately, so if some of the points in the table are unclear you may want to refer back to this table after reading the sections below. In general, though, you should note that client-side cookies are the most powerful option for maintaining state, but they require something from the client. The other options work regardless of the client, but both have limits in the number of the pages that we can track the user across.

11.1. Query Strings and Extra Path Information

We've passed query information to CGI applications many times throughout this book. In this section, we'll use queries in a slightly less obvious manner, namely to track a user's browsing trail while traversing from one document to the next on the server.

In order to do this, we'll have a CGI script handle every request for a static HTML page. The CGI script will check whether the request URL contains an identifier matching our format. If it doesn't, the script assumes that this is a new user and generates a new identifier. The script then parses the requested HTML document by looking for links to other URLs within our web site and appending a unique identifier to each URL. Thus, the identifier will be passed on with future requests and propagated from document to document. Of course, if we want to track users across CGI applications then we'll also need to parse the output of these CGI scripts. The simplest way to accomplish both goals is to create a general module that handles reading the identifier and parsing the output. This way, we need to write our code only once and can have the script for our HTML pages as well as allow all our other CGI scripts share it.

As you may have guessed, this is not a very efficient process, since a request for each and every HTML document triggers a CGI application to be executed. Tools such as mod_perl and FastCGI, discussed in Chapter 17, "Efficiency and Optimization", help because both of these tools effectively embed the Perl interpreter into the web server.

Another strategy to help improve performance is to perform some processing in advance. If you are willing to preprocess your documents, you can reduce the amount of work that happens when the customer accesses the document. The majority of the work involved in parsing a document and replacing links is identifying the links. HTML::Parser is a good module, but the work it does is rather complex. If you parse the links and add a special keyword instead of one for a particular user, then later you can look for this keyword and not have to worry about recognizing links. For example, you could parse URLs and add #USERID# as the identifier for each document. The resulting code becomes much simpler. You can effectively handle documents this way:

sub parse {
    my( $filename, $id ) = @_;
    local *FH;
    open FH, $filename or die "Cannot open file: $!";
    
    while (<FH>) {
        s/#USERID#/$id/g;
        print;
    }
}

However, when a user traverses through a set of static HTML documents, CGI applications are typically not involved. If that's the case, how do we pass session information from one HTML document to the next, and be able to keep track of it on the server?

The answer to our problem is to configure the server such that when the user requests an HTML document, the server executes a CGI application. The application would then be responsible for transparently embedding special identifying information (such as a query string) into all the hyperlinks within the requested HTML document and returning the newly created content to the browser.

Let's look at how we're actually going to implement the application. It's only a two-step process. To reiterate, the problem we're trying to solve is to determine what documents a particular user requests and how much time he or she spends viewing them. First, we need to identify the set of documents for which we want to obtain the users' browsing history. Once we do that, we simply move these documents to a specific directory under the web server's document root directory.

Next, we need to configure the web server to execute a CGI application each and every time a user requests a document from this directory. We'll use the Apache web server for this example, but the configuration details are very similar for other web servers, as well.

We simply need to insert the following directives into Apache's access configuration file, access.conf:

<Directory /usr/local/apache/htdocs/store>
    AddType text/html   .html
    AddType Tracker     .html
    Action  Tracker     /cgi/track.cgi
</Directory>

When a user requests a document from the /usr/local/apache/htdocs/store directory, Apache executes the query_track application, passing to it the relative URL of the requested document as extra path information. Here's an example. When the user requests a document from the directory for the first time:

http://localhost/store/index.html

the web server will execute query_track, like so:

http://localhost/cgi/track.cgi/store/index.html

The application uses the PATH_TRANSLATED environment variable to get the full path of index.html. Then, it opens the file, creates a new identifier for the user, embeds it into each relative URL within the document, and returns the modified HTML stream to the browser. In addition, we log the transaction to a special log file, which you can use to analyze users' browsing habits at a later time.

If you're curious as to what a modified URL looks like, here's an example:

http://localhost/store/.CC7e2BMb_H6UdK9KfPtR1g/faq.html

The identifier is a modified Base64 MD5 message digest, computed using various pieces of information from the request. The code to generate it looks like this:

use Digest::MD5;

my $md5 = new Digest::MD5;
my $remote = $ENV{REMOTE_ADDR} . $ENV{REMOTE_PORT};
my $id = $md5->md5_base64( time, $$, $remote );
$id =~ tr|+/=|-_.|;  # Make non-word chars URL-friendly

This does a good job of generating a unique key for each request. However, it is not intended to create keys that cannot be cracked. If you are generating session identifiers that provide access to sensitive data, then you should use a more sophisticated method to generate an identifier.

If you use Apache, you do not have to generate a unique identifier yourself if you build Apache with the mod_unique_id module. It creates a unique identifier for each request, which is available to your CGI script as $ENV{UNIQUE_ID}. mod_unique_id is included in the Apache distribution but not compiled by default.

Let's look at how we could construct code to parse HTML documents and insert identifiers. Example 11-1 shows a Perl module that we use to parse the request URL and HTML output.

Example 11-1. CGIBook::UserTracker.pm

#!/usr/bin/perl -wT

#/----------------------------------------------------------------
# UserTracker Module
# 
# Inherits from HTML::Parser
# 
# 

package CGIBook::UserTracker;

push @ISA, "HTML::Parser";

use strict;
use URI;
use HTML::Parser;

1;


#/----------------------------------------------------------------
# Public methods
# 

sub new {
    my( $class, $path ) = @_;
    my $id;
    
    if ( $ENV{PATH_INFO} and
         $ENV{PATH_INFO} =~ s|^/\.([a-z0-9_.-]*)/|/|i ) {
        $id = $1;
    }
    else {
        $id ||= unique_id(  );
    }
    
    my $self = $class->SUPER::new(  );
    $self->{user_id}    = $id;
    $self->{base_path}  = defined( $path ) ? $path : "";
        
    return $self;
}

sub base_path {
    my( $self, $path ) = @_;
    $self->{base_path} = $path if defined $path;
    return $self->{base_path};
}

sub user_id {
    my $self = shift;
    return $self->{user_id};
}


#/----------------------------------------------------------------
# Internal (private) subs
# 

sub unique_id {
    # Use Apache's mod_unique_id if available
    return $ENV{UNIQUE_ID} if exists $ENV{UNIQUE_ID};
    
    require Digest::MD5;
    
    my $md5 = new Digest::MD5;
    my $remote = $ENV{REMOTE_ADDR} . $ENV{REMOTE_PORT};
    
    # Note this is intended to be unique, and not unguessable
    # It should not be used for generating keys to sensitive data
    my $id = $md5->md5_base64( time, $$, $remote );
    $id =~ tr|+/=|-_.|;  # Make non-word chars URL-friendly
    return $id;
}

sub encode {
    my( $self, $url ) = @_;
    my $uri  = new URI( $url, "http" );
    my $id   = $self->user_id(  );
    my $base = $self->base_path;
    
    my $path = $uri->path;
    $path =~ s|^$base|$base/.$id| or
        die "Invalid base path configured\n";
    $uri->path( $path );
    
    return $uri->as_string;
}


#/----------------------------------------------------------------
# Subs to implement HTML::Parser callbacks
# 

sub start {
    my( $self, $tag, $attr, $attrseq, $origtext ) = @_;
    my $new_text = $origtext;
    
    my %relevant_pairs = (
        frameset    => "src",
        a           => "href",
        area        => "href",
        form        => "action",
# Uncomment these lines if you want to track images too
#        img         => "src",
#        body        => "background",
    );
    
    while ( my( $rel_tag, $rel_attr ) = each %relevant_pairs ) {
        if ( $tag eq $rel_tag and $attr->{$rel_attr} ) {
            $attr->{$rel_attr} = $self->encode( $attr->{$rel_attr} );
            my @attribs = map { "$_=\"$attr->{$_}\"" } @$attrseq;
            $new_text = "<$tag @attribs>";
        }
    }
    
    # Meta refresh tags have a different format, handled separately
    if ( $tag eq "meta" and $attr->{"http-equiv"} eq "refresh" ) {
        my( $delay, $url ) = split ";URL=", $attr->{content}, 2;
        $attr->{content} = "$delay;URL=" . $self->encode( $url );
        my @attribs = map { "$_=\"$attr->{$_}\"" } @$attrseq;
        $new_text = "<$tag @attribs>";
    }
    
    print $new_text;
}

sub declaration {
    my( $self, $decl ) = @_;
    print $decl;
}

sub text {
    my( $self, $text ) = @_;
    print $text;
}

sub end {
    my( $self, $tag ) = @_;
    print "</$tag>";
}

sub comment {
    my( $self, $comment ) = @_;
    print "<!--$comment-->";
}

Example 11-2 shows the CGI application that we use to process static HTML pages.

Example 11-2. query_track.cgi

#!/usr/bin/perl -wT

use strict;
use CGIBook::UserTracker;

local *FILE;
my $track = new CGIBook::UserTracker;
$track->base_path( "/store" );

my $requested_doc = $ENV{PATH_TRANSLATED};
unless ( -e $requested_doc ) {
    print "Location: /errors/not_found.html\n\n";
}

open FILE, $requested_doc or die "Failed to open $requested_doc: $!";

my $doc = do {
    local $/ = undef;
    <FILE>;
};

close FILE;

# This assumes we're only tracking HTML files:
print "Content-type: text/html\n\n";
$track->parse( $doc );

Once we have inserted the identifier into all the URLs, we simply send the modified content to the standard output stream, along with the content header.

Now that we've looked at how to maintain state between views of multiple HTML documents, our next step is to discuss persistence when using multiple forms. An online store, for example, is typically broken into multiple pages. We need to able to identify users as they fill out each page. We'll look at techniques for solving such problems in the next section.

Chapter 11. Maintaining State

Contents:

Table 11-1. Summary of the Techniques for Maintaining State

11.1. Query Strings and Extra Path Information

Example 11-1. CGIBook::UserTracker.pm

Example 11-2. query_track.cgi