Handling Input with CGI.pm (CGI Programming with Perl)

5.2. Handling Input with CGI.pm

CGI.pm primarily handles two separate tasks: it reads and parses input from the user, and it provides a convenient way to return HTML output. Let's first look at how it collects input.

5.2.1. Environment Information

CGI.pm provides many methods to get information about your environment. Of course, when you use CGI.pm, all of your standard CGI environment variables are still available in Perl's %ENV hash, but CGI.pm also makes most of these available via method calls. It also provides some unique methods. Table 5-1 shows how CGI.pm's functions correspond to the standard CGI environment variables.

Table 5-1. CGI.pm Environment Methods and CGI Environment Variables

CGI.pm Method	CGI Environment Variable
`auth_type`	AUTH_TYPE
Not available	CONTENT_LENGTH
`content_type`	CONTENT_TYPE
Not available	DOCUMENT_ROOT
Not available	GATEWAY_INTERFACE
`path_info`	PATH_INFO
`path_translated`	PATH_TRANSLATED
`query_string`	QUERY_STRING
`remote_addr`	REMOTE_ADDR
`remote_host`	REMOTE_HOST
`remote_ident`	REMOTE_IDENT
`remote_user`	REMOTE_USER
`request_method`	REQUEST_METHOD
`script_name`	SCRIPT_NAME
`self_url`	Not available
`server_name`	SERVER_NAME
`server_port`	SERVER_PORT
`server_protocol`	SERVER_PROTOCOL
`server_software`	SERVER_SOFTWARE
`url`	Not available
`Accept`	HTTP_ACCEPT
`http("Accept-charset")`	HTTP_ACCEPT_CHARSET
`http("Accept-encoding")`	HTTP_ACCEPT_ENCODING
`http("Accept-language")`	HTTP_ACCEPT_LANGUAGE
`raw_cookie`	HTTP_COOKIE
`http("From")`	HTTP_FROM
`virtual_host`	HTTP_HOST
`referer`	HTTP_REFERER
`user_agent`	HTTP_USER_AGENT
`https`	HTTPS
`https("Cipher")`	HTTPS_CIPHER
`https("Keysize")`	HTTPS_KEYSIZE
`https("SecretKeySize")`	HTTPS_SECRETKEYSIZE

Most of these CGI.pm methods take no arguments and return that same value as the corresponding environment variable. For example, to get the additional path information passed to your CGI script, you can use the following method:

my $path = $q->path_info;

This is the same information that you could also get this way:

my $path = $ENV{PATH_INFO};

However, a few methods differ or have features worth noting. Let's take a look at these.

5.2.1.1. Accept

As a general rule, if a CGI.pm method has the same name as a built-in Perl function or keyword (e.g., accept or tr), then the CGI.pm method is capitalized. Although there would be no collision if CGI.pm were available only via an object-oriented syntax, the collision creates problem for people who use it via the standard syntax. accept was originally lowercase, but it was renamed to Accept in version 2.44 of CGI.pm, and the new name affects both syntaxes.

Unlike the other methods that take no arguments and simply return a value, Accept can also be given a content type and it will evaluate to true or false depending on whether that content type is acceptable according to the HTTP-Accept header:

if ( $q->Accept( "image/png" ) ) {
    .
    .
    .

Keep in mind that most browsers today send */* in their Accept header. This matches anything, so using the Accept method in this manner is not especially useful. For new file formats like image/png, it is best to get the values for the HTTP header and perform the check yourself, ignoring wildcard matches (this is unfortunate, since it defeats the purpose of wildcards):

my @accept = $q->Accept;
if ( grep $_ eq "image/png", @accept ) {
    .
    .
    .

5.2.1.2. http

If the http method is called without arguments, it returns the name of the environment variables available that contain an HTTP_ prefix. If you call http with an argument, then it will return the value of the corresponding HTTP_ environment variable. When passing an argument to http, the HTTP_ prefix is optional, capitalization does not matter, and hyphens and underscores are interpreted the same. In other words, you can pass the actual HTTP header field name or the environment variable name or even some hybrid of the two, and http will generally figure it out. Here is how you can display all the HTTP_ environment variables your CGI script receives:

#!/usr/bin/perl -wT

use strict;
use CGI;

my $q = new CGI;
print $q->header( "text/plain" );

print "These are the HTTP environment variables I received:\n\n";

foreach ( $q->http ) {
    print "$_:\n";
    print "  ", $q->http( $_ ), "\n";
}

Note that this URL is not necessarily the same URL that was used to call your CGI script. Your CGI script may have been called because of an internal redirection by the web server. Also, because all of the parameters are moved to the query string, this new URL is built to be used with a GET request, even if the current request was a POST request.

5.2.1.6. url

The url method functions similarly to the self_url method, except that it returns a URL to the current CGI script without any parameters, i.e., no path information and an empty query string.

5.2.1.7. virtual_host

The virtual_host method is handy because it returns the value of the HTTP_HOST environment variable, if set, and SERVER_NAME otherwise. Remember that HTTP_HOST is the name of the web server as the browser referred to it, which may differ if multiple domains share the same IP address. HTTP_HOST is available only if the browser supplied the Host HTTP header, added for HTTP 1.1.

5.2.2. Accessing Parameters

param is probably the most useful method CGI.pm provides. It allows you to access the parameters submitted to your CGI script, whether these parameters come to you via a GET request or a POST request. If you call param without arguments, it will return a list of all of the parameter names your script received. If you provide a single argument to it, it will return the value for the parameter with that name. If no parameter with that name was submitted to your script, it returns undef.

It is possible for your CGI script to receive multiple values for a parameter with the same name. This happens when you create two form elements with the same name or you have a select box that allows multiple selections. In this case, param returns a list of all of the values if it is called in a list context and just the first value if it is called in a scalar context. This may sound a little complicated, but in practice it works such that you should end up with what you expect. If you ask param for one value, you will get one value (even if other values were also submitted), and if you ask it for a list, you will always get a list (even if the list contains only one element).

Example 5-1 is a simple example that displays all the parameters your script receives.

Example 5-1. param_list.cgi

#!/usr/bin/perl -wT

use strict;
use CGI;

my $q = new CGI;
print $q->header( "text/plain" );

print "These are the parameters I received:\n\n";

my( $name, $value );

foreach $name ( $q->param ) {
    print "$name:\n";
    foreach $value ( $q->param( $name ) ) {
        print "  $value\n";
    }
}

If you call this CGI script with multiple parameters, like this:

http://localhost/cgi/param_list.cgi?color=red&color=blue&shade=dark

you will get the following output:

These are the parameters I received:

color:
  red
  blue
shade:
  dark

5.2.2.1. Modifying parameters

CGI.pm also lets you add, modify, or delete the value of parameters within your script. To add or modify a parameter, just pass param more than one argument. Using Perl's => operator instead of a comma makes the code easier to read and allows you to omit the quotes around the parameter name, so long as it's a word (i.e., only contains includes letters, numbers, and underscores) that does not conflict with a built-in function or keyword:

$q->param( title => "Web Developer" );

You can create a parameter with multiple values by passing additional arguments:

$q->param( hobbies => "Biking", "Windsurfing", "Music" );

To delete a parameter, use the delete method and provide the name of the parameter:

$q->delete( "age" );

You can clear all of the parameters with delete_all :

$q->delete_all;

It may seem odd that you would ever want to modify parameters yourself, since these will typically be coming from the user. Setting parameters is useful for many reasons, but especially when assigning default values to fields in forms. We will see how to do this later in this chapter.

5.2.2.2. POST and the query string

param automatically determines if the request method is POST or GET. If it is POST, it reads any parameters submitted to it from STDIN. If it is GET, it reads them from the query string. It is possible to POST information to a URL that already has a query string. In this case, you have two souces of input data, and because CGI.pm determines what to do by checking the request method, it will ignore the data in the query string.

You can change this behavior if you are willing to edit CGI.pm. In fact, CGI.pm includes comments to help you do this. You can find this block of code in the init subroutine (the line number will vary depending on the version of CGI.pm you have):

if ($meth eq 'POST') {
    $self->read_from_client(\*STDIN,\$query_string,$content_length,0)
        if $content_length > 0;
    # Some people want to have their cake and eat it too!
    # Uncomment this line to have the contents of the query string
    # APPENDED to the POST data.
    # $query_string .= (length($query_string) ? '&' : '') . $ENV{'QUERY_STRING'}
             if defined $ENV{'QUERY_STRING'};
    last METHOD;
}

By removing the pound sign from the beginning of the line indicated, you will be able to use POST and query string data together. Note that the line you would need to uncomment is too long to display on one line in this text, so it has been wrapped to the next line, but it is just one line in CGI.pm.

5.2.2.3. Index queries

You may receive a query string that contains words that do not comprise name-value pairs. The <ISINDEX> HTML tag, which is not used much anymore, creates a single text field along with a prompt to enter search keywords. When a user enters words into this field and presses Enter, it makes a new request for the same URL, adding the text the user entered as the query string with keywords separated by a plus sign (+), such as this:

http://www.localhost.com/cgi/lookup.cgi?cgi+perl

You can retrieve the list of keywords that the user entered by calling param with "keywords" as the name of the parameter or by calling the separate keywords method:

my @words = $q->keywords;            # these lines do the same thing
my @words = $q->param( "keywords" );

These methods return index keywords only if CGI.pm finds no name-value pair parameters, so you don't have to worry about using "keywords" as the name of an element in your HTML forms; it will work correctly. On the other hand, if you want to POST form data to a URL with a keyword, CGI.pm cannot return that keyword to you. You must use $ENV{QUERY_STRING} to get it.

5.2.2.4. Supporting image buttons as submit buttons

Whether you use <INPUT TYPE="IMAGE" > or <INPUT TYPE="SUBMIT">, the form is still sent to the CGI script. However, with the image button, the name is not transmitted by itself. Instead, the web browser splits an image button name into two separate variables: name.x and name.y.

If you want your program to support image and regular submit buttons interchangeably, it is useful to translate the image button names to normal submit button names. Thus, the main program code can use logic based upon which submit button was clicked even if image buttons later replace them.

To accomplish this, we can use the following code that will set a form variable without the coordinates in the name for each variable that ends in ".x":

foreach ( $q->param ) {
    $q->param( $1, 1 ) if /(.*)\.x/;
}

5.2.3. Exporting Parameters to a Namespace

One of the problems with using a method to retrieve the value of a parameter is that it is more work to embed the value in a string. If you wish to print the value of someone's input, you can use an intermediate variable:

my $name = $q->param( 'user' );
print "Hi, $user!";

Another way to do this is via an odd Perl construct that forces the subroutine to be evaluated as part of an anonymous list:

print "Hi, @{[ $q->param( 'user' ) ]}!";

The first solution is more work and the second can be hard to read. Fortunately, there is a better way. If you know that you are going to need to refer to many output values in a string, you can import all the parameters as variables to a specified namespace:

$q->import_names( "Q" );
print "Hi, $Q::user!";

Parameters with multiple values become arrays in the new namespace, and any characters in a parameter name other than a letter or number become underscores. You must provide a namespace and cannot pass "main", the default namespace, because that might create security risks.

The price you pay for this convenience is increased memory usage because Perl must create an alias for each parameter.

5.2.4. File Uploads with CGI.pm

As we mentioned in the last chapter, it is possible to create a form with a multipart/form-data media type that permits users to upload files via HTTP. We avoided discussing how to handle this type of input then because handling file uploads properly can be quite complex. Fortunately, there's no need for us to do this because, like other form input, CGI.pm provides a very simple interface for handling file uploads.

You can access the name of an uploaded file with the param method, just like the value of any other form element. For example, if your CGI script were receiving input from the following HTML form:

<FORM ACTION="/cgi/upload.cgi" METHOD="POST" ENCTYPE="multipart/form-data">
  <P>Please choose a file to upload:
  <INPUT TYPE="FILE" NAME="file">
  <INPUT TYPE="SUBMIT">
</FORM>

then you could get the name of the uploaded file this way, by referring to the name of the <FILE> input element, in this case "file":

my $file = $q->param( "file" );

The name you receive from this parameter is the name of the file as it appeared on the user's machine when they uploaded it. CGI.pm stores the file as a temporary file on your system, but the name of this temporary file does not correspond to the name you get from this parameter. We will see how to access the temporary file in a moment.

The name supplied by this parameter varies according to platform and browser. Some systems supply just the name of the uploaded file; others supply the entire path of the file on the user's machine. Because path delimiters also vary between systems, it can be a challenge determining the name of the file. The following command appears to work for Windows, Macintosh, and Unix-compatible systems:

my( $file ) = $q->param( "file" ) =~ m|([^/:\\]+)$|;

However, it may strip parts of filenames, since "report 11/3/99" is a valid filename on Macintosh systems and the above command would in this case set $file to "99". Another solution is to replace any characters other than letters, digits, underscores, dashes, and periods with underscores and prevent any files from beginning with periods or dashes:

my $file = $q->param( "file" );
$file =~ s/([^\w.-])/_/g;
$file =~ s/^[-.]+//;

The problem with this is that Netscape's browsers on Windows sends the full path to the file as the filename. Thus, $file may be set to something long and ugly like "C_ _ _Windows_Favorites_report.doc".

You could try to sort out the behaviors of the different operating systems and browsers, check for the user's browser and operating system, and then treat the filename appropriately, but that would be a very poor solution. You are bound to miss some combinations, you would constantly need to update it, and one of the greatest advantages of the Web is that it works across platforms; you should not build any limitations into your solutions.

So the simple, obvious solution is actually nontechnical. If you do need to know the name of the uploaded file, just add another text field to the form allowing the user to enter the name of the file they are uploading. This has the added advantage of allowing a user to provide a different name than the file has, if appropriate. The HTML form looks like this:

<FORM ACTION="/cgi/upload.cgi" METHOD="POST" ENCTYPE="multipart/form-data">
  <P>Please choose a file to upload:
  <INPUT TYPE="FILE" NAME="file">
  <P>Please enter the name of this file:
  <INPUT TYPE="TEXT" NAME="filename">
</FORM>

You can then get the name from the text field, remembering to strip out any odd characters:

my $filename = $q->param( "filename" );
$filename =~ s/([^\w.-])/_/g;
$filename =~ s/^[-.]+//;

So now that we know how to get the name of the file uploaded, let's look at how we get at the content. CGI.pm creates a temporary file to store the contents of the upload; you can get a file handle for this file by passing the name of the file according to the file element to the upload method as follows:

my $file = $q->param( "file" );
my $fh   = $q->upload( $file );

The upload method was added to CGI.pm in Version 2.47. Prior to this you could use the value returned by param (in this case $file) as a file handle in order to read from the file; if you use it as a string it returns the name of the file. This actually still works, but there are conflicts with strict mode and other problems, so upload is the preferred way to get a file handle now. Be sure that you pass upload the name of the file according to param, and not a different name (e.g., the name the user supplied, the name with nonalphanumeric characters replaced with underscores, etc.).

Note that transfer errors are much more common with file uploads than with other forms of input. If the user presses the Stop button in the browser as the file is uploading, for example, CGI.pm will receive only a portion of the uploaded file. Because of the format of multipart/form-data requests, CGI.pm will recognize that the transfer is incomplete. You can check for errors such as this by using the cgi_error method after creating a CGI.pm object. It returns the HTTP status code and message corresponding to the error, if applicable, or an empty string if no error has occurred. For instance, if the Content-length of a POST request exceeds $CGI::POST_MAX, then cgi_error will return "413 Request entity too large". As a general rule, you should always check for an error when you are recording input on the server. This includes file uploads and other POST requests. It doesn't hurt to check for an error with GET requests either.

Example 5-2 provides the complete code, with error checking, to receive a file upload via our previous HTML form.

Example 5-2. upload.cgi

#!/usr/bin/perl -wT

use strict;
use CGI;
use Fcntl qw( :DEFAULT :flock );

use constant UPLOAD_DIR     => "/usr/local/apache/data/uploads";
use constant BUFFER_SIZE    => 16_384;
use constant MAX_FILE_SIZE  => 1_048_576;       # Limit each upload to 1 MB
use constant MAX_DIR_SIZE   => 100 * 1_048_576; # Limit total uploads to 100 MB
use constant MAX_OPEN_TRIES => 100;

$CGI::DISABLE_UPLOADS   = 0;
$CGI::POST_MAX          = MAX_FILE_SIZE;

my $q = new CGI;
$q->cgi_error and error( $q, "Error transferring file: " . $q->cgi_error );

my $file      = $q->param( "file" )     || error( $q, "No file received." );
my $filename  = $q->param( "filename" ) || error( $q, "No filename entered." );
my $fh        = $q->upload( $file );
my $buffer    = "";

if ( dir_size( UPLOAD_DIR ) + $ENV{CONTENT_LENGTH} > MAX_DIR_SIZE ) {
    error( $q, "Upload directory is full." );
}

# Allow letters, digits, periods, underscores, dashes
# Convert anything else to an underscore
$filename =~ s/[^\w.-]/_/g;
if ( $filename =~ /^(\w[\w.-]*)/ ) {
    $filename = $1;
}
else {
    error( $q, "Invalid file name; files must start with a letter or number." );
}

# Open output file, making sure the name is unique
until ( sysopen OUTPUT, UPLOAD_DIR . $filename, O_CREAT | O_EXCL ) {
    $filename =~ s/(\d*)(\.\w+)$/($1||0) + 1 . $2/e;
    $1 >= MAX_OPEN_TRIES and error( $q, "Unable to save your file." );
}

# This is necessary for non-Unix systems; does nothing on Unix
binmode $fh;
binmode OUTPUT;

# Write contents to output file
while ( read( $fh, $buffer, BUFFER_SIZE ) ) {
    print OUTPUT $buffer;
}

close OUTPUT;


sub dir_size {
    my $dir = shift;
    my $dir_size = 0;
    
    # Loop through files and sum the sizes; doesn't descend down subdirs
    opendir DIR, $dir or die "Unable to open $dir: $!";
    while ( readdir DIR ) {
        $dir_size += -s "$dir/$_";
    }
    return $dir_size;
}


sub error {
    my( $q, $reason ) = @_;
    
    print $q->header( "text/html" ),
          $q->start_html( "Error" ),
          $q->h1( "Error" ),
          $q->p( "Your upload was not procesed because the following error ",
                 "occured: " ),
          $q->p( $q->i( $reason ) ),
          $q->end_html;
    exit;
}

We start by creating several constants to configure this script. UPLOAD_DIR is the path to the directory where we will store uploaded files. BUFFER_SIZE is the amount of data to read into memory while transferring from the temporary file to the output file. MAX_FILE_SIZE is the maximum file size we will accept; this is important because we want to limit users from uploading gigabyte-sized files and filling up all of the server's disk space. MAX_DIR_SIZE is the maximum size that we will allow our upload directory to grow to. This restriction is as important as the last because users can fill up our disks by posting lots of small files just as easily as posting large files. Finally, MAX_OPEN_TRIES is the number of times we try to generate a unique filename and open that file before we give up; we'll see why this step is necessary in a moment.

First, we enable file uploads, then we set $CGI::POST_MAX to MAX_FILE_SIZE. Note $CGI::POST_MAX is actually the size of the entire content of the request, which includes the data for other form fields as well as overhead for the multipart/form-data encoding, so this value is actually a little larger than the maximum file size that the script will actually accept. For this form, the difference is minor, but if you add a file upload field to a complex form with multiple text fields, then you should keep this distinction in mind.

We then create a CGI object and check for errors. As we said earlier, errors with file uploads are much more common than with other forms of CGI input. Next we get the file's upload name and the filename the user provided, reporting errors if either of these is missing. Note that a user may be rather upset to get a message saying that the filename is missing after uploading a large file via a modem. There is no way to interrupt that transfer, but in a production application, it might be more user-friendly to save the unnamed file temporarily, prompt the user for a filename, and then rename the file. Of course, you would then need periodically clean up temporary files that were abandoned.

We get a file handle, $fh, to the temporary file where CGI.pm has stored the input. We check whether our upload directory is full and report an error if this is the case. Again, this message is likely to create some unhappy users. In a production application you should add code to notify an administrator who can see why the upload directory is full and resolve the problem. See Chapter 9, "Sending Email".

Next, we replace any characters in the filename the user supplied that may cause problems with an underscore and make sure the name doesn't start with a period or a dash. The odd construct that reassigns the result of the regular expression to $filename untaints that variable. We'll discuss tainting and why this is important in Chapter 8, "Security". We confirm again that $filename is not empty (which would happen if it had consisted of nothing but periods and/or dashes) and generate an error if this is the case.