home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomeCGI Programming with PerlSearch this book

5.2. Handling Input with CGI.pm

CGI.pm primarily handles two separate tasks: it reads and parses input from the user, and it provides a convenient way to return HTML output. Let's first look at how it collects input.

5.2.1. Environment Information

CGI.pm provides many methods to get information about your environment. Of course, when you use CGI.pm, all of your standard CGI environment variables are still available in Perl's %ENV hash, but CGI.pm also makes most of these available via method calls. It also provides some unique methods. Table 5-1 shows how CGI.pm's functions correspond to the standard CGI environment variables.

Table 5-1. CGI.pm Environment Methods and CGI Environment Variables

CGI.pm Method

CGI Environment Variable

auth_type

AUTH_TYPE

Not available

CONTENT_LENGTH

content_type

CONTENT_TYPE

Not available

DOCUMENT_ROOT

Not available

GATEWAY_INTERFACE

path_info

PATH_INFO

path_translated

PATH_TRANSLATED

query_string

QUERY_STRING

remote_addr

REMOTE_ADDR

remote_host

REMOTE_HOST

remote_ident

REMOTE_IDENT

remote_user

REMOTE_USER

request_method

REQUEST_METHOD

script_name

SCRIPT_NAME

self_url

Not available

server_name

SERVER_NAME

server_port

SERVER_PORT

server_protocol

SERVER_PROTOCOL

server_software

SERVER_SOFTWARE

url

Not available

Accept

HTTP_ACCEPT

http("Accept-charset")

HTTP_ACCEPT_CHARSET

http("Accept-encoding")

HTTP_ACCEPT_ENCODING

http("Accept-language")

HTTP_ACCEPT_LANGUAGE

raw_cookie

HTTP_COOKIE

http("From")

HTTP_FROM

virtual_host

HTTP_HOST

referer

HTTP_REFERER

user_agent

HTTP_USER_AGENT

https

HTTPS

https("Cipher")

HTTPS_CIPHER

https("Keysize")

HTTPS_KEYSIZE

https("SecretKeySize")

HTTPS_SECRETKEYSIZE

Most of these CGI.pm methods take no arguments and return that same value as the corresponding environment variable. For example, to get the additional path information passed to your CGI script, you can use the following method:

my $path = $q->path_info;

This is the same information that you could also get this way:

my $path = $ENV{PATH_INFO};

However, a few methods differ or have features worth noting. Let's take a look at these.

5.2.2. Accessing Parameters

param is probably the most useful method CGI.pm provides. It allows you to access the parameters submitted to your CGI script, whether these parameters come to you via a GET request or a POST request. If you call param without arguments, it will return a list of all of the parameter names your script received. If you provide a single argument to it, it will return the value for the parameter with that name. If no parameter with that name was submitted to your script, it returns undef.

It is possible for your CGI script to receive multiple values for a parameter with the same name. This happens when you create two form elements with the same name or you have a select box that allows multiple selections. In this case, param returns a list of all of the values if it is called in a list context and just the first value if it is called in a scalar context. This may sound a little complicated, but in practice it works such that you should end up with what you expect. If you ask param for one value, you will get one value (even if other values were also submitted), and if you ask it for a list, you will always get a list (even if the list contains only one element).

Example 5-1 is a simple example that displays all the parameters your script receives.

Example 5-1. param_list.cgi

#!/usr/bin/perl -wT

use strict;
use CGI;

my $q = new CGI;
print $q->header( "text/plain" );

print "These are the parameters I received:\n\n";

my( $name, $value );

foreach $name ( $q->param ) {
    print "$name:\n";
    foreach $value ( $q->param( $name ) ) {
        print "  $value\n";
    }
}

If you call this CGI script with multiple parameters, like this:

http://localhost/cgi/param_list.cgi?color=red&color=blue&shade=dark

you will get the following output:

These are the parameters I received:

color:
  red
  blue
shade:
  dark

5.2.4. File Uploads with CGI.pm

As we mentioned in the last chapter, it is possible to create a form with a multipart/form-data media type that permits users to upload files via HTTP. We avoided discussing how to handle this type of input then because handling file uploads properly can be quite complex. Fortunately, there's no need for us to do this because, like other form input, CGI.pm provides a very simple interface for handling file uploads.

You can access the name of an uploaded file with the param method, just like the value of any other form element. For example, if your CGI script were receiving input from the following HTML form:

<FORM ACTION="/cgi/upload.cgi" METHOD="POST" ENCTYPE="multipart/form-data">
  <P>Please choose a file to upload:
  <INPUT TYPE="FILE" NAME="file">
  <INPUT TYPE="SUBMIT">
</FORM>

then you could get the name of the uploaded file this way, by referring to the name of the <FILE> input element, in this case "file":

my $file = $q->param( "file" );

The name you receive from this parameter is the name of the file as it appeared on the user's machine when they uploaded it. CGI.pm stores the file as a temporary file on your system, but the name of this temporary file does not correspond to the name you get from this parameter. We will see how to access the temporary file in a moment.

The name supplied by this parameter varies according to platform and browser. Some systems supply just the name of the uploaded file; others supply the entire path of the file on the user's machine. Because path delimiters also vary between systems, it can be a challenge determining the name of the file. The following command appears to work for Windows, Macintosh, and Unix-compatible systems:

my( $file ) = $q->param( "file" ) =~ m|([^/:\\]+)$|;

However, it may strip parts of filenames, since "report 11/3/99" is a valid filename on Macintosh systems and the above command would in this case set $file to "99". Another solution is to replace any characters other than letters, digits, underscores, dashes, and periods with underscores and prevent any files from beginning with periods or dashes:

my $file = $q->param( "file" );
$file =~ s/([^\w.-])/_/g;
$file =~ s/^[-.]+//;

The problem with this is that Netscape's browsers on Windows sends the full path to the file as the filename. Thus, $file may be set to something long and ugly like "C_ _ _Windows_Favorites_report.doc".

You could try to sort out the behaviors of the different operating systems and browsers, check for the user's browser and operating system, and then treat the filename appropriately, but that would be a very poor solution. You are bound to miss some combinations, you would constantly need to update it, and one of the greatest advantages of the Web is that it works across platforms; you should not build any limitations into your solutions.

So the simple, obvious solution is actually nontechnical. If you do need to know the name of the uploaded file, just add another text field to the form allowing the user to enter the name of the file they are uploading. This has the added advantage of allowing a user to provide a different name than the file has, if appropriate. The HTML form looks like this:

<FORM ACTION="/cgi/upload.cgi" METHOD="POST" ENCTYPE="multipart/form-data">
  <P>Please choose a file to upload:
  <INPUT TYPE="FILE" NAME="file">
  <P>Please enter the name of this file:
  <INPUT TYPE="TEXT" NAME="filename">
</FORM>

You can then get the name from the text field, remembering to strip out any odd characters:

my $filename = $q->param( "filename" );
$filename =~ s/([^\w.-])/_/g;
$filename =~ s/^[-.]+//;

So now that we know how to get the name of the file uploaded, let's look at how we get at the content. CGI.pm creates a temporary file to store the contents of the upload; you can get a file handle for this file by passing the name of the file according to the file element to the upload method as follows:

my $file = $q->param( "file" );
my $fh   = $q->upload( $file );

The upload method was added to CGI.pm in Version 2.47. Prior to this you could use the value returned by param (in this case $file) as a file handle in order to read from the file; if you use it as a string it returns the name of the file. This actually still works, but there are conflicts with strict mode and other problems, so upload is the preferred way to get a file handle now. Be sure that you pass upload the name of the file according to param, and not a different name (e.g., the name the user supplied, the name with nonalphanumeric characters replaced with underscores, etc.).

Note that transfer errors are much more common with file uploads than with other forms of input. If the user presses the Stop button in the browser as the file is uploading, for example, CGI.pm will receive only a portion of the uploaded file. Because of the format of multipart/form-data requests, CGI.pm will recognize that the transfer is incomplete. You can check for errors such as this by using the cgi_error method after creating a CGI.pm object. It returns the HTTP status code and message corresponding to the error, if applicable, or an empty string if no error has occurred. For instance, if the Content-length of a POST request exceeds $CGI::POST_MAX, then cgi_error will return "413 Request entity too large". As a general rule, you should always check for an error when you are recording input on the server. This includes file uploads and other POST requests. It doesn't hurt to check for an error with GET requests either.

Example 5-2 provides the complete code, with error checking, to receive a file upload via our previous HTML form.

Example 5-2. upload.cgi

#!/usr/bin/perl -wT

use strict;
use CGI;
use Fcntl qw( :DEFAULT :flock );

use constant UPLOAD_DIR     => "/usr/local/apache/data/uploads";
use constant BUFFER_SIZE    => 16_384;
use constant MAX_FILE_SIZE  => 1_048_576;       # Limit each upload to 1 MB
use constant MAX_DIR_SIZE   => 100 * 1_048_576; # Limit total uploads to 100 MB
use constant MAX_OPEN_TRIES => 100;

$CGI::DISABLE_UPLOADS   = 0;
$CGI::POST_MAX          = MAX_FILE_SIZE;

my $q = new CGI;
$q->cgi_error and error( $q, "Error transferring file: " . $q->cgi_error );

my $file      = $q->param( "file" )     || error( $q, "No file received." );
my $filename  = $q->param( "filename" ) || error( $q, "No filename entered." );
my $fh        = $q->upload( $file );
my $buffer    = "";

if ( dir_size( UPLOAD_DIR ) + $ENV{CONTENT_LENGTH} > MAX_DIR_SIZE ) {
    error( $q, "Upload directory is full." );
}

# Allow letters, digits, periods, underscores, dashes
# Convert anything else to an underscore
$filename =~ s/[^\w.-]/_/g;
if ( $filename =~ /^(\w[\w.-]*)/ ) {
    $filename = $1;
}
else {
    error( $q, "Invalid file name; files must start with a letter or number." );
}

# Open output file, making sure the name is unique
until ( sysopen OUTPUT, UPLOAD_DIR . $filename, O_CREAT | O_EXCL ) {
    $filename =~ s/(\d*)(\.\w+)$/($1||0) + 1 . $2/e;
    $1 >= MAX_OPEN_TRIES and error( $q, "Unable to save your file." );
}

# This is necessary for non-Unix systems; does nothing on Unix
binmode $fh;
binmode OUTPUT;

# Write contents to output file
while ( read( $fh, $buffer, BUFFER_SIZE ) ) {
    print OUTPUT $buffer;
}

close OUTPUT;


sub dir_size {
    my $dir = shift;
    my $dir_size = 0;
    
    # Loop through files and sum the sizes; doesn't descend down subdirs
    opendir DIR, $dir or die "Unable to open $dir: $!";
    while ( readdir DIR ) {
        $dir_size += -s "$dir/$_";
    }
    return $dir_size;
}


sub error {
    my( $q, $reason ) = @_;
    
    print $q->header( "text/html" ),
          $q->start_html( "Error" ),
          $q->h1( "Error" ),
          $q->p( "Your upload was not procesed because the following error ",
                 "occured: " ),
          $q->p( $q->i( $reason ) ),
          $q->end_html;
    exit;
}

We start by creating several constants to configure this script. UPLOAD_DIR is the path to the directory where we will store uploaded files. BUFFER_SIZE is the amount of data to read into memory while transferring from the temporary file to the output file. MAX_FILE_SIZE is the maximum file size we will accept; this is important because we want to limit users from uploading gigabyte-sized files and filling up all of the server's disk space. MAX_DIR_SIZE is the maximum size that we will allow our upload directory to grow to. This restriction is as important as the last because users can fill up our disks by posting lots of small files just as easily as posting large files. Finally, MAX_OPEN_TRIES is the number of times we try to generate a unique filename and open that file before we give up; we'll see why this step is necessary in a moment.

First, we enable file uploads, then we set $CGI::POST_MAX to MAX_FILE_SIZE. Note $CGI::POST_MAX is actually the size of the entire content of the request, which includes the data for other form fields as well as overhead for the multipart/form-data encoding, so this value is actually a little larger than the maximum file size that the script will actually accept. For this form, the difference is minor, but if you add a file upload field to a complex form with multiple text fields, then you should keep this distinction in mind.

We then create a CGI object and check for errors. As we said earlier, errors with file uploads are much more common than with other forms of CGI input. Next we get the file's upload name and the filename the user provided, reporting errors if either of these is missing. Note that a user may be rather upset to get a message saying that the filename is missing after uploading a large file via a modem. There is no way to interrupt that transfer, but in a production application, it might be more user-friendly to save the unnamed file temporarily, prompt the user for a filename, and then rename the file. Of course, you would then need periodically clean up temporary files that were abandoned.

We get a file handle, $fh, to the temporary file where CGI.pm has stored the input. We check whether our upload directory is full and report an error if this is the case. Again, this message is likely to create some unhappy users. In a production application you should add code to notify an administrator who can see why the upload directory is full and resolve the problem. See Chapter 9, "Sending Email".

Next, we replace any characters in the filename the user supplied that may cause problems with an underscore and make sure the name doesn't start with a period or a dash. The odd construct that reassigns the result of the regular expression to $filename untaints that variable. We'll discuss tainting and why this is important in Chapter 8, "Security". We confirm again that $filename is not empty (which would happen if it had consisted of nothing but periods and/or dashes) and generate an error if this is the case.

We try to open a file with this name in our upload directory. If we fail, then we add a digit to $filename and try again. The regular expression allows us to keep the file extension the same: if there is already a report.txt file, then the next upload with that name will be named report1.txt, the next one report2.txt, etc. This continues until we exceed MAX_OPEN_TRIES . It is important that we create a limit to this loop because there may be a reason other than a non-unique name that prevents us from saving the file. If the disk is full or the system has too many open files, for example, we do not want to start looping endlessly. This error should also notify an administrator that something is wrong.

This script is written to handle any type of file upload, including binary files such as images or audio. By default, whenever Perl accesses a file handle on non-Unix systems (more specifically, systems that do not use \n as their end of line character), Perl translates the native operating system's end of line characters, such as \r\n for Windows or \r for MacOS, to \n on input and back to the native characters on output. This works great for text files, but it can corrupt binary files. Thus, we enable binary mode with the binmode function in order to disable this translation. On systems, like Unix, where no end of line translation occurs, binmode has no effect.

Finally, we read from our temporary file handle and write to our output file and exit. We use the read function to read and write a chunk a data at a time. The size of this chunk is defined by our BUFFER_SIZE constant. In case you are wondering, CGI.pm will remove its temporary file automatically when our script exits (technically, when $q goes out of scope).

There is another way we could have moved the file to our uploads directory. We could use CGI.pm's undocumented tmpFileName method to get the name of the temporary file containing the upload and then used Perl's rename function to move the file. However, relying on undocumented code is dangerous, because it may not be compatible with future versions of CGI.pm. Thus, in our example we stick to the public API instead.

The dir_size subroutine calculates the size of a directory by summing the size of each of its files. The error subroutine prints a message telling the user why the transfer failed. In a production application, you probably want to provide links for the user to get help or to notify someone about problems.



Library Navigation Links

Copyright © 2001 O'Reilly & Associates. All rights reserved.