19. CGI Programming

Contents:
Introduction
Writing a CGI Script
Redirecting Error Messages
Fixing a 500 Server Error
Writing a Safe CGI Program
Making CGI Scripts Efficient
Executing Commands Without Shell Escapes
Formatting Lists and Tables with HTML Shortcuts
Redirecting to a Different Location
Debugging the Raw HTTP Exchange
Managing Cookies
Creating Sticky Widgets
Writing a Multiscreen CGI Script
Saving a Form to a File or Mail Pipe
Program: chemiserie

A successful tool is one that was used to do something undreamt of by its author.

- Stephen C. Johnson

19.0. Introduction

Changes in the environment or the availability of food can make certain species more successful than others at getting food or avoiding predators. Many scientists believe a comet struck the earth millions of years ago, throwing an enormous cloud of dust into the atmosphere. Subsequent radical changes to the environment proved too much for some organisms, say dinosaurs, and hastened their extinction. Other creatures, such as mammals, found new food supplies and freshly exposed habitats to compete in.

Much as the comet altered the environment for prehistoric species, the Web has altered the environment for modern programming languages. It's opened up new vistas, and although some languages have found themselves eminently unsuited to this new world order, Perl has positively thrived. Because of its strong background in text processing and system glue, Perl has readily adapted itself to the task of providing information using text-based protocols.

Architecture

The Web is driven by plain text. Web servers and web browsers communicate using a text protocol called HTTP, Hypertext Transfer Protocol. Many of the documents exchanged are encoded in a text markup system called HTML, Hypertext Markup Language. This grounding in text is the source of much of the Web's flexibility, power, and success. The only notable exception to the predominance of plain text is the Secure Socket Layer (SSL) protocol that encrypts other protocols like HTTP into binary data that snoopers can't decode.

Web pages are identified using the Uniform Resource Locator (URL) naming scheme. URLs look like this:

http://www.perl.com/CPAN/
http://www.perl.com:8001/bad/mojo.html
ftp://gatekeeper.dec.com/pub/misc/netlib.tar.Z
ftp://anonymous@myplace:gatekeeper.dec.com/pub/misc/netlib.tar.Z
file:///etc/motd

The first part (http , ftp , file ) is called the scheme , and identifies how the file is retrieved. The next part (:// ) signifies a hostname will follow, whose interpretation depends on the scheme. After the hostname comes the path identifying the document. This path information is also called a partial URL .

The Web is a client-server system. Client browsers like Netscape and Lynx request documents (identified by a partial URL) from web servers like Apache. This browser-to-server dialog is governed by the HTTP protocol. Most of the time, the server merely sends back the contents of a file. Sometimes, however, the web server will run another program to send back a document that could be HTML text, an image, or any other document type. The server-to-program dialog is governed by the CGI (Common Gateway Interface) protocol, so the program that the server runs is a CGI program or CGI script .

The server tells the CGI program what page was requested, what values (if any) came in through HTML forms, where the request came from, who they authenticated as (if they authenticated at all), and much more. The CGI program's reply has two parts: headers to say "I'm sending back an HTML document," "I'm sending back a GIF image," or "I'm not sending you anything, go to this page instead," and a document body, perhaps containing GIF image data, plain text, or HTML.

The CGI protocol is easy to implement wrong and hard to implement right, which is why we recommend using Lincoln Stein's excellent CGI.pm module. It provides convenient functions for accessing the information the server sends you, and for preparing the CGI response the server expects. It is so useful, it is included in the standard Perl distribution as of the 5.004 release, along with helper modules like CGI::Carp and CGI::Fast. We show it off in Recipe 19.1 .

Some web servers come with a Perl interpreter embedded in them. This lets you use Perl to generate documents without starting a new process. The system overhead of reading an unchanging page isn't noticable on infrequently accessed pages, even when it's happening several times a second. CGI accesses, however, bog down the machine running the web server. Recipe 19.5 shows how to use mod_perl , the Perl interpreter embedded in the Apache web server, to get the benefits of CGI programs without the overhead.

Behind the Scenes

CGI programs are called each time the web server needs a dynamic document generated. It is important to understand that your CGI program doesn't run continuously, with the browser calling different parts of the program. Each request for a partial URL corresponding to your program starts a new copy. Your program generates a page for that request, then quits.

A browser can request a document in a number of ways called methods . (Don't confuse HTTP methods with the methods of object-orientation. They have nothing to do with each other). The GET method is the most common, indicating a simple request for a document. The HEAD method is used when the browser wants to know about the document without actually fetching it. The POST method is used to submit form values.

Form values can be encoded in both GET and POST methods. With the GET method, values are encoded in the URL, leading to ugly URLs like this:

http://mox.perl.com/cgi-bin/program?name=Johann&born=1685

With the POST method, values are in a different part of the HTTP request that the browser sends the server. If the form values in the example URL above were sent with a POST request, the user, server, and CGI script all see the URL:

http://mox.perl.com/cgi-bin/program

The GET and POST methods differ in another respect: idempotency . This simply means that making a GET request for a particular URL once or multiple times should be no different. This is because the HTTP protocol definition says that a GET request may be cached by the browser, or server, or an intervening proxy. POST requests cannot be cached, because each request is independent and matters. Typically, POST requests changes or depends on the state of the server (query or update a database, send mail, or purchase a computer).

Most servers log requests to a file (the access log ) for later analysis by the webmaster. Error messages produced by CGI programs don't go to the browser by default. Instead they are also logged to a file (the error log ), and the browser simply gets a "500 Server Error" message saying that the CGI program didn't uphold its end of the CGI bargain.

Error messages are useful in debugging any program, but they are especially so with CGI scripts. Sometimes, though, the authors of CGI programs either don't have access to the error log or don't know where it is. Having error messages sent to a more convenient location is discussed in Recipe 19.2 . Tracking down errors is covered in Recipe 19.3 .

Recipe 19.9 shows how to learn what your browser and server are really saying to one another. Unfortunately, some browsers do not implement the HTTP specification correctly, and you can use the tools in this recipe to investigate whether your program or your browser is the cause of a problem.

Security

CGI programs let anyone run a program on your system. Sure, you get to pick the program, but the anonymous user from Out There can send it unexpected values and try to trick it into doing the wrong thing. Thus security is a big concern on the Web.

Some sites address this concern by banning CGI programs. Sites that can't do without the power and utility of CGI programs must find ways to secure their CGI programs. Recipe 19.4 gives a checklist of considerations for writing a secure CGI script, and it briefly covers Perl's tainting mechanism for guarding against accidental use of unsafe data. Recipe 19.6 shows how your CGI program can safely run other programs.

HTML and Forms

Some HTML tags let you create forms, where the user can fill in values that will be submitted to the server. The forms are composed of widgets, like text entry fields and check boxes. CGI programs commonly return HTML, so the CGI module has helper functions to create HTML for everything from tables to form widgets.

In addition to Recipe 19.7 , this chapter also has Recipe 19.11 , which shows how to create forms that retain their values over multiple calls. Recipe 19.12 shows how to make a single CGI script that produces and responds to a set of pages, for example, a product catalog and ordering system.

Web-Related Resources

Unsurprisingly, some of the best references on the Web are found on the Web:

WWW Security FAQ

http://www.w3.org/Security/Faq/

Web FAQ

http://www.boutell.com/faq/

CGI FAQ

http://www.webthing.com/tutorials/cgifaq.html

HTTP Specification

http://www.w3.org/pub/WWW/Protocols/HTTP/

HTML Specification

http://www.w3.org/TR/REC-html40/

http://www.w3.org/pub/WWW/MarkUp/

CGI Specification

http://www.w3.org/CGI/

CGI Security FAQ

http://www.go2net.com/people/paulp/cgi-security/safe-cgi.txt

We recommend Lincoln Stein's fine book, Official Guide to Programming with Cgi.pm (John Wiley and Associates, 1998), Tom Boutell's aging but worthwhile CGI Programming in C and Perl (Addison-Wesley, 1996) and HTML: The Definitive Guide (3rd Edition; O'Reilly & Associates, 1998) by Chuck Musciano and Bill Kennedy. The best periodical to date is the monthly Web Techniques magazine, targeted at web programmers.