19.0.1. Architecture
The Web is
driven by plain text. Web servers and web browsers communicate using
a text protocol called HTTP, Hypertext Transfer Protocol. Many of the
documents exchanged are encoded in a text markup system called HTML,
Hypertext Markup Language. This grounding in text is the source of
much of the Web's flexibility, power, and success. The only notable
exception to the predominance of plain text is the Secure Socket
Layer (SSL) protocol that encrypts other protocols like HTTP into
binary data that snoopers can't decode.
Web pages are identified
using the Uniform Resource Locator (URL) naming scheme. URLs look
like this:
http://www.perl.com/CPAN/
http://www.perl.com:8001/bad/mojo.html
ftp://gatekeeper.dec.com/pub/misc/netlib.tar.Z
ftp://anonymous@myplace:gatekeeper.dec.com/pub/misc/netlib.tar.Z
file:///etc/motd
The first part
(http, ftp,
file) is called the scheme,
which identifies how the file is retrieved. The next part
(://) means a hostname will follow, whose
interpretation depends on the scheme. After the hostname comes the
path identifying the document. This path
information is also called a partial URL.
The Web is a client-server system. Client browsers like Netscape and
Lynx request documents (identified by a partial URL) from web servers
like Apache. This browser-to-server dialog is governed by the HTTP
protocol. Most of the time, the server merely sends back the file
contents. Sometimes, however, the web server runs another program to
return a document that could be HTML text, binary image, or any other
document type.
The server-to-program dialog can be handled in two ways. Either the
code to handle the request is part of the web server process, or else
the web server runs an external program to generate a response. The
first scenario is the model of Java servlets and mod_perl (covered in
Chapter 21). The second is governed by the Common
Gateway Interface (CGI) protocol, so the server runs a CGI
program (sometimes known as a CGI
script). This chapter deals with CGI programs.
The server tells the CGI program what page was requested, what values
(if any) came in through HTML forms, where the request came from,
whom they authenticated as (if they authenticated at all), and much
more. The CGI program's reply has two parts: headers to say "I'm
sending back an HTML document," "I'm sending back a GIF image," or
"I'm not sending you anything; go to this page instead," and a
document body, perhaps containing image data, plain text, or HTML.
The CGI
protocol is easy to implement wrong and hard to implement right,
which is why we recommend using Lincoln Stein's excellent CGI.pm
module. It provides convenient functions for accessing the
information the server sends you, and for preparing the CGI response
the server expects. It's so useful, it's included in the standard
Perl distribution, along with helper modules such as CGI::Carp and
CGI::Fast. We show it off in Recipe 19.1.
Some web servers come with a Perl interpreter embedded in them. This
lets Perl generate documents without starting a new process. The
system overhead of reading an unchanging page isn't noticeable on
infrequently accessed pages, even when it's happening several times a
second. CGI accesses, however, bog down the machine running the web
server. Chapter 21 shows how to use
mod_perl, the Perl interpreter embedded in the
Apache web server to get the benefits of CGI programs without the
overhead.