Writing Apache Modules with Perl and C

Writing Apache Modules with Perl and C

By:	Lincoln Stein and Doug MacEachern
Published:	O'Reilly & Associates, Inc. - March 1999

Show Contents Previous Page Next Page

Chapter 3 - The Apache Module Architecture and API
How Apache Works

In this section...

Introduction

The HTTP Protocol

Introduction

Show Contents Go to Top Previous Page Next Page

Much of the Apache API is driven by the simple fact that Apache is a hypertext transfer protocol (HTTP) server that runs in the background as a daemon. Because it is a daemon, it must do all the things that background applications do, namely, read its configuration files, go into the background, shut down when told to, and restart in the case of a configuration change. Because it is an HTTP server, it must be able to listen for incoming TCP/IP connections from web browsers, recognize requests for URIs, parse the URIs and translate them into the names of files or scripts, and return some response to the waiting browser (Figure 3-1). Extension modules play an active role in all these aspects of the Apache server's life.

Figure 3-1. The HTTP transaction consists of a URI request from the browser to the server, followed by a document response from the server to the browser.

Like most other servers, Apache multiplexes its operations so that it can start processing a new request before it has finished working on the previous one. On Unix systems, Apache uses a multiprocess model in which it launches a flock of servers: a single parent server is responsible for supervision and one or more children are actually responsible for serving incoming requests.¹ The Apache server takes care of the basic process management, but some extension modules need to maintain process-specific data for the lifetime of a process as well. They can do so cleanly and simply via hooks that are called whenever a child is launched or terminated. (The Win32 version of Apache uses multithreading rather than a multiprocess model, but as of this writing modules are not given a chance to take action when a new thread is created or destroyed.)

However, what extension modules primarily do is to intercede in the HTTP protocol in order to customize how Apache processes and responds to incoming browser requests. For this reason, we turn now to a quick look at HTTP itself.

The HTTP Protocol

Show Contents Go to Top Previous Page Next Page

The HTTP protocol was designed to be so simple that anyone with basic programming skills could write an HTTP client or server. In fact, hundreds of people have tried their hands at this, in languages ranging from C to Perl to Lisp. In the basic protocol a browser that wants to fetch a particular document from a server connects to the server via a TCP connection on the port indicated by the URI, usually port 80. The browser then sends the server a series of request lines terminated by a carriage-return/linefeed pair.² At the end of the request, there is an extra blank line to tell the server that the request is finished. The simplest request looks something like this:

GET /very/important/document.html HTTP/1.1
Host: www.modperl.com

The first line of the request contains three components. The first component is the request method, normally GET, POST, HEAD, PUT, or DELETE. GET is a request to fetch the contents of a document and is the most common. POST is a request which includes a body of data after the headers, normally handled by a dynamic module or an executable of some sort to process the data. It's commonly used to send CGI scripts the contents of fill-out forms. HEAD tells the server to return information about the document but not the document itself. PUT and DELETE are infrequently used: PUT is used to send a new document to the server, creating a new document at the given URI or replacing what was previously there, and DELETE causes the indicated document to be removed. For obvious reasons, PUT and DELETE methods are disabled by default on most servers.

The second component of the request is the URI of the document to be retrieved. It consists of a Unix-style path delimited by slashes. The server often translates the path into an actual file located somewhere on the server's filesystem, but it doesn't have to. In this book, we'll show examples of treating the path as a database query, as a placeholder in a virtual document tree, and other interesting applications.

The third component in the request line is the protocol in use, which in this case is Version 1.1 of the HTTP protocol. HTTP/1.1 is a big improvement over the earlier HTTP/1.0 version because of its support for virtual hosts and its fine-grained control of document caching. However, at the time this book was written most browsers actually implemented a version of HTTP/1.0 with some HTTP/1.1 features grafted on.

Following the first line are a series of HTTP header fields that the browser can send to the server in order to fine-tune the request. Each field consists of a field name, a colon, and then the value of the field, much like an email header. In the HTTP/1.1 protocol, there is only one mandatory header field, a Host field indicating which host the request is directed to. The value of this field allows a single server to implement multiple virtual hosts, each with a separate home page and document tree.

Other request header fields are optional. Here's a request sent by a recent version of Netscape Navigator:

GET /news.html HTTP/1.1
Connection: Keep-Alive
User-Agent: Mozilla/4.05 [en] (X11; I; Linux 2.0.33 i686)
Host: www.modperl.com
Referer: http://www.modperl.com/index.html
If-Modified-Since: Tue, 24 Feb 1998 11:19:03 GMT
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*
Accept-Language: en
Accept-Charset: iso-8859-1,*,utf-8

This example shows almost all of the HTTP header fields that you'll ever need to know. The Connection field is a suggestion to the server that it should keep the TCP/IP connection open after finishing this request. This is an optimization that improves performance on pages that contain multiple inline images. The User-Agent field gives the make and model of the browser. It indicates Netscape Navigator Version 4.05 ("Mozilla" is the code name for Netscape's browsers) running on a Linux system. Host is the name of the host given in the URI and is used by the virtual host system to select the right document tree and configuration file. Referer (yes, the protocol misspells it) gives the URI of the document that referred the browser to the current document. It's either an HTML file that links to the current page or, if the current document is an image file, the document that contains the image. In this case, the referrer field indicates that the user was viewing file index.html on the www.modperl.com site before selecting a link to the current document, news.html.

If-Modified-Since is another important performance optimization. Many browsers cache retrieved documents locally so that they don't have to go across the network whenever the user revisits a page. However, documents change and a cached document might be out of date. For this reason, some browsers implement a conditional fetch using If-Modified-Since. This field indicates the date at which the document was cached. The server is supposed to compare the date to the document's current modification time and only return it to the browser if the document has changed.³

Other fields in a typical request are Accept, Accept-Language, and Accept-Charset. Accept is a list of Multipurpose Internet Mime Extension (MIME) types that the browser will accept. In theory, the information in this field is supposed to be used for content negotiation. The browser tells the server what MIME types it can handle, and the server returns the document in the format that the browser most prefers. In practice, this field has atrophied. In the example above, Netscape sends an anemic list of the image types it can display without the help of plug-ins, followed by a catchall wildcard type of */*.

Accept-Language indicates the language the user prefers, in this case "en" for English. When a document is available in multiple languages, Apache can use the information in this field to return the document in the appropriate language. Lastly, Accept-Charset indicates which character sets the browser can display. The iso-8859-1 character set, often known as "Latin-1," contains the characters used in English and most Western European countries. "utf-8" stands for 8-bit Unicode, an expanded alphabet that accommodates most Western and Asian character sets. In this example, there's also a wildcard that tells the server to send the document even if it isn't written in a character set that the browser knows about specifically.

If the request had been a POST or PUT rather than a GET, there would be one or two additional fields at the bottom of the header. The Content-Length field, if present, indicates that the browser will be sending some document data following the header. The value of this field indicates how many bytes of data to expect. The Content-Type field, if present, gives the MIME type of the data. The standard MIME type for the contents of fill-out form fields is application/x-www-form-urlencoded.

The browser doesn't have to send any of these fields. Just the request line and the Host field are sufficient, as you can see for yourself using the telnet application:

% telnet www.modperl.com 80
Trying 207.198.250.44...
Connected to modperl.com.
Escape character is '^]'.
GET /news.html HTTP/1.1
Host: www.modperl.com

HTTP/1.1 200 OK
Date: Tue, 24 Feb 1998 13:16:02 GMT
Server: Apache/1.3.0 (Unix) mod_perl/1.13
Last-Modified: Wed, 11 Feb 1998 21:05:25 GMT
ETag: "65e5a-37c-35a7d395"
Accept-Ranges: bytes
Content-Length: 892
Connection: close
Content-Type: text/html

<HTML>
<HEAD>
<TITLE>What's New</TITLE>
</HEAD>
<BODY>
...
Connection closed by foreign host.

The Apache server will handle the request in the manner described later and, if all goes well, return the desired document to the client. The HTTP response is similar to the request. It contains a status line at the top, followed by some optional HTTP header fields, followed by the document itself. The header is separated from the document by a blank line.

The top line of the response starts with the HTTP version number, which in this case is 1.1. This is followed by a numeric status code, and a human-readable status message. As the "OK" message indicates, a response code of 200 means that the request was processed successfully and that the document follows. Other status codes indicate a problem on the user's end, such as the need to authenticate; problems on the server's end, such as a CGI script that has crashed; or a condition that is not an error, such as a notice that the original document has moved to a new location. The list of common status codes can be found later in this chapter.

After the response status line come optional HTTP header fields. Date indicates the current time and date and Server gives the model and version number of the server. Following this is information about the document itself. Last-Modified and Content-Length give the document's modification date and total length for use in client-side caching. Content-Type gives the document's MIME type, text/html in this case.

ETag, or "entity tag" is an HTTP/1.1-specific field that makes document caching more accurate. It identifies the document version uniquely and changes when the document changes. Apache implements this behavior using a combination of the file's last modified time, length, and inode number. Accept-Ranges is another HTTP/1.1 extension. It tells the browser that it is all right to request portions of this document. This could be used to retrieve the remainder of a document if the user hit the stop button partway through a long download and then tried to reload the page.

The Connection field is set to close as a polite way of warning the browser that the TCP connection is about to be shut down. It's an optional field provided for HTTP/1.1 compliance.

There are also a number of HTTP fields that are commonly used for user authentication and authorization. We'll introduce them in Chapter 6, Authentication and Authorization.

Following the header comes the document itself, partially shown in the example. The document's length must match the length given in Content-Length, and its format must match the MIME type given in the Content-Type field.

When you write your own Apache modules, you don't have to worry about all these fields unless you need to customize them. Apache will fill in the fields with reasonable values. Generally you will only need to adjust Content-Type to suit the type of document your module creates.

Footnotes

¹ As of this writing, plans are underway for Apache Version 2.0 which will include multithreading support on Unix platforms.

² For various historical and political reasons, different operating systems have differing ideas of what character constitutes the end of a line in text files. The HTTP protocol defines the end of a line to be the character pair represented by ASCII characters 0x0D (carriage return) and 0x0A (newline). In most ASCII environments, these characters are represented by the more familiar "\r" and "\n" escape sequences.

³ We actually cheated a bit in the preceding example. The version of Netscape that we used for the example generates a version of the If-Modified-Since header that is not compliant with the current HTTP specification (among other things, it uses a two-digit year that isn't Y2K-compliant). We edited the field to show the correct HTTP format.

Show Contents Go to Top Previous Page Next Page