Chapter 3. Toward a Real Web Site
Now that we have the server running with a basic configuration, we can start to explore more sophisticated possibilities in greater detail. Fortunately, the differences between the Windows and Unix versions of Apache fade as we get past the initial setup and configuration, so it's easier to focus on the details of making a web site work.
3.1 More and Better Web Sites: site.simple
We are now in a position to start creating real(ish) web sites, which can be found in the sample code at the web site for the book, http://oreilly.com/catalog/apache3/. For the sake of a little extra realism, we will base the site loosely round a simple web business, Butterthlies, Inc., that creates and sells picture postcards. We need to give it some web addresses, but since we don't yet want to venture into the outside world, they should be variants on your own network ID. This way, all the machines in the network realize that they don't have to go out on the Web to make contact. For instance, we edited the \windows\hosts file on the Windows 95 machine running the browser and the /etc/hosts file on the Unix machine running the server to read as follows:
127.0.0.1 localhost 192.168.123.2 www.butterthlies.com 192.168.123.2 sales.butterthlies.com 192.168.123.3 sales-IP.butterthlies.com 192.168.124.1 www.faraway.com
localhost is obligatory, so we left it in, but you should not make any server requests to it since the results are likely to be confusing.
You probably need to consult your network manager to make similar arrangements.
site.simple is site.toddle with a few small changes. The script go will work anywhere. To get started, do the following, depending on your operating environment:
test -d logs || mkdir logs httpd -d 'pwd' -f 'pwd'/conf/httpd.conf
Open an MS-DOS window and from the command line, type:
c>cd \program files\apache group\apache c>apache -k start c>Apache/1.3.26 (Win32) running ...
To stop Apache, open a second MS-DOS window:
c>apache -k stop c>cd logs c>edit error.log
This will be true of each site in the demonstration setup, so we will not mention it again.
From here on, there will be minimal differences between the server setups necessary for Win32 and those for Unix. Unless one or the other is specifically mentioned, you should assume that the text refers to both.
It would be nice to have a log of what goes on. In the first edition of this book, we found that a file access_log was created automatically in ...site.simple/logs. In a rather bizarre move since then, the Apache Group has broken backward compatibility and now requires you to mention the log file explicitly in the Config file using the TransferLog directive.
The ... /conf/httpd.conf file now contains the following:
User webuser Group webgroup ServerName www.butterthlies.com DocumentRoot /usr/www/APACHE3/APACHE3/site.simple/htdocs TransferLog logs/access_log
In ... /htdocs we have, as before, 1.txt :
hullo world from site.simple again!
Type ./go on the server. Become the client, and retrieve http://www.butterthlies.com. You should see:
Index of / . Parent Directory . 1.txt
Click on 1.txt for an inspirational message as before.
This all seems satisfactory, but there is a hidden mystery. We get the same result if we connect to http://sales.butterthlies.com. Why is this? Why, since we have not mentioned either of these URLs or their IP addresses in the configuration file on site.simple, do we get any response at all?
The answer is that when we configured the machine on which the server runs, we told the network interface to respond to anyof these IP addresses:
By default Apache listens to all IP addresses belonging to the machine and responds in the same way to all of them. If there are virtual hosts configured (which there aren't, in this case), Apache runs through them, looking for an IP name that corresponds to the incoming connection. Apache uses that configuration if it is found, or the main configuration if it is not. Later in this chapter, we look at more definite control with the directives BindAddress, Listen, and <VirtualHost>.
It has to be said that working like this (that is, switching rapidly between different configurations) seemed to get Netscape or Internet Explorer into a rare muddle. To be sure that the server was functioning properly while using Netscape as a browser, it was usually necessary to reload the file under examination by holding down the Control key while clicking on Reload. In extreme cases, it was necessary to disable caching by going to Edit Preferences Advanced Cache. Set memory and disk cache to 0, and set cache comparison to Every Time. In Internet Explorer, set Cache Compares to Every Time. If you don't, the browser tends to display a jumble of several different responses from the server. This occurs because we are doing what no user or administrator would normally do, namely, flipping around between different versions of the same site with different versions of the same file. Whenever we flip from a newer version to an older version, Netscape is led to believe that its cached version is up-to-date.
Back on the server, stop Apache with ^C, and look at the log files. In ... /logs/access_log, you should see something like this:
192.168.123.1--- [<date-time>] "GET / HTTP/1.1" 200 177
200 is the response code (meaning "OK, cool, fine"), and 177 is the number of bytes transferred. In ... /logs/error_log, there should be nothing because nothing went wrong. However, it is a good habit to look there from time to time, though you have to make sure that the date and time logged correspond to the problem you are investigating. It is easy to fool yourself with some long-gone drama.
Life being what it is, things can go wrong, and the client can ask for something the server can't provide. It makes sense to allow for this with the ErrorDocument command.
The ErrorDocument directive lets you specify what happens when a client asks for a nonexistent document.
ErrorDocument error-code "document(" in Apache v2) Server config, virtual host, directory, .htaccess
In the event of a problem or error, Apache can be configured to do one of four things:
The first option is the default, whereas options 2 through 4 are configured using the ErrorDocument directive, which is followed by the HTTP response code and a message or URL. Messages in this context begin with a double quotation mark ("), which does not form part of the message itself. Apache will sometimes offer additional information regarding the problem or error.
URLs can be local URLs beginning with a slash (/ ) or full URLs that the client can resolve. For example:
ErrorDocument 500 http://foo.example.com/cgi-bin/tester ErrorDocument 404 /cgi-bin/bad_urls.pl ErrorDocument 401 /subscription_info.html ErrorDocument 403 "Sorry can't allow you access today"
Note that when you specify an ErrorDocument that points to a remote URL (i.e., anything with a method such as "http" in front of it), Apache will send a redirect to the client to tell it where to find the document, even if the document ends up being on the same server. This has several implications, the most important being that if you use an ErrorDocument 401 directive, it must refer to a local document. This results from the nature of the HTTP basic authentication scheme.
3.2 Butterthlies, Inc., Gets Going
The httpd.conf file (to be found in ... /site.first) contains the following:
User webuser Group webgroup ServerName my586 DocumentRoot /usr/www/APACHE3/APACHE3/site.first/htdocs TransferLog logs/access_log #Listen is needed for Apache2 Listen 80
In the first edition of this book, we mentioned the directives AccessConfig and ResourceConfig here. If set with /dev/null (NUL under Win32), they disable the srm.conf and access.conf files, and they were formerly required if those files were absent. However, new versions of Apache ignore these files if they are not present, so the directives are no longer required. However, if they are present, the files mentioned will be included in the Config file. In Apache Version 1.3.14 and later, they can be given a directory rather than a filename, and all files in that directory and its subdirectories will be parsed as configuration files.
In Apache v2 the directives AccessConfig and ResourceConfig are abolished and will cause an error. However, you can write: Include conf/srm.conf Include conf/access.conf in that order, and at the end of the Config file.
Apache v2 also, rather oddly, insists on a Listen directive. If you don't include it in your Config file, you will get the error message:
...no listening sockets available, shutting down.
If you are using Win32, note that the User and Group directives are not supported, so these can be removed.
Apache's role in life is delivering documents, and so far we have not done much of that. We therefore begin in a modest way with a little HTML document that lists our cards, gives their prices, and tells interested parties how to get them.
We can look at the Netscape Help item "Creating Net Sites" and download "A Beginners Guide to HTML" as well as the next web person can, then rough out a little brochure in no time flat:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"> <html> <head> <title> Butterthlies Catalog</title> </head> <body> <h1> Welcome to Butterthlies Inc</h1> <h2>Summer Catalog</h2> <p> All our cards are available in packs of 20 at $2 a pack. There is a 10% discount if you order more than 100. </p> <hr> <p> Style 2315 <p align=center> <img src="bench.jpg" alt="Picture of a bench"> <p align=center> Be BOLD on the bench <hr> <p> Style 2316 <p align=center> <img src="hen.jpg" ALT="Picture of a hencoop like a pagoda"> <p align=center> Get SCRAMBLED in the henhouse <HR> <p> Style 2317 <p align=center> <img src="tree.jpg" alt="Very nice picture of tree"> <p align=center> Get HIGH in the treehouse <hr> <p> Style 2318 <p align=center> <img src="bath.jpg" alt="Rather puzzling picture of a bathtub"> <p align=center> Get DIRTY in the bath <hr> <p align=right> Postcards designed by Harriet@alart.demon.co.uk <hr> <br> Butterthlies Inc, Hopeful City, Nevada 99999 </body> </HTML>
We want this brochure to appear in ... /site.first/htdocs, but we will in fact be using it in many other sites as we progress, so let's keep it in a central location. We will set up links to it using the Unixln command, which creates new directory entries having the same modes as the original file without wasting disk space. Moreover, if you change the "real" copy of the file, all the linked copies change too. We have a directory /usr/www/APACHE3/main_docs, and this document lives in it as catalog_summer.html. This file refers to some rather pretty pictures that are held in four .jpg files. They live in ... /main_docs and are linked to the working htdocs directories:
% ln /usr/www/APACHE3/main_docs/catalog_summer.html . % ln /usr/www/APACHE3/main_docs/bench.jpg .
The remainder of the links follow the same format (assuming we are in .../site.first/htdocs).
If you type ls, you should see the files there as large as life.
Under Win32 there is unfortunately no equivalent to a link, so you will just have to have multiple copies.
3.2.1 Default Index
Type ./go, and shift to the client machine. Log onto http://www.butterthlies.com /:
INDEX of / *Parent Directory *bath.jpg *bench.jpg *catalog_summer.html *hen.jpg *tree.jpg
What we see in the previous listing is the index that Apache concocts in the absence of anything better. We can do better by creating our own index page in the special file ... /htdocs/index.html :
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"> <html> <head> <title>Index to Butterthlies Catalogs</title> </head> <body> <ul> <li><A href="catalog_summer.html">Summer catalog</A> <li><A href="catalog_autumn.html">Autumn catalog</A> </ul> <hr> <br>Butterthlies Inc, Hopeful City, Nevada 99999 </body> </html>
We needed a second file (catalog_autumn.html) to make our site look convincing. So we did what the management of this outfit would do themselves: we copied catalog_summer.html to catalog_autum.html and edited it, simply changing the word Summer to Autumn and including the link in ... /htdocs.
Whenever a client opens a URL that points to a directory containing the index.html file, Apache automatically returns it to the client (by default, this can be configured with the DirectoryIndex directive). Now, when we visit, we see:
INDEX TO BUTTERTHLIES CATALOGS *Summer Catalog *Autumn Catalog -------------------------------------------- Butterthlies Inc, Hopeful City, Nevada 99999
We won't forget to tell the web search engines about our site. Soon the clients will be logging in (we can see who they are by checking ... /logs/access_log). They will read this compelling sales material, and the phone will immediately start ringing with orders. Our fortune is on its way to being made.
3.3 Block Directives
Apache has a number of block directives that limit the application of other directives within them to operations on particular virtual hosts, directories, or files. These are extremely important to the operation of a real web site because within these blocks — particularly <VirtualHost> — the webmaster can, in effect, set up a large number of individual servers run by a single invocation of Apache. This will make more sense when you get to the Section 4.1.
The syntax of the block directives is detailed next.
The <VirtualHost> directive within a Config file acts like a tag in HTML: it introduces a block of text containing directives referring to one host; when we're finished with it, we stop with </VirtualHost>. For example:
.... <VirtualHost www.butterthlies.com> ServerAdmin firstname.lastname@example.org DocumentRoot /usr/www/APACHE3/APACHE3/site.virtual/htdocs/customers ServerName www.butterthlies.com ErrorLog /usr/www/APACHE3/APACHE3/site.virtual/name-based/logs/error_log TransferLog /usr/www/APACHE3/APACHE3/site.virtual/name-based/logs/access_log </VirtualHost> ...
<VirtualHost> also specifies which IP address we're hosting and, optionally, the port. If port is not specified, the default port is used, which is either the standard HTTP port, 80, or the port specified in a Port directive (not in Apache v2). host can also be _default_ , in which case it matches anything no other <VirtualHost> section matches.
In a real system, this address would be the hostname of our server. There are three more similar directives that also limit the application of other directives:
This list shows the analogues in ascending order of authority, so that <Directory> is overruled by <Files>, and <Files> by <Location>. Files can be nested within <Directory> blocks. Execution proceeds in groups, in the following order:
Group 1 is processed in the order of shortest directory to longest. The other groups are processed in the order in which they appear in the Config file. Sections inside <VirtualHost> blocks are applied after corresponding sections outside.
The <Directory> directive allows you to apply other directives to a directory or a group of directories. It is important to understand that dir refers to absolute directories, so that <Directory /> operates on the whole filesystem, not the DocumentRoot and below. dir can include wildcards — that is, ? to match a single character, * to match a sequence, and [ ] to enclose a range of characters. For instance, [a-d] means "any one of a, b, c, d." If the character ~ appears in front of dir, the name can consist of complete regular expressions.
<DirectoryMatch> has the same effect as <Directory ~ >. That is, it expects a regular expression. So, for instance, either:
<Directory ~ /[a-d].*>
means "any directory name in the root directory that starts with a, b, c, or d."
The <Files> directive limits the application of the directives in the block to that file, which should be a pathname relative to the DocumentRoot. It can include wildcards or full regular expressions preceded by ~. <FilesMatch> can be followed by a regular expression without ~. So, for instance, you could match common graphics extensions with:
Or, if you wanted our catalogs treated in some special way:
Unlike <Directory> and <Location>, <Files> can be used in a .htaccess file.
The <Location> directive limits the application of the directives within the block to those URLs specified, which can include wildcards and regular expressions preceded by ~. In line with regular-expression processing in Apache v1.3, * and ? no longer match to /. <LocationMatch> is followed by a regular expression without the ~.
Most things that are allowed in a <Directory> block are allowed in <Location>, but although AllowOverride will not cause an error in a <Location> block, it makes no sense there.
The <IfDefine> directive enables a block, provided the flag -Dnameis used when Apache starts up. This makes it possible to have multiple configurations within a single Config file. This is mostly useful for testing and distribution purposes rather than for dedicated sites.
The <IfModule> directive enables a block, provided that the named module was compiled or dynamically loaded into Apache. If the ! prefix is used, the block is enabled if the named module was not compiled or loaded. <IfModule> blocks can be nested. The module-file-name should be the name of the module's source file, e.g. mod_log_config.c.
3.4 Other Directives
Other housekeeping directives are listed here.
The ServerName directive sets the hostname of the server; this is used when creating redirection URLs. If it is not specified, then the server attempts to deduce it from its own IP address; however, this may not work reliably or may not return the preferred hostname. For example:
could be used if the canonical (main) name of the actual machine were simple.example.com, but you would like visitors to see www.example.com.
This directive controls how Apache forms URLs that refer to itself, for example, when redirecting a request for http://www.domain.com/some/directory to the correct http://www.domain.com/some/directory/ (note the trailing / ). If UseCanonical-Name is on (the default), then the hostname and port used in the redirect will be those set by ServerName and Port (not Apache v2). If it is off, then the name and port used will be the ones in the original request.
One instance where this directive may be useful is when users are in the same domain as the web server (for example, on an intranet). In this case, they may use the "short" name for the server (www, for example), instead of the fully qualified domain name (www.domain.com, say). If a user types a URL such as http://www/APACHE3/somedir (without the trailing slash), then, with UseCanonicalName switched on, the user will be directed to http://www.domain.com/somedir/. With UseCanonicalName switched off, she will be redirected to http://www/APACHE3/somedir/. An obvious case in which this is useful is when user authentication is switched on: reusing the server name that the user typed means she won't be asked to reauthenticate when the server name appears to the browser to have changed. More obscure cases relate to name/address translation caused by some firewalling techniques.
ServerAdmin gives Apache an email_address for automatic pages generated when some errors occur. It might be sensible to make this a special address such as email@example.com.
This directive allows you to let the client know which server in a chain of proxies actually did the business. ServerSignature on generates a footer to server-generated documents that includes the server version number and the ServerName of the virtual host. ServerSignature email additionally creates a mailto: reference to the relevant ServerAdmin address.
This directive controls the information about itself that the server returns. The security-minded webmaster may want to limit the information available to the bad guys:
ServerAlias gives a list of alternate names matching the current virtual host. If a request uses HTTP 1.1, it arrives with Host: server in the header and can match ServerName, ServerAlias, or the VirtualHost name.
In HTTP 1.1 you can map several hostnames to the same IP address, and the browser distinguishes between them by sending the Host header. But it was thought there would be a transition period during which some browsers still used HTTP 1.0 and didn't send the Host header. So ServerPath lets the same site be accessed through a path instead.
It has to be said that this directive often doesn't work very well because it requires a great deal of discipline in writing consistent internal HTML links, which must all be written as relative links to make them work with two different URLs. However, if you have to cope with HTTP 1.0 browsers that don't send Host headers when accessing virtual sites, you don't have much choice.
For instance, suppose you have site1.example.com and site2.example.com mapped to the same IP address (let's say 192.168.123.2), and you set up the httpd.conf file like this:
<VirtualHost 192.168.123.2> ServerName site1.example.com DocumentRoot /usr/www/APACHE3/site1 ServerPath /site1 </VirtualHost> <VirtualHost 192.168.123.2> ServerName site2.example.com DocumentRoot /usr/www/APACHE3/site2 ServerPath /site2 </VirtualHost>
Then an HTTP 1.1 browser can access the two sites with URLs http://site1.example.com / and http://site2.example.com /. Recall that HTTP 1.0 can only distinguish between sites with different IP addresses, so both of those URLs look the same to an HTTP 1.0 browser. However, with the previously listed setup, such browsers can access http://site1.example.com /site1 and http://site1.example.com /site2 to see the two different sites (yes, we did mean site1.example.com in the latter; it could have been site2.example.com in either, because they are the same as far as an HTTP 1.0 browser is concerned).
The ScoreBoardFile directive is required on some architectures to place a file that the server will use to communicate between its children and the parent. The easiest way to find out if your architecture requires a scoreboard file is to run Apache and see if it creates the file named by the directive. If your architecture requires it, then you must ensure that this file is not used at the same time by more than one invocation of Apache.
If you have to use a ScoreBoardFile, then you may see improved speed by placing it on a RAM disk. But be aware that placing important files on a RAM disk involves a certain amount of risk.
Apache 1.2 and above: Linux 1.x and SVR4 users might be able to add -DHAVE_SHMGET -DUSE_SHMGET_SCOREBOARD to the EXTRA_CFLAGS in your Config file. This might work with some 1.x installations, but not with all of them. (Prior to 1.3b4, HAVE_SHMGET would have sufficed.)
When a program crashes under Unix, a snapshot of the core code is dumped to a file. You can then examine it with a debugger to see what went wrong. This directive specifies a directory where Apache tries to put the mess. The default is the ServerRoot directory, but this is normally not writable by Apache's user. This directive is useful only in Unix, since Win32 does not dump a core after a crash.
SendBufferSize increases the send buffer in TCP beyond the default set by the operating system. This directive improves performance under certain circumstances, but we suggest you don't use it unless you thoroughly understand network technicalities.
When Apache is compiled with USE_FCNTL_SERIALIZED_ACCEPT or USE_FLOCK_SERIALIZED_ACCEPT, it will not start until it writes a lock file to the local disk. If the logs directory is NFS mounted, this will not be possible. It is not a good idea to put this file in a directory that is writable by everyone, since a false file will prevent Apache from starting. This mechanism is necessary because some operating systems don't like multiple processes sitting in accept( ) on a single socket (which is where Apache sits while waiting). Therefore, these calls need to be serialized. One way is to use a lock file, but you can't use one on an NFS-mounted directory.
The AcceptMutex directives sets the method that Apache uses to serialize multiple children accepting requests on network sockets. Prior to Apache 2.0, the method was selectable only at compile time. The optimal method to use is highly architecture- and platform-dependent. For further details, see http://httpd.apache.org/docs-2.0/misc/perf-tuning.html.
If AcceptMutex is not used or this directive is set to default, then the compile-time-selected default will be used. Other possible methods are listed later. Note that not all methods are available on all platforms. If a method is specified that is not available, a message will be written to the error log listing the available methods.
Chances are that if a user logs on to your site, he will reaccess it fairly soon. To avoid unnecessary delay, this command keeps the connection open, but only for number requests, so that one user does not hog the server. You might want to increase this from 5 if you have a deep directory structure. Netscape Navigator 2 has a bug that fouls up keepalives. Apache v1.2 and higher can detect the use of this browser by looking for Mozilla/2 in the headers returned by Netscape. If the BrowserMatch directive is set (see Chapter 13), the problem disappears.
Similarly, to avoid waiting too long for the next request, this directive sets the number of seconds to wait. Once the request has been received, the TimeOut directive applies.
TimeOut sets the maximum time that the server will wait for the receipt of a request and then its completion block by block. This directive used to have an unfortunate effect: downloads of large files over slow connections would time out. Therefore, the directive has been modified to apply to blocks of data sent rather than to the whole transfer.
If this directive is on, then every incoming connection is reverse DNS resolved, which means that, starting with the IP number, Apache finds the hostname of the client by consulting the DNS system on the Internet. The hostname is then used in the logs. If switched off, the IP address is used instead. It can take a significant amount of time to reverse-resolve an IP address, so for performance reasons it is often best to leave this off, particularly on busy servers. Note that the support program logresolve is supplied with Apache to reverse-resolve the logs at a later date.
The new double keyword supports the double-reverse DNS test. An IP address passes this test if the forward map of the reverse map includes the original IP. Regardless of the setting here, mod_access access lists using DNS names require all the names to pass the double-reverse test.
filename points to a file that will be included in the Config file in place of this directive. From Apache 1.3.14, if filename points to a directory, all the files in that directory and its subdirectories will be included.
The <Limit method > directive defines a block according to the HTTP method of the incoming request. For instance:
<Limit GET POST> ... directives ... </Limit>
This directive limits the application of the directives that follow to requests that use the GET and POST methods. Access controls are normally effective for all access methods, and this is the usual desired behavior. In the general case, access-control directives should not be placed within a <Limit> section.
The purpose of the <Limit> directive is to restrict the effect of the access controls to the nominated HTTP methods. For all other methods, the access restrictions that are enclosed in the <Limit> bracket will have no effect. The following example applies the access control only to the methods POST, PUT, and DELETE, leaving all other methods unprotected:
<Limit POST PUT DELETE> Require valid-user </Limit>
The method names listed can be one or more of the following: GET, POST, PUT, DELETE, CONNECT, OPTIONS, TRACE, PATCH, PROPFIND, PROPPATCH, MKCOL, COPY, MOVE, LOCK, and UNLOCK. The method name is case sensitive. If GET is used, it will also restrict HEAD requests.
Generally, Limit should not be used unless you really need it (for example, if you've implemented PUT and want to limit PUTs but not GETs), and we have not used it in site.authent. Unfortunately, Apache's online documentation encouraged its inappropriate use, so it is often found where it shouldn't be.
<LimitExcept> and </LimitExcept> are used to enclose a group of access-control directives that will then apply to any HTTP access method not listed in the arguments; i.e., it is the opposite of a <Limit> section and can be used to control both standard and nonstandard/unrecognized methods. See the documentation for <Limit> for more details.
This directive specifies the number of bytes from 0 (meaning unlimited) to 2147483647 (2GB) that are allowed in a request body. The default value is defined by the compile-time constant DEFAULT_LIMIT_REQUEST_BODY (0 as distributed).
The LimitRequestBody directive allows the user to set a limit on the allowed size of an HTTP request message body within the context in which the directive is given (server, per-directory, per-file, or per-location). If the client request exceeds that limit, the server will return an error response instead of servicing the request. The size of a normal request message body will vary greatly depending on the nature of the resource and the methods allowed on that resource. CGI scripts typically use the message body for passing form information to the server. Implementations of the PUT method will require a value at least as large as any representation that the server wishes to accept for that resource.
This directive gives the server administrator greater control over abnormal client-request behavior, which may be useful for avoiding some forms of denial-of-service attacks.
number is an integer from 0 (meaning unlimited) to 32,767. The default value is defined by the compile-time constant DEFAULT_LIMIT_REQUEST_FIELDS (100 as distributed).
The LimitRequestFields directive allows the server administrator to modify the limit on the number of request header fields allowed in an HTTP request. A server needs this value to be larger than the number of fields that a normal client request might include. The number of request header fields used by a client rarely exceeds 20, but this may vary among different client implementations, often depending upon the extent to which a user has configured her browser to support detailed content negotiation. Optional HTTP extensions are often expressed using request-header fields.
This directive gives the server administrator greater control over abnormal client-request behavior, which may be useful for avoiding some forms of denial-of-service attacks. The value should be increased if normal clients see an error response from the server that indicates too many fields were sent in the request.
This directive specifies the number of bytes from 0 to the value of the compile-time constant DEFAULT_LIMIT_REQUEST_FIELDSIZE (8,190 as distributed) that will be allowed in an HTTP request header.
The LimitRequestFieldsize directive allows the server administrator to reduce the limit on the allowed size of an HTTP request-header field below the normal input buffer size compiled with the server. A server needs this value to be large enough to hold any one header field from a normal client request. The size of a normal request-header field will vary greatly among different client implementations, often depending upon the extent to which a user has configured his browser to support detailed content negotiation.
This directive gives the server administrator greater control over abnormal client-request behavior, which may be useful for avoiding some forms of denial-of-service attacks. Under normal conditions, the value should not be changed from the default.
This directive sets the number of bytes from 0 to the value of the compile-time constant DEFAULT_LIMIT_REQUEST_LINE (8,190 as distributed) that will be allowed on the HTTP request line.
The LimitRequestLine directive allows the server administrator to reduce the limit on the allowed size of a client's HTTP request line below the normal input buffer size compiled with the server. Since the request line consists of the HTTP method, URI, and protocol version, the LimitRequestLine directive places a restriction on the length of a request URI allowed for a request on the server. A server needs this value to be large enough to hold any of its resource names, including any information that might be passed in the query part of a GET request.
This directive gives the server administrator greater control over abnormal client-request behavior, which may be useful for avoiding some forms of denial-of-service attacks. Under normal conditions, the value should not be changed from the default.
3.5 HTTP Response Headers
The webmaster can set and remove HTTP response headers for special purposes, such as setting metainformation for an indexer or PICS labels. Note that Apache doesn't check whether what you are doing is at all sensible, so make sure you know what you are up to, or very strange things may happen.
The HeaderName directive sets the name of the file that will be inserted at the top of the index listing. filename is the name of the file to include.
Apache 1.3.6 and Earlier
The module first attempts to include filename.html as an HTML document; otherwise, it will try to include filename as plain text. filename is treated as a filesystem path relative to the directory being indexed. In no case is SSI (server-side includes — see Chapter 14) processing done. For example:
When indexing the directory /web, the server will first look for the HTML file /web/HEADER.html and include it if found; otherwise, it will include the plain text file /web/HEADER, if it exists.
Apache Versions After 1.3.6
filename is treated as a URI path relative to the one used to access the directory being indexed, and it must resolve to a document with a major content type of "text" (e.g., text/html, text/plain, etc.). This means that filename may refer to a CGI script if the script's actual file type (as opposed to its output) is marked as text/html, such as with a directive like:
AddType text/html .cgi
Content negotiation will be performed if the MultiViews option is enabled. If filename resolves to a static text/html document (not a CGI script) and the Includes option is enabled, the file will be processed for server-side includes (see the mod_include documentation). This directive needs mod_autoindex.
The HeaderName directive takes two or three arguments: the first may be set, add, unset, or append; the second is a header name (without a colon); and the third is the value (if applicable). It can be used in <File>, <Directory>, or <Location> sections.
Header unset headerServer config, virtual host, access.conf, .htaccess
This directive can replace, merge, or remove HTTP response headers. The action it performs is determined by the first argument. This can be one of the following values:
This argument is followed by a header name, which can include the final colon, but it is not required. Case is ignored. For add, append, and set, a value is given as the third argument. If this value contains spaces, it should be surrounded by double quotes. For unset, no value should be given.
Order of Processing
The Header directive can occur almost anywhere within the server configuration. It is valid in the main server config and virtual host sections, inside <Directory>, <Location>, and <Files> sections, and within .htaccess files.
The Header directives are processed in the following order:
main server virtual host <Directory> sections and .htaccess <Location> <Files>
Order is important. These two headers have a different effect if reversed:
Header append Author "John P. Doe" Header unset Author
This way round, the Author header is not set. If reversed, the Author header is set to "John P. Doe".
The Header directives are processed just before the response is sent by its handler. These means that some headers that are added just before the response is sent cannot be unset or overridden. This includes headers such as "Date" and "Server".
The Options directive is unusually multipurpose and does not fit into any one site or strategic context, so we had better look at it on its own. It gives the webmaster some far-reaching control over what people get up to on their own sites. option can be set to None, in which case none of the extra features are enabled, or one or more of the following:
The arguments can be preceded by + or -, in which case they are added or removed. The following command, for example, adds Indexes but removes ExecCGI:
Options +Indexes -ExecCGI
If no options are set and there is no <Limit> directive, the effect is as if All had been set, which means, of course, that MultiViews is notset. If any options are set, All is turned off.
This has at least one odd effect, which we will demonstrate at .../site.options. Notice that the file go has been slightly modified:
test -d logs || mkdir logs httpd -f 'pwd'/conf/httpd$1.conf -d 'pwd'
There is an ... /htdocs directory without an index.html and a very simple Config file:
User Webuser Group Webgroup ServerName www.butterthlies.com DocumentRoot /usr/www/APACHE3/APACHE3/site.ownindex/htdocs
Type ./go in the usual way. As you access the site, you see a directory of ... /htdocs. Now, if you copy the Config file to .../conf/httpd1.conf and add the line:
Kill Apache, restart it with ./go 1, and access it again, you see a rather baffling message:
FORBIDDEN You don't have permission to access / on this server
(or something similar, depending on your browser). The reason is that when Options is not mentioned, it is, by default, set to All. By switching ExecCGI on, you switch all the others off, including Indexes. The cure for the problem is to edit the Config file (.../conf/httpd2.conf) so that the new line reads:
Similarly, if + or - are not used and multiple options could apply to a directory, the last most specific one is taken. For example (.../conf/httpd3.conf ):
Options ExecCGI Options Indexes
results in only Indexes being set; it might surprise you that CGIs did not work. The same effect can arise through multiple <Directory> blocks:
<Directory /web/docs> Options Indexes FollowSymLinks </Directory> <Directory /web/docs/specs> Options Includes </Directory>
Only Includes is set for /web/docs/specs.
3.5.1 FollowSymLinks, SymLinksIfOwnerMatch
When we saved disk space for our multiple copies of the Butterthlies catalogs by keeping the images bench.jpg, hen.jpg, bath.jpg, and tree.jpg in /usr/www/APACHE3/main_docs and making links to them, we used hard links. This is not always the best idea, because if someone deletes the file you have linked to and then recreates it, you stay linked to the old version with a hard link. With a soft, or symbolic, link, you link to the new version. To make one, use ln -s source_filename destination_filename.
However, there are security problems to do with other users on the same system. Imagine that one of them is a dubious character called Fred, who has his own webspace, ... /fred/public_html. Imagine that the webmaster has a CGI script called fido that lives in ... /cgi-bin and belongs to webuser. If the webmaster is wise, she has restricted read and execute permissions for this file to its owner and no one else. This, of course, allows web clients to use it because they also appear as webuser. As things stand, Fred cannot read the file. This is fine, and it's in line with our security policy of not letting anyone read CGI scripts. This denies them explicit knowledge of any security holes.
Fred now sneakily makes a symbolic link to fido from his own web space. In itself, this gets him nowhere. The file is as unreadable via symlink as it is in person. But if Fred now logs on to the Web (which he is perfectly entitled to do), accesses his own web space and then the symlink to fido, he can read it because he now appears to the operating system as webuser.
The Options command without All or FollowSymLinks stops this caper dead. The more trusting webmaster may be willing to concede FollowSymLinks-IfOwnerMatch , since that too should prevent access.
A webmaster will sometimes want to kill Apache and restart it with a new Config file, often to add or remove a virtual host as people's web sites come and go. This can be done the brutal way, by running ps -aux to get Apache's PID, doing kill <PID> to stop httpd and restarting it. This method causes any transactions in progress to fail in an annoying and disconcerting way for logged-on clients. A recent innovation in Apache allowed restarts of the main server without suddenly chopping off any child processes that were running.
There are three ways to restart Apache under Unix (see Chapter 2):
Under Win32 it is enough to open a second MS-DOS window and type:
apache -k shutdown|restart
See Chapter 2.
An alternative to restarting to change Config files is to use the .htaccess mechanism, which is explained in Chapter 5. In effect, the changeable parts of the Config file are stored in a secondary file kept in .../htdocs. Unlike the Config file, which is read by Apache at startup, this file is read at each access. The advantage is flexibility, because the webmaster can edit it whenever he likes without interrupting the server. The disadvantage is a fairly serious degradation in performance, because the file has to be laboriously parsed to serve each request. The webmaster can limit what people do in their .htaccess files with the AllowOverride directive.
He may also want to prevent clients seeing the .htaccess files themselves. This can be achieved by including these lines in the Config file:
<Files .htaccess> order allow,deny deny from all </Files>
3.8 CERN Metafiles
A metafile is a file with extra header data to go with the file served — for example, you could add a Refresh header. There seems no obvious place for this material, so we will put it here, with apologies to those readers who find it rather odd.
Turns metafile processing on or off on a directory basis.
Names the directory in which Apache is to look for metafiles. This is usually a "hidden" subdirectory of the directory where the file is held. Set to the value . to look in the same directory.
Names the suffix of the file containing metainformation.
The default values for these directives will cause a request for DOCUMENT_ROOT/mydir/fred.html to look for metainformation (supplementing the MIME header) in DOCUMENT_ROOT/mydir/fred.html.meta.
Apache Version 1.2 brought the expires module, mod_expires, into the main distribution. The point of this module is to allow the webmaster to set the returned headers to pass information to clients' browsers about documents that will need to be reloaded because they are apt to change or, alternatively, that are not going to change for a long time and can therefore be cached. There are three directives:
ExpiresActive simply switches the expiration mechanism on and off.
ExpiresByType takes two arguments. mime-type specifies a MIME type of file; time specifies how long these files are to remain active. There are two versions of the syntax. The first is this:
There is no space between code and seconds. code is one of the following:
seconds is simply a number. For example:
specifies 565,656 seconds after the access time.
The more readable second format is:
base [plus] number type [number type ...]
where base is one of the following:
The plus keyword is optional, and type is one of the following:
now plus 1 day 4 hours
does what it says.
This directive sets the default expiration time, which is used when expiration is enabled but the file type is not matched by an ExpireByType directive.