Chapter 1. Getting Started
When you connect to the URL of someone's home page -- say
the notional http://www.butterthlies.com/ we
shall meet later on -- you send a message across the Internet to
the machine at that address. That machine, you hope, is up and
running, its Internet connection is working, and it is ready to
receive and act on your message.
URL stands for Universal
Resource Locator. A URL such as
http://www.butter-thlies.com/ comes in three
parts:
<method>://<host>/<absolute path URL (apURL)>
So, in
our example, < method> is
http, meaning that the browser should use
HTTP (Hypertext Transfer Protocol);
<host> is
www.butterthlies.com; and
<apURL> is "/
", meaning the top directory of the host. Using
HTTP/1.1, your browser might send the following request:
GET / HTTP/1.1
Host: www.butterthlies.com
The request arrives at
port 80 (the default HTTP port) on the host
www.butterthlies.com. The message is again in
three parts: a method (an HTTP method, not a URL method), that in
this case is GET, but could equally be
PUT, POST,
DELETE, or CONNECT; the Uniform
Resource Identifier (URI) "/"; and the
version of the protocol we are using. It is then up to the web server
running on that host to make something of this message.
It is worth saying here -- and we will say it again -- that the
whole business of a web server is to translate a URL either into a
filename, and then send that file back over the Internet, or into a
program name, and then run that program and send its output back.
That is the meat of what it does: all the rest is trimming.
The host machine may be a whole cluster of hypercomputers costing an
oil sheik's ransom, or a humble PC. In either case, it had
better be running a web server, a program that listens to the network
and accepts and acts on this sort of message.
What do we want a web server to do? It should:
Run fast, so it can cope with a lot of inquiries using a minimum of
hardware. Be
multitasking,
so it can deal with more than one inquiry at once. Be multitasking, so that the person running it can maintain the data
it hands out without having to shut the service down. Multitasking is
hard to arrange within a program: the only way to do it properly is
to run the server on a multitasking operating system. In
Apache's case, this is some flavor of Unix (or Unix-like
system), Win32, or OS/2.
Authenticate
inquirers: some may be entitled to more services than others. When we
come to virtual cash, this feature (see Chapter 13, "Security")
becomes essential.
Respond to errors in the messages it gets
with answers that make sense in the context of what is going on. For
instance, if a client requests a page that the server cannot find,
the server should respond with a "404" error, which is
defined by the HTTP specification to mean "page does not
exist." Negotiate a style and language of response with the inquirer. For
instance, it should -- if the people running the server can rise
to the challenge -- be able to respond in the language of the
inquirer's choice. This ability, of course, can open up your
site to a lot more action. And there are parts of the world where a
response in the wrong language can be a bad thing. If you were
operating in Canada, where the English/French divide arouses bitter
feelings, or in Belgium, where the French/Flemish split is as bad,
this feature could make or break your business. Offer different formats. On a more technical level, a user might want
JPEG image files rather than GIF, or TIFF rather than either of the
former. He or she might want text in vdi format rather than
PostScript. Run as a
proxy server. A proxy server accepts
requests for clients, forwards them to the real servers, and then
sends the real servers' responses back to the clients. There
are two reasons why you might want a proxy server:
These are services that the developers of Apache think a server
should offer. There are people who have other ideas, and, as with all
software development, there are lots of features that might be
nice -- features someone might use one day, or that might, if put
into the code, actually make it work better instead of fouling up
something else that has, until then, worked fine. Unless developers
are careful, good software attracts so many improvements that it
eventually rolls over and sinks like a ship caught in an Arctic ice
storm.
Some ideas are in progress: in particular, various proposals for
Apache 2.0 are being kicked around. The main features Apache 2.0 is
supposed to have are multithreading (on platforms that support it),
layered I/O, and a rationalized API.
If you have bugs to
report or more ideas for development, look at http://www.apache.org/bug_report.html. You
can also try news:comp.infosystems.www.servers.unix, where some of the Apache team lurk, along
with many other knowledgeable people, and news:comp.infosystems.www.servers.ms-windows.
1.1. How Does Apache Work?
Apache is a program that runs under a suitable multitasking operating
system. In the examples in this book, the operating systems are Unix
and Windows 95/98/NT, which we call Win32. The
binary is called httpd
under Unix
and
apache.exe
under Win32[3] and normally runs
in the background. Each copy of
httpd/apache that is started has its attention
directed at a web site
, which is, for practical purposes, a
directory. For an example, look at site.toddle
on the demonstration CD-ROM. Regardless of operating
system, a site directory typically contains four subdirectories:
[3]This double name is rather annoying, but
it seems that life has progressed too far for anything to be done
about it. We will, rather clumsily, refer to
httpd/apache and hope that the reader can pick
the right one.
- conf
Contains the configuration file(s), of
which httpd.conf is the most important. It is
referred to throughout this book as the Config
file.
- htdocs
Contains
the HTML scripts to be served up to the site's clients. This
directory and those below it, the web space, are
accessible to anyone on the Web and therefore pose a severe security
risk if used for anything other than public data.
- logs
Contains the log data, both of
accesses and errors.
- cgi-bin
Contains the CGI scripts.
These are programs or shell scripts written by or for the webmaster
that can be executed by Apache on behalf of its clients. It is most
important, for security reasons, that this directory not be in the
web space.
In its idling state, Apache does nothing but listen to the IP
addresses and TCP port or ports specified in its Config file. When a
request appears on a valid port, Apache receives the HTTP request and
analyzes the headers. It then applies the rules it finds in the
Config file and takes the appropriate action.
The webmaster's main control over Apache is through the Config
file. The webmaster has some 150 directives at
his or her disposal; most of this book is an account of what these
directives do and how to use them to reasonable advantage. The
webmaster also has half a dozen flags he or she can use when Apache
starts up. Apache is freeware
: the
intending user downloads the source code and compiles it (under Unix)
or downloads the executable (for Windows) from
www.apache.org or a suitable mirror site. You
can also load the source code from the demonstration CD-ROM included
with this book, although it is not the most recent. Although it
sounds like a difficult business to download the source code and
configure and compile it, it only takes about 20 minutes and is well
worth the trouble.
Under Unix, the webmaster also controls which
modules
are compiled into Apache. Each module provides the code to
execute a number of directives. If there is a group of directives
that aren't needed, the appropriate modules can be left out of
the binary by commenting their names out in the
configuration file
[4] that controls the
compilation of the Apache sources. Discarding unwanted modules
reduces the size of the binary and may improve performance.
[4]It is important to distinguish between the configuration file
used at compile time and the Config file used to control the
operation of a web site.
Under Windows, Apache is normally precompiled as an executable. The
core modules are compiled in, and others are loaded, if needed, as
dynamic link libraries (DLLs) at runtime, so control of the
executable's size is less urgent. The DLLs supplied in the
.../apache/modules subdirectory are as follows:
APACHE~1 DLL 5,120 19/07/98 11:47 ApacheModuleAuthAnon.dll
APACHE~2 DLL 5,632 19/07/98 11:48 ApacheModuleCERNMeta.dll
APACHE~3 DLL 6,656 19/07/98 11:47 ApacheModuleDigest.dll
APACHE~4 DLL 6,144 19/07/98 11:48 ApacheModuleExpires.dll
APACHE~5 DLL 5,120 19/07/98 11:48 ApacheModuleHeaders.dll
APACHE~6 DLL 46,080 19/07/98 11:48 ApacheModuleProxy.dll
APACHE~7 DLL 35,328 19/07/98 11:48 ApacheModuleRewrite.dll
APACHE~8 DLL 6,656 19/07/98 11:48 ApacheModuleSpeling.dll
APACHE~9 DLL 10,752 19/07/98 11:47 ApacheModuleStatus.dll
APACH~10 DLL 6,144 19/07/98 11:48 ApacheModuleUserTrack.dll
What these are and what they do will become more apparent as we
proceed. You can add other DLLs from outside suppliers; more will
doubtless become available.
It is also possible to download the source code and compile it for
Win32 using Microsoft Visual C++ v5.0. We describe this in
Section 1.9, "Apache Under Windows", later in this chapter.
You might do this if you wanted to write your own module (see Chapter 15, "Writing Apache Modules").
 |  |  | 0.5. Acknowledgments |  | 1.2. What to Know About TCP/IP |
Copyright © 2001 O'Reilly & Associates. All rights reserved.
|
|