Chapter 9. Proxying
There are a few good reasons why you should not connect a busy web site straight to the Web:
The answer is to use a proxy server, which can be either Apache itself or a specialized product like Squid.
An important concern on the Web is keeping the Bad Guys out of your network (see Chapter 11). One established technique is to keep the network hidden behind a firewall; this works well, but as soon as you do it, it also means that everyone on the same network suddenly finds that their view of the Net has disappeared (rather like people living near Miami Beach before and after the building boom). This becomes an urgent issue at Butterthlies, Inc., as competition heats up and naughty-minded Bad Guys keep trying to break our security and get in. We install a firewall and, anticipating the instant outcries from the marketing animals who need to get out on the Web and surf for prey, we also install a proxy server to get them out there.
So, in addition to the Apache that serves clients visiting our sites and is protected by the firewall, we need a copy of Apache to act as a proxy server to let us, in our turn, access other sites out on the Web. Without the proxy server, those inside are safe but blind.
9.2 Proxy Directives
We are not concerned here with firewalls, so we take them for granted. The interesting thing is how we configure the proxy Apache to make life with a firewall tolerable to those behind it.
site.proxy has three subdirectories: cache, proxy, real. The Config file from ... /site. proxy/proxy is as follows:
User webuser Group webgroup ServerName www.butterthlies.com Port 8000 ProxyRequests on CacheRoot /usr/www/APACHE3/site.proxy/cache CacheSize 1000
The points to notice are as follows:
The AllowCONNECT directive specifies a list of port numbers to which the proxy CONNECT method may connect. Today's browsers use this method when a https connection is requested and proxy tunneling over http is in effect.
By default, only the default https port (443) and the default snews port (563) are enabled. Use the AllowCONNECT directive to override this default and allow connections to the listed ports only.
This directive turns proxy serving on. Even if ProxyRequests is off, ProxyPass directives are still honored.
This directive defines remote proxies to this proxy (that is, proxies that should be used for some requests instead of being satisfied directly). match is either the name of a URL scheme that the remote server supports, a partial URL for which the remote server should be used, or * to indicate that the server should be contacted for all requests. remote-server is the URL that should be used to communicate with the remote server (i.e., it is of the form protocol://hostname[:port]). Currently, only HTTP can be used as the protocol for the remote-server. For example:
ProxyRemote ftp http://ftpproxy.mydomain.com:8080 ProxyRemote http://goodguys.com/ http://mirrorguys.com:8000 ProxyRemote * http://cleversite.com
This command runs on an ordinary server and translates requests for a named directory and below to a demand to a proxy server. So, on our ordinary Butterthlies site, we might want to pass requests to /secrets onto a proxy server darkstar.com:
ProxyPass /secrets http://darkstar.com
Unfortunately, this is less useful than it might appear, since the proxy does not modify the HTML returned by darkstar.com. This means that URLs embedded in the HTML will refer to documents on the main server unless they have been written carefully. For example, suppose a document one.html is stored on darkstar.com with the URL http://darkstar.com/one.html, and we want it to refer to another document in the same directory. Then the following links will work, when accessed as http://www.butterthlies.com/secrets/one.html:
<A HREF="two.html">Two</A> <A HREF="/secrets/two.html">Two</A> <A HREF="http://darkstar.com/two.html">Two</A>
But this example will not work:
<A HREF="/two.html">Not two</A>
When accessed directly, through http://darkstar.com/one.html, these links work:
<A HREF="two.html">Two</A> <A HREF="/two.html">Two</A> <A HREF="http://darkstar.com/two.html">Two</A>
But the following doesn't:
This directive tends to be useful only for Apache proxy servers within intranets. The ProxyDomain directive specifies the default domain to which the Apache proxy server will belong. If a request to a host without a fully qualified domain name is encountered, a redirection response to the same host with the configured domain appended will be generated. The point of this is that users on intranets often only type the first part of the domain name into the browser, but the server requires a fully qualified domain name to work properly.
The NoProxy directive specifies a list of subnets, IP addresses, hosts, and/or domains, separated by spaces. A request to a host that matches one or more of these is always served directly, without forwarding to the configured ProxyRemote proxy server(s).
A reverse proxy is a way to masquerade one server as another — perhaps because the "real" server is behind a firewall or because you want part of a web site to be served by a different machine but not to look that way. It can also be used to share loads between several servers — the frontend server simply accepts requests and forwards them to one of several backend servers. The optional module mod_rewrite has some special stuff in it to support this. This directive lets Apache adjust the URL in the Location response header. If a ProxyPass (or mod_rewrite) has been used to do reverse proxying, then this directive will rewrite Location headers coming back from the reverse-proxied server so that they look as if they came from somewhere else (normally this server, of course).
This directive controls the use of the Via: HTTP header by the proxy. Its intended use is to control the flow of proxy requests along a chain of proxy servers. See RFC2068 (HTTP 1.1) for an explanation of Via: header lines.
The ProxyReceiveBufferSize directive specifies an explicit network buffer size for outgoing HTTP and FTP connections for increased throughput. It has to be greater than 512 or set to 0 to indicate that the system's default buffer size should be used.
The ProxyBlock directive specifies a list of words, hosts and/or domains, separated by spaces. HTTP, HTTPS, and FTP document requests to sites whose names contain matched words, hosts, or domains that are blocked by the proxy server. The proxy module will also attempt to determine IP addresses of list items that may be hostnames during startup and cache them for match test as well. For example:
ProxyBlock joes-garage.com some-host.co.uk rocky.wotsamattau.edu
rocky.wotsamattau.edu would also be matched if referenced by IP address.
Note that wotsamattau would also be sufficient to match wotsamattau.edu.
Note also that:
blocks connections to all sites.
9.3 Apparent Bug
When a server is set up as a proxy, then requests of the form:
GET http://someone.else.com/ HTTP/1.0
are accepted and proxied to the appropriate web server. By default, Apache does not proxy, but it can appear that it is prepared to — requests like the previous will be accepted and handled by the default configuration. Apache assumes that someone.else.com is a virtual host on the current machine. People occasionally think this is a bug, but it is, in fact, correct behavior. Note that pages served will be the same as those that would be served for any real unknown virtual host on the same machine, so this does not pose a security risk.
The proxy server's performance can be improved by caching incoming pages so that the next time one is called for, it can be served straight up without having to waste time going over the Web. We can do the same thing for outgoing pages, particularly pages generated on the fly by CGI scripts and database accesses (bearing in mind that this can lead to stale content and is not invariably desirable).
9.4.1 Inward Caching
Another reason for using a proxy server is to cache data from the Web to save the bandwidth of the world's clogged telephone systems and therefore to improve access time on our server. Note, however, that it in practice it often saves bandwidth at the expense of increased access times.
The directive CacheRoot, cunningly inserted in the Config file shown earlier, and the provision of a properly permissioned cache directory allow us to show this happening. We start by providing the directory ... /site.proxy/cache, and Apache then improves on it with some sort of directory structure like ... /site.proxy/cache/d/o/j/gfqbZ@49rZiy6LOCw.
The file gfqbZ@49rZiy6LOCw contains the following:
320994B6 32098D95 3209956C 00000000 0000001E X-URL: http://192.168.124.1/message HTTP/1.0 200 OK Date: Thu, 08 Aug 1996 07:18:14 GMT Server: Apache/1.1.1 Content-length: 30 Last-modified Thu, 08 Aug 1996 06:47:49 GMT I am a web site far out there
Next time someone wants to access http://192.168.124.1/message, the proxy server does not have to lug bytes over the Web; it can just go and look it up.
There are a number of housekeeping directives that help with caching.
This directive sets the directory to contain cache files; must be writable by Apache.
This directive sets the size of the cache area in kilobytes. More may be stored temporarily, but garbage collection reduces it to less than the set number.
This directive specifies how often, in hours, Apache checks the cache and does a garbage collection if the amount of data exceeds CacheSize.
This directive specifies how long cached documents are retained. This limit is enforced even if a document is supplied with an expiration date that is further in the future.
If no expiration time is supplied with the document, then estimate one by multiplying the time since last modification by factor. CacheMaxExpire takes precedence.
If the document is fetched by a protocol that does not support expiration times, use this number. CacheMaxExpire does not override it.
The proxy module stores its cache with filenames that are a hash of the URL. The filename is split into CacheDirLevels of directory using CacheDirLength characters for each level. This is for efficiency when retrieving the files (a flat structure is very slow on most systems). So, for example:
CacheDirLevels 3 CacheDirLength 2
converts the hash "abcdefghijk" into ab/cd/ef/ghijk. A real hash is actually 22 characters long, each character being one of a possible 64 (26), so that three levels, each with a length of 1, gives 218 directories. This number should be tuned to the anticipated number of cache entries (218 being roughly a quarter of a million, and therefore good for caches up to several million entries in size).
If present in the Config file, this directive allows content-negotiated documents to be cached by proxy servers. This could mean that clients behind those proxys could retrieve versions of the documents that are not the best match for their abilities, but it will make caching more efficient.
This directive only applies to requests that come from HTTP 1.0 browsers. HTTP 1.1 provides much better control over the caching of negotiated documents, and this directive has no effect on responses to HTTP 1.1 requests. Note that very few browsers are HTTP 1.0 anymore.
This directive specifies a list of hosts and/or domains, separated by spaces, from which documents are not cached, such as the site delivering your real-time stock market quotes .
The cache directory for the proxy server has to be set up rather carefully with owner webuser and group webgroup, since it will be accessed by that insignificant person (see Chapter 2).
You now have to tell your browser that you are going to be accessing the Web via a proxy. For example, in Netscape click on Edit Preferences Advanced Proxies tab Manual Proxy Configuration. Click on View,and in the HTTP box enter the IP address of our proxy, which is on the same network, 192.168.123, as our copy of Netscape:
Enter 8000 in the Port box.
For Microsoft Internet Explorer, select View Options Connection tab, check the Proxy Server checkbox, then click the Settings button, and set up the HTTP proxy as described previously. That is all there is to setting up a real proxy server.
You might want to set up a simulation to watch it in action, as we did, before you do the real thing. However, it is not that easy to simulate a proxy server on one desktop, and when we have simulated it, the elements play different roles from those they have supported in demonstrations so far. We end up with four elements:
The configuration in ... /site.proxy/proxy is as shown earlier. Since the proxy server is running on a machine notionally on the other side of the Web from the machine running ... /site.proxy/real, we need to put it on another port, traditionally 8000.
The configuration file in ... /proxy/real is:
User webuser Group webgroup ServerName www.faraway.com Listen www.faraway.com:80 DocumentRoot /usr/www/APACHE3/site.proxy/real/htdocs
On this site, we use the more compendious Listen with the server name and port number combined.
Normally www.faraway.com would be a site out on the Web. In our case we dummied it up on the same machine.
In ... /site.proxy/real/htdocs there is a file containing the message:
I am a web site far, far out there.
Also in /etc/hosts there is an entry:
simulating a proper DNS registration for this far-off site. Note that it is on a different network (192.168.124) from the one we normally use (192.168.123), so that when we try to access it over our LAN, we can't without help.
The file /usr/www/lan_setup on the FreeBSD machine is now:
ifconfig ep0 192.168.123.2 ifconfig ep0 192.168.123.3 alias netmask 0xFFFFFFFF ifconfig ep0 192.168.124.1 alias
Now for the action: go to ... /site.proxy/real, and start the server with ./go - then go to ... /site.proxy/proxy, and start it with ./go. On your browser, access http://192.168.124.1/. You should see the following:
Index of / . Parent Directory . message
If we select message, we see:
I am a web site far out there
Fine, but are we fooling ourselves? Go to the browser's proxy settings, and disable the HTTP proxy by removing the IP address:
Then reaccess http://192.168.124.1/. You should get some sort of network error.
What happened? We asked the browser to retrieve http://192.168.124.1/. Since it is on network 192.168.123, it failed to find this address. So instead it used the proxy server at port 8000 on 192.168.123.2. It sent its message there:
GET http://192.168.124.1/ HTTP/1.0
The copy of Apache running on the FreeBSD machine, listening to port 8000, was offered this morsel and accepted the message. Since that copy of Apache had been told to service proxy requests, it retransmitted the request to the destination we thought it was bound for all the time: 192.168.123.1 (which it can do since it is on the same machine):
GET / HTTP/1.0
In real life, things are simpler: you only have to carry out steps two and three, and you can ignore the theology. When you have finished with all this, remember to remove the HTTP proxy IP address from your browser setup.
9.5.1 Reverse Proxy
This section explains a configuration setup for proxying your backend mod_perl servers when you need to use virtual hosts. See perl.apache.org/guide/scenario.html, from which we have quoted freely. While you are better off getting it right in the first place (i.e. using different URLs for the different servers), there are at least three reasons you might want to rewrite:
The term virtual host refers to the practice of maintaining more than one server on one machine, as differentiated by their apparent hostname. For example, it is often desirable for companies sharing a web server to have their own domains, with web servers accessible as www.company1.com and www.company2.com, without requiring the user to know any extra path information.
One approach is to use a unique port number for each virtual host at the backend server, so you can redirect from the frontend server to localhost:1234 and name-based virtual servers on the frontend, though any technique on the frontend will do.
If you run the frontend and the backend servers on the same machine, you can prevent any direct outside connections to the backend server if you bind tightly to address 127.0.0.1 (localhost), as you will see in the following configuration example.
This is the frontend (light) server configuration:
<VirtualHost 10.10.10.10> ServerName www.example.com ServerAlias example.com RewriteEngine On RewriteOptions 'inherit' RewriteRule \.(gif|jpg|png|txt|html)$ - [last] RewriteRule ^/(.*)$ http://localhost:4077/$1 [proxy] </VirtualHost> <VirtualHost 10.10.10.10> ServerName foo.example.com RewriteEngine On RewriteOptions 'inherit' RewriteRule \.(gif|jpg|png|txt|html)$ - [last] RewriteRule ^/(.*)$ http://localhost:4078/$1 [proxy] </VirtualHost>
This frontend configuration handles two virtual hosts: www.example.com and foo.example.com. The two setups are almost identical.
The frontend server will handle files with the extensions .gif, .jpg, .png, .txt, and .html internally; the rest will be proxied to be handled by the backend server.
The only difference between the two virtual-host settings is that the former rewrites requests to port 4077 at the backend machine and the latter to port 4078.
If your server is configured to run traditional CGI scripts (under mod_cgi), as well as mod_perl CGI programs, then it would be beneficial to configure the frontend server to run the traditional CGI scripts directly. This can be done by altering the gif|jpg|png|txt Rewrite rule to add |cgi at the end if all your mod_cgi scripts have the .cgi extension, or by adding a new rule to handle all /cgi-bin/* locations locally.
Here is the backend (heavy) server configuration:
Port 80 PerlPostReadRequestHandler My::ProxyRemoteAddr Listen 4077 <VirtualHost localhost:4077> ServerName www.example.com DocumentRoot /home/httpd/docs/www.example.com DirectoryIndex index.shtml index.html </VirtualHost> Listen 4078 <VirtualHost localhost:4078> ServerName foo.example.com DocumentRoot /home/httpd/docs/foo.example.com DirectoryIndex index.shtml index.html </VirtualHost>
The backend server knows to tell to which virtual host the request is made, by checking the port number to which the request was proxied and using the appropriate virtual host section to handle it.
We set Port 80 so that any redirects use 80 as the port for the URL, rather than the port on which the backend server is actually running.
To get the real remote IP addresses from proxy, My::ProxyRemoteAddr handler is used based on the mod_proxy_add_forward Apache module. Prior to mod_perl 1.22, this setting must have been set per-virtual host, since it wasn't inherited by the virtual hosts.
The following configuration is yet another useful example showing the other way around. It specifies what is to be proxied, and then the rest is served by the frontend:
RewriteEngine on RewriteLogLevel 0 RewriteRule ^/(perl.*)$ http://127.0.0.1:8052/$1 [P,L] NoCache * ProxyPassReverse / http://www.example.com/
So we don't have to specify the rule for static objects to be served by the frontend, as we did in the previous example, to handle files with the extensions .gif, .jpg, .png and .txt internally.