HTTP Client Security (Building Internet Firewalls, 2nd Edition)

15.2. HTTP Client Security

The security problems of HTTP clients are just as complex as those of HTTP servers. There are several problems:

Inadvertent release of information
Vulnerabilities in external viewers
Vulnerabilities in extension systems

15.2.1. Inadvertent Release of Information

There are a number of ways in which web browsers may give away information that you didn't intend to make public. The most common is that they fail to protect passwords and usernames in ways that you might expect. Many web pages pass usernames and passwords around completely unprotected; some of them even embed them into URLs, making it easy to accidentally store them in your bookmarks file or mail them to a friend along with the location of an interesting web page.

The default authentication for web pages is something called basic authentication. This is what's happening when you ask for a page, and instead of bringing up the page, your web browser brings up a standard dialog box that asks you for your username and password. There's no encryption protecting that username and password; it's just sent to the server in cleartext. Furthermore, if you ask for another page from the same server, the username and password will be sent again, without any warning, still unencrypted.

A web site can protect the username and password data by telling the browser to use HTTPS, instead of HTTP, to make the connection. (HTTPS is discussed further later in this chapter.) This will encrypt the entire communication, including the authentication information. Unfortunately, you can't usually tell whether or not a web site has done this; although there's a little lock icon to tell you when the page you're looking at is encrypted, most clients don't display it, or any other indication that things have been secured, until the page actually loads, which is after you've already given out the username and password.

That's not the only way in which it's difficult to tell what a web server is going to do with your password. You may be able to tell if the web server is doing something extremely insecure, like embedding the password in cleartext in a URL, but you can't tell if it's storing it properly on the other end. Therefore, while you can sometimes be sure that your password isn't protected, you can never be sure that it is, and you should assume that it isn't. Don't use passwords you've sent over the Web to protect anything you care about deeply. You should use different passwords on different web sites, and they should not be the same passwords that you use anywhere else.

If you're going to send a web site important data, you should be sure that the site has made a legally binding commitment to protect that data appropriately. You should also be sure that you have an encrypted connection to the site.

15.2.1.1. Cookies

Cookies are a way for a site to track information about you. It can be information that lasts a long time (for instance, information about what you want to see when you come to the site) or information about what you've just done (for instance, information about what items you want to buy in the current transaction). Cookies are important for web sites because it's otherwise extremely difficult to keep track of anything about a web browser. Each time a browser requests a page, it's a new transaction, with no way for the server to know what previous transaction it might be related to. Cookies provide a way for a server to ask a web browser to keep track of and possibly store some data.

A cookie is a fairly simple object; it's a small amount of information to be stored that is associated with an identifying string, an expiration date, and a URL pattern that indicates when the cookie should be sent with HTTP requests. Whenever you visit a web site, the browser checks to see if any unexpired cookies match the URL pattern, and if so, the browser sends them along with your request.

The information that's in a cookie isn't usually of any great interest by itself. Cookies tend to contain customer identifiers or coded lists of preferences -- things that make sense only to the site that set the cookie. This is important because cookies are passed across the network unencrypted, and you wouldn't want them to have anything dangerous in them.

On the other hand, once a web site gets a cookie, it may give back information that you do care about. For instance, it might use the cookie to look up the credit card information you used on your last order (in which case somebody with that cookie could order things on your credit card). For that matter, it might just look up your last order and display it along with your name. Since cookies are passed unencrypted, and can be intercepted at any point, it's not good practice to use them for anything critical, but some web sites do.

In addition, many people are worried about situations in which cookies can be used to track patterns of usage. When you use a link on a web page, the site you go to gets information about the page the link was on (this is called the referrer). If the site you go to also has a cookie that will identify you, it can build up a history of the places you have come from. This wouldn't be very interesting for most sites, but sites that put up banner advertisements have links on thousands of web pages and can build up a fairly accurate picture of where you come from. Since referrer information includes the entire URL, this will include information about specific search requests and, in the worst cases, may contain password and username information.

There are some controls on the use of cookies. Some browsers don't support cookies at all; others will allow you to control the situations where you give out a cookie, by choosing to reject all cookies, asking you before accepting cookies, or accepting only certain cookies. For instance, cookies are intended to be returned only to the site that set them, but the definition of "site" is unclear. The cookie specifies what hostnames it should be returned to and may give a specific hostname or a broad range of hostnames. Some browsers can be configured so that they will accept only cookies that will be returned to the host that originally set the cookie.

15.2.2. External Viewers

HTTP servers can provide data in any number of formats: plain text files, HTML files, PostScript documents, image files (PNG, JPEG, and GIF), movie files (MPEG), audio files, and so on. The servers use MIME, discussed briefly in Chapter 16, "Electronic Mail and News", to format the data and specify its type. HTTP clients generally don't attempt to understand and process all of these different data formats. They understand some types (such as HTML, plain text, JPEG, and GIF), and they rely on external programs to deal with the rest. These external programs will display, play, preview, print, or do whatever is appropriate for the format.

For example, web browsers confronted with a PDF file will ordinarily invoke Adobe Acrobat Exchange or Acrobat Reader, and web browsers confronted with a compressed file will ordinarily invoke a decompression program. The user controls (generally via a configuration file) what data types the HTTP client knows about, which programs to invoke for which data types, and what arguments to pass to those programs. If the user hasn't provided a configuration file, the HTTP client generally uses a built-in default or a systemwide default.

These external programs may be obviously separate from the web browser, or they may be plug-ins, programs that are not part of the web browser but that integrate with it seamlessly. Plug-ins are simply external programs that can be run by the browser and will display information in windows that the browser controls. There is more than one type of plug-in technology; Microsoft's ActiveX and Netscape's Plug-Ins can both be used to provide seamless integration with a browser. Despite the fact that they look like parts of the browser, they have the same security implications as other external programs.

All of these external programs present two security concerns:

What are the inherent capabilities of the external programs an attacker might take advantage of ?
What new programs (or new arguments for existing programs) might an attacker be able to convince the user to add to the local configuration?

Let's consider, for example, what an HTTP client is going to do with a PostScript file. PostScript is a language for controlling printers. While primarily intended for that purpose, it is a full programming language, complete with data structures, flow of control operators, and file input/output operators. These operators ("read file", "write file", "create file", "delete file", etc.) are seldom used, except on printers with local disks for font storage, but they're there as part of the language. PostScript previewers (such as GhostScript) generally implement these operators for completeness.

Suppose that a user uses Internet Explorer to pull down a PostScript document. Internet Explorer invokes GhostScript, and it turns out that the document has PostScript commands in it that say "delete all files in the current directory". If GhostScript executes the commands, who's to blame? You can't really expect Internet Explorer to scan the PostScript on the way through to see if it's dangerous; that's an impossible task. You can't really expect GhostScript not to do what it's told in valid PostScript code. You can't really expect your users not to download PostScript code or to scan it themselves.

Current versions of GhostScript have a safer mode they run in by default. This mode disables "dangerous" operators such as those for file input/output. But what about all the other PostScript interpreters or previewers? And what about the applications to handle all the other data types? How safe are they? Who knows?

Even if you have safe versions of these auxiliary applications, how do you keep your users from changing their configuration files to add new applications, run different applications, or pass different arguments (for example, to disable the safer mode of GhostScript) to the existing applications?

Why would a user do this? Suppose that the user found something on the web that claimed to be something really neat -- a game demo, a graphics file, a copy of the hottest new song, whatever. And suppose that this desirable something came with a note that said "Hey, before you can access this Really Cool Thing, you need to modify your browser configuration, because the standard configuration doesn't know how to deal with this thing; here's what you do. . . ." And suppose that the instructions were something like "remove the `-dSAFER' flag from the configuration for files of type PostScript"?

Would your users recognize that they were being instructed to do something dangerous? Even if they recognized it, would they do it anyway (nice, trusting people that they are)?

15.2.3. Extension Systems

Using external viewers is not the only way to extend the capabilities of a web browser. Most browsers also support at least one system that allows web pages to download programs that will be executed by the browser. A variety of different extension systems are supported by different browsers, and they have very different security models and implications. The details of the different systems are discussed in Section 15.4, "Mobile Code and Web-Related Languages", later in this chapter. Even though the systems differ, they have a certain number of goals and security implications in common.

These extension systems are very convenient; it is often much more efficient to have the browser do some calculations itself than to have to send data to an HTTP server, have it do some calculations, and get the answer back. In addition, extension languages allow for a much more powerful and flexible interface between the browser and the full capabilities of the computer than you can get by using external viewers.

For instance, if you are filling out a form, it's annoying to have to submit the form to the server and wait for it to come back and tell you that you've omitted a required piece of information. It's preferable for your browser to be able to tell you that immediately. Similarly, if your happiness depends on having penguins dance across the screen, the most efficient way to get that effect is going to be to tell your browser how to draw a dancing penguin and where to move it.

On the other hand, filling out forms and drawing dancing penguins are not all that interesting. In order for extension languages to actually do interesting and useful tasks, they have to have more capabilities, but the more capabilities that are available, the more dangerous a language is.

Of course, normal programming languages have lots of capabilities and therefore lots of dangers, but people don't usually find this worrisome. This is because when you get a program written in a normal programming language, you generally decide that you want the program, you go out looking for it, you have some information about where it comes from, and you explicitly choose to run it. When you get a program as part of a web page, it just shows up and runs; you may be happily watching the dancing penguins and not knowing that anything else is happening.

We discuss the different approaches taken by extension languages in the following sections, as we discuss the specific languages. All of them do attempt to provide security, but none of them is problem free.

15.2.4. What Can You Do?

There is no simple, foolproof defense against the types of problems we've described. At this point in time, you have to rely on a combination of carefully installed and configured client and auxiliary programs, and a healthy dose of user education and awareness training. This is an area of active research and development, and both the safeguards and the attacks will probably develop significantly over the next couple of years.

Content-aware firewalls, whether they are packet filters or proxies, can be of considerable help in reducing client vulnerability. A firewall that pays attention to content can control which extension languages and which types of files are passed through; it is even possible for it to do virus scanning on executables. Unfortunately, it's not possible to do a truly satisfactory job of protection even with a content-aware firewall.

Using content-based filtering, you have two options; you can filter out everything that might be dangerous, or you can filter out only those things you know for certain are dangerous. In the first case, you simply filter out all scripting languages; in the second case, you filter out known attacks. Be cautious of products that claim to filter out all hostile code and only hostile code. Accurately determining what code is hostile by simply looking at the code is impossible in the most specific, logical, and mathematical sense of the term. For useful scripting languages, it is equivalent to solving the Turing halting problem (determining whether an arbitrary piece of code ever stops executing), and the proof that it is impossible is one of the most famous and fundamental results in theoretical computer science.

It is possible to recognize particular pieces of hostile code, and sometimes even to recognize patterns that are common to many pieces of hostile code. Most content-based filtering systems rely on recognizing known attacks. Any time somebody comes up with a new attack, you will be vulnerable to it until the filter is updated, which may of course be after you have been attacked. Many content-based filters are easily fooled by trivial changes to the attack. Content filters that try to remove only hostile code fundamentally use the same technology as virus detectors. This has the advantage that it's a well-understood problem for vendors, who know about creating and distributing signatures. It has the disadvantage that it's a well-understood problem for attackers, as well, who have a variety of programs at their disposal for permuting programs to change their signatures.

Using content filtering to remove all scripting languages is safer. Unfortunately, you really do need to remove all scripting languages, since otherwise, it is possible to pass through JavaScript or VBScript programs that create Java code or ActiveX controls. Many web pages are difficult or impossible to use if you filter out all scripting languages (and a significant percentage of them don't detect the problem and are just mysteriously blank or nonfunctional). In addition, using content filtering to remove scripting can interfere with some of the methods that servers use to try to deal with clients that don't support scripting languages. For instance, some servers attempt to determine capabilities of the client by the information the client provides about the browser version, which doesn't provide any information about the filter in the middle. Some pages may also try to use JavaScript to determine if Java is available. This means that pages that work fine if scripting languages are turned off at the browser may still fail miserably when a filter removes the scripts.

As we mention later, content filtering is impossible on some web pages; connections made with HTTPS instead of with HTTP are encrypted, and the firewall cannot tell what is in them to do content filtering.

15.2.5. Internet Explorer and Security Zones

One way for a browser to improve its security is to treat different sites differently. It's reasonable to allow an internal web site to do more risky things than an external one, for instance.

Starting with Internet Explorer 4.0, Microsoft introduced the concept of security zones to allow you to configure your browser to do this. Explorer defines multiple security zones and sets different default security policies for them. For instance, there is a security zone for the intranet, which by default accepts all signed ActiveX controls and asks you if you want to allow each unsigned control, and one for the Internet, which by default asks you if you want to accept each signed control and rejects all unsigned controls. (ActiveX controls and signatures are discussed later in this chapter.) There is also a security zone that applies only to data originating on the local machine (this is not supposed to include cached data that was originally loaded from the Internet). The local machine zone is the most trusted zone.

In most cases, Internet Explorer uses the host portion of the URL to determine what zone a page is in. Because the different zones have different security policies, it's important that Internet Explorer get this right. However, there have been several problems with the way that Internet Explorer does this, some of which icrosoft has fixed and some of which are not fixable. In particular, any hostname that does not contain a period is assumed to be in the intranet zone. Originally, there were various ways of referring to Internet hosts by IP address that could force any Internet host to be treated as an intranet host. These problems have been removed, and there is now no known way to write a link that will force Internet Explorer to consider it part of the intranet zone.

However, there are numerous ways for people to set themselves up so that external hosts are considered intranet hosts, and the security implications are unlikely to be clear to them. For instance, adding a domain name to the Domain Suffix Search Order in DNS properties will make all hosts in that domain parts of the intranet zone; for a less sweeping effect, any host that's present in LMHOSTS or HOSTS with a short name is also part of the intranet zone. An internal web server that will act as an intermediary and retrieve external pages will make all those pages parts of the intranet zone. The most notable class of programs that do this sort of thing are translators, like AltaVista's Babelfish (http://babelfish.altavista.com), which will translate English to French, among other options, or RinkWorks' Dialectizer (http://www.rinkworks.com/dialect), which will show you the page as if it were spoken by the cartoon character Elmer Fudd, among other options.