Chapter 2. The Hypertext Transport ProtocolThe Hypertext Transport Protocol (HTTP) is the common language that web browsers and web servers use to communicate with each other on the Internet. CGI is built on top of HTTP, so to understand CGI fully, it certainly helps to understand HTTP. One of the reasons CGI is so powerful is because it allows you to manipulate the metadata exchanged between the web browser and server and thus perform many useful tricks, including:
We won't cover all of the details of HTTP, just what is important for our understanding of CGI. Specifically, we'll focus on the request and response process: how browsers ask for and receive web pages. If you are interested in understanding more about HTTP than we provide here, visit the World Wide Web Consortium's web site at http://www.w3.org/Protocols/. On the other hand, if you are eager to get started writing CGI scripts, you may be tempted to skip this chapter. We encourage you not to. Although you can certainly learn to write CGI scripts without learning HTTP, without the bigger picture you may end up memorizing what to do instead of understanding why. This is certainly the most challenging chapter, however, because we cover a lot of material without many examples. So if you find it a little dry and want to peek ahead to the fun stuff, we'll forgive you. Just be sure to return here later. 2.1. URLsDuring our discussion of HTTP and CGI, we will be often be referring to URLs , or Uniform Resource Locators. If you have used the Web at all, then you are probably familiar with URLs. In web terms, a resource represents anything available on the web, whether it be an HTML page, an image, a CGI script, etc. URLs provide a standard way to locate these resources on the Web. Note that URLs are not actually specific to HTTP; they can refer to resources in many protocols. Our discussion here will focus strictly on HTTP URLs. 2.1.1. Elements of a URLHTTP URLs consist of a scheme, a host name, a port number, a path, a query string, and a fragment identifier, any of which may be omitted under certain circumstances (see Figure 2-1). Figure 2-1. Components of a URLHTTP URLs contain the following elements:
2.1.2. Absolute and Relative URLsMany of the elements within a URL are optional. You may omit the scheme, host, and port number in a URL if the URL is used in a context where these elements can be assumed. For example, if you include a URL in a link on an HTML page and leave out these elements, the browser will assume the link applies to a resource on the same machine as the link. There are two classes of URLs:
2.1.3. URL EncodingMany characters must be encoded within a URL for a variety of reasons. For example, certain characters such as ?, #, and / have special meaning within URLs and will be misinterpreted unless encoded. It is possible to name a file doc#2.html on some systems, but the URL http://localhost/doc#2.html would not point to this document. It points to the fragment 2.html in a (possibly nonexistent) file named doc. We must encode the # character so the web browser and server recognize that it is part of the resource name instead. Characters are encoded by representing them with a percent sign (%) followed by the two-digit hexadecimal value for that character based upon the ISO Latin 1 character set or ASCII character set (these character sets are the same for the first eight bits). For example, the # symbol has a hexadecimal value of 0x23, so it is encoded as %23. The following characters must be encoded:
Additionally, spaces should be encoded as + although %20 is also allowed. As you can see, most characters must be encoded; the list of allowed characters is actually much shorter:
It is actually permissible and not uncommon for any of the allowed characters to also be encoded by some software. Thus, any application that decodes a URL must decode every occurrence of a percentage sign followed by any two hexadecimal digits. The following code encodes text for URLs: sub url_encode { my $text = shift; $text =~ s/([^a-z0-9_.!~*'( ) -])/sprintf "%%%02X", ord($1)/ei; $text =~ tr/ /+/; return $text; } Any character not in the allowed set is replaced by a percentage sign and its two-digit hexadecimal equivalent. The three percentage signs are necessary because percentage signs indicate format codes for sprintf, and literal percentage signs must be indicated by two percentage signs. Our format code thus includes a percentage sign, %%, plus the format code for two hexadecimal digits, %02X. Code to decode URL encoded text looks like this: sub url_decode { my $text = shift; $text =~ tr/\+/ /; $text =~ s/%([a-f0-9][a-f0-9])/chr( hex( $1 ) )/ei; return $text; } Here we first translate any plus signs to spaces. Then we scan for a percentage sign followed by two hexadecimal digits and use Perl's chr function to convert the hexadecimal value into a character. Neither the encoding nor the decoding operations can be safely repeated on the same text. Text encoded twice differs from text encoded once because the percentage signs introduced in the first step would themselves be encoded in the second. Likewise, you cannot encode or decode entire URLs. If you were to decode a URL, you could no longer reliably parse it, for you may have introduced characters that would be misinterpreted such as / or ?. You should always parse a URL to get the components you want before decoding them; likewise, encode components before building them into a full URL. Note that it's good to understand how a wheel works but reinventing it would be pointless. Even though you have just seen how to encode and decode text for URLs, you shouldn't do so yourself. The URI::URL module (actually it is a collection of modules), available on CPAN (see Appendix B, "Perl Modules"), provides many URL-related modules and functions. One of the included modules, URI::Escape, provides the url_escape and url_unescape functions. Use them. The subroutines in these modules have been vigorously tested, and future versions will reflect any changes to HTTP as it evolves.[2] Using standard subroutines will also make your code much clearer to those who may have to maintain your code later (this includes you).
If, despite these warnings, you still insist on writing your own decoding code yourself, at least place it in appropriately named subroutines. Granted, some of these actions take only a line or two of code, but the code is quite cryptic, and these operations should be clearly labeled. Copyright © 2001 O'Reilly & Associates. All rights reserved. |
|