6.2. Referencing Documents: The URL
As we discussed earlier, every document on the World Wide Web has a unique address. (Imagine the chaos if they didn't.) The document's address is known as its uniform resource locator (URL).
Several tags include a URL attribute value, including hyperlinks, inline images, and forms. All use the same URL syntax to specify the location of a web resource, regardless of the type or content of that resource. That's why it's known as a uniform resource locator.
Since they can be used to represent almost any resource on the Internet, URLs come in a variety of flavors. All URLs, however, have the same top-level syntax:
The scheme describes the kind of object the URL references; the scheme_specific_part is, well, the part that is peculiar to the specific scheme. The important thing to note is that the scheme is always separated from the scheme_specific_part by a colon with no intervening spaces.
6.2.1. Writing a URL
Write URLs using the displayable characters in the US-ASCII character set. For example, surely you have heard what has become annoyingly common on the radio for an announced business website, "h, t, t, p, colon, slash, slash, w, w, w, dot, blah-blah, dot, com." That's a simple URL, written:
If you need to use a character in a URL that is not part of this character set, you must encode the character using a special notation. The encoding notation replaces the desired character with three characters: a percent sign and two hexadecimal digits whose value corresponds to the position of the character in the ASCII character set.
This is easier than it sounds. One of the most common special characters is the space (Macintosh owners, take special notice), whose position in the character set is 20 hexadecimal. You can't type a space in a URL (well, you can, but it won't work). Rather, replace spaces in the URL with %20:
This URL actually retrieves a document named new pricing.html from the www.kumquat.com server.
126.96.36.199. Handling reserved and unsafe characters
Reserved characters are those that have a specific meaning within the URL itself. For example, the slash character separates elements of a pathname within a URL. If you need to include a slash in a URL that is not intended to be an element separator, you'll need to encode it as %2F:
This URL actually references the resource named compute on the www.calculator.com server and passes the string 3/4 to it, as delineated by the question mark (?). Presumably, the resource is a server-side program that performs some arithmetic function on the passed value and returns a result.
Unsafe characters are those that have no special meaning within the URL, but may have a special meaning in the context in which the URL is written. For example, double quotes ("" ) delimit URL attribute values in tags. If you were to include a double quotation mark directly in a URL, you would probably confuse the browser. Instead, you should encode the double quotation mark as %22 to avoid any possible conflict.
Other reserved and unsafe characters that should always be encoded are shown in Table 6-1.
Table 6-1. Reserved and Unsafe Characters and Their URL Encodings
In general, you should always encode a character if there is some doubt as to whether it can be placed as-is in a URL. As a rule of thumb, any character other than a letter, number, or any of the characters $-_.+!*'( ) should be encoded.
It is never an error to encode a character, unless that character has a specific meaning in the URL. For example, encoding the slashes in an http URL causes them to be used as regular characters, not as pathname delimiters, breaking the URL.
6.2.2. The http URL
Some of the parts are optional. In fact, the most common form of the http URL is simply like this:
which designates the unique server and the directory path and name of a document.
188.8.131.52. The http server
The server is the unique Internet name or Internet Protocol (IP) numerical address of the computer system that stores the web resource. We suspect you'll mostly use more easily remembered Internet names for the servers in your URLs.
The name consists of several parts, including the server's actual name and the successive names of its network domain, each part separated by a period. Typical Internet names look like www.oreilly.com or hoohoo.ncsa.uiuc.edu.
It has become something of a convention that webmasters name their servers www for quick and easy identification on the Web. For instance, O'Reilly & Associates' web server's name is www, which, along with the publisher's domain name, becomes the very easily remembered web site www.oreilly.com. Similarly, Sun Microsystems' web server is named www.sun.com; Apple Computer's is www.apple.com, and even Microsoft makes their web server easily memorable as www.microsoft.com. The naming convention has very obvious benefits, which you, too, should take advantage of if you are called upon to create a web server for your organization.
You may also specify the address of a server using its numerical IP address. The address is a sequence of four numbers, zero to 255, separated by periods. Valid IP addresses look like 184.108.40.206 or 220.127.116.11.
It'd be a dull diversion to tell you now what the numbers mean or how to derive an IP address from a domain name, particularly since you'll rarely if ever use one in a URL. Rather, this is a good place to hyperlink: pick up any good Internet networking treatise for rigorous detail on IP addressing, such as Ed Krol's The Whole Internet User's Guide and Catalog (O'Reilly & Associates).
18.104.22.168. The http port
The port is the number of the communication port to which the client browser connects to the server. It's a networking thing: servers perform many functions besides serve up web documents and resources to client browsers: electronic mail, FTP document fetches, filesystem sharing, and so on. Although all that network activity may come into the server on a single wire, it's typically divided into software-managed "ports" for service-specific communications -- something analogous to boxes at your local post office.
The default URL port for web servers is 80. Special secure web servers (Secure HTTP, SHTTP or Secure Socket Layer, SSL) run on port 443. Most web servers today use port 80; you need to include a port number along with an immediately preceding colon in your URL if the target server does not use port 80 for web communication.
When the Web was in its infancy, pioneer webmasters ran their Wild Wild Web connections on all sorts of port numbers. For technical and security reasons, system-administrator privileges are required to install a server on port 80. Lacking such privileges, these webmasters chose other, more easily accessible, port numbers.
Now that web servers have become acceptable and are under the care and feeding of responsible administrators, documents being served on some port other than 80 or 443 should make you wonder if that server is really on the up and up. Most likely, the maverick server is being run by a clever user unbeknownst to the server's bona fide system administrators.
22.214.171.124. The http path
The document path is the Unix-style hierarchical location of the file in the server's storage system. The pathname consists of one or more names separated by slashes. All but the last name represent directories leading down to the document; the last name is usually that of the document itself.
It has become a convention that for easy identification, HTML document names end with the suffix .html (otherwise they're plain ASCII text files, remember?). Although recent versions of Windows allow longer suffixes, their users often stick to the three-letter .htm name suffix for HTML documents.
Although the server name in a URL is not case-sensitive, the document pathname may be. Since most web servers are run on Unix-based systems and Unix file names are case-sensitive, the document pathname will be case-sensitive, too. Web servers running on Windows machines are not case-sensitive, so the document pathname is not, but since it is impossible to know the operating system of the server you are accessing, always assume that the server has case-sensitive pathnames and take care to get the case correct when typing your URLs.
Certain conventions regarding the document pathname have arisen. If the last element of the document path is a directory, not a single document, the server usually will send back either a listing of the directory contents or the HTML index document in that directory. You should end the document name for a directory with a trailing slash character, but in practice, most servers will honor the request even if the character is omitted.
If the directory name is just a slash alone or sometimes nothing at all, you will retrieve the first (top-level) document or so-called home page in the uppermost root directory of the server. Every well-designed http server should have an attractive, well-designed "home page"; it's a shorthand way for users to access your web collection since they don't need to remember the document's actual filename, just your server's name. That's why, for example, you can type http://www.oreilly.com into Netscape's "Open" dialog box and get O'Reilly's home page.
Another twist: if the first component of the document path starts with the tilde character (~), it means that the rest of the pathname begins from the personal directory in the home directory of the specified user on the server machine. For instance, the URL http://www.kumquat.com/~chuck / would retrieve the top-level page from Chuck's document collection.
Different servers have different ways of locating documents within a user's home directory. Many search for the documents in a directory named public_html. Unix-based servers are fond of the name index.html for home pages. When all else fails, servers tend to cough up the first text document in the home page directory.
126.96.36.199. The http document fragment
The fragment is an identifier that points to a specific section of a document. In URL specifications, it follows the server and pathname and is separated by the pound sign (#). A fragment identifier indicates to the browser that it should begin displaying the target document at the indicated fragment name. As we describe in more detail later in this chapter, you insert fragment names into a document either with the universal id tag attribute or with the name attribute for <a> tag. Like pathnames, a fragment name may be any sequence of characters.
The fragment name and the preceding hash symbol are optional; omit them when referencing a document without defined fragments.
Formally, the fragment element only applies to HTML or XHTML documents. If the target of the URL is some other document type, the fragment name may be misinterpreted by the browser.
Fragments are useful for long documents. By identifying key sections of your document with a fragment name, you make it easy for readers to link directly to that portion of the document, avoiding the tedium of scrolling or searching through the document to get to the section that interests them.
As a rule of thumb, we recommend that every section header in your documents be accompanied by an equivalent fragment name. By consistently following this rule, you'll make it possible for readers to jump to any section in any of your documents. Fragments also make it easier to build tables of contents for your document families.
188.8.131.52. The http search parameter
The search component of the http URL, along with its preceding question mark, is optional. It indicates that the path is a searchable or executable resource on the server. The content of the search component is passed to the server as parameters that control the search or execution function.
The actual encoding of parameters in the search component is dependent upon the server and the resource being referenced. The parameters for searchable resources are covered later in this chapter, when we discuss searchable documents. Parameters for executable resources are discussed in Chapter 9, "Forms".
Although our initial presentation of http URLs indicated that a URL can have either a fragment identifier or a search component, some browsers let you use both in a single URL. If you so desire, you can follow the search parameter with a fragment identifier, telling the browser to begin displaying the results of the search at the indicated fragment. Netscape, for example, supports this usage.
We don't recommend this kind of URL, though. First and foremost, it doesn't work on a lot of browsers. Just as important, using a fragment implies that you are sure that the results of the search will have a fragment of that name defined within the document. For large document collections, this is hardly likely. You are better off omitting the fragment, showing the search results from the beginning of the document, and avoiding potential confusion among your readers.
184.108.40.206. Sample http URLs
Here are some sample http URLs:
http://www.oreilly.com/catalog.html http://www.oreilly.com/ http://www.kumquat.com:8080/ http://www.kumquat.com/planting/guide.html#soil_prep http://www.kumquat.com/find_a_quat?state=Florida
The first example is an explicit reference to a bona fide HTML document named catalog.html that is stored in the root directory of the www.oreilly.com server. The second references the top-level home page on that same server. That home page may or may not be catalog.html. Sample three, also, assumes that there is a home page in the root directory of the www.kumquat.com server, and that the web connection is to the nonstandard port 8080.
The fourth example is the URL for retrieving the web document named guide.html from the planting directory on the www.kumquat.com server. Once retrieved, the browser should display the document beginning at the fragment named soil_ prep.
The last example invokes an executable resource named find_a_quat with the parameter named state set to the value Florida. Presumably, this resource generates an HTML response that is subsequently displayed by the browser.
6.2.4. The ftp URL
The ftp URL is used to retrieve documents from an FTP (File Transfer Protocol) server.
It has the format:
220.127.116.11. The ftp user and password
FTP is an authenticated service, meaning that you must have a valid username and password in order to retrieve documents from a server. However, most FTP servers also support restricted, nonauthenticated access known as anonymous FTP. In this mode, anyone can supply the username "anonymous" and be granted access to a limited portion of the server's documents. Most FTP servers also assume (but may not grant) anonymous access if the username and password are omitted.
If you are using an ftp URL to access a site that requires a username and password, include the user and password components in the URL, along with the colon (:) and "at" sign (@). More commonly, you'll be accessing an anonymous FTP server, and the user and password components can be omitted.
If you keep the user component along with the "at" sign, but omit the password and the preceding colon, most browsers will prompt you for a password after connecting to the FTP server. This is the recommended way of accessing authenticated resources on an FTP server; it prevents others from seeing your password.
We recommend you never place an ftp URL with a username and password in any HTML document. The reasoning is simple: anyone can retrieve the document, extract the username and password from the URL, log into the FTP server, and tamper with its documents.
18.104.22.168. The ftp server and port
The ftp server and port are bound by the same rules as the server and port in an http URL, as described above. The server must be a valid Internet domain name or IP address of an FTP server. The port specifies the port on which the server is listening for requests.
If the port and its preceding colon are omitted, the default port of 21 is used. It is necessary to specify the port only if the FTP server is running on some port other than 21.
22.214.171.124. The ftp path and transfer type
The path component represents a series of directories, separated by slashes leading to the file to be retrieved. By default, the file is retrieved as a binary file; this can be changed by adding the typecode (and the preceding ;type=) to the URL.
If the typecode is set to d, the path is assumed to be a directory. The browser will request a listing of the directory contents from the server and display this listing to the user. If the typecode is any other letter, it is used as a parameter to the FTP type command before retrieving the file referenced by the path. While some FTP servers may implement other codes, most servers accept i to initiate a binary transfer and a to treat the file as a stream of ASCII text.
126.96.36.199. Sample ftp URLs
Here are some sample ftp URLs:
ftp://www.kumquat.com/sales/pricing ftp://email@example.com/results;type=d ftp://bob:firstname.lastname@example.org/listing;type=a
The first example retrieves the file named pricing from the sales directory on the anonymous FTP server at www.kumquat.com. The second logs into the FTP server on bobs-box.com as user bob, prompting for a password before retrieving the contents of the directory named results and displaying them to the user. The last example logs into bobs-box.com as bob with the password secret and retrieves the file named listing, treating its contents as ASCII characters.
6.2.5. The file URL
The file URL specifies a file stored on a machine without indicating the protocol used to retrieve the file. As such, it has limited use in a networked environment. Its real benefit, however, is that it can reference a file on the user's machine, and is particularly useful for referencing personal HTML document collections, such as those "under construction" and not yet ready for general distribution, or HTML document collections on CD-ROM. It has the following format:
188.8.131.52. The file server
The file server, like the http server described earlier, must be the Internet domain name or IP address of the machine containing the file to be retrieved. No assumptions are made as to how the browser might contact the machine to obtain the file; presumably the browser can make some connection, perhaps via a Network File System or FTP, to obtain the file.
If the server is omitted, or the special name localhost is used, the file is assumed to reside on the same machine upon which the browser is running. In this case, the browser simply accesses the file using the normal facilities of the local operating system. In fact, this is the most common usage of the file URL. By creating document families on a diskette or CD-ROM and referencing your hyperlinks using the file://localhost/ URL, you create a distributable, standalone document collection that does not require a network connection to use.
184.108.40.206. The file path
This is the path of the file to be retrieved on the desired server. The syntax of the path may differ based upon the operating system of the server; be sure to encode any potentially dangerous characters in the path.
220.127.116.11. Sample file URLs
The file URL is easy:
file://localhost/home/chuck/document.html file:///home/chuck/document.html file://marketing.kumquat.com/monthly_sales.html
The first URL retrieves /home/chuck/document.html from the user's local machine. The second is identical to the first, except we've omitted the localhost reference to the server; the server name defaults to the local server.
The third example uses some protocol to retrieve monthly_sales.html from the marketing.kumquat.com server.
6.2.6. The news URL
An unfortunate limitation in news URLs is that they don't allow you to specify a server for the newsgroup. Rather, users specify their news-server resource in their browser preferences. At one time, not long ago, Internet newsgroups were nearly universally distributed; all news servers carried all the same newsgroups and their respective articles, so one news server was as good as any. Today, the sheer bulk of disk space needed to store the daily volume of newsgroup activity is often prohibitive for any single news server, and there's also local censorship of newsgroups. Hence you cannot expect that all newsgroups, and certainly not all articles for a particular newsgroup, will be available on the user's news server.
Many users' browsers may not be correctly configured to read news. We recommend you avoid placing news URLs in your documents except in rare cases.
18.104.22.168. Accessing entire newsgroups
There are several thousand newsgroups devoted to nearly every conceivable topic under the sun and beyond. Each group has a unique name, composed of hierarchical elements separated by periods. For example, the World Wide Web announcements newsgroup is:
To access this group, use the URL:
22.214.171.124. Accessing single messages
The unique_string is a sequence of ASCII characters; the server is usually the name of the machine from which the message originated. The unique_string must be unique among all the messages that originated from the server. A sample URL to access a single message might be:
In general, message IDs are cryptic sequences of characters not readily understood by humans. Moreover, the lifespan of a message on a server is usually measured in days, after which the message is deleted and the message ID is no longer valid. The bottom line: single message news URLs are difficult to create, become invalid quickly, and are generally not used.
6.2.7. The nntp URL
126.96.36.199. The nntp server and port
The nntp server and port are defined similarly to the http server and port, described earlier. The server must be the Internet domain name or IP address of a nntp server; the port is the port on which that server is listening for requests.
If the port and its preceding colon are omitted, the default port of 119 is used.
188.8.131.52. The nntp newsgroup and article
The newsgroup is the name of the group from which an article is to be retrieved, as defined in Section 6.2.6, "The news URL".
The article is the numeric id of the desired article within that newsgroup. Although the article number is easier to determine than a message id, it falls prey to the same limitations of single message references using the news URL, described in Section 6.2.6, "The news URL". Specifically, articles do not last long on most nntp servers, and nntp URLs quickly become invalid as a result.
184.108.40.206. Sample nntp URLs
A sample nntp URL might be:
This URL retrieves article 417 from the alt.fan.kumquats newsgroup on news.kumquat.com. Keep in mind that the article will be served only to machines that are allowed to retrieve articles from this server. In general, most nntp servers restrict access to those machines on that same local area network.
6.2.8. The mailto URL
The address is any valid email address, usually of the form:
Thus, a typical mailto URL might look like:
Browsers like Netscape honor multiple recipients in the mailto URL, separated by a comma. For example:
will address the message to all three recipients. There should be no spaces before or after the commas in the URL.
220.127.116.11. Defining mail header fields
Most browsers open an email composition window when the user selects a mailto URL. The recipient's address is filled in, taken from the URL, but the message subject and various other header fields are left blank. Many webmasters would like to fill in these fields as a courtesy to their readers, but the URL standard provides no way to do this.
The modern browsers extend the mailto URL to fill this gap. By adding CGI-like parameters to the mailto header, you can set the value of the subject with Netscape and Internet Explorer, and also cc ( carbon copy) and bcc (blind carbon copy) fields for the mail message with Netscape. These URLs work with Netscape; only the first one works correctly with Internet Explorer. Section 18.104.22.168, "Passing parameters explicitly"
mailto:email@example.com?subject=Loved your book! mailto:firstname.lastname@example.orgemail@example.com mailto:firstname.lastname@example.orgemail@example.com
As you can probably guess, the first URL sets the subject of the message. Note that spaces are allowed; you don't have to replace them with the hexadecimal equivalent %20. The second URL places the address firstname.lastname@example.org in the cc field of a Netscape message. Similarly, the last example sets the bcc field of the message. You may also set several fields in one URL by separating the field definitions with ampersands. For example:
mailto:email@example.com?subject=Loved your book!&cc=booktech@ firstname.lastname@example.org
sets the subject and carbon-copy address. (This line would normally appear as a single line but is broken here due to the width of the page.)
Internet Explorer Version 3 does not recognize the bcc and cc fields in the mailto URL and will either complain about them if they appear alone or append them to a preceding subject.
6.2.9. The telnet URL
The telnet URL opens an interactive session with a desired server, allowing the user to log in and use the machine. Often, the connection to the machine automatically starts a specific service for the user; in other cases, the user must know the commands to type to use the system. The telnet URL has the form:
22.214.171.124. The telnet user and password
The telnet user and password are used exactly like the user and password components of the ftp URL, described previously. In particular, the same caveats apply regarding protecting your password and never placing it within a URL.
Just like the ftp URL, if you omit the password from the URL, the browser should prompt you for a password just before contacting the telnet server.
If you omit both the user and password, the telnet occurs without supplying a user name. For some servers, telnet automatically connects to a default service when no username is supplied. For others, the browser may prompt for a username and password when making the connection to the telnet server.
126.96.36.199. The telnet server and port
The telnet server and port are defined similarly to the http server and port, described above. The server must be the Internet domain name or IP address of a telnet server; the port is the port on which that server is listening for requests. If the port and its preceding colon are omitted, the default port of 23 is used.
6.2.10. The gopher URL
Gopher is a web-like document retrieval system that achieved some popularity on the Internet just before the World Wide Web took off, making Gopher obsolete. Some Gopher servers still exist, though, and the gopher URL lets you access Gopher documents. The gopher URL has the form:
188.8.131.52. The gopher server and port
The gopher server and port are defined similarly to the http server and port, described previously. The server must be the Internet domain name or IP address of a gopher server; the port is the port on which that server is listening for requests.
If the port and its preceding colon are omitted, the default port of 70 is used.
184.108.40.206. The gopher path
The path can take one of three forms:
type/selector type/selector%09search type/selector%09search%09gopherplus
If the Gopher resource is actually a Gopher search engine, the search component provides the string for which to search. The search string must be preceded by an encoded horizontal tab (%09).
If the Gopher server supports Gopher+ resources, the gopherplus component supplies the necessary information to locate that resource. The exact content of this component varies based upon the resources on the gopher server. This component is preceded by an encoded horizontal tab (%09). If you want to include the gopherplus component but omit the search component, you must still supply both encoded tabs within the URL.
6.2.11. Absolute and Relative URLs
You may write a URL in one of two ways: absolute or relative. An absolute URL is the complete address of a resource and has everything your system needs to find a document and its server on the Web. At the very least, an absolute URL contains the scheme and all required elements of the scheme_specific_part of the URL. It may also contain any of the optional portions of the scheme_specific_part.
With a relative URL, you provide an abbreviated document address that, when automatically combined with a "base address" by the system, becomes a complete address for the document. Within the relative URL, any component of the URL may be omitted. The browser automatically fills in the missing pieces of the relative URL using corresponding elements of a base URL. This base URL is usually the URL of the document containing the relative URL, but may be another document specified with the <base> tag. Section 6.7.1, "The <base> Header Element"
220.127.116.11. Relative schemes and servers
A common form of a relative URL is missing the scheme and server name. Since many related documents are on the same server, it makes sense to omit the scheme and server name from the relative URL. For instance, assume the base document was last retrieved from the server www.kumquat.com. The relative URL, then:
is equivalent to the absolute URL:
Table 6-2 shows how the base and relative URLs in the example are combined to form an absolute URL.
Table 6-2. Forming an Absolute URL
18.104.22.168. Relative document directories
Another common form of a relative URL omits the leading slash and one or more directory names from the beginning of the document pathname. The directory of the base URL is automatically assumed to replace these missing components. It's the most common abbreviation, because most authors place their collection of documents and subdirectories of support resources in the same directory path as the home page. For example, you might have a special/ subdirectory containing FTP files referenced in your document. Let's say that the absolute URL for that document is:
A relative URL for the file README.txt in the special/ subdirectory, looks like this:
You'll actually be retrieving:
Visually, the operation looks like that in Table 6-3.
Table 6-3. Forming an Absolute FTP URL
Common "dot-slash" pathname notations also let you express the current directory ("./") and directory above the current directory (parent; "../") in a relative URL. The current directory notation is rarely used, since it is redundant. But the parent notation lets you set the target URL to directories in other branches of the filesystem hierarchy.
For example, if the directory portion of the current URL is /planting/special/, and you want to reference an HTML document named new_ground.html in planting/standard/, you may simply form the relative URL as:
You'll actually be retrieving:
Note that parent notation has limits. For instance, most web servers will not let you navigate above the base directory: http://www.kumquat.com/../ probably won't deliver any document or directory listing to your browser.
22.214.171.124. Using relative URLs
Relative URLs are more than just a typing convenience. Because they are relative to the current server and directory, you can move the entire set of documents to another directory or even another server and never have to change a single relative link. Imagine the difficulties if you had to go into every source document and change the URL for every link every time you move it. We'd loathe using hyperlinks! Use relative URLs wherever possible.
Copyright © 2002 O'Reilly & Associates. All rights reserved.