home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomeHTML & XHTML: The Definitive GuideSearch this book

6.2. Referencing Documents: The URL

As we discussed earlier, every document on the World Wide Web has a unique address. (Imagine the chaos if they didn't.) The document's address is known as its uniform resource locator (URL).[37]

[37]"URL" usually is pronounced "you are ell," not "earl."

Several tags include a URL attribute value, including hyperlinks, inline images, and forms. All use the same URL syntax to specify the location of a web resource, regardless of the type or content of that resource. That's why it's known as a uniform resource locator.

Since they can be used to represent almost any resource on the Internet, URLs come in a variety of flavors. All URLs, however, have the same top-level syntax:

scheme:scheme_specific_part

The scheme describes the kind of object the URL references; the scheme_specific_part is, well, the part that is peculiar to the specific scheme. The important thing to note is that the scheme is always separated from the scheme_specific_part by a colon with no intervening spaces.

6.2.1. Writing a URL

Write URLs using the displayable characters in the US-ASCII character set. For example, surely you have heard what has become annoyingly common on the radio for an announced business website, "h, t, t, p, colon, slash, slash, w, w, w, dot, blah-blah, dot, com." That's a simple URL, written:

http://www.blah-blah.com

If you need to use a character in a URL that is not part of this character set, you must encode the character using a special notation. The encoding notation replaces the desired character with three characters: a percent sign and two hexadecimal digits whose value corresponds to the position of the character in the ASCII character set.

This is easier than it sounds. One of the most common special characters is the space (Macintosh owners, take special notice), whose position in the character set is 20 hexadecimal. You can't type a space in a URL (well, you can, but it won't work). Rather, replace spaces in the URL with %20:

http://www.kumquat.com/new%20pricing.html

This URL actually retrieves a document named new pricing.html from the www.kumquat.com server.

6.2.1.1. Handling reserved and unsafe characters

In addition to the nonprinting characters, you'll need to encode reserved and unsafe characters in your URLs as well.

Reserved characters are those that have a specific meaning within the URL itself. For example, the slash character separates elements of a pathname within a URL. If you need to include a slash in a URL that is not intended to be an element separator, you'll need to encode it as %2F:[38]

[38]Hexadecimal numbering is based on 16 characters: through 9 followed by A through F, which in decimal are equivalent to values through 15. Also, letter case for these extended values is not significant; "a" (10 decimal) is the same as "A", for example.

http://www.calculator.com/compute?3%2f4

This URL actually references the resource named compute on the www.calculator.com server and passes the string 3/4 to it, as delineated by the question mark (?). Presumably, the resource is a server-side program that performs some arithmetic function on the passed value and returns a result.

Unsafe characters are those that have no special meaning within the URL, but may have a special meaning in the context in which the URL is written. For example, double quotes ("" ) delimit URL attribute values in tags. If you were to include a double quotation mark directly in a URL, you would probably confuse the browser. Instead, you should encode the double quotation mark as %22 to avoid any possible conflict.

Other reserved and unsafe characters that should always be encoded are shown in Table 6-1.

Table 6-1. Reserved and Unsafe Characters and Their URL Encodings

Character

Description

Usage

Encoding

;

Semicolon

Reserved

%3B

/

Slash

Reserved

%2F

?

Question mark

Reserved

%3F

:

Colon

Reserved

%3A

@

At sign

Reserved

%40

=

Equal sign

Reserved

%3D

&

Ampersand

Reserved

%26

<

Less than sign

Unsafe

%3C

>

Greater than sign

Unsafe

%3E

"

Double quotation mark

Unsafe

%22

#

Hash symbol

Unsafe

%23

%

Percent

Unsafe

%25

{

Left curly brace

Unsafe

%7B

}

Right curly brace

Unsafe

%7D

|

Vertical bar

Unsafe

%7C

\

Backslash

Unsafe

%5C

^

Caret

Unsafe

%5E

~

Tilde

Unsafe

%7E

[

Left square bracket

Unsafe

%5B

]

Right square bracket

Unsafe

%5D

`

Back single quotation mark

Unsafe

%60

In general, you should always encode a character if there is some doubt as to whether it can be placed as-is in a URL. As a rule of thumb, any character other than a letter, number, or any of the characters $-_.+!*'( ) should be encoded.

It is never an error to encode a character, unless that character has a specific meaning in the URL. For example, encoding the slashes in an http URL causes them to be used as regular characters, not as pathname delimiters, breaking the URL.

6.2.2. The http URL

The http URL is by far the most common within the World Wide Web. It is used to access documents from a web server, and it has two formats:

http://server:port/path#fragment
http://server:port/path?search

Some of the parts are optional. In fact, the most common form of the http URL is simply like this:

http://server/path

which designates the unique server and the directory path and name of a document.

6.2.2.1. The http server

The server is the unique Internet name or Internet Protocol (IP) numerical address of the computer system that stores the web resource. We suspect you'll mostly use more easily remembered Internet names for the servers in your URLs.[39]

[39]Each Internet-connected computer has a unique address, a numeric (IP) address, of course, because computers deal only in numbers. Humans prefer names, so the Internet folks provide us with a collection of special servers and software (Domain Name System or DNS) that automatically resolve Internet names into IP addresses. InterNIC, a nonprofit agency, registers domain names mostly on a first-come, first-serve basis, and distributes new names to DNS servers worldwide.

The name consists of several parts, including the server's actual name and the successive names of its network domain, each part separated by a period. Typical Internet names look like www.oreilly.com or hoohoo.ncsa.uiuc.edu.[40]

[40]The three-letter suffix of the domain name identifies the type of organization or business that operates that portion of the Internet. For instance, "com" is a commercial enterprise; "edu" is an academic institution; and "gov" identifies a government-based domain. Outside the United States, a less-descriptive suffix is often assigned, typically a two-letter abbreviation of the country name such as "jp" for Japan and "de" for Deutschland. Many organizations around the world now use the generic three-letter suffixes in place of the more conventional two-letter national suffixes.

It has become something of a convention that webmasters name their servers www for quick and easy identification on the Web. For instance, O'Reilly & Associates' web server's name is www, which, along with the publisher's domain name, becomes the very easily remembered web site www.oreilly.com. Similarly, Sun Microsystems' web server is named www.sun.com; Apple Computer's is www.apple.com, and even Microsoft makes their web server easily memorable as www.microsoft.com. The naming convention has very obvious benefits, which you, too, should take advantage of if you are called upon to create a web server for your organization.

You may also specify the address of a server using its numerical IP address. The address is a sequence of four numbers, zero to 255, separated by periods. Valid IP addresses look like 137.237.1.87 or 192.249.1.33.

It'd be a dull diversion to tell you now what the numbers mean or how to derive an IP address from a domain name, particularly since you'll rarely if ever use one in a URL. Rather, this is a good place to hyperlink: pick up any good Internet networking treatise for rigorous detail on IP addressing, such as Ed Krol's The Whole Internet User's Guide and Catalog (O'Reilly & Associates).

6.2.2.2. The http port

The port is the number of the communication port to which the client browser connects to the server. It's a networking thing: servers perform many functions besides serve up web documents and resources to client browsers: electronic mail, FTP document fetches, filesystem sharing, and so on. Although all that network activity may come into the server on a single wire, it's typically divided into software-managed "ports" for service-specific communications -- something analogous to boxes at your local post office.

The default URL port for web servers is 80. Special secure web servers (Secure HTTP, SHTTP or Secure Socket Layer, SSL) run on port 443. Most web servers today use port 80; you need to include a port number along with an immediately preceding colon in your URL if the target server does not use port 80 for web communication.

When the Web was in its infancy, pioneer webmasters ran their Wild Wild Web connections on all sorts of port numbers. For technical and security reasons, system-administrator privileges are required to install a server on port 80. Lacking such privileges, these webmasters chose other, more easily accessible, port numbers.

Now that web servers have become acceptable and are under the care and feeding of responsible administrators, documents being served on some port other than 80 or 443 should make you wonder if that server is really on the up and up. Most likely, the maverick server is being run by a clever user unbeknownst to the server's bona fide system administrators.

6.2.2.3. The http path

The document path is the Unix-style hierarchical location of the file in the server's storage system. The pathname consists of one or more names separated by slashes. All but the last name represent directories leading down to the document; the last name is usually that of the document itself.

It has become a convention that for easy identification, HTML document names end with the suffix .html (otherwise they're plain ASCII text files, remember?). Although recent versions of Windows allow longer suffixes, their users often stick to the three-letter .htm name suffix for HTML documents.

Although the server name in a URL is not case-sensitive, the document pathname may be. Since most web servers are run on Unix-based systems and Unix file names are case-sensitive, the document pathname will be case-sensitive, too. Web servers running on Windows machines are not case-sensitive, so the document pathname is not, but since it is impossible to know the operating system of the server you are accessing, always assume that the server has case-sensitive pathnames and take care to get the case correct when typing your URLs.

Certain conventions regarding the document pathname have arisen. If the last element of the document path is a directory, not a single document, the server usually will send back either a listing of the directory contents or the HTML index document in that directory. You should end the document name for a directory with a trailing slash character, but in practice, most servers will honor the request even if the character is omitted.

If the directory name is just a slash alone or sometimes nothing at all, you will retrieve the first (top-level) document or so-called home page in the uppermost root directory of the server. Every well-designed http server should have an attractive, well-designed "home page"; it's a shorthand way for users to access your web collection since they don't need to remember the document's actual filename, just your server's name. That's why, for example, you can type http://www.oreilly.com into Netscape's "Open" dialog box and get O'Reilly's home page.

Another twist: if the first component of the document path starts with the tilde character (~), it means that the rest of the pathname begins from the personal directory in the home directory of the specified user on the server machine. For instance, the URL http://www.kumquat.com/~chuck / would retrieve the top-level page from Chuck's document collection.

Different servers have different ways of locating documents within a user's home directory. Many search for the documents in a directory named public_html. Unix-based servers are fond of the name index.html for home pages. When all else fails, servers tend to cough up the first text document in the home page directory.

6.2.2.4. The http document fragment

The fragment is an identifier that points to a specific section of a document. In URL specifications, it follows the server and pathname and is separated by the pound sign (#). A fragment identifier indicates to the browser that it should begin displaying the target document at the indicated fragment name. As we describe in more detail later in this chapter, you insert fragment names into a document either with the universal id tag attribute or with the name attribute for <a> tag. Like pathnames, a fragment name may be any sequence of characters.

The fragment name and the preceding hash symbol are optional; omit them when referencing a document without defined fragments.

Formally, the fragment element only applies to HTML or XHTML documents. If the target of the URL is some other document type, the fragment name may be misinterpreted by the browser.

Fragments are useful for long documents. By identifying key sections of your document with a fragment name, you make it easy for readers to link directly to that portion of the document, avoiding the tedium of scrolling or searching through the document to get to the section that interests them.

As a rule of thumb, we recommend that every section header in your documents be accompanied by an equivalent fragment name. By consistently following this rule, you'll make it possible for readers to jump to any section in any of your documents. Fragments also make it easier to build tables of contents for your document families.

6.2.2.5. The http search parameter

The search component of the http URL, along with its preceding question mark, is optional. It indicates that the path is a searchable or executable resource on the server. The content of the search component is passed to the server as parameters that control the search or execution function.

The actual encoding of parameters in the search component is dependent upon the server and the resource being referenced. The parameters for searchable resources are covered later in this chapter, when we discuss searchable documents. Parameters for executable resources are discussed in Chapter 9, "Forms".

Although our initial presentation of http URLs indicated that a URL can have either a fragment identifier or a search component, some browsers let you use both in a single URL. If you so desire, you can follow the search parameter with a fragment identifier, telling the browser to begin displaying the results of the search at the indicated fragment. Netscape, for example, supports this usage.

We don't recommend this kind of URL, though. First and foremost, it doesn't work on a lot of browsers. Just as important, using a fragment implies that you are sure that the results of the search will have a fragment of that name defined within the document. For large document collections, this is hardly likely. You are better off omitting the fragment, showing the search results from the beginning of the document, and avoiding potential confusion among your readers.

6.2.3. The javascript URL

The javascript URL actually is a pseudo-protocol, not usually included in discussions of URLs. Yet, with advanced browsers like Netscape and Internet Explorer, the javascript URL can be associated with a hyperlink and used to execute JavaScript commands when the user selects the link. Section 12.3.4, "JavaScript URLs"

6.2.3.1. The javascript URL arguments

What follows the javascript pseudo-protocol is one or more semicolon-separated JavaScript expressions and methods, including references to multi-expression JavaScript functions that you embed within the <script> tag in your documents (see Chapter 12, "Executable Content" for details). For example:

javascript:window.alert('Hello, world!')
javascript:doFlash('red', 'blue'); window.alert('Do not press me!')

are valid URLs that you may include as the value for a link reference (see Section 6.3.1.2, "The href attribute" and Section 6.5.4.3, "The href attribute"). The first example contains a single JavaScript method that activates an alert dialog with the simple message.

The second javascript URL example contains two arguments: the first calls a JavaScript function, doFlash, which presumably you have located elsewhere in the document within the <script> tag and which perhaps flashes the background color of the document window between the red and blue. The second expression is the same alert method as in the first example, with a slightly different message.

The javascript URL may appear in a hyperlink sans arguments, too. In that case, the Netscape browser alone -- not Internet Explorer -- opens a special JavaScript editor wherein the user may type in and test the various expressions and methods.

6.2.4. The ftp URL

The ftp URL is used to retrieve documents from an FTP (File Transfer Protocol) server.[41]

[41]FTP is an ancient Internet protocol that dates back to the Dark Ages, around 1975. It was designed as a simple way to move files between machines and is popular and useful to this day. Some people who are unable to run a true web server will place their documents on a server that speaks FTP instead.

It has the format:

ftp://user:password@server:port/path;type=typecode

6.2.4.4. Sample ftp URLs

Here are some sample ftp URLs:

ftp://www.kumquat.com/sales/pricing
ftp://bob@bobs-box.com/results;type=d
ftp://bob:secret@bobs-box.com/listing;type=a

The first example retrieves the file named pricing from the sales directory on the anonymous FTP server at www.kumquat.com. The second logs into the FTP server on bobs-box.com as user bob, prompting for a password before retrieving the contents of the directory named results and displaying them to the user. The last example logs into bobs-box.com as bob with the password secret and retrieves the file named listing, treating its contents as ASCII characters.

6.2.5. The file URL

The file URL specifies a file stored on a machine without indicating the protocol used to retrieve the file. As such, it has limited use in a networked environment. Its real benefit, however, is that it can reference a file on the user's machine, and is particularly useful for referencing personal HTML document collections, such as those "under construction" and not yet ready for general distribution, or HTML document collections on CD-ROM. It has the following format:

file://server/path

6.2.5.2. The file path

This is the path of the file to be retrieved on the desired server. The syntax of the path may differ based upon the operating system of the server; be sure to encode any potentially dangerous characters in the path.

6.2.5.3. Sample file URLs

The file URL is easy:

file://localhost/home/chuck/document.html
file:///home/chuck/document.html
file://marketing.kumquat.com/monthly_sales.html

The first URL retrieves /home/chuck/document.html from the user's local machine. The second is identical to the first, except we've omitted the localhost reference to the server; the server name defaults to the local server.

The third example uses some protocol to retrieve monthly_sales.html from the marketing.kumquat.com server.

6.2.6. The news URL

The news URL accesses either a single message or an entire newsgroup within the Usenet news system. It has two forms:

news:newsgroup
news:message_id

An unfortunate limitation in news URLs is that they don't allow you to specify a server for the newsgroup. Rather, users specify their news-server resource in their browser preferences. At one time, not long ago, Internet newsgroups were nearly universally distributed; all news servers carried all the same newsgroups and their respective articles, so one news server was as good as any. Today, the sheer bulk of disk space needed to store the daily volume of newsgroup activity is often prohibitive for any single news server, and there's also local censorship of newsgroups. Hence you cannot expect that all newsgroups, and certainly not all articles for a particular newsgroup, will be available on the user's news server.

Many users' browsers may not be correctly configured to read news. We recommend you avoid placing news URLs in your documents except in rare cases.

6.2.6.1. Accessing entire newsgroups

There are several thousand newsgroups devoted to nearly every conceivable topic under the sun and beyond. Each group has a unique name, composed of hierarchical elements separated by periods. For example, the World Wide Web announcements newsgroup is:

comp.infosys.www.announce

To access this group, use the URL:

news:comp.infosys.www.announce

6.2.7. The nntp URL

The nntp URL goes beyond the news URL to provide a complete mechanism for accessing articles in the Usenet news system. It has the form:

nntp://server:port/newsgroup/article

6.2.8. The mailto URL

The mailto URL causes an electronic mail message to be transmitted to a named recipient. It has the format:

mailto:address

The address is any valid email address, usually of the form:

user@server

Thus, a typical mailto URL might look like:

mailto:cmusciano@aol.com

Browsers like Netscape honor multiple recipients in the mailto URL, separated by a comma. For example:

mailto:cmusciano@aol.com,bkennedy@activmedia.com,booktech@ora.com

will address the message to all three recipients. There should be no spaces before or after the commas in the URL.

6.2.8.1. Defining mail header fields

Most browsers open an email composition window when the user selects a mailto URL. The recipient's address is filled in, taken from the URL, but the message subject and various other header fields are left blank. Many webmasters would like to fill in these fields as a courtesy to their readers, but the URL standard provides no way to do this.

The modern browsers extend the mailto URL to fill this gap. By adding CGI-like parameters to the mailto header, you can set the value of the subject with Netscape and Internet Explorer, and also cc ( carbon copy) and bcc (blind carbon copy) fields for the mail message with Netscape. These URLs work with Netscape; only the first one works correctly with Internet Explorer. Section 9.2.4.2, "Passing parameters explicitly"

mailto:cmusciano@aol.com?subject=Loved your book!
mailto:cmusciano@aol.com?cc=booktech@oreilly.com
mailto:cmusciano@aol.com?bcc=archive@myserver.com

As you can probably guess, the first URL sets the subject of the message. Note that spaces are allowed; you don't have to replace them with the hexadecimal equivalent %20. The second URL places the address booktech@oreilly.com in the cc field of a Netscape message. Similarly, the last example sets the bcc field of the message. You may also set several fields in one URL by separating the field definitions with ampersands. For example:

mailto:cmusciano@aol.com?subject=Loved your book!&cc=booktech@
oreilly.com&bcc=archive@myserver.com

sets the subject and carbon-copy address. (This line would normally appear as a single line but is broken here due to the width of the page.)

Internet Explorer Version 3 does not recognize the bcc and cc fields in the mailto URL and will either complain about them if they appear alone or append them to a preceding subject.

6.2.10. The gopher URL

Gopher is a web-like document retrieval system that achieved some popularity on the Internet just before the World Wide Web took off, making Gopher obsolete. Some Gopher servers still exist, though, and the gopher URL lets you access Gopher documents. The gopher URL has the form:

gopher://server:port/path

6.2.11. Absolute and Relative URLs

You may write a URL in one of two ways: absolute or relative. An absolute URL is the complete address of a resource and has everything your system needs to find a document and its server on the Web. At the very least, an absolute URL contains the scheme and all required elements of the scheme_specific_part of the URL. It may also contain any of the optional portions of the scheme_specific_part.

With a relative URL, you provide an abbreviated document address that, when automatically combined with a "base address" by the system, becomes a complete address for the document. Within the relative URL, any component of the URL may be omitted. The browser automatically fills in the missing pieces of the relative URL using corresponding elements of a base URL. This base URL is usually the URL of the document containing the relative URL, but may be another document specified with the <base> tag. Section 6.7.1, "The <base> Header Element"

6.2.11.2. Relative document directories

Another common form of a relative URL omits the leading slash and one or more directory names from the beginning of the document pathname. The directory of the base URL is automatically assumed to replace these missing components. It's the most common abbreviation, because most authors place their collection of documents and subdirectories of support resources in the same directory path as the home page. For example, you might have a special/ subdirectory containing FTP files referenced in your document. Let's say that the absolute URL for that document is:

http://www.kumquat.com/planting/guide.html

A relative URL for the file README.txt in the special/ subdirectory, looks like this:

ftp:special/README.txt

You'll actually be retrieving:

ftp://www.kumquat.com/planting/special/README.txt

Visually, the operation looks like that in Table 6-3.

Table 6-3. Forming an Absolute FTP URL

Protocol

Server

Directory

File

Base URL

http

www.kumquat.com

/planting

guide.html

Relative URL

ftp

Figure 6-3

special

README.txt

Figure 6-3

Figure 6-3

Figure 6-3

Figure 6-3

Figure 6-3

Absolute URL

ftp

www.kumquat.com

/planting/special

README.txt

Common "dot-slash" pathname notations also let you express the current directory ("./") and directory above the current directory (parent; "../") in a relative URL. The current directory notation is rarely used, since it is redundant. But the parent notation lets you set the target URL to directories in other branches of the filesystem hierarchy.

For example, if the directory portion of the current URL is /planting/special/, and you want to reference an HTML document named new_ground.html in planting/standard/, you may simply form the relative URL as:

../standard/new_ground.html

You'll actually be retrieving:

http://www.kumquat.com/planting/standard/new_ground.html

Note that parent notation has limits. For instance, most web servers will not let you navigate above the base directory: http://www.kumquat.com/../ probably won't deliver any document or directory listing to your browser.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.