Content Negotiation (CGI Programming with Perl)

2.6. Content Negotiation

People from all over the world access the same Internet, using many different languages, many different character sets, and many different browsers. One representation of a document is not going to satisfy the requirements of all these people. This is why HTTP provides something called content negotiation, which allows clients and servers to negotiate the best possible format for each given resource.

For example, say you want to make a document available in multiple languages. You could store each translation of this document separately so that they each have a unique URL. This would be a bad idea for a number of reasons, but most importantly because you would have to advertise multiple URLs for the same resource. URLs have been designed to be easily exchanged offline as well as via hyperlinks, and there is no reason why people who speak different languages should not be able to share the same URL. By utilizing content negotiation, you can offer the appropriate translation of a requested document automatically.

There are four primary forms of content negotiation: language, character set, media type, and encoding. Each have their own corresponding headers, but the negotiation process works the same way for all of them. Negotiation can be performed by the server or by the client. In server-side negotiation, the client sends a header indicating the forms of content it accepts, and the server responds by selecting one of these options and returning the resource in the appropriate format. In client-side negotiation, the client requests a resource without special headers, the server sends a list of the available contents to the client, the client then makes an additional request to specify the format of the resource desired, and the server then returns the resource in that format. Clearly there is more overhead in client-side negotiation (although caching helps), but the client is generally better than the server at choosing the most appropriate format.

2.6.1. Media Type

Clients may include a header with their HTTP request indicating a list of preferred formats. The header for media type looks like this:

Accept: text/html;q=1, text/plain;q=0.8, 
        image/jpeg, image/gif, */*;q=0.001

The Accept header list contains HTTP media types in the type/subtype format used by the Content-Type header, followed by optional quality factors (asterisks serve as wildcards). Quality factors are floating-point numbers between and 1 that indicate a preference for a particular type; the default is 1. Servers are expected to examine the Accept media types and return data that is preferred by the browser. When multiple values have the same quality factor, the more specific one (i.e., where the quality factor is specified or the media type is not a wildcard) has higher priority.

In the previous example, documents would be returned with the following priority:

text/html
image/jpeg or image/gif
text/plain
*/* (anything else)

In reality, media type negotiation is not often used because it is unwieldy for a browser to list the media types of all documents it supports each time it makes a request. The majority of browsers today specify only new or less common image formats in addition to */*. Examples of the newer formats are image/p-jpeg (progressive JPEG) or image/png. (PNG was created as an open alternative to GIF, which has patent issues; see Chapter 13, "Creating Graphics on the Fly"). Web servers generally do not support media type negotiation for static documents, but we will look at a CGI script that does this in the next chapter.

2.6.2. Internationalization

Although media type negotiation is becoming outdated, other forms of content negotiation are gaining much more importance. Internationalization has become a new arena where content negotiation plays an important role. Providing a document to members of other countries can mean two things: supporting other translations and possibly supporting other character sets. The Roman alphabet, the Cyrillic alphabet, and Kanji, for example, use different character sets. HTTP supports these forms of negotiation with the Accept-Language and Accept-Charset headers. Examples of these headers are:

Accept-Charset: iso-8859-5, iso-8859-1;q=0.5
Accept-Language: ru, en-gb;q=0.5, en;q=0.4

The first line indicates that the server should return the content in Cyrillic if possible or Western Roman otherwise. The language specifies Russian as the first choice, with British English as the second, and other forms of English as the third. Note that a single asterisk can be used in place of any of these values to represent a wildcard match. The default character set, unless specified, is US-ASCII or ISO-8859-1 (US-ASCII is a subset of ISO-8859-1).

Most web servers support language negotiation automatically for static documents. For example, if you perform a new installation of Apache, it will install multiple copies of the "It Worked!" welcome file in /usr/local/apache/htdocs. The files all share the index.html base name but have different extensions indicating the language code: index.html.en, index.html.fr, index.html.de, etc. If you point your browser at index.html, change the preferred language in your browser, and then reload the page, you should see it in another language.

2.6.3. Encoding

The final form of content negotiation supports encoding. Options for encoding include gzip , compress, and identity (no encoding). Here is an example header specifying that the browser supports compress and gzip :

Accept-Encoding: compress, gzip

A server may be able to speed up the download of a large document to this client by sending an encoded version of the document. The browser should decode the document automatically for the user.