Chapter 7. Internationalization
If the Web is to reach a truly worldwide audience, it needs to be able to support the display of all the languages of the world, with all their unique alphabets and symbols, directionality, and specialized punctuation. This poses a big challenge to HTML constructs as we know them. However, according to the W3C, "energetic efforts" are being made toward this complicated goal.
The W3C's efforts for internationalization (often referred to as "i18n" -- an i, then 18 letters, then an n) address two primary issues. First is the handling of alternative character sets that take into account all the writing systems of the world; second is how to specify languages and their unique presentation requirements within an HTML document. Many solutions presented by internationalization experts in a document called RFC 2070 were incorporated into the current HTML 4.0, XML 1.0, and CSS2 specifications.
This chapter addresses key issues for internationalization, including character sets and new language features in HTML 4 and CSS2. Be aware that many of these features are not yet supported by browsers, even the most current.
7.1. Character Sets
The first challenge in internationalization is dealing with the staggering number of unique character shapes (called "glyphs") that occur in the writing systems of the world. This includes not only alphabets, but also all ideographs (characters that indicate a whole word or concept) for languages such as Chinese, Japanese, and Korean.
7.1.1. 8-Bit Encoded Character Sets
Character encodings (or character sets) are organizations of characters -- units of a written language system -- in which each character is assigned a specific number. Each character may be associated with a number of different glyphs; for instance, the "close quote" character may be displayed using a " or » glyph, depending on the language. In addition, a single glyph may correspond to different characters, such as a comma serving as both the punctuation symbol for a pause in a sentence as well as a decimal indicator in some languages.
The number of characters available in a character set is limited by the bit-depth of its encoding. For example, 8 bits are capable of describing 256 unique characters, which is enough for most western languages.
HTML 2.0 and 3.2 are based on the 8-bit character set for western languages called Latin-1 (or ISO 8859-1). There are a number of other 8-bit encodings, including:
7.1.2. 16-Bit Encoded Character Sets
Sixteen bits of information are capable of representing 65,536 (216) different characters -- enough to contain a large number of alphabets and ideographs. In 1991, the Unicode Consortium created a 16-bit encoded "super" character set called Unicode (practically identical to another standard called ISO 10646-1) which includes nearly every character from the world's writing systems. The combination of Unicode and ISO 10646 is called the Universal Character Set (UCS). Each character is assigned a unique two-octet code (2 groups of 8 bits, making 16 bits total). The first 256 slots are given to the ISO 8859-1 character set, so it is backwards compatible.
The HTML 4.01 specification officially adopts Unicode as its document character set. So regardless of the character encoding used when a document was created, it is converted to the document character set by the browser, which interprets characters with special meaning in HTML (such as < and >) and converts character entities (such as © for ©). In cases where a character entity points outside of the Latin-1 character set (e.g., ϖ for ), HTML 4.0 browsers use the Unicode character set to display the correct character.
This is the first step toward making the Web truly multilingual. The current refinements to character-set handling on the Web are documented in a working draft, the Character Model for the World Wide Web 1.0, published by the W3C (http://www.w3.org/TR/charmod/).
7.1.3. Specifying Character Encoding
When a web client (a browser) and a server make a transaction, meta-information about the requested and returned document is communicated in the HTTP headers for the request and response. One of the most important bits of information specified is the content-type, which describes the type of data the server is sending. The charset parameter further specifies the character set used for a text document. A typical HTTP header looks like this:
Content-type: text/html; charset=ISO-8859-8
To deliberately set the character-encoding information in a document header, use the <meta> tag with its http-equiv attribute (which adds its values into the HTTP header). The meta tag that corresponds to the above header message looks like this:
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-8">
Note that the browser must support your chosen character set in order for the page to display properly.
Browsers that are capable of sending an accept-charset value can specify their preferred character encoding when requesting a document. The server can then serve the document with the appropriate encoding, if the preferred version is available.
The accept-charset attribute is already a part of the HTML 4.0 specification for form elements. With the accept-charset attribute, the document can specify which character sets the server can receive from the user in text input fields.
Copyright © 2002 O'Reilly & Associates. All rights reserved.