Internationalization (Web Design in a Nutshell, 2nd Edition)

7.1. Character Sets

The first challenge in internationalization is dealing with the staggering number of unique character shapes (called "glyphs") that occur in the writing systems of the world. This includes not only alphabets, but also all ideographs (characters that indicate a whole word or concept) for languages such as Chinese, Japanese, and Korean.

7.1.1. 8-Bit Encoded Character Sets

Character encodings (or character sets) are organizations of characters -- units of a written language system -- in which each character is assigned a specific number. Each character may be associated with a number of different glyphs; for instance, the "close quote" character may be displayed using a " or » glyph, depending on the language. In addition, a single glyph may correspond to different characters, such as a comma serving as both the punctuation symbol for a pause in a sentence as well as a decimal indicator in some languages.

The number of characters available in a character set is limited by the bit-depth of its encoding. For example, 8 bits are capable of describing 256 unique characters, which is enough for most western languages.

HTML 2.0 and 3.2 are based on the 8-bit character set for western languages called Latin-1 (or ISO 8859-1). There are a number of other 8-bit encodings, including:

ISO 8859-5	Cyrillic
ISO 8859-6	Arabic
ISO 8859-7	Greek
ISO 8859-8	Hebrew
SHIFT_JIS	Japanese
EUC-JP	Japanese

7.1.2. 16-Bit Encoded Character Sets

Sixteen bits of information are capable of representing 65,536 (216) different characters -- enough to contain a large number of alphabets and ideographs. In 1991, the Unicode Consortium created a 16-bit encoded "super" character set called Unicode (practically identical to another standard called ISO 10646-1) which includes nearly every character from the world's writing systems. The combination of Unicode and ISO 10646 is called the Universal Character Set (UCS). Each character is assigned a unique two-octet code (2 groups of 8 bits, making 16 bits total). The first 256 slots are given to the ISO 8859-1 character set, so it is backwards compatible.

The HTML 4.01 specification officially adopts Unicode as its document character set. So regardless of the character encoding used when a document was created, it is converted to the document character set by the browser, which interprets characters with special meaning in HTML (such as < and >) and converts character entities (such as © for ©). In cases where a character entity points outside of the Latin-1 character set (e.g., ϖ for ), HTML 4.0 browsers use the Unicode character set to display the correct character.

This is the first step toward making the Web truly multilingual. The current refinements to character-set handling on the Web are documented in a working draft, the Character Model for the World Wide Web 1.0, published by the W3C (http://www.w3.org/TR/charmod/).

A Unicode Font

Bitstream has created a TrueType font called "Cyberbit" that contains a large percentage of the Unicode character set. It is available only via licensing to developers and is unfortunately no longer offered as a retail product. For more information about Cyberbit, contact Bitstream's developer products department at oemsales@bitsream.com.

7.1.3. Specifying Character Encoding

When a web client (a browser ) and a server make a transaction, meta-information about the requested and returned document is communicated in the HTTP headers for the request and response. One of the most important bits of information specified is the content-type, which describes the type of data the server is sending. The charset parameter further specifies the character set used for a text document. A typical HTTP header looks like this:

Content-type: text/html; charset=ISO-8859-8

To deliberately set the character-encoding information in a document header, use the <meta> tag with its http-equiv attribute (which adds its values into the HTTP header). The meta tag that corresponds to the above header message looks like this:

<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-8">

Note that the browser must support your chosen character set in order for the page to display properly.

Browsers that are capable of sending an accept-charset value can specify their preferred character encoding when requesting a document. The server can then serve the document with the appropriate encoding, if the preferred version is available.

The accept-charset attribute is already a part of the HTML 4.0 specification for form elements. With the accept-charset attribute, the document can specify which character sets the server can receive from the user in text input fields.

A Unicode Font

Chapter 7. Internationalization

Contents:

7.1. Character Sets

7.1.1. 8-Bit Encoded Character Sets

7.1.2. 16-Bit Encoded Character Sets

7.1.3. Specifying Character Encoding