home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomeWeb Design in a NutshellSearch this book

Chapter 7. Internationalization

If the Web is to reach a truly worldwide audience, it needs to be able to support the display of all the languages of the world, with all their unique alphabets and symbols, directionality, and specialized punctuation. This poses a big challenge to HTML constructs as we know them. However, according to the W3C, "energetic efforts" are being made toward this complicated goal.

The W3C's efforts for internationalization (often referred to as "i18n" -- an i, then 18 letters, then an n) address two primary issues. First is the handling of alternative character sets that take into account all the writing systems of the world; second is how to specify languages and their unique presentation requirements within an HTML document. Many solutions presented by internationalization experts in a document called RFC 2070 were incorporated into the current HTML 4.0, XML 1.0, and CSS2 specifications.

This chapter addresses key issues for internationalization, including character sets and new language features in HTML 4 and CSS2. Be aware that many of these features are not yet supported by browsers, even the most current.

7.1. Character Sets

The first challenge in internationalization is dealing with the staggering number of unique character shapes (called "glyphs") that occur in the writing systems of the world. This includes not only alphabets, but also all ideographs (characters that indicate a whole word or concept) for languages such as Chinese, Japanese, and Korean.

7.1.2. 16-Bit Encoded Character Sets

Sixteen bits of information are capable of representing 65,536 (216) different characters -- enough to contain a large number of alphabets and ideographs. In 1991, the Unicode Consortium created a 16-bit encoded "super" character set called Unicode (practically identical to another standard called ISO 10646-1) which includes nearly every character from the world's writing systems. The combination of Unicode and ISO 10646 is called the Universal Character Set (UCS). Each character is assigned a unique two-octet code (2 groups of 8 bits, making 16 bits total). The first 256 slots are given to the ISO 8859-1 character set, so it is backwards compatible.

The HTML 4.01 specification officially adopts Unicode as its document character set. So regardless of the character encoding used when a document was created, it is converted to the document character set by the browser, which interprets characters with special meaning in HTML (such as < and >) and converts character entities (such as &#169; for ©). In cases where a character entity points outside of the Latin-1 character set (e.g., &#982; for ), HTML 4.0 browsers use the Unicode character set to display the correct character.

This is the first step toward making the Web truly multilingual. The current refinements to character-set handling on the Web are documented in a working draft, the Character Model for the World Wide Web 1.0, published by the W3C (http://www.w3.org/TR/charmod/).

A Unicode Font

Bitstream has created a TrueType font called "Cyberbit" that contains a large percentage of the Unicode character set. It is available only via licensing to developers and is unfortunately no longer offered as a retail product. For more information about Cyberbit, contact Bitstream's developer products department at .



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.