26.3. Other Unicode Blocks
So far we've accounted for
a little over 300 of the more than 90,000 Unicode characters. Many
thousands are still unaccounted for. Outside the ranges defined in
XHTML and SGML, standard entity names don't exist.
You should either use an editor that can produce the characters you
need in the appropriate character set or you should use character
references. Most of the 90,000-plus Unicode characters are either Han
ideographs, Hangul syllables, or rarely used characters. However, we
do list a few of the most useful blocks later in this chapter. Others
can be found online at http://www.unicode.org/charts/ or in
The Unicode Standard Version 3.0 by the Unicode
Consortium (Addison Wesley, 2000).
In the tables that follow, the upper lefthand corner contains the
character's hexadecimal Unicode value, and the upper
righthand corner contains the character's decimal
Unicode value. You can use either value to form a character reference
so as to use these characters in element content and attribute
values, even without an editor or fonts that support them.
26.3.1. Latin Extended-A
The 128 characters in the Latin
Extended-A block of Unicode are used in conjunction with the normal
ASCII and Latin-1 characters. They cover most European Latin letters
missing from Latin-1. The block includes various characters
you'll find in the upper halves of the other
ISO-8859 Latin character sets, including ISO-8859-2, ISO-8859-3,
ISO-8859-4, and ISO-8859-9. When combined with ASCII and Latin-1,
this block lets you write Afrikaans, Basque, Breton, Catalan,
Croatian, Czech, Esperanto, Estonian, French, Frisian, Greenlandic,
Hungarian, Latvian, Lithuanian, Maltese, Polish,
Provençal, Rhaeto-Romanic, Romanian, Romany, Sami, Slovak,
Slovenian, Sorbian, Turkish, and Welsh. See Figure 26-7.
Figure 26-7. Unicode's Latin Extended-A block
26.3.7. Cyrillic
While the
Cyrillic script shown in Figure 26-13 is most
familiar to Western readers from its use for Russian,
it's also used for other Slavic languages,
including Serbian, Ukrainian, and Byelorussian, and for many
non-Slavic languages of the former Soviet Union, such as Azerbaijani,
Tuvan, and Ossetian. Indeed, many characters in this block are not
actually found in Russian, but exist only in other languages written
in the Cyrillic script. Following the breakup of the Soviet Union,
some non-Slavic languages, such as Moldavian and Azerbaijani, are now
reverting to Latin-derived scripts.
Figure 26-13. The Cyrillic block of Unicode
26.3.11. Devanagari
The Devanagari
script is used for many languages of the Indian
subcontinent, including Awadhi, Bagheli, Bhatneri, Bhili, Bihari,
Braj Bhasa, Chhattisgarhi, Garhwali, Gondi, Harauti, Hindi, Ho,
Jaipuri, Kachchhi, Kanauji, Konkani, Kului, Kumaoni, Kurku, Kurukh,
Marwari, Mundari, Newari, Palpa, and Santali. It's
also used for the classical language Sanskrit. See Figure 26-17.
Figure 26-17. The Devanagari block of Unicode
 |  |  | 26.2. HTML4 Entity Sets |  | Index |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|