home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomeXML in a NutshellSearch this book

26.3. Other Unicode Blocks

So far we've accounted for a little over 300 of the more than 90,000 Unicode characters. Many thousands are still unaccounted for. Outside the ranges defined in XHTML and SGML, standard entity names don't exist. You should either use an editor that can produce the characters you need in the appropriate character set or you should use character references. Most of the 90,000-plus Unicode characters are either Han ideographs, Hangul syllables, or rarely used characters. However, we do list a few of the most useful blocks later in this chapter. Others can be found online at http://www.unicode.org/charts/ or in The Unicode Standard Version 3.0 by the Unicode Consortium (Addison Wesley, 2000).

In the tables that follow, the upper lefthand corner contains the character's hexadecimal Unicode value, and the upper righthand corner contains the character's decimal Unicode value. You can use either value to form a character reference so as to use these characters in element content and attribute values, even without an editor or fonts that support them.

26.3.4. Spacing Modifier Letters

The Spacing Modifier Letters block, shown in Figure 26-10, includes characters from multiple languages and scripts that modify the preceding or following character, generally by changing its pronunciation.

Figure 26-10
Figure 26-10

Figure 26-10. The Spacing Modifier Letters block of Unicode

26.3.6. Greek and Coptic

The Greek block of Unicode is used primarily for the modern Greek language. Currently, it's the only option for the Greek-derived Coptic script, but it doesn't really serve that purpose very well, and a separate Coptic block is a likely addition in the future. Extending coverage to classical and Byzantine Greek requires many more accented characters, which are available in the Greek Extended Block, shown in Figure 26-22, or by combining these characters with the Combining Diacritical Marks in Figure 26-11. The Greek alphabet is also a fertile source of mathematical and scientific notation, though some common letters, such as Figure and Figure , are encoded separately in the Mathematical Operators block in Figure 26-27 and the Mathematical Alphanumeric Symbols block in Figure 26-28 for their use as mathematical symbols. The Greek and Coptic block of Unicode is shown in Figure 26-12.

Figure 26-12
Figure 26-12

Figure 26-12. The Greek and Coptic block of Unicode

26.3.8. Armenian

The Armenian script shown in Figure 26-14 is used for writing the Armenian language, currently spoken by about seven million people around the world.

Figure 26-14

Figure 26-14. The Armenian block of Unicode

26.3.10. Arabic

The Arabic script shown in Figure 26-16 is used for many languages besides Arabic, including Kurdish, Pashto, Persian, Sindhi, and Urdu. Turkish was also written in the Arabic script until early in the twentieth century when Turkey converted to a modified Latin alphabet.

Figure 26-16
Figure 26-16

Figure 26-16. The Arabic block of Unicode

26.3.16. Greek Extended

The Greek Extended block shown in Figure 26-22 contains mostly archaic letters and accented letters that are used in classical and Byzantine Greek, but not in modern Greek.

Figure 26-22
Figure 26-22
Figure 26-22

Figure 26-22. The Greek Extended block of Unicode

26.3.17. General Punctuation

The General Punctuation block shown in Figure 26-23 contains punctuation characters used across a variety of languages and scripts that are not already encoded in Latin-1. Characters 0x2000 through 0x200B are all varying amounts of whitespace ranging from zero width (0x200B) to six ems (0x2007). 0x200C through 0x200F and 0x206A through 0x206F are nonprinting format characters with no graphical representation.

Figure 26-23

Figure 26-23. The General Punctuation block of Unicode

26.3.21. Mathematical Operators

The Mathematical Operators block shown in Figure 26-27 contains a wide variety of symbols used in higher mathematics. A few of these symbols superficially resemble letters in other blocks. For instance, in most fonts character 2206, Figure , is virtually identical to the Greek capital letter delta. However, using characters in this block is preferable for mathematical expressions, as it allows software to distinguish between letters and mathematical symbols. Fonts may use the same glyph to represent different code points in cases like this.

Figure 26-27
Figure 26-27
Figure 26-27

Figure 26-27. The Mathematical Operators block of Unicode

Unicode 3.1.1 adds one more block of mathematical alphanumeric symbols in Plane 1 between 0x1D400 and 0x1D7FF as shown in Figure 26-28. Mostly these are repetitions of the ASCII and Greek letters and digits in what would normally be considered font variations. For instance, 0x1D400 is mathematical bold capital A. The justification for these is that when used in an equation, they really aren't the same characters as the equivalent glyphs in text.

Figure 26-28
Figure 26-28
Figure 26-28
Figure 26-28
Figure 26-28
Figure 26-28
Figure 26-28
Figure 26-28

Figure 26-28. The Mathematical Alphanumeric Symbols block of Unicode

26.3.23. Optical Character Recognition

The Optical Character Recognition (OCR) block shown in Figure 26-30 includes the OCR-A characters that are not already encoded as ASCII and magnetic-ink character-recognition symbols used on checks.

Figure 26-30

Figure 26-30. The Optical Character Recognition block of Unicode



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.