home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Perl CookbookPerl CookbookSearch this book

1.9. Canonicalizing Strings with Unicode Combined Characters

1.9.3. Discussion

The same character sequence can sometimes be specified in multiple ways. Sometimes this is because of legacy encodings, such as the letters from Latin1 that contain diacritical marks. These can be specified directly with a single character (like U+00E7, LATIN SMALL LETTER C WITH CEDILLA) or indirectly via the base character (like U+0063, LATIN SMALL LETTER C) followed by a combining character (U+0327, COMBINING CEDILLA).

Another possibility is that you have two or more marks following a base character, but the order of those marks varies in your data. Imagine you wanted the letter "c" to have both a cedilla and a caron on top of it in order to print a Figure . That could be specified in any of these ways:

$string = v231.780;
#   LATIN SMALL LETTER C WITH CEDILLA
#   COMBINING CARON

$string = v99.807.780;
#         LATIN SMALL LETTER C
#         COMBINING CARON
#         COMBINING CEDILLA

$string = v99.780.807
#         LATIN SMALL LETTER C
#         COMBINING CEDILLA
#         COMBINING CARON

The normalization functions rearrange those into a reliable ordering. Several are provided, including NFD( ) for canonical decomposition and NFC( ) for canonical decomposition followed by canonical composition. No matter which of these three ways you used to specify your Figure , the NFD version is v99.807.780, whereas the NFC version is v231.780.

Sometimes you may prefer NFKD( ) and NFKC( ), which are like the previous two functions except that they perform compatible decomposition, which for NFKC( ) is then followed by canonical composition. For example, \x{FB00} is the double-f ligature. Its NFD and NFC forms are the same thing, "\x{FB00}", but its NFKD and NFKC forms return a two-character string, "\x{66}\x{66}".

1.9.4. See Also

The Universal Character Code section at the beginning of this chapter; the documentation for the Unicode::Normalize module; Recipe 8.20



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.