For example, the word "façade" can be written with one
character between the two a's, "\x{E7}", a
character right out of Latin1 (ISO 8859-1). These characters might be
encoded into a two-byte sequence under the UTF-8 encoding that Perl
uses internally, but those two bytes still only count as one single
character. That works just fine.
There's a thornier issue. Another way to write U+00E7 is with two
different code points: a regular "c" followed by
"\x{0327}". Code point U+0327 is a non-spacing
combining character that means to go back and put a cedilla
underneath the preceding base character.
There are times when you want Perl to treat each combined character
sequence as one logical character. But because they're distinct code
points, Perl's character-related operations treat non-spacing
combining characters as separate characters, including
substr, length, and regular
expression metacharacters, such as in /./ or
/[^abc]/.
(?x: # begin non-capturing group
\PM # one character without the M (mark) property,
# such as a letter
\pM # one character that does have the M (mark) property,
# such as an accent mark
* # and you can have as many marks as you want
)
Otherwise simple operations become tricky if these beasties are in
your string. Consider the approaches for reversing a word by
character from the previous recipe. Written with combining
characters, "année" and
"niño" can be expressed in Perl as
"anne\x{301}e" and
"nin\x{303}o".
for $word ("anne\x{301}e", "nin\x{303}o") {
printf "%s simple reversed to %s\n", $word,
scalar reverse $word;
printf "%s better reversed to %s\n", $word,
join("", reverse $word =~ /\X/g);
}
That produces:
année simple reversed to éenna
année better reversed to eénna
niño simple reversed to õnin
niño better reversed to oñin
In the reversals marked as simply reversed, the diacritical marking
jumped from one base character to the other one. That's because a
combining character always follows its base character, and you've
reversed the whole string. By grabbing entire sequences of a base
character plus any combining characters that follow, then reversing
that list, this problem is avoided.