6.18. Matching Multiple-Byte CharactersProblemYou need to perform regular-expression searches against multiple-byte characters. A character encoding is a set mapping from characters and symbols to digital representations. ASCII is an encoding where each character is represented as exactly one byte, but complex writing systems, such as those for Chinese, Japanese, and Korean, have so many characters that their encodings need to use multiple bytes to represent characters. Perl works on the principle that each byte represents a single character, which works well in ASCII but makes regular expression matches on strings containing multiple-byte characters tricky, to say the least. The regular expression engine does not understand the character boundaries in your string of bytes, and so can return "matches" from the middle of one character to the middle of another. SolutionExploit the encoding by tailoring the pattern to the sequences of bytes that constitute characters. The basic approach is to build a pattern that matches a single (multiple byte) character in the encoding, and then use that "any character" pattern in larger patterns. Discussion
As an example, we'll examine one of the encodings for Japanese, called
EUC-JP
, and then show how we use this in solving a number of multiple-byte encoding issues. EUC-JP can represent thousands of characters, but it's basically a superset of ASCII. Bytes with values ranging from 0 to 127 (
We can convey this information - what bytes can make up characters in this encoding - as a regular expression. For ease of use later, here we'll define a string, my $eucjp = q{ # EUC-JP encoding subcomponents: [\x00-\x7F] # ASCII/JIS-Roman (one-byte/character) | \x8E[\xA0-\xDF] # half-width katakana (two bytes/char) | \x8F[\xA1-\xFE][\xA1-\xFE] # JIS X 0212-1990 (three bytes/char) | [\xA1-\xFE][\xA1-\xFE] # JIS X 0208:1997 (two bytes/char) };
(Because we've inserted comments and whitespace for pretty-printing, we'll have to use the With this template in hand, the following sections show how to:
All the examples are shown using EUC-JP as the encoding of interest, but they will work with any of the many multiple-byte encodings commonly used for text processing, such as Unicode, Big-5, etc. Avoiding false matchesA false match is where the regular expression engine finds a match that begins in the middle of a multiple-byte character sequence. We can get around the problem by carefully controlling the match, ensuring that the pattern matching engine stays synchronized with the character boundaries at all times.
This can be done by anchoring the match to the start of the string, then manually bypassing characters ourselves when the real match can't happen at the current location. With the EUC-JP example, the "bypassing characters" part is /^ (?: $eucjp )*? \xC5\xEC\xB5\xFE/ox # Trying to find Tokyo
In the EUC-JP encoding, the Japanese word for Tokyo is written with two characters, the first encoded by the two bytes
Don't forget to use the
Use in a replacement is similar, but since the text leading to the real match is also part of the overall match, we must capture it with parentheses, being sure to include it in the replacment text. Assuming that /^ ( (?:eucjp)*? ) $Tokyo/$1$Osaka/ox
If used with /\G ( (?:eucjp)*? ) $Tokyo/$1$Osaka/gox Splitting multiple-byte strings
Another common task is to split an input string into its individual charcters. With a one-byte-per-character encoding, you can simply split @chars = /$eucjp/gox; # One character per list element
Now, while (<>) { my @chars = /$eucjp/gox; # One character per list element for my $char (@chars) { if (length($char) == 1) { # Do something interesting with this one-byte character } else { # Do something interesting with this multiple-byte character } } my $line = join("",@chars); # Glue list back together print $line; }
In the two "do something interesting" parts, any change to Validating multiple-byte strings
The use of
One way to address this is to use A better approach to confirm that a string is valid with respect to an encoding is to use something like: $is_eucjp = m/^(?:$eucjp)*$/xo; If a string has only valid characters from start to end, you know the string as a whole is valid.
There is one potential for a problem, and that's due to how the end-of-string metacharacter
You can use the basic validation technique to detect which encoding is being used. For example, Japanese is commonly encoded with either EUC-JP, or another encoding called Shift-JIS. If you've set up the templates, as with $is_eucjp = m/^(?:$eucjp)*$/xo; $is_sjis = m/^(?:$sjis)*$/xo; If both are true, the text is likely ASCII (since, essentially, ASCII is a sub-component of both encodings). (It's not quite fool-proof, though, since some strings with multi-byte characters might appear to be valid in both encodings. In such a case, automatic detection becomes impossible, although one might use character-frequency data to make an educated guess.) Converting between encodingsConverting from one encoding to another can be as simple as an extension of the process-each-character routine above. Conversions for some closely related encodings can be done by a simple mathematical computation on the bytes, while others might require huge mapping tables. In either case, you insert the code at the "do something interesting" points in the routine.
Here's an example to convert from EUC-JP to Unicode, using a while (<>) { my @chars = /$eucjp/gox; # One character per list element for my $euc (@chars) { my $uni = $euc2uni{$char}; if (defined $uni) { $euc = $uni; } else { ## deal with unknown EUC->Unicode mapping here. } } my $line = join("",@chars); print $line; } The topic of multiple-byte matching and processing is of particular importance when dealing with Unicode, which has a variety of possible representations. UCS-2 and UCS-4 are fixed-length encodings. UTF-8 defines a mixed one- through six-byte encoding. UTF-16, which represents the most common instance of Unicode encoding, is a variable-length 16-bit encoding. See AlsoJeffrey Friedl's article in Issue 5 of The Perl Journal ; CJKV Information Processing by Ken Lunde; O'Reilly & Associates, (due 1999) |
|