4.11. Unicode
Unicode provides a
unique number for every character, regardless of the computing
platform, program, or programming language. This is particularly
important because without a standard such as Unicode, computers would
continue to use different encoding classes for characters, many of
which would conflict if character classes were used together.
Unicode support was introduced to Perl
with Perl 5.6. Although it is still not completely adherent in the
Unicode spec, Unicode support has matured significantly under Perl
5.8. You can now use Unicode reliably with file I/O and with regular
expressions. With regular expressions, the pattern will adapt to the
data and will automatically switch to the correct Unicode character
scheme.
Perl's Unicode implementation falls into the
following categories:
- I/O
-
There is currently no way in Perl to mark data
that's read from or written to a file as being of
type Unicode (utf8). Future versions of Perl will support such a
feature.
- Regular expressions
-
The determination whether to match Unicode characters is made when
the pattern is compiled, based on whether the pattern contains
Unicode characters and not when matching happens at runtime. This
will be changed to match Unicode characters at runtime.
- use utf8
-
The utf8 module is still needed to enable a few Unicode features. The
utf8 pragma, as implemented by the utf8 module,
implements tables used for Unicode support. You must load the
utf8 pragma explicitly to enable recognition of
UTF-8 encoded literals and identifiers in the source text.
- Byte and character semantics
-
As of 5.6.0, Perl uses logically wide characters to represent strings
internally. This internal representation uses the UTF-8 encoding.
Future versions of Perl will work with characters rather than bytes.
This was a purposeful decision made so Perl 5.6 could transition from
byte semantics to character semantics in programs. Perl will make the
decision to switch to character semantics if it finds that the input
data has characters on which it can safely operate with UTF-8. You
can disable character semantics by using the bytes
pragma, as explained in Chapter 8, "Standard Modules". Character
semantics have the following effects:
-
Strings and patterns may contain characters that have an ordinal
value larger than 255.
-
Identifiers within a Perl program may contain Unicode alphanumeric
characters.
-
Regular expressions match characters and not bytes.
-
Character classes in regular expressions match characters and not
bytes.
-
Named Unicode properties and block ranges may be used as character
classes with the \p and \P
constructs.
-
\X matches any extended Unicode sequence.
-
tr// matches characters instead of bytes.
-
Case translation operators use the Unicode case translation tables
when provided character input.
-
Most operators that deal with positions or lengths in a string switch
to using character positions.
-
pack( ) and unpack( ) do not
change.
-
Bit operators work on characters.
-
scalar reverse( ) reverses characters and not
bytes.
| | | 4.10. Signals | | 4.12. Formats |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
|