Chapter 15. Unicode

Historically, people made up character sets to reflect what they needed to do in the context of their own culture. Since people of all cultures are naturally lazy, they've tended to include only the symbols they needed, excluding the ones they didn't need. That worked fine as long as we were only communicating with other people of our own culture, but now that we're starting to use the Internet for cross-cultural communication, we're running into problems with the exclusive approach. It's hard enough to figure out how to type accented characters on an American keyboard. How in the world (literally) can one write a multilingual web page?

Unicode is the answer, or at least part of the answer (see also XML). Unicode is an inclusive rather than an exclusive character set. While people can and do haggle over the various details of Unicode (and there are plenty of details to haggle over), the overall intent is to make everyone sufficiently happy[1] with Unicode so that they'll willingly use Unicode as the international medium of exchange for textual data. Nobody is forcing you to use Unicode, just as nobody is forcing you to read this chapter (we hope). People will always be allowed to use their old exclusive character sets within their own culture. But in that case (as we say), portability suffers.

[1] Or in some cases, insufficiently unhappy.

The Law of Conservation of Suffering says that if we reduce the suffering in one place, suffering must increase elsewhere. In the case of Unicode, we must suffer the migration from byte semantics to character semantics. Since, through an accident of history, Perl was invented by an American, Perl has historically confused the notions of bytes and characters. In migrating to Unicode, Perl must somehow unconfuse them.

Paradoxically, by getting Perl itself to unconfuse bytes and characters, we can allow the Perl programmer to confuse them, relying on Perl to keep them straight, just as we allow programmers to confuse numbers and strings and rely on Perl to convert back and forth as necessary. To the extent possible, Perl's approach to Unicode is the same as its approach to everything else: Just Do The Right Thing. Ideally, we'd like to achieve these four Goals:

Goal #1:: Old byte-oriented programs should not spontaneously break on the old byte-oriented data they used to work on.
Goal #2:: Old byte-oriented programs should magically start working on the new character-oriented data when appropriate.
Goal #3:: Programs should run just as fast in the new character-oriented mode as in the old byte-oriented mode.
Goal #4:: Perl should remain one language, rather than forking into a byte-oriented Perl and a character-oriented Perl.

Taken together, these Goals are practically impossible to reach. But we've come remarkably close. Or rather, we're still in the process of coming remarkably close, since this is a work in progress. As Unicode continues to evolve, so will Perl. But our overarching plan is to provide a safe migration path that gets us where we want to go with minimal casualties along the way. How we do that is the subject of the next section.

15.1. Building Character

In releases of Perl prior to 5.6, all strings were viewed as sequences of bytes.[2] In versions 5.6 and later, however, a string may contain characters wider than a byte. We now view strings not as sequences of bytes, but as sequences of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit computers, 0 .. 2**64-1). These numbers represent abstract characters, and the larger the number, the "wider" the character, in some sense; but unlike many languages, Perl is not tied to any particular width of character representation. Perl uses a variable-length encoding (based on UTF-8), so these abstract character numbers may, or may not, be packed one number per byte. Obviously, character number 18,446,744,073,709,551,615 (that is, "\x{ffff_ffff_ffff_ffff}") is never going to fit into a byte (in fact, it takes 13 bytes), but if all the characters in your string are in the range 0..127 decimal, then they are certainly packed one per byte, since UTF-8 is the same as ASCII in the lowest seven bits.

[2] You may prefer to call them "octets"; that's okay, but we think the two words are pretty much synonymous these days, so we'll stick with the blue-collar word.

Perl uses UTF-8 only when it thinks it is beneficial, so if all the characters in your string are in the range 0..255, there's a good chance the characters are all packed in bytes--but in the absence of other knowledge, you can't be sure because internally Perl converts between fixed 8-bit characters and variable-length UTF-8 characters as necessary. The point is, you shouldn't have to worry about it most of the time, because the character semantics are preserved at an abstract level regardless of representation.

In any event, if your string contains any character numbers larger than 255 decimal, the string is certainly stored in UTF-8. More accurately, it is stored in Perl's extended version of UTF-8, which we call utf8, in honor of a pragma by that name, but mostly because it's easier to type. (And because "real" UTF-8 is only allowed to contain character numbers blessed by the Unicode Consortium. Perl's utf8 is allowed to contain any character numbers you need to get your job done. Perl doesn't give a rip whether your character numbers are officially correct or just correct.)

We said you shouldn't worry about it most of the time, but people like to worry anyway. Suppose you use a v-string to represent an IPv4 address:

$locaddr = v127.0.0.1;     # Certainly stored as bytes.
$oreilly = v204.148.40.9;  # Might be stored as bytes or utf8.
$badaddr = v2004.148.40.9; # Certainly stored as utf8.

Everyone can figure out that $badaddr will not work as an IP address. So it's easy to think that if O'Reilly's network address gets forced into a UTF-8 representation, it will no longer work. But the characters in the string are abstract numbers, not bytes. Anything that uses an IPv4 address, such as the gethostbyaddr function, should automatically coerce the abstract character numbers back into a byte representation (and fail on $badaddr).

The interfaces between Perl and the real world have to deal with the details of the representation. To the extent possible, existing interfaces try to do the right thing without your having to tell them what to do. But you do occasionally have to give instructions to some interfaces (such as the open function), and if you write your own interface to the real world, it will need to be either smart enough to figure things out for itself or at least smart enough to follow instructions when you want it to behave differently than it would by default.[3]

[3] On some systems, there may be ways of switching all your interfaces at once. If the -C command-line switch is used, (or the global ${^WIDE_SYSTEM_CALLS} variable is set to 1), all system calls will use the corresponding wide character APIs. (This is currently only implemented on Microsoft Windows.) The current plan of the Linux community is that all interfaces will switch to UTF-8 mode if $ENV{LC_CTYPE} is set to "UTF-8". Other communities may take other approaches. Our mileage may vary.

Since Perl worries about maintaining transparent character semantics within the language itself, the only place you need to worry about byte versus character semantics is in your interfaces. By default, all your old Perl interfaces to the outside world are byte-oriented, so they produce and consume byte-oriented data. That is to say, on the abstract level, all your strings are sequences of numbers in the range 0..255, so if nothing in the program forces them into utf8 representations, your old program continues to work on byte-oriented data just as it did before. So put a check mark by Goal #1 above.

If you want your old program to work on new character-oriented data, you must mark your character-oriented interfaces such that Perl knows to expect character-oriented data from those interfaces. Once you've done this, Perl should automatically do any conversions necessary to preserve the character abstraction. The only difference is that you've introduced some strings into your program that are marked as potentially containing characters higher than 255, so if you perform an operation between a byte string and utf8 string, Perl will internally coerce the byte string into a utf8 string before performing the operation. Typically, utf8 strings are coerced back to byte strings only when you send them to a byte interface, at which point, if the string contains characters larger than 255, you have a problem that can be handled in various ways depending on the interface in question. So you can put a check mark by Goal #2.

Sometimes you want to mix code that understands character semantics with code that has to run with byte semantics, such as I/O code that reads or writes fixed-size blocks. In this case, you may put a use bytes declaration around the byte-oriented code to force it to use byte semantics even on strings marked as utf8 strings. You are then responsible for any necessary conversions. But it's a way of enforcing a stricter local reading of Goal #1, at the expense of a looser global reading of Goal #2.

Goal #3 has largely been achieved, partly by doing lazy conversions between byte and utf8 representations and partly by being sneaky in how we implement potentially slow features of Unicode, such as character property lookups in huge tables.

Goal #4 has been achieved by sacrificing a small amount of interface compatibility in pursuit of the other Goals. By one way of looking at it, we didn't fork into two different Perls; but by another way of looking at it, revision 5.6 of Perl is a forked version of Perl with regard to earlier versions, and we don't expect people to switch from earlier versions until they're sure the new version will do what they want. But that's always the case with new versions, so we'll allow ourselves to put a check mark by Goal #4 as well.

Chapter 15. Unicode

Contents:

15.1. Building Character