home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Perl CookbookPerl CookbookSearch this book

8.20. Reading or Writing Unicode from a Filehandle

8.20.3. Discussion

Perl's text manipulation functions handle UTF-8 strings just as well as they do 8-bit data—they just need to know what type of data they're working with. Each string in Perl is internally marked as either UTF-8 or 8-bit data. The encoding(...) layer converts data between variable external encodings and the internal UTF-8 within Perl. This is done by way of the Encode module.

In the section on Unicode Support in Perl back in the Introduction to Chapter 1, we explained how under Unicode, every different character had a different code point (i.e., a different number) associated with it. Assigning all characters unique code points solves many problems. No longer does the same number, like 0xC4, represent one character under one character repertoire (e.g., a LATIN CAPITAL LETTER A WITH DIAERESIS under ISO-8859-1) and a different character in another repertoire (e.g., a GREEK CAPITAL LETTER DELTA under ISO-8859-7).

This neatly solves many problems, but still leaves one important issue: the precise format used in memory or disk for each code point. If most code points fit in 8 bits, it would seem wasteful to use, say, a full 32 bits for each character. But if every character is the same size as every other character, the code is easier to write and may be faster to execute.

This has given rise to different encoding systems for storing Unicode, each offering distinct advantages. Fixed-width encodings fit every code point into the same number of bits, which simplifies programming but at the expense of some wasted space. Variable-width encodings use only as much space as each code point requires, which saves space but complicates programming.

One further complication is combined characters, which may look like single letters on paper but in code require multiple code points. When you see a capital A with two dots above it (a diaeresis) on your screen, it may not even be character U+00C4. As explained in Recipe 1.8, Unicode supports the idea of combining characters, where you start with a base character and add non-spacing marks to it. U+0308 is a "COMBINING DIAERESIS", so you could use a capital A (U+0041) followed by U+0308, or A\x{308} to produce the same output.

The following table shows the old ISO 8859-1 way of writing a capital A with a diaeresis, in which the logical character code and the physical byte layout enjoyed an identical representation, and the new way under Unicode. We'll include both ways of writing that character: one precomposed in one code point and the other using two code points to create a combined character.

Old way

New way

Ä

A

Ä

Ä

Character(s)

0xC4

U+0041

U+00C4

U+0041 U+0308

Character repertoire

ISO 8859-1

Unicode

Unicode

Unicode

Character code(s)

0xC4

0x0041

0x00C4

0x0041 0x0308

Encoding

UTF-8

UTF-8

UTF-8

Byte(s)

0xC4

0x41

0xC3 0x84

0x41 0xCC 0x88

The internal format used by Perl is UTF-8, a variable-width encoding system. One reason for this choice is that legacy ASCII requires no conversion for UTF-8, looking in memory exactly as it did before—just one byte per character. Character U+0041 is just 0x41 in memory. Legacy data sets don't increase in size, and even those using Western character sets like ISO 8859-n grow only slightly, since in practice you still have a favorable ratio of regular ASCII characters to 8-bit accented characters.

Just because Perl uses UTF-8 internally doesn't preclude using other formats externally. Perl automatically converts all data between UTF-8 and whatever encoding you've specified for that handle. The Encode module is used implicitly when you specify an I/O layer of the form ":encoding(....)". For example:

binmode(FH, ":encoding(UTF-16BE)")
    or die "can't binmode to utf-16be: $!";

or directly in the open:

open(FH, "< :encoding(UTF-32)", $pathname)
    or die "can't open $pathname: $!";

Here's a comparison of actual byte layouts of those two sequences, both representing a capital A with diaeresis, under several other popular formats:

U+00C4

U+0041 U+0308

UTF-8

c3 84

41 cc 88

UTF-16BE

00 c4

00 41 03 08

UTF-16LE

c4 00

41 00 08 03

UTF-16

fe ff 00 c4

fe ff 00 41 03 08

UTF-32LE

c4 00 00 00

41 00 00 00 08 03 00 00

UTF-32BE

00 00 00 c4

00 00 00 41 00 00 03 08

UTF-32

00 00 fe ff 00 00 00 c4

00 00 fe ff 00 00 00 41 00 00 03 08

This can chew up memory quickly. It's also complicated by the fact that some computers are big-endian, others little-endian. So fixed-width encoding formats that don't specify their endian-ness require a special byte-ordering sequence ("FF EF" versus "EF FF"), usually needed only at the start of the stream.

If you're reading or writing UTF-8 data, use the :utf8 layer. Because Perl natively uses UTF-8, the :utf8 layer bypasses the Encode module for performance.

The Encode module understands many aliases for encodings, so ascii, US-ascii, and ISO-646-US are synonymous. Read the Encode::Supported manpage for a list of available encodings. Perl supports not only standard Unicode names but vendor-specific names, too; for example, iso-8859-1 is cp850 on DOS, cp1252 on Windows, MacRoman on a Mac, and hp-roman8 on NeXTstep. The Encode module recognizes all of these as names for the same encoding.



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.