Strings (Perl Cookbook, 2nd Edition)

1.0. Introduction

Many programming languages force you to work at an uncomfortably low level. You think in lines, but your language wants you to deal with pointers. You think in strings, but it wants you to deal with bytes. Such a language can drive you to distraction. Don't despair; Perl isn't a low-level language, so lines and strings are easy to handle.

Perl was designed for easy but powerful text manipulation. In fact, Perl can manipulate text in so many ways that they can't all be described in one chapter. Check out other chapters for recipes on text processing. In particular, see Chapter 6 and Chapter 8, which discuss interesting techniques not covered here.

Perl's fundamental unit for working with data is the scalar, that is, single values stored in single (scalar) variables. Scalar variables hold strings, numbers, and references. Array and hash variables hold lists or associations of scalars, respectively. References are used for referring to values indirectly, not unlike pointers in low-level languages. Numbers are usually stored in your machine's double-precision floating-point notation. Strings in Perl may be of any length, within the limits of your machine's virtual memory, and can hold any arbitrary data you care to put there—even binary data containing null bytes.

A string in Perl is not an array of characters—nor of bytes, for that matter. You cannot use array subscripting on a string to address one of its characters; use substr for that. Like all data types in Perl, strings grow on demand. Space is reclaimed by Perl's garbage collection system when no longer used, typically when the variables have gone out of scope or when the expression in which they were used has been evaluated. In other words, memory management is already taken care of, so you don't have to worry about it.

A scalar value is either defined or undefined. If defined, it may hold a string, number, or reference. The only undefined value is undef. All other values are defined, even numeric and the empty string. Definedness is not the same as Boolean truth, though; to check whether a value is defined, use the defined function. Boolean truth has a specialized meaning, tested with operators such as && and || or in an if or while block's test condition.

Two defined strings are false: the empty string ("") and a string of length one containing the digit zero ("0"). All other defined values (e.g., "false", 15, and \$x) are true. You might be surprised to learn that "0" is false, but this is due to Perl's on-demand conversion between strings and numbers. The values 0., 0.00, and 0.0000000 are all numbers and are therefore false when unquoted, since the number zero in any of its guises is always false. However, those three values ("0.", "0.00", and "0.0000000") are true when used as literal quoted strings in your program code or when they're read from the command line, an environment variable, or an input file.

This is seldom an issue, since conversion is automatic when the value is used numerically. If it has never been used numerically, though, and you just test whether it's true or false, you might get an unexpected answer—Boolean tests never force any sort of conversion. Adding 0 to the variable makes Perl explicitly convert the string to a number:

print "Gimme a number: ";
0.00000
chomp($n = <STDIN>);  # $n now holds "0.00000";

print "The value $n is ", $n ? "TRUE" : "FALSE", "\n";
That value 0.00000 is TRUE

$n += 0;
print "The value $n is now ", $n ? "TRUE" : "FALSE", "\n";
That value 0 is now FALSE

The undef value behaves like the empty string ("") when used as a string, 0 when used as a number, and the null reference when used as a reference. But in all three possible cases, it's false. Using an undefined value where Perl expects a defined value will trigger a runtime warning message on STDERR if you've enabled warnings. Merely asking whether something is true or false demands no particular value, so this is exempt from warnings. Some operations do not trigger warnings when used on variables holding undefined values. These include the autoincrement and autodecrement operators, ++ and --, and the addition and concatenation assignment operators, += and .= ("plus-equals" and "dot-equals").

Specify strings in your program using single quotes, double quotes, the quoting operators q// and qq//, or here documents. No matter which notation you use, string literals are one of two possible flavors: interpolated or uninterpolated. Interpolation governs whether variable references and special sequences are expanded. Most are interpolated by default, such as in patterns (/regex/) and running commands ($x = `cmd`).

Where special characters are recognized, preceding any special character with a backslash renders that character mundane; that is, it becomes a literal. This is often referred to as "escaping" or "backslash escaping."

Using single quotes is the canonical way to get an uninterpolated string literal. Three special sequences are still recognized: ' to terminate the string, \' to represent a single quote, and \\ to represent a backslash in the string.

$string = '\n';                     # two characters, \ and an n
$string = 'Jon \'Maddog\' Orwant';  # literal single quotes

Double quotes interpolate variables (but not function calls—see Recipe 1.15 to find how to do this) and expand backslash escapes. These include "\n" (newline), "\033" (the character with octal value 33), "\cJ" (Ctrl-J), "\x1B" (the character with hex value 0x1B), and so on. The full list of these is given in the perlop(1) manpage and the section on "Specific Characters" in Chapter 5 of Programming Perl.

$string = "\n";                     # a "newline" character
$string = "Jon \"Maddog\" Orwant";  # literal double quotes

If there are no backslash escapes or variables to expand within the string, it makes no difference which flavor of quotes you use. When choosing between writing 'this' and writing "this", some Perl programmers prefer to use double quotes so that the strings stand out. This also avoids the slight risk of having single quotes mistaken for backquotes by readers of your code. It makes no difference to Perl, and it might help readers.

The q// and qq// quoting operators allow arbitrary delimiters on interpolated and uninterpolated literals, respectively, corresponding to single- and double-quoted strings. For an uninterpolated string literal that contains single quotes, it's easier to use q// than to escape all single quotes with backslashes:

$string = 'Jon \'Maddog\' Orwant';   # embedded single quotes
$string = q/Jon 'Maddog' Orwant/;    # same thing, but more legible

Choose the same character for both delimiters, as we just did with /, or pair any of the following four sets of bracketing characters:

$string = q[Jon 'Maddog' Orwant];   # literal single quotes
$string = q{Jon 'Maddog' Orwant};   # literal single quotes
$string = q(Jon 'Maddog' Orwant);   # literal single quotes
$string = q<Jon 'Maddog' Orwant>;   # literal single quotes

Here documents are a notation borrowed from the shell used to quote a large chunk of text. The text can be interpreted as single-quoted, double-quoted, or even as commands to be executed, depending on how you quote the terminating identifier. Uninterpolated here documents do not expand the three backslash sequences the way single-quoted literals normally do. Here we double-quote two lines with a here document:

$a = <<"EOF";
This is a multiline here document
terminated by EOF on a line by itself
EOF

Notice there's no semicolon after the terminating EOF. Here documents are covered in more detail in Recipe 1.16.

1.0.1. The Universal Character Code

As far as the computer is concerned, all data is just a series of individual numbers, each a string of bits. Even text strings are just sequences of numeric codes interpreted as characters by programs like web browsers, mailers, printing programs, and editors.

Back when memory sizes were far smaller and memory prices far more dear, programmers would go to great lengths to save memory. Strategies such as stuffing six characters into one 36-bit word or jamming three characters into one 16-bit word were common. Even today, the numeric codes used for characters usually aren't longer than 7 or 8 bits, which are the lengths you find in ASCII and Latin1, respectively.

That doesn't leave many bits per character—and thus, not many characters. Consider an image file with 8-bit color. You're limited to 256 different colors in your palette. Similarly, with characters stored as individual octets (an octet is an 8-bit byte), a document can usually have no more than 256 different letters, punctuation marks, and symbols in it.

ASCII, being the American Standard Code for Information Interchange, was of limited utility outside the United States, since it covered only the characters needed for a slightly stripped-down dialect of American English. Consequently, many countries invented their own incompatible 8-bit encodings built upon 7-bit ASCII. Conflicting schemes for assigning numeric codes to characters sprang up, all reusing the same limited range. That meant the same number could mean a different character in different systems and that the same character could have been assigned a different number in different systems.

Locales were an early attempt to address this and other language- and country-specific issues, but they didn't work out so well for character set selection. They're still reasonable for purposes unrelated to character sets, such as local preferences for monetary units, date and time formatting, and even collating sequences. But they are of far less utility for reusing the same 8-bit namespace for different character sets.

That's because if you wanted to produce a document that used Latin, Greek, and Cyrillic characters, you were in for big trouble, since the same numeric code would be a different character under each system. For example, character number 196 is a Latin capital A with a diaeresis above it in ISO 8859-1 (Latin1); under ISO 8859-7, that same numeric code represents a Greek capital delta. So a program interpreting numeric character codes in the ISO 8859-1 locale would see one character, but under the ISO 8859-7 locale, it would see something totally different.

This makes it hard to combine different character sets in the same document. Even if you did cobble something together, few programs could work with that document's text. To know what characters you had, you'd have to know what system they were in, and you couldn't easily mix systems. If you guessed wrong, you'd get a jumbled mess on your screen, or worse.

1.0.2. Unicode Support in Perl

Enter Unicode.

Unicode attempts to unify all character sets in the entire world, including many symbols and even fictional character sets. Under Unicode, different characters have different numeric codes, called code points.

Mixed-language documents are now easy, whereas before they weren't even possible. You no longer have just 128 or 256 possible characters per document. With Unicode you can have tens of thousands (and more) of different characters all jumbled together in the same document without confusion.

The problem of mixing, say, an Ä with a Δ evaporates. The first character, formally named "LATIN CAPITAL LETTER A WITH DIAERESIS" under Unicode, is assigned the code point U+00C4 (that's the Unicode preferred notation). The second, a "GREEK CAPITAL LETTER DELTA", is now at code point U+0394. With different characters always assigned different code points, there's no longer any conflict.

Perl has supported Unicode since v5.6 or so, but it wasn't until the v5.8 release that Unicode support was generally considered robust and usable. This by no coincidence corresponded to the introduction of I/O layers and their support for encodings into Perl. These are discussed in more detail in Chapter 8.

All Perl's string functions and operators, including those used for pattern matching, now operate on characters instead of octets. If you ask for a string's length, Perl reports how many characters are in that string, not how many bytes are in it. If you extract the first three characters of a string using substr, the result may or may not be three bytes. You don't know, and you shouldn't care, either. One reason not to care about the particular underlying bytewise representation is that if you have to pay attention to it, you're probably looking too closely. It shouldn't matter, really—but if it does, this might mean that Perl's implementation still has a few bumps in it. We're working on that.

Because characters with code points above 256 are supported, the chr function is no longer restricted to arguments under 256, nor is ord restricted to returning an integer smaller than that. Ask for chr(0x394), for example, and you'll get a Greek capital delta: Δ.

$char = chr(0x394);
$code = ord($char);
printf "char %s is code %d, %#04x\n", $char, $code, $code;

char Δ is code 916, 0x394

If you test the length of that string, it will say 1, because it's just one character. Notice how we said character; we didn't say anything about its length in bytes. Certainly the internal representation requires more than just 8 bits for a numeric code that big. But you the programmer are dealing with characters as abstractions, not as physical octets. Low-level details like that are best left up to Perl.

You shouldn't think of characters and bytes as the same. Programmers who interchange bytes and characters are guilty of the same class of sin as C programmers who blithely interchange integers and pointers. Even though the underlying representations may happen to coincide on some platforms, this is just a coincidence, and conflating abstract interfaces with physical implementations will always come back to haunt you, eventually.

You have several ways to put Unicode characters into Perl literals. If you're lucky enough to have a text editor that lets you enter Unicode directly into your Perl program, you can inform Perl you've done this via the use utf8 pragma. Another way is to use \x escapes in Perl interpolated strings to indicate a character by its code point in hex, as in \xC4. Characters with code points above 0xFF require more than two hex digits, so these must be enclosed in braces.

print "\xC4 and \x{0394} look different\n";

char Ä andΔ look different\n

Recipe 1.5 describes how to use charnames to put \N{NAME} escapes in string literals, such as \N{GREEK CAPITAL LETTER DELTA}, \N{greek:Delta}, or even just \N{Delta} to indicate a Δ character.

That's enough to get started using Unicode in Perl alone, but getting Perl to interact properly with other programs requires a bit more.

Using the old single-byte encodings like ASCII or ISO 8859-n, when you wrote out a character whose numeric code was NN, a single byte with numeric code NN would appear. What actually appeared depended on which fonts were available, your current locale setting, and quite a few other factors. But under Unicode, this exact duplication of logical character numbers (code points) into physical bytes emitted no longer applies. Instead, they must be encoded in any of several available output formats.

Internally, Perl uses a format called UTF-8, but many other encoding formats for Unicode exist, and Perl can work with those, too. The use encoding pragma tells Perl in which encoding your script itself has been written, or which encoding the standard filehandles should use. The use open pragma can set encoding defaults for all handles. Special arguments to open or to binmode specify the encoding format for that particular handle. The -C command-line flag is a shortcut to set the encoding on all (or just standard) handles, plus the program arguments themselves. The environment variables PERLIO, PERL_ENCODING, and PERL_UNICODE all give Perl various sorts of hints related to these matters.

Chapter 1. Strings

Contents:

1.0. Introduction

1.0.1. The Universal Character Code

1.0.2. Unicode Support in Perl