home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book Home Programming PerlSearch this book

5.2. Pattern-Matching Operators

Zoologically speaking, Perl's pattern-matching operators function as a kind of cage for regular expressions, to keep them from getting out. This is by design; if we were to let the regex beasties wander throughout the language, Perl would be a total jungle. The world needs its jungles, of course--they're the engines of biological diversity, after all--but jungles should stay where they belong. Similarly, despite being the engines of combinatorial diversity, regular expressions should stay inside pattern match operators where they belong. It's a jungle in there.

As if regular expressions weren't powerful enough, the m// and s/// operators also provide the (likewise confined) power of double-quote interpolation. Since patterns are parsed like double-quoted strings, all the normal double-quote conventions will work, including variable interpolation (unless you use single quotes as the delimiter) and special characters indicated with backslash escapes. (See "Specific Characters" later in this chapter.) These are applied before the string is interpreted as a regular expression. (This is one of the few places in the Perl language where a string undergoes more than one pass of processing.) The first pass is not quite normal double-quote interpolation, in that it knows what it should interpolate and what it should pass on to the regular expression parser. So, for instance, any $ immediately followed by a vertical bar, closing parenthesis, or the end of the string will be treated not as a variable interpolation, but as the traditional regex assertion meaning end-of-line. So if you say:

$foo = "bar";
/$foo$/;
the double-quote interpolation pass knows that those two $ signs are functioning differently. It does the interpolation of $foo, then hands this to the regular expression parser:
/bar$/;
Another consequence of this two-pass parsing is that the ordinary Perl tokener finds the end of the regular expression first, just as if it were looking for the terminating delimiter of an ordinary string. Only after it has found the end of the string (and done any variable interpolation) is the pattern treated as a regular expression. Among other things, this means you can't "hide" the terminating delimiter of a pattern inside a regex construct (such as a character class or a regex comment, which we haven't covered yet). Perl will see the delimiter wherever it is and terminate the pattern at that point.

You should also know that interpolating variables into a pattern slows down the pattern matcher, because it feels it needs to check whether the variable has changed, in case it has to recompile the pattern (which will slow it down even further). See "Variable Interpolation" later in this chapter.

The tr/// transliteration operator does not interpolate variables; it doesn't even use regular expressions! (In fact, it probably doesn't belong in this chapter at all, but we couldn't think of a better place to put it.) It does share one feature with m// and s///, however: it binds to variables using the =~ and !~ operators.

The =~ and !~ operators, described in Chapter 3, "Unary and Binary Operators", bind the scalar expression on their lefthand side to one of three quote-like operators on their right: m// for matching a pattern, s/// for substituting some string for a substring matched by a pattern, and tr/// (or its synonym, y///) for transliterating one set of characters to another set. (You may write m// as //, without the m, if slashes are used for the delimiter.) If the righthand side of =~ or !~ is none of these three, it still counts as a m// matching operation, but there'll be no place to put any trailing modifiers (see "Pattern Modifiers" later), and you'll have to handle your own quoting:

print "matches" if $somestring =~ $somepattern;
Really, there's little reason not to spell it out explicitly:
print "matches" if $somestring =~ m/$somepattern/;
When used for a matching operation, =~ and !~ are sometimes pronounced "matches" and "doesn't match" respectively (although "contains" and "doesn't contain" might cause less confusion).

Apart from the m// and s/// operators, regular expressions show up in two other places in Perl. The first argument to the split function is a special match operator specifying what not to return when breaking a string into multiple substrings. See the description and examples for split in Chapter 29, "Functions". The qr// ("quote regex") operator also specifies a pattern via a regex, but it doesn't try to match anything (unlike m//, which does). Instead, the compiled form of the regex is returned for future use. See "Variable Interpolation" for more information.

You apply one of the m//, s///, or tr/// operators to a particular string with the =~ binding operator (which isn't a real operator, just a kind of topicalizer, linguistically speaking). Here are some examples:

$haystack =~ m/needle/                # match a simple pattern
$haystack =~  /needle/                # same thing

$italiano =~ s/butter/olive oil/      # a healthy substitution

$rotate13 =~ tr/a-zA-Z/n-za-mN-ZA-M/  # easy encryption (to break)
Without a binding operator, $_ is implicitly used as the "topic":
/new life/ and              # search in $_ and (if found)
    /new civilizations/     #    boldly search $_ again

s/sugar/aspartame/          # substitute a substitute into $_

tr/ATCG/TAGC/               # complement the DNA stranded in $_
Because s/// and tr/// change the scalar to which they're applied, you may only use them on valid lvalues:
"onshore" =~ s/on/off/;      # WRONG: compile-time error
However, m// works on the result of any scalar expression:
if ((lc $magic_hat->fetch_contents->as_string) =~ /rabbit/) {
    print "Nyaa, what's up doc?\n";
}
else {
    print "That trick never works!\n";
}
But you have to be a wee bit careful, since =~ and !~ have rather high precedence--in our previous example the parentheses are necessary around the left term.[3] The !~ binding operator works like =~, but negates the logical result of the operation:
if ($song !~ /words/) {
    print qq/"$song" appears to be a song without words.\n/;
}
Since m//, s///, and tr/// are quote operators, you may pick your own delimiters. These work in the same way as the quoting operators q//, qq//, qr//, and qw// (see the section Section 5.6.3, "Pick Your Own Quotes" in Chapter 2, "Bits and Pieces").
$path =~ s#/tmp#/var/tmp/scratch#;

if ($dir =~ m[/bin]) {
    print "No binary directories please.\n";
}
When using paired delimiters with s/// or tr///, if the first part is one of the four customary bracketing pairs (angle, round, square, or curly), you may choose different delimiters for the second part than you chose for the first:
s(egg)<larva>;
s{larva}{pupa};
s[pupa]/imago/;
Whitespace is allowed in front of the opening delimiters:
s (egg)   <larva>;
s {larva} {pupa};
s [pupa]  /imago/;
Each time a pattern successfully matches (including the pattern in a substitution), it sets the $`, $&, and $' variables to the text left of the match, the whole match, and the text right of the match. This is useful for pulling apart strings into their components:
"hot cross buns" =~ /cross/;
print "Matched: <$`> $& <$'>\n";    # Matched: <hot > cross < buns>
print "Left:    <$`>\n";            # Left:    <hot >
print "Match:   <$&>\n";            # Match:   <cross>
print "Right:   <$'>\n";            # Right:   < buns>
For better granularity and efficiency, use parentheses to capture the particular portions that you want to keep around. Each pair of parentheses captures the substring corresponding to the subpattern in the parentheses. The pairs of parentheses are numbered from left to right by the positions of the left parentheses; the substrings corresponding to those subpatterns are available after the match in the numbered variables, $1, $2, $3, and so on:[4]
$_ = "Bilbo Baggins's birthday is September 22";
/(.*)'s birthday is (.*)/;
print "Person: $1\n";
print "Date: $2\n";
$`, $&, $', and the numbered variables are global variables implicitly localized to the enclosing dynamic scope. They last until the next successful pattern match or the end of the current scope, whichever comes first. More on this later, in a different scope.

[3] Without the parentheses, the lower-precedence lc would have applied to the whole pattern match instead of just the method call on the magic hat object.

[4] Not $0, though, which holds the name of your program.

Once Perl sees that you need one of $`, $&, or $' anywhere in the program, it provides them for every pattern match. This will slow down your program a bit. Perl uses a similar mechanism to produce $1, $2, and so on, so you also pay a price for each pattern that contains capturing parentheses. (See "Clustering" to avoid the cost of capturing while still retaining the grouping behavior.) But if you never use $`$&, or $', then patterns without capturing parentheses will not be penalized. So it's usually best to avoid $`, $&, and $' if you can, especially in library modules. But if you must use them once (and some algorithms really appreciate their convenience), then use them at will, because you've already paid the price. $& is not so costly as the other two in recent versions of Perl.

5.2.1. Pattern Modifiers

We'll discuss the individual pattern-matching operators in a moment, but first we'd like to mention another thing they all have in common, modifiers.

Immediately following the final delimiter of an m//, s///, qr//, or tr/// operator, you may optionally place one or more single-letter modifiers, in any order. For clarity, modifiers are usually written as "the /o modifier" and pronounced "the slash oh modifier", even though the final delimiter might be something other than a slash. (Sometimes people say "flag" or "option" to mean "modifier"; that's okay too.)

Some modifiers change the behavior of the individual operator, so we'll describe those in detail later. Others change how the regex is interpreted, so we'll talk about them here. The m//, s///, and qr// operators[5] all accept the following modifiers after their final delimiter:

[5] The tr/// operator does not take regexes, so these modifiers do not apply.

Modifier Meaning
/i Ignore alphabetic case distinctions (case insensitive).
/s Let . match newline and ignore deprecated $* variable.
/m Let ^ and $ match next to embedded \n.
/x Ignore (most) whitespace and permit comments in pattern.
/o Compile pattern once only.

The /i modifier says to match both upper- and lowercase (and title case, under Unicode). That way /perl/i would also match the strings "PROPERLY" or "Perlaceous" (amongst other things). A use locale pragma may also have some influence on what is considered to be equivalent. (This may be a negative influence on strings containing Unicode.)

The /s and /m modifiers don't involve anything kinky. Rather, they affect how Perl treats matches against a string that contains newlines. But they aren't about whether your string actually contains newlines; they're about whether Perl should assume that your string contains a single line (/s) or multiple lines (/m), because certain metacharacters work differently depending on whether they're expected to behave in a line-oriented fashion or not.

Ordinarily, the metacharacter "." matches any one character except a newline, because its traditional meaning is to match characters within a line. With /s, however, the "." metacharacter can also match a newline, because you've told Perl to ignore the fact that the string might contain multiple newlines. (The /s modifier also makes Perl ignore the deprecated $* variable, which we hope you too have been ignoring.) The /m modifier, on the other hand, changes the interpretation of the ^ and $ metacharacters by letting them match next to newlines within the string instead of considering only the ends of the string. See the examples in the section Section 5.6, "Positions" later in this chapter.

The /o modifier controls pattern recompilation. Unless the delimiters chosen are single quotes (m'PATTERN', s'PATTERN'REPLACEMENT', or qr'PATTERN'), any variables in the pattern will be interpolated (and may cause the pattern to be recompiled) every time the pattern operator is evaluated. If you want such a pattern to be compiled once and only once, use the /o modifier. This prevents expensive run-time recompilations; it's useful when the value you are interpolating won't change during execution. However, mentioning /o constitutes a promise that you won't change the variables in the pattern. If you do change them, Perl won't even notice. For better control over recompilation, use the qr// regex quoting operator. See "Variable Interpolation" later in this chapter for details.

The /x is the expressive modifier: it allows you to exploit whitespace and explanatory comments in order to expand your pattern's legibility, even extending the pattern across newline boundaries.

Er, that is to say, /x modifies the meaning of the whitespace characters (and the # character): instead of letting them do self-matching as ordinary characters do, it turns them into metacharacters that, oddly, now behave as whitespace (and comment characters) should. Hence, /x allows spaces, tabs, and newlines for formatting, just like regular Perl code. It also allows the # character, not normally special in a pattern, to introduce a comment that extends through the end of the current line within the pattern string.[6] If you want to match a real whitespace character (or the # character), then you'll have to put it into a character class, or escape it with a backslash, or encode it using an octal or hex escape. (But whitespace is normally matched with a \s* or \s+ sequence, so the situation doesn't arise often in practice.)

[6] Be careful not to include the pattern delimiter in the comment--because of its "find the end first" rule, Perl has no way of knowing you didn't intend to terminate the pattern at that point.

Taken together, these features go a long way toward making traditional regular expressions a readable language. In the spirit of TMTOWTDI, there's now more than one way to write a given regular expression. In fact, there's more than two ways:

m/\w+:(\s+\w+)\s*\d+/;       # A word, colon, space, word, space, digits.

m/\w+: (\s+ \w+) \s* \d+/x;  # A word, colon, space, word, space, digits.

m{
    \w+:                     # Match a word and a colon.

    (                        # (begin group)
         \s+                 # Match one or more spaces.
         \w+                 # Match another word.
    )                        # (end group)
    \s*                      # Match zero or more spaces.
    \d+                      # Match some digits
}x;
We'll explain those new metasymbols later in the chapter. (This section was supposed to be about pattern modifiers, but we've let it get out of hand in our excitement about /x. Ah well.) Here's a regular expression that finds duplicate words in paragraphs, stolen right out of the Perl Cookbook. It uses the /x and /i modifiers, as well as the /g modifier described later.
# Find duplicate words in paragraphs, possibly spanning line boundaries.
#   Use /x for space and comments, /i to match both `is'
#   in "Is is this ok?", and use /g to find all dups.
$/ = "";        # "paragrep" mode
while (<>) {
    while ( m{
                \b            # start at a word boundary
                (\w\S+)       # find a wordish chunk
                (
                    \s+       # separated by some whitespace
                    \1        # and that chunk again
                ) +           # repeat ad lib
                \b            # until another word boundary
             }xig
         )
    {
        print "dup word '$1' at paragraph $.\n";
    }
}
When run on this chapter, it produces warnings like this:
dup word 'that' at paragraph 100
As it happens, we know that that particular instance was intentional.

5.2.2. The m// Operator (Matching)

EXPR =~ m/PATTERN/cgimosx
EXPR =~ /PATTERN/cgimosx
EXPR =~ ?PATTERN?cgimosx
m/PATTERN/cgimosx
/PATTERN/cgimosx
?PATTERN?cgimosx
The m// operator searches the string in the scalar EXPR for PATTERN. If / or ? is the delimiter, the initial m is optional. Both ? and ' have special meanings as delimiters: the first is a once-only match; the second suppresses variable interpolation and the six translation escapes (\U and company, described later).

If PATTERN evaluates to a null string, either because you specified it that way using // or because an interpolated variable evaluated to the empty string, the last successfully executed regular expression not hidden within an inner block (or within a split, grep, or map) is used instead.

In scalar context, the operator returns true (1) if successful, false ("") otherwise. This form is usually seen in Boolean context:

if ($shire =~ m/Baggins/) { ... }  # search for Baggins in $shire
if ($shire =~ /Baggins/)  { ... }  # search for Baggins in $shire

if ( m#Baggins# )         { ... }  # search right here in $_
if ( /Baggins/ )          { ... }  # search right here in $_
Used in list context, m// returns a list of substrings matched by the capturing parentheses in the pattern (that is, $1, $2, $3, and so on) as described later under "Capturing and Clustering". The numbered variables are still set even when the list is returned. If the match fails in list context, a null list is returned. If the match succeeds in list context but there were no capturing parentheses (nor /g), a list value of (1) is returned. Since it returns a null list on failure, this form of m// can also be used in Boolean context, but only when participating indirectly via a list assignment:
if (($key,$value) = /(\w+): (.*)/) { ... }
Valid modifiers for m// (in whatever guise) are shown in Table 5-1.

Table 5.1. m// Modifiers

Modifier Meaning
/i Ignore alphabetic case.
/m Let ^ and $ match next to embedded \n.
/s Let . match newline and ignore deprecated $*.
/x Ignore (most) whitespace and permit comments in pattern.
/o Compile pattern once only.
/g Globally find all matches.
/cg Allow continued search after failed /g match.

The first five modifiers apply to the regex and were described earlier. The last two change the behavior of the match operation itself. The /g modifier specifies global matching--that is, matching as many times as possible within the string. How it behaves depends on context. In list context, m//g returns a list of all matches found. Here we find all the places someone mentioned "perl", "Perl", "PERL", and so on:

if (@perls = $paragraph =~ /perl/gi) {
    printf "Perl mentioned %d times.\n", scalar @perls;
}
If there are no capturing parentheses within the /g pattern, then the complete matches are returned. If there are capturing parentheses, then only the strings captured are returned. Imagine a string like:
$string = "password=xyzzy verbose=9 score=0";
Also imagine you want to use that to initialize a hash like this:
%hash = (password => "xyzzy", verbose => 9, score => 0);
Except, of course, you don't have a list, you have a string. To get the corresponding list, you can use the m//g operator in list context to capture all of the key/value pairs from the string:
%hash = $string =~ /(\w+)=(\w+)/g;
The (\w+) sequence captures an alphanumeric word. See the section Section 5.7, "Capturing and Clustering".

Used in scalar context, the /g modifier indicates a progressive match, which makes Perl start the next match on the same variable at a position just past where the last one stopped. The \G assertion represents that position in the string; see Section 5.6, "Positions" later in this chapter for a description of \G. If you use the /c (for "continue") modifier in addition to /g, then when the /g runs out, the failed match doesn't reset the position pointer.

If a ? is the delimiter, as in ?PATTERN?, this works just like a normal /PATTERN/ search, except that it matches only once between calls to the reset operator. This can be a convenient optimization when you want to match only the first occurrence of the pattern during the run of the program, not all occurrences. The operator runs the search every time you call it, up until it finally matches something, after which it turns itself off, returning false until you explicitly turn it back on with reset. Perl keeps track of the match state for you.

The ?? operator is most useful when an ordinary pattern match would find the last rather than the first occurrence:

open DICT, "/usr/dict/words" or die "Can't open words: $!\n";
while (<DICT>) {
    $first = $1 if ?(^neur.*)?;
    $last  = $1 if /(^neur.*)/;
}
print $first,"\n";          # prints "neurad"
print $last,"\n";           # prints "neurypnology"
The reset operator will reset only those instances of ?? compiled in the same package as the call to reset. Saying m?? is equivalent to saying ??.

5.2.3. The s/// Operator (Substitution)

LVALUE =~ s/PATTERN/REPLACEMENT/egimosx
s/PATTERN/REPLACEMENT/egimosx
This operator searches a string for PATTERN and, if found, replaces the matched substring with the REPLACEMENT text. (Modifiers are described later in this section.)
$lotr = $hobbit;           # Just copy The Hobbit
$lotr =~ s/Bilbo/Frodo/g;  #   and write a sequel the easy way.
The return value of an s/// operation (in scalar and list contexts alike) is the number of times it succeeded (which can be more than once if used with the /g modifier, as described earlier). On failure, since it substituted zero times, it returns false (""), which is numerically equivalent to 0.
if ($lotr =~ s/Bilbo/Frodo/) { print "Successfully wrote sequel." }
$change_count = $lotr =~ s/Bilbo/Frodo/g;
The replacement portion is treated as a double-quoted string. You may use any of the dynamically scoped pattern variables described earlier ($`, $&, $', $1, $2, and so on) in the replacement string, as well as any other double-quote gizmos you care to employ. For instance, here's an example that finds all the strings "revision", "version", or "release", and replaces each with its capitalized equivalent, using the \u escape in the replacement portion:
s/revision|version|release/\u$&/g;  # Use | to mean "or" in a pattern
All scalar variables expand in double-quote context, not just these strange ones. Suppose you had a %Names hash that mapped revision numbers to internal project names; for example, $Names{"3.0"} might be code-named "Isengard". You could use s/// to find version numbers and replace them with their corresponding project names:
s/version ([0-9.]+)/the $Names{$1} release/g;
In the replacement string, $1 returns what the first (and only) pair of parentheses captured. (You could use also \1 as you would in the pattern, but that usage is deprecated in the replacement. In an ordinary double-quoted string, \1 means a Control-A.)

If PATTERN is a null string, the last successfully executed regular expression is used instead. Both PATTERN and REPLACEMENT are subject to variable interpolation, but a PATTERN is interpolated each time the s/// operator is evaluated as a whole, while the REPLACEMENT is interpolated every time the pattern matches. (The PATTERN can match multiple times in one evaluation if you use the /g modifier.)

As before, the first five modifiers in Table 5-2 alter the behavior of the regex; they're the same as in m// and qr//. The last two alter the substitution operator itself.

Table 5.2. s/// Modifiers

Modifier Meaning
/i Ignore alphabetic case (when matching).
/m Let ^ and $ match next to embedded \n.
/s Let . match newline and ignore deprecated $*.
/x Ignore (most) whitespace and permit comments in pattern.

/o

Compile pattern once only.

/g

Replace globally, that is, all occurrences.
/e Evaluate the right side as an expression.

The /g modifier is used with s/// to replace every match of PATTERN with the REPLACEMENT value, not just the first one found. A s///g operator acts as a global search and replace, making all the changes at once, much like list m//g, except that m//g doesn't change anything. (And s///g is not a progressive match as scalar m//g was.)

The /e modifier treats the REPLACEMENT as a chunk of Perl code rather than as an interpolated string. The result of executing that code is used as the replacement string. For example, s/([0-9]+)/sprintf("%#x", $1)/ge would convert all numbers into hexadecimal, changing, for example, 2581 into 0xb23. Or suppose that, in our earlier example, you weren't sure that you had names for all the versions, so you wanted to leave any others unchanged. With a little creative /x formatting, you could say:

s{
    version
    \s+
    (
        [0-9.]+
    )
}{
    $Names{$1}
        ? "the $Names{$1} release"
        : $&
}xge;
The righthand side of your s///e (or in this case, the lower side) is syntax-checked and compiled at compile time along with the rest of your program. Any syntax error is detected during compilation, and run-time exceptions are left uncaught. Each additional /e after the first one (like /ee, /eee, and so on) is equivalent to calling evalSTRING on the result of the code, once per extra /e. This evaluates the result of the code expression and traps exceptions in the special $@ variable. See the section Section 5.10.3, "Programmatic Patterns" later in the chapter for more details.

5.2.3.1. Modifying strings en passant

Sometimes you want a new, modified string without clobbering the old one upon which the new one was based. Instead of writing:

$lotr = $hobbit;
$lotr =~ s/Bilbo/Frodo/g;
you can combine these into one statement. Due to precedence, parentheses are required around the assignment, as they are with most combinations applying =~ to an expression.
($lotr = $hobbit) =~ s/Bilbo/Frodo/g;
Without the parentheses around the assignment, you'd only change $hobbit and get the number of replacements stored into $lotr, which would make a rather dull sequel.

You can't use a s/// operator directly on an array. For that, you need a loop. By a lucky coincidence, the aliasing behavior of for/foreach, combined with its use of $_ as the default loop variable, yields the standard Perl idiom to search and replace each element in an array:

for (@chapters) { s/Bilbo/Frodo/g }  # Do substitutions chapter by chapter.
s/Bilbo/Frodo/g for @chapters;       # Same thing.
As with a simple scalar variable, you can combine the substitution with an assignment if you'd like to keep the original values around, too:
@oldhues = ('bluebird', 'bluegrass',  'bluefish', 'the blues');
for (@newhues = @oldhues) { s/blue/red/ }
print "@newhues\n";           # prints: redbird redgrass redfish the reds
The idiomatic way to perform repeated substitutes on the same variable is to use a once-through loop. For example, here's how to canonicalize whitespace in a variable:
for ($string) {
    s/^\s+//;       # discard leading whitespace
    s/\s+$//;       # discard trailing whitespace
    s/\s+/ /g;      # collapse internal whitespace
}
which just happens to produce the same result as:
$string = join(" ", split " ", $string);
You can also use such a loop with an assignment, as we did in the array case:
for ($newshow = $oldshow) {
    s/Fred/Homer/g;
    s/Wilma/Marge/g;
    s/Pebbles/Lisa/g;
    s/Dino/Bart/g;
}

5.2.3.2. When a global substitution just isn't global enough

Occasionally, you can't just use a /g to get all the changes to occur, either because the substitutions have to happen right-to-left or because you need the length of $` to change between matches. You can usually do what you want by calling s/// repeatedly. However, you want the loop to stop when the s/// finally fails, so you have to put it into the conditional, which leaves nothing to do in the main part of the loop. So we just write a 1, which is a rather boring thing to do, but bored is the best you can hope for sometimes. Here are some examples that use a few more of those odd regex beasties that keep popping up:

# put commas in the right places in an integer
1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/;

# expand tabs to 8-column spacing
1 while s/\t+/' ' x (length($&)*8 - length($`)%8)/e;

# remove (nested (even deeply nested (like this))) remarks
1 while s/\([^()]*\)//g;

# remove duplicate words (and triplicate (and quadruplicate...))
1 while s/\b(\w+) \1\b/$1/gi;
That last one needs a loop because otherwise it would turn this:
Paris in THE THE THE THE spring.
into this:
Paris in THE THE spring.
which might cause someone who knows a little French to picture Paris sitting in an artesian well emitting iced tea, since "thé" is French for "tea". A Parisian is never fooled, of course.

5.2.4. The tr/// Operator (Transliteration)

LVALUE =~ tr/SEARCHLIST/REPLACEMENTLIST/cds
tr/SEARCHLIST/REPLACEMENTLIST/cds
For sed devotees, y/// is provided as a synonym for tr///. This is why you can't call a function named y, any more than you can call a function named q or m. In all other respects, y/// is identical to tr///, and we won't mention it again.

This operator might not appear to fit into a chapter on pattern matching, since it doesn't use patterns. This operator scans a string, character by character, and replaces each occurrence of a character found in SEARCHLIST (which is not a regular expression) with the corresponding character from REPLACEMENTLIST (which is not a replacement string). It looks a bit like m// and s///, though, and you can even use the =~ or !~ binding operators on it, so we describe it here. (qr// and split are pattern-matching operators, but you don't use the binding operators on them, so they're elsewhere in the book. Go figure.)

Transliteration returns the number of characters replaced or deleted. If no string is specified via the =~ or !~ operator, the $_ string is altered. The SEARCHLIST and REPLACEMENTLIST may define ranges of sequential characters with a dash:

$message =~ tr/A-Za-z/N-ZA-Mn-za-m/;    # rot13 encryption.
Note that a range like A-Z assumes a linear character set like ASCII. But each character set has its own ideas of how characters are ordered and thus of which characters fall in a particular range. A sound principle is to use only ranges that begin from and end at either alphabets of equal case (a-e, A-E), or digits (0-4). Anything else is suspect. When in doubt, spell out the character sets in full: ABCDE.

The SEARCHLIST and REPLACEMENTLIST are not variable interpolated as double-quoted strings; you may, however, use those backslash sequences that map to a specific character, such as \n or \015.

Table 5-3 lists the modifiers applicable to the tr/// operator. They're completely different from those you apply to m//, s///, or qr//, even if some look the same.

Table 5.3. tr/// Modifiers

Modifier Meaning
/c Complement SEARCHLIST.
/d Delete found but unreplaced characters.
/s Squash duplicate replaced characters.

If the /c modifier is specified, the character set in SEARCHLIST is complemented; that is, the effective search list consists of all the characters not in SEARCHLIST. In the case of Unicode, this can represent a lot of characters, but since they're stored logically, not physically, you don't need to worry about running out of memory.

The /d modifier turns tr/// into what might be called the "transobliteration" operator: any characters specified by SEARCHLIST but not given a replacement in REPLACEMENTLIST are deleted. (This is slightly more flexible than the behavior of some tr(1) programs, which delete anything they find in SEARCHLIST, period.)

If the /s modifier is specified, sequences of characters converted to the same character are squashed down to a single instance of the character.

If the /d modifier is used, REPLACEMENTLIST is always interpreted exactly as specified. Otherwise, if REPLACEMENTLIST is shorter than SEARCHLIST, the final character is replicated until it is long enough. If REPLACEMENTLIST is null, the SEARCHLIST is replicated, which is surprisingly useful if you just want to count characters, not change them. It's also useful for squashing characters using /s.

tr/aeiou/!/;                 # change any vowel into !
tr{/\\\r\n\b\f. }{_};        # change strange chars into an underscore

tr/A-Z/a-z/ for @ARGV;       # canonicalize to lowercase ASCII

$count = ($para =~ tr/\n//); # count the newlines in $para
$count = tr/0-9//;           # count the digits in $_

$word =~ tr/a-zA-Z//s;       # bookkeeper -> bokeper

tr/@$%*//d;                  # delete any of those
tr#A-Za-z0-9+/##cd;          # remove non-base64 chars

# change en passant
($HOST = $host) =~ tr/a-z/A-Z/;

$pathname =~ tr/a-zA-Z/_/cs; # change non-(ASCII)alphas to single underbar

tr [\200-\377]
   [\000-\177];              # strip 8th bit, bytewise
If the same character occurs more than once in SEARCHLIST, only the first is used. Therefore, this:
tr/AAA/XYZ/
will change any single character A to an X (in $_).

Although variables aren't interpolated into tr///, you can still get the same effect by using evalEXPR:

$count = eval "tr/$oldlist/$newlist/";
die if $@;  # propagates exception from illegal eval contents

One more note: if you want to change your text to uppercase or lowercase, don't use tr///. Use the \U or \L sequences in a double-quoted string (or the equivalent uc and lc functions) since they will pay attention to locale or Unicode information and tr/a-z/A-Z/ won't. Additionally, in Unicode strings, the \u sequence and its corresponding ucfirst function understand the notion of titlecase, which for some languages may be distinct from simply converting to uppercase.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.