[Chapter 7] 7.4 More on the Matching Operator

7.4 More on the Matching Operator

We have already looked at the simplest uses of the matching operator (a regular expression enclosed in slashes). Now let's look at a zillion ways to make this operator do something slightly different.

7.4.1 Selecting a Different Target (the `=~` Operator)

Usually the string you'll want to match your pattern against is not within the $_ variable, and it would be a nuisance to put it there (perhaps you already have a value in $_ you're quite fond of). No problem. The =~ operator helps us here. This operator takes a regular expression operator on the right side, and changes the target of the operator to something besides the $_ variable - namely, some value named on the left side of the operator. For example:

$a = "hello world";
$a =~ /^he/;         # true
$a =~ /(.)\l/;       # also true (matches the double l)
if ($a =~ /(.)\1/) { # true, so yes...
                     # some stuff
}

The target of the =~ operator can be any expression that yields some scalar string value. For example, <STDIN> yields a scalar string value when used in a scalar context, so we can combine this with the =~ operator and a regular expression match operator to get a compact check for particular input, as in:

print "any last request? ";
if (<STDIN> =~ /^[yY]/) { # does the input begin with a y?
  print "And just what might that request be? ";
  <STDIN>; # discard a line of standard input
  print "Sorry, I'm unable to do that.\n";
}

In this case, <STDIN> yields the next line from standard input, which is then immediately used as the string to match against the pattern ^[yY] . Note that you never stored the input into a variable, so if you wanted to match the input against another pattern, or possibly echo the data out in an error message, you'd be out of luck. But this form frequently comes in handy.

7.4.2 Ignoring Case

In the previous example, we used [yY] to match either a lower- or uppercase y . For very short strings, such as y or fred , this match is easy enough, as in [fF][oO][oO] . But what if the string you wanted to match was the word "procedure " in either lower- or uppercase?

In the Windows NT findstr command, a /i flag indicates "ignore case." Perl also has such an option. You indicate the ignore-case option by appending a lowercase i to the closing slash, as in / somepattern /i . This says that the letters of the pattern will match letters in the string in either case. For example, to match the word procedure in either case at the beginning of the line, use /^procedure/i .

Now our previous example looks like this:

print "any last request? ";
if (<STDIN> =~ /^y/i) { # does the input begin with a y?
  # yes! deal with it
  ...
}

7.4.3 Using a Different Delimiter

If you are looking for a string with a regular expression that contains slash characters (/ ), you must precede each slash with a backslash (\ ). For example, you can look for a string that begins with /wwwroot/docs like this:

$path = <STDIN>; # read a pathname (from "find" perhaps?)
if ($path =~ /^\/wwwroot\/docs/) {
  # begins with /wwwroot/docs...
}

As you can see, the backslash-slash combination makes this example look as if there are little valleys between the text pieces. Using this combination for a lot of slash characters can get cumbersome, so Perl allows you to specify a different delimiter character. Simply precede any nonalphanumeric, nonwhitespace character[ 5 ] (your selected delimiter) with an m , then list your pattern followed by another identical delimiter character, and you're done, as in:

[5] If the delimiter happens to be the left character of a left-right pair (parentheses, braces, angle bracket, or square bracket), the closing delimiter is the corresponding right of the same pair. But otherwise, the characters are the same for begin and end.

/^\/wwwroot\/docs/ # using standard slash delimiter
m@^/wwwroot/docs@  # using @ for a delimiter
m#^/wwwroot/docs#  # using # for a delimiter (my favorite)

You can even use slashes again if you want, as in m/fred/ . So the common regular-expression matching operator is really the m operator; however, the m is optional if you choose slash for a delimiter.

7.4.4 Using Variable Interpolation

A regular expression is variable interpolated before it is considered for other special characters. As a result, you can construct a regular expression from computed strings rather than just literals. For example:

$what = "bird";
$sentence = "Every good bird does fly.";
if ($sentence =~ 

/\b$what\b/) {
  print "The sentence contains the word $what!\n";
}

In this example we have effectively constructed the regular expression operator /\bbird\b/ using a variable reference.

Here's a slightly more complicated example:

$sentence = "Every good bird does fly.";
print "What should I look for? ";
$what = <STDIN>;
chomp($what);
if ($sentence =~ /$what/) { # found it!
  print "I saw $what in $sentence.\n";
} else {
  print "nope... didn't find it.\n";
}

If you enter bird , it is found. If you enter scream , it isn't. If you enter [bw]ird , that's also found, showing that the regular expression pattern-matching characters are indeed still significant.

How would you make them insignificant? You'd have to arrange for the non-alphanumeric characters to be preceded by a backslash, which would then turn them into literal matches. That process sounds hard, unless you have the \Q quoting escape at your disposal:

$what = "[box]";
foreach (qw(in[box] out[box] white[sox])) {
  if (/\Q$what\E/) {
    print "$_ matched!\n";
  }
}

Here, the \Q$what\E construct turns into \[box\] , making the match look for a literal pair of enclosing brackets, instead of treating the whole thing as a character class.

7.4.5 Special Read-Only Variables

After a successful pattern match, the variables $1 , $2 , $3 , and so on are set to the same values as \1 , \2 , \3 , and so on, held inside the pattern. You can use this feature to look at a piece of the match in later code. For example:

$_ = "this is a test";
/(\w+)\W+(\w+)/; # match first two words
                 # $1 is now "this" and $2 is now "is"

You can also gain access to the same values ($1 , $2 , $3 , and so on) by placing a match in a list context. The result is a list of values from $1 up to the number of memorized things, but only if the regular expression matches. (Otherwise, the variables are undefined.) Taking that last example in another way:

$_ = "this is a test";
($first, $second) = /(\w+)\W+(\w+)/; # match first two words
                # $first is now "this" and $second is now "is"

Other predefined read-only variables include $& , which is the part of the string that matched the regular expression; $` , which is the part of the string before the part that matched; and $' , which is the part of the string after the part that matched. For example:

$_ = "this is a sample string";
/sa.*le/; # matches "sample" within the string
# $` is now "this is a "
# $& is now "sample"
# $' is now " string"

Because these variables are set on each successful match, you should save the values elsewhere if you need them later in the program.[ 6 ]

[6] See O'Reilly's Mastering Regular Expressions for the performance ramifications of using these variables.