"If this pattern matching thing is so powerful and so fantastic," you
may be asking, "why don't you have a hundred different recipes on
regular expressions in this chapter?" Regular expressions are the
natural solution to many problems involving numbers, strings, dates,
web documents, mail addresses, and almost everything else in this
book; we use pattern matching over 100 times in other chapters. This
chapter mostly presents recipes in which pattern matching forms part
of the questions, not just part of the answers.
Perl's extensive and integrated support for regular expressions means
that you not only have features available that you won't find in any
other language, but you have new ways of using them, too. Programmers
new to Perl often look for functions like these:
match( $string, $pattern );
subst( $string, $pattern, $replacement );
but matching and substituting are such common tasks that they merit
their own notation:
$meadow =~ m/sheep/; # True if $meadow contains "sheep"
$meadow !~ m/sheep/; # True if $meadow doesn't contain "sheep"
$meadow =~ s/old/new/; # Replace "old" with "new" in $meadow
Pattern matching isn't like direct string comparison, even at its
simplest level; it's more like string searching with mutant wildcards
on steroids. Without anchors, the position where the match occurs can
float freely throughout the string. Any of the following lines would
also be matched by the expression $meadow
=~ /ovine/, giving false
positives when looking for lost sheep:
Fine bovines demand fine toreadors.
Muskoxen are a polar ovibovine species.
Grooviness went out of fashion decades ago.
Sometimes they're right in front of you but they still don't match:
Ovines are found typically in oviaries.
The problem is that while you are probably thinking in some human
language, the pattern-matching engine most assuredly is not. When the
engine is presented with the pattern /ovine/ and a
string to match against, it searches the string for an
"o" that is immediately followed by a
"v", then by an "i", then by an
"n", and then finally by an
"e". What comes before or after that sequence
doesn't matter. Additionally, those letters are matched
case-sensitively. That's why it didn't find
"Ovines", since that string starts with a capital
letter.
As you find your patterns matching some strings you don't want them
to match and not matching other strings that you do want them to
match, you start embellishing. When looking for nothing but sheep,
you probably want to match a pattern more like this:
if ($meadow =~ /\bovines?\b/i) { print "Here be sheep!" }
As you see, certain character sequences have special meaning to the
pattern-matching engine, often standing in for several possible
literal characters. These so-called metacharacters let you do such
things as restrict the pattern to the start or end of the string,
give alternatives for parts of a pattern, allow repetition and
wildcarding, and remember part of the matching substring for use
later in the pattern or in code.
Learning the syntax of pattern matching isn't as daunting as it might
appear. Sure, there are a lot of symbols, but each has a reason for
existing. Regular expressions aren't random jumbles of
punctuation—they're carefully thought-out jumbles of
punctuation! If you forget one, you can always look it up. Summary
tables are included in Programming Perl,
Learning Perl, Mastering Regular
Expressions, and the perlre(1) and
perlop(1) manpages included with every Perl
installation.
6.0.2. The Tricky Bits
Much trickier than the syntax of regular expressions is their sneaky
semantics. The three aspects of pattern-matching behavior that seem
to cause folks the most trouble are greed, eagerness, and
backtracking—and also how these three interact with each other.
Greed is the principle that if
a standard quantifier (such as *) can match a
varying number of times, it matches as long a substring as it can.
This is explained in Recipe 6.15.
Eagerness is the notion that the leftmost match wins. The engine is
eager to return you a match as quickly as possible, sometimes even
before you are expecting it. Consider the match
"Fred" =~
/x*/. If asked to explain this in plain language,
you might say "Does the string "Fred" contain any
x's?" If so, you might be surprised to learn that
it seems to. That's because /x*/ doesn't truly
mean "any x's," unless your idea of "any" includes
nothing at all. Formally, it means zero or more
of them, and here zero sufficed for the eager matcher.
A more illustrative example of eagerness would be the following:
$string = "good food";
$string =~ s/o*/e/;
Can you guess which of the following is in $string
after that substitution?
good food
geod food
geed food
geed feed
ged food
ged fed
egood food
The correct answer is the last one, because the earliest point at
which zero or more occurrences of "o" could be
found was right at the beginning of the string. Surprised? Regular
expressions can do that to you if you're unfamiliar with their
semantics.
Here's another example of where greed takes a back seat to eagerness:
$ echo longest | perl -ne 'print "$&\n" if /long|longer|longest/'
long
That's because Perl uses
what's called a traditional NFA,[10] a
non-deterministic finite automaton. This kind of matching engine is
not guaranteed to return the longest overall
match, just the first match. You might think of Perl's greed as being
left-to-right directed, not globally greedy.
[10]As opposed to a
POSIX-style NFA. See Mastering Regular
Expressions for the differences.
NFAs can be slow, but significant performance gains can be made by
rewriting the patterns to exploit how the particular NFA
implementation runs. This is a major part of Jeffrey Friedl's book,
Mastering Regular Expressions.
The
last and most powerful of the three tricky bits in pattern matching
is backtracking. For a pattern to match, the entire regular
expression must match, not just part of it. So if the beginning of a
pattern containing a quantifier succeeds in a way that causes later
parts in the pattern to fail, the matching engine backs up and tries
to find another match for the beginning part—that's why it's
called backtracking. It means that the engine is going to try
different possibilities, systematically investigating alternate
matches until it finds one that works. In some pattern-matching
implementations, the engine keeps backtracking in case other
submatches make the overall match longer. Perl's matcher doesn't do
that; as soon as one possibility works, it uses that—until and
unless something later in the pattern fails, forcing a backtrack to
retry another possible way of matching. This is discussed in Recipe 6.16.
6.0.3. Pattern-Matching Modifiers
Pattern-matching modifiers are a lot
easier to list and learn than the different metacharacters. Table 6-1 contains a brief summary of them.
Table 6-1. Pattern-matching modifiers and their meanings
Modifier
|
Meaning
|
/i
|
Ignore alphabetic case
|
/x
|
Ignore most whitespace in pattern and permit comments
|
/g
|
Global—match/substitute as often as possible
|
/gc
|
Don't reset search position on failed match
|
/s
|
Let . match newline
|
/m
|
Let ^ and $ match next to
embedded \n
|
/o
|
Compile pattern once only
|
/e
|
Righthand side of an s/// is code whose result is
used as the replacement value
|
/ee
|
Righthand side of an s/// is a string that's
eval'd twice; the final result then used as the
replacement value
|
/i and
/g are the most commonly used modifiers. The
pattern /ram/i matches "ram",
"RAM", "Ram", and so forth.
Backreferences are checked case-insensitively if this modifier is on;
see Recipe 6.16 for an example. This
case-insensitivity can be made aware of the user's current locale
settings if the use locale
pragma has been invoked.
The /g modifier is
used with s/// to replace every non-overlapping
match, not just the first one. /g is also used
with m// in loops to find (but not replace) every
matching occurrence:
while (m/(\d+)/g) {
print "Found number $1\n";
}
Used on m// in list context, /g
pulls out all matches:
@numbers = m/(\d+)/g;
That finds only non-overlapping matches. You have to be much sneakier
to get overlapping ones by making a zero-width look-ahead with the
(?=...) construct. Because it's zero-width, the
match engine hasn't advanced at all. Within the look-ahead, capturing
parentheses are used to grab the thing anyway. Although we've saved
something, Perl notices we haven't made any forward progress on the
/g, so it bumps us forward one character position.
This shows the difference:
$digits = "123456789";
@nonlap = $digits =~ /(\d\d\d)/g;
@yeslap = $digits =~ /(?=(\d\d\d))/g;
print "Non-overlapping: @nonlap\n";
print "Overlapping: @yeslap\n";
Non-overlapping: 123 456 789
Overlapping: 123 234 345 456 567 678 789
The /s and
/m modifiers are useful when matching strings with
embedded newlines. /s makes dot match
"\n", something it doesn't normally do; it also
makes the match ignore the value of the old, deprecated
$* variable. /m makes
^ and $ match after and before
"\n", respectively. They are useful with paragraph
slurping mode as explained in the Introduction to Chapter 8, and in Recipe 6.6.
The /e modifier is
used on replacements so that the righthand part is run as code and
its return value is used as the replacement string.
s/(\d+)/sprintf("%#x", $1)/ge
converts all numbers into hex, changing, for example,
2581 into
0xb23.
Because
different countries have different ideas of what constitutes an
alphabet, the POSIX standard provides systems (and thus programs)
with a standard way of representing alphabets, character set
ordering, and so on. Perl gives you access to some of these through
the use locale pragma; see the
perllocale manpage for more information. When
use locale is in effect, the
\w character class includes accented and other
exotic characters. The case-changing \u,
\U, \l, and
\L (and the corresponding uc,
ucfirst, etc. functions) escapes also respect
use locale, so σ
will be turned into Σ with \u if the
locale says it should. (This only matters in 8-bit encodings, such as
ISO 8859-7 for the Greek character set. If those characters had been
in Unicode, case translation would always apply, irrespective of
current locale setting.)