home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book Home Programming PerlSearch this book

5.9. Staying in Control

As any good manager knows, you shouldn't micromanage your employees. Just tell them what you want, and let them figure out the best way of doing it. Similarly, it's often best to think of a regular expression as a kind of specification: "Here's what I want; go find a string that fits the bill."

On the other hand, the best managers also understand the job their employees are trying to do. The same is true of pattern matching in Perl. The more thoroughly you understand of how Perl goes about the task of matching any particular pattern, the more wisely you'll be able to make use of Perl's pattern matching capabilities.

One of the most important things to understand about Perl's pattern-matching is when not to use it.

5.9.1. Letting Perl Do the Work

When people of a certain temperament first learn regular expressions, they're often tempted to see everything as a problem in pattern matching. And while that may even be true in the larger sense, pattern matching is about more than just evaluating regular expressions. It's partly about looking for your car keys where you dropped them, not just under the streetlamp where you can see better. In real life, we all know that it's a lot more efficient to look in the right places than the wrong ones.

Similarly, you should use Perl's control flow to decide which patterns to execute, and which ones to skip. A regular expression is pretty smart, but it's smart like a horse. It can get distracted if it sees too much at once. So sometimes you have to put blinders onto it. For example, you'll recall our earlier example of alternation:

/Gandalf|Saruman|Radagast/
That works as advertised, but not as well as it might, because it searches every position in the string for every name before it moves on to the next position. Astute readers of The Lord of the Rings will recall that, of the three wizards named above, Gandalf is mentioned much more frequently than Saruman, and Saruman is mentioned much more frequently than Radagast. So it's generally more efficient to use Perl's logical operators to do the alternation:
/Gandalf/ || /Saruman/ || /Radagast/
This is yet another way of defeating the "leftmost" policy of the Engine. It only searches for Saruman if Gandalf was nowhere to be seen. And it only searches for Radagast if Saruman is also absent.

Not only does this change the order in which things are searched, but it sometimes allows the regular expression optimizer to work better. It's generally easier to optimize searching for a single string than for several strings simultaneously. Similarly, anchored searches can often be optimized if they're not too complicated.

You don't have to limit your control of the control flow to the || operator. Often you can control things at the statement level. You should always think about weeding out the common cases first. Suppose you're writing a loop to process a configuration file. Many configuration files are mostly comments. It's often best to discard comments and blank lines early before doing any heavy-duty processing, even if the heavy duty processing would throw out the comments and blank lines in the course of things:

while (<CONF>) {
    next if /^#/;
    next if /^\s*(#|$)/;
    chomp;
    munchabunch($_);
}
Even if you're not trying to be efficient, you often need to alternate ordinary Perl expressions with regular expressions simply because you want to take some action that is not possible (or very difficult) from within the regular expression, such as printing things out. Here's a useful number classifier:
warn "has nondigits"        if     /\D/;
warn "not a natural number" unless /^\d+$/;             # rejects -3
warn "not an integer"       unless /^-?\d+$/;           # rejects +3
warn "not an integer"       unless /^[+-]?\d+$/;
warn "not a decimal number" unless /^-?\d+\.?\d*$/;     # rejects .2
warn "not a decimal number" unless /^-?(?:\d+(?:\.\d*)?|\.\d+)$/;
warn "not a C float"
       unless /^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/;
We could stretch this section out a lot longer, but really, that sort of thing is what this whole book is about. You'll see many more examples of the interplay of Perl code and pattern matching as we go along. In particular, see the later section Section 5.10.3, "Programmatic Patterns". (It's okay to read the intervening material first, of course.)

5.9.2. Variable Interpolation

Using Perl's control flow mechanisms to control regular expression matching has its limits. The main difficulty is that it's an "all or nothing" approach; either you run the pattern, or you don't. Sometimes you know the general outlines of the pattern you want, but you'd like to have the capability of parameterizing it. Variable interpolation provides that capability, much like parameterizing a subroutine lets you have more influence over its behavior than just deciding whether to call it or not. (More about subroutines in the next chapter).

One nice use of interpolation is to provide a little abstraction, along with a little readability. With regular expressions you may certainly write things concisely:

if ($num =~ /^[-+]?\d+\.?\d*$/) { ... }
But what you mean is more apparent when you write:
$sign = '[-+]?';
$digits = '\d+';
$decimal = '\.?';
$more_digits = '\d*';
$number = "$sign$digits$decimal$more_digits";
...
if ($num =~ /^$number$/o) { ... }
We'll cover this use of interpolation more under "Generated patterns" later in this chapter. We'll just point out that we used the /o modifier to suppress recompilation because we don't expect $number to change its value over the course of the program.

Another cute trick is to turn your tests inside out and use the variable string to pattern-match against a set of known strings:

chomp($answer = <STDIN>);
if    ("SEND"  =~ /^\Q$answer/i) { print "Action is send\n"  }
elsif ("STOP"  =~ /^\Q$answer/i) { print "Action is stop\n"  }
elsif ("ABORT" =~ /^\Q$answer/i) { print "Action is abort\n" }
elsif ("LIST"  =~ /^\Q$answer/i) { print "Action is list\n"  }
elsif ("EDIT"  =~ /^\Q$answer/i) { print "Action is edit\n"  }
This lets your user perform the "send" action by typing any of S, SE, SEN, or SEND (in any mixture of upper- and lowercase). To "stop", they'd have to type at least ST (or St, or sT, or st).

5.9.2.1. When backslashes happen

When you think of double-quote interpolation, you usually think of both variable and backslash interpolation. But as we mentioned earlier, for regular expressions there are two passes, and the interpolation pass defers most of the backslash interpretation to the regular expression parser (which we discuss later). Ordinarily, you don't notice the difference, because Perl takes pains to hide the difference. (One sequence that's obviously different is the \b metasymbol, which turns into a word boundary assertion--outside of character classes, anyway. Inside a character class where assertions make no sense, it reverts to being a backspace, as it is normally.)

It's actually fairly important that the regex parser handle the backslashes. Suppose you're searching for tab characters in a pattern with a /x modifier:

($col1, $col2) = /(.*?) \t+ (.*?)/x;
If Perl didn't defer the interpretation of \t to the regex parser, the \t would have turned into whitespace, which the regex parser would have ignorantly ignored because of the /x. But Perl is not so ignoble, or tricky.

You can trick yourself though. Suppose you abstracted out the column separator, like this:

$colsep = "\t+";                         # (double quotes)
($col1, $col2) = /(.*?) $colsep (.*?)/x;
Now you've just blown it, because the \t turns into a real tab before it gets to the regex parser, which will think you said /(.*?)+(.*?)/ after it discards the whitespace. Oops. To fix, avoid /x, or use single quotes. Or better, use qr//. (See the next section.)

The only double-quote escapes that are processed as such are the six translation escapes: \U, \u, \L, \l, \Q, and \E. If you ever look into the inner workings of the Perl regular expression compiler, you'll find code for handling escapes like \t for tab, \n for newline, and so on. But you won't find code for those six translation escapes. (We only listed them in Table 5-7 because people expect to find them there.) If you somehow manage to sneak any of them into the pattern without going through double-quotish evaluation, they won't be recognized.

How could they find their way in? Well, you can defeat interpolation by using single quotes as your pattern delimiter. In m'...', qr'...', and s'...'...', the single quotes suppress variable interpolation and the processing of translation escapes, just as they would in a single-quoted string. Saying m'\ufrodo' won't find a capitalized version of poor frodo. However, since the "normal" backslash characters aren't really processed on that level anyway, m'\t\d' still matches a real tab followed by any digit.

Another way to defeat interpolation is through interpolation itself. If you say:

$var = '\U';
/${var}frodo/;
poor frodo remains uncapitalized. Perl won't redo the interpolation pass for you just because you interpolated something that looks like it might want to be reinterpolated. You can't expect that to work any more than you'd expect this double interpolation to work:
$hobbit = 'Frodo';
$var = '$hobbit';           # (single quotes)
/$var/;                     # means m'$hobbit', not m'Frodo'.

Here's another example that shows how most backslashes are interpreted by the regex parser, not by variable interpolation. Imagine you have a simple little grep-style program written in Perl:[10]

#!/usr/bin/perl
$pattern = shift;
while (<>) {
    print if /$pattern/o;
}
If you name that program pgrep and call it this way:
% pgrep '\t\d' *.c
then you'll find that it prints out all lines of all your C source files in which a digit follows a tab. You didn't have to do anything special to get Perl to realize that \t was a tab. If Perl's patterns were just double-quote interpolated, you would have; fortunately, they aren't. They're recognized directly by the regex parser.

[10] If you didn't know what a grep program was before, you will now. No system should be without grep--we believe grep is the most useful small program ever invented. (It logically follows that we don't believe Perl is a small program.)

The real grep program has a -i switch that turns off case-sensitive matching. You don't have to add such a switch to your pgrep program; it can already handle that without modification. You just pass it a slightly fancier pattern, with an embedded /i modifier:

% pgrep '(?i)ring' LotR*.pod
That now searches for any of "Ring", "ring", "RING", and so on. You don't see this feature too much in literal patterns, since you can always just write /ring/i. But for patterns passed in on the command line, in web search forms, or embedded in configuration files, it can be a lifesaver. (Speaking of rings.)

5.9.2.2. The qr// quote regex operator

Variables that interpolate into patterns necessarily do so at run time, not compile time. This slows down execution because Perl has to check whether you've changed the contents of the variable; if so, it would have to recompile the regular expression. As mentioned in "Pattern-Matching Operators", if you promise never to change the pattern, you can use the /o option to interpolate and compile only once:

print if /$pattern/o;
Although that works fine in our pgrep program, in the general case, it doesn't. Imagine you have a slew of patterns, and you want to match each of them in a loop, perhaps like this:
foreach $item (@data) {
    foreach $pat (@patterns) {
        if ($item =~ /$pat/) { ... }
    }
}
You couldn't write /$pat/o because the meaning of $pat varies each time through the inner loop.

The solution to this is the qr/PATTERN/imosx operator. This operator quotes--and compiles--its PATTERN as a regular expression. PATTERN is interpolated the same way as in m/PATTERN/. If ' is used as the delimiter, no interpolation of variables (or the six translation escapes) is done. The operator returns a Perl value that may be used instead of the equivalent literal in a corresponding pattern match or substitute. For example:

$regex = qr/my.STRING/is;
s/$regex/something else/;
is equivalent to:
s/my.STRING/something else/is;
So for our nested loop problem above, preprocess your pattern first using a separate loop:
@regexes = ();
foreach $pat (@patterns) {
    push @regexes, qr/$pat/;
}
Or all at once using Perl's map operator:
@regexes = map { qr/$_/ } @patterns;
And then change the loop to use those precompiled regexes:
foreach $item (@data) {
    foreach $re (@regexes) {
        if ($item =~ /$re/) { ... }
    }
}
Now when you run the match, Perl doesn't have to create a compiled regular expression on each if test, because it sees that it already has one.

The result of a qr// may even be interpolated into a larger match, as though it were a simple string:

$regex = qr/$pattern/;
$string =~ /foo${regex}bar/;   # interpolate into larger patterns
This time, Perl does recompile the pattern, but you could always chain several qr// operators together into one.

The reason this works is because the qr// operator returns a special kind of object that has a stringification overload as described in Chapter 13, "Overloading". If you print out the return value, you'll see the equivalent string:

$re = qr/my.STRING/is;
print $re;                  # prints (?si-xm:my.STRING)
The /s and /i modifiers were enabled in the pattern because they were supplied to qr//. The /x and /m, however, are disabled because they were not.

Any time you interpolate strings of unknown provenance into a pattern, you should be prepared to handle any exceptions thrown by the regex compiler, in case someone fed you a string containing untamable beasties:

$re = qr/$pat/is;                      # might escape and eat you
$re = eval { qr/$pat/is } || warn ...  # caught it in an outer cage
For more on the eval operator, see Chapter 29, "Functions".

5.9.3. The Regex Compiler

After the variable interpolation pass has had its way with the string, the regex parser finally gets a shot at trying to understand your regular expression. There's not actually a great deal that can go wrong at this point, apart from messing up the parentheses, or using a sequence of metacharacters that doesn't mean anything. The parser does a recursive-descent analysis of your regular expression and, if it parses, turns it into a form suitable for interpretation by the Engine (see the next section). Most of the interesting stuff that goes on in the parser involves optimizing your regular expression to run as fast as possible. We're not going to explain that part. It's a trade secret. (Rumors that looking at the regular expression code will drive you insane are greatly exaggerated. We hope.)

But you might like to know what the parser actually thought of your regular expression, and if you ask it politely, it will tell you. By saying use re "debug", you can examine how the regex parser processes your pattern. (You can also see the same information by using the -Dr command-line switch, which is available to you if your Perl was compiled with the -DDEBUGGING flag during installation.)

#!/usr/bin/perl
use re "debug";
"Smeagol" =~ /^Sm(.*)g[aeiou]l$/;
The output is below. You can see that prior to execution Perl compiles the regex and assigns meaning to the components of the pattern: BOL for the beginning of line (^), REG_ANY for the dot, and so on:
Compiling REx `^Sm(.*)g[aeiou]l$'
size 24 first at 2
rarest char l at 0
rarest char S at 0
   1: BOL(2)
   2: EXACT <Sm>(4)
   4: OPEN1(6)
   6:   STAR(8)
   7:     REG_ANY(0)
   8: CLOSE1(10)
  10: EXACT <g>(12)
  12: ANYOF[aeiou](21)
  21: EXACT <l>(23)
  23: EOL(24)
  24: END(0)
anchored `Sm' at 0 floating `l'$ at 4..2147483647
     (checking anchored) anchored(BOL) minlen 5 
Omitting $` $& $' support.
Some of the lines summarize the conclusions of the regex optimizer. It knows that the string must start with "Sm", and that therefore there's no reason to do the ordinary left-to-right scan. It knows that the string must end with an "l", so it can reject out of hand any string that doesn't. It knows that the string must be at least five characters long, so it can ignore any string shorter than that right off the bat. It also knows what the rarest character in each constant string is, which can help in searching "studied" strings. (See study in Chapter 29, "Functions".)

It then goes on to trace how it executes the pattern:

EXECUTING...
 
Guessing start of match, REx `^Sm(.*)g[aeiou]l$' against `Smeagol'...
Guessed: match at offset 0
Matching REx `^Sm(.*)g[aeiou]l$' against `Smeagol'
  Setting an EVAL scope, savestack=3
   0 <> <Smeagol>         |  1:  BOL
   0 <> <Smeagol>         |  2:  EXACT <Sm>
   2 <Sm> <eagol>         |  4:  OPEN1
   2 <Sm> <eagol>         |  6:  STAR
                           REG_ANY can match 5 times out of 32767...
  Setting an EVAL scope, savestack=3
   7 <Smeagol> <>         |  8:    CLOSE1
   7 <Smeagol> <>         | 10:    EXACT <g>
                              failed...
   6 <Smeago> <l>         |  8:    CLOSE1
   6 <Smeago> <l>         | 10:    EXACT <g>
                              failed...
   5 <Smeag> <ol>         |  8:    CLOSE1
   5 <Smeag> <ol>         | 10:    EXACT <g>
                              failed...
   4 <Smea> <gol>         |  8:    CLOSE1
   4 <Smea> <gol>         | 10:    EXACT <g>
   5 <Smeag> <ol>         | 12:    ANYOF[aeiou]
   6 <Smeago> <l>         | 21:    EXACT <l>
   7 <Smeagol> <>         | 23:    EOL
   7 <Smeagol> <>         | 24:    END
Match successful!
Freeing REx: `^Sm(.*)g[aeiou]l$'
If you follow the stream of whitespace down the middle of Smeagol, you can actually see how the Engine overshoots to let the .* be as greedy as possible, then backtracks on that until it finds a way for the rest of the pattern to match. But that's what the next section is about.

5.9.4. The Little Engine That /Could(n't)?/

And now we'd like to tell you the story of the Little Regex Engine that says, "I think I can. I think I can. I think I can."

In this section, we lay out the rules used by Perl's regular expression engine to match your pattern against a string. The Engine is extremely persistent and hardworking. It's quite capable of working even after you think it should quit. The Engine doesn't give up until it's certain there's no way to match the pattern against the string. The Rules below explain how the Engine "thinks it can" for as long as possible, until it knows it can or can't. The problem for our Engine is that its task is not merely to pull a train over a hill. It has to search a (potentially) very complicated space of possibilities, keeping track of where it has been and where it hasn't.

The Engine uses a nondeterministic finite-state automaton (NFA, not to be confused with NFL, a nondeterministic football league) to find a match. That just means that it keeps track of what it has tried and what it hasn't, and when something doesn't pan out, it backs up and tries something else. This is known as backtracking. (Er, sorry, we didn't invent these terms. Really.) The Engine is capable of trying a million subpatterns at one spot, then giving up on all those, backing up to within one choice of the beginning, and trying the million subpatterns again at a different spot. The Engine is not terribly intelligent; just persistent, and thorough. If you're cagey, you can give the Engine an efficient pattern that doesn't let it do a lot of silly backtracking.

When someone trots out a phrase like "Regexes choose the leftmost, longest match", that means that Perl generally prefers the leftmost match over longest match. But the Engine doesn't realize it's "preferring" anything, and it's not really thinking at all, just gutting it out. The overall preferences are an emergent behavior resulting from many individual and unrelated choices. Here are those choices:[11]

[11] Some of these choices may be skipped if the regex optimizer has any say, which is equivalent to the Little Engine simply jumping through the hill via quantum tunneling. But for this discussion we're pretending the optimizer doesn't exist.

Rule 1

The Engine tries to match as far left in the string as it can, such that the entire regular expression matches under Rule 2.

The Engine starts just before the first character and tries to match the entire pattern starting there. The entire pattern matches if and only if the Engine reaches the end of the pattern before it runs off the end of the string. If it matches, it quits immediately--it doesn't keep looking for a "better" match, even though the pattern might match in many different ways.

If it is unable to match the pattern at the first position in the string, it admits temporary defeat and moves to the next position in the string, between the first and second characters, and tries all the possibilities again. If it succeeds, it stops. If it fails, it continues on down the string. The pattern match as a whole doesn't fail until it has tried to match the entire regular expression at every position in the string, including after the last character.

A string of n characters actually provides n+ 1 positions to match at. That's because the beginnings and the ends of matches are between the characters of the string. This rule sometimes surprises people when they write a pattern like /x*/ that can match zero or more "x" characters. If you try that pattern on a string like "fox", it won't find the "x". Instead, it will immediately match the null string before the "f" and never look further. If you want it to match one or more x characters, you need to use /x+/ instead. See the quantifiers under Rule 5.

A corollary to this rule is that any pattern matching the null string is guaranteed to match at the leftmost position in the string (in the absence of any zero-width assertions to the contrary).

Rule 2

When the Engine encounters a set of alternatives (separated by | symbols), either at the top level or at the current "cluster" level, it tries them left-to-right, stopping on the first successful match that allows successful completion of the entire pattern.

A set of alternatives matches a string if any of the alternatives match under Rule 3. If none of the alternatives matches, it backtracks to the Rule that invoked this Rule, which is usually Rule 1, but could be Rule 4 or 6, if we're within a cluster. That rule will then look for a new position at which to apply Rule 2.

If there's only one alternative, then either it matches or it doesn't, and Rule 2 still applies. (There's no such thing as zero alternatives, because a null string always matches.)

Rule 3

Any particular alternative matches if every item listed in the alternative matches sequentially according to Rules 4 and 5 (such that the entire regular expression can be satisfied).

An item consists of either an assertion, which is covered in Rule 4, or a quantified atom, covered by Rule 5. Items that have choices on how to match are given a "pecking order" from left to right. If the items cannot be matched in order, the Engine backtracks to the next alternative under Rule 2.

Items that must be matched sequentially aren't separated in the regular expression by anything syntactic--they're merely juxtaposed in the order they must match. When you ask to match /^foo/, you're actually asking for four items to be matched one after the other. The first is a zero-width assertion, matched under Rule 4, and the other three are ordinary characters that must match themselves, one after the other, under Rule 5.

The left-to-right pecking order means that in a pattern like:

/x*y*/
x* gets to pick one way to match, and then y* tries all its ways. If that fails, then x* gets to pick its second choice, and make y* try all of its ways again. And so on. The items to the right "vary faster", to borrow a phrase from multi-dimensional arrays.

Rule 4

If an assertion does not match at the current position, the Engine backtracks to Rule 3 and retries higher-pecking-order items with different choices.

Some assertions are fancier than others. Perl supports many regex extensions, some of which are zero-width assertions. For example, the positive lookahead (?=...) and the negative lookahead (?!...) don't actually match any characters, but merely assert that the regular expression represented by ... would (or would not) match at this point, were we to attempt it, hypothetically speaking.[12]

[12]In actual fact, the Engine does attempt it. The Engine goes back to Rule 2 to test the subpattern, and then wipes out any record of how much string was eaten, returning only the success or failure of the subpattern as the value of the assertion. (It does, however, remember any captured substrings.)

Rule 5

A quantified atom matches only if the atom itself matches some number of times that is allowed by the quantifier. (The atom itself is matched according to Rule 6.)

Different quantifiers require different numbers of matches, and most of them allow a range of numbers of matches. Multiple matches must all match in a row; that is, they must be adjacent within the string. An unquantified atom is assumed to have a quantifier requiring exactly one match (that is, /x/ is the same as /x{1}/). If no match can be found at the current position for any allowed quantity of the atom in question, the Engine backtracks to Rule 3 and retries higher-pecking-order items with different choices.

The quantifiers are *, +, ?, *?, +?, ??, and the various brace forms. If you use the {COUNT} form, then there is no choice, and the atom must match exactly that number of times or not at all. Otherwise, the atom can match over a range of quantities, and the Engine keeps track of all the choices so that it can backtrack if necessary. But then the question arises as to which of these choices to try first. One could start with the maximal number of matches and work down, or the minimal number of matches and work up.

The traditional quantifiers (without a trailing question mark) specify greedy matching; that is, they attempt to match as many characters as possible. To find the greediest match, the Engine has to be a little bit careful. Bad guesses are potentially rather expensive, so the Engine doesn't actually count down from the maximum value, which after all could be Very Large and cause millions of bad guesses. What the Engine actually does is a little bit smarter: it first counts up to find out how many matching atoms (in a row) are really there in the string, and then it uses that actual maximum as its first choice. (It also remembers all the shorter choices in case the longest one doesn't pan out.) It then (at long last) tries to match the rest of the pattern, assuming the longest choice to be the best. If the longest choice fails to produce a match for the rest of the pattern, it backtracks and tries the next longest.

If you say /.*foo/, for example, it will try to match the maximal number of "any" characters (represented by the dot) clear out to the end of the line before it ever tries looking for "foo"; and then when the "foo" doesn't match there (and it can't, because there's not enough room for it at the end of the string), the Engine will back off one character at a time until it finds a "foo". If there is more than one "foo" in the line, it'll stop on the last one, since that will really be the first one it encounters as it backtracks. When the entire pattern succeeds using some particular length of .*, the Engine knows it can throw away all the other shorter choices for .* (the ones it would have used had the current "foo" not panned out).

By placing a question mark after any greedy quantifier, you turn it into a frugal quantifier that chooses the smallest quantity for the first try. So if you say /.*?foo/, the .*? first tries to match 0 characters, then 1 character, then 2, and so on until it can match the "foo". Instead of backtracking backward, it backtracks forward, so to speak, and ends up finding the first "foo" on the line instead of the last.

Rule 6

Each atom matches according to the designated semantics of its type. If the atom doesn't match (or does match, but doesn't allow a match of the rest of the pattern), the Engine backtracks to Rule 5 and tries the next choice for the atom's quantity.

Atoms match according to the following types:

  • A regular expression in parentheses, (...), matches whatever the regular expression (represented by ...) matches according to Rule 2. Parentheses therefore serve as a clustering operator for quantification. Bare parentheses also have the side effect of capturing the matched substring for later use in a backreference. This side effect can be suppressed by using (?:...) instead, which has only the clustering semantics--it doesn't store anything in $1, $2, and so on. Other forms of parenthetical atoms (and assertions) are possible--see the rest of this chapter.

  • A dot matches any character, except maybe newline.

  • A list of characters in square brackets (a character class)matches any one of the characters specified by the list.

  • A backslashed letter matches either a particular character or a character from a set of characters, as listed in Table 5-7.

  • Any other backslashed character matches that character.

  • Any character not mentioned above matches itself.

That all sounds rather complicated, but the upshot of it is that, for each set of choices given by a quantifier or alternation, the Engine has a knob it can twiddle. It will twiddle those knobs until the entire pattern matches. The Rules just say in which order the Engine is allowed to twiddle those knobs. Saying the Engine prefers the leftmost match merely means it twiddles the start position knob the slowest. And backtracking is just the process of untwiddling the knob you just twiddled in order to try twiddling a knob higher in the pecking order, that is, one that varies slower.

Here's a more concrete example, a program that detects when two consecutive words share a common ending and beginning:

$a = 'nobody';
$b = 'bodysnatcher';
if ("$a $b" =~ /^(\w+)(\w+) \2(\w+)$/) {
    print "$2 overlaps in $1-$2-$3\n";
}
This prints:
body overlaps in no-body-snatcher
You might think that $1 would first grab up all of "nobody" due to greediness. And in fact, it does--at first. But once it's done so, there aren't any further characters to put in $2, which needs characters put into it because of the + quantifier. So the Engine backs up and $1 begrudgingly gives up one character to $2. This time the space character matches successfully, but then it sees \2, which represents a measly "y". The next character in the string is not a "y", but a "b". This makes the Engine back up all the way and try several more times, eventually forcing $1 to surrender the body to $2. Habeas corpus, as it were.

Actually, that won't quite work out if the overlap is itself the product of a doubling, as in the two words "rococo" and "cocoon". The algorithm above would have decided that the overlapping string, $2, must be just "co" rather than "coco". But we don't want a "rocococoon"; we want a "rococoon". Here's one of those places you can outsmart the Engine. Adding a minimal matching quantifier to the $1 part gives the much better pattern /^(\w+?)(\w+) \2(\w+)$/, which does exactly what we want.

For a much more detailed discussion of the pros and cons of various kinds of regular expression engines, see Jeffrey Friedl's book, Mastering Regular Expressions. Perl's regular expression Engine works very well for many of the everyday problems you want to solve with Perl, and it even works okay for those not-so-everyday problems, if you give it a little respect and understanding.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.