5.9. Staying in ControlAs any good manager knows, you shouldn't micromanage your employees. Just tell them what you want, and let them figure out the best way of doing it. Similarly, it's often best to think of a regular expression as a kind of specification: "Here's what I want; go find a string that fits the bill." On the other hand, the best managers also understand the job their employees are trying to do. The same is true of pattern matching in Perl. The more thoroughly you understand of how Perl goes about the task of matching any particular pattern, the more wisely you'll be able to make use of Perl's pattern matching capabilities. One of the most important things to understand about Perl's pattern-matching is when not to use it. 5.9.1. Letting Perl Do the WorkWhen people of a certain temperament first learn regular expressions, they're often tempted to see everything as a problem in pattern matching. And while that may even be true in the larger sense, pattern matching is about more than just evaluating regular expressions. It's partly about looking for your car keys where you dropped them, not just under the streetlamp where you can see better. In real life, we all know that it's a lot more efficient to look in the right places than the wrong ones. Similarly, you should use Perl's control flow to decide which patterns to execute, and which ones to skip. A regular expression is pretty smart, but it's smart like a horse. It can get distracted if it sees too much at once. So sometimes you have to put blinders onto it. For example, you'll recall our earlier example of alternation: That works as advertised, but not as well as it might, because it searches every position in the string for every name before it moves on to the next position. Astute readers of The Lord of the Rings will recall that, of the three wizards named above, Gandalf is mentioned much more frequently than Saruman, and Saruman is mentioned much more frequently than Radagast. So it's generally more efficient to use Perl's logical operators to do the alternation:/Gandalf|Saruman|Radagast/ This is yet another way of defeating the "leftmost" policy of the Engine. It only searches for Saruman if Gandalf was nowhere to be seen. And it only searches for Radagast if Saruman is also absent./Gandalf/ || /Saruman/ || /Radagast/ Not only does this change the order in which things are searched, but it sometimes allows the regular expression optimizer to work better. It's generally easier to optimize searching for a single string than for several strings simultaneously. Similarly, anchored searches can often be optimized if they're not too complicated. You don't have to limit your control of the control flow to the || operator. Often you can control things at the statement level. You should always think about weeding out the common cases first. Suppose you're writing a loop to process a configuration file. Many configuration files are mostly comments. It's often best to discard comments and blank lines early before doing any heavy-duty processing, even if the heavy duty processing would throw out the comments and blank lines in the course of things: Even if you're not trying to be efficient, you often need to alternate ordinary Perl expressions with regular expressions simply because you want to take some action that is not possible (or very difficult) from within the regular expression, such as printing things out. Here's a useful number classifier:while (<CONF>) { next if /^#/; next if /^\s*(#|$)/; chomp; munchabunch($_); } We could stretch this section out a lot longer, but really, that sort of thing is what this whole book is about. You'll see many more examples of the interplay of Perl code and pattern matching as we go along. In particular, see the later section Section 5.10.3, "Programmatic Patterns". (It's okay to read the intervening material first, of course.)warn "has nondigits" if /\D/; warn "not a natural number" unless /^\d+$/; # rejects -3 warn "not an integer" unless /^-?\d+$/; # rejects +3 warn "not an integer" unless /^[+-]?\d+$/; warn "not a decimal number" unless /^-?\d+\.?\d*$/; # rejects .2 warn "not a decimal number" unless /^-?(?:\d+(?:\.\d*)?|\.\d+)$/; warn "not a C float" unless /^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/; 5.9.2. Variable InterpolationUsing Perl's control flow mechanisms to control regular expression matching has its limits. The main difficulty is that it's an "all or nothing" approach; either you run the pattern, or you don't. Sometimes you know the general outlines of the pattern you want, but you'd like to have the capability of parameterizing it. Variable interpolation provides that capability, much like parameterizing a subroutine lets you have more influence over its behavior than just deciding whether to call it or not. (More about subroutines in the next chapter). One nice use of interpolation is to provide a little abstraction, along with a little readability. With regular expressions you may certainly write things concisely: But what you mean is more apparent when you write:if ($num =~ /^[-+]?\d+\.?\d*$/) { ... } We'll cover this use of interpolation more under "Generated patterns" later in this chapter. We'll just point out that we used the /o modifier to suppress recompilation because we don't expect $number to change its value over the course of the program.$sign = '[-+]?'; $digits = '\d+'; $decimal = '\.?'; $more_digits = '\d*'; $number = "$sign$digits$decimal$more_digits"; ... if ($num =~ /^$number$/o) { ... } Another cute trick is to turn your tests inside out and use the variable string to pattern-match against a set of known strings: This lets your user perform the "send" action by typing any of S, SE, SEN, or SEND (in any mixture of upper- and lowercase). To "stop", they'd have to type at least ST (or St, or sT, or st).chomp($answer = <STDIN>); if ("SEND" =~ /^\Q$answer/i) { print "Action is send\n" } elsif ("STOP" =~ /^\Q$answer/i) { print "Action is stop\n" } elsif ("ABORT" =~ /^\Q$answer/i) { print "Action is abort\n" } elsif ("LIST" =~ /^\Q$answer/i) { print "Action is list\n" } elsif ("EDIT" =~ /^\Q$answer/i) { print "Action is edit\n" } 5.9.2.1. When backslashes happenWhen you think of double-quote interpolation, you usually think of both variable and backslash interpolation. But as we mentioned earlier, for regular expressions there are two passes, and the interpolation pass defers most of the backslash interpretation to the regular expression parser (which we discuss later). Ordinarily, you don't notice the difference, because Perl takes pains to hide the difference. (One sequence that's obviously different is the \b metasymbol, which turns into a word boundary assertion--outside of character classes, anyway. Inside a character class where assertions make no sense, it reverts to being a backspace, as it is normally.) It's actually fairly important that the regex parser handle the backslashes. Suppose you're searching for tab characters in a pattern with a /x modifier: If Perl didn't defer the interpretation of \t to the regex parser, the \t would have turned into whitespace, which the regex parser would have ignorantly ignored because of the /x. But Perl is not so ignoble, or tricky.($col1, $col2) = /(.*?) \t+ (.*?)/x; You can trick yourself though. Suppose you abstracted out the column separator, like this: Now you've just blown it, because the \t turns into a real tab before it gets to the regex parser, which will think you said /(.*?)+(.*?)/ after it discards the whitespace. Oops. To fix, avoid /x, or use single quotes. Or better, use qr//. (See the next section.)$colsep = "\t+"; # (double quotes) ($col1, $col2) = /(.*?) $colsep (.*?)/x; The only double-quote escapes that are processed as such are the six translation escapes: \U, \u, \L, \l, \Q, and \E. If you ever look into the inner workings of the Perl regular expression compiler, you'll find code for handling escapes like \t for tab, \n for newline, and so on. But you won't find code for those six translation escapes. (We only listed them in Table 5-7 because people expect to find them there.) If you somehow manage to sneak any of them into the pattern without going through double-quotish evaluation, they won't be recognized. How could they find their way in? Well, you can defeat interpolation by using single quotes as your pattern delimiter. In m'...', qr'...', and s'...'...', the single quotes suppress variable interpolation and the processing of translation escapes, just as they would in a single-quoted string. Saying m'\ufrodo' won't find a capitalized version of poor frodo. However, since the "normal" backslash characters aren't really processed on that level anyway, m'\t\d' still matches a real tab followed by any digit. Another way to defeat interpolation is through interpolation itself. If you say: poor frodo remains uncapitalized. Perl won't redo the interpolation pass for you just because you interpolated something that looks like it might want to be reinterpolated. You can't expect that to work any more than you'd expect this double interpolation to work:$var = '\U'; /${var}frodo/; $hobbit = 'Frodo'; $var = '$hobbit'; # (single quotes) /$var/; # means m'$hobbit', not m'Frodo'. Here's another example that shows how most backslashes are interpreted by the regex parser, not by variable interpolation. Imagine you have a simple little grep-style program written in Perl:[10] If you name that program pgrep and call it this way:#!/usr/bin/perl $pattern = shift; while (<>) { print if /$pattern/o; } then you'll find that it prints out all lines of all your C source files in which a digit follows a tab. You didn't have to do anything special to get Perl to realize that \t was a tab. If Perl's patterns were just double-quote interpolated, you would have; fortunately, they aren't. They're recognized directly by the regex parser.% pgrep '\t\d' *.c
The real grep program has a -i switch that turns off case-sensitive matching. You don't have to add such a switch to your pgrep program; it can already handle that without modification. You just pass it a slightly fancier pattern, with an embedded /i modifier: That now searches for any of "Ring", "ring", "RING", and so on. You don't see this feature too much in literal patterns, since you can always just write /ring/i. But for patterns passed in on the command line, in web search forms, or embedded in configuration files, it can be a lifesaver. (Speaking of rings.)% pgrep '(?i)ring' LotR*.pod 5.9.2.2. The qr// quote regex operatorVariables that interpolate into patterns necessarily do so at run time, not compile time. This slows down execution because Perl has to check whether you've changed the contents of the variable; if so, it would have to recompile the regular expression. As mentioned in "Pattern-Matching Operators", if you promise never to change the pattern, you can use the /o option to interpolate and compile only once: Although that works fine in our pgrep program, in the general case, it doesn't. Imagine you have a slew of patterns, and you want to match each of them in a loop, perhaps like this:print if /$pattern/o; You couldn't write /$pat/o because the meaning of $pat varies each time through the inner loop.foreach $item (@data) { foreach $pat (@patterns) { if ($item =~ /$pat/) { ... } } } The solution to this is the qr/PATTERN/imosx operator. This operator quotes--and compiles--its PATTERN as a regular expression. PATTERN is interpolated the same way as in m/PATTERN/. If ' is used as the delimiter, no interpolation of variables (or the six translation escapes) is done. The operator returns a Perl value that may be used instead of the equivalent literal in a corresponding pattern match or substitute. For example: is equivalent to:$regex = qr/my.STRING/is; s/$regex/something else/; So for our nested loop problem above, preprocess your pattern first using a separate loop:s/my.STRING/something else/is; Or all at once using Perl's map operator:@regexes = (); foreach $pat (@patterns) { push @regexes, qr/$pat/; } And then change the loop to use those precompiled regexes:@regexes = map { qr/$_/ } @patterns; Now when you run the match, Perl doesn't have to create a compiled regular expression on each if test, because it sees that it already has one.foreach $item (@data) { foreach $re (@regexes) { if ($item =~ /$re/) { ... } } } The result of a qr// may even be interpolated into a larger match, as though it were a simple string: This time, Perl does recompile the pattern, but you could always chain several qr// operators together into one.$regex = qr/$pattern/; $string =~ /foo${regex}bar/; # interpolate into larger patterns The reason this works is because the qr// operator returns a special kind of object that has a stringification overload as described in Chapter 13, "Overloading". If you print out the return value, you'll see the equivalent string: The /s and /i modifiers were enabled in the pattern because they were supplied to qr//. The /x and /m, however, are disabled because they were not.$re = qr/my.STRING/is; print $re; # prints (?si-xm:my.STRING) Any time you interpolate strings of unknown provenance into a pattern, you should be prepared to handle any exceptions thrown by the regex compiler, in case someone fed you a string containing untamable beasties: For more on the eval operator, see Chapter 29, "Functions".$re = qr/$pat/is; # might escape and eat you $re = eval { qr/$pat/is } || warn ... # caught it in an outer cage 5.9.3. The Regex CompilerAfter the variable interpolation pass has had its way with the string, the regex parser finally gets a shot at trying to understand your regular expression. There's not actually a great deal that can go wrong at this point, apart from messing up the parentheses, or using a sequence of metacharacters that doesn't mean anything. The parser does a recursive-descent analysis of your regular expression and, if it parses, turns it into a form suitable for interpretation by the Engine (see the next section). Most of the interesting stuff that goes on in the parser involves optimizing your regular expression to run as fast as possible. We're not going to explain that part. It's a trade secret. (Rumors that looking at the regular expression code will drive you insane are greatly exaggerated. We hope.) But you might like to know what the parser actually thought of your regular expression, and if you ask it politely, it will tell you. By saying use re "debug", you can examine how the regex parser processes your pattern. (You can also see the same information by using the -Dr command-line switch, which is available to you if your Perl was compiled with the -DDEBUGGING flag during installation.) The output is below. You can see that prior to execution Perl compiles the regex and assigns meaning to the components of the pattern: BOL for the beginning of line (^), REG_ANY for the dot, and so on:#!/usr/bin/perl use re "debug"; "Smeagol" =~ /^Sm(.*)g[aeiou]l$/; Some of the lines summarize the conclusions of the regex optimizer. It knows that the string must start with "Sm", and that therefore there's no reason to do the ordinary left-to-right scan. It knows that the string must end with an "l", so it can reject out of hand any string that doesn't. It knows that the string must be at least five characters long, so it can ignore any string shorter than that right off the bat. It also knows what the rarest character in each constant string is, which can help in searching "studied" strings. (See study in Chapter 29, "Functions".)Compiling REx `^Sm(.*)g[aeiou]l$' size 24 first at 2 rarest char l at 0 rarest char S at 0 1: BOL(2) 2: EXACT <Sm>(4) 4: OPEN1(6) 6: STAR(8) 7: REG_ANY(0) 8: CLOSE1(10) 10: EXACT <g>(12) 12: ANYOF[aeiou](21) 21: EXACT <l>(23) 23: EOL(24) 24: END(0) anchored `Sm' at 0 floating `l'$ at 4..2147483647 (checking anchored) anchored(BOL) minlen 5 Omitting $` $& $' support. It then goes on to trace how it executes the pattern: If you follow the stream of whitespace down the middle of Smeagol, you can actually see how the Engine overshoots to let the .* be as greedy as possible, then backtracks on that until it finds a way for the rest of the pattern to match. But that's what the next section is about.EXECUTING... Guessing start of match, REx `^Sm(.*)g[aeiou]l$' against `Smeagol'... Guessed: match at offset 0 Matching REx `^Sm(.*)g[aeiou]l$' against `Smeagol' Setting an EVAL scope, savestack=3 0 <> <Smeagol> | 1: BOL 0 <> <Smeagol> | 2: EXACT <Sm> 2 <Sm> <eagol> | 4: OPEN1 2 <Sm> <eagol> | 6: STAR REG_ANY can match 5 times out of 32767... Setting an EVAL scope, savestack=3 7 <Smeagol> <> | 8: CLOSE1 7 <Smeagol> <> | 10: EXACT <g> failed... 6 <Smeago> <l> | 8: CLOSE1 6 <Smeago> <l> | 10: EXACT <g> failed... 5 <Smeag> <ol> | 8: CLOSE1 5 <Smeag> <ol> | 10: EXACT <g> failed... 4 <Smea> <gol> | 8: CLOSE1 4 <Smea> <gol> | 10: EXACT <g> 5 <Smeag> <ol> | 12: ANYOF[aeiou] 6 <Smeago> <l> | 21: EXACT <l> 7 <Smeagol> <> | 23: EOL 7 <Smeagol> <> | 24: END Match successful! Freeing REx: `^Sm(.*)g[aeiou]l$' 5.9.4. The Little Engine That /Could(n't)?/And now we'd like to tell you the story of the Little Regex Engine that says, "I think I can. I think I can. I think I can." In this section, we lay out the rules used by Perl's regular expression engine to match your pattern against a string. The Engine is extremely persistent and hardworking. It's quite capable of working even after you think it should quit. The Engine doesn't give up until it's certain there's no way to match the pattern against the string. The Rules below explain how the Engine "thinks it can" for as long as possible, until it knows it can or can't. The problem for our Engine is that its task is not merely to pull a train over a hill. It has to search a (potentially) very complicated space of possibilities, keeping track of where it has been and where it hasn't. The Engine uses a nondeterministic finite-state automaton (NFA, not to be confused with NFL, a nondeterministic football league) to find a match. That just means that it keeps track of what it has tried and what it hasn't, and when something doesn't pan out, it backs up and tries something else. This is known as backtracking. (Er, sorry, we didn't invent these terms. Really.) The Engine is capable of trying a million subpatterns at one spot, then giving up on all those, backing up to within one choice of the beginning, and trying the million subpatterns again at a different spot. The Engine is not terribly intelligent; just persistent, and thorough. If you're cagey, you can give the Engine an efficient pattern that doesn't let it do a lot of silly backtracking. When someone trots out a phrase like "Regexes choose the leftmost, longest match", that means that Perl generally prefers the leftmost match over longest match. But the Engine doesn't realize it's "preferring" anything, and it's not really thinking at all, just gutting it out. The overall preferences are an emergent behavior resulting from many individual and unrelated choices. Here are those choices:[11]
That all sounds rather complicated, but the upshot of it is that, for each set of choices given by a quantifier or alternation, the Engine has a knob it can twiddle. It will twiddle those knobs until the entire pattern matches. The Rules just say in which order the Engine is allowed to twiddle those knobs. Saying the Engine prefers the leftmost match merely means it twiddles the start position knob the slowest. And backtracking is just the process of untwiddling the knob you just twiddled in order to try twiddling a knob higher in the pecking order, that is, one that varies slower. Here's a more concrete example, a program that detects when two consecutive words share a common ending and beginning: This prints:$a = 'nobody'; $b = 'bodysnatcher'; if ("$a $b" =~ /^(\w+)(\w+) \2(\w+)$/) { print "$2 overlaps in $1-$2-$3\n"; } You might think that $1 would first grab up all of "nobody" due to greediness. And in fact, it does--at first. But once it's done so, there aren't any further characters to put in $2, which needs characters put into it because of the + quantifier. So the Engine backs up and $1 begrudgingly gives up one character to $2. This time the space character matches successfully, but then it sees \2, which represents a measly "y". The next character in the string is not a "y", but a "b". This makes the Engine back up all the way and try several more times, eventually forcing $1 to surrender the body to $2. Habeas corpus, as it were.body overlaps in no-body-snatcher Actually, that won't quite work out if the overlap is itself the product of a doubling, as in the two words "rococo" and "cocoon". The algorithm above would have decided that the overlapping string, $2, must be just "co" rather than "coco". But we don't want a "rocococoon"; we want a "rococoon". Here's one of those places you can outsmart the Engine. Adding a minimal matching quantifier to the $1 part gives the much better pattern /^(\w+?)(\w+) \2(\w+)$/, which does exactly what we want. For a much more detailed discussion of the pros and cons of various kinds of regular expression engines, see Jeffrey Friedl's book, Mastering Regular Expressions. Perl's regular expression Engine works very well for many of the everyday problems you want to solve with Perl, and it even works okay for those not-so-everyday problems, if you give it a little respect and understanding. Copyright © 2002 O'Reilly & Associates. All rights reserved. |
|