5.10. Fancy Patterns5.10.1. Lookaround AssertionsSometimes you just need to sneak a peek. There are four regex extensions that help you do just that, and we call them lookaround assertions because they let you scout around in a hypothetical sort of way, without committing to matching any characters. What these assertions assert is that some pattern would (or would not) match if we were to try it. The Engine works it all out for us by actually trying to match the hypothetical pattern, and then pretending that it didn't match (if it did). When the Engine peeks ahead from its current position in the string, we call it a lookahead assertion. If it peeks backward, we call it a lookbehind assertion. The lookahead patterns can be any regular expression, but the lookbehind patterns may only be fixed width, since they have to know where to start the hypothetical match from. While these four extensions are all zero-width assertions, and hence do not consume characters (at least, not officially), you can in fact capture substrings within them if you supply extra levels of capturing parentheses.
5.10.2. Nonbacktracking SubpatternsAs described in "The Little Engine That /Could(n't)?/", the Engine often backtracks as it proceeds through the pattern. You can block the Engine from backtracking back through a particular set of choices by creating a nonbacktracking subpattern. A nonbacktracking subpattern looks like (?>PATTERN), and it works exactly like a simple (?:PATTERN), except that once PATTERN has found a match, it suppresses backtracking on any of the quantifiers or alternatives inside the subpattern. (Hence, it is meaningless to use this on a PATTERN that doesn't contain quantifiers or alternatives.) The only way to get it to change its mind is to backtrack to something before the subpattern and reenter the subpattern from the left. It's like going into a car dealership. After a certain amount of haggling over the price, you deliver an ultimatum: "Here's my best offer; take it or leave it." If they don't take it, you don't go back to haggling again. Instead, you backtrack clear out the door. Maybe you go to another dealership, and start haggling again. You're allowed to haggle again, but only because you reentered the nonbacktracking pattern again in a different context. For devotees of Prolog or SNOBOL, you can think of this as a scoped cut or fence operator. Consider how in "aaab" =~ /(?:a*)ab/, the a* first matches three a's, but then gives up one of them because the last a is needed later. The subgroup sacrifices some of what it wants in order for the whole match to succeed. (Which is like letting the car salesman talk you into giving him more of your money because you're afraid to walk away from the deal.) In contrast, the subpattern in "aaab" =~ /(?>a*)ab/ will never give up what it grabs, even though this behavior causes the whole match to fail. (As the song says, you have to know when to hold 'em, when to fold 'em, and when to walk away.) Although (?>PATTERN) is useful for changing the behavior of a pattern, it's mostly used for speeding up the failure of certain matches that you know will fail anyway (unless they succeed outright). The Engine can take a spectacularly long time to fail, particular with nested quantifiers. The following pattern will succeed almost instantly: But success is not the problem. Failure is. If you remove that final "b" from the string, the pattern will probably run for many, many years before failing. Many, many millennia. Actually, billions and billions of years.[13] You can see by inspection that the pattern can't succeed if there's no "b" on the end of the string, but the regex optimizer is not smart enough (as of this writing) to figure out that /[b]/ is equivalent to /b/. But if you give it a hint, you can get it to fail quickly while still letting it succeed where it can:$_ = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"; /a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*[b]/; For a (hopefully) more realistic example, imagine a program that's supposed to read in a paragraph at a time and show just the lines that are continued, where contination lines are specified with trailing backslashes. Here's a sample from Perl's Makefile that uses this line-continuation convention:/(?>a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*)[b]/; You could write your simple program this way:# Files to be built with variable substitution before miniperl # is available. sh = Makefile.SH cflags.SH config_h.SH makeaperl.SH makedepend.SH \ makedir.SH myconfig.SH writemain.SH That works, but it's really quite slow. That's because the Engine backtracks a character at a time from the end of the line, shrinking what's in $1. This is pointless. And writing it without the extraneous captures doesn't help much. Using:#!/usr/bin/perl -00p while ( /( (.+) ( (?<=\\) \n .* )+ ) /gx) { print "GOT $.: $1\n\n"; } for a pattern is somewhat faster, but not much. This is where a nonbacktracking subpattern helps a lot. The pattern:(.+(?:(?<=\\)\n.*)+) does the same thing, but more than an order of magnitude faster because it doesn't waste time backtracking in search of something that isn't there.((?>.+)(?:(?<=\\)\n.*)+)
You'll never get a success with (?>...) that you wouldn't get with (?:...) or even a simple (...). But if you're going to fail, it's best to fail quickly and get on with your life. 5.10.3. Programmatic PatternsMost Perl programs tend to follow an imperative (also called procedural) programming style, like a series of discrete commands laid out in a readily observable order: "Preheat oven, mix, glaze, heat, cool, serve to aliens." Sometimes into this mix you toss a few dollops of functional programming ("Use a little more glaze than you think you need, even after taking this into account, recursively"), or sprinkle it with bits of object-oriented techniques ("but please hold the anchovy objects"). Often it's a combination of all of these. But the regular expression Engine takes a completely different approach to problem solving, more of a declarative approach. You describe goals in the language of regular expressions, and the Engine implements whatever logic is needed to solve your goals. Logic programming languages (such as Prolog) don't always get as much exposure as the other three styles, but they're more common than you'd think. Perl couldn't even be built without make(1) or yacc(1), both of which could be considered, if not purely declarative languages, at least hybrids that blend imperative and logic programming together. You can do this sort of thing in Perl, too, by blending goal declarations and imperative code together more miscibly than we've done so far, drawing upon the strengths of both. You can programmatically build up the string you'll eventually present to the regex Engine, in a sense creating a program that writes a new program on the fly. You can also supply ordinary Perl expressions as the replacement part of s/// via the /e modifier. This allows you to dynamically generate the replacement string by executing a bit of code every time the pattern matches. Even more elaborately, you can interject bits of code wherever you'd like in a middle of a pattern using the (?{CODE}) extension, and that code will be executed every time the Engine encounters that code as it advances and recedes in its intricate backtracking dance. Finally, you can use s///ee or (??{CODE}) to add another level of indirection: the results of executing those code snippets will themselves be re-evaluated for further use, creating bits of program and pattern on the fly, just in time. 5.10.3.1. Generated patternsIt has been said[14] that programs that write programs are the happiest programs in the world. In Jeffrey Friedl's book, Mastering Regular Expressions, the final tour de force demonstrates how to write a program that produces a regular expression to determine whether a string conforms to the RFC 822 standard; that is, whether it contains a standards-compliant, valid mail header. The pattern produced is several thousand characters long, and about as easy to read as a crash dump in pure binary. But Perl's pattern matcher doesn't care about that; it just compiles up the pattern without a hitch and, even more interestingly, executes the match very quickly--much more quickly, in fact, than many short patterns with complex backtracking requirements.
That's a very complicated example. Earlier we showed you a very simple example of the same technique when we built up a $number pattern out of its components (see the section Section 5.9.2, "Variable Interpolation"). But to show you the power of this programmatic approach to producing a pattern, let's work out a problem of medium complexity. Suppose you wanted to pull out all the words with a certain vowel-consonant sequence; for example, "audio" and "eerie" both follow a VVCVV pattern. Although describing what counts as a consonant or a vowel is easy, you wouldn't ever want to type that in more than once. Even for our simple VVCVV case, you'd need to type in a pattern that looked something like this: A more general-purpose program would accept a string like "VVCVV" and programmatically generate that pattern for you. For even more flexibility, it could accept a word like "audio" as input and use that as a template to infer "VVCVV", and from that, the long pattern above. It sounds complicated, but really isn't, because we'll let the program generate the pattern for us. Here's a simple cvmap program that does all of that:^[aeiouy][aeiouy][cbdfghjklmnpqrstvwxzy][aeiouy][aeiouy]$ The %map variable holds all the interesting bits. Its keys are each letter of the alphabet, and the corresponding value is all the letters of its type. We throw in C and V, too, so you can specify either "VVCVV" or "audio", and still get out "eerie". Each character in the argument supplied to the program is used to pull out the right character class to add to the pattern. Once the pattern is created and compiled up with qr//, the match (even a very long one) will run quickly. Here's why you might get if you run this program on "fortuitously":#!/usr/bin/perl $vowels = 'aeiouy'; $cons = 'cbdfghjklmnpqrstvwxzy'; %map = (C => $cons, V => $vowels); # init map for C and V for $class ($vowels, $cons) { # now for each type for (split //, $class) { # get each letter of that type $map{$_} .= $class; # and map the letter back to the type } } for $char (split //, shift) { # for each letter in template word $pat .= "[$map{$char}]"; # add appropriate character class } $re = qr/^${pat}$/i; # compile the pattern print "REGEX is $re\n"; # debugging output @ARGV = ('/usr/dict/words') # pick a default dictionary if -t && !@ARGV; while (<>) { # and now blaze through the input print if /$re/; # printing any line that matches } Looking at that REGEX, you can see just how much villainous typing you saved by programming languorously, albeit circuitously.% cvmap fortuitously /usr/dict/wordses REGEX is (?i-xsm:^[cbdfghjklmnpqrstvwxzy][aeiouy][cbdfghjklmnpqrstvwxzy][cbd fghjklmnpqrstvwxzy][aeiouy][aeiouy][cbdfghjklmnpqrstvwxzy][aeiouy][aeiouy][c bdfghjklmnpqrstvwxzy][cbdfghjklmnpqrstvwxzy][aeiouycbdfghjklmnpqrstvwxzy]$) carriageable circuitously fortuitously languorously marriageable milquetoasts sesquiquarta sesquiquinta villainously 5.10.3.2. Substitution evaluationsWhen the /e modifier ("e" is for expression evaluation) is used on an s/PATTERN/CODE/e expression, the replacement portion is interpreted as a Perl expression, not just as a double-quoted string. It's like an embedded do {CODE}. Even though it looks like a string, it's really just a code block that gets compiled up at the same time as rest of your program, long before the substitution actually happens. You can use the /e modifier to build replacement strings with fancier logic than double-quote interpolation allows. This shows the difference: And this converts Celsius temperatures into Fahrenheit:s/(\d+)/$1 * 2/; # Replaces "42" with "42 * 2" s/(\d+)/$1 * 2/e; # Replaces "42" with "84" Applications of this technique are limitless. Here's a filter that modifies its files in place (like an editor) by adding 100 to every number that starts a line (and that is followed by a colon, which we only peek at, but don't actually match, or replace):$_ = "Preheat oven to 233C.\n"; s/\b(\d+\.?\d*)C\b/int($1 * 1.8 + 32) . "F"/e; # convert to 451F Now and then, you want to do more than just use the string you matched in another computation. Sometimes you want that string to be a computation, whose own evaluation you'll use for the replacement value. Each additional /e modifier after the first wraps an eval around the code to execute. The following two lines do the same thing, but the first one is easier to read:% perl -pi -e 's/^(\d+)(?=:)/100 + $1/e' filename You could use this technique to replace mentions of simple scalar variables with their values:s/PATTERN/CODE/ee s/PATTERN/eval(CODE)/e Because it's really an eval, the /ee even finds lexical variables. A slightly more elaborate example calculates a replacement for simple arithmetical expressions on (nonnegative) integers:s/(\$\w+)/$1/eeg; # Interpolate most scalars' values Like any other evalSTRING, compile-time errors (like syntax problems) and run-time exceptions (like dividing by zero) are trapped. If so, the $@ ($EVAL_ERROR) variable says what went wrong.$_ = "I have 4 + 19 dollars and 8/2 cents.\n"; s{ ( \d+ \s* # find an integer [+*/-] # and an arithmetical operator \s* \d+ # and another integer ) }{ $1 }eegx; # then expand $1 and run that code print; # "I have 23 dollars and 4 cents." 5.10.3.3. Match-time code evaluationIn most programs that use regular expressions, the surrounding program's run-time control structure drives the logical execution flow. You write if or while loops, or make function or method calls, that wind up calling a pattern-matching operation now and then. Even with s///e, it's the substitution operator that is in control, executing the replacement code only after a successful match. With code subpatterns, the normal relationship between regular expression and program code is inverted. As the Engine is applying its Rules to your pattern at match time, it may come across a regex extension of the form (?{CODE}). When triggered, this subpattern doesn't do any matching or any looking about. It's a zero-width assertion that always "succeeds", evaluated only for its side effects. Whenever the Engine needs to progress over the code subpattern as it executes the pattern, it runs that code. As the Engine tries to match glyph against this pattern, it first lets the .+ eat up all five letters. Then it prints "hi". When it finds that final dot, all five letters have been eaten, so it needs to backtrack back to the .+ and make it give up one of the letters. Then it moves forward through the pattern again, stopping to print "hi" again before assigning h to the final dot and completing the match successfully."glyph" =~ /.+ (?{ print "hi" }) ./x; # Prints "hi" twice. The braces around the CODE fragment are intended to remind you that it is a block of Perl code, and it certainly behaves like a block in the lexical sense. That is, if you use my to declare a lexically scoped variable in it, it is private to the block. But if you use local to localize a dynamically scoped variable, it may not do what you expect. A (?{ CODE}) subpattern creates an implicit dynamic scope that is valid throughout the rest of the pattern, until it either succeeds or backtracks through the code subpattern. One way to think of it is that the block doesn't actually return when it gets to the end. Instead, it makes an invisible recursive call to the Engine to try to match the rest of the pattern. Only when that recursive call is finished does it return from the block, delocalizing the localized variables.[15]
In the next example, we initialize $i to 0 by including a code subpattern at the beginning of the pattern. Then we match any number of characters with .*--but we place another code subpattern in between the . and the * so we can count how many times . matches. The Engine merrily goes along, setting $i to 0 and letting the .* gobble up all 10 characters in the string. When it encounters the literal lori in the pattern, it backtracks and gives up those four characters from the .*. After the match, $i will still be 10.$_ = 'lothlorien'; m/ (?{ $i = 0 }) # Set $i to 0 (. (?{ $i++ }) )* # Update $i, even after backtracking lori # Forces a backtrack /x; If you wanted $i to reflect how many characters the .* actually ended up with, you could make use of the dynamic scope within the pattern: Here, we use local to ensure that $i contains the number of characters matched by .*, regardless of backtracking. $i will be forgotten after the regular expression ends, so the code subpattern, (?{ $result = $i }), ensures that the count will live on in $result.$_ = 'lothlorien'; m/ (?{ $i = 0 }) (. (?{ local $i = $i + 1; }) )* # Update $i, backtracking-safe. lori (?{ $result = $i }) # Copy to non-localized location. /x; The special variable $^R (described in Chapter 28, "Special Names") holds the result of the last (?{CODE}) that was executed as part of a successful match. You can use a (?{CODE}) extension as the COND of a (?(COND)IFTRUE|IFFALSE). If you do this, $^R will not be set, and you may omit the parentheses around the conditional: Here, we test whether $foo{bar} is greater than symbol. If so, we include . in the pattern, and if not, we include signet in the pattern. Stretched out a bit, it might be construed as more readable:"glyph" =~ /.+(?(?{ $foo{bar} gt "symbol" }).|signet)./; When use re 'eval' is in effect, a regex is allowed to contain (?{CODE}) subpatterns even if the regular expression interpolates variables:"glyph" =~ m{ .+ # some anythings (?(?{ # if $foo{bar} gt "symbol" # this is true }) . # match another anything | # else signet # match signet ) . # and one more anything }x; This is normally disallowed since it is a potential security risk. Even though the pattern above may be innocuous because $suffix is innocuous, the regex parser can't tell which parts of the string were interpolated and which ones weren't, so it just disallows code subpatterns entirely if there were any interpolations./(.*?) (?{length($1) < 3 && warn}) $suffix/; # Error without use re 'eval' If the pattern is obtained from tainted data, even use re 'eval' won't allow the pattern match to proceed. When use re 'taint' is in effect and a tainted string is the target of a regex, the captured subpatterns (either in the numbered variables or in the list of values returned by m// in list context) are tainted. This is useful when regex operations on tainted data are meant not to extract safe substrings, but merely to perform other transformations. See Chapter 23, "Security", for more on tainting. For the purpose of this pragma, precompiled regular expressions (usually obtained from qr//) are not considered to be interpolated: This is allowed if $pat is a precompiled regular expression, even if $pat contains (?{CODE}) subpatterns./foo${pat}bar/ Earlier we showed you a bit of what usere'debug' prints out. A more primitive debugging solution is to use (?{CODE}) subpatterns to print out what's been matched so far during the match: This prints:"abcdef" =~ / .+ (?{print "Matched so far: $&\n"}) bcdef $/x; showing the .+ grabbing all the letters and giving them up one by one as the Engine backtracks.Matched so far: abcdef Matched so far: abcde Matched so far: abcd Matched so far: abc Matched so far: ab Matched so far: a 5.10.3.4. Match-time pattern interpolationYou can build parts of your pattern from within the pattern itself. The (??{ CODE }) extension allows you to insert code that evaluates to a valid pattern. It's like saying /$pattern/, except that you can generate $pattern at run time--more specifically, at match time. For instance: This is equivalent to /\wred\d/ if $threshold is greater than 1, and /\wblue\d/ otherwise./\w (??{ if ($threshold > 1) { "red" } else { "blue" } }) \d/x; You can include backreferences inside the evaluated code to derive patterns from just-matched substrings (even if they will later become unmatched through backtracking). For instance, this matches all strings that read the same backward as forward (known as palindromedaries, phrases with a hump in the middle): You can balance parentheses like so:/^ (.+) .? (??{quotemeta reverse $1}) $/xi; This matches strings of the form (shazam!) and (((shazam!))), sticking shazam! into $2. Unfortunately, it doesn't notice whether the parentheses in the middle are balanced. For that we need recursion.$text =~ /( \(+ ) (.*?) (??{ '\)' x length $1 })/x; Fortunately, you can do recursive patterns too. You can have a compiled pattern that uses (??{CODE}) to refer to itself. Recursive matching is pretty irregular, as regular expressions go. Any text on regular expressions will tell you that a standard regex can't match nested parentheses correctly. And that's correct. It's also correct that Perl's regexes aren't standard. The following pattern[16] matches a set of nested parentheses, however deep they go: You could use it like this to match a function call:$np = qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $np }) # Group with matching parens )* \) }x; $funpat = qr/\w+$np/; 'myfunfun(1,(2*(3+4)),5)' =~ /^$funpat$/; # Matches!
5.10.3.5. Conditional interpolationThe (?(COND)IFTRUE|IFFALSE) regex extension is similar to Perl's ?: operator. If COND is true, the IFTRUE pattern is used; otherwise, the IFFALSE pattern is used. The COND can be a backreference (expressed as a bare integer, without the \ or $), a lookaround assertion, or a code subpattern. (See Section 5.10.1, "Lookaround Assertions" and Section 5.10.3.3, "Match-time code evaluation" earlier in this chapter.) If the COND is an integer, it is treated as a backreference. For instance, consider: Here, the COND is (2), which is true if a second backreference exists. If that's the case, (\$\d+) is included in the pattern at that point (creating the $3 backreference); otherwise, \w+ is used.#!/usr/bin/perl $x = 'Perl is free.'; $y = 'ManagerWare costs $99.95.'; foreach ($x, $y) { /^(\w+) (?:is|(costs)) (?(2)(\$\d+)|\w+)/; # Either (\$\d+) or \w+ if ($3) { print "$1 costs money.\n"; # ManagerWare costs money. } else { print "$1 doesn't cost money.\n"; # Perl doesn't cost money. } } If the COND is a lookaround or code subpattern, the truth of the assertion is used to determine whether to include IFTRUE or IFFALSE: This uses a lookbehind assertion as the COND to match a DNA sequence that ends in either AAG, or some other base combination and C./[ATGC]+(?(?<=AA)G|C)$/; You can omit the |IFFALSE alternative. If you do, the IFTRUE pattern will be included in the pattern as usual if the COND is true, but if the condition isn't true, the Engine will move on to the next portion of the pattern. 5.10.4. Defining Your Own AssertionsYou can't change how Perl's Engine works, but if you're sufficiently warped, you can change how it sees your pattern. Since Perl interprets your pattern similarly to double-quoted strings, you can use the wonder of overloaded string constants to see to it that text sequences of your choosing are automatically translated into other text sequences. In the example below, we specify two transformations to occur when Perl encounters a pattern. First, we define \tag so that when it appears in a pattern, it's automatically translated to (?:<.*?>), which matches most HTML and XML tags. Second, we "redefine" the \w metasymbol so that it handles only English letters. We'll define a package called Tagger that hides the overloading from our main program. Once we do that, we'll be able to say: Here's Tagger.pm, couched in the form of a Perl module (see Chapter 11, "Modules"):use Tagger; $_ = '<I>camel</I>'; print "Tagged camel found" if /\tag\w+\tag/; The Tagger module is handed the pattern immediately before interpolation, so you can bypass the overloading by bypassing interpolation, as follows:package Tagger; use overload; sub import { overload::constant 'qr' => \&convert } sub convert { my $re = shift; $re =~ s/ \\tag /<.*?>/xg; $re =~ s/ \\w /[A-Za-z]/xg; return $re; } 1; If you wanted the interpolated variable to be customized, call the convert function directly:$re = '\tag\w+\tag'; # This string begins with \t, a tab print if /$re/; # Matches a tab, followed by an "a"... Now if you're still wondering what those sub thingies are there in the Tagger module, you'll find out soon enough because that's what our next chapter is all about.$re = '\tag\w+\tag'; # This string begins with \t, a tab $re = Tagger::convert $re; # expand \tag and \w print if /$re/; # $re becomes <.*?>[A-Za-z]+<.*?> Copyright © 2001 O'Reilly & Associates. All rights reserved. |
|