6. Pattern Matching
Contents:
[Art is] pattern informed by sensibility. - Sir Herbert Read The Meaning of Art 6.0. IntroductionAlthough most modern programming languages offer primitive pattern matching tools, usually through an extra library, Perl's patterns are integrated directly into the language core. Perl's patterns boast features not found in pattern matching in other languages, features that encourage a whole different way of looking at data. Just as chess players see patterns in the board positions that their pieces control, Perl adepts look at data in terms of patterns. These patterns, expressed in the punctuation-intensive language of regular expressions,[ 1 ] provide access to powerful algorithms normally available only to computer science scholars.
"If this pattern matching thing is so powerful and so fantastic," you may be saying, "why don't you have a hundred different recipes on regular expressions in this chapter?" Regular expressions are the natural solution to many problems involving numbers, strings, dates, web documents, mail addresses, and almost everything else in this book ; we used pattern matching over 100 times in other chapters. This chapter mostly presents recipes in which pattern matching forms part of the questions, not just part of the answers. Perl's extensive and ingrained support for regular expressions means that you not only have features available that you won't find in any other language, but you have new ways of using them, too. Programmers new to Perl often look for functions like these: match( $string, $pattern ); subst( $string, $pattern, $replacement ); But matching and substituting are such common tasks that they merit their own syntax: $meadow =~ m/sheep/; # True if $meadow contains "sheep" $meadow !~ m/sheep/; # True if $meadow doesn't contain "sheep" $meadow =~ s/old/new/; # Replace "old" with "new" in $meadow
Pattern matching isn't like direct string comparison, even at its simplest. It's more like string searching with mutant wildcards on steroids. Without anchors, the position where the match occurs can float freely throughout the string. Any of the following lines would also be matched by the expression Fine bovines demand fine toreadors. Muskoxen are a polar ovibovine species. Grooviness went out of fashion decades ago. Sometimes they're right in front of you but they still don't match: Ovines are found typically in oviaries.
The problem is that while you are probably thinking in some human language, the pattern matching engine most assuredly is not. When the engine is presented with the pattern As you find your patterns matching some strings you don't want them to match and not matching other strings that you do want them to match, you start embellishing. If you're really looking for nothing but sheep, you probably want to match more like this: if ($meadow =~ /\bovines?\b/i) { print "Here be sheep!" }
Don't be tricked by the phantom cow lurking in that string. That's not a bovine. It's an ovine with a As you see, some characters or sequences of characters have special meaning to the pattern-matching engine. These metacharacters let you anchor the pattern to the start or end of the string, give alternatives for parts of a pattern, allow repetition and wildcarding, and remember part of the matching substring for use later in the pattern or in subsequent code. Learning the syntax of pattern matching isn't as daunting as it might appear. Sure, there are a lot of symbols, but each has a reason for existing. Regular expressions aren't random jumbles of punctuation - they're carefully thought out jumbles of punctuation! If you forget one, you can always look it up. Summary tables are included in Programming Perl , Learning Perl , Mastering Regular Expressions , and the perlre (1) and perlop (1) manpages included with every Perl installation. The Tricky BitsMuch trickier than the syntax of regular expressions is their sneaky semantics. The three aspects of pattern-matching behavior that seem to cause folks the most trouble are greed, eagerness, and backtracking (and also how these three interact with each other).
Greed is the principle that if a quantifier (like
Eagerness is the notion that the leftmost match wins. The engine is very eager to return you a match as quickly as possible, sometimes even before you are expecting it. Consider the match A more illustrative example of eagerness would be the following: $string = "good food"; $string =~ s/o*/e/;
Can you guess which of the following is in
The answer is the last one because the earliest point at which zero or more occurrences of
Can you guess what adding Here's another example of where greed takes a back seat to eagerness:
% echo ababacaca | perl -ne 'print "$&\n" if /(a|ba|b)+(a|ac)+/'
That's because Perl uses what's called a traditional NFA,[ 2 ] a non-deterministic finite automaton. This kind of matching engine is not guaranteed to return the longest overall match, just the longest, leftmost match. You might think of Perl's greed as being left-to-right directed, not globally greedy.
But it doesn't have to be that way. Here's an example using awk , a language that Perl borrows a lot from:
% echo ababacaca |
awk 'match($0,/(a|ba|b)+(a|ac)+/) { print substr($0, RSTART, RLENGTH) }'
Choosing how to implement pattern matching depends mainly on two factors: are the expressions nonregular (do they use backreferences), and what needs to be returned (yes/no, range of whole match, ranges of subexpressions). Tools like awk , egrep , and lex use regular expressions and only need a yes/no answer or the range of the whole match. This is exactly what DFAs can support, and because DFAs are faster and simpler, these tools have traditionally used DFA implementations. Pattern matching within programs and libraries, such as ed , regex , and perl , is another kettle of fish; typically, we need to support nonregular expressions and we need to know what parts of the string were matched by various parts of the pattern. This is a much harder problem with potentially exponential run times. The natural algorithm for this problem is an NFA, and therein lies both a problem and an opportunity. The problem is that NFAs are slow. The opportunity is that significant performance gains can be made by rewriting the patterns to exploit how the particular NFA implementation runs. This is a major part of Jeffrey Friedl's book, Mastering Regular Expressions . The last and most powerful of the three tricky bits in pattern matching is backtracking. For a pattern to match, the entire regular expression must match, not just part of it. So if the beginning of a pattern containing a quantifier succeeds in a way that causes later parts in the pattern to fail, the matching engine backs up and tries to find another match for the beginning part - that's why it's called backtracking. Essentially, it means that the engine is going to try different possibilities, systematically investigating alternate matches until it finds one that works. In some pattern matching implementations, you keep backtracking in case other submatches make the overall match longer. Perl's matcher doesn't do that; as soon as one possibility works, it uses that - until and unless something later on in the pattern fails, forcing a backtrack to retry another possible way of matching. This is discussed in Recipe 6.16 . Pattern-Matching ModifiersPattern-matching modifiers are a lot easier to list and learn than the different metacharacters. Here's a brief summary of them:
while (m/(\d+)/g) { print "Found number $1\n"; }
Used in list context, @numbers = m/(\d+)/g;
That finds only non-overlapping matches. You have to be much sneakier to get overlapping ones by making a zero-width look-ahead with the This shows the difference: $digits = "123456789"; @nonlap = $digits =~ /(\d\d\d)/g; @yeslap = $digits =~ /(?=(\d\d\d))/g; print "Non-overlapping: @nonlap\n"; print "Overlapping: @yeslap\n";
Special Variables
Perl sets special variables as the result of certain kinds of matches: $string = "And little lambs eat ivy"; $string =~ /l[^s]*s/; print "
All this power may make patterns seem omnipotent. Surprisingly enough, this is not (quite) the case. Regular expressions are fundamentally incapable of doing some things. For some of those, special modules lend a hand. Regular expressions are unable to deal with balanced input, that is, anything that's arbitrarily nested, like matching parentheses, matching HTML tags, etc. For that, you have to build up a real parser, like the HTML::Parser recipes in Chapter 20, Web Automation . Another thing Perl patterns can't do yet is fuzzy matches; Recipe 6.13 shows how to use a module to work around that. To learn far more about regular expressions than you ever thought existed, check out Mastering Regular Expressions , written by Jeffrey Friedl and published by O'Reilly & Associates. This book is dedicated to explaining regular expressions from a practical perspective. It not only covers general regular expressions and Perl patterns well, it also compares and contrasts these with those used in other popular languages. | ||||||||||||||||||
|