5.7. Capturing and ClusteringPatterns allow you to group portions of your pattern together into subpatterns and to remember the strings matched by those subpatterns. We call the first behavior clustering and the second one capturing. 5.7.1. CapturingTo capture a substring for later use, put parentheses around the subpattern that matches it. The first pair of parentheses stores its substring in $1, the second pair in $2, and so on. You may use as many parentheses as you like; Perl just keeps defining more numbered variables for you to represent these captured strings. Some examples: Note the difference between the second and third patterns. The second form is usually what you want. The third form does not create multiple variables for multiple digits. Parentheses are numbered when the pattern is compiled, not when it is matched./(\d)(\d)/ # Match two digits, capturing them into $1 and $2 /(\d+)/ # Match one or more digits, capturing them all into $1 /(\d)+/ # Match a digit one or more times, capturing the last into $1 Captured strings are often called backreferences because they refer back to parts of the captured text. There are actually two ways to get at these backreferences. The numbered variables you've seen are how you get at backreferences outside of a pattern, but inside the pattern, that doesn't work. You have to use \1, \2, etc.[9] So to find doubled words like "the the" or "had had", you might use this pattern: But most often, you'll be using the $1 form, because you'll usually apply a pattern and then do something with the substrings. Suppose you have some text (a mail header) that looks like this:/\b(\w+) \1\b/i and you want to construct a hash that maps the text before each colon to the text afterward. If you were looping through this text line by line (say, because you were reading it from a file) you could do that as follows:From: gnat@perl.com To: camelot@oreilly.com Date: Mon, 17 Jul 2000 09:00:00 -1000 Subject: Eye of the needle Like $`, $&, and $', these numbered variables are dynamically scoped through the end of the enclosing block or eval string, or to the next successful pattern match, whichever comes first. You can use them in the righthand side (the replacement part) of a substitute, too:while (<>) { /^(.*?): (.*)$/; # Pre-colon text into $1, post-colon into $2 $fields{$1} = $2; } Groupings can nest, and when they do, the groupings are counted by the location of the left parenthesis. So given the string "Primula Brandybuck", the pattern:s/^(\S+) (\S+)/$2 $1/; # Swap first two words
/^((\w+) (\w+))$/ would capture "Primula Brandybuck" into $1, "Primula" into $2, and "Brandybuck" into $3. This is depicted in Figure 5-1.
Figure 5.1. Creating backreferences with parenthesesPatterns with captures are often used in list context to populate a list of values, since the pattern is smart enough to return the captured substrings as a list: With the /g modifier, a pattern can return multiple substrings from multiple matches, all in one list. Suppose you had the mail header we saw earlier all in one string (in $_, say). You could do the same thing as our line-by-line loop, but with one statement:($first, $last) = /^(\w+) (\w+)$/; ($full, $first, $last) = /^((\w+) (\w+))$/; The pattern matches four times, and each time it matches, it finds two substrings. The /gm match returns all of these as a flat list of eight strings, which the list assignment to %fields will conveniently interpret as four key/value pairs, thus restoring harmony to the universe.%fields = /^(.*?): (.*)$/gm; Several other special variables deal with text captured in pattern matches. $& contains the entire matched string, $` everything to the left of the match, $' everything to the right. $+ contains the contents of the last backreference. For more explanation of these magical Elvish variables (and for a way to write them in English), see Chapter 28, "Special Names".$_ = "Speak, <EM>friend</EM>, and enter."; m[ (<.*?>) (.*?) (</.*?>) ]x; # A tag, then chars, then an end tag print "prematch: $`\n"; # Speak, print "match: $&\n"; # <EM>friend</EM> print "postmatch: $'\n"; # , and enter. print "lastmatch: $+\n"; # </EM> The @- (@LAST_MATCH_START) array holds the offsets of the beginnings of any submatches, and @+ (@LAST_MATCH_END) holds the offsets of the ends: #!/usr/bin/perl $alphabet = "abcdefghijklmnopqrstuvwxyz"; $alphabet =~ /(hi).*(stu)/; print "The entire match began at $-[0] and ended at $+[0]\n"; print "The first match began at $-[1] and ended at $+[1]\n"; print "The second match began at $-[2] and ended at $+[2]\n"; If you really want to match a literal parenthesis character instead of having it interpreted as a metacharacter, backslash it: This matches a parenthesized example (e.g., this statement). But since dot is a wildcard, this also matches any parenthetical statement with the first letter e and third letter g (ergo, this statement too)./\(e.g., .*?\)/ 5.7.2. ClusteringBare parentheses both cluster and capture. But sometimes you don't want that. Sometimes you just want to group portions of the pattern without creating a backreference. You can use an extended form of parentheses to suppress capturing: the (?:PATTERN) notation will cluster without capturing. There are at least three reasons you might want to cluster without capturing:
In addition, it's more efficient to suppress the capture of something you're not going to use. On the minus side, the notation is a little noisier, visually speaking. In a pattern, a left parenthesis immediately followed by a question mark denotes a regex extension. The current regular expression bestiary is relatively fixed--we don't dare create a new metacharacter, for fear of breaking old Perl programs. Instead, the extension syntax is used to add new features to the bestiary. In the remainder of the chapter, we'll see many more regex extensions, all of which cluster without capturing, as well as doing something else. The (?:PATTERN) extension is just special in that it does nothing else. So if you say: it's like:@fields = split(/\b(?:a|b|c)\b/) but doesn't spit out extra fields. (The split operator is a bit like m//g in that it will emit extra fields for all the captured substrings within the pattern. Ordinarily, split only returns what it didn't match. For more on split see Chapter 29, "Functions".)@fields = split(/\b(a|b|c)\b/) 5.7.3. Cloistered Pattern ModifiersYou may cloister the /i, /m, /s, and /x modifiers within a portion of your pattern by inserting them (without the slash) between the ? and : of the clustering notation. If you say: it matches both "Harry S Truman" and "Harry s Truman", whereas:/Harry (?i:s) Truman/ matches both "Harry S Truman" and "Harry S. Truman", as well as "Harry Truman", and:/Harry (?x: [A-Z] \.? \s )?Truman/ matches all five, by combining the /i and /x modifiers within the cloister./Harry (?ix: [A-Z] \.? \s )?Truman/ You can also subtract modifiers from a cloister with a minus sign: This matches any capitalization of the name--but if the middle initial is provided, it must be capitalized, since the /i applied to the overall pattern is suspended inside the cloister./Harry (?x-i: [A-Z] \.? \s )?Truman/i By omitting the colon and PATTERN, you can export modifier settings to an outer cluster, turning it into a cloister. That is, you can selectively turn modifiers on and off for the cluster one level outside the modifiers' parentheses, like so: Note that the second and third examples create backreferences. If that wasn't what you wanted, then you should have been using (?-i:bar) and (?x-i: bar), respectively./(?i)foo/ # Equivalent to /foo/i /foo((?-i)bar)/i # "bar" must be lower case /foo((?x-i) bar)/ # Enables /x and disables /i for "bar" Setting modifiers on a portion of your pattern is particularly useful when you want "." to match newlines in part of your pattern but not in the rest of it. Setting /s on the whole pattern doesn't help you there. Copyright © 2002 O'Reilly & Associates. All rights reserved. |
|