1.7 Regular ExpressionsRegular expressions (aka regexps, regexes or REs) are used by many UNIX programs, such as grep , sed and awk ,[ 24 ] editors like vi and emacs , and even some of the shells. A regular expression is a way of describing a set of strings without having to list all the strings in your set.
Regular expressions are used several ways in Perl. First and
foremost, they're used in conditionals to determine whether a string
matches a particular pattern. So when you see something that looks like
Second, if you can locate patterns within a string, you can replace them
with something else. So when you see something that looks like
Finally, patterns can specify not only where something is, but also where it isn't. So the split operator uses a regular expression to specify where the data isn't. That is, the regular expression defines the delimiters that separate the fields of data. Our grade example has a couple of trivial examples of this. Lines 5 and 12 each split strings on the space character in order to return a list of words. But you can split on any delimiter you can specify with a regular expression. (There are various modifiers you can use in each of these situations to do exotic things like ignore case when matching alphabetic characters, but these are the sorts of gory details that we'll cover in Chapter 2 .) The simplest use of regular expressions is to match a literal expression. In the case of the splits we just mentioned, we matched on a single space. But if you match on several characters in a row, they all have to match sequentially. That is, the pattern looks for a substring, much as you'd expect. Let's say we want to show all the lines of an HTML file that are links to other HTML files (as opposed to FTP links). Let's imagine we're working with HTML for the first time, and we're being a little naive yet. We know that these links will always have "http:" in them somewhere. We could loop through our file with this:[ 25 ]
while ($line = <FILE>) { if ($line =~ /http:/) { print $line; } }
Here, the
while (<FILE>) { print if /http:/; } (Hmm, another one of those statement modifiers seems to have snuck in there. Insidious little beasties.)
This stuff is pretty handy, but what if we wanted to find all the links,
not just the HTTP links? We could give a list of links, like "
while (<FILE>) { print if /http:/; print if /ftp:/; print if /mailto:/; # What next? }
Since regular expressions are descriptive of a set of strings, we can
just describe what we are looking for: a number of alphabetic characters
followed by a colon. In regular expression talk (Regexpese?), that
would be Because certain classes like the alphabetics are so commonly used, Perl defines special cases for them. See Table 1.7 for these special cases.
Note that these match
single
characters. A
(We should note that
There is one other very special character class, written with a "
1.7.1 Quantifiers
The characters and character classes we've talked about all match single
characters. We mentioned that you could match multiple "word"
characters with
The most general form of quantifier specifies both the minimum and
maximum number of times an item can match. You put the two numbers in
braces, separated by a comma. For example, if you were trying
to match North American phone numbers,
If you put the minimum and the comma but omit the maximum, then the
maximum is taken to be infinity. In other words, it will match at least
the minimum number of times, plus as many as it can get after that.
For example,
Certain combinations of minimum and maximum occur frequently, so Perl
defines special quantifiers for them. We've already seen
There are a couple things about quantification that you need
to be careful of. First of all, Perl quantifiers are by default
greedy
. This means that they will attempt to match as much as they
can as long as the entire expression still matches. For example, if you
are matching
spp:Fe+H20=FeO2;H:2112:100:Stephen P Potter:/home/spp:/bin/tcsh
and try to match "
$_ = "fred xxxxxxx barney"; s/x*//;
it will have absolutely no effect. This is because the
There's one other thing you need to know. By default quantifiers apply
to a single preceding character, so 1.7.2 Minimal MatchingIf you were using an ancient version of Perl and you didn't want greedy matching, you had to use a negated character class. (And really, you were still getting greedy matching of a constrained variety.)
In modern versions of Perl, you can force nongreedy, minimal
matching by use of a question mark after any quantifier. Our same
username match would now be 1.7.3 Nailing Things DownWhenever you try to match a pattern, it's going to try to match in every location till it finds a match. An anchor allows you to restrict where the pattern can match. Essentially, an anchor is something that matches a "nothing", but a special kind of nothing that depends on its surroundings. You could also call it a rule, or a constraint, or an assertion. Whatever you care to call it, it tries to match something of zero width, and either succeeds or fails. (If it fails, it merely means that the pattern can't match that particular way. The pattern will go on trying to match some other way, if there are any other ways to try.)
The special character string
/\bFred\b/
would match both "
In a similar vein, there are also anchors for the beginning of the
string and the end of the string. If it is the first character of a
pattern, the caret ( So now you can probably figure out that when we said:
next LINE if $line =~ /^#/;
we meant "Go to the next iteration of 1.7.4 Backreferences
We mentioned earlier that you can use parentheses to group things
for quantifiers, but you can also use parentheses to remember bits and
pieces of what you matched. A pair of parentheses around a part of a
regular expression causes whatever was matched by that part to be
remembered for later use. It doesn't change what the part matches, so
How you refer back to the remembered part of the string depends on where
you want to do it from. Within the same regular expression, you
use a backslash followed by an integer. The integer corresponding to a
given pair of parentheses is determined by counting left parentheses
from the beginning of the pattern, starting with one. So for example, to
match something similar to an HTML tag (like " Outside the regular expression itself, such as in the replacement part of a substitution, the special variable is used as if it were a normal scalar variable named by the integer. So, if you wanted to swap the first two words of a string, for example, you could use:
s/(\S+)\s+(\S+)/$2 $1/ The right side of the substitution is really just a funny kind of double-quoted string, which is why you can interpolate variables there, including backreference variables. This is a powerful concept: interpolation (under controlled circumstances) is one of the reasons Perl is a good text-processing language. The other reason is the pattern matching, of course. Regular expressions are good for picking things apart, and interpolation is good for putting things back together again. Perhaps there's hope for Humpty Dumpty after all. | ||||||||||||
|