Regular Expressions (Programming Perl)

1.7. Regular Expressions

Regular expressions (a.k.a. regexes, regexps, or REs) are used by many search programs such as grep and findstr, text-munging programs like sed and awk, and editors like vi and emacs. A regular expression is a way of describing a set of strings without having to list all the strings in your set.[22]

[22] A good source of information on regular expression concepts is Jeffrey Friedl's book, Mastering Regular Expressions (O'Reilly & Associates).

Many other computer languages incorporate regular expressions (some of them even advertise "Perl5 regular expressions"!), but none of these languages integrates regular expressions into the language the way Perl does. Regular expressions are used several ways in Perl. First and foremost, they're used in conditionals to determine whether a string matches a particular pattern, because in a Boolean context they return true and false. So when you see something that looks like /foo/ in a conditional, you know you're looking at an ordinary pattern-matching operator:

if (/Windows 95/) { print "Time to upgrade?\n" }

Second, if you can locate patterns within a string, you can replace them with something else. So when you see something that looks like s/foo/bar/, you know it's asking Perl to substitute "bar" for "foo", if possible. We call that the substitution operator. It also happens to return true or false depending on whether it succeeded, but usually it's evaluated for its side effect:

s/Windows/Linux/;

Finally, patterns can specify not only where something is, but also where it isn't. So the split operator uses a regular expression to specify where the data isn't. That is, the regular expression defines the separators that delimit the fields of data. Our Average Example has a couple of trivial examples of this. Lines 5 and 12 each split strings on the space character in order to return a list of words. But you can split on any separator you can specify with a regular expression:

($good, $bad, $ugly) = split(/,/, "vi,emacs,teco");

(There are various modifiers you can use in each of these situations to do exotic things like ignore case when matching alphabetic characters, but these are the sorts of gory details that we'll cover later when we get to the gory details.)

The simplest use of regular expressions is to match a literal expression. In the case of the split above, we matched on a single comma character. But if you match on several characters in a row, they all have to match sequentially. That is, the pattern looks for a substring, much as you'd expect. Let's say we want to show all the lines of an HTML file that contain HTTP links (as opposed to FTP links). Let's imagine we're working with HTML for the first time, and we're being a little naïve. We know that these links will always have "http:" in them somewhere. We could loop through our file with this:

while ($line = <FILE>) {
    if ($line =~ /http:/) {
        print $line;
    }
}

Here, the =~ (pattern-binding operator) is telling Perl to look for a match of the regular expression "http:" in the variable $line. If it finds the expression, the operator returns a true value and the block (a print statement) is executed.[23]

[23] This is very similar to what the Unix command grep 'http:' file would do. On MS-DOS you could use the find command, but it doesn't know how to do more complicated regular expressions. (However, the misnamed findstr program of Windows NT does know about regular expressions.)

By the way, if you don't use the =~ binding operator, Perl will search a default string instead of $line. It's like when you say, "Eek! Help me find my contact lens!" People automatically know to look around near you without your actually having to tell them that. Likewise, Perl knows that there is a default place to search for things when you don't say where to search for them. This default string is actually a special scalar variable that goes by the odd name of $_. In fact, it's not the default just for pattern matching; many operators in Perl default to using the $_ variable, so a veteran Perl programmer would likely write the last example as:

while (<FILE>) {
    print if /http:/;
}

(Hmm, another one of those statement modifiers seems to have snuck in there. Insidious little beasties.)

This stuff is pretty handy, but what if we wanted to find all of the link types, not just the HTTP links? We could give a list of link types, like "http:", "ftp:", "mailto:", and so on. But that list could get long, and what would we do when a new kind of link was added?

while (<FILE>) {
    print if /http:/;
    print if /ftp:/;
    print if /mailto:/;
    # What next?
}

Since regular expressions are descriptive of a set of strings, we can just describe what we are looking for: a number of alphabetic characters followed by a colon. In regular expression talk (Regexese?), that would be /[a-zA-Z]+:/, where the brackets define a character class. The a-z and A-Z represent all alphabetic characters (the dash means the range of all characters between the starting and ending character, inclusive). And the + is a special character that says "one or more of whatever was before me". It's what we call a quantifier, meaning a gizmo that says how many times something is allowed to repeat. (The slashes aren't really part of the regular expression, but rather part of the pattern-match operator. The slashes are acting like quotes that just happen to contain a regular expression.)

Because certain classes like the alphabetics are so commonly used, Perl defines shortcuts for them:

Name	ASCII Definition	Code
Whitespace	`[ \t\n\r\f]`	`\s`
Word character	`[a-zA-Z_0-9]`	`\w`
Digit	`[0-9]`	`\d`

Note that these match single characters. A \w will match any single word character, not an entire word. (Remember that + quantifier? You can say \w+ to match a word.) Perl also provides the negation of these classes by using the uppercased character, such as \D for a nondigit character.

We should note that \w is not always equivalent to [a-zA-Z_0-9] (and \d is not always [0-9]). Some locales define additional alphabetic characters outside the ASCII sequence, and \w respects them. Newer versions of Perl also know about Unicode letter and digit properties and treat Unicode characters with those properties accordingly. (Perl also considers ideographs to be \w characters.)

There is one other very special character class, written with a ".", that will match any character whatsoever.[24] For example, /a./ will match any string containing an "a" that is not the last character in the string. Thus it will match "at" or "am" or even "a!", but not "a", since there's nothing after the "a" for the dot to match. Since it's searching for the pattern anywhere in the string, it'll match "oasis" and "camel", but not "sheba". It matches "caravan" on the first "a". It could match on the second "a", but it stops after it finds the first suitable match, searching from left to right.

[24] Except that it won't normally match a newline. When you think about it, a "." doesn't normally match a newline in grep(1) either.

1.7.1. Quantifiers

The characters and character classes we've talked about all match single characters. We mentioned that you could match multiple "word" characters with \w+. The + is one kind of quantifier, but there are others. All of them are placed after the item being quantified.

The most general form of quantifier specifies both the minimum and maximum number of times an item can match. You put the two numbers in braces, separated by a comma. For example, if you were trying to match North American phone numbers, the sequence \d{7,11} would match at least seven digits, but no more than eleven digits. If you put a single number in the braces, the number specifies both the minimum and the maximum; that is, the number specifies the exact number of times the item can match. (All unquantified items have an implicit {1} quantifier.)

If you put the minimum and the comma but omit the maximum, then the maximum is taken to be infinity. In other words, it will match at least the minimum number of times, plus as many as it can get after that. For example, \d{7} will match only the first seven digits (a local North American phone number, for instance, or the first seven digits of a longer number), while \d{7,} will match any phone number, even an international one (unless it happens to be shorter than seven digits). There is no special way of saying "at most" a certain number of times. Just say .{0,5}, for example, to find at most five arbitrary characters.

Certain combinations of minimum and maximum occur frequently, so Perl defines special quantifiers for them. We've already seen +, which is the same as {1,}, or "at least one of the preceding item". There is also *, which is the same as {0,}, or "zero or more of the preceding item", and ?, which is the same as {0,1}, or "zero or one of the preceding item" (that is, the preceding item is optional).

You need to be careful of a couple things about quantification. First of all, Perl quantifiers are by default greedy. This means that they will attempt to match as much as they can as long as the whole pattern still matches. For example, if you are matching /\d+/ against "1234567890", it will match the entire string. This is something to watch out for especially when you are using ".", any character. Often, someone will have a string like:

larry:JYHtPh0./NJTU:100:10:Larry Wall:/home/larry:/bin/tcsh

and will try to match "larry:" with /.+:/. However, since the + quantifier is greedy, this pattern will match everything up to and including "/home/larry:", because it matches as much as possible before the last colon, including all the other colons. Sometimes you can avoid this by using a negated character class, that is, by saying /[^:]+:/, which says to match one or more noncolon characters (as many as possible), up to the first colon. It's that little caret in there that negates the Boolean sense of the character class.[25] The other point to be careful about is that regular expressions will try to match as early as possible. This even takes precedence over being greedy. Since scanning happens left-to-right, this means that the pattern will match as far left as possible, even if there is some other place where it could match longer. (Regular expressions may be greedy, but they aren't into delayed gratification.) For example, suppose you're using the substitution command (s///) on the default string (variable $_, that is), and you want to remove a string of x's from the middle of the string. If you say:

$_ = "fred xxxxxxx barney";
s/x*//;

it will have absolutely no effect! This is because the x* (meaning zero or more "x" characters) will be able to match the "nothing" at the beginning of the string, since the null string happens to be zero characters wide and there's a null string just sitting there plain as day before the "f" of "fred".[26]

[25] Sorry, we didn't pick that notation, so don't blame us. That's just how negated character classes are customarily written in Unix culture.

[26] Don't feel bad. Even the authors get caught by this from time to time.

There's one other thing you need to know. By default, quantifiers apply to a single preceding character, so /bam{2}/ will match "bamm" but not "bambam". To apply a quantifier to more than one character, use parentheses. So to match "bambam", use the pattern /(bam){2}/.

1.7.2. Minimal Matching

If you were using an ancient version of Perl and you didn't want greedy matching, you had to use a negated character class. (And really, you were still getting greedy matching of a constrained variety.)

In modern versions of Perl, you can force nongreedy, minimal matching by placing a question mark after any quantifier. Our same username match would now be /.*?:/. That .*? will now try to match as few characters as possible, rather than as many as possible, so it stops at the first colon rather than at the last.

1.7.3. Nailing Things Down

Whenever you try to match a pattern, it's going to try to match in every location till it finds a match. An anchor allows you to restrict where the pattern can match. Essentially, an anchor is something that matches a "nothing", but a special kind of nothing that depends on its surroundings. You could also call it a rule, or a constraint, or an assertion. Whatever you care to call it, it tries to match something of zero width, and either succeeds or fails. (Failure merely means that the pattern can't match that particular way. The pattern will go on trying to match some other way, if there are any other ways left to try.)

The special symbol \b matches at a word boundary, which is defined as the "nothing" between a word character (\w) and a nonword character (\W), in either order. (The characters that don't exist off the beginning and end of your string are considered to be nonword characters.) For example,

/\bFred\b/

would match "Fred" in both "The Great Fred" and "Fred the Great", but not in "Frederick the Great" because the "d" in "Frederick" is not followed by a nonword character.

In a similar vein, there are also anchors for the beginning of the string and the end of the string. If it is the first character of a pattern, the caret (^) matches the "nothing" at the beginning of the string. Therefore, the pattern /^Fred/ would match "Fred" in "Frederick the Great" but not in "The Great Fred", whereas /Fred^/ wouldn't match either. (In fact, it doesn't even make much sense.) The dollar sign ($) works like the caret, except that it matches the "nothing" at the end of the string instead of the beginning.[27]

[27] This is a bit oversimplified, since we're assuming here that your string contains no newlines; ^ and $ are actually anchors for the beginnings and endings of lines rather than strings. We'll try to straighten this all out in Chapter 5, "Pattern Matching" (to the extent that it can be straightened out).

So now you can probably figure out that when we said:

next LINE if $line =~ /^#/;

we meant "Go to the next iteration of LINE loop if this line happens to begin with a # character."

Earlier we said that the sequence \d{7,11} would match a number from seven to eleven digits long. While strictly true, the statement is misleading: when you use that sequence within a real pattern match operator such as /\d{7,11}/, it does not preclude there being extra unmatched digits after the 11 matched digits! You often need to anchor quantified patterns on either or both ends to get what you expect.

1.7.4. Backreferences

We mentioned earlier that you can use parentheses to group things for quantifiers, but you can also use parentheses to remember bits and pieces of what you matched. A pair of parentheses around a part of a regular expression causes whatever was matched by that part to be remembered for later use. It doesn't change what the part matches, so /\d+/ and /(\d+)/ will still match as many digits as possible, but in the latter case they will be remembered in a special variable to be backreferenced later.

How you refer back to the remembered part of the string depends on where you want to do it from. Within the same regular expression, you use a backslash followed by an integer. The integer corresponding to a given pair of parentheses is determined by counting left parentheses from the beginning of the pattern, starting with one. So for example, to match something similar to an HTML tag like "<B>Bold</B>", you might use /<(.*?)>.*?<\/\1>/. This forces the two parts of the pattern to match the exact same string, such as the "B" in this example.

Outside the regular expression itself, such as in the replacement part of a substitution, you use a $ followed by an integer, that is, a normal scalar variable named by the integer. So, if you wanted to swap the first two words of a string, for example, you could use:

s/(\S+)\s+(\S+)/$2 $1/

The right side of the substitution (between the second and third slashes) is mostly just a funny kind of double-quoted string, which is why you can interpolate variables there, including backreference variables. This is a powerful concept: interpolation (under controlled circumstances) is one of the reasons Perl is a good text-processing language. The other reason is the pattern matching, of course. Regular expressions are good for picking things apart, and interpolation is good for putting things back together again. Perhaps there's hope for Humpty Dumpty after all.