Chapter 5. Pattern Matching

Perl's built-in support for pattern matching lets you search large amounts of data conveniently and efficiently. Whether you run a huge commercial portal site scanning every newsfeed in existence for interesting tidbits, or a government organization dedicated to figuring out human demographics (or the human genome), or an educational institution just trying to get some dynamic information up on your web site, Perl is the tool of choice, in part because of its database connections, but largely because of its pattern-matching capabilities. If you take "text" in the widest possible sense, perhaps 90% of what you do is 90% text processing. That's really what Perl is all about and always has been about--in fact, it's even part of Perl's name: Practical Extraction and Report Language. Perl's patterns provide a powerful way to scan through mountains of mere data and extract useful information from it.

You specify a pattern by creating a regular expression (or regex), and Perl's regular expression engine (the "Engine", for the rest of this chapter) then takes that expression and determines whether (and how) the pattern matches your data. While most of your data will probably be text strings, there's nothing stopping you from using regexes to search and replace any byte sequence, even what you'd normally think of as "binary" data. To Perl, bytes are just characters that happen to have an ordinal value less than 256. (More on that in Chapter 15, "Unicode".)

If you're acquainted with regular expressions from some other venue, we should warn you that regular expressions are a bit different in Perl. First, they aren't entirely "regular" in the theoretical sense of the word, which means they can do much more than the traditional regular expressions taught in computer science classes. Second, they are used so often in Perl that they have their own special variables, operators, and quoting conventions which are tightly integrated into the language, not just loosely bolted on like any other library. Programmers new to Perl often look in vain for functions like these:

match( $string, $pattern );
subst( $string, $pattern, $replacement );

But matching and substituting are such fundamental tasks in Perl that they merit one-letter operators: m/PATTERN/ and s/PATTERN/REPLACEMENT/ (m// and s///, for short). Not only are they syntactically brief, but they're also parsed like double-quoted strings rather than ordinary operators; nevertheless, they operate like operators, so we'll call them that. Throughout this chapter, you'll see these operators used to match patterns against a string. If some portion of the string fits the pattern, we say that the match is successful. There are lots of cool things you can do with a successful pattern match. In particular, if you are using s///, a successful match causes the matched portion of the string to be replaced with whatever you specified as the REPLACEMENT.

This chapter is all about how to build and use patterns. Perl's regular expressions are potent, packing a lot of meaning into a small space. They can therefore be daunting if you try to intuit the meaning of a long pattern as a whole. But if you can break it up into its parts, and if you know how the Engine interprets those parts, you can understand any regular expression. It's not unusual to see a hundred line C or Java program expressed with a one-line regular expression in Perl. That regex may be a little harder to understand than any single line out of the longer program; on the other hand, the regex will likely be much easier to understand than the longer program taken as a whole. You just have to keep these things in perspective.

5.1. The Regular Expression Bestiary

Before we dive into the rules for interpreting regular expressions, let's see what some patterns look like. Most characters in a regular expression simply match themselves. If you string several characters in a row, they must match in order, just as you'd expect. So if you write the pattern match:

/Frodo/

you can be sure that the pattern won't match unless the string contains the substring "Frodo" somewhere. (A substring is just a part of a string.) The match could be anywhere in the string, just as long as those five characters occur somewhere, next to each other and in that order.

Other characters don't match themselves, but "misbehave" in some way. We call these metacharacters. (All metacharacters are naughty in their own right, but some are so bad that they also cause other nearby characters to misbehave as well.)

Here are the miscreants:

\ | ( ) [ { ^ $ * + ? .

Metacharacters are actually very useful and have special meanings inside patterns; we'll tell you all those meanings as we go along. But we do want to reassure you that you can always match any of these twelve characters literally by putting a backslash in front of it. For example, backslash is itself a metacharacter, so to match a literal backslash, you'd backslash the backslash: \\.

You see, backslash is one of those characters that makes other characters misbehave. It just works out that when you make a misbehaving metacharacter misbehave, it ends up behaving--a double negative, as it were. So backslashing a character to get it to be taken literally works, but only on punctuational characters; backslashing an (ordinarily well-behaved) alphanumeric character does the opposite: it turns the literal character into something special. Whenever you see such a two-character sequence:

\b \D \t \3 \s

you'll know that the sequence is a metasymbol that matches something strange. For instance, \b matches a word boundary, while \t matches an ordinary tab character. Notice that a tab is one character wide, while a word boundary is zero characters wide because it's the spot between two characters. So we call \b a zero-width assertion. Still, \t and \b are alike in that they both assert something about a particular spot in the string. Whenever you assert something in a regular expression, you're just claiming that that particular something has to be true in order for the pattern to match.

Most pieces of a regular expression are some sort of assertion, including the ordinary characters that simply assert that they match themselves. To be precise, they also assert that the next thing will match one character later in the string, which is why we talk about the tab character being "one character wide". Some assertions (like \t) eat up some of the string as they match, and others (like \b) don't. But we usually reserve the term "assertion" for the zero-width assertions. To avoid confusion, we'll call the thing with width an atom. (If you're a physicist, you can think of nonzero-width atoms as massive, in contrast to the zero-width assertions, which are massless like photons.)

You'll also see some metacharacters that aren't assertions; rather, they're structural (just as braces and semicolons define the structure of ordinary Perl code, but don't really do anything). These structural metacharacters are in some ways the most important ones because the crucial first step in learning to read regular expressions is to teach your eyes to pick out the structural metacharacters. Once you've learned that, reading regular expressions is a breeze.[1]

[1]Admittedly, a stiff breeze at times, but not something that will blow you away.

One such structural metacharacter is the vertical bar, which indicates alternation:

/Frodo|Pippin|Merry|Sam/

That means that any of those strings can trigger a match; this is covered in Section 5.8, "Alternation" later in the chapter. And in Section 5.7, "Capturing and Clustering" after that, we'll show you how to use parentheses around portions of your pattern to do grouping:

/(Frodo|Drogo|Bilbo) Baggins/

or even:

/(Frod|Drog|Bilb)o Baggins/

Another thing you'll see are what we call quantifiers, which say how many of the previous thing should match in a row. Quantifiers look like this:

*  +  ?  *?  {3}  {2,5}

You'll never see them in isolation like that, though. Quantifiers only make sense when attached to atoms--that is, to assertions that have width.[2] Quantifiers attach to the previous atom only, which in human terms means they normally quantify only one character. If you want to match three copies of "bar" in a row, you need to group the individual characters of "bar" into a single "molecule" with parentheses, like this:

/(bar){3}/

[2] Quantifiers are a bit like the statement modifiers in Chapter 4, "Statements and Declarations", which can only attach to a single statement. Attaching a quantifier to a zero-width assertion would be like trying to attach a while modifier to a declaration--either of which makes about as much sense as asking your local apothecary for a pound of photons. Apothecaries only deal in atoms and such.

That will match "barbarbar". If you'd said /bar{3}/, that would match "barrr"--which might qualify you as Scottish but disqualify you as barbarbaric. (Then again, maybe not. Some of our favorite metacharacters are Scottish.) For more on quantifiers, see "Quantifiers" later.

Now that you've seen a few of the beasties that inhabit regular expressions, you're probably anxious to start taming them. However, before we discuss regular expressions in earnest, we need to backtrack a little and talk about the pattern-matching operators that make use of regular expressions. (And if you happen to spot a few more regex beasties along the way, just leave a decent tip for the tour guide.)

Chapter 5. Pattern Matching

Contents:

5.1. The Regular Expression Bestiary