Concepts of Regular Expressions (Learning Perl, 3rd Edition)

Perl has many features that set it apart from other languages. Of all those features, one of the most important is its strong support for regular expressions. These allow fast, flexible, and reliable string handling.

But that power comes at a price. Regular expressions are actually tiny programs in their own special language, built inside Perl. (Yes, you're about to learn another programming language![162] Fortunately it's a simple one.) So for the next two chapters, we'll be learning that language; then we'll take what we've learned back to the world of Perl in Chapter 9, "Using Regular Expressions".

Regular expressions aren't merely part of Perl; they're also found in sed and awk, procmail, grep, most programmers' text editors like vi and emacs, and even in more esoteric places. If you've seen some of these already, you're ahead of the game. Keep watching, and you'll see many more tools that use or support regular expressions, such as search engines on the Web (often written in Perl), email clients, and others.

7.1. What Are Regular Expressions?

A regular expression, often called a pattern in Perl, is a template that either matches or doesn't match a given string.[163] That is, there are an infinite number of possible text strings; a given pattern divides that infinite set into two groups: the ones that match, and the ones that don't. There's never any kinda-sorta-almost-up-to-here wishy-washy matching: either it matches or it doesn't. A pattern may match just one possible string, or just two or three, or a dozen, or a hundred, or an infinite number. Or it may match all strings except for one, or except for some, or except for an infinite number.[164]

[163]Purists would ask for a more rigorous definition. But then again, purists say that Perl's patterns aren't really regular expressions. If you're serious about regular expressions, we highly recommend the book Mastering Regular Expressions by Jeffrey Friedl (O'Reilly & Associates, Inc.).

[164]And as we'll see, you could have a pattern that always matches or that never does. In rare cases, even these may be useful. Generally, though, they're mistakes.

We already referred to regular expressions as being little programs in their own simple programming language. It's a simple language because the programs have just one task: to look at a string and say "it matches" or "it doesn't match".[165] That's all they do.

[165]The programs also pass back some information that Perl can use later. One such piece of information is the "regular expressions memories" that we'll learn about a little later.

One of the places you're likely to have seen regular expressions is in the Unix grep command, which prints out text lines matching a given pattern. For example, if you wanted to see which lines in a given file mention flint and, somewhere later on the same line, stone, you might do something like this, with the Unix grep command:

$ grep 'flint.*stone' some_file
a piece of flint, a stone which may be used to start a fire by striking
found obsidian, flint, granite, and small stones of basaltic rock, which
a flintlock rifle in poor condition. The sandstone mantle held several

Now, if you've used regular expressions somewhere else, that's good, because you have a head start on these three chapters. But Perl's regular expressions have somewhat different syntax than most other implementations; in fact, everybody's regular expressions are a little different. So, if you needed to use a backslash to do something in another implementation, maybe you'll need to leave it off in Perl, or maybe vice versa.

Don't confuse regular expressions with shell filename-matching patterns, called globs. A typical glob is what you use when you type *.pm to the Unix shell to match all filenames that end in .pm. Globs use a lot of the same characters that we use in regular expressions, but those characters are used in totally different ways.[166] We'll visit globs later, in Chapter 12, "Directory Operations", but for now try to put them totally out of your mind.

[166]Globs are also (alas) sometimes called patterns. What's worse, though, is that some bad Unix books for beginners (and possibly written by beginners) have taken to calling globs "regular expressions", which they certainly are not. This confuses many folks at the start of their work with Unix.

Chapter 7. Concepts of Regular Expressions

Contents:

7.1. What Are Regular Expressions?