Extended Regular Expressions (Unix Power Tools, 3rd Edition)

32.15. Extended Regular Expressions

At least two programs use extended regular expressions: egrep and awk. [perl uses expressions that are even more extended. -- JP] With these extensions, special characters preceded by a backslash no longer have special meaning: \{, \}, \<, \>, $, $, as well as \digit. There is a very good reason for this, which I will delay explaining to build up suspense.

The question mark (?) matches zero or one instance of the character set before it, and the plus sign (+) matches one or more copies of the character set. You can't use \{ and \} in extended regular expressions, but if you could, you might consider ? to be the same as \{0,1\} and + to be the same as \{1,\}.

By now, you are wondering why the extended regular expressions are even worth using. Except for two abbreviations, there seem to be no advantages and a lot of disadvantages. Therefore, examples would be useful.

The three important characters in the expanded regular expressions are (, |, and ). Parentheses are used to group expressions; the vertical bar acts an an OR operator. Together, they let you match a choice of patterns. As an example, you can use egrep to print all From: and Subject: lines from your incoming mail [which may also be in /var/spool/mail/$USER. -- JP]:

% egrep '^(From|Subject): ' /usr/spool/mail/$USER

All lines starting with From: or Subject: will be printed. There is no easy way to do this with simple regular expressions. You could try something like ^[FS][ru][ob][mj]e*c*t*: and hope you don't have any lines that start with Sromeet:. Extended expressions don't have the \< and \> characters. You can compensate by using the alternation mechanism. Matching the word "the" in the beginning, middle, or end of a sentence or at the end of a line can be done with the extended regular expression (^| )the([^a-z]|$). There are two choices before the word: a space or the beginning of a line. Following the word, there must be something besides a lowercase letter or else the end of the line. One extra bonus with extended regular expressions is the ability to use the *, +, and ? modifiers after a (...) grouping.

[If you're on a Darwin system and use Apple Mail or one of the many other clients, you can grep through your mail files locally. For Mail, look in your home directory's Library/Mail/ directory. There should be a subdirectory there, perhaps named something like iTools:example@mail.example.com, with an IMAP directory tree beneath it. IMAP stores messages individually, not in standard Unix mbox format, so there is no way to look for all matches in a single mailbox by grepping a single file, but fortunately, you can use regular expressions to construct a file list to search. :-) -- SJC]

Here are two ways to match "a simple problem", "an easy problem", as well as "a problem"; the second expression is more exact:

% egrep "a[n]? (simple|easy)? ?problem" data
% egrep "a[n]? ((simple|easy) )?problem" data

I promised to explain why the backslash characters don't work in extended regular expressions. Well, perhaps the \{...\} and \<...\> could be added to the extended expressions, but it might confuse people if those characters are added and the $...$ are not. And there is no way to add that functionality to the extended expressions without changing the current usage. Do you see why? It's quite simple. If ( has a special meaning, then \( must be the ordinary character. This is the opposite of the simple regular expressions, where ( is ordinary and \( is special. The usage of the parentheses is incompatible, and any change could break old programs.

If the extended expression used (...|...) as regular characters, and $...\|...$ for specifying alternate patterns, then it is possible to have one set of regular expressions that has full functionality. This is exactly what GNU Emacs (Section 19.1) does, by the way -- it combines all of the features of regular and extended expressions with one syntax.

-- BB