3.2.11. What's the Word? Part II
Let's reevaluate the regular expression
for searching for a single word in light
of the new metacharacters we've discussed.
Our first attempt at writing a regular
expression for grep
to search for a word concluded with the following expression:
book.*
This expression is fairly simple, matching a space followed by the string
"book" followed by any number of characters followed by a space. However,
it does not match all possible occurrences and it does match a few
nuisance words.
The following test file contains numerous occurrences of "book."
We've added a notation, which is not part of the file,
to indicate whether the input line should be a "hit" (>) and
included in the output or a "miss" (<).
We've tried to include as many different examples as possible.
$ cat bookwords
> This file tests for book in various places, such as
> book at the beginning of a line or
> at the end of a line book
> as well as the plural books and
< handbooks. Here are some
< phrases that use the word in different ways:
> "book of the year award"
> to look for a line with the word "book"
> A GREAT book!
> A great book? No.
> told them about (the books) until it
> Here are the books that you requested
> Yes, it is a good book for children
> amazing that it was called a "harmful book" when
> once you get to the end of the book, you can't believe
< A well-written regular expression should
< avoid matching unrelated words,
< such as booky (is that a word?)
< and bookish and
< bookworm and so on.
As we search for occurrences of the word "book," there are 13 lines
that should be matched and 7 lines that should not be matched.
First, let's run the previous regular expression on
the sample file and check the results.
$ grep '
book.*
' bookwords
This file tests for book in various places, such as
as well as the plural books and
A great book? No.
told them about (the books) until it
Here are the books that you requested
Yes, it is a good book for children
amazing that it was called a "harmful book" when
once you get to the end of the book, you can't believe
such as booky (is that a word?)
and bookish and
It only prints 8 of the 13 lines that we want to match
and it prints 2 of the lines that we don't want to match.
The expression matches lines containing
the words "booky" and "bookish." It ignores "book" at the beginning
of a line and at the end of a line.
It ignores "book" when there are certain punctuation marks involved.
To restrict the search even more,
we must use character classes.
Generally, the list of characters that might end a word
are punctuation marks, such as:
? . , ! ; : '
In addition, quotation marks, parentheses, braces, and brackets
might surround a word or open or close with a word:
" () {} []
You would also have to accommodate
the plural or possessive forms of the word.
Thus, you would have two different character classes: before and after
the word. Remember that all we have to do is list the members of
the class inside square brackets.
Before the word, we now have:
["[{(]
and after the word:
[]})"?!.,;:'s]
Note that putting the closing square bracket as the first character in the class
makes it a member of the class rather than closing the set.
Putting the two classes together, we get the expression:
["[{(]*book[]})"?!.,;:'s]*
Show this to the uninitiated, and they'll throw up their hands in
despair! But now that you know the principles involved, you can not
only understand this expression, but could easily reconstruct it.
Let's see how it does on the sample file (we use double quotes to
enclose the single quote character, and then a backslash in front of
the embedded double quotes):
$ grep " [\"[{(]*book[]})\"?!.,;:'s]* " bookwords
This file tests for book in various places, such as
as well as the plural books and
A great book? No.
told them about (the books) until it
Here are the books that you requested
Yes, it is a good book for children
amazing that it was called a "harmful book" when
once you get to the end of the book, you can't believe
We eliminated the lines that we don't want but there
are four lines that we're not getting.
Let's examine the four lines:
book at the beginning of a line or
at the end of a line book
"book of the year award"
A GREAT book!
All of these are problems caused by the string appearing
at the beginning or end of a line.
Because there is no space at the beginning
or end of a line, the pattern is not matched.
We can use the positional metacharacters, ^ and $. Since
we want to match either a space or beginning or end
of a line, we can use egrep and specify the "or" metacharacter along with
parentheses for grouping.
For instance, to match either the beginning of a line
or a space, you could write the expression:
(^| )
(Because | and ()
are part of the extended set of metacharacters, if
you were using sed, you'd have to write
different expressions to handle each case.)
Here's the revised regular expression:
(^| )["[{(]*book[]})"?\!.,;:'s]*( |$)
Now let's see how it works:
$ egrep "(^| )[\"[{(]*book[]})\"?\!.,;:'s]*( |$)" bookwords
This file tests for book in various places, such as
book at the beginning of a line or
at the end of a line book
as well as the plural books and
"book of the year award"
to look for a line with the word "book"
A GREAT book!
A great book? No.
told them about (the books) until it
Here are the books that you requested
Yes, it is a good book for children
amazing that it was called a "harmful book" when
once you get to the end of the book, you can't believe
This is certainly a complex regular expression; however,
it can be broken down into parts.
This expression may not match every single instance,
but it can be easily adapted to handle other occurrences
that you may find.
You could also create a simple shell script to replace "book" with
a command-line argument. The only problem might be
that the plural of some words is not simply "s."
By sleight of hand, you could handle the "es" plural by adding "e" to the character
class following the word; it would work in many cases.