A.7. Answers to Chapter 8 Exercises
-
Here's one way to do it:
/\b(fred|wilma)\s+flintstone\b/
If you forgot to use the \b word-boundary anchors,
take off half a point; without those, this would mistakenly match
strings like alfred
flintstones. The exercise description said to
match words.
-
The point of this exercise may not be obvious, but in the real world,
you'll often have to do something similar. Someday,
you'll be unlucky enough to have a confusing program to
maintain, and you'll wonder what the author was trying to
accomplish.[394]
/"([^"]*)"/ matches a simple string in double
quotes. By a "simple" string, we don't mean one
like Perl's double-quoted strings, which could contain a
backslashed double-quote mark or other backslash magic. This matches
just a double-quote mark, the contents of the string (which
can't contain a double quote), and a closing quote mark. The
contents may be empty. The parentheses aren't needed for
grouping, so they seem to be memory parentheses; as we'll see
in the next chapter, this regular expression memory, which holds the
quoted substring, is probably being saved for some later use. Perhaps
this pattern would be used in reading a configuration file with
quoted strings, although in that case it should probably use anchors.
/^0?[0-3]?[0-7]{1,2}$/ matches if the string has
nothing but an octal number (perhaps with a leading zero) in the
range from 0 through 0377. Note
that this one is anchored at both ends, so it doesn't allow
anything else in the string before or after the number. (The previous
pattern wasn't anchored; it could match anywhere in the
string.)
/^\b[\w.]{1,12}\b$/ matches strings made up of
nothing but letters, digits, underscores, and dots, but never
starting or ending with a dot. Also, the strings are limited to a
maximum of 12 characters.
You may have noticed that the dot inside the character class is not
special, so it doesn't need to be backslashed. That makes the
character class match ordinary letters, digits, and underscores, and
also dots.
The way we can be sure that this one won't allow a string to
start or end with a dot is that it has both a word-boundary anchor
and a start-of-string or end-of-string anchor at each end of the
string. The word-boundary anchor can match only if there's a
"word" starting or ending there, and a dot can't be
part of a word.
So, this would match strings like perl.tar.gz, but
not some_excessively_long_filename or
perl.tar. or .profile or
...[395] This pattern could be useful for
validating user-chosen filenames.
-
Here's one way to do it:
/^\$[A-Za-z_]\w*$/
The dollar sign at the start has to be backslashed to mean a real
dollar sign. What follows must be a letter or underscore, then zero
or more letters, digits, or underscores.
-
This pattern is surprisingly tricky to get right. Here's how we
construct it, step by step.
We start out by needing to match a word, so that's
/\w+/. Of course, we want to remember that word
for later, so we add parentheses: /(\w+)/. And we
want to match it when it occurs two or more times, so that's
/(\w+)\1+/. (The plus sign at the end means
one or more times -- but that's in
addition to the one time that the word occurred originally.)
But we're not done yet. Now we need to allow for the whitespace
which may come between the words. We don't want to memorize the
whitespace (since it may vary), so we'll put it outside the
parentheses: /(\w+)\s\1+/. Oh, but there could be
any number of whitespace characters, so long as there's at
least one, so we'll add a plus sign. So now we have
/(\w+)\s+\1+/.
But that's not right; the final plus sign is modifying the
backreference alone. We need it to apply to both the backreference
(that is, our repeated word) and the whitespace in front of it:
/(\w+)(\s+\1)+/. So, now we can match a triple
word. First, the part in the first parenthesis pair matches the first
occurrence, then the part in the second parenthesis pair can twice
match some whitespace followed by that same word. When we try it out,
it matches all of our sentences with doubled words, so we happily put
it into our program and move on to the next project.
Then, the next week, we get a bug report! The pattern reports a match
on the sentence This is a test, even though
there's clearly no doubled word there. In moments, we've
fired up the pattern test program[396]
to see what part of the string is matching: |Th<is is>
a test|. There it is, a doubled word is,
hidden in an ordinary string.
Clearly, this is a job for a word boundary anchor; we can't
have our word start in the middle of another word. So we fix the
program to use /\b(\w+)(\s+\1)+/, and sit back,
confident that we've got it right this time.
And then, just when you got started on another project,
another bug report comes in. This time,
we've matched the doubled word the in the
phrase the theory. Yes, we need
a word boundary at the end of the pattern to
keep from matching a partial word there:
/\b(\w+)(\s+\1)+\b/. Now we've finally
gotten it right.
What you've just read is a true story. The regular expression
has been changed, but the bug reports are real. It does happen, more
often than we'd like to admit, that even after you've
been writing these patterns for years, you can make a pattern which
has a bug, you can test it with a number of test cases, you can put
it into a long-running program, the Perl documentation, or even a
best-selling Perl book, and not realize that the bug is there until
much later.
The moral of the story is that regular expressions can be
challenging. If you're serious about learning about regular
expressions, though (and all Perl programmers should be), we highly
recommend the book Mastering Regular
Expressions, by Jeffry Friedl (O'Reilly &
Associates, Inc.).
| | | A.6. Answers to Chapter 7 Exercises | | A.8. Answers to Chapter 9 Exercises |
Copyright © 2002 O'Reilly & Associates. All rights reserved.
|
|