Answers to Chapter 8 Exercises (Learning Perl, 3rd Edition)

A.7. Answers to Chapter 8 Exercises

Here's one way to do it:
```
/\b(fred|wilma)\s+flintstone\b/
```
If you forgot to use the \b word-boundary anchors, take off half a point; without those, this would mistakenly match strings like alfred flintstones. The exercise description said to match words.
The point of this exercise may not be obvious, but in the real world, you'll often have to do something similar. Someday, you'll be unlucky enough to have a confusing program to maintain, and you'll wonder what the author was trying to accomplish.[394]

[394]If you're especially unlucky, this happens when you look at your own code ten minutes after writing it.

/"([^"]*)"/ matches a simple string in double quotes. By a "simple" string, we don't mean one like Perl's double-quoted strings, which could contain a backslashed double-quote mark or other backslash magic. This matches just a double-quote mark, the contents of the string (which can't contain a double quote), and a closing quote mark. The contents may be empty. The parentheses aren't needed for grouping, so they seem to be memory parentheses; as we'll see in the next chapter, this regular expression memory, which holds the quoted substring, is probably being saved for some later use. Perhaps this pattern would be used in reading a configuration file with quoted strings, although in that case it should probably use anchors.

/^0?[0-3]?[0-7]{1,2}$/ matches if the string has nothing but an octal number (perhaps with a leading zero) in the range from 0 through 0377. Note that this one is anchored at both ends, so it doesn't allow anything else in the string before or after the number. (The previous pattern wasn't anchored; it could match anywhere in the string.)

/^\b[\w.]{1,12}\b$/ matches strings made up of nothing but letters, digits, underscores, and dots, but never starting or ending with a dot. Also, the strings are limited to a maximum of 12 characters.

You may have noticed that the dot inside the character class is not special, so it doesn't need to be backslashed. That makes the character class match ordinary letters, digits, and underscores, and also dots.

The way we can be sure that this one won't allow a string to start or end with a dot is that it has both a word-boundary anchor and a start-of-string or end-of-string anchor at each end of the string. The word-boundary anchor can match only if there's a "word" starting or ending there, and a dot can't be part of a word.

So, this would match strings like perl.tar.gz, but not some_excessively_long_filename or perl.tar. or .profile or ...[395] This pattern could be useful for validating user-chosen filenames.

[395]You may know that file and directory names beginning with a dot are not displayed by default on Unix systems, and that the special directory name .. always means the directory one level higher in the hierarchy.

Here's one way to do it:

/^\$[A-Za-z_]\w*$/

The dollar sign at the start has to be backslashed to mean a real dollar sign. What follows must be a letter or underscore, then zero or more letters, digits, or underscores.

This pattern is surprisingly tricky to get right. Here's how we construct it, step by step.

We start out by needing to match a word, so that's /\w+/. Of course, we want to remember that word for later, so we add parentheses: /(\w+)/. And we want to match it when it occurs two or more times, so that's /(\w+)\1+/. (The plus sign at the end means one or more times -- but that's in addition to the one time that the word occurred originally.)

But we're not done yet. Now we need to allow for the whitespace which may come between the words. We don't want to memorize the whitespace (since it may vary), so we'll put it outside the parentheses: /(\w+)\s\1+/. Oh, but there could be any number of whitespace characters, so long as there's at least one, so we'll add a plus sign. So now we have /(\w+)\s+\1+/.

But that's not right; the final plus sign is modifying the backreference alone. We need it to apply to both the backreference (that is, our repeated word) and the whitespace in front of it: /(\w+)(\s+\1)+/. So, now we can match a triple word. First, the part in the first parenthesis pair matches the first occurrence, then the part in the second parenthesis pair can twice match some whitespace followed by that same word. When we try it out, it matches all of our sentences with doubled words, so we happily put it into our program and move on to the next project.

Then, the next week, we get a bug report! The pattern reports a match on the sentence This is a test, even though there's clearly no doubled word there. In moments, we've fired up the pattern test program [396] to see what part of the string is matching: |Th<is is> a test|. There it is, a doubled word is, hidden in an ordinary string.

[396]We told you that it would come in handy, and we weren't kidding.

Clearly, this is a job for a word boundary anchor; we can't have our word start in the middle of another word. So we fix the program to use /\b(\w+)(\s+\1)+/, and sit back, confident that we've got it right this time.

And then, just when you got started on another project, another bug report comes in. This time, we've matched the doubled word the in the phrase the theory. Yes, we need a word boundary at the end of the pattern to keep from matching a partial word there: /\b(\w+)(\s+\1)+\b/. Now we've finally gotten it right.

What you've just read is a true story. The regular expression has been changed, but the bug reports are real. It does happen, more often than we'd like to admit, that even after you've been writing these patterns for years, you can make a pattern which has a bug, you can test it with a number of test cases, you can put it into a long-running program, the Perl documentation, or even a best-selling Perl book, and not realize that the bug is there until much later.

The moral of the story is that regular expressions can be challenging. If you're serious about learning about regular expressions, though (and all Perl programmers should be), we highly recommend the book Mastering Regular Expressions, by Jeffry Friedl (O'Reilly & Associates, Inc.).