home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


Book HomeLearning Perl, 3rd EditionSearch this book

8.3. Anchors

By default, if a pattern doesn't match at the start of the string, it can "float" on down the string, trying to match somewhere else. But there are a number of anchors that may be used to hold the pattern at a particular point in a string.

The caret[178] anchor (^) marks the beginning of the string, while the dollar sign ($) marks the end.[179] So the pattern /^fred/ will match fred only at the start of the string; it wouldn't match manfred mann. And /rock$/ will match rock only at the end of the string; it wouldn't match knute rockne.

[178]Yes, you've seen that caret is already used in another way in patterns. As the first character of a character class, it negates the class. But outside of a character class, it's a metacharacter in a different way, being the start-of-string anchor. There are only so many characters, so we have to use some of them twice.

[179]Actually, it matches either the end of the string, or at a newline at the end of the string. That's so that you can match the end of the string whether it has a trailing newline or not. Most folks don't worry about this distinction much, but once in a long while it's important to remember that /^fred$/will match either "fred" or "fred\n" with equal ease.

Sometimes, you'll want to use both of these anchors, to ensure that the pattern matches an entire string. A common example is /^\s*$/, which matches a blank line. But this "blank" line may include some whitespace characters, like tabs and spaces, which are invisible to you and me. Any line that matches that pattern looks just like any other one on paper, so this pattern treats all blank lines as equivalent. Without the anchors, it would match nonblank lines as well.

8.3.1. Word Anchors

Anchors aren't just at the ends of the string. The word-boundary anchor, \b, matches at either end of a word.[180] So we can use /\bfred\b/ to match the word fred but not frederick or alfred or manfred mann. This is similar to the feature often called something like "match whole words only" in a word processor's search command.

[180]Some regular expression implementations have one anchor for start-of-word and another for end-of-word, but Perl uses \b for both.

Alas, these aren't words as you and I are likely to think of them; they're those \w-type words made up of ordinary letters, digits, and underscores. The \b anchor matches at the start or end of a group of \w characters.

In Figure 8-1, there's a grey underline under each "word," and the arrows show the corresponding places where \b could match. There are always an even number of word boundaries in a given string, since there's an end-of-word for every start-of-word.

The "words" are sequences of letters, digits, and underscores; that is, a word in this sense is what's matched by /\w+/. There are five words in that sentence: That, s, a, word, and boundary.[181] Notice that the quote marks around word don't change the word boundaries; these words are made of \w characters.

[181]You can see why we wish that we could change the definition of "word"; That's should be one word, not two words with an apostrophe in-between. And even in text that may be mostly ordinary English, it's normal to find a soupçon of other characters spicing things up.

Each arrow points to the beginning or the end of one of the grey underlines, since the word boundary anchor \b matches only at the beginning or the end of a group of word characters.

Figure 8-1

Figure 8-1. Word-boundary matches with \b

The word-boundary anchor is useful to ensure that we don't accidentally find cat in delicatessen, dog in boondoggle, or fish in selfishness. Sometimes you'll want just one word-boundary anchor, as when using /\bhunt/ to match words like hunt or hunting or hunter, but not shunt, or when using /stone\b/ to match words like sandstone or flintstone but not capstones.

The nonword-boundary anchor is \B; it matches at any point where \b would not match. So the pattern /\bsearch\B/ will match searches, searching, and searched, but not search or researching.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.