Positions (Programming Perl)

5.6. Positions

Some regex constructs represent positions in the string to be matched, which is a location just to the left or right of a real character. These metasymbols are examples of zero-width assertions because they do not correspond to actual characters in the string. We often just call them "assertions". (They're also known as "anchors" because they tie some part of the pattern to a particular position.)

You can always manipulate positions in a string without using patterns. The built-in substr function lets you extract and assign to substrings, measured from the beginning of the string, the end of the string, or from a particular numeric offset. This might be all you need if you were working with fixed-length records, for instance. Patterns are only necessary when a numeric offset isn't sufficient. But most of the time, offsets aren't sufficient--at least, not sufficiently convenient, compared to patterns.

5.6.1. Beginnings: The \A and ^ Assertions

The \A assertion matches only at the beginning of the string, no matter what. However, the ^ assertion is the traditional beginning-of-line assertion as well as a beginning-of-string assertion. Therefore, if the pattern uses the /m modifier[8] and the string has embedded newlines, ^ also matches anywhere inside the string immediately following a newline character:

/\Abar/      # Matches "bar" and "barstool"
/^bar/       # Matches "bar" and "barstool"
/^bar/m      # Matches "bar" and "barstool" and "sand\nbar"

Used in conjunction with /g, the /m modifier lets ^ match many times in the same string:

s/^\s+//gm;             # Trim leading whitespace on each line
$total++ while /^./mg;  # Count nonblank lines

[8] Or you've set the deprecated $* variable to 1 and you're not overriding $* with the /s modifier.

5.6.2. Endings: The \z, \Z, and $ Assertions

The \z metasymbol matches at the end of the string, no matter what's inside. \Z matches right before the newline at the end of the string if there is a newline, or at the end if there isn't. The $ metacharacter usually means the same as \Z. However, if the /m modifier was specified and the string has embedded newlines, then $ can also match anywhere inside the string right in front of a newline:

/bot\z/      # Matches "robot"
/bot\Z/      # Matches "robot" and "abbot\n"
/bot$/       # Matches "robot" and "abbot\n"
/bot$/m      # Matches "robot" and "abbot\n" and "robot\nrules"

/^robot$/    # Matches "robot" and "robot\n"
/^robot$/m   # Matches "robot" and "robot\n" and "this\nrobot\n"
/\Arobot\Z/  # Matches "robot" and "robot\n"
/\Arobot\z/  # Matches only "robot" -- but why didn't you use eq?

As with ^, the /m modifier lets $ match many times in the same string when used with /g. (These examples assume that you've read a multiline record into $_, perhaps by setting $/ to "" before reading.)

s/\s*$//gm;   # Trim trailing whitespace on each line in paragraph

while (/^([^:]+):\s*(.*)/gm ) {  # get mail header
    $headers{$1} = $2;
}

In "Variable Interpolation" later in this chapter, we'll discuss how you can interpolate variables into patterns: if $foo is "bc", then /a$foo/ is equivalent to /abc/. Here, the $ does not match the end of the string. For a $ to match the end of the string, it must be at the end of the pattern or immediately be followed by a vertical bar or closing parenthesis.

5.6.3. Boundaries: The \b and \B Assertions

The \b assertion matches at any word boundary, defined as the position between a \w character and a \W character, in either order. If the order is \W\w, it's a beginning-of-word boundary, and if the order is \w\W, it's an end-of-word boundary. (The ends of the string count as \W characters here.) The \B assertion matches any position that is not a word boundary, that is, the middle of either \w\w or \W\W.

/\bis\b/   # matches "what it is" and "that is it"
/\Bis\B/   # matches "thistle" and "artist"
/\bis\B/   # matches "istanbul" and "so--isn't that butter?"
/\Bis\b/   # matches "confutatis" and "metropolis near you"

Because \W includes all punctuation characters (except the underscore), there are \b boundaries in the middle of strings like "isn't", "booktech@oreilly.com", "M.I.T.", and "key/value".

Inside a character class ([\b]), a \b represents a backspace rather than a word boundary.

5.6.4. Progressive Matching

When used with the /g modifier, the pos function allows you to read or set the offset where the next progressive match will start:

$burglar = "Bilbo Baggins";
while ($burglar =~ /b/gi) {
    printf "Found a B at %d\n", pos($burglar)-1;
}

(We subtract one from the position because that was the length of the string we were looking for, and pos is always the position just past the match.)

The code above prints:

Found a B at 0
Found a B at 3
Found a B at 6

After a failure, the match position normally resets back to the start. If you also apply the /c (for "continue") modifier, then when the /g runs out, the failed match doesn't reset the position pointer. This lets you continue your search past that point without starting over at the very beginning.

$burglar = "Bilbo Baggins";
while ($burglar =~ /b/gci) {        # ADD /c
    printf "Found a B at %d\n", pos($burglar)-1;
}
while ($burglar =~ /i/gi) {
    printf "Found an I at %d\n", pos($burglar)-1;
}

Besides the three B's it found earlier, Perl now reports finding an i at position 10. Without the /c, the second loop's match would have restarted from the beginning and found another i at position 6 first.

5.6.5. Where You Left Off: The \G Assertion

Whenever you start thinking in terms of the pos function, it's tempting to start carving your string up with substr, but this is rarely the right thing to do. More often, if you started with pattern matching, you should continue with pattern matching. However, if you're looking for a positional assertion, you're probably looking for \G.

The \G assertion represents within the pattern the same point that pos represents outside of it. When you're progressively matching a string with the /g modifier (or you've used the pos function to directly select the starting point), you can use \G to specify the position just after the previous match. That is, it matches the location immediately before whatever character would be identified by pos. This allows you to remember where you left off:

($recipe = <<'DISH') =~ s/^\s+//gm;
    Preheat oven to 451 deg. fahrenheit.
    Mix 1 ml. dilithium with 3 oz. NaCl and
    stir in 4 anchovies.  Glaze with 1 g.
    mercury.  Heat for 4 hours and let cool
    for 3 seconds.  Serves 10 aliens.
DISH

$recipe =~ /\d+ /g;
$recipe =~ /\G(\w+)/;           # $1 is now "deg"
$recipe =~ /\d+ /g;
$recipe =~ /\G(\w+)/;           # $1 is now "ml"
$recipe =~ /\d+ /g;
$recipe =~ /\G(\w+)/;           # $1 is now "oz"

The \G metasymbol is often used in a loop, as we demonstrate in our next example. We "pause" after every digit sequence, and at that position, we test whether there's an abbreviation. If so, we grab the next two words. Otherwise, we just grab the next word:

pos($recipe) = 0;                 # Just to be safe, reset \G to 0
while ( $recipe =~ /(\d+) /g ) {
    my $amount = $1;
    if ($recipe =~ / \G (\w{0,3}) \. \s+ (\w+) /x) {  # abbrev. + word
        print "$amount $1 of $2\n";
    } else {
        $recipe =~ / \G (\w+) /x;                     # just a word
        print "$amount $1\n";
    }
}

That produces:

451 deg of fahrenheit
1 ml of dilithium
3 oz of NaCl
4 anchovies
1 g of mercury
4 hours
3 seconds
10 aliens