6.6.3. Discussion
A common, brute-force approach to parsing documents where newlines
are not significant is to read the file one paragraph at a time (or
sometimes even the entire file as one string) and then extract tokens
one by one. If the pattern involves dot, such as
.+ or .*?, and must match
across newlines, you need to do something special to make dot match a
newline; ordinarily, it does not. When you've read more than one line
into a string, you'll probably prefer to have ^
and $ match beginning- and end-of-line, not just
beginning- and end-of-string.
The difference between /m and
/s is important: /m allows
^ and $ to match next to an
embedded newline, whereas /s allows
. to match newlines. You can even use them
together—they're not mutually exclusive.
Example 6-3. headerfy
#!/usr/bin/perl
# headerfy: change certain chapter headers to html
$/ = '';
while (<> ) { # fetch a paragraph
s{
\A # start of record
( # capture in $1
Chapter # text string
\s+ # mandatory whitespace
\d+ # decimal number
\s* # optional whitespace
: # a real colon
. * # anything not a newline till end of line
)
}{<H1>$1</H1>}gx;
print;
}
Here it is as a one-liner from the command line for those of you for
whom the extended comments just get in the way of understanding:
% perl -00pe 's{\A(Chapter\s+\d+\s*:.*)}{<H1>$1</H1>}gx' datafile
This problem is interesting because we need to be able to specify
start-of-record and end-of-line in the same pattern. We could
normally use ^ for start-of-record, but we need
$ to indicate not only end-of-record, but
end-of-line as well. We add the /m modifier, which
changes ^ and $. Instead of
using ^ to match beginning-of-record, we use
\A instead. We're not using it here, but in case
you're interested, the version of $ that always
matches end-of-record with an optional newline, even in the presence
of /m, is \Z. To match the real
end without the optional newline, use \z.
$/ = ''; # paragraph read mode
while (<ARGV>) {
while (/^START(.*?)^END/sm) { # /s makes . span line boundaries
# /m makes ^ match near newlines
print "chunk $. in $ARGV has <<$1>>\n";
}
}
If you're already committed to the /m modifier,
use \A and \Z for the old
meanings of ^ and $,
respectively. But what if you've used the /s
modifier and want the original meaning of dot? You use
[^\n].
Finally, although $ and \Z can
match one before the end of a string if that last character is a
newline, \z matches only at the very end of the
string. We can use lookaheads to define the other two as shortcuts
involving \z: