Matching Within Multiple Lines (Perl Cookbook, 2nd Edition)

6.6.3. Discussion

A common, brute-force approach to parsing documents where newlines are not significant is to read the file one paragraph at a time (or sometimes even the entire file as one string) and then extract tokens one by one. If the pattern involves dot, such as .+ or .*?, and must match across newlines, you need to do something special to make dot match a newline; ordinarily, it does not. When you've read more than one line into a string, you'll probably prefer to have ^ and $ match beginning- and end-of-line, not just beginning- and end-of-string.

The difference between /m and /s is important: /m allows ^ and $ to match next to an embedded newline, whereas /s allows . to match newlines. You can even use them together—they're not mutually exclusive.

Example 6-2 creates a simplistic filter to strip HTML tags out of each file in @ARGV and then send those results to STDOUT. First we undefine the record separator so each read operation fetches one entire file. (There could be more than one file, because @ARGV could have several arguments in it. If so, each readline would fetch the entire contents of one file.) Then we strip out instances of beginning and ending angle brackets, plus anything in between them. We can't use just .* for two reasons: first, it would match closing angle brackets, and second, the dot wouldn't cross newline boundaries. Using .*? in conjunction with /s solves these problems.

Example 6-2. killtags

  #!/usr/bin/perl
  # killtags - very bad html tag killer
  undef $/;           # each read is whole file
  while (<>) {        # get one whole file at a time
      s/<.*?>//gs;    # strip tags (terribly)
      print;          # print file to STDOUT
  }

Because this is just a single character, it would be much faster to use s/<[^>]*>//gs, but that's still a naïve approach: it doesn't correctly handle tags inside HTML comments or angle brackets in quotes (<IMG SRC="here.gif" ALT="<<Ooh la la!>>">). Recipe 20.6 explains how to avoid these problems.

Example 6-3 takes a plain text document and looks for lines at the start of paragraphs that look like "Chapter 20: Better Living Through Chemisery". It wraps these with an appropriate HTML level-one header. Because the pattern is relatively complex, we use the /x modifier so we can embed whitespace and comments.

Example 6-3. headerfy

  #!/usr/bin/perl
  # headerfy: change certain chapter headers to html
  $/ = '';
  while (<> ) {              # fetch a paragraph
      s{
          \A                  # start of record
          (                   # capture in $1
              Chapter         # text string
              \s+             # mandatory whitespace
              \d+             # decimal number
              \s*             # optional whitespace
              :               # a real colon
              . *             # anything not a newline till end of line
          )
      }{<H1>$1</H1>}gx;
      print;
  }

Here it is as a one-liner from the command line for those of you for whom the extended comments just get in the way of understanding:

% perl -00pe 's{\A(Chapter\s+\d+\s*:.*)}{<H1>$1</H1>}gx' datafile

This problem is interesting because we need to be able to specify start-of-record and end-of-line in the same pattern. We could normally use ^ for start-of-record, but we need $ to indicate not only end-of-record, but end-of-line as well. We add the /m modifier, which changes ^ and $. Instead of using ^ to match beginning-of-record, we use \A instead. We're not using it here, but in case you're interested, the version of $ that always matches end-of-record with an optional newline, even in the presence of /m, is \Z. To match the real end without the optional newline, use \z.

The following example demonstrates using /s and /m together. That's because we want ^ to match the beginning of any line in the paragraph; we also want dot to match a newline. The predefined variable $. represents the record number of the filehandle most recently read from using readline(FH) or <FH>. The predefined variable $ARGV is the name of the file that's automatically opened by implicit <ARGV> processing.

$/ = '';            # paragraph read mode
while (<ARGV>) {
    while (/^START(.*?)^END/sm) {   # /s makes . span line boundaries
                                    # /m makes ^ match near newlines
        print "chunk $. in $ARGV has <<$1>>\n";
    }
}

If you're already committed to the /m modifier, use \A and \Z for the old meanings of ^ and $, respectively. But what if you've used the /s modifier and want the original meaning of dot? You use [^\n].

Finally, although $ and \Z can match one before the end of a string if that last character is a newline, \z matches only at the very end of the string. We can use lookaheads to define the other two as shortcuts involving \z:

`$` without `/m`	`(?=\n)?\z`
`$` with `/m`	`(?=\n)\|\z`
`\Z` always	`(?=\n)?\z`

6.6. Matching Within Multiple Lines

6.6.1. Problem

6.6.2. Solution

6.6.3. Discussion

Example 6-2. killtags

Example 6-3. headerfy

6.6.4. See Also