6.10. Speeding Up Interpolated MatchesProblemYou want your function or program to take one or more regular expressions as arguments, but doing so seems to run slower than using literals. Solution
To overcome this bottleneck, if you have only one pattern whose value won't change during the entire run of a program, store it in a string and use while ($line = <>) { if ($line =~ /$pattern/o) { # do something } } If you have more than one pattern, however, that won't work. Use one of the three techniques outlined in the Discussion for a speed-up of an order of magnitude or so. Discussion
When Perl compiles a program, it converts patterns into an internal form. This conversion occurs at compile time for patterns without variables in them, but at run time for those that do contain variables. That means that interpolating variables into patterns, as in
The
Using
Example 6.4
is an example of the slow but straightforward technique for matching many patterns against many lines. The array Example 6.4: popgrep1#!/usr/bin/perl # popgrep1 - grep for abbreviations of places that say "pop" # version 1: slow but obvious way @popstates = qw(CO ON MI WI MN); LINE: while (defined($line = <>)) { for $state (@popstates) { if ($line =~ /\b$state\b/) { print; next LINE; } } }
Such a direct, obvious, brute-force approach is also horribly slow because it has to recompile all patterns with each line of input. Three different ways of addressing this are described in this section. One builds a string of Perl code and
The traditional way to get Perl to speed up a multiple match is to build up a string containing the code and Example 6.5: popgrep2#!/usr/bin/perl # popgrep2 - grep for abbreviations of places that say "pop" # version 2: eval strings; fast but hard to quote @popstates = qw(CO ON MI WI MN); $code = 'while (defined($line = <>)) {'; for $state (@popstates) { $code .= "\tif (\$line =~ /\\b$state\\b/) { print \$line; next; }\n"; } $code .= '}'; print "CODE IS\n----\n$code\n----\n" if 0; # turn on to debug eval $code; die if $@;
The while (defined($line = <>)) { if ($line =~ /\bCO\b/) { print $line; next; } if ($line =~ /\bON\b/) { print $line; next; } if ($line =~ /\bMI\b/) { print $line; next; } if ($line =~ /\bWI\b/) { print $line; next; } if ($line =~ /\bMN\b/) { print $line; next; } }
As you see, those end up looking like constant strings to
The worst thing about this
A solution to these problems is a subtle technique first developed by Jeffrey
Friedl. The key here is building an anonymous subroutine that caches the compiled patterns in the closure it creates. To do this, we Example 6.6 is a version of our pop grepper that uses that technique. Example 6.6: popgrep3#!/usr/bin/perl # popgrep3 - grep for abbreviations of places that say "pop" # version 3: use build_match_func algorithm @popstates = qw(CO ON MI WI MN); $expr = join('||', map { "m/\\b\$popstates[$_]\\b/o" } 0..$#popstates); $match_any = eval "sub { $expr }"; die if $@; while (<>) { print if &$match_any; } The string that gets evaluated ends up looking like this, modulo formatting: sub { m/\b$popstates[0]\b/o || m/\b$popstates[1]\b/o || m/\b$popstates[2]\b/o || m/\b$popstates[3]\b/o || m/\b$popstates[4]\b/o }
The reference to the Example 6.7 is a generalized form of this technique showing how to create functions that return true if any of the patterns match or if all match. Example 6.7: grepauth#!/usr/bin/perl # grepauth - print lines that mention both Tom and Nat $multimatch = build_match_all(q/Tom/, q/Nat/); while (<>) { print if &$multimatch; } exit; sub build_match_any { build_match_func('||', @_) } sub build_match_all { build_match_func('&&', @_) } sub build_match_func { my $condition = shift; my @pattern = @_; # must be lexical variable, not dynamic one my $expr = join $condition => map { "m/\$pattern[$_]/o" } (0..$#pattern); my $match_func = eval "sub { local \$_ = shift if \@_; $expr }"; die if $@; # propagate $@; this shouldn't happen! return $match_func; }
Using
What you really need is some way to get Perl to compile each pattern once and let you directly refer to the compiled form later on. Such functionality is directly supported in the 5.005 release in the form of a Example 6.8 is a version of our program that demonstrates a simple use of this module. Example 6.8: popgrep4#!/usr/bin/perl # popgrep4 - grep for abbreviations of places that say "pop" # version 4: use Regexp module use Regexp; @popstates = qw(CO ON MI WI MN); @poppats = map { Regexp->new( '\b' . $_ . '\b') } @popstates; while (defined($line = <>)) { for $patobj (@poppats) { print $line if $patobj->match($line); } } You might wonder about the comparative speeds of these approaches. When run against the 22,000 line text file (the Jargon File, to be exact), version 1 ran in 7.92 seconds, version 2 in merely 0.53 seconds, version 3 in 0.79 seconds, and version 4 in 1.74 seconds. The last technique is a lot easier to understand than the others, although it does run slightly slower than they do. It's also more flexible. See Also
Interpolation is explained in the "Scalar Value Constructors" section of
perldata
(1), and in the
"String literals"
section of
Chapter 2
of
Programming Perl
; the |
|